# craping data from HTML tables

- Reading tables from a string
- Reading tables from a URL
- Reading tables from a file
- Parsing date columns with parse_dates
- Explicitly typecast with converters
- MultiIndex, header, and index column
- Matching a table with match
- Filtering tables with attrs
- Working with missing values

[Pandas read_html](https://bindichen.medium.com/all-pandas-read-html-you-should-know-for-scraping-data-from-html-tables-a3cbb5ce8274)

In [5]:
# %load command1.py
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'

%config InlineBackend.figure_format='svg'
plt.rcParams['figure.dpi']=120

pd.options.display.float_format='{:,.2f}'.format
pd.set_option('display.max_colwidth', None)


**Read table from string**

In [17]:
html_string = """
<table>
  <thead>
    <tr>
      <th>ID</th>
      <th>date</th>
      <th>name</th>
      <th>year</th>
      <th>cost</th>
      <th>region</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>001</td>
      <td>2020-01-01</td>
      <td>Jenny</td>
      <td>1998</td>
      <td>0.2</td>
      <td>South</td>
    </tr>
    <tr>
      <td>002</td>
      <td>2020-01-02</td>
      <td>Alice</td>
      <td>1992</td>
      <td>-1.34</td>
      <td>East</td>
    </tr>
    <tr>
      <td>003</td>
      <td>2020-01-03</td>
      <td>Tomas</td>
      <td>1982</td>
      <td>1.00023</td>
      <td>South</td>
    </tr>
  </tbody>
</table>
"""

In [8]:
dfs = pd.read_html(html_string)
dfs
print()
print(type(dfs))
dfs[0] # slicing the list

[         date   name  year  cost region
 0  2020-01-01  Jenny  1998  0.20  South
 1  2020-01-02  Alice  1992 -1.34   East
 2  2020-01-03  Tomas  1982  1.00  South]


<class 'list'>


Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,0.2,South
1,2020-01-02,Alice,1992,-1.34,East
2,2020-01-03,Tomas,1982,1.0,South


In [9]:
dfs[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    3 non-null      object 
 1   name    3 non-null      object 
 2   year    3 non-null      int64  
 3   cost    3 non-null      float64
 4   region  3 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 248.0+ bytes


**Reading tables from a URL**

In [12]:
URL = 'https://en.wikipedia.org/wiki/London'
dfs = pd.read_html(URL)

print(f'Total tables: {len(dfs)}')

Total tables: 31


**Reading table from a file**

In [13]:
file_path = './pandasData/html_string.txt'
with open(file_path, 'r') as f:
    dfs = pd.read_html(f.read())
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,0.2,South
1,2020-01-02,Alice,1992,-1.34,East
2,2020-01-03,Tomas,1982,1.0,South


**Parsing date columns with parse_dates**

In [14]:
dfs = pd.read_html(html_string, parse_dates=['date'])
dfs[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    3 non-null      datetime64[ns]
 1   name    3 non-null      object        
 2   year    3 non-null      int64         
 3   cost    3 non-null      float64       
 4   region  3 non-null      object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 248.0+ bytes


**Explicitly typecast with converters**

In [18]:
dfs = pd.read_html(html_string, converters={
    'ID': str,
    'year': int,
    'cost': float,
})
dfs[0]

Unnamed: 0,ID,date,name,year,cost,region
0,1,2020-01-01,Jenny,1998,0.2,South
1,2,2020-01-02,Alice,1992,-1.34,East
2,3,2020-01-03,Tomas,1982,1.0,South


**MultiIndex, header, and index column**

In [19]:
html_string = """
<table>
  <thead>
    <tr>
      <th colspan="5">Year 2020</th>
    </tr>
    <tr>
      <th>date</th>
      <th>name</th>
      <th>year</th>
      <th>cost</th>
      <th>region</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2020-01-01</td>
      <td>Jenny</td>
      <td>1998</td>
      <td>1.2</td>
      <td>South</td>
    </tr>
    <tr>
      <td>2020-01-02</td>
      <td>Alice</td>
      <td>1992</td>
      <td>-1.34</td>
      <td>East</td>
    </tr>
  </tbody>
</table>
"""

In [20]:
dfs = pd.read_html(html_string)
dfs[0]

Unnamed: 0_level_0,Year 2020,Year 2020,Year 2020,Year 2020,Year 2020
Unnamed: 0_level_1,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,-1.34,East


In [21]:
# specify header row
dfs = pd.read_html(html_string, header=1)
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,-1.34,East


In [22]:
# specify an index column
dfs = pd.read_html(html_string, header=1, index_col=0)
dfs[0]

Unnamed: 0_level_0,name,year,cost,region
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,Jenny,1998,1.2,South
2020-01-02,Alice,1992,-1.34,East


**Matching a table with match**

In [23]:
html_string = """
<table id="report">
  <caption>2020 report</caption>
  <thead>
    <tr>
      <th>date</th>
      <th>name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2020-01-01</td>
      <td>Jenny</td>
    </tr>
    <tr>
      <td>2020-01-02</td>
      <td>Alice</td>
    </tr>
  </tbody>
</table>


<table>
  <caption>Average income</caption>
  <thead>
    <tr>
      <th>name</th>
      <th>income</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Tom</td>
      <td>200</td>
    </tr>
    <tr>
      <td>James</td>
      <td>300</td>
    </tr>
  </tbody>
</table>
"""

In [26]:
# text in caption
dfs = pd.read_html(html_string, match='2020 report')
dfs[0]

# text in table cell
dfs = pd.read_html(html_string, match='James')
dfs[0]

# text in caption
dfs = pd.read_html(html_string, match='Average income')
dfs[0]

Unnamed: 0,date,name
0,2020-01-01,Jenny
1,2020-01-02,Alice


Unnamed: 0,name,income
0,Tom,200
1,James,300


Unnamed: 0,name,income
0,Tom,200
1,James,300


**Filtering tables with attrs**

In [28]:
dfs=pd.read_html(html_string, attrs={'id': 'report'})
dfs[0]

Unnamed: 0,date,name
0,2020-01-01,Jenny
1,2020-01-02,Alice


**Working with missing values**

In [29]:
html_string = """
<table>
  <tr>
    <th>date</th>
    <th>name</th>
    <th>year</th>
    <th>cost</th>
    <th>region</th>
  </tr>
  <tr>
    <td>2020-01-01</td>
    <td>Jenny</td>
    <td>1998</td>
    <td>1.2</td>
    <td>South</td>
  </tr>
  <tr>
    <td>2020-01-02</td>
    <td>Alice</td>
    <td>1992</td>
    <td></td>
    <td>East</td>
  </tr>
  <tr>
    <td>2020-01-03</td>
    <td>Tomas</td>
    <td>1982</td>
    <td></td>
    <td>South</td>
  </tr>
</table>
"""

In [30]:
dfs = pd.read_html(html_string)
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,,East
2,2020-01-03,Tomas,1982,,South


In [31]:
# To keep these empty strings, we can set the argument keep_default_na to False

dfs = pd.read_html(html_string, keep_default_na=False)
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,,East
2,2020-01-03,Tomas,1982,,South


In [32]:
# Sometimes, you may have other character representations for missing values. If we know what kind of characters used as missing values in the table, 
# we can handle them using na_values parameter:

dfs = pd.read_html(html_string, na_values=['?', '&'])
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,,East
2,2020-01-03,Tomas,1982,,South


In [33]:
# When the DataFrame is already created, we can use pandas replace() function to handle these values:
df_clean = dfs[0].replace({ "?": np.nan, "&": np.nan })
df_clean

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,,East
2,2020-01-03,Tomas,1982,,South
