# Web Scraping with HTML Tables

We will go through the steps on how to use Pandas read_html method for scraping data from HTML tables. First, in the simplest example, we are going to use Pandas to read HTML from a string. Second, we are going to go through a couple of examples in which we scrape data from Wikipedia tables with Pandas read_html.

Pandas read_html() Example 1:

In [5]:
import pandas as pd

html = '''<table>
  <tr>
    <th>a</th>
    <th>b</th>
    <th>c</th>
    <th>d</th>
  </tr>
  <tr>
    <td>1</td>
    <td>2</td>
    <td>3</td>
    <td>4</td>
  </tr>
  <tr>
    <td>5</td>
    <td>6</td>
    <td>7</td>
    <td>8</td>
  </tr>
</table>'''

df = pd.read_html(html)

In [6]:
type(df)

list

Now, the result we get is not a Pandas DataFrame but a Python list. That is, if we use the type() function we can see that. If we want to get the table, we can use the first index of the list (0).

In [8]:
df[0]

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8


Pandas read_html Example 2:

In the second, Pandas read_html example, we are going to scrape data from Wikipedia. In fact, we are going to get the HTML table of Pythonidae snakes (also known as Python snakes). 

In [9]:
import pandas as pd

dfs = pd.read_html('https://en.wikipedia.org/wiki/Pythonidae')

Now, we get a list of 9 tables (len(dfs))

In [10]:
len(dfs)

9

If we go to the Wikipedia page, we can see that the first table is the one to the right. In this example, however, we extract the second table. 

In [11]:
dfs[1]

Unnamed: 0,Pythonidae,Pythonidae.1
0,,
1,Indian python (Python molurus),Indian python (Python molurus)
2,Scientific classification,Scientific classification
3,Kingdom:,Animalia
4,Phylum:,Chordata
5,Class:,Reptilia
6,Order:,Squamata
7,Suborder:,Serpentes
8,Superfamily:,Pythonoidea
9,Family:,"PythonidaeFitzinger, 1826"


Pandas read_html Example 3:
Wikipedia has an interesting chart displaying which muscle groups are worked by different weightlifting exercises. 

In [12]:
weightlifting_df_list = pd.read_html('https://en.wikipedia.org/wiki/List_of_weight_training_exercises', index_col=0)
len(weightlifting_df_list)

2

In [13]:
weightlifting_df_list[0]

Unnamed: 0_level_0,Calves,Quad-riceps,Ham-strings,Gluteus,Hipsother,Lowerback,Lats,Trapezius,Abdominals,Pectorals,Deltoids,Triceps,Biceps,Forearms
Exercise,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Squat,Some,Yes,Some,Yes,Yes,Some,,,Yes,,,,,
Leg press,Some,Yes,Some,Yes,,,,,,,,,,
Lunge,,Yes,Yes,Yes,Yes,,,,,,,,,
Deadlift,Some,Yes,Yes,Yes,Yes,Yes,,Some,Some,,,,,Some
Leg extension,,Yes,,,,,,,,,,,,
Leg curl,Some,,Yes,,,,,,,,,,,
Standing calf raise,Yes,,,,,,,,,,,,,
Seated calf raise,Yes,,,,,,,,,,,,,
Hip adductor,,,,,Yes,,,,,,,,,
Bench press,,,,,,,,,,Yes,,Yes,,


Let's assign the first DataFrame to a variable.

In [14]:
exercises_df = weightlifting_df_list[0]

Now you could do whatever you like with the DataFrame. Maybe you want to filter it to show only the exercises that work your hamstrings.

In [15]:
hammies = exercises_df[(exercises_df['Ham-strings']=='Yes') | (exercises_df['Ham-strings']=='Some')]
hammies

Unnamed: 0_level_0,Calves,Quad-riceps,Ham-strings,Gluteus,Hipsother,Lowerback,Lats,Trapezius,Abdominals,Pectorals,Deltoids,Triceps,Biceps,Forearms
Exercise,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Squat,Some,Yes,Some,Yes,Yes,Some,,,Yes,,,,,
Leg press,Some,Yes,Some,Yes,,,,,,,,,,
Lunge,,Yes,Yes,Yes,Yes,,,,,,,,,
Deadlift,Some,Yes,Yes,Yes,Yes,Yes,,Some,Some,,,,,Some
Leg curl,Some,,Yes,,,,,,,,,,,


Let's sort the table by the exercises that work the hamstrings a lot.

In [16]:
hammies.sort_values(by='Ham-strings', ascending=False)

Unnamed: 0_level_0,Calves,Quad-riceps,Ham-strings,Gluteus,Hipsother,Lowerback,Lats,Trapezius,Abdominals,Pectorals,Deltoids,Triceps,Biceps,Forearms
Exercise,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Lunge,,Yes,Yes,Yes,Yes,,,,,,,,,
Deadlift,Some,Yes,Yes,Yes,Yes,Yes,,Some,Some,,,,,Some
Leg curl,Some,,Yes,,,,,,,,,,,
Squat,Some,Yes,Some,Yes,Yes,Some,,,Yes,,,,,
Leg press,Some,Yes,Some,Yes,,,,,,,,,,


References: 1. https://www.marsja.se/how-to-use-pandas-read_html-to-scrape-data-from-html-tables/
            2. https://deepnote.com/@deepnote/Scrape-HTML-Tables-Without-Leaving-Pandas-Xq9isYiTRGyTcaDHt4T6pA