# Scraping with Pandas

In [1]:
import pandas as pd

We can use the `read_html` function in Pandas to automatically scrape any tabular data from a page.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'

In [3]:
tables = pd.read_html(url)
tables

[                                           0   \
 0                                       State   
 1   Municipal (Within city proper boundaries)   
 2                                     Alabama   
 3                                      Alaska   
 4                                     Arizona   
 5                                    Arkansas   
 6                                  California   
 7                                    Colorado   
 8                                 Connecticut   
 9                                    Delaware   
 10                                    Florida   
 11                                    Georgia   
 12                                     Hawaii   
 13                                      Idaho   
 14                                   Illinois   
 15                                    Indiana   
 16                                       Iowa   
 17                                     Kansas   
 18                                   Kentucky   


What we get in return is a list of dataframes for any tabular data that Pandas found.

In [4]:
type(tables)

list

We can slice off any of those dataframes that we want using normal indexing.

In [5]:
df = tables[0]
df.columns = ['State', 'Abr.', 'State-hood Rank', 'Capital', 
              'Capital Since', 'Area (sq-mi)', 'Municipal Population', 'Metropolitan', 
              'Metropolitan Population', 'Population Rank', 'Notes']
df.head()

Unnamed: 0,State,Abr.,State-hood Rank,Capital,Capital Since,Area (sq-mi),Municipal Population,Metropolitan,Metropolitan Population,Population Rank,Notes
0,State,Abr.,State-hood,Capital,Capital since,Area (mi²),Population (2010),Notes,,,
1,Municipal (Within city proper boundaries),Metropolitan (Both within the capital city pro...,Rank in state,Rank in US,,,,,,,
2,Alabama,AL,1819,Montgomery,1846,155.4,205764,374536,2.0,102.0,Birmingham is the state's largest city.
3,Alaska,AK,1959,Juneau,1906,2716.7,31275,,3.0,,Largest capital by municipal land area.
4,Arizona,AZ,1912,Phoenix,1889,474.9,1445632,4192887,1.0,6.0,Phoenix is the most populous capital city in t...


Cleanup of extra rows

In [6]:
df = df.iloc[2:]
df.head()

Unnamed: 0,State,Abr.,State-hood Rank,Capital,Capital Since,Area (sq-mi),Municipal Population,Metropolitan,Metropolitan Population,Population Rank,Notes
2,Alabama,AL,1819,Montgomery,1846,155.4,205764,374536.0,2.0,102.0,Birmingham is the state's largest city.
3,Alaska,AK,1959,Juneau,1906,2716.7,31275,,3.0,,Largest capital by municipal land area.
4,Arizona,AZ,1912,Phoenix,1889,474.9,1445632,4192887.0,1.0,6.0,Phoenix is the most populous capital city in t...
5,Arkansas,AR,1836,Little Rock,1821,116.2,193524,699757.0,1.0,117.0,
6,California,CA,1850,Sacramento,1854,97.2,466488,2149127.0,6.0,35.0,


Set the index to the `State` column

In [7]:
df.set_index('State', inplace=True)
df.head()

Unnamed: 0_level_0,Abr.,State-hood Rank,Capital,Capital Since,Area (sq-mi),Municipal Population,Metropolitan,Metropolitan Population,Population Rank,Notes
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Alabama,AL,1819,Montgomery,1846,155.4,205764,374536.0,2.0,102.0,Birmingham is the state's largest city.
Alaska,AK,1959,Juneau,1906,2716.7,31275,,3.0,,Largest capital by municipal land area.
Arizona,AZ,1912,Phoenix,1889,474.9,1445632,4192887.0,1.0,6.0,Phoenix is the most populous capital city in t...
Arkansas,AR,1836,Little Rock,1821,116.2,193524,699757.0,1.0,117.0,
California,CA,1850,Sacramento,1854,97.2,466488,2149127.0,6.0,35.0,


In [8]:
df.loc['Alabama']

Abr.                                                            AL
State-hood Rank                                               1819
Capital                                                 Montgomery
Capital Since                                                 1846
Area (sq-mi)                                                 155.4
Municipal Population                                        205764
Metropolitan                                                374536
Metropolitan Population                                          2
Population Rank                                                102
Notes                      Birmingham is the state's largest city.
Name: Alabama, dtype: object

## DataFrames as HTML

Pandas also had a `to_html` method that we can use to generate HTML tables from DataFrames.

In [9]:
html_table = df.to_html()
html_table

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Abr.</th>\n      <th>State-hood Rank</th>\n      <th>Capital</th>\n      <th>Capital Since</th>\n      <th>Area (sq-mi)</th>\n      <th>Municipal Population</th>\n      <th>Metropolitan</th>\n      <th>Metropolitan Population</th>\n      <th>Population Rank</th>\n      <th>Notes</th>\n    </tr>\n    <tr>\n      <th>State</th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Alabama</th>\n      <td>AL</td>\n      <td>1819</td>\n      <td>Montgomery</td>\n      <td>1846</td>\n      <td>155.4</td>\n      <td>205764</td>\n      <td>374536</td>\n      <td>2.0</td>\n      <td>102.0</td>\n      <td>Birmingham is the state\'s largest city.</td>\n    </tr>\n    <tr>\n      <th>Alaska</th>\n      <td>AK</td>\n 

You may have to strip unwanted newlines to clean up the table.

In [10]:
html_table.replace('\n', '')

'<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Abr.</th>      <th>State-hood Rank</th>      <th>Capital</th>      <th>Capital Since</th>      <th>Area (sq-mi)</th>      <th>Municipal Population</th>      <th>Metropolitan</th>      <th>Metropolitan Population</th>      <th>Population Rank</th>      <th>Notes</th>    </tr>    <tr>      <th>State</th>      <th></th>      <th></th>      <th></th>      <th></th>      <th></th>      <th></th>      <th></th>      <th></th>      <th></th>      <th></th>    </tr>  </thead>  <tbody>    <tr>      <th>Alabama</th>      <td>AL</td>      <td>1819</td>      <td>Montgomery</td>      <td>1846</td>      <td>155.4</td>      <td>205764</td>      <td>374536</td>      <td>2.0</td>      <td>102.0</td>      <td>Birmingham is the state\'s largest city.</td>    </tr>    <tr>      <th>Alaska</th>      <td>AK</td>      <td>1959</td>      <td>Juneau</td>      <td>1906</td>      <td>2716.7</td>      <td>312

You can also save the table directly to a file.

In [11]:
df.to_html('table.html')

In [12]:
# OSX Users can run this to open the file in a browser, 
# or you can manually find the file and open it in the browser
!open table.html