### Scraping from the Web using 'read_html' ###

In [3]:
import pandas as pd

##### You may have to install a module called lxml if you get an error message for the cell below #####

conda install lxml

In [4]:
# Read in the tables on this Wikipedia page
tables =  pd.read_html('https://en.wikipedia.org/wiki/Minnesota')

# Check how many tables have been read in
print(f"no_of_tables_retrieved = {len(tables)}")

no_of_tables_retrieved = 28


In [5]:
# Assign any one of the table to a variable and 

df = tables[5]
print(f"Table number 5 in DataFrame format:\n\n{df}")

Table number 5 in DataFrame format:

            Country  Population
0            Mexico       95227
1           Somalia       76658
2   Hmong people[b]       55005
3             India       39559
4          Ethiopia       36982
5              Laos       24901
6             China       24353
7           Vietnam       22283
8           Liberia       20168
9       South Korea       20126
10         Thailand       19235
11           Canada       18804
12            Kenya       16823
13          Myanmar       15679
14      Philippines       13544
15           Russia       12787
16      El Salvador       12137
17          Nigeria        9508
18        Guatemala        7727
19          Ecuador        6298


In [6]:
# With 38 tables, it can be challenging to find the one you need. 
# To make the table selection easier, use the match parameter to select a subset of tables. 
# We can use the caption “United States presidential election results for Minnesota” to select the table

# Read in from Wikipedia page, skiprows if there are multiple hearder rows, let row0 be the header and take the first table (from list of tables)
table = pd.read_html('https://en.wikipedia.org/wiki/Minnesota', match='United States presidential election results for Minnesota',skiprows=1,header=0)[0]

# Assign df1
df1 = table
# Check type of data structure
print(f"Data type: {type(df1)}\n")

# Rename columns
df1 = df1.rename(columns={'No.':'Republican','%':'R%','No..1':'Democratic','%.1':'D%','No..2':'Third Party','%.2':'T%'})

# Take a look
print(f"The dataframe:\n\n{df1.head()}\n\nIt's shape: {df1.shape}\n\nThe index: {df1.index}")

Data type: <class 'pandas.core.frame.DataFrame'>

The dataframe:

   Year  Republican      R%  Democratic      D%  Third Party     T%
0  1860       22069  63.53%       11920  34.31%          748  2.15%
1  1864       25055  59.06%       17367  40.94%            0  0.00%
2  1868       43722  60.88%       28096  39.12%            0  0.00%
3  1872       55708  61.27%       35211  38.73%            0  0.00%
4  1876       72955  58.80%       48587  39.16%         2533  2.04%

It's shape: (42, 7)

The index: RangeIndex(start=0, stop=42, step=1)


In [7]:
# Gather info and check the datatypes of the individual columns
print(df1.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Year         42 non-null     int64 
 1   Republican   42 non-null     int64 
 2   R%           42 non-null     object
 3   Democratic   42 non-null     int64 
 4   D%           42 non-null     object
 5   Third Party  42 non-null     int64 
 6   T%           42 non-null     object
dtypes: int64(4), object(3)
memory usage: 2.4+ KB
None


In [8]:
# We need to convert the % columns to numeric values if we want to do any analysis.
# Doing df1['R%'].astype('float') does not work as there is the % sign with the numbers, so we need to remove that

df1[['D%','R%']].replace({'%':''}, regex=True).astype('float')

Unnamed: 0,D%,R%
0,34.31,63.53
1,40.94,59.06
2,39.12,60.88
3,38.73,61.27
4,39.16,58.8
5,35.36,62.28
6,36.87,58.78
7,39.65,54.12
8,37.76,45.96
9,40.89,56.62
