# Week 8 - Web Scraping

### Objective:

#### Parse data from the HTML of a website

### Helpful Resources

Site to scrape - https://www.multistate.us/issues/covid-19-state-reopening-guide

Beautiful Soup documentation - https://www.crummy.com/software/BeautifulSoup/bs4/doc/

#### Import libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

#### Make soup

Make request to website

In [25]:
response_ranks = requests.get('https://www.multistate.us/issues/covid-19-state-reopening-guide')
type(response_ranks)

requests.models.Response

Create soup object from response object

In [26]:
soup_ranks = BeautifulSoup(response_ranks.text, 'lxml')
type(soup_ranks)

bs4.BeautifulSoup

#### Check soup

In [27]:
soup_ranks.title

<title>COVID-19 State Reopening Guide | MultiState</title>

In [24]:
'COVID' in soup_ranks.text

True

In [29]:
'Ratings' in soup_ranks.text

True

In [30]:
'Molly' in soup_ranks.text

False

#### Soup to array

Retrieve the ratings table from the soup object

In [None]:
rank_rable = soup_ranks.find('table') # this works because our soup has only one table

Retrieve the values in the table, and create an array

In [31]:
array_ranks = []
for row in table.find_all('tr'): # tr = table row
    temp = [] # create empty array for each table row
    for cell in row.find_all(['th', 'td']): # th = table header; td = table cell
        temp.append(cell.text)
    array_ranks.append(temp)
array_ranks

[['Rank', 'State', 'Score', 'Rank', 'State', 'Score'],
 ['(1)', 'Alabama', '100', '(26)', 'North Carolina', '90'],
 ['(2)', 'Arizona', '100', '(27)', 'Utah', '90'],
 ['(3)', 'Arkansas', '100', '(28)', 'Wisconsin', '90'],
 ['(4)', 'Florida', '100', '(29)', 'Connecticut', '86'],
 ['(5)', 'Georgia', '100', '(30)', 'Ohio', '86'],
 ['(6)', 'South Carolina', '100', '(31)', 'New Jersey', '82'],
 ['(7)', 'Alaska', '96', '(32)', 'Virginia', '81'],
 ['(8)', 'Indiana', '96', '(33)', 'Vermont', '80'],
 ['(9)', 'Iowa', '96', '(34)', 'New York', '79'],
 ['(10)', 'Montana', '96', '(35)', 'Minnesota', '74'],
 ['(11)', 'New Hampshire', '96', '(36)', 'Maine', '72'],
 ['(12)', 'North Dakota', '96', '(37)', 'Nevada', '72'],
 ['(13)', 'Oklahoma', '96', '(38)', 'Kentucky', '70'],
 ['(14)', 'South Dakota', '96', '(39)', 'Pennsylvania', '68'],
 ['(15)', 'Texas', '96', '(40)', 'Michigan', '65'],
 ['(16)', 'West Virginia', '96', '(41)', 'Rhode Island', '63'],
 ['(17)', 'Louisiana', '93', '(42)', 'Massachusetts'

#### Array to dataframe

Using the array from above, create a pandas dataframe

In [32]:
colnames = array_ranks[0] # first array in our nested array contains the column names
df_ranks = pd.DataFrame(data=array_ranks, columns=colnames)
df_ranks

Unnamed: 0,Rank,State,Score,Rank.1,State.1,Score.1
0,Rank,State,Score,Rank,State,Score
1,(1),Alabama,100,(26),North Carolina,90
2,(2),Arizona,100,(27),Utah,90
3,(3),Arkansas,100,(28),Wisconsin,90
4,(4),Florida,100,(29),Connecticut,86
5,(5),Georgia,100,(30),Ohio,86
6,(6),South Carolina,100,(31),New Jersey,82
7,(7),Alaska,96,(32),Virginia,81
8,(8),Indiana,96,(33),Vermont,80
9,(9),Iowa,96,(34),New York,79


In [33]:
df_ranks.drop([0], inplace=True)
df_ranks

Unnamed: 0,Rank,State,Score,Rank.1,State.1,Score.1
1,(1),Alabama,100,(26),North Carolina,90
2,(2),Arizona,100,(27),Utah,90
3,(3),Arkansas,100,(28),Wisconsin,90
4,(4),Florida,100,(29),Connecticut,86
5,(5),Georgia,100,(30),Ohio,86
6,(6),South Carolina,100,(31),New Jersey,82
7,(7),Alaska,96,(32),Virginia,81
8,(8),Indiana,96,(33),Vermont,80
9,(9),Iowa,96,(34),New York,79
10,(10),Montana,96,(35),Minnesota,74


Fix repeated columns

In [37]:
df1 = df_ranks.iloc[:, 0:3]
df2 = df_ranks.iloc[:, 3:]
df_final = pd.concat([df1, df2]).reset_index()
df_final.drop(columns='index', inplace=True)
df_final

Unnamed: 0,Rank,State,Score
0,(1),Alabama,100
1,(2),Arizona,100
2,(3),Arkansas,100
3,(4),Florida,100
4,(5),Georgia,100
5,(6),South Carolina,100
6,(7),Alaska,96
7,(8),Indiana,96
8,(9),Iowa,96
9,(10),Montana,96
