# Week 3 -- Web Scraping Project

Create a Jupyter notebook with code to load & analyze Covid case metrics from   https://www.worldometers.info/coronavirus/country/us/#nav-yesterday

Upload your completed notebook to a public github.com repo for your AD450 coursework and submit a link to the uploaded file in Canvas.

Your notebook should have well identified sections that:

* load state-level data from the provided web url into a Pandas dataframe with these requirements:
  * dataframe column names should match the HTML column headers
  * exclude the '#', 'source', and 'projections' columns 
  * exclude rows for country totals
  * state column should contain just the state name (no HTML)
* describe the dataframe by printing the first few rows
* print dataframe summary statistics 
* print the the top 5 states (name & value) for each of these metrics:
  * new cases
  * total deaths
  * total cases / 1M pop
  * total deaths / 1M pop

NOTE:

Each notebook section should include text description and code that can be executed successfully


In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

Load data from Worldometer:

In [None]:
url = "https://www.worldometers.info/coronavirus/country/us/#nav-yesterday"
response = requests.get(url)
print(response)

soup = BeautifulSoup(response.text, "html.parser")
table = soup.find("table", id="usa_table_countries_today")
body = table.find("tbody")
body.find("tr", class_="total_row_usa").decompose()

table_data = []
for row in body.find_all("tr"):
  row_data = []
  for cell in row.findAll("td"):
    row_data.append(cell.text)
  if(len(row_data) > 0):
    data_item = {"State": row_data[1][1:],
                 "Total_Cases": row_data[2],
                 "New_Cases": row_data[3][1:],
                 "Total_Deaths": int(row_data[4][1:].replace(',','')),
                 "New_Deaths": row_data[5],
                 "Total_Recovered": row_data[6][1:],
                 "Active_Cases": row_data[7][1:],
                 "Total_Cases_Per_1M_Pop": int(row_data[8].replace(',','')),
                 "Deaths_Per_1M_Pop": int(row_data[9].replace(',','')),
                 "Total_Tests": row_data[10][1:],
                 "Tests_Per_1M_Pop": row_data[11],
                 "Population": row_data[12][1:]}
    table_data.append(data_item)

df = pd.DataFrame(table_data)


Describe the dataframe by printing the first few rows

In [None]:
pd.set_option("display.max_columns",4)
print(df.head())

Print dataframe summary statistics

In [None]:
print(df.describe())
print("shape: " + str(df.shape))

Print the top 5 states (name and value) for each of these metrics

In [None]:
# New Cases
# This column was borked on my input, so I'll do it out but don't expect much
top_new_cases = df.sort_values("New_Cases", ascending=False)
print(top_new_cases[["State","New_Cases"]].head(5))

In [None]:
# Total Deaths
top_deaths = df.sort_values("Total_Deaths", ascending=False)
print(top_deaths[["State","Total_Deaths"]].head(5))

In [None]:
# Total Cases per 1M pop
top_cases_per_pop = df.sort_values("Total_Cases_Per_1M_Pop", ascending=False)
print(top_cases_per_pop[["State","Total_Cases_Per_1M_Pop"]].head(5))

In [None]:
# Total Deaths per 1M pop
top_deaths_per_pop = df.sort_values("Deaths_Per_1M_Pop", ascending=False)
print(top_deaths_per_pop[["State","Deaths_Per_1M_Pop"]].head(5))

Below was to check column names and such

In [None]:
# For testing purposes, run at your own risk
pd.set_option("display.max_rows",None,"display.max_columns",None)
print(df)