# ETL - COVID State Vaccination Data
This project consolidates COVID vaccination data for the US by State to prepare it for uploading to a Postgres Database 
- Notes on the data: Unfortunately, I was not able to find consolidated vaccine data on the CDC websites that would give me everything that I wanted.....I could get culumulative totals by state, and I could get daily counts at the national level, but I wanted daily vaccination counts by State. So I had to useOur World in Data's site and download several different files to get the data that I wanted. https://ourworldindata.org/us-states-vaccinations
    - us-covid-number-fully-vaccinated-in-US.csv
    - us-covid-share-fully-vaccinated.csv
    - us-daily-covid-vaccine-doses-administered-by-state.csv
    - us-daily-covid-vaccine-doses-per-million.csv

The various steps are detailed below:
1. [Extract and Transform National and State level Vaccination Data](#Extract-and-Transform-National-and-State-level-Vaccination-Data)
2. [Extract and Transform COVID Case Data](#Extract-and-Transform-COVID-Case-Data)
3. [Extract and Transform COVID Death Data](#Extract-and-Transform-COVID-Death-Data)
2. [Load all data to the COVID PostgreSQL Database](#Load-Final-Data-to-PostgreSQL-Database)


In [1]:
# Dependencies
import pandas as pd
# Import psycopg2 - the DB API 2.0 compliant PostgreSQL driver for Python
import psycopg2
from sqlalchemy import create_engine


## Extract and Transform National and State level Vaccination Data 

In [2]:
# Files to Load
number_fully_vaccinated_to_load = "Resources/us-covid-number-fully-vaccinated-in-US.csv"
share_fully_vaccinated_to_load = "Resources/us-covid-share-fully-vaccinated.csv"
number_doses_administered_to_load = "Resources/us-daily-covid-vaccine-doses-administered-by-state.csv"
number_doses_per_million_to_load = "Resources/us-daily-covid-vaccine-doses-per-million.csv"

# Read Vaccine data files and store into Pandas DataFrames
nbr_fully_vaccinated_df = pd.read_csv(number_fully_vaccinated_to_load)
shr_fully_vaccinated_df = pd.read_csv(share_fully_vaccinated_to_load)
nbr_doses_administered_df = pd.read_csv(number_doses_administered_to_load)
nbr_doses_per_million_df = pd.read_csv(number_doses_per_million_to_load)


In [3]:
# Remove empty columns before merging
nbr_fully_vaccinated_df = nbr_fully_vaccinated_df.drop(columns=['Code'])
shr_fully_vaccinated_df = shr_fully_vaccinated_df.drop(columns=['Code'])
nbr_doses_administered_df = nbr_doses_administered_df.drop(columns=['Code'])
nbr_doses_per_million_df = nbr_doses_per_million_df.drop(columns=['Code'])

# Combine the data into a single dataset dropping the empty columns 
df1 = pd.merge(nbr_fully_vaccinated_df, shr_fully_vaccinated_df, how="left", on=["Entity","Date"])
df2 = pd.merge(df1,nbr_doses_administered_df, how="left", on=["Entity","Date"])
df3 = pd.merge(df2, nbr_doses_per_million_df, how="left", on=["Entity","Date"])
# df3.head(100)

In [4]:
# Clean up the merged file: 
# Remove duplicate data that is listed under Federal Agencies (in addition to containing duplicates, it also contains
# many NaNs): Bureau of Prisons, Dept of Defense, Indian Health Svc, Long Term Care, Veterans Health
vaccinations_df = df3.loc[(df3["Entity"] != "Bureau of Prisons") &
                          (df3["Entity"] != "Dept of Defense") &
                          (df3["Entity"] != "Indian Health Svc") &
                          (df3["Entity"] != "Long Term Care") &
                          (df3["Entity"] != "Veterans Health"), :].copy()

# Replace remaining NaN values with zeros - these primarily occurred on the first day of data collection for some states.
vaccinations_df.fillna(value=0, inplace=True)

# Change the columns back to integers (fillna added an unnecessary decimal position)
vaccinations_df['daily_vaccinations'] = vaccinations_df['daily_vaccinations'].astype(int) 
vaccinations_df['daily_vaccinations_per_million'] = vaccinations_df['daily_vaccinations_per_million'].astype(int)


In [5]:
vaccinations_df.head()

Unnamed: 0,Entity,Date,people_fully_vaccinated,people_fully_vaccinated_per_hundred,daily_vaccinations,daily_vaccinations_per_million
0,Alabama,1/12/2021,7270,0.15,0,0
1,Alabama,1/13/2021,9245,0.19,5906,1205
2,Alabama,1/15/2021,13488,0.28,7478,1525
3,Alabama,1/19/2021,16346,0.33,7523,1534
4,Alabama,1/20/2021,17956,0.37,7880,1607


In [6]:
# Restructure the data before finalizing it
# Change column name from Entity to State to better reflect the content of the final, cleaned up dataframe. 
vaccinations_df.rename(columns={'Entity':'state_name', 'Date':'date_administered' }, 
                 inplace=True)

# Remove the rows of national (state_name ="US") data into its own csv for ease of creating the 2 tables US_vaccinations
# and State_vaccinations.
# Important note: the national and state numbers aren't always the same, because of the way that the different
# jurisdictions report their data and how the CDC cross-checks and totals it up so I am preserving that difference
# by creating two separate tables
US_vaccinations_df = vaccinations_df.loc[(vaccinations_df["state_name"] == "United States"), :].copy()
US_vaccinations_df = US_vaccinations_df.drop(columns=['state_name'])
US_vaccinations_df.reset_index(drop=True, inplace=True)

# US_vaccinations_df = US_vaccinations_df.set_index('Date')

state_vaccinations_df = vaccinations_df.loc[(vaccinations_df["state_name"] != "United States"), :].copy()
state_vaccinations_df.reset_index(drop=True, inplace=True)


In [7]:
# Write the merged/cleaned up files to new csv files for backup purposes 
US_vaccinations_df.to_csv(r"Resources\US_vaccinations.csv", index = False, encoding="utf-8")
state_vaccinations_df.to_csv(r"Resources\State_Vaccinations.csv", index = False, encoding="utf-8")

## Extract and Transform COVID Case Data 

## Extract and Transform COVID Death Data 

## Load Final Data to PostgreSQL Database 

In [8]:
connection_string = "postgres:password@localhost:5432/COVID"
engine = create_engine(f'postgresql://{connection_string}')

In [9]:
# Confirm tables
engine.table_names()

['state_vaccinations', 'us_vaccinations']

In [10]:
US_vaccinations_df.to_sql(name='us_vaccinations', con=engine, if_exists='append', index=False)

In [12]:
pd.read_sql_query('select * from us_vaccinations', con=engine).head()

Unnamed: 0,date_administered,people_fully_vaccinated,people_fully_vaccinated_per_hundred,daily_vaccinations,daily_vaccinations_per_million
0,2021-01-12,782228,0.24,641524,1932
1,2021-01-13,1020260,0.31,710238,2139
2,2021-01-15,1610524,0.49,798707,2406
3,2021-01-19,2023124,0.61,911493,2745
4,2021-01-20,2161419,0.65,892403,2688


In [11]:
state_vaccinations_df.to_sql(name='state_vaccinations', con=engine, if_exists='append', index=False)

In [13]:
pd.read_sql_query('select * from state_vaccinations', con=engine).head()

Unnamed: 0,state_name,date_administered,people_fully_vaccinated,people_fully_vaccinated_per_hundred,daily_vaccinations,daily_vaccinations_per_million
0,Alabama,2021-01-12,7270,0.15,0,0
1,Alabama,2021-01-13,9245,0.19,5906,1205
2,Alabama,2021-01-15,13488,0.28,7478,1525
3,Alabama,2021-01-19,16346,0.33,7523,1534
4,Alabama,2021-01-20,17956,0.37,7880,1607
