# ETL - COVID State Vaccination Data
This project consolidates COVID vaccination data for the US by State to prepare it for uploading to a Postgres Database 
- Notes on the data: Unfortunately, I was not able to find consolidated vaccine data on the CDC websites that would give me everything that I wanted.....I could get culumulative totals by state, and I could get daily counts at the national level, but I wanted daily vaccination counts by State. So I had to useOur World in Data's site and download several different files to get the data that I wanted. https://ourworldindata.org/us-states-vaccinations
    - us-covid-number-fully-vaccinated-in-US.csv
    - us-covid-share-fully-vaccinated.csv
    - us-daily-covid-vaccine-doses-administered-by-state.csv
    - us-daily-covid-vaccine-doses-per-million.csv

The various steps are detailed below:
1. [Merge CSVs of State-level Vaccination Data](#Merge-CSVs-of-State-level-Vaccination-Data ) 


In [1]:
# Dependencies
import pandas as pd


## Extract, Merge and Clean State level Vaccination Data 

In [2]:
# Files to Load
number_fully_vaccinated_to_load = "Resources/us-covid-number-fully-vaccinated-in-US.csv"
share_fully_vaccinated_to_load = "Resources/us-covid-share-fully-vaccinated.csv"
number_doses_administered_to_load = "Resources/us-daily-covid-vaccine-doses-administered-by-state.csv"
number_doses_per_million_to_load = "Resources/us-daily-covid-vaccine-doses-per-million.csv"

# Read Vaccine date files and store into Pandas DataFrames
nbr_fully_vaccinated_df = pd.read_csv(number_fully_vaccinated_to_load)
shr_fully_vaccinated_df = pd.read_csv(share_fully_vaccinated_to_load)
nbr_doses_administered_df = pd.read_csv(number_doses_administered_to_load)
nbr_doses_per_million_df = pd.read_csv(number_doses_per_million_to_load)


In [3]:
# Remove empty columns before merging
nbr_fully_vaccinated_df = nbr_fully_vaccinated_df.drop(columns=['Code'])
shr_fully_vaccinated_df = shr_fully_vaccinated_df.drop(columns=['Code'])
nbr_doses_administered_df = nbr_doses_administered_df.drop(columns=['Code'])
nbr_doses_per_million_df = nbr_doses_per_million_df.drop(columns=['Code'])

# Combine the data into a single dataset dropping the empty columns 
df1 = pd.merge(nbr_fully_vaccinated_df, shr_fully_vaccinated_df, how="left", on=["Entity","Date"])
df2 = pd.merge(df1,nbr_doses_administered_df, how="left", on=["Entity","Date"])
df3 = pd.merge(df2, nbr_doses_per_million_df, how="left", on=["Entity","Date"])
# df3.head(100)

In [4]:
# Clean up the merged file: 
# Remove duplicate data that is listed under Federal Agencies (in addition to being duplicates, they also contains many NaNs):
# Bureau of Prisons, Dept of Defense, Indian Health Svc, Long Term Care, Veterans Health
vaccinations_df = df3.loc[(df3["Entity"] != "Bureau of Prisons") &
                          (df3["Entity"] != "Dept of Defense") &
                          (df3["Entity"] != "Indian Health Svc") &
                          (df3["Entity"] != "Long Term Care") &
                          (df3["Entity"] != "Veterans Health"), :].copy()

# Replace remaining NaN values with zeros - these primarily occurred on the first day of data collection for some states.
vaccinations_df.fillna(value=0, inplace=True)

# Change the columns back to ints (fillna added a decimal position)
vaccinations_df['daily_vaccinations'] = vaccinations_df['daily_vaccinations'].astype(int) 
vaccinations_df['daily_vaccinations_per_million'] = vaccinations_df['daily_vaccinations_per_million'].astype(int)

# Change column name from Entity to State to better reflect the content of the final, cleaned up dataframe. 
vaccinations_df.rename(columns={'Entity':'State'}, 
                 inplace=True)
vaccinations_df.head()

Unnamed: 0,State,Date,people_fully_vaccinated,people_fully_vaccinated_per_hundred,daily_vaccinations,daily_vaccinations_per_million
0,Alabama,1/12/2021,7270,0.15,0,0
1,Alabama,1/13/2021,9245,0.19,5906,1205
2,Alabama,1/15/2021,13488,0.28,7478,1525
3,Alabama,1/19/2021,16346,0.33,7523,1534
4,Alabama,1/20/2021,17956,0.37,7880,1607


In [5]:
# Write the merged/cleaned up file to a new, single csv file
vaccinations_df.to_csv('Resources\MergedVaccinations.csv', index = False)