# COVID-19 ETL

This notebook is used to load, clean COVID-19 data and export it to PostgreSQL. The data contains:

* Data on COVID-19 (coronavirus) by Our World in Data: https://github.com/owid/covid-19-data/tree/master/public/data
* Data on COVID-19 (coronavirus) vaccinations by Our World in Data: https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations
* COVID-19 Case Surveillance Public Use Data with Geography: https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data-with-Ge/n8mc-b4w4

In [1]:
import pandas as pd
from sqlalchemy import create_engine

## Global COVID-19 Data
The goal of this part is to divide the global data set into 3 parts:
* Stats vs Cases data
* Stats vs Tests data
* Stats vs Vaccines data

At the end of each part there will be a clean dataframe that will be uploaded to PostgreSQL. 

In [40]:
covid_data = "./resources/owid-covid-data.csv"

covid_data_df = pd.read_csv(covid_data)
covid_data_df

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
0,AFG,Asia,Afghanistan,2020-02-24,1.000,1.000,,,,,...,1803.987,,597.029,9.590,,,37.746,0.500,64.830,0.511
1,AFG,Asia,Afghanistan,2020-02-25,1.000,0.000,,,,,...,1803.987,,597.029,9.590,,,37.746,0.500,64.830,0.511
2,AFG,Asia,Afghanistan,2020-02-26,1.000,0.000,,,,,...,1803.987,,597.029,9.590,,,37.746,0.500,64.830,0.511
3,AFG,Asia,Afghanistan,2020-02-27,1.000,0.000,,,,,...,1803.987,,597.029,9.590,,,37.746,0.500,64.830,0.511
4,AFG,Asia,Afghanistan,2020-02-28,1.000,0.000,,,,,...,1803.987,,597.029,9.590,,,37.746,0.500,64.830,0.511
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80981,ZWE,Africa,Zimbabwe,2021-04-07,36984.000,18.000,14.571,1531.000,0.000,1.143,...,1899.775,21.400,307.846,1.820,1.600,30.700,36.791,1.700,61.490,0.571
80982,ZWE,Africa,Zimbabwe,2021-04-08,37052.000,68.000,22.286,1532.000,1.000,1.286,...,1899.775,21.400,307.846,1.820,1.600,30.700,36.791,1.700,61.490,0.571
80983,ZWE,Africa,Zimbabwe,2021-04-09,37147.000,95.000,34.857,1535.000,3.000,1.571,...,1899.775,21.400,307.846,1.820,1.600,30.700,36.791,1.700,61.490,0.571
80984,ZWE,Africa,Zimbabwe,2021-04-10,37273.000,126.000,51.714,1538.000,3.000,2.000,...,1899.775,21.400,307.846,1.820,1.600,30.700,36.791,1.700,61.490,0.571


## Global Stats vs Cases data

In [39]:
# Supress scientific notation by forcing formatting 
pd.options.display.float_format = '{:.3f}'.format

# Unefficient way of finding the last row of each country and inserting it into a new dataframe
# Ideally, making a list and appending it to a new dataframe would be much more efficient
# But there are so many columns that time-wise this takes less time

countries = covid_data_df["location"].unique().tolist()
stats = pd.DataFrame()
df = pd.DataFrame()

# Make dataframe from the total number of cases
for country in countries:
    df = covid_data_df.loc[covid_data_df["location"] == country]
    num_cases = df["total_cases"].argmax()
    df = df.iloc[num_cases].to_frame().T
    stats = pd.concat([stats,df])
    
del df

In [20]:
# Dataframe to be used for scatter plots - needs some cleaning (probably) 
# la forma de buscar pordria ser usando max en vez del ultimo de la fila, dependiendo de la columna en la que se haga 
stats = stats.dropna(subset=['total_cases'])
stats
# Habria que hacer tests vs stats y vaccines vs stats

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
412,AFG,Asia,Afghanistan,2021-04-11,57160.00,16.00,69.14,2521.00,0.00,3.43,...,1803.99,,597.03,9.59,,,37.75,0.50,64.83,0.51
836,OWID_AFR,,Africa,2021-04-11,4350198.00,9684.00,11325.57,115710.00,288.00,265.29,...,,,,,,,,,,
1248,ALB,Europe,Albania,2021-04-11,128393.00,238.00,266.00,2317.00,7.00,7.43,...,11803.43,1.10,304.19,10.08,7.10,51.20,,2.89,78.57,0.80
1660,DZA,Africa,Algeria,2021-04-11,118516.00,138.00,127.71,3130.00,4.00,3.57,...,13913.84,0.50,278.36,6.73,0.70,30.40,83.74,1.90,76.88,0.75
2066,AND,Europe,Andorra,2021-04-11,12545.00,48.00,44.86,120.00,0.00,0.43,...,,,109.14,7.97,29.00,37.80,,,83.73,0.87
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79394,VNM,Asia,Vietnam,2021-04-11,2693.00,1.00,8.86,35.00,0.00,0.00,...,6171.88,2.00,245.47,6.00,1.00,45.90,85.85,2.60,75.40,0.70
79840,OWID_WRL,,World,2021-04-11,136046624.00,690739.00,674669.29,2936364.00,8557.00,11871.00,...,15469.21,10.00,233.07,8.51,6.43,34.63,60.13,2.71,72.58,0.74
80207,YEM,Asia,Yemen,2021-04-11,5357.00,81.00,79.86,1049.00,18.00,14.71,...,1479.15,18.80,495.00,5.35,7.60,29.20,49.54,0.70,66.12,0.47
80597,ZMB,Africa,Zambia,2021-04-11,90029.00,111.00,157.00,1226.00,0.00,0.86,...,3689.25,57.50,234.50,3.94,3.10,24.70,13.94,2.00,63.89,0.58


In [21]:
# Total cases vs stats
countries_stats = stats[["iso_code", "continent", "location", "total_cases", "population", "population_density", "median_age", "aged_65_older", "aged_70_older", "gdp_per_capita", "extreme_poverty", "cardiovasc_death_rate"]]
countries_stats

Unnamed: 0,iso_code,continent,location,total_cases,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate
412,AFG,Asia,Afghanistan,57160.00,38928341.00,54.42,18.60,2.58,1.34,1803.99,,597.03
836,OWID_AFR,,Africa,4350198.00,1340598113.00,,,,,,,
1248,ALB,Europe,Albania,128393.00,2877800.00,104.87,38.00,13.19,8.64,11803.43,1.10,304.19
1660,DZA,Africa,Algeria,118516.00,43851043.00,17.35,29.10,6.21,3.86,13913.84,0.50,278.36
2066,AND,Europe,Andorra,12545.00,77265.00,163.75,,,,,,109.14
...,...,...,...,...,...,...,...,...,...,...,...,...
79394,VNM,Asia,Vietnam,2693.00,97338583.00,308.13,32.60,7.15,4.72,6171.88,2.00,245.47
79840,OWID_WRL,,World,136046624.00,7794798729.00,58.05,30.90,8.70,5.36,15469.21,10.00,233.07
80207,YEM,Asia,Yemen,5357.00,29825968.00,53.51,20.30,2.92,1.58,1479.15,18.80,495.00
80597,ZMB,Africa,Zambia,90029.00,18383956.00,23.00,17.70,2.48,1.54,3689.25,57.50,234.50


In [24]:
# Total cases vs more stats
more_countries_stats = stats[["iso_code", "continent", "location", "total_cases", "diabetes_prevalence", "female_smokers", "male_smokers", "hospital_beds_per_thousand", "life_expectancy", "human_development_index"]]
more_countries_stats

Unnamed: 0,iso_code,continent,location,total_cases,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
412,AFG,Asia,Afghanistan,57160.00,9.59,,,0.50,64.83,0.51
836,OWID_AFR,,Africa,4350198.00,,,,,,
1248,ALB,Europe,Albania,128393.00,10.08,7.10,51.20,2.89,78.57,0.80
1660,DZA,Africa,Algeria,118516.00,6.73,0.70,30.40,1.90,76.88,0.75
2066,AND,Europe,Andorra,12545.00,7.97,29.00,37.80,,83.73,0.87
...,...,...,...,...,...,...,...,...,...,...
79394,VNM,Asia,Vietnam,2693.00,6.00,1.00,45.90,2.60,75.40,0.70
79840,OWID_WRL,,World,136046624.00,8.51,6.43,34.63,2.71,72.58,0.74
80207,YEM,Asia,Yemen,5357.00,5.35,7.60,29.20,0.70,66.12,0.47
80597,ZMB,Africa,Zambia,90029.00,3.94,3.10,24.70,2.00,63.89,0.58


In [25]:
merged_country_stats = countries_stats.merge(more_countries_stats, how="inner", on=["iso_code","continent","location","total_cases"])
merged_country_stats

Unnamed: 0,iso_code,continent,location,total_cases,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
0,AFG,Asia,Afghanistan,57160.00,38928341.00,54.42,18.60,2.58,1.34,1803.99,,597.03,9.59,,,0.50,64.83,0.51
1,OWID_AFR,,Africa,4350198.00,1340598113.00,,,,,,,,,,,,,
2,ALB,Europe,Albania,128393.00,2877800.00,104.87,38.00,13.19,8.64,11803.43,1.10,304.19,10.08,7.10,51.20,2.89,78.57,0.80
3,DZA,Africa,Algeria,118516.00,43851043.00,17.35,29.10,6.21,3.86,13913.84,0.50,278.36,6.73,0.70,30.40,1.90,76.88,0.75
4,AND,Europe,Andorra,12545.00,77265.00,163.75,,,,,,109.14,7.97,29.00,37.80,,83.73,0.87
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194,VNM,Asia,Vietnam,2693.00,97338583.00,308.13,32.60,7.15,4.72,6171.88,2.00,245.47,6.00,1.00,45.90,2.60,75.40,0.70
195,OWID_WRL,,World,136046624.00,7794798729.00,58.05,30.90,8.70,5.36,15469.21,10.00,233.07,8.51,6.43,34.63,2.71,72.58,0.74
196,YEM,Asia,Yemen,5357.00,29825968.00,53.51,20.30,2.92,1.58,1479.15,18.80,495.00,5.35,7.60,29.20,0.70,66.12,0.47
197,ZMB,Africa,Zambia,90029.00,18383956.00,23.00,17.70,2.48,1.54,3689.25,57.50,234.50,3.94,3.10,24.70,2.00,63.89,0.58


In [26]:
# Global cases data 
stats_cases = stats[["iso_code", "continent", "location", "total_cases", "total_deaths", "total_cases_per_million", "total_deaths_per_million"]]
stats_cases

Unnamed: 0,iso_code,continent,location,total_cases,total_deaths,total_cases_per_million,total_deaths_per_million
412,AFG,Asia,Afghanistan,57160.00,2521.00,1468.34,64.76
836,OWID_AFR,,Africa,4350198.00,115710.00,3244.97,86.31
1248,ALB,Europe,Albania,128393.00,2317.00,44614.98,805.13
1660,DZA,Africa,Algeria,118516.00,3130.00,2702.70,71.38
2066,AND,Europe,Andorra,12545.00,120.00,162363.30,1553.10
...,...,...,...,...,...,...,...
79394,VNM,Asia,Vietnam,2693.00,35.00,27.67,0.36
79840,OWID_WRL,,World,136046624.00,2936364.00,17453.51,376.71
80207,YEM,Asia,Yemen,5357.00,1049.00,179.61,35.17
80597,ZMB,Africa,Zambia,90029.00,1226.00,4897.15,66.69


In [27]:
# Final dataframe for cases vs stats
ready_stats_vs_cases = merged_country_stats.merge(stats_cases, how="inner", on=["iso_code","continent","location","total_cases"])
ready_stats_vs_cases

Unnamed: 0,iso_code,continent,location,total_cases,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,...,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index,total_deaths,total_cases_per_million,total_deaths_per_million
0,AFG,Asia,Afghanistan,57160.00,38928341.00,54.42,18.60,2.58,1.34,1803.99,...,597.03,9.59,,,0.50,64.83,0.51,2521.00,1468.34,64.76
1,OWID_AFR,,Africa,4350198.00,1340598113.00,,,,,,...,,,,,,,,115710.00,3244.97,86.31
2,ALB,Europe,Albania,128393.00,2877800.00,104.87,38.00,13.19,8.64,11803.43,...,304.19,10.08,7.10,51.20,2.89,78.57,0.80,2317.00,44614.98,805.13
3,DZA,Africa,Algeria,118516.00,43851043.00,17.35,29.10,6.21,3.86,13913.84,...,278.36,6.73,0.70,30.40,1.90,76.88,0.75,3130.00,2702.70,71.38
4,AND,Europe,Andorra,12545.00,77265.00,163.75,,,,,...,109.14,7.97,29.00,37.80,,83.73,0.87,120.00,162363.30,1553.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
194,VNM,Asia,Vietnam,2693.00,97338583.00,308.13,32.60,7.15,4.72,6171.88,...,245.47,6.00,1.00,45.90,2.60,75.40,0.70,35.00,27.67,0.36
195,OWID_WRL,,World,136046624.00,7794798729.00,58.05,30.90,8.70,5.36,15469.21,...,233.07,8.51,6.43,34.63,2.71,72.58,0.74,2936364.00,17453.51,376.71
196,YEM,Asia,Yemen,5357.00,29825968.00,53.51,20.30,2.92,1.58,1479.15,...,495.00,5.35,7.60,29.20,0.70,66.12,0.47,1049.00,179.61,35.17
197,ZMB,Africa,Zambia,90029.00,18383956.00,23.00,17.70,2.48,1.54,3689.25,...,234.50,3.94,3.10,24.70,2.00,63.89,0.58,1226.00,4897.15,66.69


## Global Stats vs Test data

In [9]:
# Iterate through each country, find the max numer of tests done and concatenate to new dataframe
test_stats = pd.DataFrame()
df = pd.DataFrame()

for country in countries:
    df = covid_data_df.loc[covid_data_df["location"] == country]
    num_tests = df["total_tests"].argmax()
    df = df.iloc[num_tests].to_frame().T
    test_stats = pd.concat([test_stats,df])

del df

In [14]:
# Global test data
# Drop NaN and get test columns 
test_stats = test_stats.dropna(subset=['total_tests'])
test_stats_df = test_stats[["iso_code", "continent", "location", "new_tests", "total_tests", "total_tests_per_thousand", "new_tests_per_thousand", "positive_rate", "tests_per_case", "tests_units"]]
test_stats_df

Unnamed: 0,iso_code,continent,location,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,positive_rate,tests_per_case,tests_units
1244,ALB,Europe,Albania,2538.00,555376.00,192.99,0.88,0.11,9.20,tests performed
2060,AND,Europe,Andorra,,171485.00,2219.44,,0.12,8.10,people tested
3352,ARG,South America,Argentina,32157.00,7610064.00,168.38,0.71,0.23,4.40,tests performed
3764,ARM,Asia,Armenia,4943.00,889872.00,300.30,1.67,0.21,4.70,tests performed
4652,AUS,Oceania,Australia,44549.00,15998167.00,627.38,1.75,0.00,5250.40,tests performed
...,...,...,...,...,...,...,...,...,...,...
77207,USA,North America,United States,484155.00,385064247.00,1163.33,1.46,0.07,14.30,tests performed
77604,URY,South America,Uruguay,,1488362.00,428.46,,0.22,4.50,tests performed
79367,VNM,Asia,Vietnam,,2482302.00,25.50,,,,samples tested
80594,ZMB,Africa,Zambia,6016.00,1286686.00,69.99,0.33,0.04,28.00,tests performed


In [28]:
# Stats to merge with test data 
more_test_stats = test_stats[["iso_code", "continent", "location", "total_tests", "population", "population_density", "median_age", "aged_65_older", "aged_70_older", "gdp_per_capita", "extreme_poverty", "cardiovasc_death_rate", "diabetes_prevalence", "female_smokers", "male_smokers", "hospital_beds_per_thousand", "life_expectancy", "human_development_index"]]
more_test_stats

Unnamed: 0,iso_code,continent,location,total_tests,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
1244,ALB,Europe,Albania,555376.00,2877800.00,104.87,38.00,13.19,8.64,11803.43,1.10,304.19,10.08,7.10,51.20,2.89,78.57,0.80
2060,AND,Europe,Andorra,171485.00,77265.00,163.75,,,,,,109.14,7.97,29.00,37.80,,83.73,0.87
3352,ARG,South America,Argentina,7610064.00,45195777.00,16.18,31.90,11.20,7.44,18933.91,0.60,191.03,5.50,16.20,27.70,5.00,76.67,0.84
3764,ARM,Asia,Armenia,889872.00,2963234.00,102.93,35.70,11.23,7.57,8787.58,1.80,341.01,7.11,1.50,52.10,4.20,75.09,0.78
4652,AUS,Oceania,Australia,15998167.00,25499881.00,3.20,37.90,15.50,10.13,44648.71,0.50,107.79,5.07,13.00,16.50,3.84,83.44,0.94
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77207,USA,North America,United States,385064247.00,331002647.00,35.61,38.30,15.41,9.73,54225.45,1.20,151.09,10.79,19.10,24.60,2.77,78.86,0.93
77604,URY,South America,Uruguay,1488362.00,3473727.00,19.75,35.60,14.65,10.36,20551.41,0.10,160.71,6.93,14.00,19.90,2.80,77.91,0.82
79367,VNM,Asia,Vietnam,2482302.00,97338583.00,308.13,32.60,7.15,4.72,6171.88,2.00,245.47,6.00,1.00,45.90,2.60,75.40,0.70
80594,ZMB,Africa,Zambia,1286686.00,18383956.00,23.00,17.70,2.48,1.54,3689.25,57.50,234.50,3.94,3.10,24.70,2.00,63.89,0.58


In [29]:
# Final dataframe for test vs stats
ready_test_stats = test_stats_df.merge(more_test_stats, how="inner", on=["iso_code","continent","location","total_tests"])
ready_test_stats

Unnamed: 0,iso_code,continent,location,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,positive_rate,tests_per_case,tests_units,...,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
0,ALB,Europe,Albania,2538.00,555376.00,192.99,0.88,0.11,9.20,tests performed,...,8.64,11803.43,1.10,304.19,10.08,7.10,51.20,2.89,78.57,0.80
1,AND,Europe,Andorra,,171485.00,2219.44,,0.12,8.10,people tested,...,,,,109.14,7.97,29.00,37.80,,83.73,0.87
2,ARG,South America,Argentina,32157.00,7610064.00,168.38,0.71,0.23,4.40,tests performed,...,7.44,18933.91,0.60,191.03,5.50,16.20,27.70,5.00,76.67,0.84
3,ARM,Asia,Armenia,4943.00,889872.00,300.30,1.67,0.21,4.70,tests performed,...,7.57,8787.58,1.80,341.01,7.11,1.50,52.10,4.20,75.09,0.78
4,AUS,Oceania,Australia,44549.00,15998167.00,627.38,1.75,0.00,5250.40,tests performed,...,10.13,44648.71,0.50,107.79,5.07,13.00,16.50,3.84,83.44,0.94
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108,USA,North America,United States,484155.00,385064247.00,1163.33,1.46,0.07,14.30,tests performed,...,9.73,54225.45,1.20,151.09,10.79,19.10,24.60,2.77,78.86,0.93
109,URY,South America,Uruguay,,1488362.00,428.46,,0.22,4.50,tests performed,...,10.36,20551.41,0.10,160.71,6.93,14.00,19.90,2.80,77.91,0.82
110,VNM,Asia,Vietnam,,2482302.00,25.50,,,,samples tested,...,4.72,6171.88,2.00,245.47,6.00,1.00,45.90,2.60,75.40,0.70
111,ZMB,Africa,Zambia,6016.00,1286686.00,69.99,0.33,0.04,28.00,tests performed,...,1.54,3689.25,57.50,234.50,3.94,3.10,24.70,2.00,63.89,0.58


## Global Stats vs Vaccine data

In [11]:
# Each country vaccine data for stats
vaccine_stats = pd.DataFrame()
df = pd.DataFrame()

for country in countries:
    df = covid_data_df.loc[covid_data_df["location"] == country]
    num_vacc = df["total_vaccinations"].argmax()
    df = df.iloc[num_vacc].to_frame().T
    vaccine_stats = pd.concat([vaccine_stats,df])

del df

In [30]:
vaccine_stats = vaccine_stats.dropna(subset=['total_vaccinations'])
vaccine_stats_df = vaccine_stats[["iso_code","continent","location","total_vaccinations","people_vaccinated","people_fully_vaccinated","total_vaccinations_per_hundred","people_vaccinated_per_hundred","people_fully_vaccinated_per_hundred"]]
vaccine_stats_df

Unnamed: 0,iso_code,continent,location,total_vaccinations,people_vaccinated,people_fully_vaccinated,new_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred
408,AFG,Asia,Afghanistan,120000.00,120000.00,,,0.31,0.31,
836,OWID_AFR,,Africa,13477122.00,8962280.00,4502546.00,86405.00,1.01,0.67,0.34
1247,ALB,Europe,Albania,256810.00,,,6134.00,8.92,,
1609,DZA,Africa,Algeria,75000.00,,,,0.17,,
2066,AND,Europe,Andorra,17091.00,,,,22.12,,
...,...,...,...,...,...,...,...,...,...,...
78000,UZB,Asia,Uzbekistan,148642.00,148642.00,,,0.44,0.44,
78940,VEN,South America,Venezuela,98000.00,98000.00,,,0.34,0.34,
79392,VNM,Asia,Vietnam,58037.00,58037.00,,1678.00,0.06,0.06,
79840,OWID_WRL,,World,788189884.00,439334176.00,172417377.00,11757158.00,10.11,5.64,2.21


In [31]:
more_vaccine_stats = vaccine_stats[["iso_code", "continent", "location", "total_vaccinations", "population", "population_density", "median_age", "aged_65_older", "aged_70_older", "gdp_per_capita", "extreme_poverty", "cardiovasc_death_rate", "diabetes_prevalence", "female_smokers", "male_smokers", "hospital_beds_per_thousand", "life_expectancy", "human_development_index"]]
more_vaccine_stats

Unnamed: 0,iso_code,continent,location,total_vaccinations,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
408,AFG,Asia,Afghanistan,120000.00,38928341.00,54.42,18.60,2.58,1.34,1803.99,,597.03,9.59,,,0.50,64.83,0.51
836,OWID_AFR,,Africa,13477122.00,1340598113.00,,,,,,,,,,,,,
1247,ALB,Europe,Albania,256810.00,2877800.00,104.87,38.00,13.19,8.64,11803.43,1.10,304.19,10.08,7.10,51.20,2.89,78.57,0.80
1609,DZA,Africa,Algeria,75000.00,43851043.00,17.35,29.10,6.21,3.86,13913.84,0.50,278.36,6.73,0.70,30.40,1.90,76.88,0.75
2066,AND,Europe,Andorra,17091.00,77265.00,163.75,,,,,,109.14,7.97,29.00,37.80,,83.73,0.87
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78000,UZB,Asia,Uzbekistan,148642.00,33469199.00,76.13,28.20,4.47,2.87,6253.10,,724.42,7.57,1.30,24.70,4.00,71.72,0.72
78940,VEN,South America,Venezuela,98000.00,28435943.00,36.25,29.00,6.61,3.92,16745.02,,204.85,6.47,,,0.80,72.06,0.71
79392,VNM,Asia,Vietnam,58037.00,97338583.00,308.13,32.60,7.15,4.72,6171.88,2.00,245.47,6.00,1.00,45.90,2.60,75.40,0.70
79840,OWID_WRL,,World,788189884.00,7794798729.00,58.05,30.90,8.70,5.36,15469.21,10.00,233.07,8.51,6.43,34.63,2.71,72.58,0.74


In [32]:
ready_vaccine_stats = vaccine_stats_df.merge(more_vaccine_stats, how="inner", on=["iso_code","continent","location","total_vaccinations"])
ready_vaccine_stats

Unnamed: 0,iso_code,continent,location,total_vaccinations,people_vaccinated,people_fully_vaccinated,new_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,...,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_thousand,life_expectancy,human_development_index
0,AFG,Asia,Afghanistan,120000.00,120000.00,,,0.31,0.31,,...,1.34,1803.99,,597.03,9.59,,,0.50,64.83,0.51
1,OWID_AFR,,Africa,13477122.00,8962280.00,4502546.00,86405.00,1.01,0.67,0.34,...,,,,,,,,,,
2,ALB,Europe,Albania,256810.00,,,6134.00,8.92,,,...,8.64,11803.43,1.10,304.19,10.08,7.10,51.20,2.89,78.57,0.80
3,DZA,Africa,Algeria,75000.00,,,,0.17,,,...,3.86,13913.84,0.50,278.36,6.73,0.70,30.40,1.90,76.88,0.75
4,AND,Europe,Andorra,17091.00,,,,22.12,,,...,,,,109.14,7.97,29.00,37.80,,83.73,0.87
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
175,UZB,Asia,Uzbekistan,148642.00,148642.00,,,0.44,0.44,,...,2.87,6253.10,,724.42,7.57,1.30,24.70,4.00,71.72,0.72
176,VEN,South America,Venezuela,98000.00,98000.00,,,0.34,0.34,,...,3.92,16745.02,,204.85,6.47,,,0.80,72.06,0.71
177,VNM,Asia,Vietnam,58037.00,58037.00,,1678.00,0.06,0.06,,...,4.72,6171.88,2.00,245.47,6.00,1.00,45.90,2.60,75.40,0.70
178,OWID_WRL,,World,788189884.00,439334176.00,172417377.00,11757158.00,10.11,5.64,2.21,...,5.36,15469.21,10.00,233.07,8.51,6.43,34.63,2.71,72.58,0.74


## US COVID-19 Data

In [41]:
# Get data from US only
covid_data_df_us = covid_data_df.loc[covid_data_df["location"] == "United States"]
covid_data_df_us

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
76767,USA,North America,United States,2020-01-22,1.000,,,,,,...,54225.446,1.200,151.089,10.790,19.100,24.600,,2.770,78.860,0.926
76768,USA,North America,United States,2020-01-23,1.000,0.000,,,,,...,54225.446,1.200,151.089,10.790,19.100,24.600,,2.770,78.860,0.926
76769,USA,North America,United States,2020-01-24,2.000,1.000,,,,,...,54225.446,1.200,151.089,10.790,19.100,24.600,,2.770,78.860,0.926
76770,USA,North America,United States,2020-01-25,2.000,0.000,,,,,...,54225.446,1.200,151.089,10.790,19.100,24.600,,2.770,78.860,0.926
76771,USA,North America,United States,2020-01-26,5.000,3.000,,,,,...,54225.446,1.200,151.089,10.790,19.100,24.600,,2.770,78.860,0.926
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77208,USA,North America,United States,2021-04-07,30922386.000,75038.000,65936.000,559202.000,2570.000,989.286,...,54225.446,1.200,151.089,10.790,19.100,24.600,,2.770,78.860,0.926
77209,USA,North America,United States,2021-04-08,31002264.000,79878.000,66056.571,560202.000,1000.000,979.571,...,54225.446,1.200,151.089,10.790,19.100,24.600,,2.770,78.860,0.926
77210,USA,North America,United States,2021-04-09,31084962.000,82698.000,67896.000,561074.000,872.000,970.000,...,54225.446,1.200,151.089,10.790,19.100,24.600,,2.770,78.860,0.926
77211,USA,North America,United States,2021-04-10,31151495.000,66533.000,68404.429,561783.000,709.000,969.429,...,54225.446,1.200,151.089,10.790,19.100,24.600,,2.770,78.860,0.926


In [42]:
# Date, total cases, new cases, and deaths data
covid_numbers_us_df = covid_data_df_us[["location", "date", "total_cases", "new_cases", "total_deaths", "new_deaths", "total_cases_per_million", "new_cases_per_million", "total_deaths_per_million", "new_deaths_per_million"]]
covid_numbers_us_df

Unnamed: 0,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million
76767,United States,2020-01-22,1.000,,,,0.003,,,
76768,United States,2020-01-23,1.000,0.000,,,0.003,0.000,,
76769,United States,2020-01-24,2.000,1.000,,,0.006,0.003,,
76770,United States,2020-01-25,2.000,0.000,,,0.006,0.000,,
76771,United States,2020-01-26,5.000,3.000,,,0.015,0.009,,
...,...,...,...,...,...,...,...,...,...,...
77208,United States,2021-04-07,30922386.000,75038.000,559202.000,2570.000,93420.359,226.699,1689.419,7.764
77209,United States,2021-04-08,31002264.000,79878.000,560202.000,1000.000,93661.680,241.321,1692.440,3.021
77210,United States,2021-04-09,31084962.000,82698.000,561074.000,872.000,93911.521,249.841,1695.074,2.634
77211,United States,2021-04-10,31151495.000,66533.000,561783.000,709.000,94112.525,201.004,1697.216,2.142


In [45]:
# Test data
covid_testdata_df_us = covid_data_df_us[["location", "date", "new_tests", "total_tests", "total_tests_per_thousand", "new_tests_per_thousand", "positive_rate", "tests_per_case"]]
# Keep only the rows with at least 2 non-NA values.
covid_testdata_df_us = covid_testdata_df_us.dropna(thresh=3)
covid_testdata_df_us

Unnamed: 0,location,date,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,positive_rate,tests_per_case
76806,United States,2020-03-01,372.000,372.000,0.001,0.001,,
76807,United States,2020-03-02,550.000,922.000,0.003,0.002,,
76808,United States,2020-03-03,933.000,1855.000,0.006,0.003,,
76809,United States,2020-03-04,924.000,2779.000,0.008,0.003,,
76810,United States,2020-03-05,1205.000,3984.000,0.012,0.004,,
...,...,...,...,...,...,...,...,...
77203,United States,2021-04-02,1210503.000,382758324.000,1156.360,3.657,0.055,18.200
77204,United States,2021-04-03,841444.000,383599768.000,1158.902,2.542,0.057,17.500
77205,United States,2021-04-04,450322.000,384050090.000,1160.263,1.360,0.057,17.500
77206,United States,2021-04-05,530002.000,384580092.000,1161.864,1.601,0.061,16.400


In [48]:
# Can date be a primary key? Since all vaslues have to be unique, if the lenght of the unique array matches the length of the column date then all values are unique

if len(covid_testdata_df_us["date"].unique()) == len(covid_testdata_df_us["date"]):
    print("Date can be a primary key")
else:
    print("Date can't be a primary key")

Date can be a primary key


In [9]:
covid_data_state = "./resources/COVID-19_Case_Surveillance_Public_Use_Data_with_Geography.csv"

covid_data_state_df = pd.read_csv(covid_data_state, low_memory=False)

covid_data_state_df

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,case_month,res_state,state_fips_code,res_county,county_fips_code,age_group,sex,race,ethnicity,case_positive_specimen_interval,case_onset_interval,process,exposure_yn,current_status,symptom_status,hosp_yn,icu_yn,death_yn,underlying_conditions_yn
0,2020-02,,,,,,,,,,0.0,Missing,Missing,Laboratory-confirmed case,Symptomatic,Yes,Missing,,
1,2020-02,,,,,,,,,3.0,0.0,Clinical evaluation,Yes,Laboratory-confirmed case,Symptomatic,Yes,Yes,,Yes
2,2020-02,,,,,,,,,,0.0,Clinical evaluation,Missing,Laboratory-confirmed case,Symptomatic,Yes,No,,Yes
3,2020-08,,,,,,,,,0.0,,Routine surveillance,Missing,Laboratory-confirmed case,Asymptomatic,No,No,Missing,Yes
4,2020-08,,,,,,,,,0.0,,Routine surveillance,Missing,Laboratory-confirmed case,Missing,Missing,Missing,Missing,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22507134,2020-12,AZ,4.0,YUMA,4027.0,65+ years,Male,White,Non-Hispanic/Latino,,,Missing,Missing,Laboratory-confirmed case,Missing,Yes,Missing,Yes,
22507135,2020-12,AZ,4.0,YUMA,4027.0,65+ years,Male,White,Non-Hispanic/Latino,,,Missing,Missing,Laboratory-confirmed case,Missing,Missing,Missing,Yes,
22507136,2020-12,AZ,4.0,YUMA,4027.0,65+ years,Male,White,Non-Hispanic/Latino,,,Missing,Missing,Laboratory-confirmed case,Missing,Yes,Missing,Yes,
22507137,2020-12,AZ,4.0,YUMA,4027.0,65+ years,Male,White,Non-Hispanic/Latino,,,Missing,Missing,Laboratory-confirmed case,Missing,Yes,Missing,Yes,


In [11]:
# California data
covid_data_CA = covid_data_state_df.loc[covid_data_state_df["res_state"] == "CA"]
covid_data_CA

Unnamed: 0,case_month,res_state,state_fips_code,res_county,county_fips_code,age_group,sex,race,ethnicity,case_positive_specimen_interval,case_onset_interval,process,exposure_yn,current_status,symptom_status,hosp_yn,icu_yn,death_yn,underlying_conditions_yn
205,2020-12,CA,6.0,,,Missing,,,,,0.0,Missing,Missing,Laboratory-confirmed case,Symptomatic,Yes,Missing,,
206,2020-12,CA,6.0,,,Missing,,,,,,Missing,Missing,Laboratory-confirmed case,Unknown,No,Missing,,
207,2020-12,CA,6.0,,,,,,,,0.0,Missing,Missing,Laboratory-confirmed case,Symptomatic,Yes,Unknown,,
7413,2020-03,CA,6.0,BUTTE,6007.0,,,,,,0.0,Missing,Missing,Laboratory-confirmed case,Symptomatic,No,Missing,No,
7414,2020-03,CA,6.0,BUTTE,6007.0,,,,,,0.0,Missing,Missing,Laboratory-confirmed case,Symptomatic,Yes,No,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22506961,2020-12,CA,6.0,YOLO,6113.0,65+ years,Male,White,Non-Hispanic/Latino,,,Missing,Missing,Laboratory-confirmed case,Unknown,No,Missing,,
22506962,2020-12,CA,6.0,YOLO,6113.0,65+ years,Male,White,Non-Hispanic/Latino,,0.0,Missing,Missing,Laboratory-confirmed case,Symptomatic,Yes,Yes,,
22506963,2020-12,CA,6.0,YOLO,6113.0,65+ years,Male,White,Non-Hispanic/Latino,,,Missing,Missing,Laboratory-confirmed case,Unknown,Yes,Yes,,
22506964,2020-12,CA,6.0,YOLO,6113.0,65+ years,Male,White,Non-Hispanic/Latino,,,Missing,Missing,Laboratory-confirmed case,Unknown,Yes,Missing,,


## COVID-19 US Vactination Data

In [7]:
us_vaccination = "./resources/us_state_vaccinations.csv"

us_vaccination_df = pd.read_csv(us_vaccination)

us_vaccination_df

Unnamed: 0,date,location,total_vaccinations,total_distributed,people_vaccinated,people_fully_vaccinated_per_hundred,total_vaccinations_per_hundred,people_fully_vaccinated,people_vaccinated_per_hundred,distributed_per_hundred,daily_vaccinations_raw,daily_vaccinations,daily_vaccinations_per_million,share_doses_used
0,2021-01-12,Alabama,78134.0,377025.0,70861.0,0.15,1.59,7270.0,1.45,7.69,,,,0.207
1,2021-01-13,Alabama,84040.0,378975.0,74792.0,0.19,1.71,9245.0,1.53,7.73,5906.0,5906.0,1205.0,0.222
2,2021-01-14,Alabama,92300.0,435350.0,80480.0,,1.88,,1.64,8.88,8260.0,7083.0,1445.0,0.212
3,2021-01-15,Alabama,100567.0,444650.0,86956.0,0.28,2.05,13488.0,1.77,9.07,8267.0,7478.0,1525.0,0.226
4,2021-01-16,Alabama,,,,,,,,,7557.0,7498.0,1529.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5933,2021-04-08,Wyoming,288814.0,436025.0,169230.0,20.75,49.90,120094.0,29.24,75.34,228.0,2655.0,4587.0,0.662
5934,2021-04-09,Wyoming,289028.0,447855.0,169409.0,20.78,49.94,120246.0,29.27,77.38,214.0,2633.0,4549.0,0.645
5935,2021-04-10,Wyoming,289340.0,450525.0,169683.0,20.83,49.99,120534.0,29.32,77.84,312.0,2640.0,4561.0,0.642
5936,2021-04-11,Wyoming,310702.0,450525.0,180223.0,22.80,53.68,131933.0,31.14,77.84,21362.0,3431.0,5928.0,0.690


In [55]:
ca_us_vaccination_df = us_vaccination_df.loc[us_vaccination_df["location"] == "California"]
ca_us_vaccination_df

Unnamed: 0,date,location,total_vaccinations,total_distributed,people_vaccinated,people_fully_vaccinated_per_hundred,total_vaccinations_per_hundred,people_fully_vaccinated,people_vaccinated_per_hundred,distributed_per_hundred,daily_vaccinations_raw,daily_vaccinations,daily_vaccinations_per_million,share_doses_used
546,2021-01-12,California,816301.0,3286050.0,703540.0,0.25,2.07,100089.0,1.78,8.32,,,,0.248
547,2021-01-13,California,891489.0,3435650.0,744545.0,0.34,2.26,133689.0,1.88,8.70,75188.00,75188.0,1903.0,0.259
548,2021-01-14,California,975293.0,3540175.0,801998.0,,2.47,,2.03,8.96,83804.00,79496.0,2012.0,0.275
549,2021-01-15,California,1072959.0,3548575.0,865387.0,0.52,2.72,204374.0,2.19,8.98,97666.00,85553.0,2165.0,0.302
550,2021-01-16,California,,,,,,,,,96867.75,88381.0,2237.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
632,2021-04-08,California,21243518.0,27861050.0,14123008.0,19.23,53.76,7599559.0,35.74,70.51,377626.00,377051.0,9543.0,0.762
633,2021-04-09,California,21725654.0,28532520.0,14445185.0,19.80,54.98,7822226.0,36.56,72.21,482136.00,391393.0,9906.0,0.761
634,2021-04-10,California,22281619.0,29034050.0,14803675.0,20.53,56.39,8110488.0,37.47,73.48,555965.00,404572.0,10239.0,0.767
635,2021-04-11,California,22754163.0,29034050.0,15123816.0,21.09,57.59,8332396.0,38.28,73.48,472544.00,398820.0,10094.0,0.784


In [8]:
vaccination_by_man = "./resources/vaccinations-by-manufacturer.csv"

vaccination_by_man_df = pd.read_csv(vaccination_by_man)

vaccination_by_man_df

Unnamed: 0,location,date,vaccine,total_vaccinations
0,Chile,2020-12-24,Pfizer/BioNTech,420
1,Chile,2020-12-25,Pfizer/BioNTech,5198
2,Chile,2020-12-26,Pfizer/BioNTech,8338
3,Chile,2020-12-27,Pfizer/BioNTech,8649
4,Chile,2020-12-28,Pfizer/BioNTech,8649
...,...,...,...,...
2261,United States,2021-04-10,Moderna,82622178
2262,United States,2021-04-10,Pfizer/BioNTech,94715143
2263,United States,2021-04-11,Johnson&Johnson,6453740
2264,United States,2021-04-11,Moderna,83847244


In [56]:
vaccination_by_man_us_df =  vaccination_by_man_df.loc[vaccination_by_man_df["location"] == "United States"]
vaccination_by_man_us_df

Unnamed: 0,location,date,vaccine,total_vaccinations
2061,United States,2021-01-12,Moderna,3835859
2062,United States,2021-01-12,Pfizer/BioNTech,5488697
2063,United States,2021-01-13,Moderna,4249795
2064,United States,2021-01-13,Pfizer/BioNTech,6025872
2065,United States,2021-01-15,Moderna,5122662
...,...,...,...,...
2261,United States,2021-04-10,Moderna,82622178
2262,United States,2021-04-10,Pfizer/BioNTech,94715143
2263,United States,2021-04-11,Johnson&Johnson,6453740
2264,United States,2021-04-11,Moderna,83847244
