---
<img src="CoronavirusImage.png" align="center"/>

# Group: CovidCats

### *Members:  Aldo, Andrew, Araz, Jane & Veohnti*

## Project:   Extract – Transform – Load

---

In the midst of the 2020 global coronavirus pandemic, this project ETLs the COVID-19 daily case data from World Health Organization (WHO) reports and others, *as compiled by Johns Hopkins University (JHU) and promulgated on [JHU's GitHub repository](https://github.com/CSSEGISandData/COVID-19)*, in order to ultimately **compare growth trajectories by country on comparable timescales**, beginning from the day on which each respective country had one hundred confirmed cases of COVID-19 (*i.e.*, 100 or more cases = Day 0).

The daily time series case data is available on JHU's GitHub repository in three (3) separate CSV files by case type:  

+ confirmed cases, 
+ deaths, and 
+ recovered cases.

We used an API from [GitHub's Developer Tools](https://developer.github.com/v3/repos/contents/#get-contents "Click to visit GitHub's Developer documentation") to pull the most current daily data so that our extracted, cleaned dataframes and loaded tables could automatically update each time the ETL process was performed.

---

> ### Step 1 – Extract
>
> First we pulled the raw data from JHU's GitHub repository, reading JHU's CSV files into Pandas dataframes:

---

In [1]:
# Install PyGitHub for extracting data via GitHub API
get_ipython().system(' pip install PyGithub')



In [2]:
# Import dependencies (including those needed for cleaning later...) and a separately saved config.py file containing a GitHub personal access token (API key).
import pandas as pd
import os
import numpy as np
import datetime
from config import git_key

In [3]:
# Importing the Population file
pop_file = "WPP2019_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.xlsx"

df_pop = pd.read_excel(pop_file)

In [4]:
# View raw dataframe of World Populations (from the UN)
df_pop

Unnamed: 0,Index,Variant,"Region, subregion, country or area *",Notes,Country code,Type,Parent code,1950,1951,1952,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,1,Estimates,WORLD,,900,World,0,2.53643e+06,2.58403e+06,2.63086e+06,...,7.04119e+06,7.12583e+06,7.21058e+06,7.29529e+06,7.3798e+06,7.46402e+06,7.54786e+06,7.63109e+06,7.71347e+06,7.7948e+06
1,2,Estimates,UN development groups,a,1803,Label/Separator,900,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,3,Estimates,More developed regions,b,901,Development Group,1803,814819,824004,833720,...,1.23956e+06,1.24411e+06,1.24845e+06,1.25262e+06,1.25662e+06,1.26048e+06,1.26415e+06,1.26756e+06,1.27063e+06,1.2733e+06
3,4,Estimates,Less developed regions,c,902,Development Group,1803,1.72161e+06,1.76003e+06,1.79714e+06,...,5.80164e+06,5.88171e+06,5.96213e+06,6.04268e+06,6.12317e+06,6.20354e+06,6.28371e+06,6.36353e+06,6.44284e+06,6.52149e+06
4,5,Estimates,Least developed countries,d,941,Development Group,902,195428,199180,203015,...,856471,876867,897793,919223,941131,963520,986385,1.00969e+06,1.03339e+06,1.05744e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284,285,Estimates,Bermuda,14,60,Country/Area,918,37.256,37.8,38.437,...,65.076,64.737,64.381,64.038,63.695,63.36,63.04,62.763,62.508,62.273
285,286,Estimates,Canada,,124,Country/Area,918,13733.4,14078.4,14445.5,...,34539.2,34922,35296.5,35664.3,36026.7,36382.9,36732.1,37074.6,37411,37742.2
286,287,Estimates,Greenland,26,304,Country/Area,918,22.993,23.466,23.936,...,56.555,56.477,56.412,56.383,56.378,56.408,56.473,56.565,56.66,56.772
287,288,Estimates,Saint Pierre and Miquelon,2,666,Country/Area,918,4.567,4.609,4.648,...,6.323,6.251,6.168,6.073,5.992,5.933,5.885,5.845,5.821,5.795


In [5]:
df_pop = df_pop.rename(columns={"Region, subregion, country or area *" : "Country/Region"})
df_pop

Unnamed: 0,Index,Variant,Country/Region,Notes,Country code,Type,Parent code,1950,1951,1952,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,1,Estimates,WORLD,,900,World,0,2.53643e+06,2.58403e+06,2.63086e+06,...,7.04119e+06,7.12583e+06,7.21058e+06,7.29529e+06,7.3798e+06,7.46402e+06,7.54786e+06,7.63109e+06,7.71347e+06,7.7948e+06
1,2,Estimates,UN development groups,a,1803,Label/Separator,900,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,3,Estimates,More developed regions,b,901,Development Group,1803,814819,824004,833720,...,1.23956e+06,1.24411e+06,1.24845e+06,1.25262e+06,1.25662e+06,1.26048e+06,1.26415e+06,1.26756e+06,1.27063e+06,1.2733e+06
3,4,Estimates,Less developed regions,c,902,Development Group,1803,1.72161e+06,1.76003e+06,1.79714e+06,...,5.80164e+06,5.88171e+06,5.96213e+06,6.04268e+06,6.12317e+06,6.20354e+06,6.28371e+06,6.36353e+06,6.44284e+06,6.52149e+06
4,5,Estimates,Least developed countries,d,941,Development Group,902,195428,199180,203015,...,856471,876867,897793,919223,941131,963520,986385,1.00969e+06,1.03339e+06,1.05744e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284,285,Estimates,Bermuda,14,60,Country/Area,918,37.256,37.8,38.437,...,65.076,64.737,64.381,64.038,63.695,63.36,63.04,62.763,62.508,62.273
285,286,Estimates,Canada,,124,Country/Area,918,13733.4,14078.4,14445.5,...,34539.2,34922,35296.5,35664.3,36026.7,36382.9,36732.1,37074.6,37411,37742.2
286,287,Estimates,Greenland,26,304,Country/Area,918,22.993,23.466,23.936,...,56.555,56.477,56.412,56.383,56.378,56.408,56.473,56.565,56.66,56.772
287,288,Estimates,Saint Pierre and Miquelon,2,666,Country/Area,918,4.567,4.609,4.648,...,6.323,6.251,6.168,6.073,5.992,5.933,5.885,5.845,5.821,5.795


In [5]:
# Define function to extract current coronavirus data from Johns Hopkins' Github repository and read CSV into a dataframe
def repo_to_df(git_key, branch):
    
    # Import dependencies
    from github import Github
    import requests
    import io
    
    # Create a Github API instance using an access key token
    g = Github(git_key)
    
    # Set Github repository name for GET requests to retrieve coronavirus data (CSV files)
    repo = g.get_repo("CSSEGISandData/COVID-19")
    contents = repo.get_contents(branch)
    
    # Decode and read CSV into dataframe 
    df = pd.read_csv(io.StringIO(contents.decoded_content.decode('utf-8')))
    return df

In [6]:
# Use defined function (above) to extract CSVs of coronavirus data from Github into dataframes (confirmed cases, deaths...)
confirmed_df = repo_to_df(git_key,"/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
deaths_df = repo_to_df(git_key,"/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
recovered_df = repo_to_df(git_key,"/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv")

In [7]:
# View raw dataframe of confirmed cases
confirmed_df

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20,3/30/20,3/31/20
0,,Afghanistan,33.000000,65.000000,0,0,0,0,0,0,...,40,40,74,84,94,110,110,120,170,174
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,89,104,123,146,174,186,197,212,223,243
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,201,230,264,302,367,409,454,511,584,716
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,113,133,164,188,224,267,308,334,370,376
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,2,3,3,3,4,4,5,7,7,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,Turks and Caicos Islands,United Kingdom,21.694000,-71.797900,0,0,0,0,0,0,...,0,0,0,0,0,0,4,4,5,5
252,,MS Zaandam,0.000000,0.000000,0,0,0,0,0,0,...,0,0,0,0,0,0,2,2,2,2
253,,Botswana,-22.328500,24.684900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3,4
254,,Burundi,-3.373100,29.918900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


In [8]:
# View raw dataframe of deaths
deaths_df

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20,3/30/20,3/31/20
0,,Afghanistan,33.000000,65.000000,0,0,0,0,0,0,...,1,1,1,2,4,4,4,4,4,4
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,2,4,5,5,6,8,10,10,11,15
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,17,17,19,21,25,26,29,31,35,44
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,1,1,1,1,3,3,3,6,8,12
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,Turks and Caicos Islands,United Kingdom,21.694000,-71.797900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
252,,MS Zaandam,0.000000,0.000000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
253,,Botswana,-22.328500,24.684900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
254,,Burundi,-3.373100,29.918900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# View raw dataframe of recovered cases
recovered_df

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20,3/30/20,3/31/20
0,,Afghanistan,33.000000,65.000000,0,0,0,0,0,0,...,1,1,1,2,2,2,2,2,2,5
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,2,2,10,17,17,31,31,33,44,52
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,65,65,24,65,29,29,31,31,37,46
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,10,10
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
237,Turks and Caicos Islands,United Kingdom,21.694000,-71.797900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
238,,MS Zaandam,0.000000,0.000000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
239,,Botswana,-22.328500,24.684900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
240,,Burundi,-3.373100,29.918900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---

> ### Step 2 – Transform
>
>
>> #### *Sub-step 2.1 – Clean & Aggregate by Country*
>>
>> Next we cleaned and aggregated the raw data with a defined function in order to: 
>>
>> + fill any NaNs with zero values, 
>> + drop unnecessary columns (*e.g.*, specific latitude and longitude coordinates), 
>> + group and aggregate (i.e., sum) case counts by overall Country/Region in order to eliminate the current breakout by Province/State, 
>> + set the Country/Region as the index, and
>> + set all values as integers, 
>> + sort descending based on the latest calendar date's numbers. 


---

In [65]:
# # Define Function for cleaning the population data
def clean_pop_df(df_pop):
    indexNames = df_pop[df_pop['Type'] != "Country/Area" ].index                 # Getting the index numbers for rows that don't have country datas
    df_pop.drop(indexNames , inplace=True)                              # Dropping Rows which don't have country data
    df_pop = df_pop.drop(columns = ["Variant", "Notes","Parent code", "Index", "Country code", "Type","1950","1951","1952","1953","1954","1955","1956","1957","1958","1959","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017","2018","2019"])
    df_pop = df_pop.sort_values(by=df_pop.columns[0], ascending=True) # Sorting values
    df_pop = df_pop.set_index(["Country/Region"])          # Setting Index as region
    df_pop = df_pop.rename(columns={"2020" : "Current Population"})
    df_pop = df_pop.fillna(value=0)                                              # Dropping Null Values
    
    
    return df_pop
df_pop

Unnamed: 0,Index,Variant,Country/Region,Notes,Country code,Type,Parent code,1950,1951,1952,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
26,27,Estimates,Burundi,,108,Country/Area,910,2308.93,2360.44,2406.03,...,8958.41,9245.99,9540.3,9844.3,10160,10488,10827,11175.4,11530.6,11890.8
27,28,Estimates,Comoros,,174,Country/Area,910,159.459,163.146,166.538,...,706.578,723.865,741.511,759.39,777.435,795.597,813.89,832.322,850.891,869.595
28,29,Estimates,Djibouti,,262,Country/Area,910,62,63.313,64.744,...,853.671,868.136,883.296,898.707,913.998,929.117,944.1,958.923,973.557,988.002
29,30,Estimates,Eritrea,,232,Country/Area,910,822.347,835,849.258,...,3213.97,3250.1,3281.45,3311.44,3342.82,3376.56,3412.89,3452.8,3497.12,3546.43
30,31,Estimates,Ethiopia,,231,Country/Area,910,18128,18467,18819.7,...,90139.9,92727,95385.8,98094.3,100835,103603,106400,109224,112079,114964
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284,285,Estimates,Bermuda,14,60,Country/Area,918,37.256,37.8,38.437,...,65.076,64.737,64.381,64.038,63.695,63.36,63.04,62.763,62.508,62.273
285,286,Estimates,Canada,,124,Country/Area,918,13733.4,14078.4,14445.5,...,34539.2,34922,35296.5,35664.3,36026.7,36382.9,36732.1,37074.6,37411,37742.2
286,287,Estimates,Greenland,26,304,Country/Area,918,22.993,23.466,23.936,...,56.555,56.477,56.412,56.383,56.378,56.408,56.473,56.565,56.66,56.772
287,288,Estimates,Saint Pierre and Miquelon,2,666,Country/Area,918,4.567,4.609,4.648,...,6.323,6.251,6.168,6.073,5.992,5.933,5.885,5.845,5.821,5.795


In [66]:
# Cleaning the data
pop_clean = clean_pop_df(df_pop)

In [67]:
# Viewing the Data, population is in thousands
pop_clean.head()

Unnamed: 0_level_0,Current Population
Country/Region,Unnamed: 1_level_1
Afghanistan,38928.341
Albania,2877.8
Algeria,43851.043
American Samoa,55.197
Andorra,77.265


In [70]:
pop_clean.to_csv("pop_data.csv",index=False)

In [13]:
# Define function for cleaning the data
def clean_df(df, death_confirmed_recovered):
    
    #df = df.fillna(value=0)                                          # Fill NaN with zero values
    df = df.drop(columns=["Lat", "Long"]) # Drop "Lat" and "Long" columns
    df = df.groupby(['Country/Region'], as_index = False).sum() # Create groupby object for sorting by country/region and aggregate by summing   
    df = pd.melt(df, id_vars=["Country/Region"],var_name="Date", value_name= str(death_confirmed_recovered))
    df = df.reset_index(drop=True) # Define new index
    df['Date'] = pd.to_datetime(df['Date']) # convert date column into a date type
    df = df.sort_values(by=["Country/Region", 'Date'], ascending=True) # Sort by highest value of most recently added date column
    return df

In [14]:
# Use defined function (above) to clean each dataframe
confirmed_clean = clean_df(confirmed_df, "Confirmed")
deaths_clean = clean_df(deaths_df, "Deaths")
recovered_clean = clean_df(recovered_df, "Recovered")


In [15]:
# Merge DFs for one transformed table
first_merge = pd.merge(confirmed_clean, deaths_clean, on=['Country/Region', 'Date'])
clean_df = pd.merge(first_merge, recovered_clean, on=['Country/Region', 'Date'])
clean_df

Unnamed: 0,Country/Region,Date,Confirmed,Deaths,Recovered
0,Afghanistan,2020-01-22,0,0,0
1,Afghanistan,2020-01-23,0,0,0
2,Afghanistan,2020-01-24,0,0,0
3,Afghanistan,2020-01-25,0,0,0
4,Afghanistan,2020-01-26,0,0,0
...,...,...,...,...,...
12595,Zimbabwe,2020-03-27,5,1,0
12596,Zimbabwe,2020-03-28,7,1,0
12597,Zimbabwe,2020-03-29,7,1,0
12598,Zimbabwe,2020-03-30,7,1,0


In [68]:
#merge tables on countries
final_merge = pd.merge(clean_df, pop_clean, how='left', on='Country/Region')
final_merge

Unnamed: 0,Country/Region,Date,Confirmed,Deaths,Recovered,Current Population
0,Afghanistan,2020-01-22,0,0,0,38928.341
1,Afghanistan,2020-01-23,0,0,0,38928.341
2,Afghanistan,2020-01-24,0,0,0,38928.341
3,Afghanistan,2020-01-25,0,0,0,38928.341
4,Afghanistan,2020-01-26,0,0,0,38928.341
...,...,...,...,...,...,...
12595,Zimbabwe,2020-03-27,5,1,0,14862.927
12596,Zimbabwe,2020-03-28,7,1,0,14862.927
12597,Zimbabwe,2020-03-29,7,1,0,14862.927
12598,Zimbabwe,2020-03-30,7,1,0,14862.927


In [69]:
# Export cleaned data set into a csv file
final_merge.to_csv("clean_data.csv",index=False)

---

> ### Step 2 – Transform
>
>
>> #### *Sub-step 2.2 – ______________*
>>
>> Next we ____________ the data in order to: 
>>
>> + .... 


---

---

> ### Step 3 – Load
>
>
> Finally we loaded the *transformed* dataframes into respective tables within a relational Postgres SQL database: 
>
> + confirmed cases, 
> + deaths, and 
> + recovered cases.
>

---

Now the latest COVID-19 data has been migrated from the Johns Hopkins University's CSV files into a production SQL database ready for querying, analysis, and visualizations of countries' growth trajectories on comparable timescales.

### Thank you for viewing our ETL project!

#### ~ The CovidCats

---