---
<img src="CoronavirusImage.png" align="center"/>

# Group: CovidCats

### *Members:  Aldo, Andrew, Araz, Jane & Veohnti*

## Project:   Extract – Transform – Load

---

In the midst of the 2020 global coronavirus pandemic, this project ETLs the COVID-19 daily case data from World Health Organization (WHO) reports and others, *as compiled by Johns Hopkins University (JHU) and promulgated on [JHU's GitHub repository](https://github.com/CSSEGISandData/COVID-19)*, in order to ultimately **compare growth trajectories by country on comparable timescales**, beginning from the day on which each respective country had one hundred confirmed cases of COVID-19 (*i.e.*, 100 or more cases = Day 0).

The daily time series case data is available on JHU's GitHub repository in three (3) separate CSV files by case type:  

+ confirmed cases, 
+ deaths, and 
+ recovered cases.

We used an API from [GitHub's Developer Tools](https://developer.github.com/v3/repos/contents/#get-contents "Click to visit GitHub's Developer documentation") to pull the most current daily data so that our extracted, cleaned dataframes and loaded tables could automatically update each time the ETL process was performed.

---

> ### Step 1 – Extract
>
> First we pulled the raw data from JHU's GitHub repository, reading JHU's CSV files into Pandas dataframes:

---

In [None]:
# Install PyGitHub for extracting data via GitHub API
get_ipython().system(' pip install PyGithub')

In [36]:
# Import dependencies (including those needed for cleaning later...) and a separately saved config.py file containing a GitHub personal access token (API key).
import pandas as pd
import os
import numpy as np
import datetime
from config import git_key

In [37]:
# Define function to extract current coronavirus data from Johns Hopkins' Github repository and read CSV into a dataframe
def repo_to_df(git_key, branch):
    
    # Import dependencies
    from github import Github
    import requests
    import io
    
    # Create a Github API instance using an access key token
    g = Github(git_key)
    
    # Set Github repository name for GET requests to retrieve coronavirus data (CSV files)
    repo = g.get_repo("CSSEGISandData/COVID-19")
    contents = repo.get_contents(branch)
    
    # Decode and read CSV into dataframe 
    df = pd.read_csv(io.StringIO(contents.decoded_content.decode('utf-8')))
    return df

In [38]:
# Use defined function (above) to extract CSVs of coronavirus data from Github into dataframes (confirmed cases, deaths...)
confirmed_df = repo_to_df(git_key,"/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
deaths_df = repo_to_df(git_key,"/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
recovered_df = repo_to_df(git_key,"/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv")

In [39]:
# View raw dataframe of confirmed cases
confirmed_df

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/20/20,3/21/20,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20
0,,Afghanistan,33.0000,65.0000,0,0,0,0,0,0,...,24,24,40,40,74,84,94,110,110,120
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,70,76,89,104,123,146,174,186,197,212
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,90,139,201,230,264,302,367,409,454,511
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,75,88,113,133,164,188,224,267,308,334
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,1,2,2,3,3,3,4,4,5,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248,,Burma,21.9162,95.9560,0,0,0,0,0,0,...,0,0,0,0,0,0,0,8,8,10
249,Anguilla,United Kingdom,18.2206,-63.0686,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2,2
250,British Virgin Islands,United Kingdom,18.4207,-64.6400,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2,2
251,Turks and Caicos Islands,United Kingdom,21.6940,-71.7979,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4,4


In [40]:
# View raw dataframe of deaths
deaths_df

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/20/20,3/21/20,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20
0,,Afghanistan,33.0000,65.0000,0,0,0,0,0,0,...,0,0,1,1,1,2,4,4,4,4
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,2,2,2,4,5,5,6,8,10,10
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,11,15,17,17,19,21,25,26,29,31
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,0,0,1,1,1,1,3,3,3,6
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248,,Burma,21.9162,95.9560,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
249,Anguilla,United Kingdom,18.2206,-63.0686,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
250,British Virgin Islands,United Kingdom,18.4207,-64.6400,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
251,Turks and Caicos Islands,United Kingdom,21.6940,-71.7979,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
# View raw dataframe of recovered cases
recovered_df

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/20/20,3/21/20,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20
0,,Afghanistan,33.0000,65.0000,0,0,0,0,0,0,...,1,1,1,1,1,2,2,2,2,2
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,0,2,2,2,10,17,17,31,31,33
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,32,32,65,65,24,65,29,29,31,31
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
234,,Burma,21.9162,95.9560,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
235,Anguilla,United Kingdom,18.2206,-63.0686,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
236,British Virgin Islands,United Kingdom,18.4207,-64.6400,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
237,Turks and Caicos Islands,United Kingdom,21.6940,-71.7979,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---

> ### Step 2 – Transform
>
>
>> #### *Sub-step 2.1 – Clean & Aggregate by Country*
>>
>> Next we cleaned and aggregated the raw data with a defined function in order to: 
>>
>> + fill any NaNs with zero values, 
>> + drop unnecessary columns (*e.g.*, specific latitude and longitude coordinates), 
>> + group and aggregate (i.e., sum) case counts by overall Country/Region in order to eliminate the current breakout by Province/State, 
>> + set the Country/Region as the index, and
>> + set all values as integers, 
>> + sort descending based on the latest calendar date's numbers. 


---

In [47]:
# Define function for cleaning the data
def clean_df(df):
    
    df = df.fillna(value=0)                                          # Fill NaN with zero values
    df = df.drop(columns=["Lat", "Long"])                            # Drop "Lat" and "Long" columns
    df = df.groupby(['Country/Region'], as_index=False).agg('sum')   # Create groupby object for sorting by country/region and aggregate by summing
    df = df.set_index(["Country/Region"])                            # Define new index
    df = df.astype(int)                                              # Set all values as integers
    df = df.sort_values(by=df.columns[-1], ascending=False)          # Sort by highest value of most recently added date column

    return df

In [48]:
# Use defined function (above) to clean each dataframe
confirmed_clean = clean_df(confirmed_df)
deaths_clean = clean_df(deaths_df)
recovered_clean = clean_df(recovered_df)

In [49]:
# View cleaned dataframe of confirmed cases
confirmed_clean

Unnamed: 0_level_0,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,...,3/20/20,3/21/20,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
US,1,1,2,2,5,5,5,5,5,7,...,19100,25489,33276,43847,53740,65778,83836,101657,121478,140886
Italy,0,0,0,0,0,0,0,0,0,2,...,47021,53578,59138,63927,69176,74386,80589,86498,92472,97689
China,548,643,920,1406,2075,2877,5509,6087,8141,9802,...,81250,81305,81435,81498,81591,81661,81782,81897,81999,82122
Spain,0,0,0,0,0,0,0,0,0,0,...,20410,25374,28768,35136,39885,49515,57786,65719,73235,80110
Germany,0,0,0,0,0,1,4,4,4,5,...,19848,22213,24873,29056,32986,37323,43938,50871,57695,62095
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Belize,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,2,2,2,2,2
Guinea-Bissau,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,2,2,2,2
Timor-Leste,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,1,1,1,1,1
Saint Vincent and the Grenadines,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1


In [50]:
# View cleaned dataframe of deaths
deaths_clean

Unnamed: 0_level_0,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,...,3/20/20,3/21/20,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Italy,0,0,0,0,0,0,0,0,0,0,...,4032,4825,5476,6077,6820,7503,8215,9134,10023,10779
Spain,0,0,0,0,0,0,0,0,0,0,...,1043,1375,1772,2311,2808,3647,4365,5138,5982,6803
China,17,18,26,42,56,82,131,133,171,213,...,3253,3259,3274,3274,3281,3285,3291,3296,3299,3304
Iran,0,0,0,0,0,0,0,0,0,0,...,1433,1556,1685,1812,1934,2077,2234,2378,2517,2640
France,0,0,0,0,0,0,0,0,0,0,...,451,563,676,862,1102,1333,1698,1997,2317,2611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
El Salvador,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Mongolia,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Equatorial Guinea,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Mauritania,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
# View cleaned dataframe of recovered cases
recovered_clean

Unnamed: 0_level_0,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,...,3/20/20,3/21/20,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
China,28,30,36,39,49,58,101,120,135,214,...,71266,71857,72362,72814,73280,73773,74181,74720,75100,75582
Spain,0,0,0,0,0,0,0,0,0,0,...,1588,2125,2575,2575,3794,5367,7015,9357,12285,14709
Italy,0,0,0,0,0,0,0,0,0,0,...,4440,6072,7024,7024,8326,9362,10361,10950,12384,13030
Iran,0,0,0,0,0,0,0,0,0,0,...,6745,7635,7931,7931,8913,9625,10457,11133,11679,12391
Germany,0,0,0,0,0,0,0,0,0,0,...,180,233,266,266,3243,3547,5673,6658,8481,9211
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Madagascar,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Eswatini,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Serbia,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,15,15,0,0,0,0
Seychelles,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---

> ### Step 2 – Transform
>
>
>> #### *Sub-step 2.2 – ______________*
>>
>> Next we ____________ the data in order to: 
>>
>> + .... 


---

---

> ### Step 3 – Load
>
>
> Finally we loaded the *transformed* dataframes into respective tables within a relational Postgres SQL database: 
>
> + confirmed cases, 
> + deaths, and 
> + recovered cases.
>

---

Now the latest COVID-19 data has been migrated from the Johns Hopkins University's CSV files into a production SQL database ready for querying, analysis, and visualizations of countries' growth trajectories on comparable timescales.

### Thank you for viewing our ETL project!

#### ~ The CovidCats

---