---
<img src="CoronavirusImage.png" align="center"/>

# Group: CovidCats

### *Members:  Aldo, Andrew, Araz, Jane & Veohnti*

## Project:   Extract – Transform – Load

---

In the midst of the 2020 global coronavirus pandemic, this project ETLs the COVID-19 daily case data from World Health Organization (WHO) reports and others, *as compiled by Johns Hopkins University (JHU) and promulgated on [JHU's GitHub repository](https://github.com/CSSEGISandData/COVID-19)*, in order to ultimately **compare growth trajectories by country on comparable timescales**, beginning from the day on which each respective country had one hundred confirmed cases of COVID-19 (*i.e.*, 100 or more cases = Day 0).

The daily time series case data is available on JHU's GitHub repository in three (3) separate CSV files by case type:  

+ confirmed cases, 
+ deaths, and 
+ recovered cases.

We used an API from [GitHub's Developer Tools](https://developer.github.com/v3/repos/contents/#get-contents "Click to visit GitHub's Developer documentation") to pull the most current daily data so that our extracted, cleaned dataframes and loaded tables could automatically update each time the ETL process was performed.

---

> ### Step 1 – Extract
>
> First we pulled the raw data from JHU's GitHub repository, reading JHU's CSV files into Pandas dataframes:

---

In [60]:
# Install PyGitHub for extracting data via GitHub API
get_ipython().system(' pip install PyGithub')



In [61]:
# Import dependencies (including those needed for cleaning later...) and a separately saved config.py file containing a GitHub personal access token (API key).
import pandas as pd
import os
import numpy as np
import datetime
from config import git_key

In [62]:
# Define function to extract current coronavirus data from Johns Hopkins' Github repository and read CSV into a dataframe
def repo_to_df(git_key, branch):
    
    # Import dependencies
    from github import Github
    import requests
    import io
    
    # Create a Github API instance using an access key token
    g = Github(git_key)
    
    # Set Github repository name for GET requests to retrieve coronavirus data (CSV files)
    repo = g.get_repo("CSSEGISandData/COVID-19")
    contents = repo.get_contents(branch)
    
    # Decode and read CSV into dataframe 
    df = pd.read_csv(io.StringIO(contents.decoded_content.decode('utf-8')))
    return df

In [63]:
# Use defined function (above) to extract CSVs of coronavirus data from Github into dataframes (confirmed cases, deaths...)
confirmed_df = repo_to_df(git_key,"/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")
deaths_df = repo_to_df(git_key,"/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")
recovered_df = repo_to_df(git_key,"/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv")

In [64]:
# View raw dataframe of confirmed cases
confirmed_df

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20,3/30/20,3/31/20
0,,Afghanistan,33.000000,65.000000,0,0,0,0,0,0,...,40,40,74,84,94,110,110,120,170,174
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,89,104,123,146,174,186,197,212,223,243
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,201,230,264,302,367,409,454,511,584,716
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,113,133,164,188,224,267,308,334,370,376
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,2,3,3,3,4,4,5,7,7,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,Turks and Caicos Islands,United Kingdom,21.694000,-71.797900,0,0,0,0,0,0,...,0,0,0,0,0,0,4,4,5,5
252,,MS Zaandam,0.000000,0.000000,0,0,0,0,0,0,...,0,0,0,0,0,0,2,2,2,2
253,,Botswana,-22.328500,24.684900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3,4
254,,Burundi,-3.373100,29.918900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


In [65]:
# View raw dataframe of deaths
deaths_df

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20,3/30/20,3/31/20
0,,Afghanistan,33.000000,65.000000,0,0,0,0,0,0,...,1,1,1,2,4,4,4,4,4,4
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,2,4,5,5,6,8,10,10,11,15
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,17,17,19,21,25,26,29,31,35,44
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,1,1,1,1,3,3,3,6,8,12
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,Turks and Caicos Islands,United Kingdom,21.694000,-71.797900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
252,,MS Zaandam,0.000000,0.000000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
253,,Botswana,-22.328500,24.684900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
254,,Burundi,-3.373100,29.918900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [66]:
# View raw dataframe of recovered cases
recovered_df

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/22/20,3/23/20,3/24/20,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20,3/30/20,3/31/20
0,,Afghanistan,33.000000,65.000000,0,0,0,0,0,0,...,1,1,1,2,2,2,2,2,2,5
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,2,2,10,17,17,31,31,33,44,52
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,65,65,24,65,29,29,31,31,37,46
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,10,10
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
237,Turks and Caicos Islands,United Kingdom,21.694000,-71.797900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
238,,MS Zaandam,0.000000,0.000000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
239,,Botswana,-22.328500,24.684900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
240,,Burundi,-3.373100,29.918900,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---

> ### Step 2 – Transform
>
>
>> #### *Sub-step 2.1 – Clean & Aggregate by Country*
>>
>> Next we cleaned and aggregated the raw data with a defined function in order to: 
>>
>> + fill any NaNs with zero values, 
>> + drop unnecessary columns (*e.g.*, specific latitude and longitude coordinates), 
>> + group and aggregate (i.e., sum) case counts by overall Country/Region in order to eliminate the current breakout by Province/State, 
>> + set the Country/Region as the index, and
>> + set all values as integers, 
>> + sort descending based on the latest calendar date's numbers. 


---

In [89]:
# Define function for cleaning the data
def clean_df(df, death_confirmed_recovered):
    
    #df = df.fillna(value=0)                                          # Fill NaN with zero values
    df = df.drop(columns=["Lat", "Long"]) # Drop "Lat" and "Long" columns
    df = df.groupby(['Country/Region'], as_index = False).sum() # Create groupby object for sorting by country/region and aggregate by summing   
    df = pd.melt(df, id_vars=["Country/Region"],var_name="Date", value_name= str(death_confirmed_recovered))
    df = df.reset_index(drop=True) # Define new index
    df['Date'] = pd.to_datetime(df['Date']) # convert date column into a date type
    df = df.sort_values(by=["Country/Region", 'Date'], ascending=True) # Sort by highest value of most recently added date column
    return df

In [90]:
# Use defined function (above) to clean each dataframe
confirmed_clean = clean_df(confirmed_df, "Confirmed")
deaths_clean = clean_df(deaths_df, "Deaths")
recovered_clean = clean_df(recovered_df, "Recovered")
confirmed_clean

Unnamed: 0,Country/Region,Date,Confirmed
0,Afghanistan,2020-01-22,0
180,Afghanistan,2020-01-23,0
360,Afghanistan,2020-01-24,0
540,Afghanistan,2020-01-25,0
720,Afghanistan,2020-01-26,0
...,...,...,...
11879,Zimbabwe,2020-03-27,5
12059,Zimbabwe,2020-03-28,7
12239,Zimbabwe,2020-03-29,7
12419,Zimbabwe,2020-03-30,7


In [95]:
# Merge DFs for one transformed table
first_merge = pd.merge(confirmed_, deaths_clean, on=['Country/Region', 'Date'])
clean_df = pd.merge(first_merge, recovered_clean, on=['Country/Region', 'Date'])
clean_df.head

ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat

In [80]:
recovered_clean

Unnamed: 0,Country/Region,Date,Recovered
0,Afghanistan,1/22/20,0
180,Afghanistan,1/23/20,0
360,Afghanistan,1/24/20,0
540,Afghanistan,1/25/20,0
720,Afghanistan,1/26/20,0
...,...,...,...
7919,Zimbabwe,3/5/20,0
8099,Zimbabwe,3/6/20,0
8279,Zimbabwe,3/7/20,0
8459,Zimbabwe,3/8/20,0


---

> ### Step 2 – Transform
>
>
>> #### *Sub-step 2.2 – ______________*
>>
>> Next we ____________ the data in order to: 
>>
>> + .... 


---

---

> ### Step 3 – Load
>
>
> Finally we loaded the *transformed* dataframes into respective tables within a relational Postgres SQL database: 
>
> + confirmed cases, 
> + deaths, and 
> + recovered cases.
>

---

Now the latest COVID-19 data has been migrated from the Johns Hopkins University's CSV files into a production SQL database ready for querying, analysis, and visualizations of countries' growth trajectories on comparable timescales.

### Thank you for viewing our ETL project!

#### ~ The CovidCats

---