# 2019 Novel Coronavirus COVID-19 (2019-nCoV) Unpivoted Data

The following script takes data from the repository of the 2019 Novel Coronavirus Visual Dashboard operated by 
the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). It will apply necessary 
cleansing/reformatting to make it use in traditional relational databases and data visualization tools.


In [34]:
import pandas as pd
import pygsheets
import os
from datetime import datetime
import pycountry
from copy import deepcopy
import boto3
from botocore.exceptions import ClientError

Data downloaded directly from Johns Hopkins git repository, located at: https://github.com/CSSEGISandData/COVID-19. Their repository has three different CSV files for `confirmed`, `deaths` and `recovered` data.

In [35]:
confirmed = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv",keep_default_na=False)
deaths = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv",keep_default_na=False)
recovered = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv",keep_default_na=False)

confirmed['Case_Type'] = 'Confirmed'
deaths['Case_Type'] = 'Deaths'
recovered['Case_Type'] = 'Recovered'

key_columns = ['Country/Region','Province/State','Lat','Long','Case_Type']

data = [confirmed, deaths, recovered]

The original dataset stores the number of `Cases` for a given day in columns. 
This is not useful for reporting, thus we move these date columns to rows:

In [36]:
def unpivot(df):
    # unpivot all non-key columns
    melted = df.melt(id_vars=key_columns, var_name='Date', value_name = 'Cases')
    # change our new Date field to Date type
    melted['Date']= pd.to_datetime(melted['Date']) 
    
    return melted

unpivoted_data = list(map(unpivot, data))


### Data Quality 

 1. Replace empty values in cases to zero
 2. Maitain consistent country naming (see: https://github.com/CSSEGISandData/COVID-19/issues/396)
 3. After renaming countries, aggregate values values for one country/province per day

In [38]:
def drop_incorrect_county_state_data(df):
    stateBeforeMarch9th = df[ (df['Date'] <= '2020-03-09') & (df['Country/Region'] == 'US') & (df['Province/State'].str.contains(',') == False) ].index
    countryAfterMarch10th = df[ (df['Date'] > '2020-03-09') & (df['Country/Region'] == 'US') & df['Province/State'].str.contains(',') ].index

    return df.drop(stateBeforeMarch9th).drop(countryAfterMarch10th)

In [39]:
# Drop incorrect county/state data

unpivoted_data = [drop_incorrect_county_state_data(df) for df in unpivoted_data]

In [40]:
subdivisions = {i.name: i.code for i in pycountry.subdivisions.get(country_code="US")}
abbreviations = {subdivisions[k]: k for k in subdivisions}
locality_replacements = {"Washington, D.C.": "District of Columbia"}

for idx, df in enumerate(unpivoted_data):
    unpivoted_data[idx].replace(locality_replacements, inplace=True)

In [41]:
def resolve_US_geography(row):
    county, state = row["Province/State"].split(", ")
    state.replace("D.C.", "DC")
    row["Province/State"] = abbreviations["US-" + state.strip()]
    return row
        
def resolve_geography_df(df):
    return df.apply(lambda row: resolve_US_geography(row) if row['Country/Region'] == 'US' and row["Province/State"] not in list(subdivisions.keys()) and ", " in row["Province/State"] else row, axis="columns")

In [42]:
for df in unpivoted_data:
    df = resolve_geography_df(df)

In [43]:
unpivoted_data[0]

Unnamed: 0,Country/Region,Province/State,Lat,Long,Case_Type,Date,Cases
0,Thailand,,15.0000,101.0000,Confirmed,2020-01-22,2
1,Japan,,36.0000,138.0000,Confirmed,2020-01-22,2
2,Singapore,,1.2833,103.8333,Confirmed,2020-01-22,0
3,Nepal,,28.1667,84.2500,Confirmed,2020-01-22,0
4,Malaysia,,2.5000,112.5000,Confirmed,2020-01-22,0
...,...,...,...,...,...,...,...
21887,Aruba,,12.5211,-69.9683,Confirmed,2020-03-13,2
21888,Canada,Grand Princess,37.6489,-122.6655,Confirmed,2020-03-13,2
21889,Kenya,,-0.0236,37.9062,Confirmed,2020-03-13,1
21890,Antigua and Barbuda,,17.0608,-61.7964,Confirmed,2020-03-13,1


In [44]:
changed_names = {
    "Holy See": "Vatican City",
    "Hong Kong SAR": "Hong Kong",
    "Iran (Islamic Republic of)": "Iran",
    "Macao SAR": "Macau",
    "Republic of Korea": "South Korea",
    "Republic of Moldova": "Moldova",
    "Russian Federation": "Russia",
    "Saint Martin": "St. Martin",
    "Taipei and environs": "Taiwan",
    "Viet Nam": "Vietnam",
    "occupied Palestinian territory": "Palestine",
}


for idx,df in enumerate(unpivoted_data):
    df["Country/Region"] = df["Country/Region"].replace(changed_names)
    df["Cases"] = df["Cases"].replace('',0).astype(int)
        
    unpivoted_data[idx] = df.groupby(by=["Country/Region","Province/State","Date","Case_Type"], as_index=False) \
        .agg({"Cases": "sum", "Long": "first", "Lat": "first"})
    

    


Sorting the data by primary keys and `Date`, to make sure we can add a `Differences` column easily. 

As `Cases` are actual snapshots (running numbers), to know what was the change since the previous day we introduce a new column called `Differences`.

In [45]:
sorted_data = list( map(lambda df: df.sort_values(by=key_columns + ['Date'], ascending=True), unpivoted_data) )

#sorted_data[0].tail(5)

`Difference` is today's `Cases` minus yesterday's `Cases` for each region/state.

In [46]:
for df in sorted_data:
    df["Difference"] = df["Cases"] - df.groupby( key_columns )["Cases"].shift(1, fill_value = 0) 

concated = pd.concat(sorted_data)

#concated.tail(5)

In [47]:
#concated = concated[concated['Date'] <= '2020-03-09' ]

concated['Date'].max()


Timestamp('2020-03-13 00:00:00')

We also want to show the number of active cases. In our definition, `Active` is calculated as:

```
Active = Confirmed - Deaths - Recovered
```

As a first step, we merge the different type of cases into a single line for each `Country/Province/Date` keys:

In [48]:
confirmed = concated[concated["Case_Type"].eq("Confirmed")]
deaths = concated[concated["Case_Type"].eq("Deaths")]
recovered = concated[concated["Case_Type"].eq("Recovered")]

active = confirmed  \
        .merge(deaths, validate= "one_to_one", suffixes =["","_d"], on=["Country/Region","Province/State","Date"]) \
        .merge(recovered, validate= "one_to_one", suffixes =["","_r"], on= ["Country/Region","Province/State","Date"])

#active.head()

The apply the calculations both for `Cases` and `Difference`:

In [49]:
active["Case_Type"] = 'Active'
active["Cases"] = active["Cases"] - active["Cases_r"] - active["Cases_d"]
active["Difference"] = active["Difference"] - active["Difference_r"] - active["Difference_d"]

#active.tail()

Then merge the `Active` dataset with the original one. 

In [50]:
data = pd.concat([concated,active], join="inner")

data["Case_Type"].unique()

array(['Confirmed', 'Deaths', 'Recovered', 'Active'], dtype=object)

In [53]:
data[(data["Country/Region"] == "US") & (data["Date"] > "2020-03-08")]["Province/State"].unique()

array(['Adams, IN', 'Alabama', 'Alachua, FL', 'Alameda County, CA',
       'Alaska', 'Anoka, MN', 'Arapahoe, CO', 'Arizona', 'Arkansas',
       'Arlington, VA', 'Beadle, SD', 'Bennington County, VT',
       'Bergen County, NJ', 'Berkshire County, MA', 'Bernalillo, NM',
       'Bon Homme, SD', 'Boone, IN', 'Broward County, FL', 'Bucks, PA',
       'Burlington, NJ', 'Calaveras, CA', 'California', 'Camden, NC',
       'Camden, NJ', 'Carver County, MN', 'Charles Mix, SD',
       'Charleston County, SC', 'Charlotte County, FL', 'Charlton, GA',
       'Chatham County, NC', 'Cherokee County, GA', 'Clark County, NV',
       'Clark County, WA', 'Cobb County, GA', 'Collier, FL',
       'Collin County, TX', 'Colorado', 'Connecticut',
       'Contra Costa County, CA', 'Cook County, IL', 'Cuyahoga, OH',
       'Dallas, TX', 'Dane, WI', 'Davidson County, TN',
       'Davis County, UT', 'Davison, SD', 'DeKalb, GA', 'Delaware',
       'Delaware County, PA', 'Denver County, CO', 'Deschutes, OR',
      

Before we save the file locally, we add the `Last_Update_Date` in `UTC` time zone.

### Writing local file: `JHU_COVID-19.csv`

In [17]:
data["Last_Update_Date"] = datetime.utcnow()
data.to_csv("./JHU_COVID-19.csv", index=False)

### Upload results to publicly available Google Sheets

You have to have set service account credentials in `GSHEET_API_CREDENTIALS` environment variable. More information on how authententication works explained here: https://pygsheets.readthedocs.io/en/stable/authorization.html#environment-variables

The public google sheet URL is: https://docs.google.com/spreadsheets/d/1avGWWl1J19O_Zm0NGTGy2E-fOG05i4ljRfjl87P7FiA/edit?ts=5e5e8a9e#gid=0


In [39]:
gsheet_key = os.environ.get('GSHEET_KEY', '1ZILeAru7cNH0FOUwFQllWh2MlVsdBKSc3LyBLmZsi9o')
#gsheet_key = '1avGWWl1J19O_Zm0NGTGy2E-fOG05i4ljRfjl87P7FiA'

gc = pygsheets.authorize(service_account_env_var='GSHEET_API_CREDENTIALS')

sheet = gc.open_by_key(gsheet_key)[0]

if sheet.rows < len(data.index):
    sheet.add_rows(len(data.index) - sheet.rows)

sheet.set_dataframe(data, 'A1')

"{} rows added to the worksheet".format(sheet.rows)

KeyError: '1ZILeAru7cNH0FOUwFQllWh2MlVsdBKSc3LyBLmZsi9o'

## Upload S3

In [None]:
# You need to set up the AWS Access Key ID and AWS Secret Access Key to make it work
# https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
BUCKET = 'test-covid19'# TODO update when we have the final s3 bucket
FILE_NAME = 'JHU_COVID-19.csv' 

In [None]:
def upload_file(file_name, bucket, object_name=None):
    """Upload a file to an S3 bucket
    :param file_name: File to upload
    :param bucket: Bucket to upload to
    :param object_name: S3 object name. If not specified then file_name is used
    :return: True if file was uploaded, else False
    """

    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        print(e)
        return False
    return True

In [None]:
upload_file(FILE_NAME, BUCKET, object_name=None)