<h1>Covid Data Prep</h1>
<h3>Feature Engineering</h3>
<p>The notebook below takes the Covid-19 df and prepares it to use in XGBoost and Deep Learning Notebooks</p>
<br>
<p>The raw data consists of the following feature</p>
<ol>
    <li>dateRep</li>
    <li style="color:red;">day</li>
    <li style="color:red;">month</li>
    <li style="color:red;">year</li>
    <li>cases</li>
    <li>deaths</li>
    <li>countriesAndTerritories</li>
    <li style="color:red;">geoId</li>
    <li style="color:red;">countryterritorycode</li>
    <li>popData2018</li>
    <li style="color:red;">continentExp</li>
</ol>
<br>
</p>Feature names in red will be remove as they were defined as adding little information to the model. The remaining feautures will be evaluated through the notebook below.</p>

<h5>Import dependancies</h5>
<ul>
    <li>pandas: feature extrapolation and extraction and creation</li>
    <li>numpy: numerical data manipluation</li>
    <li>os: interaction with the operating system</li>
    <li>seaborn: plotting library</li>
    <li>sklearn.model_selection.train_test_split: spliting the data into the various data sets (train, test and validation)</li>
</ul>

In [157]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [158]:
import os
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

</h2>Read in raw data</h2>

<p>Change to the relevant directory and read in the csv</p>

The csv will need some preprocessing

In [159]:
os.chdir(r"/Users/samueleckford/Python/Covid_19/covid_19_analysis/datasets")

<p>Now read the file. No Line should need skipping.</p>

In [175]:
# Import and format dataframe
covid19_df = pd.read_csv('COVID-19-geographic-disbtribution-worldwide-2020-05-08.csv', engine='python')
covid19_df.head(2)

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp
0,08/05/2020,8,5,2020,171,2,Afghanistan,AF,AFG,37172386.0,Asia
1,07/05/2020,7,5,2020,168,9,Afghanistan,AF,AFG,37172386.0,Asia


In [176]:
# Columns to drop
drop_columns = ['geoId', 'day', 'month', 'year', 'countryterritoryCode', 'continentExp']
# Create a 'datetime' column based on the dates
covid19_df['dateRep'] = pd.to_datetime(covid19_df['dateRep'], dayfirst=True)
# Drop the columns that add no value
covid19_df.drop(columns=drop_columns, inplace=True)
# Display the new table

In [177]:
# Sort the table by the date
covid19_df.sort_values(by=['dateRep'], ascending=True, inplace=True)
# Create a cumulative sum of covid cases and deaths
covid19_df['Cum_Cases'] = covid19_df.groupby("countriesAndTerritories")['cases'].cumsum()
covid19_df['Cum_Deaths'] = covid19_df.groupby("countriesAndTerritories")['deaths'].cumsum()

In [178]:
# Create column for days since x deaths
covid19_df['flag'] = np.where(covid19_df['Cum_Cases'] > 100, 1, 0) # calculate globaly as its a true false
# groupby again creating a unique dataframe for each country, and applying a cumulative sum to the "flag" column
covid19_df['flag'] = covid19_df.loc[covid19_df['Cum_Cases'] > 100].groupby("countriesAndTerritories")['flag'].cumsum()

<p>We don't want Nan's so we replace them in the next few cells below</p>

In [None]:
for i in covid19_df.columns:
    frac_null = covid19_df[i].isna().sum() /len(covid19_df)
    print(i, ':', frac_null)

In [196]:
top_count = covid19_df.groupby('countriesAndTerritories')['deaths'].sum().sort_values(ascending=False).iloc[:20]
top_count = list(top_count.keys())
top_count[:4]

['United_States_of_America', 'United_Kingdom', 'Italy', 'Spain']

In [180]:
len(covid19_df)

15698

In [181]:
# Trim the dataframe to just the 20 most affected countries
covid19_df = covid19_df[covid19_df['countriesAndTerritories'].isin(top_count)]
# Fill the empty 'flag' rows
covid19_df.fillna(0, inplace=True)

In [182]:
len(covid19_df)

2443

<p>Above we filtered out countries not in the 'top_count' list</p>

<p>Below we export the .csv as a master dataframe. We then do the final processing stages for the train and test datasets</p>

In [185]:
# Export the polished dataframe to re-import after modelling.
covid19_df.to_csv(r'/Users/samueleckford/Python/Covid_19/covid_19_analysis/datasets/Data_Export/covid19_df.csv')

In [191]:
# 'get_dummies' creates a new column for each country that is populated with either a 1 or a 0
df_train = pd.get_dummies(data=covid19_df, columns=["countriesAndTerritories"])
# df_y is target to predict, in this case 'deaths'
df_y = df_train[['deaths', 'dateRep']]
# df_train contains the columns we will use to predict 'deaths'
df_train.drop(columns=['cases', 'deaths', 'Cum_Cases', 'Cum_Deaths'], inplace=True)

In [192]:
folder_create = os.path.exists("data")
if folder_create is False:
    os.mkdir("./data")
    os.mkdir("./data/train")
    os.mkdir("./data/test")

In [193]:
# Instead of randomly splitting the data we will select a date to test 'blind' from
date_slice = '2020-04-23'

In [194]:
# Split the data according to the chosen date
X_train = df_train.loc[df_train['dateRep'] < date_slice]
X_test = df_train.loc[df_train['dateRep'] >= date_slice]
y_train = df_y.loc[df_y['dateRep'] < date_slice]
y_test = df_y.loc[df_y['dateRep'] >= date_slice]

In [195]:
X_train.to_csv("./data/train/train_x.csv")
y_train.to_csv("./data/train/train_y.csv")
X_test.to_csv("./data/test/test_x.csv")
y_test.to_csv("./data/test/test_y.csv")