# Preprocessing and Training Data Environment
In this notebook the data set is prepared for the modeling step. This includes creating dummie features for categorical data based on strings and adding new features based on the given data set.

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score,mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.model_selection import train_test_split 
%matplotlib inline
os.getcwd()

'/Users/lisahw/Documents/Courses and Conferences/DataScience/MyProject/Capstone_02/Springboard/notebooks'

### Load the EDA data set

In [5]:
df = pd.read_csv('../data/interim/COVID_cluster.csv',index_col=0)
df.head()

Unnamed: 0,Country,Confirmed,Deaths,Recovered,Active,Cardio Death Rate,Diabetes Percentage,Obesity,Undernourished,PopMale,PopFemale,PopTotal,Total Population,Clusters
0,US,0.397961,0.023945,0.0,0.374016,0.151089,10.79,37.3,1.0,2.812048,3.923944,6.735992,329064.917,2
1,Canada,0.184218,0.012892,0.0,0.171326,0.105599,7.37,31.3,1.0,3.159014,4.289525,7.44854,37411.038,2
2,United Kingdom,0.320635,0.046886,0.001482,0.272266,0.122137,4.28,29.5,1.0,3.676556,4.856698,8.533254,67530.161,2
3,China,0.005858,0.000323,0.005519,1.6e-05,0.261899,9.74,6.6,8.5,1.555179,2.04811,3.603289,1433783.692,0
4,Netherlands,0.249054,0.031824,0.000871,0.216358,0.109361,5.29,23.1,1.0,3.546222,4.785875,8.332098,17097.123,2


The cluster assignment is categorical data, while the country is rather an identifier. An interesting property is the Total Population (in thousands) since all other measures are relative to it for absolute numbers.

### Create dummies for the Clusters

In [20]:
df_test = pd.get_dummies(df['Clusters'],drop_first=True,prefix='Cluster')
df_all = pd.concat([df.drop('Clusters',axis=1),df_test],axis=1,join='inner')
df_all.head()

Unnamed: 0,Country,Confirmed,Deaths,Recovered,Active,Cardio Death Rate,Diabetes Percentage,Obesity,Undernourished,PopMale,PopFemale,PopTotal,Total Population,Cluster_1,Cluster_2
0,US,0.397961,0.023945,0.0,0.374016,0.151089,10.79,37.3,1.0,2.812048,3.923944,6.735992,329064.917,0,1
1,Canada,0.184218,0.012892,0.0,0.171326,0.105599,7.37,31.3,1.0,3.159014,4.289525,7.44854,37411.038,0,1
2,United Kingdom,0.320635,0.046886,0.001482,0.272266,0.122137,4.28,29.5,1.0,3.676556,4.856698,8.533254,67530.161,0,1
3,China,0.005858,0.000323,0.005519,1.6e-05,0.261899,9.74,6.6,8.5,1.555179,2.04811,3.603289,1433783.692,0,0
4,Netherlands,0.249054,0.031824,0.000871,0.216358,0.109361,5.29,23.1,1.0,3.546222,4.785875,8.332098,17097.123,0,1


### Standardize value ranges
The StandardScaler function is used to standardize all value ranges except for the 2 dummy features.

In [36]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df_all.drop(['Country','Cluster_1','Cluster_2'],axis=1).values)
# print(scaler.mean_)
df_scaled = pd.DataFrame(scaler.transform(df_all.drop(['Country','Cluster_1','Cluster_2'],axis=1).values),columns=df_all.columns[1:-2])
df_scaled = pd.concat([df_all['Country'],df_scaled,df_all.loc[:,['Cluster_1','Cluster_2']]],axis=1,join='inner')
df_scaled.head()


Unnamed: 0,Country,Confirmed,Deaths,Recovered,Active,Cardio Death Rate,Diabetes Percentage,Obesity,Undernourished,PopMale,PopFemale,PopTotal,Total Population,Cluster_1,Cluster_2
0,US,2.752835,1.797774,-0.422845,5.481596,-0.888446,0.930265,1.999616,-0.800812,1.003797,0.748004,0.855217,1.675076,0,1
1,Canada,0.9389,0.792328,-0.422845,2.205108,-1.270117,0.017751,1.360261,-0.800812,1.280396,0.925822,1.073087,-0.074331,0,1
2,United Kingdom,2.096601,3.884546,-0.404671,3.836811,-1.131361,-0.806713,1.168454,-0.800812,1.692977,1.201695,1.404751,0.10633,0,1
3,China,-0.574762,-0.350955,-0.355182,-0.564141,0.041266,0.650108,-1.271752,-0.178617,0.00183,-0.1644,-0.102644,8.301432,0,0
4,Netherlands,1.489127,2.514475,-0.41216,2.933047,-1.238552,-0.537228,0.486475,-0.800812,1.589076,1.167247,1.343245,-0.196179,0,1


In [37]:
df_scaled.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Confirmed,147.0,-3.247591e-17,1.003419,-0.623486,-0.591817,-0.486957,0.236951,4.719149
Deaths,147.0,-4.7203360000000007e-17,1.003419,-0.380374,-0.370784,-0.342408,-0.163702,6.384002
Recovered,147.0,-2.076948e-17,1.003419,-0.422845,-0.408573,-0.355836,-0.067954,6.646002
Active,147.0,1.8126090000000003e-17,1.003419,-0.564396,-0.540252,-0.443827,-0.044857,5.481596
Cardio Death Rate,147.0,-6.117555e-17,1.003419,-1.490183,-0.790136,-0.1105,0.56402,3.921871
Diabetes Percentage,147.0,-3.413747e-16,1.003419,-1.68454,-0.667968,-0.099648,0.460668,3.926619
Obesity,147.0,3.088988e-16,1.003419,-1.751268,-1.101257,0.358604,0.763529,1.999616
Undernourished,147.0,-6.948335000000001e-17,1.003419,-0.800812,-0.800812,-0.344535,0.327435,4.060606
PopMale,147.0,0.0,1.003419,-1.090112,-0.870169,-0.362523,0.792058,3.156894
PopFemale,147.0,-1.04225e-16,1.003419,-1.113988,-0.838078,-0.393192,0.782449,2.98781


As expected, the mean for the scaled values is zero and the standard deviation is one.

### Train-Test-Split
Our total data set consists of 147 instances and 14 features. To allow sufficient instances per set, 75% goes to training and 25% to testing. As we have seen before, some data is covariant and some features should be dropped since they do not add extra information.

In [41]:
# Drop covariant features
df_analysis = df_scaled.drop(['Country','Recovered','Active','PopFemale','PopTotal'],axis=1)

In [42]:
# We want to predict the death rate for a certain country
X = df_analysis.drop(['Deaths'],axis=1)
y = df_analysis['Deaths']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42)


In [43]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110 entries, 145 to 102
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Confirmed            110 non-null    float64
 1   Cardio Death Rate    110 non-null    float64
 2   Diabetes Percentage  110 non-null    float64
 3   Obesity              110 non-null    float64
 4   Undernourished       110 non-null    float64
 5   PopMale              110 non-null    float64
 6   Total Population     110 non-null    float64
 7   Cluster_1            110 non-null    uint8  
 8   Cluster_2            110 non-null    uint8  
dtypes: float64(7), uint8(2)
memory usage: 7.1 KB


Now we have a training and testing data set to model and predict the death rate for a certain country based on the health conditions of the nation. Our initial question was whether we can predict the number of COVID-19 cases based on the health conditions and demography of a country. This will be assessed in a second step.