# Modeling COVID-19 fatality relative to the confirmed cases

### This notebook is very similar to the one modeling the death rate. The difference lies in the data set: the COVID fatality (deaths/confirmed) is predicted based on the demography. So no COVID data used to predict.

### As you will see below, the models do not perform very well. Therefore I focussed my analysis on the other modeling notebook and data set.

In this step we are finding the right or best model to correlate relative fatality to the demography of a country, and the goal would be predicting countries of risk that did not have COVID-19 back in May. This could then be evaluated with more recent COVID-19 data.

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score,mean_absolute_error
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
%matplotlib inline
os.getcwd()

'/Users/lisahw/Documents/Courses and Conferences/DataScience/MyProject/Capstone_02/Springboard/notebooks'

### Read in Train and Test data sets

In [27]:
data_file = 'sqlite:///../data/interim/COVID_fatal_train_test.db'
X_train = pd.read_sql('SELECT * FROM XTRAIN',data_file,index_col='index')
X_test = pd.read_sql('SELECT * FROM XTEST',data_file,index_col='index')
y_train = np.ravel(pd.read_sql('SELECT * FROM yTRAIN',data_file,index_col='index'))
y_test = np.ravel(pd.read_sql('SELECT * FROM yTEST',data_file,index_col='index'))
X_train.head()

Unnamed: 0_level_0,Cardio Death Rate,Diabetes Percentage,Obesity,Undernourished,PopMale,Total Population,Cluster_1,Cluster_2
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
145,-0.188623,-0.89743,-1.282407,2.99043,-1.073046,-0.191597,0,0
42,-0.977732,-0.467855,0.08155,-0.228392,-0.211033,-0.19452,0,1
16,0.34418,0.287237,-1.612741,0.335731,-0.448477,0.679257,0,0
10,0.179408,-0.153011,0.859433,-0.56023,-0.321158,-0.040489,1,0
115,-0.036122,1.151723,0.561067,-0.410903,0.294547,-0.298068,1,0


In [5]:
X_test.columns

Index(['Cardio Death Rate', 'Diabetes Percentage', 'Obesity', 'Undernourished',
       'PopMale', 'Total Population', 'Cluster_1', 'Cluster_2'],
      dtype='object')

In [6]:
X_train.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Cardio Death Rate,110.0,-0.032341,1.013118,-1.490183,-0.836498,-0.182965,0.541127,3.921871
Diabetes Percentage,110.0,0.058671,1.072147,-1.68454,-0.656628,-0.056957,0.513364,3.926619
Obesity,110.0,0.014224,1.02962,-1.666021,-1.090601,0.353276,0.763529,1.999616
Undernourished,110.0,0.007363,1.000471,-0.800812,-0.800812,-0.352831,0.331583,3.372043
PopMale,110.0,-0.038553,0.997058,-1.090112,-0.858592,-0.381903,0.653262,3.156894
Total Population,110.0,0.041129,1.142138,-0.298068,-0.271199,-0.230962,-0.07574,8.301432
Cluster_1,110.0,0.409091,0.493916,0.0,0.0,0.0,1.0,1.0
Cluster_2,110.0,0.245455,0.432326,0.0,0.0,0.0,0.0,1.0


Data has been standardized and clusters identified. There are 3 clusters, but one has been dropped to avoid covariance.

## Linear Regression Model

In [7]:
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(X_train,y_train)
print('R2 score/Coeff. of determination: ',linreg.score(X_test,y_test))
y_pred = linreg.predict(X_test)
print('Explained Variance: ',explained_variance_score(y_test,y_pred))
print('Mean square error: ',mean_squared_error(y_test,y_pred))
print('Mean absolute error: ',mean_absolute_error(y_test,y_pred))

R2 score/Coeff. of determination:  0.05497018859179903
Explained Variance:  0.07633640225217875
Mean square error:  0.7779026424003744
Mean absolute error:  0.6765600807271355


In [26]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
polyfeat = PolynomialFeatures(degree=2,include_bias=False)
linreg = LinearRegression()
pipeline = Pipeline([("polynomial_features", polyfeat),("linear_regression", linreg)])
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
print('Explained Variance: ',explained_variance_score(y_test,y_pred))
print('Mean square error: ',mean_squared_error(y_test,y_pred))
print('Mean absolute error: ',mean_absolute_error(y_test,y_pred))
print('R2 score/Coeff. of determination: ',pipeline.score(X_test,y_test))

Explained Variance:  -1.3791823180863063
Mean square error:  1.987074045023469
Mean absolute error:  0.9915424262703362
R2 score/Coeff. of determination:  -1.413983585694214


## Decision Tree Regressor

In [8]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X_train, y_train)
y_pred = tree_reg.predict(X_test)
print('Mean square error: ',mean_squared_error(y_test,y_pred))
print('R2 score/Coeff. of determination: ',r2_score(y_test,y_pred))

Mean square error:  0.7492417718432132
R2 score/Coeff. of determination:  0.08978865509533429


In [9]:
param_grid = {"ccp_alpha": [0.0,0.5],
 "max_depth": [2,3, None],
 "max_features": [3, 6, 9],
 "min_samples_split": [2, 5],
 "min_samples_leaf": [1, 3]}
reg3 = DecisionTreeRegressor(random_state=42)
grid = GridSearchCV(estimator=reg3, param_grid=param_grid, n_jobs=-1,cv=2)
grid.fit(X_train, y_train)
print('Best score: ',grid.best_score_)
print('Best set of parameters: ' ,grid.best_params_)
y_pred = grid.predict(X_test)
print('Mean square error: ',mean_squared_error(y_test,y_pred))
print('R2 score/Coeff. of determination: ',r2_score(y_test,y_pred))

Best score:  -0.4046568299912895
Best set of parameters:  {'ccp_alpha': 0.0, 'max_depth': 2, 'max_features': 6, 'min_samples_leaf': 1, 'min_samples_split': 5}
Mean square error:  0.7261074247104945
R2 score/Coeff. of determination:  0.11789326165692582


In [140]:
tree_reg = DecisionTreeRegressor(max_depth=3, random_state=42, max_features=6,min_samples_leaf=3)
tree_reg.fit(X_train, y_train)
y_pred = tree_reg.predict(X_test)

## Random Forest Regressor

In [28]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(random_state=0, n_estimators=250)
# regressor = RandomForestRegressor(bootstrap= False,max_depth= 3, max_features= 6, min_samples_leaf= 3, min_samples_split=2,n_estimators= 250)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
print('Mean square error: ',mean_squared_error(y_test,y_pred))
print('R2 score/Coeff. of determination: ',r2_score(y_test,y_pred))

Mean square error:  0.5953416780648717
R2 score/Coeff. of determination:  0.27675315254229715


In [15]:
print(X_train.columns)
print(regressor.feature_importances_)

Index(['Cardio Death Rate', 'Diabetes Percentage', 'Obesity', 'Undernourished',
       'PopMale', 'Total Population', 'Cluster_1', 'Cluster_2'],
      dtype='object')
[0.17035396 0.19538908 0.12078505 0.21458881 0.17989542 0.11238417
 0.0052508  0.00135271]


In [29]:
param_grid = {"n_estimators": [150,250],
 "max_depth": [3, None],
 "max_features": [3, 6, 9],
 "min_samples_split": [2, 5],
 "min_samples_leaf": [1, 3],
 "bootstrap": [True, False]}
reg2 = RandomForestRegressor(random_state=0)
grid = GridSearchCV(estimator=reg2, param_grid=param_grid, n_jobs=-1,cv=2)
grid.fit(X_train, y_train)
print('Best score: ',grid.best_score_)
print('Best set of parameters: ' ,grid.best_params_)
y_pred = grid.predict(X_test)
print('Mean square error: ',mean_squared_error(y_test,y_pred))
print('R2 score/Coeff. of determination: ',r2_score(y_test,y_pred))

Best score:  -0.36751542472491716
Best set of parameters:  {'bootstrap': True, 'max_depth': None, 'max_features': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 150}
Mean square error:  0.5510218830436412
R2 score/Coeff. of determination:  0.3305947584807003


## Gradient Boosting Regressor

In [30]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(random_state=42)
# Fit the model on the trainng data.
gbr.fit(X_train, y_train)
# Print the accuracy from the testing data.
y_pred = gbr.predict(X_test)
print('Explained Variance: ',explained_variance_score(y_test,y_pred))
print('Mean square error: ',mean_squared_error(y_test,y_pred))
print('Mean absolute error: ',mean_absolute_error(y_test,y_pred))
print('R2 score/Coeff. of determination: ',r2_score(y_test,y_pred))

Explained Variance:  0.20910202617265017
Mean square error:  0.651076685124857
Mean absolute error:  0.6858489237094716
R2 score/Coeff. of determination:  0.20904385276642168


In [31]:
param_grid = {"n_estimators": [100,150,250],
              "subsample": [0.5,0.75,1],
 "max_depth": [3, None],
 "max_features": [3, 6, 9],
 "min_samples_split": [2, 5],
 "min_samples_leaf": [1, 3]}
gbr = GradientBoostingRegressor(random_state=42)
grid = GridSearchCV(estimator=gbr, param_grid=param_grid, n_jobs=-1,cv=2)
grid.fit(X_train, y_train)
print('Best score: ',grid.best_score_)
print('Best set of parameters: ' ,grid.best_params_)
y_pred = grid.predict(X_test)
print('Mean square error: ',mean_squared_error(y_test,y_pred))
print('R2 score/Coeff. of determination: ',r2_score(y_test,y_pred))

Best score:  -0.3569188371563269
Best set of parameters:  {'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100, 'subsample': 0.5}
Mean square error:  0.6160685018032027
R2 score/Coeff. of determination:  0.2515733096405107


In [32]:
gbr = GradientBoostingRegressor(random_state=42,max_depth=3,max_features=6,n_estimators=100,subsample=0.5,min_samples_split=5)
gbr.fit(X_train,y_train)
for col,imp in zip(X_train.columns,gbr.feature_importances_):
    print('{} : {:.1f}%'.format(col,100*imp))
print('Explained Variance: ',explained_variance_score(y_test,y_pred))
print('Mean square error: ',mean_squared_error(y_test,y_pred))
print('Mean absolute error: ',mean_absolute_error(y_test,y_pred))
print('R2 score/Coeff. of determination: ',r2_score(y_test,y_pred))

Cardio Death Rate : 18.7%
Diabetes Percentage : 13.7%
Obesity : 12.9%
Undernourished : 21.8%
PopMale : 16.2%
Total Population : 13.0%
Cluster_1 : 3.6%
Cluster_2 : 0.2%
Explained Variance:  0.253856249745411
Mean square error:  0.6160685018032027
Mean absolute error:  0.6681127624799257
R2 score/Coeff. of determination:  0.2515733096405107
