# Course 3 Assignment 4

In this assignment we had to do a logistic regression model

In [1]:
# import needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
%matplotlib inline

## Loading and preparing data

In [2]:
# show all columns
pd.set_option('display.max_columns', None)
# loading the data from the local file
df = pd.read_csv('data/covid_data.csv')

In [3]:
# prepare data
df.date = pd.to_datetime(df.date)
dfx = df.dropna(subset=['continent'])  # gets rid of summaries for 'world' and 'africa' etc, as I only want data for countries
# the columns I need for this task
cols = ['location', 'date', 'new_cases_per_million','new_deaths_per_million', 'people_fully_vaccinated', 'human_development_index', 'population', 'extreme_poverty']
dfx = dfx[cols].dropna()  # getting rid of rows with empty data
# getting rid of rows where new cases and  deaths are below zero (due to error correction)
dfx = dfx[dfx.new_deaths_per_million >= 0]
dfx = dfx[dfx.new_cases_per_million >= 0]
# limiting it to 2021 which is when vaccinations really got started
dfx = dfx[dfx['date'].dt.year == 2021]
# so as to compare like with like, I'm keeping only countries with human development indices over 0.9
dfx = dfx[dfx.human_development_index > 0.9]
# calculating percentage of population fully vaccinated
dfx['percentage_fully_vaccinated'] = (dfx.people_fully_vaccinated/dfx.population) * 100

# binning response variable
dfx['new_deaths_binned'] = pd.cut(dfx.new_deaths_per_million, 2, labels=[0, 1])
dfx.new_deaths_binned = pd.to_numeric(dfx.new_deaths_binned)
dfx.tail()

Unnamed: 0,location,date,new_cases_per_million,new_deaths_per_million,people_fully_vaccinated,human_development_index,population,extreme_poverty,percentage_fully_vaccinated,new_deaths_binned
81005,United States,2021-04-25,96.872,0.843,94772329.0,0.926,331002647.0,1.2,28.631895,0
81006,United States,2021-04-26,144.08,1.432,95888088.0,0.926,331002647.0,1.2,28.968979,0
81007,United States,2021-04-27,153.642,1.937,96747454.0,0.926,331002647.0,1.2,29.228604,0
81008,United States,2021-04-28,166.539,2.897,98044421.0,0.926,331002647.0,1.2,29.620434,0
81009,United States,2021-04-29,175.826,2.58,99668945.0,0.926,331002647.0,1.2,30.111223,0


# Logistic regressions

In [4]:
# trying with just usual explanatory variable
reg = smf.logit(formula='new_deaths_binned ~ percentage_fully_vaccinated', data=dfx).fit()
reg.summary()

Optimization terminated successfully.
         Current function value: 0.118530
         Iterations 11


0,1,2,3
Dep. Variable:,new_deaths_binned,No. Observations:,1170.0
Model:,Logit,Df Residuals:,1168.0
Method:,MLE,Df Model:,1.0
Date:,"Tue, 04 May 2021",Pseudo R-squ.:,0.09835
Time:,15:37:56,Log-Likelihood:,-138.68
converged:,True,LL-Null:,-153.81
Covariance Type:,nonrobust,LLR p-value:,3.791e-08

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.4697,0.237,-10.417,0.000,-2.934,-2.005
percentage_fully_vaccinated,-0.4015,0.107,-3.757,0.000,-0.611,-0.192


In [5]:
params = reg.params
conf = reg.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']  # OR == odds ratio
np.exp(conf)

Unnamed: 0,Lower CI,Upper CI,OR
Intercept,0.053163,0.134651,0.084607
percentage_fully_vaccinated,0.54281,0.825238,0.669289


odds ratio > 1 so is negative

In [6]:
# let's just throw them in, if it gets silly i will try them seperately
final = smf.logit(formula='new_deaths_binned ~ percentage_fully_vaccinated + human_development_index + new_cases_per_million + extreme_poverty', data=dfx).fit()
final.summary()

Optimization terminated successfully.
         Current function value: 0.093849
         Iterations 11


0,1,2,3
Dep. Variable:,new_deaths_binned,No. Observations:,1170.0
Model:,Logit,Df Residuals:,1165.0
Method:,MLE,Df Model:,4.0
Date:,"Tue, 04 May 2021",Pseudo R-squ.:,0.2861
Time:,15:37:56,Log-Likelihood:,-109.8
converged:,True,LL-Null:,-153.81
Covariance Type:,nonrobust,LLR p-value:,3.491e-18

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,30.0031,18.068,1.661,0.097,-5.410,65.416
percentage_fully_vaccinated,-0.3191,0.102,-3.122,0.002,-0.519,-0.119
human_development_index,-35.4746,19.364,-1.832,0.067,-73.428,2.479
new_cases_per_million,0.0038,0.001,6.165,0.000,0.003,0.005
extreme_poverty,-2.4556,0.867,-2.831,0.005,-4.156,-0.756


HDI is not significant, all the rest are.

In [7]:
# final model
final = smf.logit(formula='new_deaths_binned ~ percentage_fully_vaccinated + new_cases_per_million + extreme_poverty', data=dfx).fit()
final.summary()

Optimization terminated successfully.
         Current function value: 0.095368
         Iterations 11


0,1,2,3
Dep. Variable:,new_deaths_binned,No. Observations:,1170.0
Model:,Logit,Df Residuals:,1166.0
Method:,MLE,Df Model:,3.0
Date:,"Tue, 04 May 2021",Pseudo R-squ.:,0.2745
Time:,15:37:56,Log-Likelihood:,-111.58
converged:,True,LL-Null:,-153.81
Covariance Type:,nonrobust,LLR p-value:,3.404e-18

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.1698,0.400,-7.924,0.000,-3.954,-2.386
percentage_fully_vaccinated,-0.3573,0.104,-3.443,0.001,-0.561,-0.154
new_cases_per_million,0.0041,0.001,7.033,0.000,0.003,0.005
extreme_poverty,-1.8592,0.810,-2.295,0.022,-3.447,-0.271


In [8]:
params = final.params
conf = final.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']  # OR == odds ratio
np.exp(conf)

Unnamed: 0,Lower CI,Upper CI,OR
Intercept,0.019179,0.092018,0.04201
percentage_fully_vaccinated,0.570828,0.857338,0.699566
new_cases_per_million,1.002995,1.005316,1.004155
extreme_poverty,0.031829,0.762558,0.155794


## Summary

I split my response variable into two bins, low and high new deaths per million for this task.

I used the following variables in my logistic regression model to try and predict new covid-19 related deaths per million people:
- percentage of the population fully vaccinated (main hypothesis variable)
- new covid-19 cases per million
- Human Development Index (HDI)
- percentage of the population in extreme poverty

HDI was not statistically significant, so I removed it from the model. All of the other variables were significant and none were confounding as the relationship between percentage vaccinated and new deaths remained significant throughout.

When controlling for the other variables named, new cases per million did not appear to have much effect on new deaths, with the odds ratio being barely over one (OR=1.00, 95% CI=1.00-1.01, p<0.001). Percentage fully vaccinated (OR=0.70, 95% CI=0.57-0.86, p=0.001) and extreme poverty (OR=0.16, 95% CI=0.03, 0.76, p<0.001) are both significantly negatively associated with new deaths per million, such that an increase in either measure leads to a reduced likelihood of high new deaths. The reduction for extreme poverty is very small, but is still confusing as all other tests so far have shown an increase in new deaths with and increase in extreme poverty.

I note that python has been giving me warnings about possible complete quasi-separation throughout these models, meaning predictors yield a perfect prediction of the response variable for most values of the predictors, but not all and some paramaters won't be identified. So I would not say this model is exactly reliable!

But it does support my original hypothesis, which is that an increase in the percentage of the population fully vaccinated is associated with a decrease in the number of new deaths per million.