The purpose of this notebook is to explore and replicate the findings that were found in this (non-peer-reviewed) paper:
https://www.medrxiv.org/content/10.1101/2020.03.24.20042937v1.full.pdf

It is titled: Correlation between universal BCG vaccination policy and reduced morbidity and mortality for COVID-19: an epidemiological study

They also give you access to the data used to produce the results here:
https://www.medrxiv.org/highwire/filestream/74536/field_highwire_adjunct_files/0/2020.03.24.20042937-1.xls

In [None]:
import numpy as np
import pylab as pl
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
from scipy.stats import spearmanr
import os

Let us read in the data and explore their headers:

In [None]:
train = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-3/train.csv")
BCG_policy = pd.read_csv("/kaggle/input/bcg-data-otazu/BCG_Data.csv")
train.head(2)

In [None]:
BCG_policy.head(2)

rename Country to Country_Region:

In [None]:
BCG_policy.columns = ('Country_Region','IncomeLevel','Population','Total COVID-19 Tests Administered (lower bound)','COVID-19 cases','COVID-19 Deaths','case mortality','BCG % coverage','BCG strain','BCG_policy')
BCG_policy.head(2)

OK here we need to bring in the case fatality rate, and in order to do so we need to aggregate by country:

In [None]:
country_date = pd.DataFrame(train.groupby(['Country_Region', 'Date'],as_index=False).agg({'ConfirmedCases':['sum'],'Fatalities':['sum']}))
country_date[country_date['Country_Region'] == 'China'].head(5)

And so now we can calculate the column:

CaseFatalityRate (cfr, in %) = TotalFatalities / TotalConfirmedCases

In [None]:
country_date.columns = ('Country_Region','Date','ConfirmedCases','Fatalities')
country_aggs = pd.DataFrame(country_date.groupby(['Country_Region'],as_index=False).agg({'ConfirmedCases':['max'],'Fatalities':['max']}))
country_aggs['CaseFatalityRate'] = country_aggs['Fatalities']/country_aggs['ConfirmedCases']
country_aggs.columns = ('Country_Region','TotalConfirmedCases','TotalFatalities','CaseFatalityRate')
country_aggs.head(2)

now we can left join to bring in the data into the case dataset:

In [None]:
combined = pd.merge(country_aggs,
                 BCG_policy,
                 on='Country_Region', 
                 how='left')
combined['BCG_policy'].fillna('0', inplace=True)
combined['BCG_policy'] = combined['BCG_policy'].apply({'--':0,'0':0,'1':1,'2':2,'3':3}.get)
#pd.set_option('display.max_rows', 200)
combined.head(2)

OK so now let's explore, visually, the case fatality rate by BCG policy, where 0 = unknown, 1 = universal BCG policy, 2 = used to have BCG policy and 3 = never had a universal BCG policy:

In [None]:
ax = sns.boxplot(x="BCG_policy", y="CaseFatalityRate", data=combined)
ax = sns.stripplot(x="BCG_policy", y="CaseFatalityRate", color='black',alpha=0.3,data=combined)
ax.set_ylim([0, 0.125]) 

Right off the bat BCG policy = 3 (which translates to never having had a universal one) is visibly higher ito case fatality rate. There are only 5 countries in that category though:

In [None]:
combined[combined['BCG_policy'] == 3].head(10)

What we can also do is run the same graph by IncomeLevel. According to the paper "The mortality rate might be influenced by multiple factors including a country’s standard of medical care.
 In order to account for that, we classified countries according to their GNI per capita in 2018 using the World Bank data (https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world- bank-country-and-lending-groups). Countries were divided in three categories: low income (L) with an annual income of 1,025 dollars or less, lower middle income with an income between 1,026 and 3,995 dollars, and middle high and high income countries, which included countries with annual incomes over 3,996 dollars."

In [None]:
ax = sns.boxplot(x="IncomeLevel", y="CaseFatalityRate", data=combined)
ax = sns.stripplot(x="IncomeLevel", y="CaseFatalityRate", color='black',alpha=0.3,data=combined)
ax.set_ylim([0, 0.125]) 

if you look at the medians, there appears to be a slight linear relationship between case fatality rate and income level.

Let us now bring both factors into the boxplot graph:

In [None]:
#sns.boxplot(x="day", y="total_bill", hue="smoker", data=df, palette="Set1")
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
ax = sns.boxplot(x="IncomeLevel", y="CaseFatalityRate",hue="BCG_policy", data=combined)
ax = sns.stripplot(x="IncomeLevel", y="CaseFatalityRate", hue="BCG_policy" , color='black',alpha=0.3,data=combined)
ax.set_ylim([0, 0.15]) 

As per the paper the high income:never had a universal policy is standing out again. If we combine income levels 1 and 2 (LowerLowerMiddle) and combine levels 3 and 4 (MIddleHighHigh) then we see the following:

In [None]:
combined['IncomeLevel2'] = combined['IncomeLevel'].apply({1:'LowerLowerMiddle',2:'LowerLowerMiddle',3: 'MiddleHighHigh', 4: 'MiddleHighHigh'}.get)
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
ax = sns.boxplot(x="BCG_policy", y="CaseFatalityRate",hue="IncomeLevel2", data=combined[combined['TotalConfirmedCases'] >= 1000])
ax = sns.stripplot(x="BCG_policy", y="CaseFatalityRate", hue="IncomeLevel2" , color='black',alpha=0.3,data=combined[combined['TotalConfirmedCases'] >= 1000])
ax.set_ylim([0, 0.15]) 

Can we replicate his p = 0.006 BETWEEN income levels 3 and 4 with policy 1 AND income levels 3 and 4 with policy 3?
Indeed we can, I only used the Mann-Whitney test because the samples were of unequal length (he used the Wilcoxon rank sum test):

In [None]:
from scipy.stats import mannwhitneyu
data1 = combined[(combined['IncomeLevel'].astype(float) >= 3) & (combined['BCG_policy'].astype(float) == 3)]["CaseFatalityRate"]
data2 = combined[(combined['IncomeLevel'].astype(float) >= 3) & (combined['BCG_policy'].astype(float) == 1)]["CaseFatalityRate"]
# compare samples
stat, p = mannwhitneyu(data1 , data2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

Now just a note, the sample size = 5 on the MiddleHighHigh and policy = 3 combo. Should we be worried?

When it comes to sample size estimation we need to tread carefully. The effect size is small here, which means we need decent samples in both sets of data. And by decent I mean we can calculate how many observations should be in each group.

Let us say that we hypothesize that there is a higher fatality rate in countries who never had a universal BCG policy than their counterparts who do have a universal one. A test of hypothesis will be conducted to compare fatality rates of the countries who never had a universal BCG policy with the countries who did. If we deem that a 3 fold increase would be clinically meaningful(from 3% (the average in those countries with universal policy and who are IncomeLevel = 3 or 4)  to 9% (Italy for instance)) then how many countries should be enrolled in the study to ensure that the power of the test is 80% to detect this difference in the groups? A two sided test will be used with a 5% level of significance.  

We first compute the effect size by substituting the proportions of cases in each group who are expected to pass away from the disease, p1=0.09 (i.e., 0.03*3.0=0.09) and p2=0.03 and the overall proportion, p=0.06 (i.e., (0.09+0.03)/2):

In [None]:
from math import sqrt
r=0.03
p1=r*3
p2=r
p=(p1+p2)/2.0
ES = (p1-p2)/sqrt(p*(1-p))
ES

So now that we have calculated the effect size, let us go ahead and calculate the sample size required in each group:

In [None]:
# estimate sample size via power analysis
from statsmodels.stats.power import TTestIndPower
# parameters for power analysis
effect = ES
alpha = 0.05
power = 0.8
# perform power analysis
analysis = TTestIndPower()
result = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha)
print('Sample Size: %.3f' % result)

it would seem that we need 246 observations in each group. Since we do not, it means that we do not have high statistical power. Statistical power, or the power of a hypothesis test is the probability that the test correctly rejects the null hypothesis. Learn more about sample estimation here:
http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Power/BS704_Power_print.html
