# COVID-19 HYPOTHESIS TESTING
Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true.

## Table of Contents
1. Import Libraries   
2. Load Data
3. Testing For Normality
4. Test Of Equal Variance
5. Mood's Median Test

### 1. Import Libraries

In [2]:
import statsmodels
import numpy as np
import pandas as pd
import scipy.stats as stats
from scipy.stats import anderson
import matplotlib.pyplot as plt

### 2. Load Data

In [3]:
# Aggregated Covid data set for each continent in csv format is read and stored in different pandas dataframes. 
asi = pd.read_csv("C:\\Users\\dkowu\\Desktop\\Asia Cases.csv")
af= pd.read_csv("C:\\Users\\dkowu\\Desktop\\Africa Cases.csv")
eu= pd.read_csv("C:\\Users\\dkowu\\Desktop\\Europe Cases.csv")
na= pd.read_csv("C:\\Users\\dkowu\\Desktop\\North America Cases.csv")
sa= pd.read_csv("C:\\Users\\dkowu\\Desktop\\South America Cases.csv")
oc= pd.read_csv("C:\\Users\\dkowu\\Desktop\\Oceania Cases.csv")

### 3. Normality Test

Many statistical models are based on an underlying assumption of normality hence the need to test for normality. 
The normality test is a hypothesis test. The null hypothesis (Ho) is that your data is not different from normal. Your alternate or alternative hypothesis (Ha) is that your data is different from normal.Regardless of the statistical normality test you use, you will make your decision about whether to reject or not reject the null based on your p-value. 

For this specific normality test, the Anderson-Darling method is used. The syntax is anderson(arr, dist=’norm’) where:
* arr: It is an array of sample data
* dist: It specifies the type of distribution to test against. By Default it is set to ‘norm’ but we can also use ‘expon’ or ‘logistic.’
 
As displaced below, each of the continent have a test statistic well above the critical values therefore that is sufficient proof to claim that the given data sets is not normally distributed.

In [4]:
print('Asia')
Asia_result = (anderson(asi['New_Asia_Cases'], dist='norm'))
print(f"A-D statistic: {Asia_result[0]}")
print(f"Critical values: {Asia_result[1]}")
print(f"Significance levels: {Asia_result[2]}\n")

print('Africa')
Africa_result = (anderson(af['New_Africa_Cases'], dist='norm'))
print(f"A-D statistic: {Africa_result[0]}")
print(f"Critical values: {Africa_result[1]}")
print(f"Significance levels: {Africa_result[2]}\n")

print('Europe')
Europe_result = (anderson(eu['New_Europe_Cases'], dist='norm'))
print(f"A-D statistic: {Europe_result[0]}")
print(f"Critical values: {Europe_result[1]}")
print(f"Significance levels: {Europe_result[2]}\n")

print('Oceania')
Oceania_result = (anderson(oc['New_Oceania_Cases'], dist='norm'))
print(f"A-D statistic: {Oceania_result[0]}")
print(f"Critical values: {Oceania_result[1]}")
print(f"Significance levels: {Oceania_result[2]}\n")

print('North America')
North_America_result = (anderson(na['New_NA_Cases'], dist='norm'))
print(f"A-D statistic: {North_America_result[0]}")
print(f"Critical values: {North_America_result[1]}")
print(f"Significance levels: {North_America_result[2]}\n")

print('South America')
South_America_result = (anderson(sa['New_SA_Cases'], dist='norm'))
print(f"A-D statistic: {South_America_result[0]}")
print(f"Critical values: {South_America_result[1]}")
print(f"Significance levels: {South_America_result[2]}")

Asia
A-D statistic: 75.9097569265083
Critical values: [0.573 0.653 0.784 0.914 1.087]
Significance levels: [15.  10.   5.   2.5  1. ]

Africa
A-D statistic: 38.73001146631941
Critical values: [0.573 0.653 0.783 0.914 1.087]
Significance levels: [15.  10.   5.   2.5  1. ]

Europe
A-D statistic: 94.0577592669955
Critical values: [0.573 0.653 0.783 0.914 1.087]
Significance levels: [15.  10.   5.   2.5  1. ]

Oceania
A-D statistic: 178.67487813321804
Critical values: [0.573 0.653 0.783 0.914 1.087]
Significance levels: [15.  10.   5.   2.5  1. ]

North America
A-D statistic: 91.31806255644096
Critical values: [0.573 0.653 0.784 0.914 1.087]
Significance levels: [15.  10.   5.   2.5  1. ]

South America
A-D statistic: 45.182472018739645
Critical values: [0.573 0.653 0.784 0.914 1.087]
Significance levels: [15.  10.   5.   2.5  1. ]


### 4. Test Of Equal Variance

Like the normality test, many statistical tests depend on the assumption of equal variance hence the need to test for it. 
The Levene test is used to verify this assumption. The Levene test was selected because it is less sensitive to non-normal data. The null hypothesis (Ho) is that the variances are equal across all samples/groups. The alternate hypothesis (Ha) is that the variances are not equal across all samples/groups. Typically if p-value is less than 0.05 we reject the null hypothesis, if not we fail to reject the null hypothesis.

Since the p-value for covid cases across the different continents is less than 0.05, this means we have sufficient evidence to say that the variance in daily covid-19 cases between the 6 different continents is significantly different.
Similarly for daily covid-19 deaths we have sufficient evidence to say the variance is significantly different since the p-value is less than 0.05.


In [5]:
#Levene's test centered at the median for cases
stats.levene(asi["New_Asia_Cases"], af["New_Africa_Cases"], eu["New_Europe_Cases"], na["New_NA_Cases"], sa["New_SA_Cases"], oc["New_Oceania_Cases"], center='median')



LeveneResult(statistic=186.1240318355679, pvalue=2.712075070748478e-183)

In [6]:
#Levene's test centered at the median for deaths
stats.levene(asi["New_Asia_Deaths"], af["New_Africa_Deaths"], eu["New_Europe_Deaths"], na["New_NA_Deaths"], sa["New_SA_Deaths"], oc["New_Oceania_Deaths"], center='median')


LeveneResult(statistic=419.3862442478768, pvalue=0.0)

### 5. Mood's Median Test

Mood’s median test is a nonparametric test to compare the medians of two or more samples. To perform Mood's median test there has to be at least two samples and the samples do not have to be the same length. The null hypothesis (Ho) is that the median is equal across all samples/groups. The alternate hypothesis (Ha) is that the median is not equal across all samples/groups.
The Mood's median test in python returns the test statistic, the p-value, the grand median(median after all samples are combined together)  and a 2 by n contingency tables where n is for number of groups (the first and second rows in the table are for values above and below the grand mean respectively in each group).

Typically if the p-value is greater than 0.05, we fail to reject the null hypothesis.
From the results below, it can be seen that the p-value for both covid cases and deaths across 6 continents are less than 0.05.
This is sufficient evidence to say the median for both daily covid cases and deaths is significantly different across all 6 continents.

In [7]:
stat, p, med, tbl = stats.median_test(asi["New_Asia_Cases"], af["New_Africa_Cases"], eu["New_Europe_Cases"], na["New_NA_Cases"], sa["New_SA_Cases"], oc["New_Oceania_Cases"])
print(stat,p,med,tbl)

1913.2019660836645 0.0 41159.0 [[739  39 643 633 518 112]
 [167 833 245 276 391 772]]


In [8]:
stat, p, med, tbl = stats.median_test(asi["New_Asia_Deaths"], af["New_Africa_Deaths"], eu["New_Europe_Deaths"], na["New_NA_Deaths"], sa["New_SA_Deaths"], oc["New_Oceania_Deaths"])
print(stat,p,med,tbl)

2168.6613987132764 0.0 670.0 [[675  88 667 681 573   0]
 [231 784 221 228 336 884]]
