# ANOVA (Analysis of Variance):

#### ANOVA is a statistical technique used to compare means of three or more groups to determine if there are any statistically significant differences between them. It divides the total variability observed in a data set into different components attributable to different sources of variation.



In [None]:
## Importing Libraries 

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


In [None]:
df=pd.read_csv('Covid_sample.csv')
df.head(4)

In [None]:
## Stats

import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
## One Way Anova

mod=ols("total_vaccinations~continent",data=df).fit()
aov_table=sm.stats.anova_lm(mod,type=2)
print(aov_table)

In [None]:
## pairwise comparison
pair_t=mod.t_test_pairwise('continent',method='bonferroni')#sidak
pair_t.result_frame

In [None]:
## tukey test hsd

import pingouin as pg
## first calculate anova table
aov=pg.anova(data=df,dv='total_vaccinations',between='continent',detailed=True)
print(aov)

#tukley hsd
pt=pg.pairwise_tukey(data=df,dv='total_vaccinations',between='continent')
print(pt)

In [None]:
#tukley hsd
pt=pg.pairwise_tukey(data=df,dv='total_vaccinations',between='continent')
print(pt)

# Normal Distribution:

### A normal distribution, also known as Gaussian distribution, is a probability distribution that is symmetric around its mean, meaning most of the observations cluster around the central peak and fewer occur as you move away from the mean in either direction. Many statistical tests and methods assume that the data is normally distributed.

In [None]:
#normal distribution - how to draw

def pdf(x):
    mean = np.mean(x)
    std = np.std(x)
    y_out = 1/(std * np.sqrt(2 * np.pi)) * np.exp( - (x - mean)**2 / (2 * std**2))
    return y_out

# to generate an array of X
x = np.arange(-2, 2, 0.1)
y = pdf(x)

#plotting the normal curve / bell curve or gaussian distribution

plt.style.use('seaborn')
plt.figure(figsize=(5,5))

plt.plot(x, y, color = 'blue', linestyle = 'dashed')

plt.scatter(x, y, marker='o', s=25, color='red')

In [None]:
# histogram test
sns.histplot(df['continent'])


In [None]:
# qq plot
#pip install statsmodels 
from statsmodels.graphics.gofplots import qqplot

#q-q norm plot
qqplot(df['continent'])
plt.show()

### Normality Tests

- There are many statistical tests that we can use to quantify whether a sample of data looks as though it was drawn from a Gaussian distribution.
Each test makes different assumptions and considers different aspects of the data.
We will look at 3 commonly used tests in this section that you can apply to your own data samples..

#### Shapiro-Wilk Test
#### D’Agostino’s K^2 Test
Anderson-Darling Test
p <= alpha: reject H0, not normal.
p > alpha: fail to reject H0, normal.

##### 1. Shapiro-Wilk Test

The Shapiro-Wilk test evaluates a data sample and quantifies how likely it is that the data was drawn from a Gaussian distribution, named for Samuel Shapiro and Martin Wilk.

In practice, the Shapiro-Wilk test is believed to be a reliable test of normality, although there is some suggestion that the test may be suitable for smaller samples of data, e.g. thousands of observations or fewer.

The shapiro() SciPy function will calculate the Shapiro-Wilk on a given dataset. The function returns both the W-statistic calculated by the test and the p-value.

Assumptions

Observations in each sample are independent and identically distributed.
Interpretation

H0: the sample has a Gaussian distribution.
H1: the sample does not have a Gaussian distribution.
Python code is here:

In [None]:
# shapirowilk test

#import library
from scipy.stats import shapiro

stat, p = shapiro(df['total_vaccinations'])
print('stat=%.3f, p=%.3f' % (stat, p))

# make a coditional argument for further use
if p > 0.05:
	print('Probably Gaussian or Normal Distribution')
else:
	print('Probably not Gaussian nor normal distribution')

## 2. D’Agostino’s K^2 Test

The D’Agostino’s K^2 test calculates summary statistics from the data, namely kurtosis and skewness, to determine if the data distribution departs from the normal distribution, named for Ralph D’Agostino.

Skew is a quantification of how much a distribution is pushed left or right, a measure of asymmetry in the distribution.
Kurtosis quantifies how much of the distribution is in the tail. It is a simple and commonly used statistical test for normality.
The D’Agostino’s K^2 test is available via the normaltest() SciPy function and returns the test statistic and the p-value. Assumptions

Observations in each sample are independent and identically distributed.
Interpretation

H0: the sample has a Gaussian distribution.
H1: the sample does not have a Gaussian distribution.
Python code is here:

In [None]:
# D’Agostino’s K^2 Test test

#import library
from scipy.stats import normaltest

stat, p = normaltest(df['total_vaccinations'])
print('stat=%.3f, p=%.3f' % (stat, p))

# make a coditional argument for further use
if p > 0.05:
	print('Probably Gaussian or Normal Distribution')
else:
	print('Probably not Gaussian nor normal distribution')

### 3. Anderson-Darling Test

A statistical test that can be used to evaluate whether a data sample comes from one of among many known data samples, named for Theodore Anderson and Donald Darling.

It can be used to check whether a data sample is normal. The test is a modified version of a more sophisticated nonparametric goodness-of-fit statistical test called the Kolmogorov-Smirnov test.

A feature of the Anderson-Darling test is that it returns a list of critical values rather than a single p-value. This can provide the basis for a more thorough interpretation of the result.

The anderson() SciPy function implements the Anderson-Darling test. It takes as parameters the data sample and the name of the distribution to test it against. By default, the test will check against the Gaussian distribution (dist=’norm’).

- Assumptions

Observations in each sample are independent and identically distributed.
Interpretation

H0: the sample has a Gaussian distribution.
H1: the sample does not have a Gaussian distribution.
Python code is here:

In [None]:
from scipy.stats import anderson

# select a column to check the normal distribution
result = anderson(df["total_vaccinations"])
print('stat=%.3f' % (result.statistic))
for i in range(len(result.critical_values)):
	sl, cv = result.significance_level[i], result.critical_values[i]
	if result.statistic < cv:
		print('Probably Gaussian/Normal Distribution at the %.1f%% level' % (sl))
	else:
		print('Probably not Gaussian/Normal Distribution at the %.1f%% level' % (sl))

##  Correlation

-  what is correaltion
-  variables within a dataset can be related for lots of reasons
-  types
   - pearsons'r
   - spearmenan's rho
   - kendall's tau

- For example
1. one variable could cause or depend on the value of another variable
2. one variable could be lightly associated with another variable
3. two variable could depend on third unknown variable 

> Positive Correlation:both variable change in the same direction.\
> Neautral Correlation:no relationship in the change of the variables.\
> Negative Correlation:variables change in opposite directions.

## Covariance 

- variables can be related by linear relationship.this is a relationship that is consistently additive across the two data samples.
- the relationship can be summarized between two variables called the covariance .
- the sign of covariance can be interpreted as wheter the two varibale change in the same direction(positve)or change in different directions (negative).
- the magnitude of the civariance is not easily interpreted .A covariacne values of zero indicated that both variables are  completly independent .

In [None]:
df.head(4)

In [None]:
new_df = df[[ 'total_vaccinations',
       'people_partially_vaccinated', 'people_fully_vaccinated']].copy()


In [None]:
new_df.corr()

In [None]:
cor=new_df.corr(method='pearson')## for  normal data

In [None]:
cor1=new_df.corr(method='spearman')## for non gausian distribution

In [None]:
sns.regplot(x='people_partially_vaccinated',y='people_fully_vaccinated',data=new_df)

In [None]:
sns.regplot(x='total_vaccinations',y='people_partially_vaccinated',data=new_df)

In [None]:
corr=new_df.corr(method='pearson')
sns.heatmap(corr)

In [None]:
corr=new_df.corr(method='pearson')
sns.heatmap(corr,annot=True)

In [None]:
corr.style.background_gradient(cmap="coolwarm")

In [None]:
sns.pairplot(cor1)

# Manova

In [None]:
from statsmodels.multivariate.manova import MANOVA

MANOVA.from_formula("total_vaccinations+ people_partially_vaccinated+ people_fully_vaccinated~continent", data=df)

In [None]:
mova=MANOVA.from_formula("total_vaccinations+ people_partially_vaccinated+ people_fully_vaccinated~continent", data=df)
print(mova.mv_test())