# Statistcs in Python

- How to check the normal or gaussian distribution of our data in Python?

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Normal Distribution and its tests
1. import dataset
2. subsetting a dataset
3. visual test for normal distribution
4. Histogram
5. qq-norm plot
6. statistical test

In [None]:
# Import a dataset
kashti = sns.load_dataset('titanic')
kashti.head()

In [None]:
#taking subsets
kashti = kashti[['sex', 'age', 'fare']]
kashti.head()

In [None]:
# histogram test
sns.histplot(kashti['age'])

In [None]:
pip install statsmodels

In [None]:
# qq plot
#pip install statsmodels 
from statsmodels.graphics.gofplots import qqplot

#q-q norm plot
qqplot(kashti['age'])
plt.show()

# Normality Tests
There are many statistical tests that we can use to quantify whether a sample of data looks as though it was drawn from a Gaussian distribution.\ Each test makes different assumptions and considers different aspects of the data.\ We will look at 3 commonly used tests in this section that you can apply to your own data samples..

1. Shapiro-Wilk Test
2. D’Agostino’s K^2 Test
3. Anderson-Darling Test
4. p <= alpha: reject H0, not normal.\ p > alpha: fail to reject H0, normal.

1. Shapiro-Wilk Test
The Shapiro-Wilk test evaluates a data sample and quantifies how likely it is that the data was drawn from a Gaussian distribution, named for Samuel Shapiro and Martin Wilk.

In practice, the Shapiro-Wilk test is believed to be a reliable test of normality, although there is some suggestion that the test may be suitable for smaller samples of data, e.g. thousands of observations or fewer.

The shapiro() SciPy function will calculate the Shapiro-Wilk on a given dataset. The function returns both the W-statistic calculated by the test and the p-value.

Assumptions

Observations in each sample are independent and identically distributed.
Interpretation

H0: the sample has a Gaussian distribution.
H1: the sample does not have a Gaussian distribution.
Python code is here:

In [None]:
# shapirowilk test

#import library
from scipy.stats import shapiro

stat, p = shapiro(kashti['age'])
print('stat=%.3f, p=%.3f' % (stat, p))

# make a coditional argument for further use
if p > 0.05:
	print('Probably Gaussian or Normal Distribution')
else:
	print('Probably not Gaussian nor normal distribution')

2. D’Agostino’s K^2 Test
The D’Agostino’s K^2 test calculates summary statistics from the data, namely kurtosis and skewness, to determine if the data distribution departs from the normal distribution, named for Ralph D’Agostino.

Skew is a quantification of how much a distribution is pushed left or right, a measure of asymmetry in the distribution.
Kurtosis quantifies how much of the distribution is in the tail. It is a simple and commonly used statistical test for normality.
The D’Agostino’s K^2 test is available via the normaltest() SciPy function and returns the test statistic and the p-value. Assumptions

Observations in each sample are independent and identically distributed.
Interpretation

H0: the sample has a Gaussian distribution.
H1: the sample does not have a Gaussian distribution.
Python code is here:

In [None]:
# D’Agostino’s K^2 Test test

#import library
from scipy.stats import normaltest

stat, p = normaltest(kashti['age'])
print('stat=%.3f, p=%.3f' % (stat, p))

# make a coditional argument for further use
if p > 0.05:
	print('Probably Gaussian or Normal Distribution')
else:
	print('Probably not Gaussian nor normal distribution')