<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [1]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import 'random' to generate random sample
import random

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

In [2]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

<a id="chisq"></a>
# 2. Chi-Square Test

It is a non-parametric test. `Non-parametric tests` do not require any assumptions on the parameter of the population from which the sample is taken. These tests can be applied to the ordinal/ nominal data. A non-parametric test can be performed on the data containing outliers.

The chi-square test statistic follows a Chi-square ($\chi^{2}$) distribution under the null hypothesis. It can be used to check the relationship between the categorical variables. 

Let us calculate the right-tailed $\chi^{2}$ values for different levels of significance ($\alpha$).

<a id="goodness"></a>
## 2.1 Chi-Square Test for Goodness of Fit

This test is used to compare the distribution of the categorical data with the expected distribution. 

<p style='text-indent:6em'> <strong> $H_{0}$: There is no significant difference between the observed and expected frequencies from the expected distribution</strong></p>
<p style='text-indent:6em'> <strong> $H_{1}$: There is a significant difference between the observed and expected frequencies from the expected distribution</strong></p>

### Example:



#### 1. At an emporium, the manager is interested in knowing the age group which visits the mall during the day. He defines categories as - children, teenagers, adults and senior citizens. He plans to have his inventory of goods accordingly. He claims that out of all the people who visited 5% are children, 38% are teenagers, 2% are senior citizens are remaining are adults. From a sample of 180 people, it was seen that 25 were children, 50 were teenagers, 90 were adults and  15 were senior citizens. Test the manager’s claim at a 95% confidence level.


### Practice:

1) In a school, sports teacher is willing to see the proportion of
people participating in different sports. He expects that all the sports
are equal in proportion. After the observation, he found that

cricket - 35%
volley ball - 25%
foot ball - 20%
basket ball - 20%

Total number of student in the school - 200

Check the hypotheis with 95% Confidence level.

<a id="ind"></a>
## 2.2 Chi-Square Test for Independence

This test is used to test whether the categorical variables are independent or not.

<p style='text-indent:20em'> <strong> $H_{0}$: The variables are independent</strong></p>
<p style='text-indent:20em'> <strong> $H_{1}$: The variables are not independent (i.e. variables are dependent)</strong></p>

Consider a categorical variable `A` with `r` levels and variable `B` with `c` levels. Let us test the independence of variables A and B.

The test statistic is given as:
<p style='text-indent:25em'> <strong> $\chi^{2} = \sum_{i= 1}^{r}\sum_{j = 1}^{c}\frac{O_{ij}^{2}}{E_{ij}} - N$</strong></p>

Where, <br>
$O_{ij}$: Observed frequency for category (i,j) <br>
$E_{ij}$: Expected frequency for category (i,j)<br>
$N$: Total number of observations

Under $H_{0}$, the test statistic follows a chi-square distribution with $(r-1)(c-1)$ degrees of freedom.

### Example:

#### 1. Check if there is any relationship between the gender and education level of students with 95% confidence. 

Use the performance dataset of students available in the CSV file `students_data.csv`.

In [19]:
df = pd.read_csv('students_data.csv')
df.head()

Unnamed: 0,gender,ethnicity,education,lunch,test_prep_course,math_score,reading_score,writing_score,total_score,training_institute
0,female,group B,bachelor's degree,standard,none,89,55,56,200,Nature Learning
1,female,group C,college,standard,completed,55,63,72,190,Nature Learning
2,female,group B,master's degree,standard,none,64,71,56,191,Nature Learning
3,male,group A,associate's degree,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,college,standard,none,75,66,51,192,Nature Learning


## Practice:

In [7]:
#H0: categorical columns are independent(no relation)
#Ha: categorical columns are dependent(relation)
obs_val = pd.crosstab(df["gender"], df["education"])
obs_val

education,Ph.D.,associate's degree,bachelor's degree,college,high school,master's degree
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,91,116,63,117,94,36
male,88,106,55,108,103,23


In [11]:
chi2, pval, df, exp_val = stats.chi2_contingency(obs_val, correction = False)

In [10]:
print(chi2,pval,df,exp_val)

3.5267538812534243 0.6193433487137843 5 [[ 92.543 114.774  61.006 116.325 101.849  30.503]
 [ 86.457 107.226  56.994 108.675  95.151  28.497]]


In [None]:
#pval = 0.61
#sig lvl = 0.05
#pval>sig lvl
#fail to reject H0
#The columns are independent

#### Find the relationship between lunch vs gender

In [16]:
obs_val = pd.crosstab(df["gender"], df["lunch"])
obs_val

lunch,free/reduced,standard
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,188,329
male,166,317


In [17]:
chi2, pval, df, exp_val = stats.chi2_contingency(obs_val, correction = False)

In [18]:
print(chi2,pval,df,exp_val)

0.43464430395171516 0.5097188141605202 1 [[183.018 333.982]
 [170.982 312.018]]


In [None]:
#pval = 0.50
#sig lvl = 0.05
#pval>sig lvl
#fail to reject H0
#The columns are independent

<a id="1way"></a>
# 3. One-way ANOVA

It is used to check the equality of population means for more than two independent samples. Each group is considered as a `treatment`. It assumes that the samples are taken from normally distributed populations. To check this assumption we can use the `Shapiro-Wilk Test.` Also, the population variances should be equal; this can be tested using the `Levene's Test`.

The null and alternative hypothesis is given as:
<p style='text-indent:20em'> <strong> $H_{0}$: The averages of all treatments are the same. </strong></p>
<p style='text-indent:20em'> <strong> $H_{1}$: At least one treatment has a different average. </strong></p>

Consider there are `t` treatments and `N` number of total observations. The test statistic is given as:
<p style='text-indent:28em'> <strong> $F = \frac{MTrSS}{MESS} $</strong></p>

Where,<br>
MTrSS = $\frac{TrSS}{df_{Tr}}$<br>

TrSS = $\sum_{i}^{t}\sum_{j}^{n_{i}}n_{i}(\bar{x_{i}}. - \bar{x}..)$<br> $n_{i}$ is the number of observations in $i^{th}$ treatment. <br>$\bar{x_{i}}.$ is the mean over $i^{th}$ treatment <br> $\bar{x}..$ is the grand mean (i.e. mean of all the observations). <br>

$df_{Tr}$ is the degrees of freedom for treatments (= $t-1$)

MESS = $\frac{ESS}{df_{e}}$<br>

ESS = $\sum_{i}^{t}\sum_{j}^{n_{i}}{(x_{ij} - \bar{x_{i}}.)}^{2}$

$df_{e}$ is the degrees of freedom for error (= $N-t$)

Under $H_{0}$, the test statistic follows F-distribution with ($t-1,  N-t$) degrees of freedom.

Let us calculate the F values for different levels of significance ($\alpha$).

### Example:

#### 1. Total marks in aptitude exam are recorded for students with different race/ethnicity. Test whether all the races/ethnicities have an equal average score with 0.05 level of significance. 

Use the performance dataset of students available in the CSV file `students_data.csv`.

In [20]:
# read the students performance data 
df = pd.read_csv('students_data.csv')

# display the first two observations
df.head()

Unnamed: 0,gender,ethnicity,education,lunch,test_prep_course,math_score,reading_score,writing_score,total_score,training_institute
0,female,group B,bachelor's degree,standard,none,89,55,56,200,Nature Learning
1,female,group C,college,standard,completed,55,63,72,190,Nature Learning
2,female,group B,master's degree,standard,none,64,71,56,191,Nature Learning
3,male,group A,associate's degree,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,college,standard,none,75,66,51,192,Nature Learning


In [21]:
g1 = df[df['ethnicity']=='group A']['total_score']
g2 = df[df['ethnicity']=='group B']['total_score']
g3 = df[df['ethnicity']=='group C']['total_score']
g4 = df[df['ethnicity']=='group D']['total_score']
g5 = df[df['ethnicity']=='group E']['total_score']


In [22]:
#Assumptions
#1. Data is normal

#Shapiro test
#H0: skew = 0
#Ha: skew != 0
stats.shapiro(g1)

ShapiroResult(statistic=0.9894436001777649, pvalue=0.6901752352714539)

In [None]:
#pval = 0.69
#sig lvl = 0.05
#p val > sig lvl
#Fail to reject H0
#Data is normal

In [26]:
stats.shapiro(g2)

ShapiroResult(statistic=0.9947066307067871, pvalue=0.7402700185775757)

In [25]:
stats.shapiro(g3)

ShapiroResult(statistic=0.9973903298377991, pvalue=0.8950209617614746)

In [24]:
stats.shapiro(g4)

ShapiroResult(statistic=0.9948431253433228, pvalue=0.5269628167152405)

In [23]:
stats.shapiro(g5)

ShapiroResult(statistic=0.991719126701355, pvalue=0.5859840512275696)

In [None]:
#All pval > 0.05
#Data is normal

In [None]:
#Assumption 2: Variance 
#Test of variance: Levenne test
#H0: All variance is equal
#Ha: Atleast one variance is not eqaul

In [27]:
stats.levene(g1,g2,g3,g4,g5)

LeveneResult(statistic=1.8006030590828939, pvalue=0.12649444001357793)

In [None]:
#pval = 0.12
#sig lvl = 0.05
#p val > sig lvl
#Fail to reject H0
#All variance is equal

In [None]:
#Data is normal
#Variance is equal
#more than 2 samples so using ANOVA

In [None]:
#H0: All means are equal
#Ha: Atleast one mean is not equal

In [28]:
stats.f_oneway(g1,g2,g3,g4,g5)

F_onewayResult(statistic=0.789109595922189, pvalue=0.5322937031083035)

In [None]:
#pval = 0.53
#sig lvl = 0.05
#p val > sig lvl
#Fail to reject H0
#All means are equal. All ethnic groups score same marks

#### 2. Ryan is a production manager at an industry manufacturing alloy seals. They have 4 machines - A, B, C and D. Ryan wants to study whether all the machines have equal efficiency. Ryan collects data of tensile strength from all the 4 machines as given. Test at 5% level of significance.

<img src='1_ANOVA.png'>

In [30]:
a = [68.7, 75.4, 70.9, 79.1, 78.2]
b = [62.7, 68.5, 63.1, 62.2, 60.3]
c = [55.9, 56.1, 57.3, 59.2, 50.1]
d = [80.7, 70.3, 80.9, 85.4, 82.3]
stats.f_oneway(a,b,c,d)

F_onewayResult(statistic=32.03072350199285, pvalue=5.375613532781072e-07)

In [None]:
#pval = 0.00000053
#p val < 0.05
#Reject H0
#Atleast one mean is not equal