<table align="left" width=100%>
    <tr>
        <td width="20%">
            <img src="GL-2.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b> Faculty Notebook <br> ( Day 4) </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## Table of Content

1. **[Import Libraries](#lib)**
2. **[Z Proportion Test](#prop)**
    - 2.1 - **[Two Sample Test](#2_p)**
3. **[Chi-Square Test](#chisq)**
    - 3.1 - **[Chi-Square Test for Goodness of Fit](#goodness)**
    - 3.2 - **[Chi-Square Test for Independence](#ind)**

<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [1]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

import scipy.stats as stats
from statsmodels.stats import weightstats as wstats

import statsmodels.api as sma

In [2]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

<a id="prop"></a>
# 2. Z Proportion Test

<a id="2_p"></a>
## 2.1 Two Sample Test

Perform two sample Z test for the population proportion. We check the equality of population proportions $P_{1}$ and $P_{2}$.

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: P_{1} - P_{2} = P_{0}$ or $P_{1} - P_{2} \geq P_{0}$ or $P_{1} - P_{2} \leq P_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: P_{1} - P_{2} \neq P_{0}$ or $P_{1} - P_{2} < P_{0}$ or $P_{1} - P_{2} > P_{0}$</strong></p>

The test statistic for two sample proportion Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{(p_{1} -  p_{2}) - P_{0}}{\sqrt{\bar{P}(1-\bar{P})(\frac{1}{n_{1}} + \frac{1}{n_{2}})}}$   $\hspace{2 cm} \bar{P} = \frac{n_{1}p_{1} + n_{2}p_{2}}{n_{1} + n_{2}}$ </strong></p>

Where, <br>
$p_{1}, p_{2}$: Samples proportions<br>
$P_{0}$: Hypothesized proportion<br>
$\bar{P}$: Proportion of pooled sample<br>
$n_{1}, n_{2}$: Samples sizes

### Example:

#### 1. A team of nutritionists believes that each institute provides 'standard' lunch to an equal proportion of students. A sample of students from institutes <i>Nature Learning</i> and <i>Speak Global Learning</i> is given. Consider the null hypothesis as equality of proportion with 0.1 level of significance.

Consider the sample data available in the CSV file `StudentsPerformance.csv`.

In [4]:
# read the students performance data 
df_student = pd.read_csv('StudentsPerformance (2).csv')

# display the first two observations
df_student.head(2)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning


In [63]:
# get the training institutes in the dataframe
df_student['training institute'].unique()

array(['Nature Learning', 'Speak Global Learning'], dtype=object)

In [64]:
df_student['lunch'].unique()

array(['standard', 'free/reduced'], dtype=object)

The dataset contains the information about the students from two different institutes.

In [112]:
# Sample proportion

# Count of lunch for Nature Learning
nl_lunch=df_student.loc[(df_student['training institute']=='Nature Learning')&
                        (df_student['lunch']=='standard')].shape[0]

In [114]:
# Count of lunch for Nature Learning
sgl_lunch=df_student.loc[(df_student['training institute']=='Speak Global Learning')&
                         (df_student['lunch']=='standard')].shape[0]

In [115]:
# Count of nature learnig
nl=df_student.loc[df_student['training institute']=='Nature Learning'].shape[0]
sgl=df_student.loc[df_student['training institute']=='Speak Global Learning'].shape[0]

In [120]:
# Import the required library
import statsmodels.api as sma
sma.stats.proportions_ztest(count=np.array([nl_lunch,sgl_lunch]),nobs=np.array([nl,sgl]))

(0.7935300106078008, 0.4274690915859791)

Here the z-score is less than 1.64, the p-value is greater than 0.1, also the confidence interval contains the value in the null hypothesis (i.e. 0). Thus, we fail to reject (i.e. accept) the null hypothesis and we do not have enough evidence to conclude that the proportion of students with standard lunch is different.

#### 2. Steve owns a kiosk where he sells two magazines - A and B in a month. He buys 100 copies of magazine A out of which 78 were sold and 70 copies of magazine B out of which 65 were sold. Is there enough evidence to say that magazine is B is more popular? Test the claim using p-value technique with α = 0.05.

The null and alternative hypothesis is:

H<sub>0</sub>: Prop_A=Prop_B (Both the Magazines are Equally Popular)<br> 
H<sub>1</sub>: Prop_B- Prop_A>0 (Magazine B is more Popular)

In [121]:
mag_a=100
mag_a_sold=78
mag_b=70
mag_b_sold=65

In [124]:
sma.stats.proportions_ztest(count=np.array([mag_a_sold,mag_b_sold]),
                            nobs=np.array([mag_a,mag_b]),alternative='larger')

(-2.60830803458311, 0.9954504483994527)

<a id="chisq"></a>
# 3. Chi-Square Test

It is a non-parametric test. `Non-parametric tests` do not require any assumptions on the parameter of the population from which the sample is taken. These tests can be applied to the ordinal/ nominal data. A non-parametric test can be performed on the data containing outliers.

The chi-square test statistic follows a Chi-square ($\chi^{2}$) distribution under the null hypothesis. It can be used to check the relationship between the categorical variables. 

<a id="goodness"></a>
## 3.1 Chi-Square Test for Goodness of Fit

This test is used to compare the distribution of the categorical data with the expected distribution. 

<p style='text-indent:6em'> <strong> $H_{0}$: There is no significant difference between the observed and expected frequencies from the expected distribution</strong></p>
<p style='text-indent:6em'> <strong> $H_{1}$: There is a significant difference between the observed and expected frequencies from the expected distribution</strong></p>

The test statistic is given as:
<p style='text-indent:25em'> <strong> $\chi^{2} = \sum_{i = 1}^{k}\frac{O_{i}^{2}}{E_{i}} - N$</strong></p>

Where, <br>
$O_{i}$: Observed frequency for category i <br>
$E_{i}$: Expected frequency for category i<br>
$N$: Total number of observations

Under $H_{0}$, the test statistic follows a chi-square distribution with $(k - p - 1)$ degrees of freedom, where k is the number class frequencies and p is the number of estimated parameters. 

**Note:** All the expected frequencies should be greater than or equal to 5. If not, add the classes such that each class will have a frequency greater than or equal to 5.

A bank has an ATM installed inside the bank, and it is available to its customers only from 7 am to 6 pm Monday through Friday. The manager of the bank wanted to investigate if the number of people who use this ATM is the same for each of the 5 days (Monday through Friday) of the week. She randomly selected one week and counted the number of people who used this ATM on each of the 5 days during that week.

At a 1% level of significance, can we reject the null hypothesis that the number of people who use this ATM each of the 5 days of the week is the same?

* **Ho: That the Count of People is Same Across All Days**
* **Ha: That the Count of People is Not Same Across All Days**

In [5]:
days=['Mon','Tue','Wed','Thu','Fri']
count=[253,197,204,279,267] # Observed value

In [8]:
# Find the Critical Value
# since it is 5 days so defree if freedom = n-1=5-1=4
stats.chi2.isf(0.01,4)

13.276704135987625

In [10]:
# Expected
exp=np.mean(count)

# Find the Test Stats & Compare
num=np.sum((count-exp)**2)
teststats=num/exp
teststats

23.183333333333334

In [11]:
# Remember that the chi square test is always a right tail test...
# Because the chi square distt values are from 0 to n...
#p-value
1-stats.chi2.cdf(teststats,df=len(count)-1)

0.00011638214275699887

#### 2. At an emporium, the manager is interested in knowing the age group which visits the mall during the day. He defines categories as - children, teenagers, adults and senior citizens. He plans to have his inventory of goods accordingly. He claims that out of all the people who visited 5% are children, 38% are teenagers, 2% are senior citizens are remaining are adults. From a sample of 180 people, it was seen that 25 were children, 50 were teenagers, 90 were adults and  15 were senior citizens. Test the manager’s claim at a 95% confidence level.


The null and alternative hypothesis is:

H<sub>0</sub>: The manager's claim is correct <br>
H<sub>1</sub>: The manager's claim is not correct

For ⍺ = 0.05 and degrees of freedom = 3, calculate the critical value.

In [23]:
cat=['children','teenagers','adults','senior citizens']
ratio=[0.05,0.38,0.02,0.55]
obs=[25,50,90,15]
n=180

In [34]:
# Expected ratio
exp=n*np.array(ratio)
exp

array([ 9. , 68.4,  3.6, 99. ])

In [35]:
# Calculate the Critical Value...
stats.chi2.isf(0.05,df=3)

7.814727903251178

In [43]:
# Calculate the Test Statistic
num=(obs-exp)**2
teststats=num/exp
sum(teststats)

2178.2668793195116

In [44]:
stats.chisquare(obs,exp)

Power_divergenceResult(statistic=2178.2668793195116, pvalue=0.0)

In [45]:
# p-value
1-stats.chi2.cdf(sum(teststats),df=len(obs)-1)

0.0

<a id="ind"></a>
## 3.2 Chi-Square Test for Independence

This test is used to test whether the categorical variables are independent or not.

<p style='text-indent:20em'> <strong> $H_{0}$: The variables are independent</strong></p>
<p style='text-indent:20em'> <strong> $H_{1}$: The variables are not independent (i.e. variables are dependent)</strong></p>

Consider a categorical variable `A` with `r` levels and variable `B` with `c` levels. Let us test the independence of variables A and B.

The test statistic is given as:
<p style='text-indent:25em'> <strong> $\chi^{2} = \sum_{i= 1}^{r}\sum_{j = 1}^{c}\frac{O_{ij}^{2}}{E_{ij}} - N$</strong></p>

Where, <br>
$O_{ij}$: Observed frequency for category (i,j) <br>
$E_{ij}$: Expected frequency for category (i,j)<br>
$N$: Total number of observations

Under $H_{0}$, the test statistic follows a chi-square distribution with $(r-1)(c-1)$ degrees of freedom.

### Example:

#### 1. Check if there is any relationship between the gender and education level of students with 95% confidence. 

Use the performance dataset of students available in the CSV file `students_data.csv`.

In [47]:
# read the students performance data 
df_student = pd.read_csv('StudentsPerformance (2).csv')

# display the first two observations
df_student.head(2)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning


The null and alternative hypothesis is:

H<sub>0</sub>: The variables gender and race/ethnicity are independent<br>
H<sub>1</sub>: The variables gender and race/ethnicity are not independent

In [48]:
df_student['gender'].unique()

array(['female', 'male'], dtype=object)

In [51]:
# Generate a contigency table
tb1=pd.crosstab(df_student['gender'],df_student['race/ethnicity'])
tb1

race/ethnicity,group A,group B,group C,group D,group E
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,36,104,180,128,69
male,54,86,139,133,71


In [55]:
# Chi square test
teststats, pvalue, df, exp_freq=stats.chi2_contingency(tb1)
print('Teststats',teststats)
print('p-value',pvalue)

Teststats 9.554257224920228
p-value 0.048644241994791254


### Titanic Dataset

* Deal with the Missing Values 
* Manipulate Deck to remove the missing values
* See if you can create new features
* Perform Hypothesis Testing on the Dataset

In [23]:
titanic= sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Summary of Chi Square Test
* Based on Categorical Data.This is a test is applied only on Categories.
* Well Suited For Classification Problems in Machine Learning.
* The Ho is that Two Categories have nothing in Common or One Cat has no effect on the other...
* Alternate - That both the categories are related.

* Again, this is a Non Parametric Test meaning that there is no assumption about the Data.

* Since, we deal only in categories therefore, we only work on Frequency/Value_Counts() data.

* It is based on Observed Values (Coming from Data) and Expected Values (Which we calculate) and basis that we find out our test statistic.

* A Chi Square Test is a Test of Hypothesis where two samples are tested for Independence.

* **Here, we dont compare the means of samples coming from same or different population. Here, we have Categorical Variables/Dichotomous Variables on which we perform the test.**

* We check if these samples with different categories are independent of each other or there is some pattern of Being Dependent.

* If there is Dependence, researcher claims that these two variables have significant relationship. In other words, One variable is linked/being driven by the other. For e.g. Survival in Titanic is Dependent on the Family Count.

* **Formula = SUM[(Obs-Exp)**2/Exp] - Chi Square Test Statistic.**