# Chi Squared Test for Independence
- We'll feed our chi2 function two series
- Output from chi2 functioon that we care about is the `p` value

## Process
1. Set your alpha.  State our null hypothesis.  The chi2 null hypothesis is:
    - there is no relationship between A and B
    - A and B categories are independent
2. Run a `observed = pd.corosstabe(df.A, df.B)
3. Compare your `p` to `a`. If `p < a` then we reject the null, we have evidence supporting the alternative hypothesis 

In [1]:
import pandas as pd
from scipy import stats
from pydataset import data
df = data("tips")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [2]:
# Let's investigate smokng status and day
# The null hypothesis is that they are independent
# Step 1, make a crosstab of the two values we care about investigating
observed = pd.crosstab(df.smoker, df.time)
observed

time,Dinner,Lunch
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1
No,106,45
Yes,70,23


pandas.core.frame.DataFrame

In [7]:
alpha = .05

In [8]:
chi2, p, degf, expected = stats.chi2_contingency(observed) # assigned 4 variables because the function returns 4 values although all we really want is the p value
p

0.4771485672079724

## What about gender and day?

In [9]:
# Step 1: set your alpha and define your null hypothesis
# null = gender and day are independent

In [10]:
# Step 2: calculate the observed values with a crosstab
observed = pd.crosstab(df.sex, df.day)
observed

day,Fri,Sat,Sun,Thur
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,9,28,18,32
Male,10,59,58,30


In [12]:
type(observed) # Labeled rows and columns indicates that it's a DataFrame

pandas.core.frame.DataFrame

In [11]:
# Step 3, run Chi2_contingency to get the p values
chi2, p, degf, expected = stats.chi2_contingency(observed)
p

0.004180302092822257

## What about time of day and which day?
- null hypothesis: time and day are independent

In [14]:
observed = pd.crosstab(df.time, df.day)
observed

day,Fri,Sat,Sun,Thur
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dinner,12,87,76,1
Lunch,7,0,0,61


In [15]:
chi2, p, degf, expected = stats.chi2_contingency(observed)
p

8.449897551777147e-47

In [17]:
if p < alpha:
        print("Reject the null")
else:
    print("Fail to reject the null")

Reject the null and accept the alternative
