# Chi$^2$ ($\chi^2$) Test for Independence

aka Pearson's Chi$^2$ test. Pronounced as 'Ki' as in kite.


https://docs.google.com/presentation/d/13V7cMcgbM6bIQL2fbMtONre15iiNKpxnX7ECiWTxrVI/edit?usp=sharing

Lets us test the hypothesis that one group is independent of another
- $H_0$ is always that there is no association between the groups (they are independent)
- $H_a$ is that there is a association (they are not independent) between the groups


The null hypothesis assumes that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable

## The Quick Way To Run a Chi$^2$ Test

1. form hypothesis
2. make contigency table
3. use stats.chi2_contingency

## Example 1 - Tips Data

In [1]:
import pandas as pd
import numpy as np

from pydataset import data
from scipy import stats

In [4]:
#load tips dataset
df = data('tips')
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.50,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
240,29.03,5.92,Male,No,Sat,Dinner,3
241,27.18,2.00,Female,Yes,Sat,Dinner,2
242,22.67,2.00,Male,Yes,Sat,Dinner,2
243,17.82,1.75,Male,No,Sat,Dinner,2


In [5]:
#check out the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 15.2+ KB


## Is smoking independent of time of day?

### 1. Form hypothesis

- $H_o$: There is no association between the smoker and time of the day (independence)
- $H_a$: There is that there is a association between smoker and time of day

### 2. Make contigency table

In [7]:
#set our alpha
alpha = 0.05

In [8]:
#look at smoker data
df.smoker.value_counts()

No     151
Yes     93
Name: smoker, dtype: int64

In [9]:
#look at time data
df.time.value_counts()

Dinner    176
Lunch      68
Name: time, dtype: int64

In [11]:
#make 'contingency' table using pandas crosstab
observed = pd.crosstab(df.smoker, df.time)
observed

time,Dinner,Lunch
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1
No,106,45
Yes,70,23


### 3. Use stats.chi2_contingency

In [14]:
#use stats chi2_contingency
stats.chi2_contingency(observed)

(0.5053733928754354,
 0.4771485672079724,
 1,
 array([[108.91803279,  42.08196721],
        [ 67.08196721,  25.91803279]]))

In [15]:
#chi2_contingency prints out 4 values - chi2, p-value, degrees of freedom, 
# expected values
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [16]:
#output values
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[106  45]
 [ 70  23]]

Expected
[[108  42]
 [ 67  25]]

----
chi^2 = 0.5054
p     = 0.4771


In [18]:
if p < alpha:
    print('We reject the null')
else:
    print("we fail to reject the null")

we fail to reject the null


> a low chi^2 value typically leads to a high p-value

## Example 2 - Attrition Data

In [19]:
#get data
df = pd.read_csv("https://gist.githubusercontent.com/ryanorsinger/6ba2dd985c9aa92f5598fc0f7c359f6a/raw/b20a508cee46e6ac69eb1e228b167d6f42d665d8/attrition.csv")

In [20]:
#always look at your data!!
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

### What could be categorical data???

Let's guess by looking at how many different values each columm has

In [24]:
#.nunique counts the number of unique values in each column
df.nunique()

Age                           43
Attrition                      2
BusinessTravel                 3
DailyRate                    886
Department                     3
DistanceFromHome              29
Education                      5
EducationField                 6
EmployeeCount                  1
EmployeeNumber              1470
EnvironmentSatisfaction        4
Gender                         2
HourlyRate                    71
JobInvolvement                 4
JobLevel                       5
JobRole                        9
JobSatisfaction                4
MaritalStatus                  3
MonthlyIncome               1349
MonthlyRate                 1427
NumCompaniesWorked            10
Over18                         1
OverTime                       2
PercentSalaryHike             15
PerformanceRating              2
RelationshipSatisfaction       4
StandardHours                  1
StockOptionLevel               4
TotalWorkingYears             40
TrainingTimesLastYear          7
WorkLifeBa

## Ex. Is Attrition independent from Business Travel amount?

### 1. Form hypothesis

$H_0$: Attrition and Business travel have no association (They are independent)

$H_a$: Attrition and Business travel are associated (They are dependent)

### 2. Make contigency table

Let's scope out our columns and see what categories we have

In [26]:
#get unique values and counts from Attrition
df.Attrition.value_counts()

No     1233
Yes     237
Name: Attrition, dtype: int64

In [27]:
#get unique values and counts from BusinessTravel
df.BusinessTravel.value_counts()

Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: BusinessTravel, dtype: int64

In [39]:
pd.crosstab(df.Attrition, df.BusinessTravel, margins=True)

BusinessTravel,Non-Travel,Travel_Frequently,Travel_Rarely,All
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,138,208,887,1233
Yes,12,69,156,237
All,150,277,1043,1470


In [40]:
#make contigency table
observed = pd.crosstab(df.Attrition, df.BusinessTravel)
observed

BusinessTravel,Non-Travel,Travel_Frequently,Travel_Rarely
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
No,138,208,887
Yes,12,69,156


### 3. Use stats.chi2_contingency

In [41]:
#calculate chi2 values
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [42]:
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[138 208 887]
 [ 12  69 156]]

Expected
[[125 232 874]
 [ 24  44 168]]

----
chi^2 = 24.1824
p     = 0.0000


> typically a high chi^2 value leads to a low p-value, depends on degrees of freedom

In [43]:
if p < alpha:
    print('We reject the null')
else:
    print("we fail to reject the null")

We reject the null


## Mini Exercise:
### Is Attrition independent from Department?

### 1. Form hypothesis

- $H_0$: There is no association between Attrition and Department (They are independent)
- $H_a$: There is an association between Attrition and Department (They are not independent)

### 2. Make contigency table

In [44]:
# how many categories we have in 'Department' column? (hint: value_counts())
df.Department.value_counts()

Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64

In [None]:
df.Attrition.value_counts()

In [49]:
# crosstab for observed values between Attrition and Depts
observed = pd.crosstab(df.Department, df.Attrition)
observed

Attrition,No,Yes
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
Human Resources,51,12
Research & Development,828,133
Sales,354,92


### 3. Use stats.chi2_contingency

In [52]:
# use stats.chi2_contingency test 
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [53]:
#output values
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[ 51  12]
 [828 133]
 [354  92]]

Expected
[[ 52  10]
 [806 154]
 [374  71]]

----
chi^2 = 10.7960
p     = 0.0045


In [55]:
if p < alpha:
    print("We reject the null hypothesis")
else:
    print("We fail to reject the null hypothesis")

We reject the null hypothesis


## Correlation Extra FYI

> Expected values all need to be greater than 5, we normally don't run into this issue

> If not greater than 5, use fisher's exact test