## Chi-Square Test

In [1]:
# import pandas
import pandas as pd

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

### What is a chi-square test?

- A Pearson’s chi-square test is a statistical test for categorical data.
- It is used to determine whether your data are significantly different from what you expected. 

There are two types of Pearson’s chi-square tests:

- The chi-square goodness of fit test is used to test whether the frequency distribution of a categorical variable is different from your expectations.
- The chi-square test of independence is used to test whether two categorical variables are related to each other.

### What is the chi-square goodness of fit test?

- A chi-square (Χ2) goodness of fit test is a goodness of fit test for a categorical variable. Goodness of fit is a measure of how well a statistical model fits a set of observations.

      When goodness of fit is high, the values expected based on the model are close to the observed values.
      When goodness of fit is low, the values expected based on the model are far from the observed values.
      
### What is the chi-square test of independence?

- chi-square test of independence, also known as a chi-square test of association, to determine whether two categorical variables are related.
- The test compares the observed frequencies to the frequencies you would expect if the two variables are unrelated.
- When the variables are unrelated, the observed and expected frequencies will be similar.

### Contingency tables

- When you want to perform a chi-square test of independence, the best way to organize your data is a type of frequency distribution table called a contingency table.
- A contingency table, also known as a cross tabulation or crosstab, shows the number of observations in each combination of groups. 
- It also usually includes row and column totals.


### The chi-square formula

$\begin{equation*} X^2=\sum{\frac{(O-E)^2}{E}} \end{equation*}$

Where:

    Χ2 is the chi-square test statistic
    Σ is the summation operator (it means “take the sum of”)
    O is the observed frequency
    E is the expected frequency

### When to use a chi-square test

1. You want to test a hypothesis about one or more categorical variables. 
2. The sample was randomly selected from the population.
3. There are a minimum of five observations expected in each group or combination of groups.

### Chi-square test properties

1. Two times the number of degrees of freedom is equal to the variance.
2. The number of degree of freedom is equal to the mean distribution
3. The chi-square distribution curve approaches the normal distribution when the degree of freedom increases.

### Some of the uses of the Chi-Squared test:

1. The Chi-squared test can be used to see if your data follows a well-known theoretical probability distribution like the Normal or Poisson distribution.
2. The Chi-squared test allows you to assess your trained regression model's goodness of fit on the training, validation, and test data sets.



In [2]:
from scipy.stats import chi2
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

### Goodness of fit

- Does data follows certain pattern or behaviour.

Example 1:
- A company in Los Angeles has three functional departments - Research and Development, Sales, and Human Resources. The company claims that the percentage of employees in these 3 departments is 55%, 35% and 10% respectively. Check the company's claim using p-value criteria. Consider a 5% level of significance.

  The null and alternative hypothesis is:

- H0: There is no significant difference between the observed and expected values.

- H1: There is a significant difference between the observed and expected values.



In [3]:
df_emp=pd.read_csv(r"C:\Users\Shree\Desktop\ANALYTICS VIDYA\PYTHON\class\Employee_Attrition.csv")
df_emp.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [4]:
df_emp["Department"].value_counts()

Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64

In [5]:
obs_values=df_emp["Department"].value_counts(normalize=True).reset_index()
obs_values

Unnamed: 0,index,Department
0,Research & Development,0.653741
1,Sales,0.303401
2,Human Resources,0.042857


In [6]:
obs_values["Department"][0]

0.6537414965986394

### stats.chisquare()

- Calculate a one-way chi-square test.

- The chi-square test tests the null hypothesis that the categorical data has the given frequencies.

In [7]:
n=3
alpha=0.05

observed_values=[obs_values["Department"][0],obs_values["Department"][1],obs_values["Department"][2]]
expected_values=[0.55,0.35,0.1]

chi_score, p_value = stats.chisquare(f_obs = observed_values, f_exp = expected_values)
print("chi-score:",chi_score)
print("p-value:",p_value)

chi_critical=stats.chi2.isf(q = alpha, df =n-1 )
print("chi critical:",chi_critical)

## P-value approach

if (p_value <=alpha):   
    print("P-value approach: Reject null hypothesis")
else:
    print("P-value approach: Fail to reject null hypothesis")
    
## Critical value approach

if(chi_score > chi_critical):
    print("Critical value approach: Reject null hypothesis")
else:
    print("Critical value approach: Fail to reject null hypothesis")

chi-score: 0.05842497083646603
p-value: 0.9712100745610734
chi critical: 5.991464547107983
P-value approach: Fail to reject null hypothesis
Critical value approach: Fail to reject null hypothesis


Conclusion:
- There is no significant difference between the observed and expected values.

### Independant Test

Example 2:
- Check whether travelling for work depends upon the job role of an employee. Use p-value criteria to test the dependence with 99% confidence.

  The null and alternative hypothesis is:

- H0: Business travel and job role are independent
- H1: Business travel and job role are not independent

In [8]:
df_emp.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [9]:
df_emp.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [10]:
table = pd.crosstab(df_emp['BusinessTravel'], df_emp['JobRole'])
table

JobRole,Healthcare Representative,Human Resources,Laboratory Technician,Manager,Manufacturing Director,Research Director,Research Scientist,Sales Executive,Sales Representative
BusinessTravel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Non-Travel,15,4,28,12,13,6,28,39,5
Travel_Frequently,26,10,51,13,29,12,54,59,23
Travel_Rarely,90,38,180,77,103,62,210,228,55


In [11]:
obs = table.values
obs

array([[ 15,   4,  28,  12,  13,   6,  28,  39,   5],
       [ 26,  10,  51,  13,  29,  12,  54,  59,  23],
       [ 90,  38, 180,  77, 103,  62, 210, 228,  55]], dtype=int64)

In [12]:
df_emp['BusinessTravel'].nunique()

3

In [13]:
df_emp['JobRole'].nunique()

9

### chi2_contingency()

- Chi-square test of independence of variables in a contingency table.

- This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table

In [14]:
alpha=0.01

chi_score, p_value, dof, expected_value = stats.chi2_contingency(observed = obs)

print("chi score value:",chi_score)
print("p-value:",p_value)
print("degree of freedom:",dof)

chi score value: 11.987695596739206
p-value: 0.7448263418408124
degree of freedom: 16


In [15]:
print("expected values:",expected_value)

expected values: [[ 13.36734694   5.30612245  26.42857143  10.40816327  14.79591837
    8.16326531  29.79591837  33.26530612   8.46938776]
 [ 24.68503401   9.79863946  48.8047619   19.22040816  27.32312925
   15.07482993  55.02312925  61.42993197  15.64013605]
 [ 92.94761905  36.8952381  183.76666667  72.37142857 102.88095238
   56.76190476 207.18095238 231.3047619   58.89047619]]


In [16]:
chi_critical = stats.chi2.isf(q = alpha, df = dof )
chi_critical

31.999926908815176

In [17]:
## P-value approach

if (p_value <= alpha):   
    print("P-value approach: Reject null hypothesis")
else:
    print("P-value approach: Fail to reject null hypothesis")
    
## Critical value approach

if(chi_score > chi_critical):
    print("Critical value approach: Reject null hypothesis")
else:
    print("Critical value approach: Fail to reject null hypothesis")

P-value approach: Fail to reject null hypothesis
Critical value approach: Fail to reject null hypothesis


Conclusion:
- Business travel and job role are independent.

### References:
- https://www.scribbr.com/statistics/chi-square-tests/
- https://www.cuemath.com/chi-square-formula/