# Pearson's 𝜒2 Test

Pearson's chi-squared test or Pearson's $\chi^2$ test is a statistical method used to determine whether observed differences between sets of categorical data are likely due to chance. 

**There are two types of Pearson’s chi-squared tests:**
- The Chi-Square Goodness of Fit Test
- The Chi-Square Test of Independence

## I. Chi-Square Goodness of Fit Test

The formula used to calculate the chi-Square goodness of fit test is:

$$
\chi^2\ (cal.) = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}
$$

Where:

$\chi^2$ = Chi-square test statistic for k -1 degrees of freedom

$O_i$ = Observed count/frequency of type $i$

$E_i$ = Expected count/frequency of type $i$

$k$ =  Number of distinct categories or groups

df (Degrees of freedom) = k -1

### *Question 1*

**Question 1:** A coin was flipped 100 times, yielding 40 heads and 60 tails. Conduct a hypothesis test at the 95% confidence level to assess whether the coin is biased.

**Importing Libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

**Creating Data**

In [2]:
# Creating lists of data from the question 1 above
obs = [40, 60]
exp = [50, 50]

**Stating Hypothesis**

Null hypothesis H0: The coin is unbiased.

Alternative hypothesis H1: The coin is biased.

**Selecting the Level of Significance or α Level**

Given that the confidence level (1 - α) is 95%, the significance level (α) is 5% (or 0.05). 

The chi-square goodness-of-fit test is almost always right-tailed because the test statistic, which is calculated as $\sum \frac {(O - E)^2}{E}$, is always positive.

**Calculating Test Statistic and P-Value**

In [3]:
stats.chisquare(f_obs=obs, f_exp=exp)

Power_divergenceResult(statistic=4.0, pvalue=0.04550026389635857)

In [4]:
# By default the expected values (i.e., f_exp) are assumed to be equally likely
stats.chisquare(obs)

Power_divergenceResult(statistic=4.0, pvalue=0.04550026389635857)

Here, the calculated chi-square value is 4.0 and the right-tailed p-value is 0.045.

**Interpreting Results**

Since the p-value is less than α = 0.05, we reject the null hypothesis. Therefore, we conclude that the coin is biased.

### *Question 2*

**Question 2:** A shop owner wants to compare the number of T-shirts sold in each size to the expected proportions. The total number of shirts sold was 25 (Small), 41 (Medium), 91 (Large), and 68 (X-Large). The expected proportions were 0.1, 0.2, 0.4, and 0.3 for Small, Medium, Large, and X-Large, respectively. Conduct a hypothesis test at the 95% confidence level to determine whether the observed (sample) sales differ significantly from the expected (population) sales.

**Creating Data**

In [5]:
# Creating the data as a pandas Series based on question 2 above
exp_prop = pd.Series([0.1, 0.2, 0.4, 0.3])
obs_freq = pd.Series([25, 41, 91, 68])
exp_freq = exp_prop * sum(obs_freq)

In [6]:
obs_freq

0    25
1    41
2    91
3    68
dtype: int64

In [7]:
exp_freq

0    22.5
1    45.0
2    90.0
3    67.5
dtype: float64

**Stating Hypothesis**

Null hypothesis H0:  The observed sales follow the expected proportions or O $\approx$ E.

Alternative hypothesis H1: The observed sales do not follow the expected proportions or O $\neq$ E.

**Selecting the Level of Significance or α Level**

Given that the confidence level (1 - α) is 95%, the significance level (α) is 5% (or 0.05). 

The chi-square goodness-of-fit test is almost always right-tailed because the test statistic, which is calculated as $\sum \frac {(O - E)^2}{E}$, is always positive.

**Calculating Test Statistic and P-Value**

In [8]:
stats.chisquare(obs_freq, exp_freq)

Power_divergenceResult(statistic=0.648148148148148, pvalue=0.8853267818237286)

**Interpreting Results**

Since the p-value is greater than α = 0.05, we fail to reject the null hypothesis. Therefore, we conclude that the observed shirt sales follow the expected proportions.

## II. Chi-Square Test of Independence in a Two-Way Table (or, a Contingency Table)

Unlike the one-way goodness-of-fit test, which involves only a single row of data, a contingency table—also known as a cross-tabulation or crosstab—is a matrix-style table that presents the frequency distribution of variables across both rows and columns. It is used to examine the relationship between two discrete variables.

*The formula used to calculate the Chi-Square Test of Independence for a two-way contingency table is the same as that used in the Chi-Square Goodness-of-Fit Test described above. However, the formula used to calculate the degrees of freedom is:*

$df$ (Degrees of freedom) = (r - 1) * (c - 1)

Where,
r = number of rows and 
c = number of clumns

**Creating Data**

In [9]:
# Creating the data as a pandas DataFrame, where the rows correspond to Shift 1, Shift 2, and Shift 3, and the columns correspond to Operator 1, Operator 2, and Operator 3. 
shif_oper = pd.DataFrame(data=[[22, 26, 23], [28, 62, 26], [72, 22, 66]], index=['Shift 1', 'Shift 2', 'Shift 3'], columns=['Operator 1', 'Operator 2', 'Operator 3'])
shif_oper

Unnamed: 0,Operator 1,Operator 2,Operator 3
Shift 1,22,26,23
Shift 2,28,62,26
Shift 3,72,22,66


*The data is now presented in the same format as the contingency table.*

### *Question 1*

**Question 1:** Conduct a hypothesis test at the 95% confidence level to assess whether a relationship exists between the row (shift) and column (operator) variables, using the data presented above as a contingency table of observed frequencies. 

**Stating Hypothesis**

Null hypothesis H0: There is no relationship between the row variable (shift) and the column variable (operator).

Alternative hypothesis H1:  There is a relationship between the row variable (shift) and the column variable (operator).


**Note:** In a contingency table analysis, the alternative hypothesis simply states that a relationship exists between the row and column variables. It doesn't specify:

- The nature of that relationship (positive, negative, causal, etc.)

- The direction or strength of the association

- Which variable might influence the other

**Selecting the Level of Significance or α Level**

Given that the confidence level (1 - α) is 95%, the significance level (α) is 5% (or 0.05). 

**Calculating Test Statistic and P-Value**

In [10]:
stats.chi2_contingency(observed=shif_oper)
# Here, the 'observed' parameter represents the data containing observed frequencies in the form of a contingency table.  

Chi2ContingencyResult(statistic=50.09315721064659, pvalue=3.4527076339398545e-10, dof=4, expected_freq=array([[24.96253602, 22.50720461, 23.53025937],
       [40.78386167, 36.77233429, 38.44380403],
       [56.25360231, 50.7204611 , 53.0259366 ]]))

Here, the calculated chi-squared value is 50.09 and the p-value is 3.4527076339398545e-10.

**Interpreting Results**

Since the p-value is less than α = 0.05, we reject the null hypothesis. Therefore, we conclude that there is a relationship between the row variable (shift) and the column variable (operator).

### *Question 2*

**Question 2:** Using the built-in Tips dataset from the Seaborn library, perform a hypothesis test at the 95% confidence level to assess whether a relationship exists between the row (day) and column (smoker) variables. 

**Loading Data**

In [11]:
# Checking the Tips Dataset in the Seaborn library
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

In [12]:
# Loading the Tips Dataset, a built-in data in the Seaborn library
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [13]:
# Checking info() method
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [14]:
# Using the 'groupby' method to group together the rows based off of the 'day' and 'smoker' columns and performing aggregate function such as count()
tips.groupby(by=['day', 'smoker'], observed=True).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,time,size
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thur,Yes,17,17,17,17,17
Thur,No,45,45,45,45,45
Fri,Yes,15,15,15,15,15
Fri,No,4,4,4,4,4
Sat,Yes,42,42,42,42,42
Sat,No,45,45,45,45,45
Sun,Yes,19,19,19,19,19
Sun,No,57,57,57,57,57


In [15]:
# Creating a pivot table to reorganize or summarize the data accross rows and columns and then apply the aggregate function such as count() to the 'tip' column 
day_smoker_pivot = tips.pivot_table(values=tips, index='day', columns='smoker', aggfunc='count')['tip']
day_smoker_pivot

smoker,Yes,No
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,17,45
Fri,15,4
Sat,42,45
Sun,19,57


*The data is now presented in the same format as the contingency table.*

In [16]:
# Contingency table can also be created directly using pd.crosstab() function to summarize the frequency of occurrences for combinations of two categorical variables
day_smoker_contin = pd.crosstab(tips['day'], tips['smoker'])
day_smoker_contin

smoker,Yes,No
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,17,45
Fri,15,4
Sat,42,45
Sun,19,57


**Stating Hypothesis**

Null hypothesis H0: There is no relationship between the row variable (day of the week) and the column variable (smoker status) with respect to tips.

Alternative hypothesis H1:  There is a relationship between the row variable (day of the week) and the column variable (smoker status) with respect to tips.

**Selecting the Level of Significance or α Level**

Given that the confidence level (1 - α) is 95%, the significance level (α) is 5% (or 0.05). 

**Calculating Test Statistic and P-Value**

In [17]:
# Using the pivot table dataFrame
stats.chi2_contingency(day_smoker_pivot)

Chi2ContingencyResult(statistic=25.787216672396262, pvalue=1.0567572499836523e-05, dof=3, expected_freq=array([[23.63114754, 38.36885246],
       [ 7.24180328, 11.75819672],
       [33.15983607, 53.84016393],
       [28.96721311, 47.03278689]]))

Here, the calculated chi-squared value is 25.78 and the p-value is 1.0567572499836523e-05.

In [18]:
# Using the contingency table dataFrame
stats.chi2_contingency(day_smoker_contin)

Chi2ContingencyResult(statistic=25.787216672396262, pvalue=1.0567572499836523e-05, dof=3, expected_freq=array([[23.63114754, 38.36885246],
       [ 7.24180328, 11.75819672],
       [33.15983607, 53.84016393],
       [28.96721311, 47.03278689]]))

Here, the calculated chi-squared value is 25.78 and the p-value is 1.0567572499836523e-05.

**Interpreting Results**

Since the p-value is less than α = 0.05, we reject the null hypothesis. Therefore, we conclude that there is a relationship between the row variable (day of the week) and the column variable (smoker status) with respect to tips.