# 🎯 Goal of the Chi-Squared: Test for Independence Between Two Categorical Variables

**Let’s assume we want to test whether gender and loan approval are independent.**

Example Dataset  

Data from a bank on 100 clients:

|Gender|Loan approved| Loan denied| Total|
|----------|----------|----------|----------|
|Male|30|20|50|
|Female|10|40|50|
|Total|40|60|100|

❓ **Hypotheses**

- $H_0$ (null hypothesis): Gender and loan approval are independent.  

- $H_1$ (alternative hypothesis): Gender and loan approval are not independent.

## Step 1: Compute Expected Frequencies

**If the variables were independent, the expected frequency for each cell would be:**

$E_{ij} = \dfrac{(\text{Row Total})_i\times(\text{Columns Total})_j}{(\text{Grand Total})}$

Let's compute them

For Male & Approved:  
- $E = \dfrac{50 \times 40}{100}=20$

For Male & Denied:  
- $E = \dfrac{50 \times 60}{100}=30$

For Female & Approved:  
- $E = \dfrac{50 \times 40}{100}=20$

For Female & Denied:  
- $E = \dfrac{50 \times 60}{100}=30$

**Expected Frequency Table:**  

|Gender| Approved| Denied|
|----------|----------|----------|
|Male| 20| 30|
|Female| 20| 30|

## Step 2: Chi-Squared Statistic

Use the formula:  

$\chi^2 = \sum \dfrac{(O-E)^2}{E}$  

Where $O = observed, E = expected$.

$\chi^2 = \dfrac{(30-20)^2}{20} + \dfrac{(20-30)^2}{30} + \dfrac{(10-20)^2}{20} + \dfrac{(40-30)^2}{30}$  

$\chi^2 = \dfrac{100}{20} + \dfrac{100}{30} + \dfrac{100}{20} + \dfrac{100}{30} = 5 + 3.33 + 5 + 3.33 = 16.66$

## Step 3: Compute the p-value

- Degrees of freedom:  

$df = (rows - 1) \times (columns - 1) = (2 - 1) \times (2 - 1) = 1$

In [6]:
from scipy.stats import chi2

chi2_stat = 16.66
df = 1
p_value = 1 - chi2.cdf(chi2_stat, df)
print(f"p-value = {p_value:.5f}")

p-value = 0.00004


✅ Conclusion

Since the p-value < 0.05, we reject the null hypothesis.  
- There is strong evidence that gender and loan approval are not independent.

## 🐍 Python Code Version

In [7]:
import numpy as np
from scipy.stats import chi2_contingency

# Observed frequencies
data = np.array([[30, 20],
                 [10, 40]])

chi2_stat, p_value, dof, expected = chi2_contingency(data)

print(f"Chi-squared statistic: {chi2_stat:.2f}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies:\n", expected)
print(f"p-value: {p_value:.4f}")

Chi-squared statistic: 15.04
Degrees of freedom: 1
Expected frequencies:
 [[20. 30.]
 [20. 30.]]
p-value: 0.0001


**END**