<font color = "red" size = 12>Chi-Square (χ²) Test</font>

<font color = "red" >Chi-Square (χ²) Distribution</font>


The **Chi-Square distribution** is a type of probability distribution commonly used in **statistical hypothesis testing**. It plays a key role in tests that measure how well an observed dataset fits an expected pattern, such as **goodness-of-fit tests** and **tests for independence**.

**How It Works**
- It arises when we sum the **squares** of independent standard normal variables (standard normal distribution).
- The **degrees of freedom (df)** determine its shape and spread. More degrees of freedom lead to a more **symmetrical shape**.
- It is always **non-negative**, meaning values cannot be negative.

**Key Properties**
1. **Continuous distribution:** It deals with a range of possible values rather than discrete outcomes.
2. **Positively skewed:** The distribution leans to the right, but as **degrees of freedom increase**, it **starts resembling a normal distribution**.
3. **Mean & Variance:** 
   - Mean = Degrees of freedom (df)
   - Variance = **2 × Degrees of freedom**
4. **Approaches normal distribution** as degrees of freedom increase.

**Where It’s Used**
- **Goodness-of-Fit Test:** Determines if an observed dataset matches an expected distribution.
- **Test for Independence:** Used in contingency tables to check relationships between categorical variables.66

**Note- There is file name 6_code help you to understand chi square distribution in this folder**



<font color = "red">**Chi-Square Goodness-of-Fit Test**</font>

The **Chi-Square Goodness-of-Fit test** helps determine whether the observed data of a **single categorical variable** matches an expected theoretical distribution (such as uniform, binomial, or Poisson). It is useful for checking whether sample data follows a specific probability pattern or significantly deviates from it.

**Steps Involved**
1. **Define Hypotheses**  
   - **Null hypothesis (H₀):** The observed data follows the expected theoretical distribution.  
   - **Alternative hypothesis (H₁):** The observed data does **not** follow the expected theoretical distribution.  

2. **Calculate Expected Frequencies**  
   - Use the theoretical distribution and sample size to estimate the expected values for each category.  

3. **Compute the Chi-Square Test Statistic**  
   - The formula used is:  
     $$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$  
     - Where:
       - $O_i$ = Observed frequency of category $i$
       - $E_i$ = Expected frequency of category $i$
       - The summation runs over all categories.  

4. **Determine Degrees of Freedom (df)**  
   - Formula: **df = (Number of categories - 1)**  

5. **Find the p-value**  
   - Compare the test statistic ($\chi^2$) to the **Chi-Square distribution** based on the degrees of freedom.  
   - If the **p-value is small (typically < 0.05)**, reject the **null hypothesis**, meaning the data does **not** fit the expected distribution.  

**Key Assumptions**
- **Independent observations:** Each data point must be independent.
- **Categorical data:** The variable being analyzed must be categorical (not continuous).
- **Expected frequencies:** Each category should have an expected frequency of at least **5** for reliable results.
- **Fixed distribution:** The theoretical distribution being compared to the observed data should be specified before the test is conducted. It is essential to avoid choosing a distribution based on the observed data, as doing so can lead to biased results.



**The Chi-Square Goodness-of-Fit test is a non-parametric test. Non- parametric tests do not assume that the data comes from a specific probability distribution or make any assumptions about population parameters like the mean or standard deviation.**

**Example:-**

Suppose a marketing team at a retail company wants to understand the distribution of visits to
their website by day of the week. They have a hypothesis that visits are uniformly distributed
across all days of the week, meaning they expect an equal number of visits on each day. They
collected data on website visits for four weeks and want to test if the observed distribution
matches the expected uniform distribution.
Observed frequencies (number of website visits per day of the week for four weeks):
• Monday: 420
• Tuesday: 380
• Wednesday: 410
• Thursday: 400
• Friday: 410
• Saturday: 430
• Sunday: 390


**Chi-Square Goodness-of-Fit Test for a Six-Sided Die**

**Step 1: Define Hypotheses**
- **Null Hypothesis (H₀)**: The die is fair, meaning each side has an equal probability of appearing.
- **Alternative Hypothesis (H₁)**: The die is not fair, meaning at least one side appears with a different probability.

**Step 2: Observed and Expected Frequencies**

**Observed Frequencies:**
| Side | Observed Frequency (O) |
|------|------------------------|
| 1    | 12                     |
| 2    | 8                      |
| 3    | 11                     |
| 4    | 9                      |
| 5    | 10                     |
| 6    | 10                     |

**Expected Frequencies:**
If the die is fair, each side should appear equally. The total number of rolls is **60**, and there are **6** sides:


$$
E = \frac{60}{6} = 10
$$



| Side | Expected Frequency (E) |
|------|------------------------|
| 1    | 10                     |
| 2    | 10                     |
| 3    | 10                     |
| 4    | 10                     |
| 5    | 10                     |
| 6    | 10                     |

## Step 3: Calculate Chi-Square Statistic
The Chi-Square statistic is calculated as:



$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$



For each side:



$$
\chi^2 = \frac{(12 - 10)^2}{10} + \frac{(8 - 10)^2}{10} + \frac{(11 - 10)^2}{10} + \frac{(9 - 10)^2}{10} + \frac{(10 - 10)^2}{10} + \frac{(10 - 10)^2}{10}
$$





$$
\chi^2 = \frac{4}{10} + \frac{4}{10} + \frac{1}{10} + \frac{1}{10} + \frac{0}{10} + \frac{0}{10}
$$





$$
\chi^2 = 0.4 + 0.4 + 0.1 + 0.1 + 0 + 0
$$





$$
\chi^2 = 1.0
$$



**Step 4: Determine Degrees of Freedom**

The degrees of freedom (df) is calculated as:



$$
df = k - 1 = 6 - 1 = 5
$$



Step 5: Find Critical Value or P-value
Using a chi-square table at a significance level (α) of 0.05 and **df = 5**, the critical value is approximately **11.07**.

**Step 6: Conclusion**

- Since **χ² = 1.0** is much smaller than the critical value **11.07**, we **fail to reject** the null hypothesis.
- There is **no significant evidence** to suggest that the die is unfair.

**Final Answer:**

The observed data is consistent with a fair die.

<font color = "red" >**Example - 2**</font>


**Step 1: Define Hypotheses**

- **Null Hypothesis (H₀)**: Male and female births are equally probable, meaning each child has a 50% chance of being a boy or a girl.
- **Alternative Hypothesis (H₁)**: Male and female births are not equally probable.



**Step 2: Observed and Expected Frequencies**

From the survey of 800 families, we have observed the following distribution:

| Girls (G) | Boys (B) | Number of Families (Observed O) |
|-----------|---------|--------------------------------|
| 4         | 0       | 32                             |
| 3         | 1       | 178                            |
| 2         | 2       | 290                            |
| 1         | 3       | 236                            |
| 0         | 4       | 64                             |

If male and female births are equally probable (p = 0.5 for each), the expected number of families for each case follows a **Binomial Distribution**:

$$
E = P(X = k) \times 800
$$

Where $ P(X = k) $ is the binomial probability of having $ k $ girls in a family of 4 children, calculated using:

$$
P(X = k) = \binom{4}{k} \times (0.5)^k \times (0.5)^{(4-k)}
$$

Calculating each expected probability:

$$
P(X = 4) = \frac{4!}{4!(0!)} \times (0.5)^4 = 0.0625
$$
$$
P(X = 3) = \frac{4!}{3!(1!)} \times (0.5)^4 = 0.25
$$
$$
P(X = 2) = \frac{4!}{2!(2!)} \times (0.5)^4 = 0.375
$$
$$
P(X = 1) = \frac{4!}{1!(3!)} \times (0.5)^4 = 0.25
$$
$$
P(X = 0) = \frac{4!}{0!(4!)} \times (0.5)^4 = 0.0625
$$

Multiplying by 800 families:

| Girls (G) | Expected Probability $ P(X = k) $ | Expected Families $E $ |
|-----------|-----------------------------------|---------------------------|
| 4         | 0.0625                            | 50                        |
| 3         | 0.25                              | 200                       |
| 2         | 0.375                             | 300                       |
| 1         | 0.25                              | 200                       |
| 0         | 0.0625                            | 50                        |

---

**Step 3: Compute Chi-Square Statistic**

The formula for the **Chi-Square statistic** is:

$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$

Plugging in values:

$$
\chi^2 = \frac{(32 - 50)^2}{50} + \frac{(178 - 200)^2}{200} + \frac{(290 - 300)^2}{300} + \frac{(236 - 200)^2}{200} + \frac{(64 - 50)^2}{50}
$$

$$
= \frac{(-18)^2}{50} + \frac{(-22)^2}{200} + \frac{(-10)^2}{300} + \frac{(36)^2}{200} + \frac{(14)^2}{50}
$$

$$
= \frac{324}{50} + \frac{484}{200} + \frac{100}{300} + \frac{1296}{200} + \frac{196}{50}
$$

$$
= 6.48 + 2.42 + 0.33 + 6.48 + 3.92
$$

$$
\chi^2 = 19.63
$$



**Step 4: Determine Degrees of Freedom**
Degrees of freedom ($ df $) is:

$$
df = \text{categories} - 1 = 5 - 1 = 4
$$



**Step 5: Find Critical Value or P-value**
Using the **Chi-Square distribution table** at a significance level ($ \alpha $) of 0.05 and **df = 4**, the **critical value** is approximately **9.49**.

Since **χ² = 19.63** is **greater** than the critical value **9.49**, we **reject** the null hypothesis.

P_value = 0.0005

**Step 6: Conclusion**
- since $p\_value =0.0005$ is less then $\alpha = 0.05$, we **reject the null hypothesis**.
- There is **statistically significant evidence** that male and female births **are NOT equally probable**.

This means that the observed data **does not** match the assumption that each child has an equal chance of being male or female.


<font color = "red">**Chi-Square Test for Independence**</font>

The **Chi-Square Test for Independence**, also known as the **Chi-Square Test for Association**, helps determine whether two categorical variables are related. It assesses if the presence of one variable influences the presence of another or if they are truly independent.

**How It Works**
This test compares **observed frequencies** in a **contingency table** (which shows how different categories of two variables intersect) with the **expected frequencies** calculated under the assumption that the variables are independent.



**Steps to Perform the Chi-Square Test**
1. **Define Hypotheses**
   - **Null Hypothesis (H₀):** No association exists between the two categorical variables—they are independent.
   - **Alternative Hypothesis (H₁):** There is an association between the two variables—they are dependent.
   
2. **Create a Contingency Table**
   - Organize observed values for different categories of the two variables.

3. **Calculate Expected Frequencies**
   - Compute what the values would be **if** the variables were independent.

4. **Compute the Chi-Square Test Statistic**  
   $$
   χ^2 = Σ \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
  $$
   where:
   - $ O_{ij} $ = observed frequency in a cell
   - $ E_{ij} $ = expected frequency

5. **Determine Degrees of Freedom (df)**  
   $$
   df = (rows - 1) \times (columns - 1)
   $$

6. **Compare with Critical Value or P-Value**  
   - Use a **Chi-Square distribution table** or statistical software to get the **p-value**.
   - If **p-value < 0.05**, reject **H₀**, meaning there **is** a significant association.

7. **Draw Conclusions**
   - If **H₀ is rejected**, the variables are **associated**.
   - If **H₀ is not rejected**, the variables are likely **independent**.



**Key Assumptions**
- **Independence of Observations**: Each observation in the sample should be independent.
- **Categorical Variables**: The test is meant for **categorical data** (not continuous).
- **Adequate Sample Size**: Each cell should have **expected frequency ≥ 5** for accuracy.
- **Fixed Marginal Totals**: The total counts for rows and columns should not change based on variable relationships.

<font color = "red">**Eample 1**</font>

Problem Statement: A researcher wants to investigate if there is an association between education level (a categorical variable) and preference for a particular type of exercise (also a categorical variable) among a group of 150 individuals. The researcher collects data and organizes it into a contingency table, displaying the observed frequencies for different education levels and exercise types.

Using the Chi-Square Test for Independence, determine whether there is a significant association between education level and exercise preference.

Given Contingency Table

| Education Level | Yoga | Running | Swimming | Total |
|---------------|------|---------|---------|------|
| High School  | 25   | 20      | 15      | 60   |
| Bachelor's   | 10   | 25      | 30      | 65   |
| Master's or PhD | 5 | 15      | 5       | 25   |
| **Total**    | 40   | 60      | 50      | 150  |

**Core Idea: Probability and Independence**
When two variables are **independent**, their joint probability is simply the product of their individual probabilities.

Mathematically, if we have **events A and B**, their joint probability under independence is:

$$
P(A \cap B) = P(A) \times P(B)
$$

In the **Chi-Square Test**, we use this principle to estimate expected frequencies in a **contingency table**.



**Step-by-Step Explanation**

**Step 1: Define Hypotheses**
- **Null Hypothesis (H₀):** The two categorical variables are **independent**.
- **Alternative Hypothesis (H₁):** The two categorical variables are **dependent**.

**Step 2: Contingency Table**
A contingency table organizes the **observed frequencies** for each combination of categories across two categorical variables.

| Education Level | Yoga | Running | Swimming | Total |
|---------------|------|---------|---------|------|
| High School  | 25   | 20      | 15      | 60   |
| Bachelor's   | 10   | 25      | 30      | 65   |
| Master's or PhD | 5 | 15      | 5       | 25   |
| **Total**    | 40   | 60      | 50      | 150  |



**Step 3: Deriving the Expected Frequency Formula**
**Concept Behind Expected Frequency Calculation**
- If the variables are **independent**, the probability of any cell occurring is:

  $$
  P(\text{Row Category} \cap \text{Column Category}) = P(\text{Row Category}) \times P(\text{Column Category})
  $$

- Since probability is **relative frequency**, we replace probabilities with their observed **marginal frequencies** (row totals and column totals) divided by the grand total:

  $$
  E_{ij} = \left(\frac{\text{Row Total}}{\text{Grand Total}}\right) \times \left(\frac{\text{Column Total}}{\text{Grand Total}}\right) \times \text{Grand Total}
  $$

- This simplifies to:

  $$
  E_{ij} = \frac{(\text{Row Total} \times \text{Column Total})}{\text{Grand Total}}
  $$


**Step 4: Expected Frequency Calculation Example**
For **High School & Yoga**, we use the formula:

$$
E_{11} = \frac{(60 \times 40)}{150} = \frac{2400}{150} = 16
$$

Computing the expected values for the entire table:

| Education Level | Yoga (Expected) | Running (Expected) | Swimming (Expected) |
|---------------|----------------|----------------|----------------|
| High School  | 16  | 24  | 20  |
| Bachelor's   | 17.33  | 26  | 21.67 |
| Master's or PhD | 6.67 | 10 | 8.33 |



**Step 5: Compute Chi-Square Statistic**
Using:

$$
χ^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
$$

For High School & Yoga:

$$
\frac{(25 - 16)^2}{16} = \frac{81}{16} = 5.06
$$

Compute similarly for each cell, then sum them up.


$$
\chi^2 = 5.06 + 0.67 + 1.25 + 3.10 + 0.04 + 3.20 + 0.42 + 2.50 + 1.33 = 17.57
$$

**Step 6: Degrees of Freedom**
$$
df = (\text{Rows} - 1) \times (\text{Columns} - 1) = (3-1) \times (3-1) = 2 \times 2 = 4
$$



**Step 7: Compare to Critical Value**
- Look up the **Chi-Square distribution table** at **α = 0.05** for **df = 4**.
- If the computed $ χ^2 $ statistic is **greater** than the critical value, reject $ H₀ $.



**Step 8: Conclusion**
- If **H₀ is rejected**, the variables are **associated**.
- If **H₀ is not rejected**, the variables are **independent**.



**Key Assumptions**
- **Independence of Observations:** Each observation should be independent.
- **Categorical Variables:** The test applies to **categorical** (not continuous) data.
- **Adequate Sample Size:** Expected frequency in each cell should be **≥ 5**.
- **Fixed Marginal Totals:** Row and column totals should be predefined.


<font color = "red">**Chi-Square Test Applications in Machine Learning**</font>

The **Chi-Square Test** is widely used in **machine learning** and **data analysis** to evaluate relationships between categorical variables, improve model efficiency, and optimize feature selection.



**1. Feature Selection**  
- The **Chi-Square Test** acts as a **filter-based feature selection method**.
- It ranks and selects the most relevant categorical features in a dataset.
- By measuring **associations between categorical features and the target variable**, irrelevant or redundant features can be eliminated.
- This improves **model performance** and **efficiency**.



**2. Evaluation of Classification Models**  
- In **multi-class classification problems**, the **Chi-Square Test** can compare **observed vs. expected** class frequencies in a **confusion matrix**.
- Helps assess the **goodness of fit** of a model—showing how well predictions align with actual class distributions.



**3. Analyzing Relationships Between Categorical Features**  
- Used in **exploratory data analysis** to find **associations between categorical features**.
- Identifying **relationships between variables** informs **feature engineering**.
- Provides **insights into dataset structure** for better modeling decisions.


<font color = "red">
**4. Discretization of Continuous Variables**  
- When converting **continuous variables** into **categorical bins**, the **Chi-Square Test** can help determine **optimal binning intervals**.
- Ensures **effective representation** of continuous variables when mapped to categorical target variables.


Note - > This red Because you have to study this

</font>



**5. Variable Selection in Decision Trees**  
- Certain **decision tree algorithms**, such as **CHAID (Chi-Squared Automatic Interaction Detection)**, use the **Chi-Square Test** to:
  - Identify **most significant splitting variables**.
  - Improve **tree interpretability and efficiency**.
  - Ensure decisions are **data-driven** rather than arbitrary.



In [1]:
import scipy.stats as stats

test_statistics = 19.63
df = 4

stats.chi2.sf(test_statistics,df)

0.0005907805433338796

In [2]:
# Finding out critical value :- 

critical_value = stats.chi2.ppf(0.95,5)
critical_value

11.070497693516351