# Content

### The Core Idea: The Chi-Square (χ²) Statistic

Both tests are built around the same fundamental idea. We compare the **Observed Counts** in our data table to the **Expected Counts** we would expect to see *if the null hypothesis were true*.

The **Chi-Square (χ²) test statistic** measures how far the observed counts are from the expected counts. A large value means a big discrepancy, while a small value means the observed data is very close to what was expected.

*   **Formula:** **χ² = Σ [ (Observed - Expected)² / Expected ]**
    *   `Observed`: The actual count from your sample data in a given cell.
    *   `Expected`: The count you would expect in that cell if H₀ were true.
    *   You calculate the `(O-E)²/E` value for **every cell** in the table and then **sum (Σ)** them all up.

A large χ² value suggests that the null hypothesis is unlikely to be true. We use this statistic to find a P-value from a **chi-square distribution**, which is defined by **degrees of freedom (df) = (number of rows - 1) * (number of columns - 1)**.


### The Chi-Square Test for Homogeneity

#### Theory
*   **The Question:** "Is the distribution of a single categorical variable the **same** (homogeneous) across two or more different populations?"
*   **The Data Collection:** You take **multiple independent random samples**, one from each population you want to compare. Then, you classify each subject in those samples according to one categorical variable.

#### Hypotheses
*   **Null Hypothesis (H₀):** The distribution of [categorical variable] is the same for all populations.
*   **Alternative Hypothesis (Hₐ):** The distribution of [categorical variable] is *not* the same for all populations.

#### Step-by-Step Example
**Scenario:** A marketing firm wants to know if the preferred social media platform is the same for different age groups. They take a random sample of 100 Teens, a separate random sample of 100 Young Adults, and a final separate random sample of 100 Adults.

**Observed Data Table:**

| | Teenagers | Young Adults | Adults | **Row Total** |
| :--- | :---: | :---: | :---: | :---: |
| **TikTok** | 60 | 40 | 10 | **110** |
| **Instagram** | 30 | 45 | 25 | **100** |
| **Facebook** | 10 | 15 | 65 | **90** |
| **Column Total** | **100** | **100** | **100** | **300** |

**Calculations (The "DO" Step):**
1.  **Calculate Expected Counts:** The formula for each cell is **(Row Total * Column Total) / Grand Total**.
    *   Expected(Teenagers, TikTok) = (110 * 100) / 300 ≈ **36.7**
    *   Expected(Adults, Facebook) = (90 * 100) / 300 = **30.0**
    *   ...and so on for every cell.
2.  **Calculate the χ² statistic:** For each cell, find `(O-E)²/E` and sum them.
    *   Cell (Teenagers, TikTok): `(60 - 36.7)² / 36.7 ≈ 14.7`
    *   Cell (Adults, Facebook): `(65 - 30.0)² / 30.0 ≈ 40.8`
    *   ...after doing this for all 9 cells, you sum them up to get the final `χ²` value. Let's say it comes out to **`χ² = 72.8`**.
3.  **Find the P-value:**
    *   `df = (3 rows - 1) * (3 columns - 1) = 2 * 2 = 4`.
    *   The P-value is the probability of getting a `χ²` statistic of 72.8 or greater on a distribution with df=4. This P-value will be extremely small (virtually zero).

**Conclusion:**
*   Since the P-value is less than any reasonable `α` (like 0.05), we **reject H₀**. We have very strong evidence that the distribution of preferred social media platforms is **not the same** across these three age groups.

### The Chi-Square Test for Association (Independence)

#### Theory
*   **The Question:** "Is there an **association** (or relationship) between two categorical variables within a **single population**?" This is equivalent to asking if the two variables are **independent**.
*   **The Data Collection:** You take **one single random sample** from one population. Then, you classify each subject according to **two different categorical variables**.

#### Hypotheses
*   **Null Hypothesis (H₀):** There is no association between [Variable 1] and [Variable 2]. (They are independent).
*   **Alternative Hypothesis (Hₐ):** There is an association between [Variable 1] and [Variable 2]. (They are dependent).

#### Step-by-Step Example
**Scenario:** A sociologist takes a single random sample of 300 university students and asks each of them two questions: 1) "What is your primary academic division?" and 2) "Do you live on-campus or off-campus?"

**Observed Data Table:**

| | On-Campus | Off-Campus | **Row Total** |
| :--- | :---: | :---: | :---: |
| **Arts & Sciences** | 70 | 50 | **120** |
| **Engineering** | 40 | 40 | **80** |
| **Business** | 30 | 70 | **100** |
| **Column Total** | **140** | **160** | **300** |

**Calculations (The "DO" Step):**
The process is identical to the test for homogeneity.
1.  **Calculate Expected Counts:** Formula is **(Row Total * Column Total) / Grand Total**.
    *   Expected(Arts & Sci, On-Campus) = (120 * 140) / 300 = **56.0**
    *   Expected(Business, Off-Campus) = (100 * 160) / 300 ≈ **53.3**
2.  **Calculate the χ² statistic:** Sum the `(O-E)²/E` for all cells.
    *   Cell (Arts & Sci, On-Campus): `(70 - 56.0)² / 56.0 ≈ 3.5`
    *   ...after summing all 6 cells, let's say we get **`χ² = 15.6`**.
3.  **Find the P-value:**
    *   `df = (3 rows - 1) * (2 columns - 1) = 2 * 1 = 2`.
    *   The P-value is `P(χ²₂ ≥ 15.6)`. This is a very small number, `≈ 0.0004`.

**Conclusion:**
*   Since the P-value (0.0004) is less than `α` (0.05), we **reject H₀**. We have strong evidence of an **association** between a student's academic division and their housing status. They are not independent variables.

---

### Conditions for Both Chi-Square Tests

For the results of either test to be valid, we must check three conditions:
1.  **Random:** The data must be collected using a random sample or samples.
2.  **Independent (10% Rule):** When sampling without replacement, each sample size should be no more than 10% of its respective population.
3.  **Large Counts:** **All expected counts must be greater than or equal to 5.** This is the most important condition. If any expected count is less than 5, the chi-square model is not appropriate.

***

### Python Code Illustration

The `scipy.stats.chi2_contingency` function is used for both tests. It takes the table of observed counts as input and returns the `χ²` statistic, the P-value, the degrees of freedom, and the table of expected counts.

