# 🧪 Full Guide: Solving Chi-Squared Questions (For Beginners)

This guide explains **how to solve chi-squared exercises** step-by-step. It covers how to recognize the question, choose the right test, get the right inputs, write the code, interpret results, and formulate a clear answer.

---

## 🧠 1. How to Recognize the Type of Chi-Squared Question

### ✅ A. **Chi-Squared Test of Independence**

**Question clues:**

* "Is there a relationship between X and Y?"
* "Does gender influence preference?"
* "Are these two categorical variables dependent or independent?"

Use this test when you have **two categorical variables** and want to check for a connection.

---

### ✅ B. **Chi-Squared Goodness-of-Fit Test**

**Question clues:**

* "Is the sample representative of the population?"
* "Do these frequencies match the expected distribution?"
* "Is there a fair/biased distribution?"

Use this test when you have **one categorical variable** and known **expected percentages**.

---

## 🔢 2. Getting the Right Inputs

### A. Observed Frequencies

These are the actual counts from your dataset or the exercise.

* In pandas:

```python
import pandas as pd
observed = pd.Series([...])  # for goodness-of-fit or fairness
# or
observed = pd.crosstab(df['Variable1'], df['Variable2'])  # for independence test
```

### B. Expected Frequencies

#### For Goodness-of-Fit (known percentages):

If the population is:

* Category A: 50%
* Category B: 30%
* Category C: 20%

Then:

```python
expected_percentages = np.array([0.5, 0.3, 0.2])
expected = expected_percentages * observed.sum()
```

#### For Independence:

Calculated automatically in the function or with:

```python
expected = np.outer(row_totals, col_totals) / total
```

---

## 🧪 3. Code Examples (With Explanation)

### A. Chi-Squared Test of Independence (2 variabelen)

**Used for:** Two categorical variables (e.g. Gender vs. Survey response)

```python
import pandas as pd
from scipy.stats import chi2_contingency

alpha = 0.05

# Maak een kruistabel van twee categorische variabelen, variable 1 is independent, variable2 is dependent
observed = pd.crosstab(df["Variabele1"], df["Variabele2"])

# Bepaal degrees of freedom
dof = (observed.shape[0] - 1) * (observed.shape[1] - 1)

# Critical value
critical_value = stats.chi2.ppf(q=1 - alpha, df=dof)

# Run test
chi2, p, dof, expected = stats.chi2_contingency(observed)

print(f"Chi² = {chi2:.4f}")
print(f"p-waarde = {p:.4f}")
print(f"p-waarde in % = {round(p,4) * 100}")
print(f"Vrijheidsgraden = {dof}")
print(f"Kritieke waarde bij α = {alpha}: {critical_value:.4f}")

# Beslissing
if p < alpha:
    print("❌ Verwerp H0: Er is waarschijnlijk een verband tussen de variabelen.")
else:
    print("✅ Verwerp H0 niet: Geen bewijs voor een verband.")

if chi2 > critical_value:
    print("❌ Verwerp H0: Chi² is groter dan de kritieke waarde.")
else:
    print("✅ Verwerp H0 niet: Chi² is niet groter dan de kritieke waarde.")
```

**📌 Interpretatie:**
- Als $p < 0{,}05$: **Verwerp $H_0$** → Er is **waarschijnlijk een verband**
- Als $p \geq 0{,}05$: **Verwerp $H_0$ niet** → **Geen bewijs voor een verband**

- Als $\chi^2 > \text{kritieke waarde}$ ⇒ **Verwerp $H_0$**
- Als $\chi^2 \leq \text{kritieke waarde}$ ⇒ **Verwerp $H_0$ niet**

**Answer Example:**

> H0: Gender and Survey are independent.
> H1: Gender and Survey are not independent.
> Chi² = 4.26, p = 0.372 > 0.05 → Do not reject H0.
> Conclusion: There is no significant relationship between gender and survey responses.

---

### ✅ C. Chi-Squared Goodness-of-Fit Test (based on variable exposure)

**Use this when:** The expected distribution depends on external values like number of days, exposure time, or surface area.

**Question clues:**

* "Are births proportional to lunar phases (which have different lengths)?"
* "Does observed frequency align with unequal expected times/lengths/etc.?"

```python
import pandas as pd
import numpy as np
from scipy.stats import chisquare, chi2

# Sample setup — requires 3 columns: category, exposure, observed
df = pd.DataFrame({
    'category': [...],                 # Category labels (optional for labeling only)
    'exposure': [...],                 # e.g. number of days, time, area...
    'observed': [...]                  # Actual counts per category
})

alpha = 0.05                          # Significance level

# Step 1: Calculate total exposure and expected values
exposure_total = df['exposure'].sum()
observed_total = df['observed'].sum()
df['expected'] = (df['exposure'] / exposure_total) * observed_total

# Step 2: Run chi-squared test
chi2_stat, p = chisquare(f_obs=df['observed'], f_exp=df['expected'])
dof = len(df) - 1
critical_value = chi2.ppf(1 - alpha, df=dof)

# Step 3: Interpret result
print(f"Chi² = {chi2_stat:.4f}")
print(f"p-value = {p:.4f}")
print(f"Critical value = {critical_value:.4f}")

if p < alpha:
    print("❌ Verwerp H0: de verdeling wijkt significant af van wat we verwachten op basis van blootstelling")
else:
    print("✅ Verwerp H0 niet: verdeling komt overeen met de blootstelling")
```

---

### ✅ D. Chi-Squared Goodness-of-Fit Test (for a single known percentage)

**Use this when:** You compare the observed frequency of one category to a known percentage (e.g. 20%, 5.7%, etc.)

**Question clues:**

* "Is market share more than 20%?"
* "Is hiring proportionate to 5.7% of the population?"
* "Are X% of outcomes happening as expected?"

```python
import numpy as np
from scipy.stats import chisquare, chi2

# Inputs
success = ...                         # Number of cases in category (e.g. 15 African American)
total = ...                           # Total sample size (e.g. 405 teachers)
expected_proportion = 0.057          # Known percentage in decimal
alpha = 0.05                          # Significance level

# Step 1: Setup arrays
observed = np.array([success, total - success])
expected = np.array([expected_proportion, 1 - expected_proportion]) * total

# Step 2: Chi-squared test
chi2_stat, p = chisquare(f_obs=observed, f_exp=expected)
dof = 1
critical_value = chi2.ppf(1 - alpha, df=dof)

# Step 3: Interpret
print(f"Chi² = {chi2_stat:.4f}")
print(f"p-value = {p:.4f}")
print(f"Critical value = {critical_value:.4f}")

if p < alpha:
    print("❌ Verwerp H0: het percentage wijkt significant af van de verwachte proportie")
else:
    print("✅ Verwerp H0 niet: het percentage komt overeen met de verwachte proportie")
```

---

### ❓ Hoe kies je tussen C en D?

| Vraag of situatie                                                                       | Gebruik test |
| --------------------------------------------------------------------------------------- | ------------ |
| Je hebt meerdere categorieën met verschillende blootstelling (tijd, dagen, lengte, ...) | **C**        |
| Je vergelijkt één categorie met een bekende proportie (%)                               | **D**        |
| "Is het aandeel X gelijk aan 20% of 5.7%?"                                              | **D**        |
| "Is de verdeling eerlijk over X, Y, Z dagen?"                                           | **C**        |
| Je verwacht GEEN gelijke kansen, maar kansen obv lengte/tijd                            | **C**        |
| Je test of één groep is onder- of oververtegenwoordigd                                  | **D**        |












## 📊 Optional: Cramér's V (Effect Size, only for Independence)

```python
import numpy as np

n = observed.to_numpy().sum()
dof = min(observed.shape) - 1
cramers_v = np.sqrt(chi2 / (n * dof))
print(f"Cramér's V: {cramers_v:.3f}")
```

| Value | Meaning        |
| ----- | -------------- |
| 0.0   | No association |
| 0.1   | Weak           |
| 0.25  | Moderate       |
| 0.5+  | Strong         |

---

## ✅ Summary Checklist

| Step | What to Do                       |
| ---- | -------------------------------- |
| 1    | Identify the type of question    |
| 2    | Extract observed values          |
| 3    | Determine expected values        |
| 4    | Run the appropriate chi² test    |
| 5    | Interpret the p-value            |
| 6    | Formulate your answer in context |

Use this guide to solve **every chi-squared question** with confidence!