## Chi-Squared Goodness-of-Fit Test

The chi-squared goodness-of-fit test is a statistical test used to assess whether the observed frequency distribution of categorical data matches the expected distribution. It is a non-parametric test that is particularly useful when dealing with categorical variables or counts.

### Assumptions:

1. **Categorical Data:** The data should be categorical, meaning observations fall into distinct categories.

2. **Independent Observations:** Each observation should be independent of others.

3. **Expected Frequencies:** The expected frequencies for each category should be reasonably large (typically, each expected frequency should be greater than 5).

### Mathematical Formulation:

The test involves comparing the observed frequencies ($O_i$) with the expected frequencies ($E_i$). The chi-squared statistic ($\chi^2$) is calculated using the formula:

$$ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} $$

where the sum is taken over all categories.

### Degrees of Freedom:

The degrees of freedom ($df$) for the chi-squared test depend on the number of categories ($k$) and are given by $df = k - 1$.

### Decision Rule:

The obtained $\chi^2$ value is compared against the critical $\chi^2$ value at a chosen significance level to determine whether to reject the null hypothesis.

- **Null Hypothesis ($H_0$):** The observed and expected frequencies are consistent; there is no significant difference between the observed and expected distributions.

- **Alternative Hypothesis ($H_1$):** There is a significant difference between the observed and expected distributions.

### Interpretation:

- If the calculated $\chi^2$ value is greater than the critical $\chi^2$ value, the null hypothesis is rejected, suggesting a significant difference.

- If the calculated $\chi^2$ value is less than the critical $\chi^2$ value, there is insufficient evidence to reject the null hypothesis.

The chi-squared goodness-of-fit test is widely used in various fields, such as market research, biology, and social sciences, to assess whether the observed distribution of categorical data differs significantly from the expected distribution.


### Example:

**Problem Statement:** 

Suppose we have conducted a survey on the preferred modes of transportation for individuals in a city. The expected distribution based on historical data suggests that 40% prefer cars, 30% prefer public transportation, and 30% prefer bicycles.

We collected data from a random sample of 200 individuals and want to assess whether the observed distribution matches the expected distribution.

**Hypotheses:**  

- **Null Hypothesis ($H_0$):** The observed and expected distributions are consistent; there is no significant difference.
  
- **Alternative Hypothesis ($H_1$):** There is a significant difference between the observed and expected distributions.

**Data:**

- Observed Frequencies:
  - Cars: 90
  - Public Transportation: 70
  - Bicycles: 40

### Assumptions:

1. Categorical Data
2. Independent Observations
3. Expected Frequencies > 5

### Calculation:

The chi-squared statistic ($\chi^2$) is calculated using the formula:

$$ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} $$

where $O_i$ is the observed frequency, $E_i$ is the expected frequency, and the sum is taken over all categories.

### Degrees of Freedom:

$df = k - 1$ where $k$ is the number of categories.

### Decision Rule:

Compare the calculated $\chi^2$ value against the critical $\chi^2$ value at a chosen significance level.

### Solution:

1. Calculate the expected frequencies based on the proportions.
   - Expected Cars: $0.40 \times 200 = 80$
   - Expected Public Transportation: $0.30 \times 200 = 60$
   - Expected Bicycles: $0.30 \times 200 = 60$

2. Apply the chi-squared formula to obtain the test statistic.

3. Determine the critical $\chi^2$ value at the chosen significance level (e.g., 0.05) with $df = 2 - 1 = 1$.

4. Compare the calculated $\chi^2$ value with the critical value.

### Conclusion:

If the calculated $\chi^2$ value exceeds the critical value, we reject the null hypothesis, indicating a significant difference between the observed and expected distributions. If not, we fail to reject the null hypothesis.

This example illustrates the application of the chi-squared goodness-of-fit test in analyzing categorical data.


In [14]:
import numpy as np
from scipy.stats import chisquare, chi2

# Observed frequencies
observed = np.array([90, 70, 40])

# Expected proportions based on historical data
expected_proportions = np.array([0.4, 0.3, 0.3])

# Sample size
total_observed = np.sum(observed)

# Calculate expected frequencies
expected = expected_proportions * total_observed

#Set alpha = 0.05
alpha = 0.05

# Degrees of freedom
df = len(observed) - 1

# Perform the chi-squared test
chi2_statistic = np.sum((observed - expected)**2/expected)
p_value = 1-chi2.cdf(chi2_statistic, df)
chi2_critical = chi2.ppf(1-alpha, df)


# Output results
print(f't-statistic: {chi2_statistic: 3.4f} \np-value: {p_value: 3.4f} \nt-critical: {chi2_critical: 3.4f}')
print('Reject H0:', p_value < alpha)


t-statistic:  9.5833 
p-value:  0.0083 
t-critical:  5.9915
Reject H0: True


In [15]:
# Using scipy chi2 test:
chisquare(observed, expected)

Power_divergenceResult(statistic=9.583333333333334, pvalue=0.008298614824955013)