# Lab | Goodness of Fit and Independence Tests

## Question 1
A researcher gathers information about the patterns of Physical Activity of children in the fifth grade of primary school of a public school. He defines three categories of physical activity (Low, Medium, High). He also inquires about the regular consumption of sugary drinks at school, and defines two categories (Yes = consumed, No = not consumed). We would like to evaluate if there is an association between patterns of physical activity and the consumption of sugary drinks for the children of this school, at a level of 5% significance. The results are in the following table:

![](https://education-team-2020.s3.eu-west-1.amazonaws.com/ds-ai/lab-goodness-of-fit/table4.png)

In [1]:
import pandas as pd

data = {
    "Sugary Drinks Yes": [32, 14, 6],
    "Sugary Drinks No": [12, 22, 9]
}

index = ["Low", "Medium", "High"]

crosstab = pd.DataFrame(data, index=index)
crosstab.index.name = "Physical Activity"

crosstab.loc["Total"] = crosstab.sum()
crosstab["Total"] = crosstab.sum(axis=1)

print(crosstab)


                   Sugary Drinks Yes  Sugary Drinks No  Total
Physical Activity                                            
Low                               32                12     44
Medium                            14                22     36
High                               6                 9     15
Total                             52                43     95


In [2]:
# your answer here
from scipy.stats import chi2_contingency

chi2, p, dof, expected = chi2_contingency(crosstab)

print(f"Chi²: {chi2:.4f}")
print(f"p-value: {p:.4f}")
print(f"DoF: {dof}")
print("expected frequency:")
print(expected)

Chi²: 10.7122
p-value: 0.0977
DoF: 6
expected frequency:
[[24.08421053 19.91578947 44.        ]
 [19.70526316 16.29473684 36.        ]
 [ 8.21052632  6.78947368 15.        ]
 [52.         43.         95.        ]]


In [3]:
alpha = 0.05
if p < alpha:
    print('\n→ H0 is rejected: There is a significant relationship between the variables "Physical Activity" and "Sugar Drinks".')
else:
    print('\n→ H0 is not rejected: There is NO significant relationship between the variables "Physical Activity" and "Sugar Drinks".')



→ H0 is not rejected: There is NO significant relationship between the variables "Physical Activity" and "Sugar Drinks".


## [OPTIONAL] Question 2
The following table indicates the number of 6-point scores in an American rugby match in the 1979 season.

![](https://education-team-2020.s3.eu-west-1.amazonaws.com/ds-ai/lab-goodness-of-fit/table1.png)

Based on these results, we create a Poisson distribution with the sample mean parameter  = 2.435. Is there any reason to believe that at a .05 level the number of scores is a Poisson variable?

Check [here](https://www.geeksforgeeks.org/how-to-create-a-poisson-probability-mass-function-plot-in-python/) how to create a poisson distribution and how to calculate the expected observations, using the probability mass function (pmf).
A Poisson distribution is a discrete probability distribution. It gives the probability of an event happening a certain number of times (k) within a given interval of time or space. The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events.

In [8]:
import pandas as pd
import numpy as np
from scipy.stats import poisson, chisquare

# Original data
observed_counts = [35, 99, 104, 110, 62, 25, 10, 3]
categories = ["0", "1", "2", "3", "4", "5", "6", "7+"]

df = pd.DataFrame({
    "scores": categories,
    "observed": observed_counts
})

total_obs = sum(observed_counts)
λ = 2.435

# Individual probabilities from 0 to 6
expected_probs = [poisson.pmf(k, mu=λ) for k in range(7)]

# Probability for "7 or more" = 1 - sum of first 7
expected_probs.append(1 - sum(expected_probs))

# Expected counts (probability * total)
expected_counts = [round(p * total_obs, 2) for p in expected_probs]

df["expected"] = expected_counts
print(df)

chi2_stat, p_value = chisquare(f_obs=observed_counts, f_exp=expected_counts)

print(f"\nChi-square statistic: {chi2_stat:.4f}")
print(f"p-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print("➡️ Reject H0: The distribution is NOT Poisson.")
else:
    print("✅ Fail to reject H0: The distribution could be Poisson.")


  scores  observed  expected
0      0        35     39.24
1      1        99     95.56
2      2       104    116.34
3      3       110     94.43
4      4        62     57.49
5      5        25     28.00
6      6        10     11.36
7     7+         3      5.58

Chi-square statistic: 6.4891
p-value: 0.4839
✅ Fail to reject H0: The distribution could be Poisson.
