# Lab | Goodness of Fit and Independence Tests

## Question 1
A researcher gathers information about the patterns of Physical Activity of children in the fifth grade of primary school of a public school. He defines three categories of physical activity (Low, Medium, High). He also inquires about the regular consumption of sugary drinks at school, and defines two categories (Yes = consumed, No = not consumed). We would like to evaluate if there is an association between patterns of physical activity and the consumption of sugary drinks for the children of this school, at a level of 5% significance. The results are in the following table: 

![](https://education-team-2020.s3.eu-west-1.amazonaws.com/ds-ai/lab-goodness-of-fit/table4.png)

- Null Hypothesis (H₀):
"Physical activity level and sugary drink consumption are independent (no association)."

- Alternative Hypothesis (H₁):
"Physical activity level and sugary drink consumption are dependent (there is an association)."

In [5]:
import numpy as np
import pandas as pd

# Observed data (3x2 table)
observed = np.array([
    [32, 12],  # Low activity
    [14, 22],  # Medium activity
    [6, 9]     # High activity
])
print("Observed Contingency Table:")
print(observed)

Observed Contingency Table:
[[32 12]
 [14 22]
 [ 6  9]]


In [14]:
# Create the contingency table with labels
data = {
    'Sugary Drinks (Yes)': [32, 14, 6],
    'Sugary Drinks (No)': [12, 22, 9]
}

# Create DataFrame with proper row labels
df = pd.DataFrame(data,
                 index=['Low', 'Medium', 'High'])
df.index.name = 'Physical Activity'

print("Contingency Table:")
display(df)

Contingency Table:


Unnamed: 0_level_0,Sugary Drinks (Yes),Sugary Drinks (No)
Physical Activity,Unnamed: 1_level_1,Unnamed: 2_level_1
Low,32,12
Medium,14,22
High,6,9


In [17]:
from scipy.stats import chi2_contingency
chi2_stat, p_value, dof, expected = chi2_contingency(df)

print(f"\nChi-square statistic: {chi2_stat:.2f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies:")
print(expected)


Chi-square statistic: 10.71
P-value: 0.0047
Degrees of freedom: 2
Expected frequencies:
[[24.08421053 19.91578947]
 [19.70526316 16.29473684]
 [ 8.21052632  6.78947368]]


## [OPTIONAL] Question 2
The following table indicates the number of 6-point scores in an American rugby match in the 1979 season.

![](https://education-team-2020.s3.eu-west-1.amazonaws.com/ds-ai/lab-goodness-of-fit/table1.png)

Based on these results, we create a Poisson distribution with the sample mean parameter  = 2.435. Is there any reason to believe that at a .05 level the number of scores is a Poisson variable?

Check [here](https://www.geeksforgeeks.org/how-to-create-a-poisson-probability-mass-function-plot-in-python/) how to create a poisson distribution and how to calculate the expected observations, using the probability mass function (pmf). 
A Poisson distribution is a discrete probability distribution. It gives the probability of an event happening a certain number of times (k) within a given interval of time or space. The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events.

In [24]:
# your answer here
from scipy.stats import poisson, chisquare

# Create a dictionary with the data
data = {
    'Number of Scores (k)': [0, 1, 2, 3, 4, 5, 6, '7 or more'],
    'Observed Matches': [35, 99, 104, 110, 62, 25, 10, 3] 
}

# Create DataFrame
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Number of Scores (k),Observed Matches
0,0,35
1,1,99
2,2,104
3,3,110
4,4,62
5,5,25
6,6,10
7,7 or more,3


In [27]:
# Observed data from your DataFrame
observed = df['Observed Matches'].values  # [35, 99, 104, 110, 62, 25, 10, 3]
total_matches = observed.sum()  # Total matches = 35+99+104+110+62+25+10+3 = 448
lambda_ = 2.435  # Given mean

# Compute Poisson probabilities for k=0 to 6
k_values = np.arange(0, 7)  # 0,1,2,3,4,5,6
poisson_probs = poisson.pmf(k_values, mu=lambda_)


# For k ≥7, compute cumulative probability
prob_7_or_more = 1 - poisson_probs.sum()
poisson_probs = np.append(poisson_probs, prob_7_or_more)

# Calculate expected frequencies
expected = poisson_probs * total_matches

# Update DataFrame
df['Expected Matches'] = np.round(expected, 2)
display(df)

Unnamed: 0,Number of Scores (k),Observed Matches,Expected Matches
0,0,35,39.24
1,1,99,95.56
2,2,104,116.34
3,3,110,94.43
4,4,62,57.49
5,5,25,28.0
6,6,10,11.36
7,7 or more,3,5.58


In [28]:
print("Expected frequencies:", df['Expected Matches'].values)

Expected frequencies: [ 39.24  95.56 116.34  94.43  57.49  28.    11.36   5.58]


In [30]:
# Combine k=5,6,7+ if needed
observed_combined = [
    df['Observed Matches'].iloc[0],  # k=0
    df['Observed Matches'].iloc[1],  # k=1
    df['Observed Matches'].iloc[2],  # k=2
    df['Observed Matches'].iloc[3],  # k=3
    df['Observed Matches'].iloc[4],  # k=4
    df['Observed Matches'].iloc[5] + df['Observed Matches'].iloc[6] + df['Observed Matches'].iloc[7]  # k=5+
]

expected_combined = [
    df['Expected Matches'].iloc[0],  # k=0
    df['Expected Matches'].iloc[1],  # k=1
    df['Expected Matches'].iloc[2],  # k=2
    df['Expected Matches'].iloc[3],  # k=3
    df['Expected Matches'].iloc[4],  # k=4
    df['Expected Matches'].iloc[5] + df['Expected Matches'].iloc[6] + df['Expected Matches'].iloc[7]  # k=5+
]

In [31]:
# Perform Chi-Square test
chi2_stat, p_value = chisquare(f_obs=observed_combined, f_exp=expected_combined)

# Degrees of freedom = (number of categories after merging) - 1 - (parameters estimated)
# Since λ was given (not estimated), subtract 0
df_degrees = len(observed_combined) - 1

print(f"Chi-square statistic: {chi2_stat:.2f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {df_degrees}")

Chi-square statistic: 5.88
P-value: 0.3177
Degrees of freedom: 5


In [32]:
 # Assuming you already have these values from the test:
# chi2_stat = 5.88  # Replace with your actual chi-square statistic
# p_value = 0.3177   # Replace with your actual p-value
alpha = 0.05        # Significance level

print("\n--- Chi-Square Goodness-of-Fit Test Conclusion ---")
print(f"Chi-square statistic: {chi2_stat:.2f}")
print(f"P-value: {p_value:.4f}")

if p_value < alpha:
    print(f"\nAt a {alpha*100}% significance level:")
    print("Reject the null hypothesis (H₀).")
    print("Conclusion: The number of scores does NOT follow a Poisson distribution with λ = 2.435.")
else:
    print(f"\nAt a {alpha*100}% significance level:")
    print("Fail to reject the null hypothesis (H₀).")
    print("Conclusion: The number of scores is consistent with a Poisson distribution with λ = 2.435.")


--- Chi-Square Goodness-of-Fit Test Conclusion ---
Chi-square statistic: 5.88
P-value: 0.3177

At a 5.0% significance level:
Fail to reject the null hypothesis (H₀).
Conclusion: The number of scores is consistent with a Poisson distribution with λ = 2.435.
