# Lab | Goodness of Fit and Independence Tests

## Question 1
A researcher gathers information about the patterns of Physical Activity of children in the fifth grade of primary school of a public school. He defines three categories of physical activity (Low, Medium, High). He also inquires about the regular consumption of sugary drinks at school, and defines two categories (Yes = consumed, No = not consumed). We would like to evaluate if there is an association between patterns of physical activity and the consumption of sugary drinks for the children of this school, at a level of 5% significance. The results are in the following table: 

![](https://education-team-2020.s3.eu-west-1.amazonaws.com/ds-ai/lab-goodness-of-fit/table4.png)

In [21]:
# your answer here
import numpy as np
import pandas as pd
import scipy.stats as stats

# Observed frequencies
data = {'Physical Activity': ['Low', 'Medium', 'High'],
        'Yes': [32, 14, 6],  # Sugary drinks consumed
        'No': [12, 22, 9]}   # Sugary drinks not consumed

#Creating the dataframe
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Physical Activity,Yes,No
0,Low,32,12
1,Medium,14,22
2,High,6,9


In [22]:
# Converting dataframe to Contigency table
observed= df[['Yes', 'No']].values

# Chi_square test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-Square Statistic: {chi2_stat}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("\nExpected Frequencies:")
print(pd.DataFrame(expected, columns=['Yes', 'No'], index=['Low', 'Medium', 'High']))

Chi-Square Statistic: 10.712198008709638
P-value: 0.004719280137040844
Degrees of Freedom: 2

Expected Frequencies:
              Yes         No
Low     24.084211  19.915789
Medium  19.705263  16.294737
High     8.210526   6.789474


In [23]:
# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis: There is a statistically significant association between physical activity levels and sugary drink consumption.")
    print("Children's physical activity patterns are not independent of sugary drink consumption.")
else:
    print("Fail to reject the null hypothesis: No significant association between physical activity levels and sugary drink consumption.")

Reject the null hypothesis: There is a statistically significant association between physical activity levels and sugary drink consumption.
Children's physical activity patterns are not independent of sugary drink consumption.


## [OPTIONAL] Question 2
The following table indicates the number of 6-point scores in an American rugby match in the 1979 season.

![](https://education-team-2020.s3.eu-west-1.amazonaws.com/ds-ai/lab-goodness-of-fit/table1.png)

Based on these results, we create a Poisson distribution with the sample mean parameter  = 2.435. Is there any reason to believe that at a .05 level the number of scores is a Poisson variable?

Check [here](https://www.geeksforgeeks.org/how-to-create-a-poisson-probability-mass-function-plot-in-python/) how to create a poisson distribution and how to calculate the expected observations, using the probability mass function (pmf). 
A Poisson distribution is a discrete probability distribution. It gives the probability of an event happening a certain number of times (k) within a given interval of time or space. The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events.

In [None]:
# your answer here
# Given data
observed_counts = [35, 99, 104, 110, 62, 25, 10, 3]
total= 448
lambda_mean = 2.435

# Defining k values (0-7)
k_values = np.arange(len(observed_counts))

#Poisson expected probabilities
expected_probs = stats.poisson.pmf(k_values, mu=lambda_mean)

# Adjusting last category (7 or more) by summing remaining probabilities
expected_probs[-1] = 1 - expected_probs[:-1].sum()

# Expected frequencies
expected_counts = expected_probs * total

#Chi-square test
chi2_stat, p_value = stats.chisquare(f_obs=observed_counts, f_exp=expected_counts)

# Creating DataFrame for clarity
df = pd.DataFrame({'Scores': k_values, 'Observed': observed_counts, 'Expected': expected_counts})
print(df)
print(f"\nChi-Square Statistic: {chi2_stat}")
print(f"P-value: {p_value}")


# Conclusion
if p_value < 0.05:
    print("\nReject the null hypothesis → The data does NOT follow a Poisson distribution.")
else:
    print("\nFailed to reject the null hypothesis → The data fits a Poisson distribution.")

   Scores  Observed    Expected
0       0        35   39.243791
1       1        99   95.558630
2       2       104  116.342632
3       3       110   94.431437
4       4        62   57.485137
5       5        25   27.995262
6       6        10   11.361410
7       7         3    5.581701

Chi-Square Statistic: 6.491310681109821
P-value: 0.4836889068537269

Failed to reject the null hypothesis → The data fits a Poisson distribution.
