In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Overview
The chi-squared (χ²) test is a statistical method used to determine whether there is a significant association between categorical variables. It is a non-parametric test that compares the observed frequencies in each category with the frequencies expected if there were no association (i.e., the variables are independent). The test measures how well the observed data fits with the expected distribution based on a specific hypothesis.

### Chi-Squared Test for Goodness of Fit
- Purpose: It assesses whether an observed frequency distribution matches an expected distribution. Essentially, it tests if a single categorical variable fits a specific distribution.
- Use Case: Used when you have one categorical variable from a single population and you want to check if the distribution of categories matches a theoretical distribution.
- Example: Checking if a six-sided die is fair by comparing the observed frequency of each outcome (1 to 6) with the expected frequency if the die were fair.

#### Case Study - One-Sample Chi-Squared Test
A psychologist is observing eating behaviour in 131 children aged 3 years old from Newcastle. He presents each child 20 new foods which they have never eaten before. He then records the number of foods they actually try.

Previous research with thousands of children from across the country has shown that we expect 40% of young children to try 0 to 5 new foods, 30% to try 6 to 
10 new foods, 20% to try 11 to 15 new foods and 10% to try 16 to 20 new foods.

Perform a test to see if the children from Newcastle follow the same distribution that the research on British children has found.

#### Hypothesis:
- H0: The children from Newcastle follow the same distribution found by the research.
- H1: The children from Newcastle do not follow the same distribution found by the research.

In [5]:
# test results of children from Newcastle
data = [['0-5',45],['6-10',34],['11-15',31],['16-20',21]]
df = pd.DataFrame(data, columns=['Number of food tried','Frequency'])
df

Unnamed: 0,Number of food tried,Frequency
0,0-5,45
1,6-10,34
2,11-15,31
3,16-20,21


In [10]:
# observed frequencies
observed = list(df['Frequency'])

# expected frequencies
total = df['Frequency'].sum()
expected = [0.4*total, 0.3*total, 0.2*total, 0.1*total]

In [11]:
observed

[45, 34, 31, 21]

In [12]:
expected

[52.400000000000006, 39.3, 26.200000000000003, 13.100000000000001]

In [13]:
import scipy.stats as stats

# Perform the Chi-Squared test
chi2_stat, p_value = stats.chisquare(f_obs=observed, f_exp=expected)

# Output the results
print(f"Chi-Squared Statistic: {chi2_stat}")
print(f"P-Value: {p_value}")

# Decision based on p-value
alpha = 0.05  # significance level
if p_value < alpha:
    print("Reject the null hypothesis: The children from Newcastle do not follow the same distribution found by the research.")
else:
    print("Fail to reject the null hypothesis: The children from Newcastle follow the same distribution found by the research.")


Chi-Squared Statistic: 7.40330788804071
P-Value: 0.06009563348294773
Fail to reject the null hypothesis: The children from Newcastle follow the same distribution found by the research.


### Chi-Squared Test for Independence
- Purpose: It evaluates whether there is a significant association between two categorical variables. The null hypothesis states that the variables are independent, while the alternative hypothesis suggests that they are associated.
- Use Case: Used when you have two categorical variables and you want to see if there is a relationship between them.
- Example: Testing if there is an association between gender (male/female) and preference for a particular type of movie (action/comedy/drama).

#### Case Study
Imagine you work for a supermarket, and you want to investigate if there is an association between Gender (Male or Female) and Preference for a Product (Prefer Product A or Prefer Product B).

You conducted a survey of 200 customers, and the results are shown in the table below.

In [16]:
survey_data = [['Male',30,50,80],['Female',70,50,120]]
df_survey = pd.DataFrame(survey_data, columns=['Gender', 'Product A', 'Product B', 'Total'])
df_survey

Unnamed: 0,Gender,Product A,Product B,Total
0,Male,30,50,80
1,Female,70,50,120


#### Hypotheses:
- Null Hypothesis 𝐻0: There is no association between Gender and Product Preference (they are independent).
- Alternative Hypothesis 𝐻1: There is an association between Gender and Product Preference (they are not independent).

In [17]:
# Observed data from the table
observed = np.array([[30, 50],  # Male
                     [70, 50]]) # Female

# Perform the Chi-Squared Test of Independence
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)

# Output the results
print(f"Chi-Squared Statistic: {chi2_stat}")
print(f"P-Value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies Table:")
print(expected)

# Decision based on p-value
alpha = 0.05  # significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is an association between Gender and Product Preference.")
else:
    print("Fail to reject the null hypothesis: No significant association between Gender and Product Preference.")

Chi-Squared Statistic: 7.520833333333333
P-Value: 0.006098945931214356
Degrees of Freedom: 1
Expected Frequencies Table:
[[40. 40.]
 [60. 60.]]
Reject the null hypothesis: There is an association between Gender and Product Preference.
