<a href="https://colab.research.google.com/github/MK316/Spring2024/blob/main/Seminar/Chi_Squared_GoodnessOfFit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📙**Part 2. Chi-Squared Test of Goodness-of-fit**

### The Chi-squared goodness of fit test is used to determine whether observed categorical data fits a certain distribution or pattern that is expected based on theoretical considerations or prior knowledge.

+ Note: There are 3 datasets for you to practice. Complete them by 6/22 (save this file to your Github repository)

# 🌀 Sample data analysis (Kim MR)

## 0.1 Example: Color Preferences in a Population

+ Description: This dataset represents a survey conducted to understand color preferences in a population. Participants (N=1000) were asked to choose their favorite color from a list of five options: 'Green', 'Yellow', 'Purple', 'Blue', 'Red'

+ 🔎 Null Hypothesis (H<sub>0</sub>): The distribution of color preferences in the population is uniform, with each color being equally favored.
+ 🔎 Alternative Hypothesis (H<sub>A</sub>): There is a difference in color preferences among the population.

### 0.2 Data: Contingency table

|Red| Blue| Green| Yellow|Purple|Total|
|--|--|--|--|--|--|
|200|250|150|200|200|1000|


### Dataset preview

|Responses|Color|
|--|--|
|1|Red|
|2|Green|
|3|Yellow|
|4|Purple|
|...|...|


## 🌱 Step [1] Run the code to generate contingency table of the data

Assume that we counted occurrences and have the result

In [None]:
import numpy as np

# Generate data
np.random.seed(0)
sample_size = 1000

observed_frequencies = [200, 250, 150, 200, 200]  # Observed frequencies for each color
expected = int(sample_size / len(observed_frequencies))

expected_frequency = [expected]*5

print(observed_frequencies)
print(expected_frequency)

## 🌱Step [2]: Perform the test

In [None]:
# Perform chi-squared test
from scipy.stats import chisquare
chi2_stat, p_value = chisquare(observed_frequencies)

print("Chi-squared Statistic:", chi2_stat)
print("P-value:", p_value)

'e' is scientific notation. To convert it use the code below

In [None]:
print("P-value:", '{:.20f}'.format(p_value))

## 🌱Step [3]: Calculate standardized Residuals

### Standardized Residuals

To analyze which color is significantly different from the expected distribution in a goodness of fit test, you can calculate the standardized residuals for each category. Standardized residuals measure the difference between the observed and expected frequencies in terms of standard deviations.

+ Calculate the expected frequencies for each color based on the expected distribution (uniform distribution in this case).

+ Calculate the Chi-squared statistic and P-value for the goodness of fit test.

+ Calculate the standardized residuals for each color using the formula:

> Standardized Residual = (Observed Frequency - Expected Frequency) / sqrt(Expected Frequency)

+ Determine the **critical value** for statistical significance (e.g., ±1.96 for a significance level of 0.05).

+ Identify colors with standardized residuals that exceed the critical value in absolute terms. These colors are significantly different from the expected distribution.

In [None]:
import numpy as np
from scipy.stats import chisquare

# Labels for the colors
labels = ['Green', 'Yellow', 'Purple', 'Blue', 'Red']

# Observed frequencies for each color
observed_frequencies = [200, 250, 150, 200, 200]

# Calculate expected frequency (uniform distribution)
sample_size = sum(observed_frequencies)
expected_frequency = sample_size / len(observed_frequencies)
expected_frequencies = [expected_frequency] * len(observed_frequencies)

# Perform Chi-squared test
chi2_stat, p_value = chisquare(observed_frequencies, expected_frequencies)

# Calculate standardized residuals
standardized_residuals = [(observed - expected) / np.sqrt(expected) for observed, expected in zip(observed_frequencies, expected_frequencies)]

# Determine critical value for statistical significance (e.g., ±1.96 for 95% confidence)
critical_value = 1.96

# Identify colors significantly different from expected
significant_colors = [labels[i] for i, residual in enumerate(standardized_residuals) if abs(residual) > critical_value]

# Print results
print("Chi-squared Statistic:", chi2_stat)
print("P-value:", p_value)
print("Standardized Residuals:", dict(zip(labels, standardized_residuals)))
print("Significantly different colors:", significant_colors)


## Reporting format:

+ This analysis examines the relationship between X and Y using a Chi-squared test of independence. The dataset consists of [brief description of dataset].

+ (Methodology): A Chi-squared test of independence was conducted with a significance level of 0.05.

+ Results:

  + Chi-squared Statistic: 25.0
  + Degrees of Freedom: (5-1) = 4
  + Standardized Residuals: [0.0, 3.536, -3.536, 0.0, 0.0]
  + P-value: 5.0309817823062075e-05

+ Interpretation:
  + 1) The Chi-squared statistic of 25.0 with a P-value of 5.03e-05 suggests that there is a significant association between X and Y at the 0.05 significance level. Therefore, we reject the null hypothesis and conclude that there is evidence of a significant relationship between the variables.
  + 2) Standardized residuals help identify which specific categories (colors) contribute most to the overall Chi-squared statistic. The standardized residuals are:

>||Green|Yellow|Purple|Blue|Red|
>|--|--|--|--|--|--|
>|S.Residuals|0.0|3.536|-3.536|0.0|0.0|

  + A standardized residual value indicates how many standard deviations the observed frequency is from the expected frequency. Typically, a standardized residual greater than ±1.96 indicates a significant deviation at the 0.05 significance level.
  + Based on the standardized residuals: Colors 2 (Yellow) and 3 (Purple) have standardized residuals of approximately ±3.54, which are much higher than ±1.96. This means that these colors ( significantly deviate from the expected frequencies.

## 🌱 Residual Plot

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chisquare

# Labels for the colors
labels = ['Green', 'Yellow', 'Purple', 'Blue', 'Red']

# Observed frequencies for each color
observed_frequencies = [200, 250, 150, 200, 200]

# Calculate expected frequency (uniform distribution)
sample_size = sum(observed_frequencies)
expected_frequency = sample_size / len(observed_frequencies)
expected_frequencies = [expected_frequency] * len(observed_frequencies)

# Perform Chi-squared test
chi2_stat, p_value = chisquare(observed_frequencies, expected_frequencies)

# Calculate standardized residuals
standardized_residuals = [(observed - expected) / np.sqrt(expected) for observed, expected in zip(observed_frequencies, expected_frequencies)]

# Plot standardized residuals
plt.figure(figsize=(10, 6))
bars = plt.bar(labels, standardized_residuals, color=['Green', 'Yellow', 'Purple', 'Blue', 'Red'])

# Add a horizontal line at y=0 for reference
plt.axhline(0, color='black', linewidth=0.8)

# Add a horizontal line at y=±1.96 to indicate the significance threshold
plt.axhline(1.96, color='grey', linestyle='--', linewidth=1)
plt.axhline(-1.96, color='grey', linestyle='--', linewidth=1)

# Add labels and title
plt.xlabel('Colors')
plt.ylabel('Standardized Residuals')
plt.title('Standardized Residuals for Color Preferences')
plt.xticks(rotation=45)

# Annotate bars that are significantly different
for bar in bars:
    height = bar.get_height()
    if abs(height) > 1.96:
        plt.annotate(f'{height:.2f}',
                     xy=(bar.get_x() + bar.get_width() / 2, height),
                     xytext=(0, 3),  # 3 points vertical offset
                     textcoords="offset points",
                     ha='center', va='bottom')

# Show the plot
plt.tight_layout()
plt.show()


## Step 🌱[4]: Calculate Contributions

### **Calculating contributions**

Reporting contributions to the Chi-squared statistic can provide additional insights into which categories (colors, in this case) contribute most to the observed differences from the expected distribution. This can help in understanding the nature of the deviation more precisely.

+ The contribution of each cell to the Chi-squared statistic is given by:
> Contribution_ij = ( (Observed_ij - Expected_ij)^2 ) / Expected_ij

||Green|Yellow|Purple|Blue|Red|
|--|--|--|--|--|--|
|Contributions|0.0|12.5|12.5|0.0|0.0|


In [None]:
import numpy as np
from scipy.stats import chisquare

# Labels for the colors
labels = ['Green', 'Yellow', 'Purple', 'Blue', 'Red']

# Observed frequencies for each color
observed_frequencies = [200, 250, 150, 200, 200]

# Calculate expected frequency (uniform distribution)
sample_size = sum(observed_frequencies)
expected_frequency = sample_size / len(observed_frequencies)
expected_frequencies = [expected_frequency] * len(observed_frequencies)

# Perform Chi-squared test
chi2_stat, p_value = chisquare(observed_frequencies, expected_frequencies)

# Calculate contributions to the Chi-squared statistic
contributions = [(observed - expected) ** 2 / expected for observed, expected in zip(observed_frequencies, expected_frequencies)]

# Print contributions to Chi-squared statistic
contribution_dict = dict(zip(labels, contributions))
print("Contributions to Chi-squared Statistic:", contribution_dict)


## Step [5]: Interpretation of the result and conclusion


+ The Chi-squared test results indicate a significant difference between the observed and expected distributions of color preferences. The standardized residuals show that the colors Yellow and Purple significantly deviate from the expected frequencies.
+ Specifically, **Yellow is more preferred than expected, and Purple is less preferred than expected.** The contributions to the Chi-squared statistic indicate that the deviations in Yellow and Purple are the primary drivers of the significant result, each contributing 12.5 to the total Chi-squared statistic of 25.0.

---
# ⏰ Data set 1 (Jung WC)

## 1.1 Data: Preference for Social Media Platforms
Description: This dataset explores the preference for different social media platforms among internet users. The options include Facebook, Instagram, Twitter, Snapchat, and LinkedIn.

+ Null Hypothesis: The distribution of social media platform preferences among internet users follows a specific expected distribution.

+ Alternative Hypothesis: There is a significant difference in the preferences for social media platforms among internet users.

## 1.2 Data contingency table

|Facebook|Instagram|Twitter|Snapchat|LinkedIn|
|--|--|--|--|--|
|200|150|100|250|100|

## Step [1] Get the data and contingency table

In [None]:
import numpy as np

# Generate data
np.random.seed(0)
sample_size = 800
observed_frequencies = [200, 150, 100, 250, 100]  # Observed frequencies for each platform
expected = sample_size / len(observed_frequencies)

expected_frequency = [expected]*5

print(observed_frequencies)
print(expected_frequency)

## Step [2] Perform Chi-squared test of Goodness of fit

In [None]:
# Perform chi-squared test
from scipy.stats import chisquare



## Step [3] Calculate Standardized Residuals

## 🔎 Generate Residual Plot

## Step [4] Calculate Contributions

## Step [5] Interpretation of the result and make a conclusion

🔎 Write here your interpretation based on the result

---
# ⏰ Data set 2 (Sohn HS)

## 2.1 Description:

This dataset investigates the preference for different music genres among a group of listeners. The genres include Pop, Rock, Hip-Hop, Jazz, and Classical.

+ Null Hypothesis: The distribution of music genre preferences among listeners is consistent with a specific expected distribution.
+ Alternative Hypothesis: There is a notable difference in music genre preferences among listeners.

## 2.2 Data contingency table

|Pop|Rock|Hip-Hop|Jazz|Classical|
|--|--|--|--|--|
|250|200|150|300|300|

## Step [1] Run the code below to get the data, observed frequency (e.g., contingency table)

In [None]:
import numpy as np

# Generate data
np.random.seed(0)
sample_size = 1200
observed_frequencies = [250, 200, 150, 300, 300]  # Observed frequencies for each genre
expected = sample_size / len(observed_frequencies)

expected_frequency = [expected]*5

print(observed_frequencies)
print(expected_frequency)


## Step [2] Perform Chi-squared test of Goodness of fit

## Step [3] Calculate Standardized Residuals

## Step [4] Calculate Contributions

## Step [5] Interpretation of the result and make a conclusion

🔎 Write here your interpretation based on the result

---
# ⏰ Data set 3 (Choi JM)

## 3.1 Smartphone Brand Preferences
Description: This dataset examines the preference for different smartphone brands among consumers. The brands include Apple, Samsung, Huawei, Xiaomi, and OnePlus.

+ Null Hypothesis: The distribution of smartphone brand preferences among consumers conforms to a specific expected distribution.
+ Alternative Hypothesis: There exists a significant difference in smartphone brand preferences among consumers.

## 3.2 Data contingency table

|Apple|Samsung|Huawei|Xiaomi|OnePlus|
|--|--|--|--|--|
|350|250|200|400|300|

## Step [1] Run the code below to get the data, observed frequency (e.g., contingency table)

In [None]:
import numpy as np

# Generate data
np.random.seed(0)
sample_size = 1500
observed_frequencies = [350, 250, 200, 400, 300]  # Observed frequencies for each brand
expected_frequency = sample_size / len(observed_frequencies)

# Perform chi-squared test
from scipy.stats import chisquare
chi2_stat, p_value = chisquare(observed_frequencies)

print("Chi-squared Statistic:", chi2_stat)
print("P-value:", p_value)


## Step [2] Perform Chi-squared test of Goodness of fit

## Step [3] Calculate Standardized Residuals

## Step [4] Calculate Contributions

## Step [5] Interpretation of the result and make a conclusion

🔎 Write here your interpretation based on the result

---
The End