# Hypothesis Testing - Relationship Between Categorical Variables

### Objective
The aim is to perform statistical testing of the relationship between categorical variables using the **Chi-squared Test for Independence** and **Fisher's Exact Test**.

## 1. Tests Characteristics & Comparison

### Chi-squared Test for Independence
The Chi-squared Test is used to understand if there is a relationship between two categorical variables. The test is useful in many fields including sociology, psychology, and medicine. The chi-square test is a powerful tool that can help researchers gain insight into their data and make informed decisions based on their findings.

### Fisher's Exact Test
Fisher's Exact Test is a statistical test used to determine if there is a significant association
between two categorical variables, especially in small sample sizes. It is based on the
hypergeometric distribution and is particularly useful for analyzing (2 x 2) contingency
tables.
The hypergeometric distribution shows the probability of choosing 𝑎 objects from group A
and 𝑏 objects from group B from the total number of 𝑛 objects without returning, given that
in the general population 𝐾 objects belong to group A.
Thus, the hypergeometric distribution is most suitable for modeling the situation described
by Fisher's Exact Test, and so it is used to calculate the probability of obtaining the observed
data given the null hypothesis is true.

### Comparison of Tests
Fisher's Exact Test and the Chi-squared (χ²) Test are both used to analyze the association
between categorical variables, but there are differences in their application and assumptions.

**Key Differences:**
1. **Data Type:**
    - Chi-squared Test: Used to analyze the association between two or more categorical variables when there is a contingency table with frequencies for categories.
    - Fisher's Exact Test: Primarily applied to small sample sizes or 2x2 tables when expected
cell counts may be low.
2. **Distribution:**
    - Chi-squared Test: Based on the Chi-squared distribution.
    - Fisher's Exact Test: Based on the hypergeometric distribution.
3. **Assumptions:**
    - Chi-squared Test: Requires expected cell counts to be greater than 5 or 10 in each cell,
depending on context.
    - Fisher's Exact Test: Can be used without restrictions on expected cell counts.
    - Fisher's Exact Test is a more precise alternative to the Chi-squared Test, especially when
expected cell counts are low. When sample size is large, results of Fisher's Exact Test and
Chi-squared Test will be approximately the same.

## 2. Formulation of Hypotheses

### Snickers Preference by Gender
**Null Hypothesis (H₀):** There is no relationship between gender and preference for Snickers.  
**Alternative Hypothesis (H₁):** There is a relationship between gender and preference for Snickers.

### Political Party Affiliation by Gender
**Null Hypothesis (H₀):** There is no relationship between gender and political party affiliation.  
**Alternative Hypothesis (H₁):** There is a relationship between gender and political party affiliation.

## 3. Justfication for Test Selection

 For the Snickers preference data, both the Chi-squared test and Fisher's Exact test were employed. The Chi-squared test is appropriate for assessing associations between variables in moderate to large samples, while Fisher's Exact test is preferred for smaller sample sizes or when assumptions of the Chi-squared test are not met. Given the moderate sample size and to ensure robustness, both tests were applied. However, for the political party affiliation data, only the Chi-squared test was utilized due to the larger sample size and the suitability of the Chi-squared test for such data.

## 4. Contingency Tables and Tests

### Snickers Preference by Gender

#### Observed Data:
|                        | Boy | Girl | Sum |
|------------------------|-----|------|-----|
| Like Snickers          | 43  | 30   | 73  |
| Doesn't Like Snickers  | 8   | 19   | 27  |
| Sum                    | 51  | 49   | 100 |

#### Chi-squared Test:
- **Test Statistic ($\chi^2$):** 5.639
- **P-Value:** 0.0176
- **Degrees of Freedom:** 1
- **Expected Frequencies:**

|                        | Boy  | Girl  |
|------------------------|------|-------|
| Like Snickers          | 37.23| 35.77 |
| Doesn't Like Snickers  | 13.77| 13.23 |

**Decision:** Since the p-value (0.0176) is less than the significance level (α = 0.05), we reject the null hypothesis.  
**Conclusion:** There is a statistically significant relationship between gender and preference for Snickers.

#### Fisher's Exact Test:
- **Odds Ratio:** 3.404
- **P-Value:** 0.0129

**Decision:** Since the p-value (0.0129) is less than the significance level (α = 0.05), we reject the null hypothesis.  
**Conclusion:** There is a statistically significant relationship between gender and preference for Snickers.

### Political Party Affiliation by Gender

#### Observed Data:
|              | Male | Female | Total |
|--------------|------|--------|-------|
| Democrat     | 20   | 22     | 42    |
| Republican   | 21   | 16     | 37    |
| Independent  | 19   | 32     | 51    |
| Total        | 60   | 70     | 130   |


#### Chi-squared Test:
- **Test Statistic ($\chi^2$):** 3.335
- **P-Value:** 0.1887
- **Degrees of Freedom:** 2
- **Expected Frequencies:**

|              | Male  | Female  |
|--------------|-------|---------|
| Democrat     | 19.38 | 22.62   |
| Republican   | 17.06 | 19.92   |
| Independent  | 23.54 | 27.46   |

**Decision:** Since the p-value (0.1887) is greater than the significance level (α = 0.05), we fail to reject the null hypothesis.  
**Conclusion:** There is no statistical evidence proving the relationship between gender and political party affiliation.

## 5. Brief Conclusions

- **Snickers Preference by Gender:**
  - The Chi-squared test and Fisher's Exact test both indicate a significant relationship between gender and preference for Snickers, with boys showing a higher preference for Snickers compared to girls.

- **Political Party Affiliation by Gender:**
  - The Chi-squared test indicates no significant relationship between gender and political party affiliation, suggesting that gender does not influence political party preference.

Based on the results, we conclude that gender and Snickers preference is related. This statistical test provide evidence against the null hypotheses, indicating that this relationship is indeed significant. However, the data provided emphasizes no evidence that gender affects political views of the person. In this case, the null hypothesis failed to be rejected using Chi-squared test.

In [13]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency, fisher_exact

# Define the data tables
snickers_gender = pd.DataFrame([[43, 30, 73], [8, 19, 27], [51, 49, 100]],
                               columns=["Boy", "Girl", "Sum"],
                               index=["like Snickers", "doesn't like Snickers", "Sum"])

parties_gender = pd.DataFrame([[20, 22, 42], [21, 16, 37], [19, 32, 51], [60, 70, 130]],
                              columns=["Male", "Female", "Total"],
                              index=["Democrat", "Republican", "Independent", "Total"])

# Extracting observed values from the tables (excluding the sum rows/columns)
observed_snickers = snickers_gender.iloc[:2, :2].values
observed_parties = parties_gender.iloc[:3, :2].values

# Chi-squared Test for snickers_gender
chi2_snickers, p_snickers, dof_snickers, expected_snickers = chi2_contingency(observed_snickers)
# Chi-squared Test for parties_gender
chi2_parties, p_parties, dof_parties, expected_parties = chi2_contingency(observed_parties)

# Fisher's Exact Test for snickers_gender (2x2 table)
oddsratio_snickers, p_fisher_snickers = fisher_exact(observed_snickers)

# Output the results
print("Chi-squared Test for snickers vs gender:")
print(f"\tChi2 Statistic: {chi2_snickers}")
print(f"\tP-Value: {p_snickers}")
print(f"\tDegrees of Freedom: {dof_snickers}")
print("Expected Frequencies:")
print(expected_snickers)

print("\n\nFisher's Exact Test for snickers vs gender:")
print(f"\tOdds Ratio: {oddsratio_snickers}")
print(f"\tP-Value: {p_fisher_snickers}")

print("\n\nChi-squared Test for parties vs gender:")
print(f"\tChi2 Statistic: {chi2_parties}")
print(f"\tP-Value: {p_parties}")
print(f"\tDegrees of Freedom: {dof_parties}")
print("Expected Frequencies:")
print(expected_parties)

# Draw conclusions based on p-values
alpha = 0.05
conclusion_snickers_chi2 = "reject the null hypothesis" if p_snickers <= alpha else "fail to reject the null hypothesis"
conclusion_snickers_fisher = "reject the null hypothesis" if p_fisher_snickers <= alpha else "fail to reject the null hypothesis"
conclusion_parties_chi2 = "reject the null hypothesis" if p_parties <= alpha else "fail to reject the null hypothesis"

print("\n\nConclusions:")
print(f"\tChi-squared Test for snickers_gender: We {conclusion_snickers_chi2} at alpha = {alpha}")
print(f"\tFisher's Exact Test for snickers_gender: We {conclusion_snickers_fisher} at alpha = {alpha}")
print(f"\tChi-squared Test for parties_gender: We {conclusion_parties_chi2} at alpha = {alpha}")

Chi-squared Test for snickers vs gender:
	Chi2 Statistic: 5.638561868177003
	P-Value: 0.01756961307936074
	Degrees of Freedom: 1
Expected Frequencies:
[[37.23 35.77]
 [13.77 13.23]]


Fisher's Exact Test for snickers vs gender:
	Odds Ratio: 3.404166666666667
	P-Value: 0.012892039252329731


Chi-squared Test for parties vs gender:
	Chi2 Statistic: 3.3351430662355046
	P-Value: 0.18870477294190233
	Degrees of Freedom: 2
Expected Frequencies:
[[19.38461538 22.61538462]
 [17.07692308 19.92307692]
 [23.53846154 27.46153846]]


Conclusions:
	Chi-squared Test for snickers_gender: We reject the null hypothesis at alpha = 0.05
	Fisher's Exact Test for snickers_gender: We reject the null hypothesis at alpha = 0.05
	Chi-squared Test for parties_gender: We fail to reject the null hypothesis at alpha = 0.05
