# Hypothesis Testing - Association Between Two Binary Variables

### Objective
The aim is to perform statistical testing of the relationship between categorical variables and measure the strength of association.

## 1. Cramer's V, Phi and Contingency Coefficients
### Cramer's V
Cramer's V is the most popular of the chi-square-based measures of nominal association
because it is designed so that the attainable upper limit is always 1. It is based on the Chi-
Square (χ²) statistic and can be applied to contingency tables of any size. The values of
Cramér's V range from 0 to 1, where 0 indicates no association and 1 indicates a perfect
association.

### Phi Coefficient
The Phi coefficient (φ) is a measure of the strength of association between two binary
variables. It is also called the Yule phi or Mean Square Contingency Coefficient and is used
for contingency tables It is analogous to the Pearson correlation coefficient but is
specifically designed for 2x2 contingency tables. The value of the Phi coefficient ranges
from -1 to 1

### Contingency Coefficient
The Contingency Coefficient is a measure of association between two categorical variables
in a contingency table. It quantifies the strength of the relationship between the variables,
taking into account the table's dimensions. The coefficient ranges from 0 to 1, where 0
indicates no association, and 1 indicates a perfect association. The contingency coefficient suffers from the disadvantage that it does not reach a maximum of 1.0, notably the highest it can reach in a 2×2 table is 0.707 . It can reach values closer to 1.0 in contingency tables
with more categories; for example, it can reach a maximum of 0.870 in a 4 × 4 table. It
should, therefore, not be used to compare associations in different tables if they have
different numbers of categories

## 2. Formulation of Hypotheses

#### Snickers Preference by Gender

**Null Hypothesis (H₀):** There is no relationship between gender and preference for Snickers.  
**Alternative Hypothesis (H₁):** There is a relationship between gender and preference for Snickers.

#### Political Party Affiliation by Gender

**Null Hypothesis (H₀):** There is no relationship between gender and political party affiliation.  
**Alternative Hypothesis (H₁):** There is a relationship between gender and political party affiliation.

#### Technical Writing Experience by Discipline

**Null Hypothesis (H₀):** There is no relationship between a student's discipline and their technical writing experience.  
**Alternative Hypothesis (H₁):** There is a relationship between a student's discipline and their technical writing experience.

## 3. Justification for Test Selection

For all datasets, the Chi-squared test is appropriate for assessing associations between categorical variables in contingency tables. Fisher's Exact Test will be used for the Snickers preference data due to its 2x2 nature. Measures of association (Cramér's V, Phi coefficient, and Contingency Coefficient) will be calculated to establish the degree of relationship if the variables are dependent.

## 4. Contingency Tables and Tests

#### Snickers Preference by Gender

**Observed Data:**
|                        | Boy | Girl | Sum |
|------------------------|-----|------|-----|
| Like Snickers          | 43  | 30   | 73  |
| Doesn't Like Snickers  | 8   | 19   | 27  |
| Sum                    | 51  | 49   | 100 |


#### Political Party Affiliation by Gender

**Observed Data:**
|              | Male | Female | Total |
|--------------|------|--------|-------|
| Democrat     | 20   | 22     | 42    |
| Republican   | 21   | 16     | 37    |
| Independent  | 19   | 32     | 51    |
| Total        | 60   | 70     | 130   |


#### Technical Writing Experience by Discipline

**Observed Data:**
|              | Beginner | Intermediate | Expert | Total |
|--------------|----------|--------------|--------|-------|
| STEM         | 10       | 13           | 35     | 58    |
| Soc.Sci.     | 47       | 12           | 9      | 68    |
| HUM          | 11       | 17           | 25     | 53    |
| Total        | 68       | 42           | 69     | 179   |


## 5. Results and Interpretation

#### Snickers Preference by Gender

**Expected Frequencies:**

  |                        | Boy  | Girl  |
  |------------------------|------|-------|
  | Like Snickers          | 37.23| 35.77 |
  | Doesn't Like Snickers  | 13.77| 13.23 |

- **Chi-squared Test Statistic:** 5.639
- **Chi-squared P-Value:** 0.0176
- **Degrees of Freedom:** 1
- **Fisher's Exact Test P-Value:** 0.0129
- **Phi Coefficient:** 0.237
- **Contingency Coefficient:** 0.236

**Decision:** Since the p-values from both tests are less than the significance level (α = 0.05), we reject the null hypothesis.  
**Conclusion:** There is a statistically significant relationship between gender and preference for Snickers. The Phi coefficient indicates a weak association.

#### Political Party Affiliation by Gender

**Expected Frequencies:**

  |              | Male  | Female  |
  |--------------|-------|---------|
  | Democrat     | 19.38 | 22.62   |
  | Republican   | 17.06 | 19.92   |
  | Independent  | 23.54 | 27.46   |

- **Chi-squared Test Statistic:** 3.335
- **P-Value:** 0.1887
- **Degrees of Freedom:** 2
- **Cramér's V:** 0.160
- **Contingency Coefficient:** 0.159

**Decision:** Since the p-value is greater than the significance level (α = 0.05), we fail to reject the null hypothesis.  
**Conclusion:** There is no statistical evidence proving the relationship between gender and political party affiliation. Both Cramér's V and the Contingency Coefficient suggest a very weak association.

#### Technical Writing Experience by Discipline

**Expected Frequencies:**

  |              | Beginner | Intermediate | Expert  |
  |--------------|----------|--------------|---------|
  | STEM         | 22.03    | 13.60        | 22.37   |
  | Soc.Sci.     | 25.83    | 15.94        | 26.23   |
  | HUM          | 20.14    | 12.46        | 20.49   |

- **Chi-squared Test Statistic:** 65.425
- **P-Value:** 1.66e-12
- **Degrees of Freedom:** 4
- **Cramér's V:** 0.428
- **Contingency Coefficient:** 0.526

**Decision:** Since the p-value is less than the significance level (α = 0.05), we reject the null hypothesis.  
**Conclusion:** There is a statistically significant relationship between a student's discipline and their technical writing experience. Both Cramér's V and the Contingency Coefficient suggest a moderate association.

## 6. Brief Conclusions

- **Snickers Preference by Gender**:
  - The Chi-squared test and Fisher's Exact Test both indicate a significant relationship between gender and preference for Snickers, with boys showing a higher preference for Snickers compared to girls.
  - The Phi coefficient indicates a weak association between gender and Snickers preference.

- **Political Party Affiliation by Gender:**
  - The Chi-squared test indicates no significant relationship between gender and political party affiliation, suggesting that gender does not influence political party preference.
  - Both Cramér's V and the Contingency Coefficient suggest a very weak association.

- **Technical Writing Experience by Discipline:**
  - The Chi-squared test indicates a significant relationship between a student's discipline and their technical writing experience.
  - Both Cramér's V and the Contingency Coefficient suggest a moderate association, indicating that a student's discipline moderately influences their level of technical writing experience.

In [16]:
import pandas as pd
import scipy.stats as stats

# Creating the contingency table for Snickers preference by gender
table1 = pd.DataFrame([[43, 30], [8, 19]],
                      columns=["Boy", "Girl"],
                      index=["Like Snickers", "Doesn't Like Snickers"])

# Performing the Chi-squared test
chi2_1, p_1, dof_1, expected_1 = stats.chi2_contingency(table1)

# Performing Fisher's Exact Test
oddsratio, fisher_p_1 = stats.fisher_exact(table1)

# Calculating the Phi coefficient
phi = (chi2_1 / table1.values.sum()) ** 0.5

# Calculating the Contingency Coefficient
contingency_coeff_1 = (chi2_1 / (chi2_1 + table1.values.sum())) ** 0.5

results_1 = {
    "Chi-squared Test Statistic": chi2_1,
    "Chi-squared P-Value": p_1,
    "Degrees of Freedom": dof_1,
    "Expected Frequencies": expected_1,
    "Fisher's Exact Test P-Value": fisher_p_1,
    "Phi Coefficient": phi,
    "Contingency Coefficient": contingency_coeff_1
}

results_1

{'Chi-squared Test Statistic': 5.638561868177003,
 'Chi-squared P-Value': 0.01756961307936074,
 'Degrees of Freedom': 1,
 'Expected Frequencies': array([[37.23, 35.77],
        [13.77, 13.23]]),
 "Fisher's Exact Test P-Value": 0.012892039252329731,
 'Phi Coefficient': 0.2374565616734354,
 'Contingency Coefficient': 0.23103242407056598}

In [17]:
# Creating the contingency table for political party affiliation by gender
table2 = pd.DataFrame([[20, 22], [21, 16], [19, 32]],
                      columns=["Male", "Female"],
                      index=["Democrat", "Republican", "Independent"])

# Performing the Chi-squared test
chi2_2, p_2, dof_2, expected_2 = stats.chi2_contingency(table2)

# Calculating Cramér's V
n_2 = table2.values.sum()
cramers_v_2 = (chi2_2 / (n_2 * (min(table2.shape)-1))) ** 0.5

# Calculating the Contingency Coefficient
contingency_coeff_2 = (chi2_2 / (chi2_2 + n_2)) ** 0.5

results_2 = {
    "Chi-squared Test Statistic": chi2_2,
    "P-Value": p_2,
    "Degrees of Freedom": dof_2,
    "Expected Frequencies": expected_2,
    "Cramér's V": cramers_v_2,
    "Contingency Coefficient": contingency_coeff_2
}

results_2

{'Chi-squared Test Statistic': 3.3351430662355046,
 'P-Value': 0.18870477294190233,
 'Degrees of Freedom': 2,
 'Expected Frequencies': array([[19.38461538, 22.61538462],
        [17.07692308, 19.92307692],
        [23.53846154, 27.46153846]]),
 "Cramér's V": 0.16017161628500237,
 'Contingency Coefficient': 0.15815572544877715}

In [18]:
# Creating the contingency table for technical writing experience by discipline
table3 = pd.DataFrame({
    "Beginner": [10, 47, 11],
    "Intermediate": [13, 12, 17],
    "Expert": [35, 9, 25]
}, index=["STEM", "Soc.Sci.", "HUM"])

# Performing the Chi-squared test
chi2_3, p_3, dof_3, expected_3 = stats.chi2_contingency(table3)

# Calculating Cramér's V
n_3 = table3.values.sum()
cramers_v_3 = (chi2_3 / (n_3 * (min(table3.shape)-1))) ** 0.5

# Calculating the Contingency Coefficient
contingency_coeff_3 = (chi2_3 / (chi2_3 + n_3)) ** 0.5

results_3 = {
    "Chi-squared Test Statistic": chi2_3,
    "P-Value": p_3,
    "Degrees of Freedom": dof_3,
    "Expected Frequencies": expected_3,
    "Cramér's V": cramers_v_3,
    "Contingency Coefficient": contingency_coeff_3
}

results_3

{'Chi-squared Test Statistic': 50.21749532350336,
 'P-Value': 3.2523365803358503e-10,
 'Degrees of Freedom': 4,
 'Expected Frequencies': array([[22.03351955, 13.60893855, 22.3575419 ],
        [25.83240223, 15.95530726, 26.2122905 ],
        [20.13407821, 12.43575419, 20.4301676 ]]),
 "Cramér's V": 0.37452948255895063,
 'Contingency Coefficient': 0.468062278674923}