# EGCI 305: Chapter 10 (Categorical Data)

Outline
> 1. [Packages](#ch10_packages)

> 2. [Chi-square goodness-of-fit test](#ch10_chi_goodness)
>    - [Example: mendel](#ch10_ex_mendel)
>    - [Example: hour of birth](#ch10_ex_hour)

> 3. [Chi-square contingency](#ch10_chi_contingency)
>    - [Example: can products](#ch10_ex_can)
>    - [Example: gasoline marketing](#ch10_ex_gasoline)
>    - [Example: pregnancy](#ch10_ex_pregnancy)

<a name="ch10_packages"></a>

## Packages
> - **numpy** -- to work with array manipulation
> - **matplotlib** -- to work with visualization (backend)
> - **seaborn** -- to work with high-level visualization
> - **math** -- to work with calculation such as sqrt (if not using sympy)
> - **scipy.stats** -- to work with stat

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("Numpy version =", np.version.version)
print("Seaborn version =", sns.__version__)

import math
import scipy
print("Scipy version =", scipy.__version__)

from scipy import stats
from scipy.stats import chi2               # Chi-squared distribution
from scipy.stats import chisquare          # Chi-squared goodness-of-fit
from scipy.stats import chi2_contingency   # Chi-squared homogeneity & independent

<a name="ch10_chi_goodness"></a>

## Chi-square Goodness-of-Fit test
- **[Manual: scipy.stats.chisquare](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html)**
- Can use this function to do the whole testing procedure **if raw data are available**
- H<sub>0</sub> : the categorical data has the given proportions

<a name="ch10_ex_mendel"></a>

### Example : mendel

In [None]:
freq_obs       = np.array( [926, 288, 293, 104] )
h0_proportions = np.array( [9/16, 3/16, 3/16, 1/16] )
n        = sum(freq_obs)
freq_exp = n * h0_proportions

# Total freq_obs and total freq_exp must be exactly equal (= 1611), otherwise error
# freq_exp = np.array( [906.19, 302.06, 302.06, 100.69] )

freq_exp

In [None]:
result = chisquare(freq_obs, freq_exp)
df = freq_obs.size - 1

print("Calculated chi2 = %.2f" % result.statistic)
print("df              = %d"   % df)
print("P-value         = %.3f" % result.pvalue)

In [None]:
### From manual calculation

critical = chi2.ppf(1-0.05, 3)
pvalue   = chi2.sf(1.47, 3)
print("Critical value = %.2f" % critical)
print("P-value        = %.3f" % pvalue)

<a name="ch10_ex_hour"></a>

### Example : hour of birth
- Hypothesis
    >- H<sub>0</sub> : p<sub>i</sub> = 1/24 ; for all i
    >- H<sub>1</sub> : p<sub>i</sub> $\ne$ 1/24 ; for some i

In [None]:
freq_obs = np.array( [52, 73, 89, 88, 68, 47, 58, 47, 48, 53, 47, 34, 
                      21, 31, 40, 24, 37, 31, 47, 34, 36, 44, 78, 59] )
n        = sum(freq_obs)
freq_exp = np.full(freq_obs.size, n/24)
np.around(freq_exp, 2)

In [None]:
result = chisquare(freq_obs, freq_exp)
df = freq_obs.size - 1

print("Calculated chi2 = %.2f" % result.statistic)
print("df              = %d"   % df)
print("P-value         = %.3f" % result.pvalue)

<a name="ch10_chi_contingency"></a>

## Chi-square Contingency (Homogeneity & Independent Tests)
- **[Manual: scipy.stats.chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html)**
- Can use this function to do the whole testing procedure **if raw data are available**

<a name="ch10_ex_can"></a>

### Example : can products
- Hypothesis
    >- p<sub>ij</sub> = proportion of nonconformity j in production line i
    >- H<sub>0</sub> : p<sub>1j</sub> = p<sub>2j</sub> = p<sub>3j</sub> 
    >- H<sub>1</sub> : p<sub>ij</sub> $\ne$ p<sub>kj</sub> for some production lines i $\ne$ k

In [None]:
import pandas as pd

df_xtab = pd.DataFrame(columns = ('Blemish', 'Crack', 'Location', 'Missing', 'Other'))
df_xtab.loc['Line 1'] = [34, 65, 17, 21, 13]
df_xtab.loc['Line 2'] = [23, 52, 25, 19,  6]
df_xtab.loc['Line 3'] = [32, 28, 16, 14, 10]
df_xtab

In [None]:
result = chi2_contingency(df_xtab)

print("Calculated chi2 = %.2f" % result.statistic)
print("df              = %d"   % result.dof)
print("P-value         = %.3f" % result.pvalue, "\n")

print("Expected freq")
print( np.around(result.expected_freq, 2) )

In [None]:
### From manual calculation

critical = chi2.ppf(1-0.05, 8)
pvalue   = chi2.sf(14.16, 8)
print("Critical value = %.2f" % critical)
print("P-value        = %.3f" % pvalue)

<a name="ch10_ex_gasoline"></a>

### Example : gasoline marketing
- Hypothesis
    >- $\rho$ = population correlation between facility conditions and pricing strategies
    >- H<sub>0</sub> : $\rho$ = 0
    >- H<sub>1</sub> : $\rho$ $\ne$ 0

In [None]:
import pandas as pd

df_xtab = pd.DataFrame(columns = ('Aggressive', 'Neutral', 'Nonaggressive'))
df_xtab.loc['Substandard'] = [24, 15, 17]
df_xtab.loc['Standard']    = [52, 73, 80]
df_xtab.loc['Modern']      = [58, 86, 36]
df_xtab

In [None]:
result = chi2_contingency(df_xtab)

print("Calculated chi2 = %.2f" % result.statistic)
print("df              = %d"   % result.dof)
print("P-value         = %.6f" % result.pvalue, "\n")

print("Expected freq")
print( np.around(result.expected_freq, 2) )

<a name="ch10_ex_pregnancy"></a>

### Example : pregnancy
- Hypothesis
    >- $\rho$ = population correlation between smoking and pregnancy outcomes
    >- H<sub>0</sub> : $\rho$ = 0
    >- H<sub>1</sub> : $\rho$ $\ne$ 0

In [None]:
import pandas as pd

df_xtab = pd.DataFrame(columns = ('Smoker', 'Nonsmoker'))
df_xtab.loc['Premature'] = [88, 66]
df_xtab.loc['Full-term'] = [2542, 2963]
df_xtab

In [None]:
### When df = 1, this function will apply Yates’ correction by default
#   (adjust each observed value by 0.5 towards the corresponding expected value)

result = chi2_contingency(df_xtab)                        # with correction
#result = chi2_contingency(df_xtab, correction=False)      # without correction

print("Calculated chi2 = %.2f" % result.statistic)
print("df              = %d"   % result.dof)
print("P-value         = %.6f" % result.pvalue, "\n")

print("Expected freq")
print( np.around(result.expected_freq, 2) )

In [1]:
import numpy as np
from scipy.stats import chi2, chi2_contingency

# Observed data
data = np.array([
    [202, 369, 482, 361, 811],  # Male counts
    [230, 251, 238, 164, 258]   # Female counts
])

# Calculate the Chi-squared test using scipy's chi2_contingency function
chi2_stat, p_value, df, expected = chi2_contingency(data)

print("Chi-squared Statistic:", chi2_stat)
print("Degrees of Freedom:", df)
print("P-value:", p_value)
print("Expected Frequencies:")
print(expected)

# Determine the critical value at 95% confidence level
critical_value = chi2.ppf(0.95, df)

# Compare the chi-squared statistic to the critical value
print("Critical Value:", critical_value)
if chi2_stat > critical_value:
    print("Reject the null hypothesis - significant differences exist.")
else:
    print("Fail to reject the null hypothesis - no significant differences exist.")


Chi-squared Statistic: 131.4959157344572
Degrees of Freedom: 4
P-value: 1.864072008475466e-27
Expected Frequencies:
[[285.56149733 409.83363042 475.93582888 347.03654189 706.63250149]
 [146.43850267 210.16636958 244.06417112 177.96345811 362.36749851]]
Critical Value: 9.487729036781154
Reject the null hypothesis - significant differences exist.


In [2]:
import numpy as np
from scipy.stats import chi2, chi2_contingency

# Observed data
data = np.array([
    [141, 54, 40],
    [68, 44, 51],
    [17, 11, 19]
])

# Calculate the Chi-squared test using scipy's chi2_contingency function
chi2_stat, p_value, df, expected = chi2_contingency(data)

print("Chi-squared Statistic:", chi2_stat)
print("Degrees of Freedom:", df)
print("P-value:", p_value)
print("Expected Frequencies:")
print(expected)

# Determine the critical value at 95% confidence level
critical_value = chi2.ppf(0.95, df)

# Compare the chi-squared statistic to the critical value
print("Critical Value:", critical_value)
if chi2_stat > critical_value:
    print("Reject the null hypothesis - significant differences exist.")
else:
    print("Fail to reject the null hypothesis - no significant differences exist.")


Chi-squared Statistic: 22.373137280702615
Degrees of Freedom: 4
P-value: 0.00016889484997124787
Expected Frequencies:
[[119.34831461  57.56179775  58.08988764]
 [ 82.78202247  39.9258427   40.29213483]
 [ 23.86966292  11.51235955  11.61797753]]
Critical Value: 9.487729036781154
Reject the null hypothesis - significant differences exist.
