## **ONE-VARIABLE ($\chi^2$) TEST**

### **19.1** SURVEY OF BLOOD TYPES

#### **A Test for Qualitative (Nominal) Data**

#### **One-Variable Versus Two-Variable**

- Definition:
    + One-Variable ($\chi^2$) Test: Evaluates whether observed frequencies for a single qualitative variable are adequately described by the hypothesized or expected frequencies.
    + Goodness-of-fit: An indication of how well the given data fits a distribution of a population that is orginally normally distributed.

### **19.2** STATISTICAL HYPOTHESES

#### **Null Hypothesis**

- In this type of test, the null hypotheses come in the form of assumed or expected probability/frequency for each particular subject involved.

#### **Other Examples**

#### **Alternative Hypothesis**

- Similarly to that in every other test, the alternative hypothesis indicates the contrary of the null hypothesis, something special that is not attributable to chance is occuring to the sample/underlying population.

#### **Progress Check 19.1**

(a) $H_0: P_A = P_B = \frac{1}{2}$ <br>
Where $P_A$ is the probability/frequency of candidate A receiving votes and $P_B$ is that of candidate B.

(b) $H_0: P_{north} = P_{south} = P_{west} = P_{east} = \frac{1}{4}$ <br>
Where $P_{direction}$ is the probability of flying to a particular direction of the migratory birds.

(c) $H_0: P_{cc-\text{day of week}} = \frac{1}{7} = 0.14$ <br>
Where $P_{cc-\text{day of week}}$ is the probabibility of crimes commitment to occur on a particular day of a week.

(d) $H_0: P_{cc-\text{weekend}} = \frac{2}{7},\; P_{cc-\text{weekday}} = \frac{5}{7}$

### **19.3** DETAILS: CALCULATING $\chi^2$

- Definition:
    + Expected Frequency ($f_e$): The hypothesized frequency for each category, if the null hypothesis is true.
    + Observed Frequency ($f_o$): The observed frequency for each category.

<center><b>EXPECTED FREQUENCY (ONE-VARIABLE $\chi^2$ TEST)</b></center> <br>
<center>$\Large \it f_e = \text{(expected proportion)(total sample size)}$</center>

#### **Evaluating Discrepancies**

- In any kind of test, it is most unlikely that the random sample - because of its inevitable variability - will exactly reflect the characteristics of its population. Because of this notion, variability is always accounted for when statistical hypotheses are tested, the crucial question is whether the discrepancies between observed and expected frequencies small enough to be regarded as a common outcome, if the null hypothesis is true. In such a case, the null hypothesis is retained, otherwise, if the discrepancies are large enough to qualify as a rare outcome, the null hypothesis is rejected.

#### **Computing $\chi^2$**

<center><b>$\chi^2$ RATIO</b></center> <br>
<center>$\Large \chi^2 = \sum{\frac{(f_o - f_e)^2}{f_e}}$</center>

### **19.4** TABLE FOR THE $\chi^2$ DISTRIBUTION

<center><b>DEGREES OF FREEDOM (ONE-VARIABLE $\chi^2$ TEST)</b></center> <br>
<center>$\Large \it \text{df} = c - 1$</center>

Where:
- c: the total number of categories of the qualitative variable.

#### **Lose One Degree of Freedom**

- From the abstract perspective of degree of freedom, there always exists one proportion that is not free to vary, since it must be justitied (in some way) so that the sum of it with the remaining proportions can equal 1, this is due to the fact the maximum probability can only be 1.

### **19.5** $\chi^2$ Test

#### **Progress Check 19.2**

(a)

In [None]:
observed_frequencies = [25, 10, 5, 25, 10, 15]
categories = len(observed_frequencies)
expected_frequency = 90 / categories
chisquare = calc_chisquare(observed_frequencies, expected_frequency)
chisquare

In [None]:
kwargs = dict(
    report_type="summary",
    test_type="Chi-Square $\chi^2$ for One Variable",
    research_problem="""
                     Are of desires for love, wealth, power, health, fame, 
                     family hap. equally distributed among college students
                     """,
    h0="""
       $H_0: P_{love} = P_{wealth} = P_{power} = P_{health} = P_{fame} = P_{fam. hap.} = 0.167$
       """,
    h1="""
       $H_1: H_0\; is\; false$
       """,
    levelof_significance=0.05,
    crit_value="$\chi^2 \ge 11.07$",
    act_value="$\chi^2 = 23.33$",
    decision=False,
    hypothesis="",
    interpretation=""
)
SR.Report(**kwargs)

(b) p < 0.001

#### **$\chi^2$ Test Is Nondirectional**

## **TWO-VARIABLE $\chi^2$ TEST**

### **19.6** LOST LETTER STUDY

- Definition:
    + Two-Variable $\chi^2$ Test: Evaluates whether observed frequencies reflect the independence of two qualitative variables.

### **19.7** STATISTICAL HYPOTHESES

#### **Null Hypothesis**

- For the two-variable $\chi^2$ test, the null hypothesis always makes a statement about the lack of relationship between two qualitative variables in the underlying population.
- Example:
<center>$H_0$: "Qualitative variable 1" and qualitative variable 2" are independent</center>

#### **Alternative Hypothesis**

- The alternative hypothesis is the contrary to the null hypothesis, which claims the existence of a relationship between the variables.
<center>$H_1$: $H_0$ is false</center> <br>

#### **Progress Check 19.3**

(a) $H_0$: The educational level of adults and their preference on right-to-abortion legislation are independent.

(b) $H_0$: The attendance to a marital therapy of adults and their parents' marital status are independent.

(c) $H_0$: The performances of employees and their work schedules are independent.

### **19.8** DETAILS: CALCULATING $\chi^2$

#### **Finding Expected Frequencies from Proportions**

#### **Finding Expected Frequencies from Totals**

<center><b>EXPECTED FREQUENCY (TWO-VARIABLE $\chi^2$ TEST)</b></center> <br>
<center>$\Large \it f_e = \frac{\text{(row total)(column total)}}{\text{grand total}}$</center>

Where:
- $\it \text{row total}$: the total frequency for the row occupied by a particular cell
- $\it \text{column total}$: the total frequency for the column occupied by that cell
- $\it \text{grand total}$: the total for all columns (or rows)

### **19.9** TABLE FOR THE $\chi^2$ DISTRIBUTION 

<center><b>DEGREES OF FREEDOM (TWO-VARIABLE $\chi^2$ TEST)</b></center> <br>
<center>$\Large \it \text{df} = (c - 1)(r - 1)$</center>

Where:
- c: the number of categories for the column variable.
- r: the number of categories for the row variable.

#### **Explanation for Degrees of Freedom**

### **19.10** $\chi^2$ TEST

#### **Progress Check 19.4**

(a)

In [80]:
data = np.array([[10, 20], [30, 30], [60, 30], [80, 40]])
exp_frequency = calc_expfrequency(data)
df = calc_df(data)
chisquare = calc_chisquare(data, exp_frequency.T)
chisquare, df

(15.28, 3.0)

(b) p < 0.01

### **19.11** ESTIMATING EFFECT SIZE

- Definition:
    + Squared Cramer's Phi Coefficient ($\phi^2_c$): A very rough estimate of the proportion of explained variance (or predictability) between two qualitative variables.

#### **Squared Cramer's Phi Coefficient ($\phi^2_c$)**

<center><b>PROPORTION OF EXPLAINED VARIANCE (TWO-VARIABLE $\chi^2$)</b></center> <br>
<center>$\Large \phi^2_p = \frac{\chi^2}{n(k - 1)}$</center>

Where:
- $\chi^2$: the obtained value of statistically significant two-variable test.
- $n$: the total sample size.
- $k$: the smaller of either c (column) or r (row).

- As suggested in the formula, whenever the value of $\chi^2$ indicates a statistically significant result.

- Cohen's guidelines for estimating the effect size: <br>
![image.png](attachment:3885ab3e-8696-4bf8-bb42-a47b277965ad.png) <br>

#### **Progress Check 19.5**

In [86]:
k = np.size(data, axis=0 if ( np.size(data, axis=0) <= np.size(data, axis=1) ) else 1)
cramersphi = calc_cramersphi(chisquare, data.sum(), k)
cramersphi

0.05

### **19.12** ODDS RATIOS

- Definition:
    + Indicates the relative occurrence of one value of the dependent variable across the two categories of the independent variable.

#### **Calculating the Odds Ratio**

<center><b>ODDS</b></center> <br>
<center>$\Large \it \text{odds of each dependent variable} = \frac{\text{frequency of occurrence}}{\text{frequency of nonoccurrence}}$</center>

#### **Progress Check 19.6**

(a) A returned letter is 2.37 times more likely to come from campus than from off-campus.

In [96]:
data = [[69, 61], [51, 19]]
odds = calc_odds(data)
ratio = odds[1] / odds[0]
ratio

2.3729977116704806

(b) A returned letter is 0.42 times less likely to come from off-campus than from campus.

In [97]:
ratio = odds[0] / odds[1]
ratio

0.4214079074252652

### **19.13** REPORTS IN THE LITERATURE

![image.png](attachment:9d938f33-705f-4b8b-a246-a3c1090d1d67.png)

#### **Progress Check 19.7**

There is evidence that the susceptability to oak poison is related to a person's hair color [$\chi^2$(3, n = 300) = 15.28, p < 0.01, $\phi^2_p$ = 0.05].

### **19.14** SOME PRECAUTIONS

#### **Avoid Dependent Observations**

- The valid use of $\chi^2$ requires that observations be independent of one another.

- The total observed frequencies should never exceed the total number of subjects.

#### **Avoid Small Expected Frequencies**

- A conservative rule specifies that all expected frequencies be 5 or more.

- Avoid small expected frequencies by using larger sample sizes.

#### **Avoid Extreme Sample Size**

- A unduly small sample size produces a test that tends to miss even a seriously false null hypothesis, while an excessively large sample size produces a test that tends to detect a small, yet unimportant departures from the null hypothesis.

### **19.15** COMPUTER OUTPUT

#### **Progress Check 19.8**

(a) 30

(b) (32.50, 25.00, 42.50) or (26.25, 50.00, 23.75)

(c) 0.07

<center><b></b></center> <br>
<center>$\Large$</center>

### **Review Questions**

#### **19.9**

(a) The null hypothesis is retained.

In [108]:
data = [17, 21, 22, 18, 23, 24, 15]
exp_frequency = np.sum(data) / len(data)
chisquare = calc_chisquare(data, exp_frequency)
chisquare

3.4

(b) p > 0.1

(c) There is evidence to suggest that the crimes are equally likely to be committed on any day of a week [$\chi^2$(6, n = 140) = 3.4, p > 0.1]

In [109]:
cramersphi = calc_cramersphi(chisquare, np.sum(data), 6)
cramersphi

0.0

#### **19.10**

(a) The null hypothesis is rejected.

In [112]:
data = [33, 70]
exp_frequency = [103 / 2]
chisquare = calc_chisquare(data, exp_frequency)
chisquare

13.29

(b) p < 0.001

(c) There is evidence to suggest that elderly people are more likely to pass away after rather than before Harvest Moon Festival [$\chi^2$(1, n = 103) = 13.29, p < 0.001, $\phi^2_p$ = 0.13].

In [113]:
cramersphi = calc_cramersphi(chisquare, 103, 2)
cramersphi

0.13

#### **19.11**

(a) The null hypothesis is retained, that is, the coin is unbiased.

In [114]:
data = [30, 20]
exp_frequency = 25
chisquare = calc_chisquare(data, exp_frequency)
chisquare

2.0

(b) p > 0.1.

#### **19.12**

In [115]:
data = [27, 53 - 27]
exp_frequency = 5.3 * np.array([2, 8])
chisquare = calc_chisquare(data, exp_frequency)
chisquare

31.72

The null hypothesis is rejected.

#### **19.13**

(a) Yes, each of the observed frequencies is a multiple of 10, suggesting that the dataset is fictious in order to simplify calculations.

(b)

In [122]:
data = [[30, 10, 10, 0], [30, 10, 10, 0], [40, 40, 20, 0], [60, 20, 20, 0], [40, 20, 40, 100]]
exp_frequency = calc_expfrequency(data).T
df = calc_df(data)
chisquare = calc_chisquare(data, exp_frequency)
chisquare

220.0

(c)

In [128]:
cramersphi = calc_cramersphi(chisquare, np.sum(data))
cramersphi

0.15

#### **19.14**

In [48]:
data = [[299, 280], [186, 526]]
exp_frequency = calc_expfrequency(data)
chisquare = calc_chisquare(data, exp_frequency.T)
df = calc_df(data)
cramersphi = calc_cramersphi(data)
odds = calc_odds(data)

In [49]:
df, chisquare, cramersphi, odds

(1.0, 88.64, 0.07, array([1.06785714, 0.35361217]))

In [50]:
odds[0] / odds[1]

3.019854070660522

(a) There is a significant evidence of a relationship between the survival rates and the passangers' accommodations on the Titanic. Thus the decision is to reject the null hypothesis.

(b) The value of $\phi^2_p$ is 0.07, which suggests a medium-strength correlation between the two qualitative variables, according to Cohen's guidelines.

(c) Based on the observed frequencies, a survivor of the Titanic would more than 3 times be likely to accommodate in the Cabin rather than in the Steerage.

#### **19.15**

In [51]:
data = [[72, 71, 25, 25], [28, 29, 75, 75]]
exp_frequency = calc_expfrequency(data)
chisquare = calc_chisquare(data, exp_frequency.T)
df = calc_df(data)
cramersphi = calc_cramersphi(data)
odds = calc_odds(data)

In [52]:
df, chisquare, cramersphi, odds

(3.0, 86.62, 0.22, array([1.01408451, 0.96551724]))

In [53]:
exp_frequency

array([[48.25, 51.75],
       [48.25, 51.75],
       [48.25, 51.75],
       [48.25, 51.75]])

(a) The addresses of the letters are definitely related to the their return rates.

(b) p < 0.001

(c) The value of $\phi^2_p$ is 0.22, which suggests a strong correlation.

(d) There's significant evidence to suggest that the return rates of letters are strongly associated with their intended addresses [$\chi^2$(3. n = 400) = 86.62, p < 0.001, $\phi^2_p$ = 0.22].

#### **19.16**

#### **19.17**

#### **19.18**

### Imports and Functions

In [1]:
import numpy as np
import pandas as pd
import os
import sys

sys.path.append(os.path.join(sys.path[0], os.path.pardir))

import utils.StatsReport as SR

In [2]:
def calc_chisquare(observed_frequencies, expected_frequency):
    observed_frequencies = np.array(observed_frequencies)
    chisquare = np.sum( (observed_frequencies-expected_frequency)**2 / expected_frequency)
    
    return np.round(chisquare, decimals=2)

In [3]:
def calc_expfrequency(data):
    data = np.array(data)
    row_sums = data.sum(axis=0)
    col_sums = data.sum(axis=1)
    expected_frequencies = list()
    for each_sum in row_sums:
        expected_frequencies.append(np.round( (col_sums*each_sum)/data.sum() , decimals=2))
    
    # Require that the return array be transposed to match the shape of given data array
    return np.array(expected_frequencies).T

In [4]:
def calc_df(data):
    data = np.array(data)
    no_rows = data.size / np.size(data, axis=0)
    no_cols = data.size / no_rows
    
    return (no_rows - 1)*(no_cols -1)

In [5]:
# Requires the obtained exp_frequency be tranposed
def calc_chisquare(data, exp_frequency):
    return np.round(np.sum( (data-exp_frequency)**2 / exp_frequency ), decimals=2)

In [30]:
def calc_cramersphi(data):
    exp_frequency = calc_expfrequency(data)
    chisquare = calc_chisquare(data, exp_frequency.T)
    k = np.size(data, axis=0 if ( np.size(data, axis=0) <= np.size(data, axis=1) ) else 1)

    return np.round( chisquare / (np.sum(data)*(k - 1)) , decimals=2)

In [23]:
def calc_odds(data):
    data = np.array(data)
    odds = data[:, 0] / data[:, 1]
        
    return odds