In [3]:
import itertools
from typing import List

### $\Chi^2$ Distribution & Testing

- Assumption: Randomly sampled from population, with two categorical outcome variables.  
- Hypotheses:  
  $H_0$ : Variables are statistically *independent*  
  $H_a$ : Variables are statistically *dependent*  

- Test Statistic: let $r = \text{row total}$, $c = \text{column total}$ $g = \text{total sample size}$, $f_e = \frac{r \cdot c}{n}$, then $\Chi^2 = \sum \frac{(f_0 - f_e)^2}{f_e}$  
($f_e$ is called the expected frequency)

#### Procedure:

1. Take the frequency distribution table, and calculate expected frequency $f_e$ for each cell of the table.

In [2]:
fuji = {"high" : 45, "low" : 5}
honeycrips = {"high" : 35, "low" : 15}


2. Calculate the $\Chi^2$ statistic for the table by taking sum of squared differences between expected frequency and observed frequency $f_o$, which is the actual observed number.

3. Perform a significance test. This can be done by p-value or critical value.

p-value Method:  

1. Compute an exact p-value with software.  
2. If p-value $\le \alpha$, reject $H_0$  

Critical Value Method:  

1. Check the $\Chi^2$ against the critical value table for the specfic distribution.
Note that $\Chi^2$ is technically a family of distributions, where the distribution changes depending on the degrees of freedom. We also have one table for each $\alpha$ value. 
$df = (r - 1)(c - 1)$, where $r$ is the number of rows and $c$ the number of columns.
2. If $\Chi^2_\text{obs} \ge \text{critical value}$, reject $H_0$.

In [4]:
class Chi2:
    def __init__(self, rows:List[List[float]], row_labels:List[str], col_labels:List[str], name="Study", row_category="Row Variable", col_category="Col Variable"):
        self._rows = rows
        self._cols = [[row[i] for row in rows] for i in range(len(rows[0]))]
        self._row_labels = row_labels
        self._col_labels = col_labels

    @staticmethod
    def fe(row_total, col_total, grand_total):
        return (row_total * col_total) / grand_total
    @staticmethod
    def chiStat(observed, expected):
        return ((observed-expected)**2) / observed
    @staticmethod
    def ChiStat(observed:List[float], expected:List[float]):
        if len(observed) != len(expected):
            raise ValueError(f"List of observed observations is not the same length ({len(observed)} as the list of expectations ({len(expected)}!")
        return sum([Chi2.chiStat(observed[i], expected[i]) for i in range(len(observed))])

    def _expectedFrequencies(self):
        _row_totals = [sum(row) for row in self._rows]
        _col_totals = [sum(col) for col in self._cols]
        _grand_total = sum(_row_totals)
        return [[Chi2.fe(_row_totals[i], _col_totals[j], _grand_total) for j in range(len(row))] for i,row in enumerate(self._rows)]
    
    def PrintExpectedFrequencies(self):
        header = f"{' '*max([len(label) for label in self._row_labels])} | {' | '.join(self._row_labels)}"
        print()

    def Statistic(self):
        observed_frequencies = itertools.chain.from_iterable(self._rows)
        expected_frequencies = itertools.chain.from_iterable(self._expectedFrequencies())
        return Chi2.ChiStat( observed=list(observed_frequencies), expected=list(expected_frequencies) )

In [5]:
sex_study = Chi2([[30, 26], [89, 75]], ["Male", "Female"], ["Usual Care", "Intervention"])
print(f"Study outcome is: fe's: {sex_study._expectedFrequencies()}, Chi2: {sex_study.Statistic()}")

living_conditions_study = Chi2([[16, 18], [102, 81]], ["Alone", "With Family"], ["Usual Care", "Intervention"])
print(f"Study outcome is: fe's: {living_conditions_study._expectedFrequencies()}, Chi2: {living_conditions_study.Statistic()}")

Study outcome is: fe's: [[30.29090909090909, 25.70909090909091], [88.7090909090909, 75.2909090909091]], Chi2: 0.00815511570486035
Study outcome is: fe's: [[18.48847926267281, 15.511520737327189], [99.51152073732719, 83.48847926267281]], Chi2: 0.8682245010387092


### Standardized Residuals

When we identify a case where at least one association exists ($\Chi^2$ is significant), we can calculate the pairwise residuals, which are basically the pairwise z-scores.

We calculate the residuals $z$ by:
$z = \frac{f_o - f_e}{\sqrt{f_e (1 - p_\text{row})(1 - p_\text{column})}}$,  
where $p_\text{row} = \frac{\text{row total}}{\text{grand total}}$ and $p_\text{column} = \frac{\text{column total}}{{\text{grand total}}}$

### Strength of Association

There are multiple approaches to quantifying the strength of an association.

1. Difference in (conditional) proportions.
  The conditional proportion of a cell is $prop = \frac{f_0}{\text{marginal(column)}}$
  Then simply take the difference in conditional proportions between the two cells.  
  The differences will be between -1 and 1, and the absolute value indicates the strength of association.
  TODO: finish figuring this shit out

2. Odds Ratio  
  Given a 2x2 table:  
  <!-- <table>
  <tr><th>row</th><th>col 1</th><th>col 2</th></tr>
  <tr><td>1  </td><td>A    </td><td>B    </td></tr>
  <tr><td>2  </td><td>C    </td><td>D    </td></tr>
  </table> -->

  | row | col 1 | col 1 |
  | --- | ---   | ----  |
  | 1   | A     | B     |
  | 2   | C     | D     |

  We compute the odds ratio $\theta$ as $\theta = \frac{\text{odds row }1}{\text{odds row }2} = \frac{A/B}{C/D} = \frac{A \cdot D}{B \cdot C}$.  
  Effectively, then, we are just cross-multiplying.