# G-squared and Fisher's Exact Tests

These tests are somewhat similar to the Chi-square test.  



In [2]:
# Here's some code that will likely appear near the top of every homework or lecture this semester.

from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import scipy.stats as stats

import scipy

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=np.VisibleDeprecationWarning)

# To add text, referred to as comments to a code cell
# just put a hashtag a the beginning of the comment
# Comments are ignored by the computer when executing the code

mort22 = Table.read_table("SmallMort2022.csv")

## G-squared test or log-likelihood test

Using the code that we already know, we can run a log-likelihood test by activating a feature inside the `chi2_contingency` function.  


Recall this Pew Research Center data:


|     |Right/Leans Right  |  Left/Leans Left  |
|:-:  |  :-:              |  :-:              |
|Men  |2343|2073|
|Women|2444|2833|

How is this test different from $\chi^2$, recall that $\displaystyle \chi^2 = \sum_{All\ i,\ j} \frac{\left(O_{ij}-E_{ij}\right)^2}{E_{ij}}$


Now, $\displaystyle G^2 = 2 \sum o_{ij} \log\left(\frac{o_{ij}}{e_{ij}}\right)$



In [3]:
pew_table = np.array([[2343, 2073],[2444,2833]])

pew_table

array([[2343, 2073],
       [2444, 2833]])

In [4]:
scipy.stats.chi2_contingency(pew_table, lambda_ = "log-likelihood")

(43.489347989545195,
 4.262734591470558e-11,
 1,
 array([[ 2180.89260291,  2235.10739709],
        [ 2606.10739709,  2670.89260291]]))

The second number in the output is the p-value, $4.3 \times 10^{-11}$, which is essentially 0.  Would we get the same if we'd just run a $\chi^2$ test?

Essentially, but not exactly.  

In [91]:
scipy.stats.chi2_contingency(pew_table)

(43.460025698201854,
 4.3270900686034906e-11,
 1,
 array([[ 2180.89260291,  2235.10739709],
        [ 2606.10739709,  2670.89260291]]))

### Small data sets

For medium sized data sets, the log-likelihood test may be slightly more conservative than the $\chi^2$-test.  

Consider this fictional data.

In [88]:
example_table = np.array([[13,34], [14,12]])

example_table

array([[13, 34],
       [14, 12]])

In [89]:
scipy.stats.chi2_contingency(example_table)

(3.8657715434426807,
 0.04928053082953246,
 1,
 array([[ 17.38356164,  29.61643836],
        [  9.61643836,  16.38356164]]))

In [90]:
scipy.stats.chi2_contingency(example_table, lambda_="log-likelihood")

(3.8239881519535244,
 0.050523845121755243,
 1,
 array([[ 17.38356164,  29.61643836],
        [  9.61643836,  16.38356164]]))

Strictly applying a 5% level of significance, we see that on this data, the $\chi^2$ test is significant but the log-likelihood test is not.  How do you know which one to run?  This is the type of decision that would either be left up to you as the data person on any given project, OR it's possible that your industry will have a standard procedure that *everyone* uses.  

In [17]:
azt = Table.read_table("aids.csv")

race,azt,yes,no
white,yes,14,93
white,no,32,81
black,yes,11,52
black,no,12,43


## Really small data sets

The assumptions of the $\chi^2$ test include that there are no 0 cells in the expected table AND that no more than 20% of the cells in the expected table have a value below 5.  

What can we do if that assumption is violated?  Really small data sets are especially prone to these issues.  

If the table is a nice 2x2 two-way table, we can use the Fisher's Exact test.  It operates on the odds ratio from the table.  

$H_o: OR = 1$

This test permits use to set an alternative.

We can use the default, which is:

$H_a: OR \not= 1$

or we can set a directed alternative, that is one of:

$H_a: OR > 1$
 
$H_a: OR < 1$



In [95]:
scipy.stats.fisher_exact(example_table)

(0.32773109243697479, 0.042051852515520002)

The second number in the output is the p-value, but the first number is our old friend the Odds Ratio.  