# Confidence intervals for Odds Ratios

We learned about the odds ratio when we learned about relative risk, accuracy, sensitivity and specificity.  In this notebook, the odds ratio gets some special attention.  

The odds ratio is what they call log-normal. To explain what that means, a random variable or probabilty distribution, $Y$,  is called log-normal if $\log(Y)$ is normal.  We know a lot about normal distributions, and so-far very little about log-normal distributions.  We'll exploit our existing knowledge of normal distributions to find confidence intervals for the odds ratio.  

Here's a rough outline of the steps:

|               |             |         |             |               |            |       |
|:--|:-:|---|:-:|---|:-:|---|
|Find Odds Ratio|$\rightarrow$|Take natural log |$\rightarrow$|Becomes Normal|$\rightarrow$|Use normality to find C% interval for log-odds|
|               |             |         |             |              |             |    $\downarrow$                |
| $(e^{lb}, e^{ub})$              |$\leftarrow$ |No longer normal |  $\leftarrow$  | Put bounds into $f(x) = e^x$   |$\leftarrow$  |$(lb, ub)$ |

If your two way table summarizing the data is 

|   |  |
|:-:|:-:|
|a  |b|
|c  |d|


Then the standard error of the log-odds is $\displaystyle  SE =  \sqrt{ \frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d} }$


Confidence interval for log-odds, $\ln\left(OR \right) \pm z^* SE $

The value of $z^*$ depends upon the level of confidence.  For a 95% confidence, that means we want to capture the middle 95% of the standard normal distribution between $-z^*$ and $+z^*$.  The value of $z^*$ that does that is close to 1.9599, or about 1.96.


Then $e^{\ln\left(OR \right) \pm z_{\alpha/2} SE} = \left(e^{\ln\left(OR \right) - z^* SE}, e^{\ln\left(OR \right) + z^* SE}  \right) $ is the confidence interval for the odds ratio. 

Odds ratio confidence intervals are NOT symmetric around their estimate, the confidence interval for the log-odds is, but that symmetry is destroyed by using the exponential function.  So, it's important to also reference the actual value of the odds ratio when describing the confidence interval. 

In [43]:
# Here's some code that will likely appear near the top of every homework or lecture this semester.

from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import scipy.stats as stats

import scipy

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=np.VisibleDeprecationWarning)

# To add text, referred to as comments to a code cell
# just put a hashtag a the beginning of the comment
# Comments are ignored by the computer when executing the code

mort22 = Table.read_table("SmallMort2022.csv")

Find a 95% confidence interval for the odds ratio from this azt2 table.  

In [44]:
azt = Table.read_table("AIDS.csv" )

azt2 = azt.drop("race").group("azt", sum).relabel(1,"Yes").relabel(2, "No")

azt2

azt,Yes,No
no,44,124
yes,25,145


In [45]:
#recall this from before

OR = 3.33

In [46]:
SE = (1/44 + 1/124 + 1/25 + 1/145)**0.5

The value of $z^*$ depends upon the level of confidence.  For a 95% confidence, that means we want to capture the middle 95% of the standard normal distribution between $-z^*$ and $+z^*$.  The value of $z^*$ that does that is close to 1.9599, or about 1.96.

In [47]:
lb = np.log(3.33) - 1.96*SE
ub = np.log(3.33) + 1.96*SE

(np.exp(lb), np.exp(ub))

(1.928357858475813, 5.7504368036567355)

In [48]:
np.exp((lb, ub))

array([ 1.92835786,  5.7504368 ])

For other levels of confidence, use the inverse norm function (`scipy.stats.norm.ppf`) and plug in $\frac{1-C}{2}$.  This will calculate $-z^*$.  

In [49]:
# z^* for 95% confidence
scipy.stats.norm.ppf(0.025)

-1.9599639845400545

In [50]:
# z^* for 95% confidence
scipy.stats.norm.ppf((1-0.95)/2)

-1.959963984540054

In [51]:
# z^* for 99% confidence

C = 0.99

scipy.stats.norm.ppf((1-C)/2)

-2.5758293035489004

In [52]:
# z^* for 90% confidence

C = 0.90

scipy.stats.norm.ppf((1-C)/2)

-1.6448536269514729

In [53]:
asthma = Table.read_table("asthma.csv")

asthma.select("Asthma", "Father")

asthma.pivot("Asthma", "Father")

Father,No,Yes
No,11329,185
Yes,0,37


Use this data to compute the odds ratio approximating how many times more likely a child is to have asthma if his father had or has it.  

In [54]:
OR = 37*11329/(185*0)

ZeroDivisionError: division by zero

This data has a problem, but like in real life this problem is *our* opportunity.  It gives me an excuse to introduce the so-called Plus-2 adjustment.  How does it work?  Simply add $\frac{1}{2}$ to every cell in the 2x2 two-way table.  

In [35]:
OR = 37.5*11329.5/(185.5*0.5)

OR

4580.66037735849

Wow, so if your father had asthma, you are approximately 4600 times more likely to have asthma too (compared to children whose fathers do not and did not have asthma).  Is that reasonable?  Let's see what 95% confidence interval says.  

In [36]:
z_star_array = np.array([-1.96, 1.96])

SE = (1/11329.5 +1/185.5 + 1/0.5 + 1/37.5)**0.5

np.exp(np.log(OR) + SE*z_star_array)


array([   280.21377348,  74880.15036684])

In [65]:
## Let's add a conf interval to our TwoWaySummary function

def TwoWaySummary(x, C=0.95):
    """ x must be a 2x2 table arranged T+ F-, F+ T-"""
    """ C must be a positive number less than 1"""
    a = x[[0],[0]][0]
    b = x[[0],[1]][0]
    c = x[[1],[0]][0]
    d = x[[1],[1]][0]
    
    z_star = scipy.stats.norm.ppf((1-C)/2)
    z_star_array = np.array([z_star, -1*z_star])
    SE = (1/a + 1/b + 1/c + 1/d)**0.5
    
    c_int = np.exp(np.log((a*d)/(b*c)) + SE*z_star_array)
    
    print(f"sensitivity = {a/(a+c)}\nspecificity = {d/(b+d)}\nrelative risk = {a*(c+d)/(c*(a+b))}\nodds ratio = {a*d/(b*c)}\naccuracy = {(a+d)/(a+d+c+b)}")
    print(f"The {C*100}% Confidence Interval for the odds ratio is {c_int}.")


In [66]:
obs_asthma_tab = np.array([[37, 185], [0, 11329 ]])  # T+ F- then F+ T-


TwoWaySummary(obs_asthma_tab)

sensitivity = 0.16666666666666666
specificity = 1.0
relative risk = inf
odds ratio = inf
accuracy = 0.9839840706432343
The 95.0% Confidence Interval for the odds ratio is [ nan  inf].


  SE = (1/a + 1/b + 1/c + 1/d)**0.5
  c_int = np.exp(np.log((a*d)/(b*c)) + SE*z_star_array)
  c_int = np.exp(np.log((a*d)/(b*c)) + SE*z_star_array)
  print(f"sensitivity = {a/(a+b)}\nspecificity = {d/(c+d)}\nrelative risk = {a*(c+d)/(c*(a+b))}\nodds ratio = {a*d/(b*c)}\naccuracy = {(a+d)/(a+d+c+b)}")


In [67]:
obs_asthma_tab = np.array([[37.5, 185.5], [0.5, 11329.5 ]])  # T+ F- then F+ T-


TwoWaySummary(obs_asthma_tab)

sensitivity = 0.1681614349775785
specificity = 0.9999558693733451
relative risk = 3810.5381165919284
odds ratio = 4580.66037735849
accuracy = 0.9839002856400935
The 95.0% Confidence Interval for the odds ratio is [   280.22816037  74876.3060254 ].


In [68]:
TwoWaySummary( obs_asthma_tab , C=0.99)

sensitivity = 0.1681614349775785
specificity = 0.9999558693733451
relative risk = 3810.5381165919284
odds ratio = 4580.66037735849
accuracy = 0.9839002856400935
The 99.0% Confidence Interval for the odds ratio is [  1.16473932e+02   1.80147173e+05].


In [71]:
## Let's add a conf interval to our TwoWaySummary function
## Let's add something to automatically use the Plus 2 correction
## and print a warning about it.

def TwoWaySummary(x, C=0.95):
    """ x must be a 2x2 table arranged T+ F-, F+ T-"""
    """ C must be a positive number less than 1"""
    a = int(x[[0],[0]][0])
    b = int(x[[0],[1]][0])
    c = int(x[[1],[0]][0])
    d = int(x[[1],[1]][0])
    
    if(a == 0 or b == 0 or c == 0 or d==0):
        print("At least one entry is 0; using the Plus 2 correction")
        a = a + 0.5
        b = b + 0.5
        c = c + 0.5
        d = d + 0.5
    
    z_star = scipy.stats.norm.ppf((1-C)/2)
    z_star_array = np.array([z_star, -1*z_star])
    SE = (1/a + 1/b + 1/c + 1/d)**0.5
    
    c_int = np.exp(np.log((a*d)/(b*c)) + SE*z_star_array)
    
    print(f"sensitivity = {a/(a+b)}\nspecificity = {d/(c+d)}\nrelative risk = {a*(c+d)/(c*(a+b))}\nodds ratio = {a*d/(b*c)}\naccuracy = {(a+d)/(a+d+c+b)}")
    print(f"The {C*100}% Confidence Interval for the odds ratio is {c_int}.")


In [72]:
obs_asthma_tab = np.array([[37, 185], [0, 11329 ]])  # T+ F- then F+ T-


TwoWaySummary(obs_asthma_tab)

At least one entry is 0; using the Plus 2 correction
sensitivity = 0.1681614349775785
specificity = 0.9999558693733451
relative risk = 3810.5381165919284
odds ratio = 4580.66037735849
accuracy = 0.9839002856400935
The 95.0% Confidence Interval for the odds ratio is [   280.22816037  74876.3060254 ].
