# Prepare synthetic data

- Zella King
- 2022-09-30

Our goal is to develop a Bayesian model to predict a patient's probability of being discharged in the next 24 hours. 

Here are three initial steps I think we need to do: 

 1. Decide on our "prior" probability and express it as a distribution 
 2. Identify some data we can use for the "likelihood". This can be synthetic to start with
 3. Develop a posterior distribution from the prior and likelihood. 

 In this example I'm using terminology and  methods from here: https://allendowney.github.io/ThinkBayes2/chap06.html

In [5]:
import os
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import poisson, lognorm
import sqlalchemy as sa

In [None]:
%matplotlib inline

## Prior odds

Previously we found, by interrogating some synthetic data, that a patient's probability of discharge in the next 24 hours was .1111. Let's call this a probability of 1/9

We could express this as odds. For everyone 1 patient discharged, 8 remain. Therefore the odds of discharge are 1:8 or 0.125. 


In [83]:
def odds(p):
    return p / (1-p)

odds (1/9)

def prob(o):
    return o / (o+1)

## Likelihood


In probability form, Bayes theorem is this:

$ P(H | D) =\frac{P(H)P(D|H)}{P(D)}$

It can be rewritten in terms of odd (Bayes rule).  We can express the odds in favour of discharge as:

$ odds(discharge | data) =odds (discharge) * \frac{P( data | discharge)}{P( data | not discharged)}$

We will read in the dummy data to get the likelihood. Note that this table of synthetic data has been adjusted to make it more likely that older people have a longer time to discharge

In [55]:
sqlite_engine = sa.create_engine('sqlite:///../../data/dummy.db')
df = pd.read_sql_query("SELECT id, hours_to_discharge, department, age, pulse from discharges_age_adjusted", sqlite_engine)

# Let N be the number of patients observed
N = df.shape[0]

# Let X be the number of patients who were discharged in 24 hours
X = df[df.hours_to_discharge <= 24].shape[0]
print("Number of people discharged in 24 hours: " + str(X))
print("Total number of people observed: " + str(N))

Number of people discharged in 24 hours: 32
Total number of people observed: 444


Let's group age into two categories (over 65 or not) and identify which ones were actually discharged in 24 hours. Then, cross-tabulating them:

In [65]:
df['age'] = df.age>65
df['dischargein24'] = df.hours_to_discharge <= 24
pivot = df.groupby('dischargein24')['over65'].value_counts().to_frame().rename(columns = {'over65':'N'}).reset_index().pivot(columns = 'over65', index = 'dischargein24', values = 'N')
pivot

over65,False,True
dischargein24,Unnamed: 1_level_1,Unnamed: 2_level_1
False,233,179
True,26,6


## Posterior

Let's say we are looking at one patient of the 444 and this person is over 65. We want to know whether this person will be discharged in the next 24 hours. These posterior odds are given by this equation: 

$ odds(discharge | over65 ) =odds (discharge) * \frac{P( over65 | discharge)}{P( over65 | not discharged)}$

$ P( over65 | discharged)$ is:

In [90]:
likelihood_gt65_given_dis = df[(df.dischargein24) & (df.over65)].shape[0] / df[df.dischargein24].shape[0]
print(likelihood_gt65_given_dis)

0.1875


$ P( over65 | not discharged)$ is:

In [91]:
likelihood_lte65_given_dis = df[~(df.dischargein24) & (df.over65)].shape[0] / df[~df.dischargein24].shape[0]
print(likelihood_lte65_given_dis)

0.4344660194174757


Likelihood ratio, given by $  \frac{P( over65 | discharge)}{P( over65 | not discharged)}$ is:

In [92]:
likelihood_gt65_given_dis/likelihood_lte65_given_dis

0.43156424581005587

In [82]:
prior_odds = 1/8
likelihood_ratio = likelihood_gt65_given_dis/likelihood_lte65_given_dis
post_odds = prior_odds * likelihood_ratio
post_odds

0.053945530726256984

The posterior odds, given a patient is over 65 are substantially reduced 

## Using Allen Downey functions

In [86]:
bindex = ['prior']
table = pd.DataFrame(index = index)
table['odds'] = 1/8
table['prob'] = prob(table['odds'])
table

table

Unnamed: 0,odds,prob
prior,0.125,0.111111


In [87]:
index = ['lte65', 'gt65']
table = pd.DataFrame(index = index)
table['odds'] = 1/8
table['prob'] = prob(table['odds'])
table



Unnamed: 0,odds,prob
lte65,0.125,0.111111
gt65,0.125,0.111111


Adding the likelihood we computed earlier:

In [89]:
table['likelihood'] = [likelihood_gt65_given_dis, likelihood_lte65_given_dis]
table

Unnamed: 0,odds,prob,likelihood
lte65,0.125,0.111111,0.1875
gt65,0.125,0.111111,0.434466


Calculating the posterior odds, by following this equation:

In [93]:
table['posterior odds'] = df[(df.dischargein24) & (df.over65)].shape[0] / df[df.dischargein24].shape[0]
table

Unnamed: 0,odds,prob,likelihood,posterior odds
lte65,0.125,0.111111,0.1875,0.1875
gt65,0.125,0.111111,0.434466,0.434466
