In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
import random

In [11]:
# Importing the test and train data (20/80 split), and the full dataset
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
full_data = pd.read_csv('cleaned_data/diabetes_cleaned.csv')

# Removing the first column
train = train.iloc[:, 1:]
test = test.iloc[:, 1:]
full_data = full_data.iloc[:, 1:]

In [13]:
# Following the method used in KEH_LTW modeling notebook to create a train dataset 
# that has an equal distribution of (0) and (1) response
# 1. Determine the observations with (1) response. Let this be n observations.
#   MODIFICATION: pull the (1) response from the full dataset to increase the number of observations
# 2. Randomly sample n observations with (0) response, and combine in a new train dataset with (1) response observations.
# 3. Randomly shuffle the new train dataset

admitted = full_data.loc[full_data.readmitted == 1, :]
n = admitted.shape[0]

not_readmitted = full_data.loc[full_data.readmitted == 0, :].sample(n)

train1 = pd.concat([admitted, not_readmitted])
train1 = train1.sample(frac = 1)

In [15]:
train1.readmitted.value_counts()

1    6293
0    6293
Name: readmitted, dtype: int64

#### Summary of KEH LTW Modeling

Variables that have a relationship with `readmitted`, from full dataset EDA:
- `age`: as age increases, the proportion of individuals readmitted to the hospital increases. The number of observations overall increases. This trend persists until age 75, after which the number of hospital visits drop and the proportion of readmitted : not readmitted evens out. This variable is already binned by nature of our original dataset.
    - -> because different age ranges have different distributions of readmitted : not readmitted, look into making each age bin its own predictor, if this isn't happening already
- `time_in_hospital`: as time in hospital increases, the number of observations increase (up until 3 days), and then start to decrease. The ratio of readmitted : not readmitted gradually increases as `time_in_hospital` increases.
    - -> the different distribution of readmittied: not readmitted at for each value of `time_in_hospital`. Similarly, look into making each day bin its own predictor, if it isn't already
- `num_of_changes`: there doesn't seem to be much difference 
- `number_inpatient`: large difference!

Interactions
- as `age` increases, the `time_in_hospital` increases
- as `time_in_hospital` increases, the `um_of_changes` also increases

Models
1. `readmitted ~ time_in_hospital*age + num_of_changes + number_inpatient`
2. `readmitted ~ num_of_changes*time_in_hospital + number_inpatient + age`
3. `readmitted ~ num_of_changes*time_in_hospital + age*time_in_hospital + number_inpatient`

Thoughts
- More variable selection
    - diag_1, diag_2, diag_3
    - num_medications
    - num_lab_procedures
- Make the bins in `age` and `time_in_hospital` their own predictors
- Evaluate performance of the existing KEH-LTW models with k-fold cross-validation
- Look to see if there are other interactions or transformations