# SYD DAT 4 Project : Hospital Readmissions prediction

### Overview 

I had an opportunity to work as a software engineer at Vanderbilt University Medical Center,US where I was involved in biomedical research projects. It is very gratifying as a computer science person to help solve interesting and challenging health care problems, which is why I choose this project.

Hospital readmission means when a patient is re-admitted to a hospital within short period(usually 28/30 days)after first time or initial discharge. Such readmissions are major health care concern for most countries like US, UK, Australia and many others. This leads to increased wait times, increased medical errors resulting in risking patient safety and unncessary deaths. High readmission rate is infact an indicator of poor quality care of the hospital. It also puts huge monetary burden on the hospitals as well as the government. In this project, I want to identify patients who are likely to be readmitted to the hospital. By doing this, proper care and management of such patients can be planned by their care providers.

### Goal(s) : 
Predict if a patient will be readmitted to the hospital or not?

### Data set used : 

In order to achieve the above goal(s), I need a hospital data set that has records of all hospitalizations entries for its patients for certain number of years. It should contain information like why the patients were admitted, which department they were admitted for, how many times they were admitted, what medications they were on, what lab tests were conducted, how many days they stayed in hospital, vitals signs like heights/weights, age, race, blood pressure, smoking status, their electronic medical records, billing records, genetics data, etc.

Of course, it's hard to obtain such a heterogeneous dataset which is publicly available as patient data are very confidential. Though there are quite a few data sets that contain substantial information and after going through some of them, I decided to use "Diabetes 130-US hospitals for years 1999-2008 Data Set".

This is a public available database by Center for Clinical and Translational Research, Virginia Commonwealth University. This data is a de-identified abstract of the Health Facts database (Cerner Corporation, Kansas City, MO). It contains 10 years of diabetes patients data across 130 US hospitals.

http://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

### Summary of data : 

From the below python code, following observations are made :-
 * There are 50 features and 101766 observations(hospitalizations). 
 
 
 * Each row in dataframe represents one hospital admission. On an average, there are 5 hospitalizations for each patient (patient_nbr is the de-identifed patient ID).
 
 
 * In the dataset, "readmitted" is the target varible with three classes :
 
 NO means never readmitted (~54% of total dataset)
 
 &lt;30 means readmitted within 30 days (~11% of total dataset)
 
 &gt;30 means readmitted after 30 days (~35% of total dataset)


 * To start with a simple binary classification model, I am grouping &lt;30 and &gt;30 classes of readmitted target varible into one class "1", which means the patient is readmitted anytime. And class NO is "0" means patient is never readmitted. Also this gives even distribution between binary classes. 
 
 
 * Several patients have multiple hospitalizations, so I have used only the first admission and predict whether they will be re-admitted or not. This gives 71518 unique patient hospitalizations.
 
 
 * Numeric features are : time_in_hospital, num_medications,num_lab_procedures, num_procedures, number_outpatient, number_inpatient, number_diagnoses. 


 * The average amount of time spent(time_in_hospital) in a hospital is 4 days and there is a good variation in this feature ranging from 1 day to 14 days.
 
 
 * Number of lab procedures conducted during the stay is on an average 43, and it ranges from 1 to 132.
 
 
 * The average number of medications administered is 16 with minimum 1 and maximum of 81 medications.
 
 
 * Many features have non-numeric values, e.g.: race, gender, age is in range, medications like citoglipton, insulin, diag_1, diag_2, diag_3, etc,. So need to transform them.
 
 
 * Weight feature has 97% data missing, so I will not include it. I am not sure if it makes sense to impute such large missing data. May be better off by excluding that feature for the time being.
 

### Modelling techniques : 

* As of now I have used only numeric features for classification and used logistic regression, regularized logistic regression and decision trees methods. 


* My next step is to transform non-numeric features into categorical(nominal) values and add them into the models. Some them I think are crucial.


* Try out other models like KNN classfication, random forests, SVM, etc.


* Question : Several non-numeric features have more than 3 categories. Does it make sense to binarize them all to be used in classic logistic regression? Or better off using models that would accept categorical features without binarizing them?

### Visualizations : 

Although the main goal of the project is to use a modelling technique to predict patient readmission, but if time permits I would to like make some dashboard so that hospitals or clinicians can use to visualize data for their patients.








In [135]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot') # This styles the graphs in a nicer format

In [136]:
# read diabetic_data.csv into a DataFrame called 'hospital_set'
hospital_set = pd.read_table('dataset_diabetes/diabetic_data.csv', sep=',')

In [137]:
#hospital_set.head()
hospital_set['diag_1'].head(5)

0    250.83
1       276
2       648
3         8
4       197
Name: diag_1, dtype: object

In [138]:
# examine the default index, data types, and shape
#hospital_set.index
#hospital_set.dtypes
hospital_set.shape
 

(101766, 50)

In [139]:
hospital_set.describe() 

Unnamed: 0,encounter_id,patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses
count,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0
mean,165201600.0,54330400.0,2.024006,3.715642,5.754437,4.395987,43.095641,1.33973,16.021844,0.369357,0.197836,0.635566,7.422607
std,102640300.0,38696360.0,1.445403,5.280166,4.064081,2.985108,19.674362,1.705807,8.127566,1.267265,0.930472,1.262863,1.9336
min,12522.0,135.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
25%,84961190.0,23413220.0,1.0,1.0,1.0,2.0,31.0,0.0,10.0,0.0,0.0,0.0,6.0
50%,152389000.0,45505140.0,1.0,1.0,7.0,4.0,44.0,1.0,15.0,0.0,0.0,0.0,8.0
75%,230270900.0,87545950.0,3.0,4.0,7.0,6.0,57.0,2.0,20.0,0.0,0.0,1.0,9.0
max,443867200.0,189502600.0,8.0,28.0,25.0,14.0,132.0,6.0,81.0,42.0,76.0,21.0,16.0


In [140]:
# Here "readmitted" is the target variable
hospital_set.readmitted.value_counts()

NO     54864
>30    35545
<30    11357
Name: readmitted, dtype: int64

In [141]:
# Transforming the "readmitted" variable into binary outcome. 
# NO = 0 (patients are not readmitted, it could be their first/initial admission)
# >30 = 0 (patients admitted after 30 days are not called readmitted)
# < 30 = 1 (patients admitted within 30 days are called readmitted)
hospital_set['readmitted'] = hospital_set['readmitted'].map({ "<30" : 1, ">30" : 1,"NO" : 0})
hospital_set.readmitted.value_counts()



0    54864
1    46902
Name: readmitted, dtype: int64

In [142]:
# Here many patients have multiple encounter ids (hospitalizations). 
# For predictions, I will just take the first encounter id as first admission and 
# predict whether the patient will be readmitted next time or not withing 30 days window.

# Here I use group by function to group by the hospital_set for each patient 
# and then take minimum value for encounter id to get first admission for each patient.
hospital_subset = hospital_set.loc[hospital_set.groupby("patient_nbr")["encounter_id"].idxmin()]


In [143]:
# feature counts
#hospital_subset.discharge_disposition_id.value_counts()
#hospital_subset.admission_type_id.value_counts()
#hospital_subset['diag_1'].head(5)
hospital_subset.diag_1.value_counts()
#hospital_set.diag_1.value_counts()

#hospital_subset.index
#hospital_subset.dtypes
#hospital_subset.shape
#hospital_subset.columns

414      5233
428      3980
786      3040
410      2902
486      2439
427      2053
715      1908
434      1581
682      1470
780      1421
491      1334
276      1204
996      1125
38       1116
250.8    1084
599       998
584       963
820       824
574       775
435       754
562       711
577       690
493       677
518       664
722       660
V57       659
296       635
250.6     634
433       616
440       613
         ... 
V60         1
V66         1
V67         1
804         1
160         1
381         1
363         1
365         1
366         1
V70         1
580         1
84          1
V43         1
923         1
207         1
649         1
10          1
334         1
944         1
791         1
133         1
145         1
131         1
826         1
939         1
700         1
703         1
148         1
143         1
704         1
Name: diag_1, dtype: int64

In [93]:
# TRANSFORMATIONS FOR CATEGORICAL VALUES

# create dummy variables for admission_source and admission_type features.

#admission_source_dumies = pd.get_dummies(hospital_subset.admission_source_id, prefix='admission_source')
#admission_type_dumies = pd.get_dummies(hospital_subset.admission_type_id, prefix='admission_type')

# encode diabetesMed feature
hospital_subset['diabetesMed'] = hospital_subset.diabetesMed.map({'Yes':1, 'No':0})


# concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns)
#hospital_subset = pd.concat([hospital_subset, admission_type_dumies], axis=1)


In [148]:
# TRANSFORMATIONS FOR diag_1 feature
# diag_1 feature : These are diagnosis codes, need to group them and tranform them to disease groups

# 1. Circulatory disease : ICD9 codes 390–459, 785
#for i in range(390,460):
 #   hospital_subset['diag_1'] = hospital_subset['diag_1'].map({i: 1})

# hospital_subset['diag_1'] = hospital_subset['diag_1'].map({785: 1})

#2. Respiratory disease : 460–519, 786
for i in range(460,520):
    hospital_subset['diag_1'] = hospital_subset['diag_1'].map({i: 2})

hospital_subset['diag_1'] = hospital_subset['diag_1'].map({786: 2})

# 3. Digestive disease 520–579, 787
for i in range(520,580):
    hospital_subset['diag_1'] = hospital_subset['diag_1'].map({i: 3})

hospital_subset['diag_1'] = hospital_subset['diag_1'].map({787: 3})
    
# 4. Injury : 800–999
#for i in range(800,1000):
 #   hospital_subset['diag_1'] = hospital_subset['diag_1'].map({i: 4})
    
# 5. 
    
# concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns)
#hospital_subset = pd.concat([hospital_subset, admission_type_dumies], axis=1)

4267     NaN
5827     NaN
67608    NaN
17494    NaN
2270     NaN
1164     NaN
18234    NaN
15848    NaN
61382    NaN
2279     NaN
7866     NaN
25911    NaN
1083     NaN
2001     NaN
11049    NaN
2484     NaN
17342    NaN
15980    NaN
4407     NaN
7038     NaN
2005     NaN
10001    NaN
21483    NaN
3294     NaN
22342    NaN
36317    NaN
4333     NaN
18558    NaN
36720    NaN
7665     NaN
          ..
95640    NaN
96047    NaN
97982    NaN
96274    NaN
98897    NaN
99798    NaN
99556    NaN
91913    NaN
91774    NaN
93108    NaN
93052    NaN
93050    NaN
96345    NaN
93742    NaN
95669    NaN
100090   NaN
90717    NaN
92165    NaN
96863    NaN
95032    NaN
94231    NaN
90933    NaN
94252    NaN
94078    NaN
95283    NaN
99863    NaN
95282    NaN
93651    NaN
101748   NaN
96147    NaN
Name: diag_1, dtype: float64

In [95]:
hospital_subset.shape
hospital_subset.head()
hospital_subset.diabetesMed.value_counts()

1    54319
0    17199
Name: diabetesMed, dtype: int64

In [34]:
# Look for any linear correlations in the data
hospital_subset.corr()

Unnamed: 0,encounter_id,patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,...,admission_source_22,admission_source_25,admission_type_1,admission_type_2,admission_type_3,admission_type_4,admission_type_5,admission_type_6,admission_type_7,admission_type_8
encounter_id,1.0,0.502034,-0.150258,-0.136536,-0.113674,-0.069457,-0.051946,-0.00531,0.056166,0.070958,...,0.009159,0.011879,0.097435,-0.042423,0.059947,-0.003214,-0.100475,-0.148568,0.016861,0.012189
patient_nbr,0.502034,1.0,-0.010712,-0.136942,-0.01944,-0.017976,0.008597,-0.022938,0.015527,0.08748,...,0.00501,0.006573,0.01567,0.003631,-0.011134,-0.005587,-0.055184,0.018484,0.013988,0.02877
admission_type_id,-0.150258,-0.010712,1.0,0.088565,0.121644,-0.017477,-0.117187,0.13205,0.095376,0.038149,...,0.004461,-0.003858,-0.744539,-0.03137,0.293219,0.01413,0.414313,0.676944,0.055674,0.250019
discharge_disposition_id,-0.136536,-0.136942,0.088565,1.0,0.005202,0.163874,0.028224,0.021512,0.114623,-0.016582,...,-0.001906,-9.4e-05,-0.027389,-0.052614,0.012948,-0.00079,-0.007949,0.129063,0.010305,-0.020117
admission_source_id,-0.113674,-0.01944,0.121644,0.005202,1.0,0.007097,0.089209,-0.127122,-0.058141,0.018547,...,0.029363,0.024574,0.281051,-0.236494,-0.485358,-0.001472,0.248244,0.383803,0.005528,-0.036269
time_in_hospital,-0.069457,-0.017976,-0.017477,0.163874,0.007097,1.0,0.330146,0.188911,0.469426,-0.014984,...,0.006874,0.003068,0.009225,0.027689,-0.030596,-0.004058,-0.03174,0.020926,0.003301,-0.028088
num_lab_procedures,-0.051946,0.008597,-0.117187,0.028224,0.089209,0.330146,1.0,0.050072,0.261911,-0.006933,...,-0.008087,0.003028,0.223056,-0.05459,-0.219834,0.002894,-0.170554,0.128479,0.004107,0.002136
num_procedures,-0.00531,-0.022938,0.13205,0.021512,-0.127122,0.188911,0.050072,1.0,0.403738,-0.018347,...,0.003482,0.007721,-0.209191,0.053818,0.215339,0.000797,-0.058166,0.035978,0.013429,0.023178
num_medications,0.056166,0.015527,0.095376,0.114623,-0.058141,0.469426,0.261911,0.403738,1.0,0.026183,...,0.00589,0.002733,-0.105718,-0.053848,0.173548,-0.006051,-0.003838,0.020114,0.002572,0.012761
number_outpatient,0.070958,0.08748,0.038149,-0.016582,0.018547,-0.014984,-0.006933,-0.018347,0.026183,1.0,...,-0.001959,-0.001386,-0.0136,-0.017817,-0.017998,0.000559,0.124942,-0.020818,-0.00449,0.003801


In [96]:
feature_cols = ['time_in_hospital', 'num_medications','num_lab_procedures','num_procedures','number_outpatient','number_inpatient','number_diagnoses','diabetesMed']

for t in admission_source_dumies:
    feature_cols.append(t)
    
for t in admission_type_dumies:
    feature_cols.append(t)
    
feature_cols

['time_in_hospital',
 'num_medications',
 'num_lab_procedures',
 'num_procedures',
 'number_outpatient',
 'number_inpatient',
 'number_diagnoses',
 'diabetesMed',
 'admission_source_1',
 'admission_source_2',
 'admission_source_3',
 'admission_source_4',
 'admission_source_5',
 'admission_source_6',
 'admission_source_7',
 'admission_source_8',
 'admission_source_9',
 'admission_source_10',
 'admission_source_11',
 'admission_source_13',
 'admission_source_14',
 'admission_source_17',
 'admission_source_20',
 'admission_source_22',
 'admission_source_25',
 'admission_type_1',
 'admission_type_2',
 'admission_type_3',
 'admission_type_4',
 'admission_type_5',
 'admission_type_6',
 'admission_type_7',
 'admission_type_8']

In [97]:
# use numeric features to apply logistic regression

X = hospital_subset[feature_cols]
#X = hospital_subset[feature_cols].join(admission_type_dumies.ix[:, 'admission_type_1':])
#data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
y = hospital_subset.readmitted



In [98]:
# Split the data into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)