# SYD DAT 4 Project : Hospital Readmissions prediction

### Overview 

I had an opportunity to work as a software engineer at Vanderbilt University Medical Center,US where I was involved in biomedical research projects. It is very gratifying as a computer science person to help solve interesting and challenging health care problems, which is why I choose this project.

Hospital readmission means when a patient is re-admitted to a hospital within short period(usually 28/30 days)after first time or initial discharge. Such readmissions are major health care concern for most countries like US, UK, Australia and many others. This leads to increased wait times, increased medical errors resulting in risking patient safety and unncessary deaths. High readmission rate is infact an indicator of poor quality care of the hospital. It also puts huge monetary burden on the hospitals as well as the government. In this project, I want to identify patients who are likely to be readmitted to the hospital. By doing this, proper care and management of such patients can be planned by their care providers.

### Goal(s) : 
Predict if a patient will be readmitted to the hospital or not?

### Data set used : 

In order to achieve the above goal(s), I need a hospital data set that has records of all hospitalizations entries for its patients for certain number of years. It should contain information like why the patients were admitted, which department they were admitted for, how many times they were admitted, what medications they were on, what lab tests were conducted, how many days they stayed in hospital, vitals signs like heights/weights, age, race, blood pressure, smoking status, their electronic medical records, billing records, genetics data, etc.

Of course, it's hard to obtain such a heterogeneous dataset which is publicly available as patient data are very confidential. Though there are quite a few data sets that contain substantial information and after going through some of them, I decided to use "Diabetes 130-US hospitals for years 1999-2008 Data Set".

This is a public available database by Center for Clinical and Translational Research, Virginia Commonwealth University. This data is a de-identified abstract of the Health Facts database (Cerner Corporation, Kansas City, MO). It contains 10 years of diabetes patients data across 130 US hospitals.

http://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

### Summary of data : 

From the below python code, following observations are made :-
 * There are 50 features and 101766 observations(hospitalizations). 
 
 * Each row in dataframe represents one hospital admission. On an average, there are 5 hospitalizations for each patient (patient_nbr is the de-identifed patient ID).
 
 * Several patients have multiple hospitalizations, so I have used only the first admission and predict whether they will be re-admitted or not.
 
 * Numeric features are : time_in_hospital, num_medications,num_lab_procedures, num_procedures, number_outpatient, number_inpatient, number_diagnoses. 
 
 * The average amount of time spent(time_in_hospital) in a hospital is 4 days and there is a good variation in this feature ranging from 1 day to 14 days.
 
 * Number of lab procedures conducted during the stay is on an average 43, and it ranges from 1 to 132.
 
 * The average number of medications administered is 16 with minimum 1 and maximum of 81 medications.
 
 * Many features have non-numeric values, e.g.: race, gender, age is in range, medications like citoglipton, insulin, diag_1, diag_2, diag_3, etc,. So need to transform them.
 
 * Weight feature has 97% data missing, so I will not include it. I am not sure if it makes sense to impute such large missing data. May be better off by excluding that feature for the time being.
 

### Modelling techniques : 

To start with I have applied logistic regression and then move onto some advance techniques like regularized logistic regression, decision trees, random forest.

### Visualizations : 

Although the main goal of the project is to use a modelling technique to predict patient readmission, but if time permits I would to like make some dashboard so that hospitals or clinicians can use to visualize data for their patients.

### Discussion  :

* I have taken one hospitalization for each patient and predict whether they will be readmitted (within 30 days or not).

* ~9.6% of patients are readmitted within 30 days (this is based on labeled feature "readmitted")

* I have applied logistic regression using numeric features in dataset. I have two classes : "0" means not readmitted within 30 days and "1" means readmitted within 30 days. I have yet to include several nominal features.

* I have applied logistic regression and get accuracy_score of 0.91.

* I printed classification_report which says that precision for "0" (readmitted_NO) is 0.91 and precision for "1" (readmitted_YES) is 0.53. This means I am I missing out predicting several patients who are readmitted. And I need to improve this precision as it's important to identify such patients. 

* Currently several nominal features are not included in the model that needs to be transformed into numeric values. Since there are many nominal features (having > 3 categories), is it a good idea to binarize them as it will create huge number of features? 

* Or it is better to try other methods?



In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot') # This styles the graphs in a nicer format

In [2]:
# read diabetic_data.csv into a DataFrame called 'hospital_set'
hospital_set = pd.read_table('dataset_diabetes/diabetic_data.csv', sep=',')

In [3]:
hospital_set.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [4]:
# examine the default index, data types, and shape
#hospital_set.index
#hospital_set.dtypes
hospital_set.shape
 

(101766, 50)

In [5]:
hospital_set.describe() 

Unnamed: 0,encounter_id,patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses
count,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0
mean,165201600.0,54330400.0,2.024006,3.715642,5.754437,4.395987,43.095641,1.33973,16.021844,0.369357,0.197836,0.635566,7.422607
std,102640300.0,38696360.0,1.445403,5.280166,4.064081,2.985108,19.674362,1.705807,8.127566,1.267265,0.930472,1.262863,1.9336
min,12522.0,135.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
25%,84961190.0,23413220.0,1.0,1.0,1.0,2.0,31.0,0.0,10.0,0.0,0.0,0.0,6.0
50%,152389000.0,45505140.0,1.0,1.0,7.0,4.0,44.0,1.0,15.0,0.0,0.0,0.0,8.0
75%,230270900.0,87545950.0,3.0,4.0,7.0,6.0,57.0,2.0,20.0,0.0,0.0,1.0,9.0
max,443867200.0,189502600.0,8.0,28.0,25.0,14.0,132.0,6.0,81.0,42.0,76.0,21.0,16.0


In [6]:
# Here "readmitted" is the target variable
hospital_set.readmitted.value_counts()

NO     54864
>30    35545
<30    11357
Name: readmitted, dtype: int64

In [7]:
# Transforming the "readmitted" variable into binary outcome. 
# NO = 0 (patients are not readmitted, it could be their first/initial admission)
# >30 = 0 (patients admitted after 30 days are not called readmitted)
# < 30 = 1 (patients admitted within 30 days are called readmitted)
hospital_set['readmitted'] = hospital_set['readmitted'].map({ "<30" : 1, ">30" : 0,"NO" : 0})
hospital_set.readmitted.value_counts()



0    90409
1    11357
Name: readmitted, dtype: int64

In [8]:
# TRANSFORMATIONS FOR CATEGORICAL VALUES : TBD


In [9]:
# Here many patients have multiple encounter ids (hospitalizations). 
# For predictions, I will just take the first encounter id as first admission and 
# predict whether the patient will be readmitted next time or not withing 30 days window.

# Here I use group by function to group by the hospital_set for each patient 
# and then take minimum value for encounter id to get first admission for each patient.
hospital_subset = hospital_set.loc[hospital_set.groupby("patient_nbr")["encounter_id"].idxmin()]


In [37]:
#subset.shape
#hospital_subset.head()
hospital_subset.readmitted.value_counts()

0    65225
1     6293
Name: readmitted, dtype: int64

In [11]:
# Look for any linear correlations in the data
hospital_subset.corr()

Unnamed: 0,encounter_id,patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,readmitted
encounter_id,1.0,0.502034,-0.150258,-0.136536,-0.113674,-0.069457,-0.051946,-0.00531,0.056166,0.070958,0.051455,-0.041483,0.256566,-0.046924
patient_nbr,0.502034,1.0,-0.010712,-0.136942,-0.01944,-0.017976,0.008597,-0.022938,0.015527,0.08748,0.05634,0.093525,0.228945,0.00459
admission_type_id,-0.150258,-0.010712,1.0,0.088565,0.121644,-0.017477,-0.117187,0.13205,0.095376,0.038149,-0.017267,0.03909,-0.116886,-0.000619
discharge_disposition_id,-0.136536,-0.136942,0.088565,1.0,0.005202,0.163874,0.028224,0.021512,0.114623,-0.016582,-0.026477,-0.021484,0.048314,0.057583
admission_source_id,-0.113674,-0.01944,0.121644,0.005202,1.0,0.007097,0.089209,-0.127122,-0.058141,0.018547,0.056719,0.03009,0.066753,0.004146
time_in_hospital,-0.069457,-0.017976,-0.017477,0.163874,0.007097,1.0,0.330146,0.188911,0.469426,-0.014984,-0.009805,0.063736,0.233338,0.053531
num_lab_procedures,-0.051946,0.008597,-0.117187,0.028224,0.089209,0.330146,1.0,0.050072,0.261911,-0.006933,0.014091,0.080162,0.157574,0.028875
num_procedures,-0.00531,-0.022938,0.13205,0.021512,-0.127122,0.188911,0.050072,1.0,0.403738,-0.018347,-0.035178,-0.023977,0.089153,-0.001392
num_medications,0.056166,0.015527,0.095376,0.114623,-0.058141,0.469426,0.261911,0.403738,1.0,0.026183,0.0024,0.037487,0.259201,0.034204
number_outpatient,0.070958,0.08748,0.038149,-0.016582,0.018547,-0.014984,-0.006933,-0.018347,0.026183,1.0,0.095002,0.068591,0.076612,0.008659


In [18]:
# use numeric features to apply logistic regression

feature_cols = ['time_in_hospital', 'num_medications','num_lab_procedures','num_procedures','number_outpatient','number_inpatient','number_diagnoses']

X = hospital_subset[feature_cols]
y = hospital_subset.readmitted

In [31]:
# Split the data into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [32]:
#X_test.head()
#y_train.head()
#y_train.value_counts()
y_test.value_counts()

0    16244
1     1636
Name: readmitted, dtype: int64

In [33]:
# Fit a logistic regression model and examine the coefficients
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
zip(feature_cols, logreg.coef_[0])

[('time_in_hospital', 0.044561012702812278),
 ('num_medications', 0.0037811242351986191),
 ('num_lab_procedures', 0.00019726056157569197),
 ('num_procedures', -0.026385094224620816),
 ('number_outpatient', -0.0067079905875209933),
 ('number_inpatient', 0.33839848754443019),
 ('number_diagnoses', 0.049327701323899589)]

In [34]:
# Make predictions on testing set and calculate accuracy
y_pred_class = logreg.predict(X_test)
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)

0.90855704698


In [29]:
# compute null accuracy manually
print y_test.mean()
print 1 - y_test.mean()

0.0894155480984
0.910584451902


In [35]:
# confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[16237,     7],
       [ 1628,     8]])

In [38]:
# calculate the sensitivity
8 / float(1628 + 8)

0.004889975550122249

In [39]:
# calculate the specificity
16237 / float(16237 + 7)

0.9995690716572273

In [36]:
from sklearn.metrics import classification_report
target_names = ['Readmission_NO', 'Readmission_YES']
print(classification_report(y_test, y_pred_class, target_names=target_names))

                 precision    recall  f1-score   support

 Readmission_NO       0.91      1.00      0.95     16244
Readmission_YES       0.53      0.00      0.01      1636

    avg / total       0.87      0.91      0.87     17880

