[View in Colaboratory](https://colab.research.google.com/github/Gagan-K-Shetty/Da_project/blob/master/Hospital_readmission.ipynb)

# Introduction

Hospital readmission measures have been touted not only as a quality measure, but also as a means to bend the healthcare cost curve. In the United States, hospital readmissions within a span of 30 days turn out to be an immense burden on the healthcare system and the patient's lifestyle.<br><br>
A USFC study in the US shows that 27% of readmissions are potentially avoidable through better diagnosis at discharge or having a predictive mechanism to evaluate the chance of a patient being readmitted based on various factors. The cost per readmission in the US varies from 10 to 14 thousand dollars, subject to the insurance plan, medical facilities utilized etc.<br><br>
For this purpose, we intend to perform data analytics on the electronic health records of patients and determine the main indicators of hospital readmission and predict the chances of readmission in the future for the current patients.


### The rest of the notebook contains the project ordered as follows :-
<ol>
    <li><a href='#Research_goal'> Setting the research goal </a></li>
    <li><a href='#Retrieving_data'> Retrieving data </a></li>
    <li><a href='#Data_preparation'> Data preparation </a></li>
    <li><a href='#Data_exploration'> Data exploration </a></li>
    <li><a href='#Data_modeling'> Data modeling </a></li> 
    <li><a href='#Presentation'> Presentation and automation </a></li>
</ol>

<a id='Research_goal'></a>
# Setting the research goal
The main aim of this project to **predict the chances of readmission of a patient**. Using the electronic health records, we can identify the main indicators of hospital readmissions and flag the patients at high risk of readmission to the hospital. The scope of this project is limited to identifying readmission scenarios in to cases :-
<ol>
    <li> Readmission within 3 months </li>
    <li> Readmission between 3 to 6 months</li>
</ol>
The main advantages of this project:-
  * **Patient comfort** - By reducing the number of readmissions, we can improve the lives of patients and reduce the repeated time they spend in hospital by targeting the problems at an earlier stage.
  * **Hospital burden** - By reducing the number of readmissions, we can reduce the burden on the healthcare system by potentially treating the diseases at an earlier stage and reduce the number of patients overburdening the hospital(atleast the one which coudl be avoided)
  * **Cost factor** - By reducing the number of readmissions, we can reduce the cost associated with healthcare, which is a burden on both the patients and the insurance companies.

<a id='Retrieving_data'></a>
# Retrieving the data
Since the aim of the project is to reduce the readmission of patients, we need data of the patients while they are admitted in the hospital. Many hospitals maintain records of patients. These records are called Electronic Health Records(EHR).<br> For the purpose of the project, we looked for publicly available EHR datasets to work on. After researching on the data sets available, we found the MIMIC-III dataset to be the best choice for our project.<br><br>
<a href='https://www.nature.com/articles/sdata201635'>MIMIC-III</a> is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.
<a href='https://mit-lcp.github.io/mimic-schema-spy'/>Click here</a> for a detailed schema of the database.<br>
After some research, we decided to use the following tables from the database:-
   * **patients** : Contains patients associated with an admission to the ICU
   * **admissions** : Contains the hospital admissions associated with an ICU stay.
   * **diagnoses** : Contains diagnoses relating to a hospital admission coded using the ICD9 system.
   * **drgCodes** : Hospital stays classified using the Diagnosis-Related Group system.
   * **icuStays** : List of ICU admissions.
   * **procedures** : Procedures relating to a hospital admission coded using the ICD9 system.
   * **prescriptions** : Contains a list of the medicines prescribed 
   * **dIcdDiagnoses** : Dictionary of the International Classification of Diseases, 9th Revision (Diagnoses).
   * **dIcdProcedures** : Dictionary of the International Classification of Diseases, 9th Revision (Procedures).
   
For this project we pulled records(maximum of 50k records) from each of these tables. The contents of the tables are described in the next step.

Since we are using google colab, we need some boiler plate code to load the data stored in the google drive:-

In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import os
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 500)

In [2]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
folder_id = '19W1TgqRQqPxTxCCb1hGR1QhGajSjSs01'
listed = drive.ListFile({'q': "title contains '.csv' and '"+folder_id+"' in parents"}).GetList()
files = []
for file in listed:
  #print('title {}, id {}'.format(file['title'], file['id']))
  files.append({'title' : file['title'], 'id' : file['id']})
download_path = os.path.expanduser('~/data')
try:
  os.makedirs(download_path)
except :
  pass
dataset = {}
for x in files:
  output_file = os.path.join(download_path, x['title'])
  temp_file = drive.CreateFile({'id': x['id']})
  temp_file.GetContentFile(output_file)
  data = pd.read_csv(output_file)
  #print(data)
  dataset[x['title']] = data
print("Table names are",", ".join(dataset.keys()))

Table names are procedures.csv, prescriptions.csv, patients.csv, icuStays.csv, drgCodes.csv, dIcdDiagnoses.csv, diagnoses.csv, admissions.csv, dIcdProcedures.csv


<a id='Data_preparation'></a>
# Data Preparation
In this step, we perform the standard data cleaning steps. Let’s take a look at each table one by one. First lets load the requied libraries

In [0]:
import numpy as np

Let's take a look at the actual contents of the table "admissions"

In [4]:
admissions = dataset["admissions.csv"]
admissions

Unnamed: 0,admission_location,admission_type,admittime,deathtime,diagnosis,discharge_location,dischtime,edouttime,edregtime,ethnicity,hadm_id,has_chartevents_data,hospital_expire_flag,insurance,language,marital_status,religion,row_id,subject_id
0,EMERGENCY ROOM ADMIT,EMERGENCY,2196-04-09T12:26:00.000Z,,BENZODIAZEPINE OVERDOSE,DISC-TRAN CANCER/CHLDRN H,2196-04-10T15:54:00.000Z,2196-04-09T13:24:00.000Z,2196-04-09T10:06:00.000Z,WHITE,165315,1,0,Private,,MARRIED,UNOBTAINABLE,21,22
1,PHYS REFERRAL/NORMAL DELI,ELECTIVE,2153-09-03T07:15:00.000Z,,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,HOME HEALTH CARE,2153-09-08T19:10:00.000Z,,,WHITE,152223,1,0,Medicare,,MARRIED,CATHOLIC,22,23
2,TRANSFER FROM HOSP/EXTRAM,EMERGENCY,2157-10-18T19:34:00.000Z,,BRAIN MASS,HOME HEALTH CARE,2157-10-25T14:00:00.000Z,,,WHITE,124321,1,0,Medicare,ENGL,MARRIED,CATHOLIC,23,23
3,TRANSFER FROM HOSP/EXTRAM,EMERGENCY,2139-06-06T16:14:00.000Z,,INTERIOR MYOCARDIAL INFARCTION,HOME,2139-06-09T12:48:00.000Z,,,WHITE,161859,1,0,Private,,SINGLE,PROTESTANT QUAKER,24,24
4,EMERGENCY ROOM ADMIT,EMERGENCY,2160-11-02T02:06:00.000Z,,ACUTE CORONARY SYNDROME,HOME,2160-11-05T14:55:00.000Z,2160-11-02T04:27:00.000Z,2160-11-02T01:01:00.000Z,WHITE,129635,1,0,Private,,MARRIED,UNOBTAINABLE,25,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,EMERGENCY ROOM ADMIT,EMERGENCY,2101-12-28T00:07:00.000Z,,BRAIN MASS,HOME,2102-01-04T12:40:00.000Z,2101-12-28T00:53:00.000Z,2101-12-27T19:18:00.000Z,WHITE,193198,1,0,Private,ENGL,MARRIED,CATHOLIC,49113,68944
49996,CLINIC REFERRAL/PREMATURE,EMERGENCY,2102-11-14T18:50:00.000Z,,PNEUMONIA,LONG TERM CARE HOSPITAL,2102-12-07T13:46:00.000Z,2102-11-14T19:21:00.000Z,2102-11-14T17:20:00.000Z,WHITE,192475,1,0,Private,ENGL,MARRIED,CATHOLIC,49114,68944
49997,CLINIC REFERRAL/PREMATURE,EMERGENCY,2103-02-05T04:22:00.000Z,,SEIZURES,REHAB/DISTINCT PART HOSP,2103-02-14T13:00:00.000Z,2103-02-05T06:30:00.000Z,2103-02-05T00:36:00.000Z,WHITE,145719,1,0,Private,ENGL,MARRIED,CATHOLIC,49115,68944
49998,TRANSFER FROM HOSP/EXTRAM,EMERGENCY,2103-05-22T18:05:00.000Z,,PNEUMONIA,REHAB/DISTINCT PART HOSP,2103-06-07T15:37:00.000Z,,,WHITE,170602,1,0,Private,ENGL,MARRIED,CATHOLIC,49116,68944


Now that we have an idea of the contents of the table, lets take a look at the summary statistics

In [5]:
admissions.describe(include = [np.number])

Unnamed: 0,hadm_id,has_chartevents_data,hospital_expire_flag,row_id,subject_id
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,149912.05214,0.97208,0.09954,25025.7288,24471.16132
std,28876.168275,0.164745,0.299389,14476.341475,18707.237912
min,100001.0,0.0,0.0,1.0,2.0
25%,124888.5,1.0,0.0,12500.75,10220.0
50%,149903.5,1.0,0.0,25000.5,20483.5
75%,174893.25,1.0,0.0,37500.25,30798.25
max,199999.0,1.0,1.0,58341.0,97948.0


The above output show the descriptive stastics of the numerical data. However, most of the numeric columns are primary keys and foreign keys. Let's take a look at the categorical data.

In [6]:
admissions.describe(include = ['O'])

Unnamed: 0,admission_location,admission_type,admittime,deathtime,diagnosis,discharge_location,dischtime,edouttime,edregtime,ethnicity,insurance,language,marital_status,religion
count,50000,50000,50000,4977,49979,50000,50000,25161,25161,50000,50000,24748,40238,49542
unique,9,4,49775,4957,13564,17,49772,25152,25158,41,5,68,7,20
top,EMERGENCY ROOM ADMIT,EMERGENCY,2191-08-23T07:15:00.000Z,2195-11-28T17:17:00.000Z,NEWBORN,HOME,2101-09-22T13:00:00.000Z,2149-02-06T17:18:00.000Z,2130-05-30T16:53:00.000Z,WHITE,Medicare,ENGL,MARRIED,CATHOLIC
freq,20331,34574,4,2,7823,16575,3,2,2,34512,23301,20985,19952,17446


The above cell describes the categorical data

One thing we might notice is that the time variables are in a different format. So lets go ahead and fix that:-

In [7]:
admissions.admittime = pd.to_datetime(admissions.admittime)
admissions.admittime.head()

0   2196-04-09 12:26:00
1   2153-09-03 07:15:00
2   2157-10-18 19:34:00
3   2139-06-06 16:14:00
4   2160-11-02 02:06:00
Name: admittime, dtype: datetime64[ns]

You might have noticed that according to the admission date, admission events occured between 2100 to 2200 even though the events actually occured between 2001 and 2012. The reason for this is that before data was incorporated into the MIMIC-III database, it was first deidentified in accordance with Health Insurance Portability and Accountability Act (HIPAA) standards using structured data cleansing and date shifting. The deidentification process for structured data required the removal of all eighteen of the identifying data elements listed in HIPAA, including fields such as patient name, telephone number, address, and dates. In particular, dates were shifted into the future by a random offset for each individual patient in a consistent manner to preserve intervals, resulting in stays which occur sometime between the years 2100 and 2200. Time of day, day of the week, and approximate seasonality were conserved during date shifting. Dates of birth for patients aged over 89 were shifted to obscure their true age and comply with HIPAA regulations: these patients appear in the database with ages of over 300 years. Since the intervals are preserved, this won’t create any issues as we are interested in the differences in the times(lenght of stay, age, readmission, etc) Lets go ahead and convert the remaining dates as well.

In [0]:
admissions.dischtime = pd.to_datetime(admissions.dischtime)
admissions.edouttime = pd.to_datetime(admissions.edouttime)
admissions.edregtime = pd.to_datetime(admissions.edregtime)
admissions.deathtime = pd.to_datetime(admissions.deathtime)

Now that we have all the data in the expected format, lets look at the categorical data. Consider religion

In [9]:
pd.set_option('display.max_rows', 30)
admissions.religion.value_counts(dropna=False)

CATHOLIC                  17446
NOT SPECIFIED              9245
UNOBTAINABLE               7508
PROTESTANT QUAKER          6048
JEWISH                     4601
OTHER                      2399
EPISCOPALIAN                652
NaN                         458
GREEK ORTHODOX              366
CHRISTIAN SCIENTIST         351
BUDDHIST                    228
MUSLIM                      206
JEHOVAH'S WITNESS           116
UNITARIAN-UNIVERSALIST      101
HINDU                        89
ROMANIAN EAST. ORTH          71
7TH DAY ADVENTIST            64
BAPTIST                      28
HEBREW                       15
METHODIST                     7
LUTHERAN                      1
Name: religion, dtype: int64

Based on the statistics above, we can notice that several religions such as LUTHERAN, METHODIST, etc are a part of very few number of records (when compared to 50k records). We can merge these into the "OTHER" category as we cannot extract any meaningful information considering the paucity of records. Furthermore, "NOT SPECIFIED" and "UNOBTAINABLE" basically mean the same. NaN can be merged into the NOT SPECIFIED category as NaN indicates a problem in recording the data and we can treat it as an unobtainable value. Another thing to consider is that religions such as CATHOLIC, EPISCOPALIAN, etc are all various forms of Christian denominations. 
For the grouping of the religions, first we shall group similar religions(eg : different forms of christian denominations), then we shall group the religions with less than atleast 1000 records into "OTHER".

In [10]:
admissions.religion = admissions.religion.str.replace("|".join(["CATHOLIC","PROTESTANT QUAKER","EPISCOPALIAN","GREEK ORTHODOX","CHRISTIAN SCIENTIST","JEHOVAH'S WITNESS","UNITARIAN-UNIVERSALIST","ROMANIAN EAST. ORTH","7TH DAY ADVENTIST","BAPTIST","METHODIST"]),"CHRISTIAN")
admissions.religion = admissions.religion.str.replace("UNOBTAINABLE","NOT SPECIFIED")
admissions.religion = admissions.religion.fillna(value="NOT SPECIFIED")
values_less_than_thousand = admissions.religion.value_counts()[admissions.religion.value_counts()<1000]
admissions.religion = admissions.religion.str.replace("|".join(values_less_than_thousand.index),"OTHER")
admissions.religion.value_counts()

CHRISTIAN        25250
NOT SPECIFIED    17211
JEWISH            4601
OTHER             2938
Name: religion, dtype: int64

Now, the number of religions have been reduced to 4, which makes it easier to work with and there is minimal loss of information

Let's move on to maritial status

In [11]:
admissions.marital_status.value_counts()

MARRIED              19952
SINGLE               10775
WIDOWED               6075
DIVORCED              2635
SEPARATED              496
UNKNOWN (DEFAULT)      294
LIFE PARTNER            11
Name: marital_status, dtype: int64

<a id='Data_exploration'></a>
# Data Exploration

<a id='Data_modeling'></a>
# Data Modeling

<a id='Presentation'></a>
# Presentation