# [Paris Saclay Center for Data Science](http://www.datascience-paris-saclay.fr)

## ICU stay: predicting length-of-stay (LOS) with characteristics at admission

_David BERTOIN, Emmanuel GILSON, Vincent KERMOUNI, Paul MANGOLD, Dinh-Phong NGUYEN_

## Introduction

Intensive care unit (ICU) length of stay (LOS) is a frequent measure of ICU **resource use** and **performance** [1]. Predictions of ICU LOS are routinely used as the means of resource allocation because patients with prolonged ICU LOS account for a large proportion of resource use [2], and the early identification of the patients may help in future **planning**, such as determining discharge alternatives (e.g. long-term acute care facilities) or making sure the receiving ward after stabilization has enough available beds at the time of discharge. 

Nevertheless, prediction of ICU LOS is difficult and less studied than the prediction of mortality [3]. Prolonged stay in ICU not only **increases the overall costs** and **consumes more resources**, but also **limits the number of beds** available for use. In addition, patients, families, physicians and managers demand more informed health care information. In addition, predictive ICU models could be a building block in the larger process of making _do not resuscitate_ (DNR) decisions to determine whether to stop patient therapy to avoid unnecessary suffering and treatment costs [4].

The ability to predict LOS as an **initial assessment** of patients’ risk is therefore critical for better **resource planning and allocation**, especially when the resources are limited, as in ICUs, and can also facilitate management with **higher flexibility in hospital bed use** and better assessment in the **cost-effectiveness treatment**.

Thus, we believe ICU LOS to be a **very valuable key performance indicator (KPI)**, pertinent in all the critical hospital management fields mentioned above. The goal of this challenge is to predict ICU LOS with the help of patients' characteristics at admission in ICU. All data has been queried from the MIMIC-III database [4], an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~40,000 critical care patients.

### Requirements

* numpy $\geq$ 1.10.0  
* matplotlib $\geq$ 1.5.0 
* pandas $\geq$ 0.19.0  
* scikit-learn $\geq$ 0.17 (different syntaxes for v0.17 and v0.18)   

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split

### Loading data

In [None]:
train_filename = 'data/train.csv'
data = pd.read_csv(train_filename)
data['dob'] = pd.to_datetime(data['dob'])
data

### Variables description

- __SUBJECT_ID, HADM_ID, ICUSTAY_ID__: <br>
Identifiers which specify the patient: SUBJECT_ID is unique to a patient, HADM_ID is unique to a patient hospital stay and ICUSTAY_ID is unique to a patient ICU stay.


- __DOB__: <br>
is the date of birth of the given patient. Patients who are older than 89 years old at any time in the database have had their date of birth shifted to obscure their age and comply with HIPAA. The shift process was as follows: the patient’s age at their first admission was determined. The date of birth was then set to exactly 300 years before their first admission.


- __ADMISSION_TYPE__: <br>
describes the type of the admission: ‘ELECTIVE’, ‘URGENT’, ‘NEWBORN’ or ‘EMERGENCY’. Emergency/urgent indicate unplanned medical care, and are often collapsed into a single category in studies. Elective indicates a previously planned hospital admission. Newborn indicates that the HADM_ID pertains to the patient’s birth.


- __INSURANCE, LANGUAGE, RELIGION, MARITAL_STATUS, ETHNICITY__: <br>
These columns describe patient demographics. These columns occur in the ADMISSIONS table as they are originally sourced from the admission, discharge, and transfers (ADT) data from the hospital database. The values occasionally change between hospital admissions (HADM_ID) for a single patient (SUBJECT_ID). This is reasonable for some fields (e.g. MARITAL_STATUS, RELIGION), but less reasonable for others (e.g. ETHNICITY).


- __DIAGNOSIS__ : <br>
The DIAGNOSIS column provides a preliminary, free text diagnosis for the patient on hospital admission. The diagnosis is usually assigned by the admitting clinician and does not use a systematic ontology.


- __FIRST_CAREUNIT, LAST_CAREUNIT__ : <br>
Contain respectively, the first and last ICU type in which the patient was cared for.


- __sysbp_min, sysbp_max, sysbp_mean__ : <br>
Contain respectively, the minimum, the maximum and the mean of the patient's systolic blood pressure measured during its first day of arrival in the ICU.


- __diasbp_min, diasbp_max, diasbp_mean__ : <br>
Contain respectively, the minimum, the maximum and the mean of the patient's diastolic blood pressure measured during its first day of arrival in the ICU.


- __meanbp_min, meanbp_max, meanbp_mean__ : <br>
Contain respectively, the minimum, the maximum and the mean of the combinaison between 2/3 of the patient's systolic blood pressure and 1/3 of the patient's diastolic blood pressure measured during its first day of arrival in the ICU.


- __resprate_min, resprate_max, resprate_mean__ : <br>
Contain respectively, the minimum, the maximum and the mean of the patient's respiratory rate measured during its first day of arrival in the ICU.


- __tempc_min, tempc_max, tempc_mean__ : <br>
Contain respectively, the minimum, the maximum and the mean of the patient's temperature measured during its first day of arrival in the ICU.


- __spo2_min, spo2_max, spo2_mean__ : <br>
Contain respectively, the minimum, the maximum and the mean of the patient's arterial oxygen saturation measured during its first day of arrival in the ICU.


- __glucose_min, glucose_max, glucose_mean__ : <br>
Contain respectively, the minimum, the maximum and the mean of the patient's blood sugar level measured during its first day of arrival in the ICU.



### Basic data exploration

In [None]:
data.dtypes

In [None]:
data.describe()

In [None]:
data.count()

In [None]:
plt.figure(figsize=(16, 10))
plt.subplot(2,2,1)
plt.title('Gender')
data.gender.value_counts().plot(kind='bar')
plt.subplot(2,2,2)
plt.title('Admission type')
data.admission_type.value_counts().plot(kind='bar')
plt.figure(figsize=(16, 10))
plt.subplot(2,2,1)
plt.title('Religion')
data.religion.value_counts().plot(kind='bar')
plt.subplot(2,2,2)
plt.title('Ethnicity')
data.ethnicity.value_counts().plot(kind='bar')

## The pipeline

For submitting at the [RAMP site](http://ramp.studio), you will have to write two classes, saved in two different files:   
* the class `FeatureExtractor`, which will be used to extract features for classification from the dataset and produce a numpy array of size (number of samples $\times$ number of features). 
* a class `Regressor` to predict 

### Feature Extractor

The feature extractor implements a `transform` member function. It is saved in the file [`submissions/starting_kit/feature_extractor.py`](/edit/submissions/starting_kit/feature_extractor.py). It receives the pandas dataframe `X_df` defined at the beginning of the notebook. It should produce a numpy array representing the extracted features, which will then be used for the classification.  

Note that the following code cells are *not* executed in the notebook. The notebook saves their contents in the file specified in the first line of the cell, so you can edit your submission before running the local test below and submitting it at the RAMP site.

In [None]:
%%file submissions/starting_kit/feature_extractor.py
# -*- coding: utf-8 -*-
import pandas as pd
import os
from sklearn.preprocessing import LabelBinarizer
def fill_mean(feat):
    filled = feat.fillna(feat.mean())
    return filled

class FeatureExtractor(object):
    def __init__(self):
        self.na_string = "__na"
        self.other_string = "__other"
    
    def rename_other(self, X_df):
        return X_df.map(lambda s: self.other_string if s not in self.encoder.classes_ else s)
    
    def fit(self, X_df, y_array):
        X = pd.concat([X_df['admission_type'], pd.Series(list(self.other_string))])
        self.encoder = LabelBinarizer()
        self.encoder.fit(X.fillna(self.na_string))        

        return self

    def transform(self, X_df):    
        features = ['heartrate_mean', 'sysbp_mean', 'diasbp_mean', 'resprate_mean', 'tempc_mean', 'admission_type']
        X = X_df[features].reset_index()
        heart = fill_mean(X.heartrate_mean)
        sbp = fill_mean(X.sysbp_mean)
        dbp = fill_mean(X.diasbp_mean)
        resp = fill_mean(X.resprate_mean)
        temp = fill_mean(X.tempc_mean)
        admit = pd.DataFrame(self.encoder.transform(
            self.rename_other(X['admission_type'].fillna(self.na_string))))
        X = pd.concat([heart, sbp, dbp, resp, temp, admit], axis=1)
        
        return X
    
    def fit_transform(self, X_df):
        return self.fit(X_df).transform(X_df)

### Regressor

The regressor follows a classical scikit-learn classifier template. It should be saved in the file [`submissions/starting_kit/classifier.py`](/submissions/starting_kit/regressor.py). In its simplest form it takes a scikit-learn pipeline, assigns it to `self.clf` in `__init__`, then calls its `fit` and `predict_proba` functions in the corresponding member funtions.

In [None]:
%%file submissions/starting_kit/regressor.py
# -*- coding: utf-8 -*-
from sklearn.base import BaseEstimator
from sklearn.linear_model import LinearRegression


class Regressor(BaseEstimator):
    def __init__(self):
        pass

    def fit(self, X, y):
        self.reg = LinearRegression()
        self.reg.fit(X, y)

    def predict(self, X):
        return self.reg.predict(X)

In [None]:
%%file submissions/starting_kit/regressor.py
# -*- coding: utf-8 -*-
from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestRegressor


class Regressor(BaseEstimator):
    def __init__(self):
        pass

    def fit(self, X, y):
        self.reg = RandomForestRegressor(n_estimators=500, n_jobs=-1)
        self.reg.fit(X, y)

    def predict(self, X):
        return self.reg.predict(X)

## Local testing (before submission)

It is <b><span style="color:red">important that you test your submission files before submitting them</span></b>. For this we provide a unit test. Note that the test runs on your files in [`submissions/starting_kit`](/tree/submissions/starting_kit), not on the classes defined in the cells of this notebook.

First `pip install ramp-workflow` or install it from the [github repo](https://github.com/paris-saclay-cds/ramp-workflow). Make sure that the python files `feature_extractor.py` and `regressor.py` are in the  [`submissions/starting_kit`](/tree/submissions/starting_kit) folder, and the data `train.csv` and `test.csv` are in [`data`](/tree/data). Then run

```ramp_test_submission```

If it runs and print training and test errors on each fold, then you can submit the code.

In [None]:
!ramp_test_submission --quick-test

### References

[1] Rhodes A, Moreno RP, Azoulay E et al.  . Prospectively defined indicators to improve the safety and quality of care for critically ill patients: a report from the Task Force on Safety and Quality of the European Society of Intensive Care Medicine (ESICM). Intensive Care Med  2012;38:598–605

[2] Stricker K, Rothen HU, Takala J. Resource use in the ICU: short- vs. long-term patients. Acta Anaesthesiol Scand  2003;47:508–15

[3] Perez A, Chan W, Dennis RJ. Predicting the length of stay of patients admitted for intensive care using a first step analysis. Health Service Outcomes and Research. 2006;6:127–138

[4] MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available at: http://www.nature.com/articles/sdata201635

[5] Pei-Fang (Jennifer) Tsai, Po-Chia Chen, Yen-You Chen, et al., Length of Hospital Stay Prediction at the Admission Stage for Cardiology Patients Using Artificial Neural Network, Journal of Healthcare Engineering, vol. 2016, Article ID 7035463, 11 pages, 2016. doi:10.1155/2016/7035463