# MIMIC-III and eICU-CRD: Freely available critical care databases

## Tom Pollard

### MIT Laboratory for Computational Physiology (MIT-LCP)

<img style="float: right; height: 100px" src="./images/MIT-LCP-logo.png">

#### AIMed Datathon: Sat 8 Sept 2018

# Overview

1. Programme
1. Datasets
    - MIMIC-III
    - eICU Collaborative Research Database
2. Key points
    - Context
    - Distribution and access
3. How are people using the data?
4. Navigating the data 
5. Reproducibility and final points

# Programme

## Today
**08:30**: Registration  

**09:00**: Overview of the datasets (Tom)  
**09:30**: Pitching  
**10:00**: Coffee and team formation  
**10:30**: Tutorial (Google Cloud)

**11:00**: Start projects!  

**13:00**: Lunch  
**18:00**: Drinks 

## Tomorrow

**09:00**: Projects

**12:00**: Lunch  

**15:00**: Final presentations  
**16:00**: Post-datathon guidance (Google Cloud)  
**16:30**: Awards and closing  

# Datasets

## `http://mimic.physionet.org`

![MIMIC website](./images/mimicwebsite.png) 

# MIMIC-III (**v1.4**)

MIMIC-III = Medical Information Mart for Intensive Care

- Freely available! 
- **Single Centre**: Beth Israel Deaconess Medical Centre
    - U.S. Based (Boston, MA)
    - Has MICU, SICU, CCU, CSRU, TSICU, ...
- **Detailed in-ICU** information derived from: electronic medical records, critical care information systems, lab system,...
- **Limited out-of-ICU** information (social security death masterfile)
- \>60,000 ICU stays, \>40,000 patients (2002-2012)

## `http://eicu-crd.mit.edu`

![eICU-CRD website](./images/eicuwebsite.png)

# eICU Collaborative Research Database (**v2.0**)

- Freely available! 
- **Multi-Centre**: Sourced from Philips eICU Telehealth Program.
    - \> 250 ICUs across the United States
    - Has MICU, SICU, CCU, ...
- **Detailed in-ICU** information derived from: electronic medical records, critical care information systems
- **Limited outside-of-ICU** information.
- **~200,000 stays between 2014 and 2015** (private dataset x10 larger)
- **Wide variation in data quality** between hospitals

# Why put effort into sharing these datasets?

- For reproducibility.
- For benchmarking.
- For education.
- For collaboration.
- To encourage others to do likewise.

... and ultimately to **accelerate progress in health research**.

# Key points

# Real patients


![ICU patients](./images/icu_patient.png)

# Real data

![MIMIC on Reddit](./images/reddit.png)

- The data does not come in nice, tidy spreadsheets.
- The data was not collected especially for us to use for research.

# Getting access

- Sign a Data Use Agreement 
- Take a free, online course in human research
- Data typically shared as CSVs with build scripts
- Updates are by formal versioned releases

![MIMIC Data Use Agreement](./images/mimicdua.png)


# Datathon access

![Google Cloud](./images/google_cloud.png)

# How are people using the data?

# Global Open Source Severity of Illness Score (GOSSIS)

- How GOSSIS differs from APACHE-IV:
    - Uses more heterogeneous data.
    - Uses min/max (mostly) instead of 'worst'.
    - Does not bin predictors.
    - Integrates other sources of information for missing data, instead of assuming 'normal'.
    - Takes a hierarchical approach for diagnoses

![GOSSIS](./images/gossis.png)

# Evaluating sepsis criteria

- Compared five methods of identifying sepsis in electronic health records
- Found large variation in cohort sizes and severity of illness as measured by in-hospital mortality rate
- Sepsis studies should recognize the differences in identification methods and contextualize their findings according to the different cohorts identified.

![Sepsis-3](./images/sepsis3.png)

> Johnson AEW, Aboab J, Raffa J, Pollard T, Deliberato R, Celi LA, Stone, D. A Comparative Analysis of Sepsis Identification Methods in an Electronic Database. Critical Care Medicine (2018). http://dx.doi.org/10.1097/CCM.0000000000002965

# Conferences (e.g. of 26 papers MLHC 2017...)

- [Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach](http://proceedings.mlr.press/v68/raghu17a/raghu17a.pdf)
- [Generating Multi-label Discrete Patient Records using Generative Adversarial Networks](http://proceedings.mlr.press/v68/choi17a/choi17a.pdf)
- [Marked Point Process for Severity of Illness Assessment](http://proceedings.mlr.press/v68/islam17a/islam17a.pdf)
- [Piecewise-constant parametric approximations for survival learning](http://proceedings.mlr.press/v68/weiss17a/weiss17a.pdf)
- [Diagnostic Inferencing via Improving Clinical Concept Extraction with Deep Reinforcement Learning: A Preliminary Study](http://proceedings.mlr.press/v68/ling17a/ling17a.pdf)
- [Clinical Intervention Prediction and Understanding with Deep Neural Networks](http://proceedings.mlr.press/v68/suresh17a/suresh17a.pdf)
- [Reproducibility in critical care: a mortality prediction case study](http://proceedings.mlr.press/v68/johnson17a/johnson17a.pdf)

![MLHC 2017](./images/MLHC.png)



# Navigating the data (focusing on MIMIC)

# `patients`, `admissions` and `icustays` 

![patient tracking tables](./images/subj_hadm_icustays.png)

# Events tables

- `chartevents`: Charted observations for a patient
- `labevents`: Lab measurements both within hospital and (sometimes) outpatient clinics
- `inputevents`: Input fluids (e.g. intravenous medications)
- `microbiologyevents`: Microbiology measurements and sensitivities
- <s>`noteevents`</s>: Deidentified patient notes

# Other tables

- `diagnoses_icd`: Hospital assigned diagnosis codes.
- `procedures_icd`: Hospital assigned procedure codes
- `caregivers`: Caregivers who have recorded data
- `prescriptions`: Medications ordered for a patient
- ...

More tables and full documentation: https://mimic.physionet.org/

# Dictionary tables

- `d_cpt`
- `d_icd_diagnoses`
- `d_icd_procedures`
- `d_items`
- `d_labitems`

Lookup tables for main tables.

# `SELECT * FROM d_items WHERE label ILIKE '%heart%'`

![heartrate](./images/query_d_items.png)

# `SELECT * FROM chartevents WHERE itemid IN (211,220045)`

![image.png](./images/query_chartevents.png)

Please use the code repository!

# Putting it together

![verbal](./images/examplepatient.jpeg)

# Read the docs!

![read the docs](./images/mimicdocs.png)

# Don't start from scratch

![MIMIC Code Repository](./images/coderepo.png)

# Concepts

- Code for: severity of illness scales, sepsis definitions, available in code repository.
- Also available on BigQuery. No need to run code to get concepts!

- I often use:
    - Elixhauser (comorbidity burden index)
    - SOFA
    - OASIS (alternative to APACHE)
    - Angus sepsis definition
    - Ventilation durations
    
- Tutorials online:
    - e.g. Notebooks at: https://eicu-crd.mit.edu/tutorials/admissiondrug/

**Please use these resources!** We are happy to incorporate your contributions to improve the concepts!

# FAQs

**Q1. I think there is something wrong with the dates. They are all in 21xx.**    
A1. This is part of the deidentification process. Within a patient all intervals are preserved. Between patients, the intervals are not relevant.

**Q2. Why are there patients that are 300 years old?**  
A2. This is part of the deidentification process. Under HIPAA, a patient’s age > 89 is protected health information. Be careful with analysis of age!

**Q3. How do I find a patient’s first ICU stay?**  
A3. Use the icustay_detail concept table and have a look at: icustay_seq

**Q4. How do I continue on with my project after the datathon?**   
A4. We’ll talk more about it tomorrow, but for MIMIC III and eICU-CRD you would need to sign a similar data use agreement and take a course on research on human research.

# MIMIC: strengths and limitations

**Strengths**:  
1. Well-developed documentation and codebase.
2. Large community and literature.
3. Provider notes available **

**Limitations**:  
1. Data is slightly stale (2001-2012).
2. Smaller sample size when compared with eICU database.
3. Switchover from Carevue to Metavision can complicate extraction.
4. Generated from a single center.
5. Diagnoses only available at end of patient stay.

# eICU Database: strengths and limitations

**Strengths**:  
1. Recent data (2014-2015).    
2. Well documented severity (APACHE) scores for all patients.
3. carePlanGeneral table contains treatment plans
4. Diagnoses available during the patient stay.

**Limitations**: 
1. Underdeveloped documentation and code.
2. Little prior literature using the data (perhaps good, also!).  
3. Data quality is highly variable between hospitals.  

# Reproducibility



# Reproducibility in critical care

- Review 28 published mortality prediction models using MIMIC. 
- Attempt to reproduce the cohorts (+ bonus, compare performance to logistic regression).
- 75% of reproduced cohorts differed by >1000 patients. Percent mortality differed by up to ~20%. 

![MLHC results](./images/mlhc-paper-results.png)

> Johnson AEW, Pollard T, Mark RG. Reproducibility in critical care: a mortality prediction case study. Proceedings of the 2nd Machine Learning for Healthcare Conference (2017).

# Practical steps

- Version control your code
- Host your code in a repository
- Be clear about the version of data you are using
- Include a license
- Include a readme explaining how to run the code
- Make it easy to others to run your code (e.g. use Jupyter/Colab notebooks and RMarkdown)

# And finally...

- Learn something
- Get to know your collaborators
- Enjoy yourself!

> A “datathon” model to support cross-disciplinary collaboration. Science Translational Medicine (2016). [DOI: 10.1126/scitranslmed.aad9072](http://dx.doi.org/10.1126/scitranslmed.aad9072)

![datathon](./images/datathon.jpg)