# How are MIMIC and eICU-CRD used?

### MIT Laboratory for Computational Physiology (MIT-LCP)
#### Tuesday 19 June 2018

# Overview

- datasets
- courses
- events
- research
- success
- challenges

# Datasets

# MIMIC-III

- Comprehensive, freely-available dataset. Sourced from Beth Israel Deaconess Medical Center.
- \>60,000 ICU stays, \>40,000 patients (2002-2012)
- More data coming soon (2012-2018, OR, ED, Chest X-rays)

![MIMIC website](./images/mimicwebsite.png)


# eICU Collaborative Research Database

- Freely-available multi-center dataset. Sourced from Philips eICU Telehealth Program
- Public dataset is \>200,000 stays across \>250 hospitals. Private dataset x10 larger
- Wide variation in data quality between hospitals

![eICU-CRD website](./images/eicuwebsite.png)


# More datasets on PhysioNet

- Physiological datasets published on http://physionet.org (in process of rebuild!)
- Emphasis on waveforms (ECG, EEG, etc)
- 3 levels of access: Public; PhysioNet account; Credentialed user

![PhysioNet](./images/physionet.png)


# Technical notes

# Building

- PostgreSQL
- Testing framework on Continuous Integration server (Jenkins/Travis)
- git for version control
- Issues tracked on private GitHub repository
- Deidentification using custom software (in active development)

# Distributing 

- Gaining access requires signing a Data Use Agreement and taking a free, online course in human research
- Access requests are handled by a lab administrator
- Data typically shared as CSVs with build scripts
- Experimenting with various cloud options
- Updates are by formal versioned releases
- Web-based "[Querybuilder](https://mimic.physionet.org/tools/querybuilder/)" for new users

![MIMIC Data Use Agreement](./images/mimicdua.png)


# Supporting reuse

- Code repositories (e.g. MIMIC Code Repository)
- Tools to facilitate analysis (Python WFDB Toolbox; tableone)
- Data mapping efforts (e.g. OMOP)


![MIMIC Code Repository](./images/mimiccoderepo.png)

# Courses


# HST-953: Collaborative Data Science in Medicine

- Run in 2016 and 2017, Sept-Dec. ~30 participants each year

> Clinicians face difficult treatment decisions in contexts that are not well addressed by available evidence. The digitization of medicine provides an opportunity to find solutions to previously ambiguous questions. This course covers material from clinical epidemiology, biostatistics and machine learning as applied to electronic health record data.

![HST953](./images/hst953.jpg)

# HST-953 Syllabus

1. (a) Introduction
   (b) Machine Learning that Matters
   (c) Software Setup

2. (a) Observational Data
   (b) Project Pitches
   (c) Introduction to SQL

3. (a) Formulating a Question
   (b) Defining a Cohort
   (c) Intermediate SQL

4. (a) Hospital Panel: Sources of Data
   (b) Data Preparation
   (c) Text & NLP

5. (a) Reproducibility
   (b) Preprocessing; Missing Data
   (c) Data visualization

6. (a) Noise and Outliers
   (b) Regression
   (c) Exploratory Data Analysis

7. (a) Predictive Analytics
   (b) Data Analysis
   (c) Prediction

8. (a) Advanced Data Analysis Approaches
   (b,c) Propensity Scores

9. (a) Blood Pressure and AKI
   (b) Validation and Sensitivity
   (c) Project Mentoring

10. (a) Trend Analysis
    (b) Instrumental Variables
    (c) Project Mentoring

11. Final Project Presentations

# \>20 courses worldwide using MIMIC

- Columbia University (Noémie Elhadad). [BINF G4002 Methods II: Computational Methods in Biomedical Informatics](https://www.dbmi.columbia.edu/for-current-staff-students/courses/)
- Georgia Tech (Jimeng Sun). [CSE8803 Big Data Analytics for Healthcare](http://www.sunlab.org/teaching/cse8803/)
- Stanford (Nigam Shah). [BIOMEDIN 215 Data Driven Medicine](http://shahlab.stanford.edu/biomedin215)
- University of Texas at Austin (Joydeep Ghosh). [EE 381V: (Advanced Data Mining](http://hercules.ece.utexas.edu/ghosh/bdah-f15.htm)
- Berkeley. [Data Science W266: Natural Language Processing with Deep Learning](https://www.ischool.berkeley.edu/courses/datasci/266)
- Ludwig Maximilian University of Munich. [Recent Developments in Biostatistics (elective for "Clinical Epidemiology" MSc)](http://www.en.msc-epidemiologie.med.uni-muenchen.de/msc/programme/modules/index.html)
- Yale Center for Medical Informatics (Cynthia Brandt & Kei-Hoi Cheung). CBB750 Core Topics in Biomedical Informatics and Data Science.
- George Washington University (L Davidson). [INFR 6101. Principles of Medical Informatics](https://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201701&subjId=INFR)
- Seville University (José Troyano). [Applied Data Mining](http://www.us.es/estudios/grados/plan_226/asignatura_2260060).
- Imperial College London (Laure de Preux). [BS1812 Healthcare and Medical Analytics](http://www.imperial.ac.uk/people/l.depreux/teaching.html)
- Udacity. [Health Informatics in the Cloud](https://www.udacity.com/course/health-informatics-in-the-cloud--ud809#)

# Events

# Datathons

- Create interdisciplinary teams to work together on real clinical problems
- Run in collaboration with UK Intensive Care Society, Australia and NZ Intensive Care Society, etc
- Raise awareness. Build partnerships. Work towards integrated, international datasets

> A “datathon” model to support cross-disciplinary collaboration. Science Translational Medicine (2016). [DOI: 10.1126/scitranslmed.aad9072](http://dx.doi.org/10.1126/scitranslmed.aad9072)

![datathon](./images/datathon.jpg)

# Competitions

![IEEE Challenge](./images/ieee-challenge.png)

# Research and projects

# Global Open Source Severity of Illness Score (GOSSIS)

- How GOSSIS differs from APACHE-IV:
    - Uses more heterogeneous data.
    - Uses min/max (mostly) instead of 'worst'.
    - Does not bin predictors.
    - Integrates other sources of information for missing data, instead of assuming 'normal'.
    - Takes a hierarchical approach for diagnoses

![GOSSIS](./images/gossis.png)

# Evaluating sepsis criteria

- Assessed five methods of identifying sepsis in electronic health records, and found that all five had varying cohort sizes and severity of illness as measured by in-hospital mortality rate
- Future studies on sepsis should recognize the differences in outcome incidence among identification methods and contextualize their findings according to the different cohorts identified.

![Sepsis-3](./images/sepsis3.png)

> Johnson AEW, Aboab J, Raffa J, Pollard T, Deliberato R, Celi LA, Stone, D. A Comparative Analysis of Sepsis Identification Methods in an Electronic Database. Critical Care Medicine (2018). http://dx.doi.org/10.1097/CCM.0000000000002965

# Reproducibility in critical care

- Review 28 published mortality prediction models using MIMIC. 
- Attempt to reproduce the cohorts (+ bonus, compare performance to logistic regression).
- 75% of reproduced cohorts differed by >1000 patients. Percent mortality differed by up to ~20%. 

![MLHC results](./images/mlhc-paper-results.png)

> Johnson AEW, Pollard T, Mark RG. Reproducibility in critical care: a mortality prediction case study. Proceedings of the 2nd Machine Learning for Healthcare Conference (2017).

# Success

# Credentialed MIMIC users

![MIMIC users](./images/mimicusers.png)

# Conferences (e.g. of 26 papers MLHC 2017...)

- [Piecewise-constant parametric approximations for survival learning](http://proceedings.mlr.press/v68/weiss17a/weiss17a.pdf)
- [Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach](http://proceedings.mlr.press/v68/raghu17a/raghu17a.pdf)
- [Marked Point Process for Severity of Illness Assessment](http://proceedings.mlr.press/v68/islam17a/islam17a.pdf)
- [Diagnostic Inferencing via Improving Clinical Concept Extraction with Deep Reinforcement Learning: A Preliminary Study](http://proceedings.mlr.press/v68/ling17a/ling17a.pdf)
- [Generating Multi-label Discrete Patient Records using Generative Adversarial Networks](http://proceedings.mlr.press/v68/choi17a/choi17a.pdf)
- [Clinical Intervention Prediction and Understanding with Deep Neural Networks](http://proceedings.mlr.press/v68/suresh17a/suresh17a.pdf)
- [Reproducibility in critical care: a mortality prediction case study](http://proceedings.mlr.press/v68/johnson17a/johnson17a.pdf)

![MLHC 2017](./images/MLHC.png)


# Reuse in publications

- ~277 citations on Google Scholar to "MIMIC-III, a freely accessible critical care database. Scientific data (2016)".

![MIMIC-III citations](./images/mimiccitations.png)



# Challenges

# Challenges

- Process for handling data requests is cumbersome.
- Building databases is time-consuming and requires expert knowledge.
- Data is complex and easily misinterpreted.
- Tracking reuse and demonstrating impact is non-trivial.
- How do we release more data; support the community; and do research?

![MIMIC on Reddit](./images/reddit.png)