# Data Description

## Data collection

The dataset MIMIC III (Medical Information Mart for Intensive Care) is used in our proejct. MIMIC III is a single-center database concerning  patient admissions to critical care units. While full description can be found from [physionet.org](https://physionet.org/content/mimiciii/1.4/), the database incorprated several data sources, including:

- critical care information systems:
    - Two different systems: Philips **CareVue** and iMDsoft **MetaVision** ICU. Most of data from the two systems was merged expect for fluid intake. The data that are not merged will be given a suffix to distinguish the data source, e.g. "*CV*" for CareVue and "*MV*" for MetaVision. These two systems provided clinical data including:
        - physiological measurements, e.g. heart rate, arterial blood pressure, or respiratory rate;
        - documented progress notes by care providers;
        - drip medications and fluid balances.
- hospital electronic health records:
    - patient demographics
    - in-hospital mortality
    - laboratory test results, including hematology, chemistry and microbiology results.
    - discharge summaries
    - reports of electrocardiogram and imaging studies.
    - **billing-related information** such as International Classification of Disease, 9th Edition (ICD-9) codes, Diagnosis Related Group (DRG) codes, and Current Procedural Terminology (CPT) codes.
- Social Security Administration Death Master File.
    - Out-of-hospital mortality dates

## De-identification Processes

Deidentified using *structured data cleansing* and *date shifting*. 

1. Removal of patient name, telephone number, address. 
2. Dates were shifted into the future. intervals preserved. Time of day, day of the week, and approximate seasonality were conserved.
3. **Patients > 89 yrs appear with ages of over 300 yrs.**
4. Protected health information was removed, such as diagnostic reports and physician notes.

## Data Description

- **A relational database** consisting of 26 tables linked by identifiers which usually have the suffix ‘ID’, e.g. 
    - SUBJECT_ID (a unique patient)
    - HADM_ID (a unique admission to the hospital), and 
    - ICUSTAY_ID (a unique admission to an intensive care unit)
- Five tables track **patient stays**: ADMISSIONS; PATIENTS; ICUSTAYS; SERVICES; and TRANSFERS
- Another five tables are **dictionaries** (prefixed with ‘D_’): D_CPT; D_ICD_DIAGNOSES; D_ICD_PROCEDURES; D_ITEMS; and D_LABITEMS. 
    - Dictionary tables: Definitions for identifiers. E.g. ITEMID in CHARTEVENTS is explained in D_ITEMS, which represents the concept measured.
- The remaining tables associated with patient care, such as physiological measurements, caregiver observations, and billing information.
- ‘Events’ tables: a series of charted events such as notes, laboratory tests, and fluid balance. e.g. the OUTPUTEVENTS table: all measurements related to output for a given patient, the LABEVENTS table: laboratory test results for a patient.

# Variables of Interest (Variable Definition)

## ID Variables
- SUBJECT_ID: ID variable obtained from *ADMISSIONS.csv*. One unique subject id was assigned to each patient.
- HADM_ID: ID variable obtained from *ADMISSIONS.csv*. **Primary key**. One unique Hospital ADMinistration ID (HADM_ID) was assgined to each hopspitalization of one patient, while multiple HADM_ID may correspond to the same person.
- ADMITTIME: datetime variable obtained from *ADMISSIONS.csv*. This variable records the time when the hospitalization started.
- DISCHTIME: datetime variable obtained from *ADMISSIONS.csv*. This variable records the time when the hospitalization ended.

## Output:
- HOSPITAL_EXPIRE_FLAG: Boolean variable obtained from *ADMISSIONS.csv*. This variable indicates whether the patient passed away during hospitalization.

## Input:
- ETHNICITY: categorical variable obtained from *ADMISSIONS.csv*. **Demographic information**.
- DIAGNOSIS: string variable obtained from *ADMISSIONS.csv*.
- GENDER: categorical variable obtained from *PATIENTS.csv* via joining on SUBJECT_ID. **Demographic information**. 
- DOB: datetime variable obtained from *PATIENTS.csv* via joining on SUBJECT_ID. **Demographic information**. 
- AGE_ON_AD: numeric variable derived from DOB and ADMITTIME. **Demographic information**. This variable represents the patients' age on administration. An age greater than 300 indicates patients' age greater than 89.
- SERVICES: multi-valued categorical variable obtained from *SERVICES.csv* via joining on SUBJECT_ID and HADM_ID. This variable indicates services that the patients utilized during hospitalization.
- ICU_STAY_DAYS: numeric variable derived from *ICUSTAY.csv*. This variable presents total days of ICU stays during this hospitalization.
- MULTI_ENTRY_ICU: Boolean variable derived from *ICUSTAY.csv*. MULTI_ENTRY_ICU is set True if patients have entered the ICU before this hospitlization. 
- ICD9_code: ten Boolean variables derived from *PROCEDURES_ICD.csv* via joining on SUBJECT_ID and HADM_ID. These variables provided information on the ten most frequent procedures associated with hospitalization.
- TOTAL_ITEMID_code: twenty integer variables derived from *LABEVENTS.csv* via joining on SUBJECT_ID and HADM_ID. These variables provided information on the twenty most frequent lab tests during hospitalization. TOTAL_ITEMID_code indicates how many times the lab test have been carried out in each hospitalization.
- ABNORMAL_ITEMID_code: twenty integer variables derived from *LABEVENTS.csv* via joining on SUBJECT_ID and HADM_ID. Different from TOTAL_ITEMID_code, ABNORMAL_ITEMID_code indicates how many lab tests of this kind were abnormal.