<a href="https://colab.research.google.com/github/Aditya-2204/AI-Discovery-of-Unknown-Multi-Disease-Phenotypes-in-Longitudinal-Electronic-Health-Records/blob/master/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import pandas as pd
import numpy as np
import os
import pickle

URL_DATA = '/content/drive/MyDrive/AI-Discovery-of-Unknown-Multi-Disease-Phenotypes-in-Longitudinal-Electronic-Health-Records/physionet.org/files/mimic-iv-demo/2.2/'
URL_PROJECT = '/content/drive/MyDrive/AI-Discovery-of-Unknown-Multi-Disease-Phenotypes-in-Longitudinal-Electronic-Health-Records/'

## These tables provide the foundational demographic and administrative information, crucial for identifying unique patients and their hospital encounters.

`patients.csv.gz`:

* `subject_id`: Unique patient identifier (critical for linking all other tables).
* `gender`: Male/Female.
* `anchor_age`, `anchor_year`, `anchor_year_group`: De-identified age and year information. Used to infer approximate real age and hospitalization year while preserving privacy.
* `dod` (Date of Death): If available, an important outcome variable.
Purpose: Demographics, patient linking, age calculation, mortality outcome.
`admissions.csv.gz`:
* `subject_id`, `hadm_id`: Unique hospital admission identifier (links to all hospital-level events).
* `admittime`, `dischtime`: Admission and discharge timestamps.
* `deathtime`: Time of death during hospitalization (if applicable).
* `admission_type`, `admission_location`, `discharge_location`: Context of admission and discharge.
* `hospital_expire_flag`: Binary flag indicating in-hospital death.
* `language`, `ethnicity`, `marital_status`, `religion`: Socio-demographic context.
  * Purpose: Define hospital stays, calculate Length of Stay (LOS), identify readmissions, capture admission context, patient-level outcomes.

`transfers.csv.gz`:

* `subject_id`, `hadm_id`, `transfer_id`: Identifiers for transfer events within the hospital.
* `intime`, `outtime`: Transfer in and out times for specific hospital units/wards.
* `careunit`: The care unit the patient was transferred to.
 * Purpose: Detailed longitudinal tracking of patient movement within the hospital, especially between different wards.
II. Clinical Data for Phenotype Discovery (primarily from `/hosp` and `/icu`):

This is where the rich, longitudinal clinical data for identifying multi-disease phenotypes resides.

`diagnoses_icd.csv.gz` (from `/hosp`):

* `subject_id`, `hadm_id`, `seq_num`: Patient, admission, and sequence number of diagnosis (important for primary diagnosis).
* `icd_code`, `icd_version`: ICD (International Classification of Diseases) codes (both ICD-9 and ICD-10 in MIMIC-IV).
 * Purpose: Core for defining diseases and comorbidities. You'll need to map these to higher-level disease categories (e.g., using Clinical Classification Software (CCS) categories or custom groupings) to abstract phenotypes.

`d_icd_diagnoses.csv.gz` (from `/hosp`):

* `icd_code`, `long_title`: Provides human-readable descriptions for ICD codes.
 * Purpose: To interpret and categorize `diagnoses_icd` codes.

`labevents.csv.gz` (from `/hosp`):

* `subject_id`, `hadm_id`, `itemid`: Identifiers.
* `charttime`: Timestamp of the lab test.
* `valuenum`, `valueuom`: Numerical result and unit of measurement.
* `flag`: Abnormal flag.
 * Purpose: Longitudinal physiological status, identifying abnormal trends, and creating features like `mean`, `min`, `max`, standard deviation, and trends of various biomarkers (e.g., creatinine, white blood cell count, sodium, potassium).

`d_labitems.csv.gz` (from `/hosp`):

* `itemid`, `label`, `loinc_code`: Descriptions for labevents itemids.
 * Purpose: To understand and filter specific lab tests.

`chartevents.csv.gz` (from `/icu`):

* `subject_id`, `hadm_id`, `icustay_id`: Identifiers.
* `charttime`: Timestamp of the observation.
* `itemid`: Identifier for the type of observation (e.g., heart rate, blood pressure).
* `valuenum`, `valueuom`: Numerical value and unit.
 * Purpose: High-resolution longitudinal data on vital signs, physical assessments, and clinical scores (e.g., GCS, RASS). Crucial for capturing acute changes and patient trajectory within the ICU.

`d_items.csv.gz` (from `/icu`):

* `itemid`, `label`, `category`, `unitname`: Descriptions for chartevents itemids.
 * Purpose: To understand and filter specific vital signs and chart observations.

`prescriptions.csv.gz` (from `/hosp`):

* `subject_id`, `hadm_id`, `starttime`, `stoptime`: Patient, admission, and medication administration times.
* `drug`, `dose_val_rx`, `dose_unit_rx`: Medication name, dosage, and unit.
* `route`: Administration route.
 * Purpose: Medication history, identifying drug classes, polypharmacy, and specific treatments associated with certain phenotypes.

`procedures_icd.csv.gz` (from `/hosp`):

* `subject_id`, `hadm_id`, `icd_code`, `icd_version`: Procedure codes (ICD-9 and ICD-10).
* `seq_num`: Sequence of procedure.
 * Purpose: Identify interventions and treatments, which can be part of a phenotype definition or a response to disease.


`d_icd_procedures.csv.gz` (from `/hosp`):

* `icd_code`, `long_title`: Descriptions for procedure codes.
 * Purpose: To interpret procedures_icd codes.

`icustays.csv.gz` (from `/icu`):

* `subject_id`, `hadm_id`, `icustay_id`: Unique ICU stay identifier.
* `intime`, `outtime`: ICU admission and discharge times.
* `first_careunit`, `last_careunit`: Specific ICU units (e.g., MICU, SICU, CCU).
* `los`: Length of stay in ICU.
 * Purpose: Define ICU episodes, calculate ICU LOS, track unit transfers within critical care.

`inputevents.csv.gz` / `outputevents.csv.gz` (from `/icu`):

* `subject_id`, `hadm_id`, `icustay_id`, `charttime`: Identifiers and timestamps.
* `itemid`, `amount`, `amountuom`: Type of input/output, quantity, and unit.
 * Purpose: Fluid balance, medication infusions (`inputevents`), urine output, drain outputs (outputevents). Can reflect renal function, hydration status, etc.
III. Less Commonly Used but Potentially Useful Tables (for advanced phenotyping):

`drgcodes.csv.gz` (from `/hosp`):
* Diagnosis Related Group (DRG) codes, used for billing and resource utilization. Can be an outcome or summary feature.

`microbiologyevents.csv.gz` (from `/hosp`):

* Culture results, identification of pathogens, and antibiotic susceptibilities. Crucial for infectious disease phenotypes.

`pharmacy.csv.gz` (from `/hosp`):

* Detailed pharmacy orders (often overlaps with prescriptions).

`poe.csv.gz` / `poe_detail.csv.gz` (from `/hosp`):

* Provider order entry system data. Can give insight into clinician actions.


`services.csv.gz` (from `/hosp`):

Hospital services involved in patient care (e.g., Cardiology, General Medicine).

`datetimeevents.csv.gz` (from `/icu`): Events with date/time values (e.g., last menstrual period, exam findings).

`procedureevents.csv.gz` (from `/icu`): More granular details on procedures performed in the ICU.

# Key Steps in Utilizing this Data for AI Phenotype Discovery:

* Patient Cohort Definition: Define who you are including (e.g., all adults, patients with multiple admissions, first ICU stay).
* Temporal Alignment: All events are linked by `subject_id`, `hadm_id`, and `icustay_id`, and critically, by `charttime`, `admittime`, `dischtime`, `intime`, `outtime`. You'll need to align these events chronologically.

# Feature Engineering:
* Static Features: Age, gender, ethnicity, admission type, primary diagnosis.

* Aggregated Longitudinal Features: For time-series data (labs, vitals, medications):

# Summary Statistics:
* Mean, median, min, max, standard deviation, variance of values over a specific period (e.g., first 24 hours of ICU stay, entire hospital stay).

 * Trends: Slope of values over time.
 * Frequency/Counts: Number of abnormal lab values, number of unique medications, number of procedures.
 * Categorical Encoding: One-hot encode diagnoses, procedures, medication classes.
 * Time-Series Representation: For more advanced models (RNNs, LSTMs), you might keep the raw time-series of key lab values or vital signs, potentially binned into time intervals.
 * Missing Data Imputation: Handle NaN values (e.g., median imputation, forward fill, or more sophisticated methods like MICE).
 * Dimensionality Reduction/Clustering: Once features are extracted, apply techniques like PCA, UMAP, or t-SNE for visualization and dimensionality reduction, followed by clustering algorithms (K-Means, Hierarchical Clustering, DBSCAN, Gaussian Mixture Models) to identify the "unknown multi-disease phenotypes."

 * Phenotype Characterization: After clustering, analyze the clinical characteristics of each cluster (e.g., what are the most common diagnoses, lab abnormalities, medications, or outcomes within each identified group?) to interpret and name your discovered phenotypes.

* By systematically extracting, processing, and integrating data from these MIMIC-IV tables, you can create a rich feature set suitable for AI-driven discovery of complex patient phenotypes.

In [18]:
with open(URL_PROJECT+'csvs_hosp.pkl', "rb") as f:  # Use "wb" for write binary
  csvs_hosp = pickle.load(f)

with open(URL_PROJECT+'csvs_icu.pkl', "rb") as f:  # Use "wb" for write binary
  csvs_icu = pickle.load(f)