# Improving Medical Predictions by Irregular Multimodal Electronic Health Records Modeling

# Introduction
This paper pretends to address the challenges of handling irregularity and the integration of multimodal data for medical prediction tasks.

## Background of the problem
### What type of problem:
The paper focuses on 2 main problems; Mortality Prediction and Phenotype Classification
### What is the importance/meaning of solving the problem: 
ICUs admit patients with life-threatening conditions, Improving the efficacy and efficiency of predictions by accounting for irregular data in EHRs can help the medical providers to make more accurate and quick decisions that could save lives.

### What is the difficulty of the problem:
The primary difficulty is the handling the irregular sampling of data and the effective integration and modeling of EHR records like numerical time series and textual notes taken in multiple points in time and frequencies.

### The state of the art methods and effectiveness.
For irregular data handling;
> [1] Lipton, Z. C., Kale, D., and Wetzel, R. Directly modeling
> missing data in sequences with rnns: Improved classification of clinical time series. In Machine learning for
> healthcare conference, pp. 253–270. PMLR, 2016.

> [2] Shukla, S. N. and Marlin, B. M. Multi-time attention networks for irregularly sampled time series. arXiv preprint
> arXiv:2101.10318, 2021.

For irregular clinical notes processing;
> [3] Golmaei, S. N. and Luo, X. Deepnote-gnn: predicting hospital readmission using clinical notes and patient network.
> In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics,
> pp. 1–9, 2021.

> [4]Mahbub, M., Srinivasan, S., Danciu, I., Peluso, A., Begoli, E., Tamang, S., and Peterson, G. D. 
> Unstructured clinical notes within the 24 hours since admission predict short,> mid & long-term mortality in adult icu patients. 
> Plos one, 17(1):e0262182, 2022.

## Paper explanation
### What did the paper propose
The general problem addressed in this paper is to find a better approach to handling irregular multimodal data obtained on EHRs to better assess real-time predictions in ICUs. 

### What is the innovations of the method
To better approach irregularity and multi-modal data the paper proposes integrating the real-time series and clinical notes while considering their irregularities. This by doing the following:

- Modeling Irregularity in Time Series: 
    - Temporal Discretization-Based Embeddings (TDE): Utilizes a novel unified approach (UTDE) that combines: 
        - Imputation: Regularizes time series by filling in missing values based on prior observations or statistical methods. 
        - Discretized Multi-Time Attention (mTAND): Applies a learned interpolation method using a multi-time attention mechanism to represent the irregular time series data better. 
    - Unified Approach (UTDE): This approach integrates imputation and mTAND through a gating mechanism to dynamically combine the representation of the time series. 
     
 
- Processing Irregular Clinical Notes: 
    - Text Encoding: Uses a pretrained model (TextEncoder) to encode clinical notes into a series of representations. 
    - Irregularity Modeling: Sorts these representations by time, treats them as Multivariate Irregularly Sampled Time Series (MINSTS), and employs mTAND to generate a set of text interpolation representations to handle irregularities. 
 
 
- Multimodal Fusion: 
    - Interleaved Attention Mechanism: Fuses time series and clinical note representations across temporal steps, integrating irregularity into multimodal representations. 
    - Self and Cross-Attention: 
        - Multi-Head Self-Attention (MH): Acquires contextual embeddings for each modality by focusing within the same modality across time. 
        - Multi-Head Cross-Attention (CMH): Each modality learns from the other, integrating information across modalities. 
    - Feed-Forward and Prediction Layers: A feed-forward sublayer follows the CMH outputs, with layer normalization and residual connections applied. The final step involves passing the integrated representations through fully connected layers to predict the outcome. 


### How well the proposed method work (in its own metrics)
 The proposed methods for two medical prediction tasks consistently outperforms state-ofthe-art (SOTA) baselines in each single modality and multimodal fusion scenarios. 
Observing a relative improvements of 6.5%, 3.6%, and 4.3% in F1 for time series, clinical notes, and multimodal fusion, respectively. 

### What is the contribution to the reasearch regime (referring the Background above, how important the paper is to the problem).
The paper's contribution is important because it provides a new direction for EHR-based predictive models to consider time irregularity that could lead to more accurate and reliable medical predictions, helping patients and healthcare processes.

# Scope of Reproducibility:

For our project we plan to reproduce the experiment with In Hospital Mortality (IHM). And prove the following hypotheses:


1. The inclusion of UTDE improves the performance of the model.
2. Considering irregularities in clinical note embedding improves the performance of the model.
3. The introduction of UTDE and mTAND for processing time series and clinical notes, respectively, plus the integration of Multimodal fusion outperforms F1 score against standard baselines.

# Methodology

The project reproduction consists on the following sections
- Data:
Data descriptions
Implementation code

- Models:
Model descriptions
Implementation code

- Training
Computational requirements
Implementation code

- Evaluation
Metrics descriptions
Implementation code



In [1]:
# Imports and configs
import subprocess
import pickle
import os
from GlobalConfigs import *

DOWNLOAD_DATASET = False
EXTRACT_COMPRESSED_CSVS = False
PREPROCESS_BENCHMARKS = False
PREPROCESS_CLINICAL_NOTES = False
PREPROCESS_MULTIMODAL = False

# Getting the data
This paper uses the MIMICIII dataset the following code contains some useful code to download and extract the dataset files

In [2]:
# change physionet_username to your username
if DOWNLOAD_DATASET:
   
    physionet_username = "ftrujillo"
    password = "your_pass"
    destination_directory = "data/MIMICIII_Original"
    
    command = [
        "wget", "-r", "-N", "-c", "-np",
        "--user", physionet_username,
        "--password", password,
        "https://physionet.org/files/mimiciii/1.4/",
        "-P", destination_directory
    ]
    
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    
    for line in process.stdout:
        print(line, end='')
    
    process.wait()
    
    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")


In [3]:
if EXTRACT_COMPRESSED_CSVS:
    command = ['./decompress_mimic.sh', '-d', 'data/MIMICIII_Original/physionet.org/files/mimiciii/1.4/', '-o', 'data/mimic3']
    
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    
    for line in process.stdout:
        print(line, end='')   
        
    process.wait()
    
    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")


# Data Preprocessing
The original paper leverages the following projects to help on the data preparation and extraction from the original MIMIC CSVs

### MIMIC benchmarks
Helps to process timeseries data and divide train and test sets
https://github.com/YerevaNN/mimic3-benchmarks.git

The project does not work out of the box, so we downloaded the sourcecode and modify it inside this repo under the [mimic3-benchmarks](./mimic3-benchmarks) folder

To simplify the process we have created the following script `./build_benchmark_data.sh` tu run all timeseries required tasks.

In [4]:
if PREPROCESS_BENCHMARKS:
    command = ["./build_benchmark_data.sh"]

    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, shell=True)
    
    for line in process.stdout:
        print(line, end='')
    process.wait()
    
    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")


### ClinicalNotesICU
Helps to process medical notes and divide in train and test
 https://github.com/kaggarwal/ClinicalNotesICU.git

The project does not work out of the box, so we downloaded the sourcecode and modify it inside this repo under the [ClinicalNotesICU](./ClinicalNotesICU) folder

To simplify the process we have created the following script `./extract_med_notes.sh` tu run all the required tasks for clinical notes.


In [5]:
if PREPROCESS_CLINICAL_NOTES:
    command = ["./extract_med_notes.sh"]

    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, shell=True)
    
    for line in process.stdout:
        print(line, end='')
    process.wait()
    
    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")

# Preprocess time series and notes to create timestamps and text chunk PKLs

The [paper's repo](https://github.com/XZhang97666/MultimodalMIMIC.git) provides a preprocessing script to take care of this task.

The project does not work out of the box, so we downloaded the sourcecode and modify it inside this repo under the [MultimodalMIMIC](./MultimodalMIMIC) folder
The pre-processing script is located at [MultimodalMIMIC/preprocessing.py](MultimodalMIMIC/preprocessing.py)

In [6]:
if PREPROCESS_MULTIMODAL:
    # Not working if executed from notebook, run from command line with MultimodalMIMIC as cwd
    # from MultimodalMIMIC import preprocessing
    # %run MultimodalMIMIC/preprocessing.py
    # preprocessing.main()
    command = ["python", "MultimodalMIMIC/preprocessing.py"]

    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    
    for line in process.stdout:
        print(line, end='')
        
    process.wait()
    
    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")


Executing mimic multimodal preprocess
Namespace(data='/media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/mimic3-benchmarks/data/in-hospital-mortality/', period_length=48, task='ihm', outputdir='/media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/', timestep=1.0, imputation='previous', small_part=False)
Preprocessing ihm
Extracting train irregular data
Extracting val irregular data
Extracting test irregular data
Normalizing train times data
Normalizing val times data
Normalizing test times data
Preparing  train text data
Suceed Merging:  13655
Missing Merging:  4248
Preparing  val text data
Suceed Merging:  13655
Missing Merging:  4248
Preparing  test text data
Suceed Merging:  2488
Missing Merging:  748
Preprocessing Done


In [None]:


clinical_notes_path = f"{BENCHMARKS_ROOT_PATH}/data/root/text_fixed"

# Start times pkl contains a map of patient_id notes with its recorded time
test_note_start_time = f"{clinical_notes_path}/test_starttime.pkl"
train_note_start_time = f"{clinical_notes_path}/starttime.pkl"

test_start_times_dict = {}
train_start_times_dict = {}
with open(test_note_start_time, 'rb') as f:
    test_start_times_dict.update(pickle.load(f))

with open(train_note_start_time, 'rb') as f:
    train_start_times_dict.update(pickle.load(f))

print("test:", len(test_start_times_dict))
print("train:", len(train_start_times_dict))


In [None]:
text_train_files = []
text_test_files = []

text_train_filepath = f"{clinical_notes_path}/train"
text_test_filepath = f"{clinical_notes_path}/test"

with os.scandir(text_train_filepath) as entries:
    text_train_files.extend([entry.name for entry in entries if entry.is_file() and entry.name[0].isdigit()])

with os.scandir(text_test_filepath) as entries:
    text_test_files.extend([entry.name for entry in entries if entry.is_file() and entry.name[0].isdigit()])

print("test texts:", len(text_test_files))
print("train texts:", len(text_train_files))
print(f"{text_test_filepath}/{text_test_files[0]}")


In [None]:
# Open PKL
dataPath = "/media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/p2x_data.pkl"
if os.path.isfile(dataPath):
    print('Using', dataPath)
    with open(dataPath, 'rb') as f:
        data = pickle.load(f)
        print("pkl data:", data[0].keys())



### Modeling Irregularity in Time Series:
1. Temporal Discretization-Based Embeddings (TDE): Utilizes a novel unified
approach (UTDE) that combines:
    - Imputation: Regularizes time series by filling in missing values based
on prior observations or statistical methods.
    - Discretized Multi-Time Attention (mTAND): Applies a learned
interpolation method using a multi-time attention mechanism to
represent the irregular time series data better.
2. Unified Approach (UTDE): This approach integrates imputation and mTAND
through a gating mechanism to dynamically combine the representation of
the time series.


### Processing Irregular Clinical Notes:
1. Text Encoding: Uses a pretrained model (TextEncoder) to encode clinical
notes into a series of representations.
2. Irregularity Modeling: Sorts these representations by time, treats them as
Multivariate Irregularly Sampled Time Series (MINSTS), and employs mTAND
to generate a set of text interpolation representations to handle irregularities.

### Multimodal Fusion:
1. Interleaved Attention Mechanism: Fuses time series and clinical note
representations across temporal steps, integrating irregularity into multimodal
representations.
2. Self and Cross-Attention:
    - Multi-Head Self-Attention (MH): Acquires contextual embeddings for
each modality by focusing within the same modality across time.
    - Multi-Head Cross-Attention (CMH): Each modality learns from the
other, integrating information across modalities.
3. Feed-Forward and Prediction Layers: A feed-forward sublayer follows the
CMH outputs, with layer normalization and residual connections applied. The
final step involves passing the integrated representations through fully
connected layers to predict the outcome.

# Train Model

# Evaluate

# Ablation 1 - Drop UTDE

# Ablation 2 - Remove mTAND