# Improving Medical Predictions by Irregular Multimodal Electronic Health Records Modeling
 
The general problem addressed in this paper is to find a better approach to handling
irregular multimodal data obtained on EHRs to better assess real-time predictions in
ICUs.
“Health conditions among patients in intensive care units (ICUs) are monitored via
electronic health records (EHRs), composed of numerical time series and lengthy clinical
note sequences, both taken at irregular time intervals. Dealing with such irregularity in
every modality, and integrating irregularity into multimodal representations to improve
medical predictions, is a challenging problem.” (Zhang et al., 2023)
<br>

> Zhang, X., Li, S., Chen, Z., Yan, X., & Petzold, L. R. (2023, July). Improving medical predictions
> by irregular multimodal electronic health records modeling. In International Conference on
> Machine Learning (pp. 41300-41313). PMLR. [link](https://arxiv.org/abs/2210.12156)

## specific approach
The specific approach of the paper is to model the EHR records by integrating the
real-time series and clinical notes while considering their irregularities. To achieve this,
the paper addressed three main challenges; Modeling Irregularity in TimeSeries,
Processing Irregular Clinical Notes, and Multimodal Fusion.


In [53]:
# Imports and configs
import subprocess
import pickle
import os
from GlobalConfigs import *

DOWNLOAD_DATASET = False
EXTRACT_COMPRESSED_CSVS = False
PREPROCESS_BENCHMARKS = False
PREPROCESS_CLINICAL_NOTES = False
PREPROCESS_MULTIMODAL = False

# Getting the data
This paper uses the MIMICIII dataset the following code contains some useful code to download and extract the dataset files

In [46]:
# change physionet_username to your username
if DOWNLOAD_DATASET:
   
    physionet_username = "ftrujillo"
    password = "your_pass"
    destination_directory = "data/MIMICIII_Original"
    
    command = [
        "wget", "-r", "-N", "-c", "-np",
        "--user", physionet_username,
        "--password", password,
        "https://physionet.org/files/mimiciii/1.4/",
        "-P", destination_directory
    ]
    
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    
    for line in process.stdout:
        print(line, end='')
    
    process.wait()
    
    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")


In [47]:
if EXTRACT_COMPRESSED_CSVS:
    command = ['./decompress_mimic.sh', '-d', 'data/MIMICIII_Original/physionet.org/files/mimiciii/1.4/', '-o', 'data/mimic3']
    
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    
    for line in process.stdout:
        print(line, end='')   
        
    process.wait()
    
    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")


# Data Preprocessing
The original paper leverages the following projects to help on the data preparation and extraction from the original MIMIC CSVs

### MIMIC benchmarks
Helps to process timeseries data and divide train and test sets
https://github.com/YerevaNN/mimic3-benchmarks.git

The project does not work out of the box, so we downloaded the sourcecode and modify it inside this repo under the [mimic3-benchmarks](./mimic3-benchmarks) folder

To simplify the process we have created the following script `./build_benchmark_data.sh` tu run all timeseries required tasks.

In [49]:
if PREPROCESS_BENCHMARKS:
    command = ["./build_benchmark_data.sh"]

    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, shell=True)
    
    for line in process.stdout:
        print(line, end='')
    process.wait()
    
    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")


### ClinicalNotesICU
Helps to process medical notes and divide in train and test
 https://github.com/kaggarwal/ClinicalNotesICU.git

The project does not work out of the box, so we downloaded the sourcecode and modify it inside this repo under the [ClinicalNotesICU](./ClinicalNotesICU) folder

To simplify the process we have created the following script `./extract_med_notes.sh` tu run all the required tasks for clinical notes.


In [50]:
if PREPROCESS_CLINICAL_NOTES:
    command = ["./extract_med_notes.sh"]

    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, shell=True)
    
    for line in process.stdout:
        print(line, end='')
    process.wait()
    
    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")

# Preprocess time series and notes to create timestamps and text chunk PKLs

The [paper's repo](https://github.com/XZhang97666/MultimodalMIMIC.git) provides a preprocessing script to take care of this task.

The project does not work out of the box, so we downloaded the sourcecode and modify it inside this repo under the [MultimodalMIMIC](./MultimodalMIMIC) folder

In [59]:
if PREPROCESS_MULTIMODAL:
    # Not working if executed from notebook, run from command line with MultimodalMIMIC as cwd
    %run MultimodalMIMIC/preprocessing.py


In [None]:


clinical_notes_path = f"{BENCHMARKS_ROOT_PATH}/data/root/text_fixed"

# Start times pkl contains a map of patient_id notes with its recorded time
test_note_start_time = f"{clinical_notes_path}/test_starttime.pkl"
train_note_start_time = f"{clinical_notes_path}/starttime.pkl"

test_start_times_dict = {}
train_start_times_dict = {}
with open(test_note_start_time, 'rb') as f:
    test_start_times_dict.update(pickle.load(f))

with open(train_note_start_time, 'rb') as f:
    train_start_times_dict.update(pickle.load(f))

print("test:", len(test_start_times_dict))
print("train:", len(train_start_times_dict))


In [None]:
text_train_files = []
text_test_files = []

text_train_filepath = f"{clinical_notes_path}/train"
text_test_filepath = f"{clinical_notes_path}/test"

with os.scandir(text_train_filepath) as entries:
    text_train_files.extend([entry.name for entry in entries if entry.is_file() and entry.name[0].isdigit()])

with os.scandir(text_test_filepath) as entries:
    text_test_files.extend([entry.name for entry in entries if entry.is_file() and entry.name[0].isdigit()])

print("test texts:", len(text_test_files))
print("train texts:", len(text_train_files))
print(f"{text_test_filepath}/{text_test_files[0]}")


In [None]:
# Open PKL
dataPath = "/media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/p2x_data.pkl"
if os.path.isfile(dataPath):
    print('Using', dataPath)
    with open(dataPath, 'rb') as f:
        data = pickle.load(f)
        print("pkl data:", data[0].keys())



### Modeling Irregularity in Time Series:
1. Temporal Discretization-Based Embeddings (TDE): Utilizes a novel unified
approach (UTDE) that combines:
    - Imputation: Regularizes time series by filling in missing values based
on prior observations or statistical methods.
    - Discretized Multi-Time Attention (mTAND): Applies a learned
interpolation method using a multi-time attention mechanism to
represent the irregular time series data better.
2. Unified Approach (UTDE): This approach integrates imputation and mTAND
through a gating mechanism to dynamically combine the representation of
the time series.


### Processing Irregular Clinical Notes:
1. Text Encoding: Uses a pretrained model (TextEncoder) to encode clinical
notes into a series of representations.
2. Irregularity Modeling: Sorts these representations by time, treats them as
Multivariate Irregularly Sampled Time Series (MINSTS), and employs mTAND
to generate a set of text interpolation representations to handle irregularities.

### Multimodal Fusion:
1. Interleaved Attention Mechanism: Fuses time series and clinical note
representations across temporal steps, integrating irregularity into multimodal
representations.
2. Self and Cross-Attention:
    - Multi-Head Self-Attention (MH): Acquires contextual embeddings for
each modality by focusing within the same modality across time.
    - Multi-Head Cross-Attention (CMH): Each modality learns from the
other, integrating information across modalities.
3. Feed-Forward and Prediction Layers: A feed-forward sublayer follows the
CMH outputs, with layer normalization and residual connections applied. The
final step involves passing the integrated representations through fully
connected layers to predict the outcome.

# Train Model

# Evaluate

# Ablation 1 - Drop UTDE

# Ablation 2 - Remove mTAND