# Improving Medical Predictions by Irregular Multimodal Electronic Health Records Modeling

# Team and Repo
## Team Members:
- Franco E.Trujillo - fet2@illinois.edu
- Hongyi Wu - hongyiw6@illinois.edu

## Project Repo:
- [https://github.com/FrancoETrujillo/CS598_Final](https://github.com/FrancoETrujillo/CS598_Final)

## Reference Repos:
- [https://github.com/XZhang97666/MultimodalMIMIC](https://github.com/XZhang97666/MultimodalMIMIC)
- [https://github.com/YerevaNN/mimic3-benchmarks](https://github.com/YerevaNN/mimic3-benchmarks)
- [https://github.com/kaggarwal/ClinicalNotesICU](https://github.com/kaggarwal/ClinicalNotesICU)

# Introduction
This paper intends to address the challenges of handling irregularity and the integration of multimodal data for medical prediction tasks.

## Background of the problem
### What type of problem:
The paper focuses on 2 main problems; Mortality Prediction and Phenotype Classification
### What is the importance/meaning of solving the problem: 
ICUs admit patients with life-threatening conditions, Improving the efficacy and efficiency of predictions by accounting for irregular data in EHRs can help the medical providers to make more accurate and quick decisions that could save lives.

### What is the difficulty of the problem:
The primary difficulty is the handling the irregular sampling of data and the effective integration and modeling of EHR records like numerical time series and textual notes taken in multiple points in time and frequencies.

![EHR sample image](.img/sample_ehr.png)

### The state-of-the-art methods and effectiveness.
For irregular data handling;
> [1] Lipton, Z. C., Kale, D., and Wetzel, R. Directly modeling
> missing data in sequences with rnns: Improved classification of clinical time series. In Machine learning for
> healthcare conference, pp. 253–270. PMLR, 2016.

> [2] Shukla, S. N. and Marlin, B. M. Multi-time attention networks for irregularly sampled time series. arXiv preprint
> arXiv:2101.10318, 2021.

For irregular clinical notes processing;
> [3] Golmaei, S. N. and Luo, X. Deepnote-gnn: predicting hospital readmission using clinical notes and patient network.
> In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics,
> pp. 1–9, 2021.

> [4]Mahbub, M., Srinivasan, S., Danciu, I., Peluso, A., Begoli, E., Tamang, S., and Peterson, G. D. 
> Unstructured clinical notes within the 24 hours since admission predict short,> mid & long-term mortality in adult icu patients. 
> Plos one, 17(1):e0262182, 2022.

## Paper explanation
### What did the paper propose
The general problem addressed in this paper is to find a better approach to handling irregular multimodal data obtained on EHRs to better assess real-time predictions in ICUs. 

### What is the innovations of the method
To better approach irregularity and multi-modal data the paper proposes integrating the real-time series and clinical notes while considering their irregularities. This by doing the following:

![High level arch](.img/high_arch_w_desc.png)

#### Modeling Irregularity in Time Series:
1. Temporal Discretization-Based Embeddings (TDE): Utilizes a novel unified
approach (UTDE) that combines:
    - Imputation: Regularizes time series by filling in missing values based
on prior observations or statistical methods.
    - Discretized Multi-Time Attention (mTAND): Applies a learned
interpolation method using a multi-time attention mechanism to
represent the irregular time series data better.
2. Unified Approach (UTDE): This approach integrates imputation and mTAND
through a gating mechanism to dynamically combine the representation of
the time series.
 
 ![Detail arch](.img/imputation_plus_mtand.png)

#### Processing Irregular Clinical Notes:
1. Text Encoding: Uses a pretrained model (TextEncoder) to encode clinical
notes into a series of representations.
2. Irregularity Modeling: Sorts these representations by time, treats them as
Multivariate Irregularly Sampled Time Series (MINSTS), and employs mTAND
to generate a set of text interpolation representations to handle irregularities.


#### Multimodal Fusion:
1. Interleaved Attention Mechanism: Fuses time series and clinical note
representations across temporal steps, integrating irregularity into multimodal
representations.
2. Self and Cross-Attention:
    - Multi-Head Self-Attention (MH): Acquires contextual embeddings for
each modality by focusing within the same modality across time.
    - Multi-Head Cross-Attention (CMH): Each modality learns from the
other, integrating information across modalities.
3. Feed-Forward and Prediction Layers: A feed-forward sublayer follows the
CMH outputs, with layer normalization and residual connections applied. The
final step involves passing the integrated representations through fully
connected layers to predict the outcome.


### How well the proposed method work (in its own metrics)
 The proposed methods for two medical prediction tasks consistently outperforms state-ofthe-art (SOTA) baselines in each single modality and multimodal fusion scenarios. 
Observing a relative improvements of 6.5%, 3.6%, and 4.3% in F1 for time series, clinical notes, and multimodal fusion, respectively. 

### What is the contribution to the reasearch regime (referring the Background above, how important the paper is to the problem).
The paper's contribution is important because it provides a new direction for EHR-based predictive models to consider time irregularity that could lead to more accurate and reliable medical predictions, helping patients and healthcare processes.

# Scope of Reproducibility:

For our project we plan to reproduce the experiment with In Hospital Mortality (IHM). And prove the following hypotheses:


1. The inclusion of UTDE improves the performance of the model.
2. Considering irregularities in clinical note embedding improves the performance of the model.
3. The introduction of UTDE and mTAND for processing time series and clinical notes, respectively, plus the integration of Multimodal fusion outperforms F1 score against standard baselines.

# Prerequisites to Reproduce the project
- Get access to the MIMIC dataset
- Modify the GlobalConfigs.py to use your own project and data paths
- Install the required dependencies listed on Requirements.txt, we recommend using Conda with python 3.11
- Modify the directory variables on the **Configuring imports and directories** section bellow if needed

**Notes:** 

- We are unable to share the preprocess pkl files due to the [MIMIC DUA](https://physionet.org/content/mimiciii/view-dua/1.4/)
- This project has being developed and tested using Linux Mint 21.2, ubuntu variants should work, but you may need to modify it to execute on another OS
- More information about our dir structure can be found on our README.md 
- For this project we've used an NVDIA 4070 GPU with a limited dataset, you may need more resources for your case.

# Methodology

The project reproduction consists on the following sections
- Data
- Models
- Training
- Evaluation

# Data

This paper uses the MIMICIII dataset as starting point to obtain timeseries information and medical notes. 

The MIMIC-III dataset is composed of a set of CSV files containing information about patients, their stays, events, and notes. 

For our project the most relevant tables are:
### ADMISSIONS
Contains information about the admissions of patients to the hospital.

![Admissions](.img/Addmissions_table.png)


### PATIENTS
Contains information about the patients.

![Patients](.img/patients_table.png)


### ICUSTAYS
Contains information about the ICU stays of patients.

![ICUStays](.img/icu_stays_table.png)


### NOTEEVENTS
Contains information about the notes taken for each patient.

![NoteEvents](.img/NoteEvents_table.png)


### CALLOUT
Contains information about when patients were ready for discharge (called out), and the actual time of their discharge (or more generally, their outcome).

![Callout](.img/callout_table.png)


More info about the table structures can be found at https://mit-lcp.github.io/mimic-schema-spy/index.html


 
## Getting the data
 The following code contains some useful code to download and extract the dataset files locally
 
**Note**: To download the mimic dataset is necessary to complete the request for access at [Physionet](https://physionet.org/)

After downloading and extracting the dataset, we will have a directory structure like this:
.
├── ClinicalNotesICU
│   ├── models
│   └── scripts
├── mimic3-benchmarks
│   ├── data
│   │   ├── decompensation
│   │   ├── in-hospital-mortality
│   │   ├── length-of-stay
│   │   ├── multitask
│   │   ├── phenotyping
│   │   └── root
│   │       ├── test_text_fixed
│   │       └── text_fixed
│   ├── mimic3benchmark
│   │   ├── evaluation
│   │   ├── resources
│   │   ├── scripts
│   │   └── tests
│   │       └── resources
│   └── mimic3models
│       ├── decompensation
│       │   └── logistic
│       ├── in_hospital_mortality
│       │   └── logistic
│       ├── keras_models
│       ├── length_of_stay
│       │   └── logistic
│       ├── multitask
│       ├── phenotyping
│       │   └── logistic
│       └── resources
└── MultimodalMIMIC
    ├── Data
    │   ├── ihm
    │   └── irregular
    └── run
        └── TS_Text

## Config tasks to execute
Open and edit GlobalConfigs.py to set up the local path to the project.

In [11]:
import os
import pickle
# Imports and configs
import subprocess
from tqdm import tqdm

from GlobalConfigs import *

DOWNLOAD_DATASET = False
EXTRACT_COMPRESSED_CSVS = False
PREPROCESS_BENCHMARKS = False
PREPROCESS_CLINICAL_NOTES = False
PREPROCESS_MULTIMODAL = False

## Download dataset

In [12]:
# change physionet_username to your username
if DOWNLOAD_DATASET:

    physionet_username = "your_user_name"
    password = "your_pass"
    destination_directory = "data/MIMICIII_Original"

    command = [
        "wget", "-r", "-N", "-c", "-np",
        "--user", physionet_username,
        "--password", password,
        "https://physionet.org/files/mimiciii/1.4/",
        "-P", destination_directory
    ]

    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)

    for line in process.stdout:
        print(line, end='')

    process.wait()

    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")


In [13]:
if EXTRACT_COMPRESSED_CSVS:
    command = ['./decompress_mimic.sh', '-d', 'data/MIMICIII_Original/physionet.org/files/mimiciii/1.4/', '-o',
               'data/mimic3']

    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)

    for line in process.stdout:
        print(line, end='')

    process.wait()

    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")


# Preparing the data
The original paper leverages the following projects to help on the data preparation and extraction from the original MIMIC CSVs

It leverages the **mimic3-benchmarks** and the **ClinicalNotesICU** for the following:

- Cleanup invalid data
- Map the events, diagnoses, and stays for each patient.
- Extract timeseries for in-hospital-mortality 
- Split timeseries data into train and test sets
- Extract Medical notes for patients
- Split Medical notes for train and test sets

### MIMIC benchmarks
Helps to process timeseries data and divide train and test sets
[mimic3-benchmarks](https://github.com/YerevaNN/mimic3-benchmarks.git)

This repo contains a set of scripts that take the RAW mimic CSVs and prepare the irregular data:
- extract_subjects.py:
Generates one directory per SUBJECT_ID and writes ICU stay information to data/{SUBJECT_ID}/stays.csv, diagnoses to data/{SUBJECT_ID}/diagnoses.csv, and events to data/{SUBJECT_ID}/events.csv


- validate_events.py
Attempts to fix some issues (ICU stay ID is missing) and removes the events that have missing information. About 80% of events remain after removing all suspicious rows


- extract_episodes_from_subjects.py
Breaks up per-subject data into separate episodes (pertaining to ICU stays). Time series of events are stored in {SUBJECT_ID}/episode{#}_timeseries.csv (where # counts distinct episodes) while episode-level information (patient age, gender, ethnicity, height, weight) and outcomes (mortality, length of stay, diagnoses) are stores in {SUBJECT_ID}/episode{#}.csv. This script requires two files, one that maps event ITEMIDs to clinical variables and another that defines valid ranges for clinical variables


- split_train_and_test.py
Splits the whole dataset into training and testing sets.


- create_in_hospital_mortality.py
Generate task-specific datasets for in-hospital-mortality prediction

After running the preparation scripts we end up with a directory data/in-hospital-mortality we have two subdirectories: train and test. Each of them contains a bunch of ICU stays and one file with name listfile.csv, which lists all samples in that particular set. Each row of listfile.csv has the following form: icu_stay, period_length, label(s). A row specifies a sample for which the input is the collection of ICU event of icu_stay that occurred in the first period_length hours of the stay and the target are label(s). In in-hospital mortality prediction task period_length is always 48 hours.


The project does not work out of the box, so we downloaded the sourcecode and modify it inside this repo under the [mimic3-benchmarks](./mimic3-benchmarks) folder
To simplify the process we have created the following script `./build_benchmark_data.sh` to run all timeseries required tasks.



In [14]:
if PREPROCESS_BENCHMARKS:
    command = ["./build_benchmark_data.sh"]

    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, shell=True)

    for line in process.stdout:
        print(line, end='')
    process.wait()

    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")


### ClinicalNotesICU
Helps to process medical notes and divide in train and test
 [ClinicalNotesICU](https://github.com/kaggarwal/ClinicalNotesICU.git)

Similar to the mimic3-benchmarks, this repo contains a set of scripts that take the RAW mimic CSVs and process the clinical notes for the previously generated train and test datasets.

- extract_notes.py
Uses the NOTEEVENTS.csv and the previously generated train and test sets to extract the notes within the first 48 hours of the event and saves them on its own train a test data directories 

- extract_T0.py
Uses the stays.csv and events.csv to extract the episodes start time and save them into a binary pkl file.

The project does not work out of the box, so we downloaded the sourcecode and modify it inside this repo under the [ClinicalNotesICU](./ClinicalNotesICU) folder

To simplify the process we have created the following script `./extract_med_notes.sh` tu run all the required tasks for clinical notes.


In [15]:
if PREPROCESS_CLINICAL_NOTES:
    command = ["./extract_med_notes.sh"]

    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, shell=True)

    for line in process.stdout:
        print(line, end='')
    process.wait()

    if process.returncode != 0:
        print(f"Command failed with return code {process.returncode}")

# Preprocess time series For Mimic Multimodal 
The next step is to discretize and normalize the timeseries data, as well as link the clinical notes with their corresponding timestamps.

The [paper's repo](https://github.com/XZhang97666/MultimodalMIMIC.git) provides a preprocessing script to work on this task.

After running the preprocessing steps we save the following PKLs to be used by the model: 
```
mean_std.pkl 
norm_ts_test.pkl
norm_ts_train.pkl
norm_ts_val.pkl
testp2x_data.pkl
trainp2x_data.pkl
ts_test.pkl
ts_train.pkl
ts_val.pkl
valp2x_data.pkl
```
The project does not work out of the box, so we used it partially to call some of the functions from this notebook; the downloaded sourcecode and modifications are inside this repo under the [MultimodalMIMIC](./MultimodalMIMIC) folder

# Configuring imports and directories

In [16]:
from mimic3benchmark.readers import InHospitalMortalityReader
from MultimodalMIMIC.preprocessing import Discretizer_multi
from typing import Optional, Any
from readers import Reader
import gzip
from mimic3models.preprocessing import Normalizer
from MultimodalMIMIC.preprocessing import extract_irregular
from MultimodalMIMIC.preprocessing import mean_std
from MultimodalMIMIC.preprocessing import normalize
from MultimodalMIMIC.text_utils import TextReader
from MultimodalMIMIC.preprocessing import merge_text_ts


GENERATE_PREPROCESSED_PKL = True

# Paths for data
ihm_data_path = f"{BENCHMARKS_ROOT_PATH}/data/in-hospital-mortality"
ihm_train_data_path = f"{ihm_data_path}/train"
ihm_test_data_path = f"{ihm_data_path}/test"
discretizer_config_path = f"{MULTI_MODAL_MIMIC_PATH}/Data/irregular/discretizer_config.json"
channel_info_path = f"{MULTI_MODAL_MIMIC_PATH}/Data/irregular/channel_info.json"
textdata_fixed = f"{BENCHMARKS_ROOT_PATH}/data/root/text_fixed/train/"
text_start_time_path = f"{BENCHMARKS_ROOT_PATH}/data/root/text_fixed/starttime.pkl"
test_textdata_fixed = f"{BENCHMARKS_ROOT_PATH}/data/root/text_fixed/test/"
test_text_start_time_path = f"{BENCHMARKS_ROOT_PATH}/data/root/text_fixed/test_starttime.pkl"
ihm_discrete_save_path = f"{MULTI_MODAL_MIMIC_PATH}/Data/ihm/"

# Modify this to take only a subset of the full data; None takes the full data
n_samples_elements = 1000

mortality_period = 48
timestep = 1.0
imputation = "previous"
dataset_types = ["train", "val", "test"]

normalizer_state_file_path = f'{BENCHMARKS_ROOT_PATH}/mimic3models/in_hospital_mortality/ihm_ts{timestep}.input_str-{imputation}.start_time-zero.normalizer'


### Reading the data
First define the InHospitalMortalityReader to help us easily access the split datasets generated by mimic3-benchmarks. It provides helper functions to read multiple data series samples and labels by patient and map to a dictionary.

In [17]:
if GENERATE_PREPROCESSED_PKL:
    print(ihm_train_data_path)
    train_reader = InHospitalMortalityReader(dataset_dir=ihm_train_data_path,
                                             listfile=os.path.join(ihm_train_data_path, 'listfile.csv'),
                                             period_length=mortality_period)
    val_reader = InHospitalMortalityReader(dataset_dir=ihm_train_data_path,
                                           listfile=os.path.join(ihm_train_data_path, 'listfile.csv'),
                                           period_length=mortality_period)
    
    test_reader = InHospitalMortalityReader(dataset_dir=ihm_test_data_path,
                                            listfile=os.path.join(ihm_test_data_path, 'listfile.csv'),
                                            period_length=mortality_period)

/media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/mimic3-benchmarks/data/in-hospital-mortality/train


In [18]:

def save_compressed_pkl_gz(data_to_dump: Any, save_name: str):
    print("saving and compressing:", save_name)
    with gzip.open(f'{save_name}.pkl.gz', 'wb') as file:
        pickle.dump(data_to_dump, file)


def load_compressed_pkl(full_file_name: str) -> Any:
    with gzip.open(full_file_name, 'rb') as file:
        data_loaded = pickle.load(file)
    return data_loaded


def read_chunk(reader, chunk_size):
    chunk_data = {}
    for _ in tqdm(range(chunk_size), desc="reading data"):
        ret = reader.read_next()
        for k, v in ret.items():
            if k not in chunk_data:
                chunk_data[k] = []
            chunk_data[k].append(v)
    chunk_data["header"] = chunk_data["header"][0]
    return chunk_data


def discretize_and_save_data(reader: Reader, discretizer: Discretizer_multi, save_path: str,
                             partial_n_samples: Optional[int] = None, save_name=str, compress_pkl: bool = False):
    n_samples = reader.get_number_of_examples()
    if partial_n_samples:
        n_samples = partial_n_samples
    ret = read_chunk(reader, n_samples)
    irg_data = ret["X"]
    ts = ret["t"]
    labels = ret["y"]
    discrete_names = ret["name"]

    reg_data = []
    for X, t in tqdm(zip(irg_data, ts), total=len(irg_data), desc=f"discretizing data "):
        transformed_data = discretizer.transform(X, end=t)[0]
        reg_data.append(transformed_data)

    os.makedirs(save_path, exist_ok=True)
    save_full_path = save_path + save_name
    if compress_pkl:
        save_compressed_pkl_gz((irg_data, reg_data, labels, discrete_names), save_full_path)
    else:
        print("Saving", f"{save_full_path}.pkl")
        with open(f"{save_full_path}.pkl", 'wb') as file:
            pickle.dump((irg_data, reg_data, labels, discrete_names), file)


### Discretize temporal data and add imputation

In order to create the embeddings for the proposed model first we need to discretize and add the imputation to the missing values. In our case we will use imputation using the previous value of the series.

The Discretizer_multi takes care of transforming the irregular data into samples at each timestep while filling the missing data with the desired imputation strategy considering all time based features.

![Discretizer image](.img/only_discretize.png)


In [19]:
if GENERATE_PREPROCESSED_PKL:
    discretizer_multi = Discretizer_multi(
        impute_strategy='previous',
        store_masks=True,
        start_time='zero',
        config_path=discretizer_config_path,
        channel_path=channel_info_path
    )
    discretizer_header = discretizer_multi.transform(train_reader.read_example(0)["X"])[1].split(',')
    cont_channels = [i for (i, x) in enumerate(discretizer_header) if x.find("->") == -1]

    
    print("discretize and save train")
    discretize_and_save_data(train_reader, discretizer_multi, ihm_discrete_save_path, n_samples_elements, "ts_train")
    print("discretize and save val")
    discretize_and_save_data(val_reader, discretizer_multi, ihm_discrete_save_path, n_samples_elements, "ts_val")
    print("discretize and save test")
    discretize_and_save_data(test_reader, discretizer_multi, ihm_discrete_save_path, n_samples_elements, "ts_test")

discretize and save train


reading data: 100%|██████████| 1000/1000 [00:00<00:00, 1053.41it/s]
discretizing data : 100%|██████████| 1000/1000 [00:02<00:00, 388.73it/s]


Saving /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/ts_train.pkl
discretize and save val


reading data: 100%|██████████| 1000/1000 [00:00<00:00, 1475.51it/s]
discretizing data : 100%|██████████| 1000/1000 [00:02<00:00, 407.18it/s]


Saving /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/ts_val.pkl
discretize and save test


reading data: 100%|██████████| 1000/1000 [00:00<00:00, 1835.17it/s]
discretizing data : 100%|██████████| 1000/1000 [00:01<00:00, 569.37it/s]


Saving /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/ts_test.pkl


### Paddings and masks
After discretize we need to apply paddings and create masks for all features. For this we can use the function extract_irregular from MultimodalMimic. It creates the padded irregular and mask arrays and save them for later use. 

In [20]:
if GENERATE_PREPROCESSED_PKL:
    for dataset_type in dataset_types:
        print(f"Extracting {dataset_type} irregular data", flush=True)
        in_extract_data_path = ihm_discrete_save_path + 'ts_' + dataset_type + '.pkl'
        out_extract_data_path = ihm_discrete_save_path + 'ts_' + dataset_type + '.pkl'
        extract_irregular(in_extract_data_path, out_extract_data_path, channel_info_path, discretizer_config_path)


Extracting train irregular data
Saving: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/ts_train.pkl
Extracting val irregular data
Saving: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/ts_val.pkl
Extracting test irregular data
Saving: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/ts_test.pkl


In [21]:
if GENERATE_PREPROCESSED_PKL:
    mean_std(ihm_discrete_save_path + 'ts_train.pkl', ihm_discrete_save_path + 'mean_std.pkl')


Saving: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/mean_std.pkl


### Normalizing timeseries data
Now we apply normalization to our data by x = (x - means[f_idx]) / stds[f_idx], For this, we leverage the Normalizer provided by the mimic3-benchmarks repo


In [22]:
if GENERATE_PREPROCESSED_PKL:
    normalizer = Normalizer(fields=cont_channels)
    normalizer.load_params(normalizer_state_file_path)
    
    for dataset_type in dataset_types:
        print(f"Normalizing {dataset_type} times data", flush=True)
        normalize(ihm_discrete_save_path + 'ts_' + dataset_type + '.pkl',
                  ihm_discrete_save_path + 'norm_ts_' + dataset_type + '.pkl',
              ihm_discrete_save_path + 'mean_std.pkl')

Normalizing train times data
Saving: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/norm_ts_train.pkl
Normalizing val times data
Saving: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/norm_ts_val.pkl
Normalizing test times data
Saving: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/norm_ts_test.pkl


### Preparing the Text data
We finally prepare the text data within a period (48 in our case) and generate a json containing the note as well as the time until the end of the period. For this we use the TextReader helper from the MultimodalMIMIC module.

In [23]:

if GENERATE_PREPROCESSED_PKL:
    for dataset_type in dataset_types:
        print(f"Preparing  {dataset_type} text data", flush=True)
    
        with open(ihm_discrete_save_path + 'norm_ts_' + dataset_type + '.pkl', 'rb') as f:
            tsdata = pickle.load(f)
    
        names = [data['name'] for data in tsdata]
    
        if (dataset_type == 'train') or (dataset_type == 'val'):
            text_reader = TextReader(textdata_fixed, text_start_time_path)
        else:
            text_reader = TextReader(test_textdata_fixed, test_text_start_time_path)
    
        data_text, data_times, data_time = text_reader.read_all_text_append_json(names, mortality_period)
        merge_text_ts(data_text, data_times, data_time, tsdata, mortality_period,
                      ihm_discrete_save_path + dataset_type + 'p2x_data.pkl')

Preparing  train text data
Suceed Merging:  750
Missing Merging:  250
File dumped at: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/trainp2x_data.pkl
Preparing  val text data
Suceed Merging:  750
Missing Merging:  250
File dumped at: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/valp2x_data.pkl
Preparing  test text data
Suceed Merging:  762
Missing Merging:  238
File dumped at: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/testp2x_data.pkl


### Preprocessing results
After running the preprocessing we will obtain the pkl files that will be used by the model for training and evaluation. The following cell will show the structure of the main pkl files.



In [42]:
def display_p2x_pkl_structure():
    print("Structure of the p2x pkl files")
    with open(f"{PROJECT_BASE_PATH}/MultimodalMIMIC/Data/ihm/testp2x_data.pkl", 'rb') as file:
        p2x_data = pickle.load(file)
        print(f"Keys p2x: {p2x_data[0].keys()}")
        
    with open(f"{PROJECT_BASE_PATH}/MultimodalMIMIC/Data/ihm/norm_ts_test.pkl", 'rb') as file:
        p2x_data = pickle.load(file)
        print(f"Keys norm: {p2x_data[0].keys()}")

    with open(f"{PROJECT_BASE_PATH}/MultimodalMIMIC/Data/ihm/mean_std.pkl", 'rb') as file:
        p2x_data = pickle.load(file)
        print(f"Keys mean_std tuple type : ({type(p2x_data[0])},{type(p2x_data[0])})")
        

display_p2x_pkl_structure()

Structure of the p2x pkl files
Keys p2x: dict_keys(['reg_ts', 'name', 'label', 'ts_tt', 'irg_ts', 'irg_ts_mask', 'text_data', 'text_time_to_end'])
Keys norm: dict_keys(['reg_ts', 'name', 'label', 'ts_tt', 'irg_ts', 'irg_ts_mask'])
Keys mean_std tuple type : (<class 'list'>,<class 'list'>)


# Train Model

## import required module

In [25]:
import sys

from tensorboardX import SummaryWriter

import warnings
import logging

logger = logging.getLogger(__name__)
sys.path.insert(0, 'MultimodalMIMIC')
from GlobalConfigs import *
from MultimodalMIMIC.model import *
from MultimodalMIMIC.train import *
from MultimodalMIMIC.checkpoint import *
from accelerate import Accelerator
from MultimodalMIMIC.interp import *

## set arguments

this step is to set up parameter to set up how to train and evaluatate the model.

here we only train the model for 1 epoch for demonstration.

In [26]:
from MultimodalMIMIC.util import parse_args
ihm_discrete_save_path = f'{MULTI_MODAL_MIMIC_PATH}/Data/ihm' if ihm_discrete_save_path is None else ihm_discrete_save_path

parser = parse_args()
args = parser.parse_args(['--num_train_epochs','3',
                         '--train_batch_size','2',
                         '--eval_batch_size','8',
                         '--gradient_accumulation_steps','16',
                         '--num_update_bert_epochs','2',
                         '--notes_order','Last',
                         '--max_length','512',
                         '--output_dir','run/TS_Text',
                         '--embed_dim','128',
                         '--model_name','bioLongformer',
                         '--file_path',f'{ihm_discrete_save_path}',
                         '--mixup_level','batch',
                         '--fp16',
                         '--irregular_learn_emb_text',
                         '--irregular_learn_emb_ts',
                         '--reg_ts'])


print(vars(args))

{'task': 'ihm', 'file_path': '/media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/', 'output_dir': 'run/TS_Text', 'tensorboard_dir': None, 'seed': 42, 'mode': 'train', 'modeltype': 'TS_Text', 'eval_score': ['auc', 'auprc', 'f1'], 'num_labels': 2, 'max_length': 512, 'pad_to_max_length': False, 'model_path': None, 'train_batch_size': 2, 'eval_batch_size': 8, 'num_update_bert_epochs': 2, 'num_train_epochs': 3, 'txt_learning_rate': 5e-05, 'ts_learning_rate': 0.0004, 'gradient_accumulation_steps': 16, 'weight_decay': 0.01, 'lr_scheduler_type': 'linear', 'pt_mask_ratio': 0.15, 'mean_mask_length': 3, 'chunk': False, 'chunk_type': 'sent_doc_pos', 'warmup_proportion': 0.1, 'kernel_size': 1, 'num_heads': 8, 'layers': 3, 'cross_layers': 3, 'embed_dim': 128, 'irregular_learn_emb_ts': True, 'irregular_learn_emb_text': True, 'reg_ts': True, 'tt_max': 48, 'embed_time': 64, 'ts_to_txt': False, 'txt_to_ts': False, 'dropout': 0.1, 'model_name': 'bioLongformer', 'num_of_notes': 5,

## Set up training environment

based on given argument above, set up the training environment

In [27]:
if args.fp16:
        args.mixed_precision = "fp16"
else:
    args.mixed_precision = "no"
accelerator = Accelerator(mixed_precision=args.mixed_precision, cpu=args.cpu)

device = accelerator.device
print(f'device: {device}')
os.makedirs(args.output_dir, exist_ok=True)
if args.tensorboard_dir != None:
    writer = SummaryWriter(args.tensorboard_dir)
else:
    writer = None

warnings.filterwarnings('ignore')
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO,
)

if args.seed is not None:
    set_seed(args.seed)

output_path = make_save_dir(args)

if args.seed == 0:
    copy_file(args.ck_file_path + 'model/', src=os.getcwd())

device: cuda


2024-04-25 00:01:20.739560: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


run/TS_Text/ihm/TS_Text/TS_48/Atten/Text_48/bioLongformer/512/cross_attn3/irregular_TS_64/irregular_Text_64/5e-05_2_3_0.0004_3_8_128_1_2/


## Load data

Load training, validation and test dataset.

In [28]:
from MultimodalMIMIC.data import data_perpare
from MultimodalMIMIC.util import loadBert

if args.mode == 'train':
        if 'Text' in args.modeltype:
            BioBert, BioBertConfig, tokenizer = loadBert(args, device)
        else:
            BioBert, tokenizer = None, None
        train_dataset, train_sampler, train_dataloader = data_perpare(args, 'train', tokenizer)
        val_dataset, val_sampler, val_dataloader = data_perpare(args, 'val', tokenizer)
        _, _, test_data_loader = data_perpare(args, 'test', tokenizer)

Some weights of LongformerModel were not initialized from the model checkpoint at yikuan8/Clinical-Longformer and are newly initialized: ['longformer.pooler.dense.bias', 'longformer.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Franco Trying to load: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/trainp2x_data.pkl
Using /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/trainp2x_data.pkl
Franco Trying to load: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/valp2x_data.pkl
Using /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/valp2x_data.pkl
Franco Trying to load: /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/testp2x_data.pkl
Using /media/ftrujillo/FRD/Projects/UIUC/DLH/CS598_Final/MultimodalMIMIC/Data/ihm/testp2x_data.pkl


## Load model

load model from model python.
for full model implementation, check GitHub.

based on given argument model type, the below code will load one of two models.

- MULTCrossModel:
  - This is a multi-modal cross model. It combines both text and time series data. Depending on the configuration, it may employ Transformer encoders for processing time series data and apply attention mechanisms for processing text embeddings. The model integrates text and time series data at different levels using various fusion techniques such as self-cross attention or cross-modal fusion. Again, the output depends on the task (ihm or pheno), and appropriate loss functions are used accordingly.
- 
TSMixed
  - This is a mixed model for time series data. It combines interpolation techniques, such as S_Interp and Cross_Interp, with Transformer-based encoders or other models like LSTM or CNN for processing time series data. It handles mixed-level data, such as batch-level, sequence-level, or feature-level mixup, and outputs predictions based on the task (ihm or pheno).

![high level arch with desc](.img/high_arch_w_desc.png)

In [29]:
if 'Text' in args.modeltype:
    model = MULTCrossModel(args=args, device=device, orig_d_ts=17, orig_reg_d_ts=34, orig_d_txt=768,
                           ts_seq_num=args.tt_max, text_seq_num=args.num_of_notes, Biobert=BioBert)
else:
    model = TSMixed(args=args, device=device, orig_d_ts=17, orig_reg_d_ts=34, ts_seq_num=args.tt_max)

## train

This part of code train model. 
Given modeltype argument from above, it will set optimizer for different model type. (Text, Timeseries or mix)

In [30]:
if args.modeltype == 'TS':
        optimizer = torch.optim.Adam(model.parameters(), lr=args.ts_learning_rate)
elif args.modeltype == 'Text' or args.modeltype == 'TS_Text':
    optimizer = torch.optim.Adam([
        {'params': [p for n, p in model.named_parameters() if 'bert' not in n]},
        {'params': [p for n, p in model.named_parameters() if 'bert' in n], 'lr': args.txt_learning_rate}
    ], lr=args.ts_learning_rate)
else:
    raise ValueError("Unknown modeltype in optimizer.")

model, optimizer, train_dataloader, val_dataloader, test_data_loader = \
    accelerator.prepare(model, optimizer, train_dataloader, val_dataloader, test_data_loader)

trainer_irg(model=model, args=args, accelerator=accelerator, train_dataloader=train_dataloader, \
            dev_dataloader=val_dataloader, test_data_loader=test_data_loader, device=device, \
            optimizer=optimizer, writer=writer)

  0%|          | 0/3 [00:00<?, ?it/s]

bert update at epoch 0
0 True



0it [00:00, ?it/s][A
1it [00:00,  1.11it/s][AInput ids are automatically padded from 485 to 512 to be a multiple of `config.attention_window`: 512

2it [00:01,  1.44it/s][A
3it [00:01,  1.60it/s][AInput ids are automatically padded from 311 to 512 to be a multiple of `config.attention_window`: 512

4it [00:02,  1.68it/s][A
5it [00:03,  1.72it/s][A
6it [00:03,  1.75it/s][A
7it [00:04,  1.75it/s][A
8it [00:04,  1.76it/s][A
9it [00:05,  1.76it/s][A
10it [00:05,  1.78it/s][AInput ids are automatically padded from 3 to 512 to be a multiple of `config.attention_window`: 512

11it [00:06,  1.78it/s][A
12it [00:07,  1.77it/s][A
13it [00:07,  1.78it/s][AInput ids are automatically padded from 468 to 512 to be a multiple of `config.attention_window`: 512

14it [00:08,  1.78it/s][A
15it [00:08,  1.79it/s][A
16it [00:09,  1.75it/s][A
17it [00:09,  1.76it/s][AInput ids are automatically padded from 313 to 512 to be a multiple of `config.attention_window`: 512

18it [00:10,  1.77i

Current auc 0.8144493177387914
Best auc 0.8144493177387914
Current auprc 0.4352551154055549
Best auprc 0.4352551154055549
Current f1 0.34375
Best f1 0.34375
1 False



0it [00:00, ?it/s][A
1it [00:00,  3.08it/s][A
2it [00:00,  3.50it/s][A
3it [00:00,  3.61it/s][A
4it [00:01,  3.64it/s][A
5it [00:01,  3.67it/s][A
6it [00:01,  3.68it/s][AInput ids are automatically padded from 255 to 512 to be a multiple of `config.attention_window`: 512

7it [00:01,  3.69it/s][A
8it [00:02,  3.69it/s][A
9it [00:02,  3.74it/s][A
10it [00:02,  3.76it/s][A
11it [00:02,  3.77it/s][A
12it [00:03,  3.78it/s][A
13it [00:03,  3.78it/s][A
14it [00:03,  3.80it/s][A
15it [00:04,  3.83it/s][A
16it [00:04,  3.82it/s][A
17it [00:04,  3.85it/s][A
18it [00:04,  3.84it/s][A
19it [00:05,  3.84it/s][AInput ids are automatically padded from 511 to 512 to be a multiple of `config.attention_window`: 512

20it [00:05,  3.85it/s][A
21it [00:05,  3.86it/s][A
22it [00:05,  3.85it/s][A
23it [00:06,  3.87it/s][AInput ids are automatically padded from 188 to 512 to be a multiple of `config.attention_window`: 512

24it [00:06,  3.87it/s][A
25it [00:06,  3.86it/s][A
26it 

Current auc 0.8369534948482317
Best auc 0.8369534948482317
Current auprc 0.48688579057417025
Best auprc 0.48688579057417025
Current f1 0.47750865051903113
Best f1 0.47750865051903113
bert update at epoch 2
2 True



0it [00:00, ?it/s][A
1it [00:00,  1.67it/s][A
2it [00:01,  1.71it/s][A
3it [00:01,  1.72it/s][A
4it [00:02,  1.72it/s][A
5it [00:02,  1.75it/s][A
6it [00:03,  1.77it/s][A
7it [00:03,  1.78it/s][A
8it [00:04,  1.77it/s][A
9it [00:05,  1.79it/s][A
10it [00:05,  1.80it/s][A
11it [00:06,  1.81it/s][A
12it [00:06,  1.81it/s][A
13it [00:07,  1.81it/s][A
14it [00:07,  1.80it/s][A
15it [00:08,  1.81it/s][A
16it [00:08,  1.80it/s][A
17it [00:09,  1.81it/s][A
18it [00:10,  1.82it/s][A
19it [00:10,  1.81it/s][A
20it [00:11,  1.82it/s][A
21it [00:11,  1.81it/s][A
22it [00:12,  1.81it/s][A
23it [00:12,  1.81it/s][AInput ids are automatically padded from 252 to 512 to be a multiple of `config.attention_window`: 512

24it [00:13,  1.81it/s][A
25it [00:13,  1.82it/s][A
26it [00:14,  1.82it/s][A
27it [00:15,  1.82it/s][A
28it [00:15,  1.80it/s][A
29it [00:16,  1.79it/s][A
30it [00:16,  1.79it/s][A
31it [00:17,  1.80it/s][A
32it [00:17,  1.77it/s][A
33it [00:18,  1.76it

Current auc 0.8585700362016151
Best auc 0.8585700362016151
Current auprc 0.532729023433215
Best auprc 0.532729023433215
Current f1 0.21935483870967742
Best f1 0.47750865051903113





# Evaluate

In [31]:
eval_test(args, model, test_data_loader, device)

run/TS_Text/ihm/TS_Text/TS_48/Atten/Text_48/bioLongformer/512/cross_attn3/irregular_TS_64/irregular_Text_64/5e-05_2_3_0.0004_3_8_128_1_2/f1/42.pth.tar


  1%|          | 1/125 [00:00<01:53,  1.09it/s]Input ids are automatically padded from 504 to 512 to be a multiple of `config.attention_window`: 512
 17%|█▋        | 21/125 [00:18<01:31,  1.13it/s]Input ids are automatically padded from 494 to 512 to be a multiple of `config.attention_window`: 512
 70%|███████   | 88/125 [01:18<00:32,  1.13it/s]Input ids are automatically padded from 478 to 512 to be a multiple of `config.attention_window`: 512
100%|██████████| 125/125 [01:52<00:00,  1.12it/s]


## Load evaluation result

the performance of proposed methods and baselines are measured by the F1, AUPR, and AUROC

In [32]:
for result_file in os.listdir(args.ck_file_path):
    if 'result.pkl' in result_file:
        eval_result_path = args.ck_file_path + result_file
        # print(eval_result_path)
        with open(eval_result_path,'rb') as f:
            evaluation_result = pickle.load(f)
            print(evaluation_result)

{42: {'auc': {'val': 0.8369534948482317, 'test': 0.7976461038961039}, 'auprc': {'val': 0.48688579057417025, 'test': 0.37045235396337717}, 'f1': {'val': 0.47750865051903113, 'test': 0.3651452282157676}}}


# Conclusion

Since for the demonstration we only used 1000 sample for 1 epoch, and we fill in dummy data for entry with missing value, we could not achieve the same result as the paper describe.

We plan to further furnish the code and run on the full sample data with more epoch, to see if we could get the result as the paper describe.