### Early Sepsis Onset Detection Setup

This notebook outlines the methodology for establishing a dataset aimed at early sepsis onset detection, as described in *Section 3.3, "Early Sepsis Onset Prediction Setup,"* of our paper. The setup process is divided into three main stages:

1. **Feature Extraction:** Extracting and preprocessing input feature data (currently focused on vital signs, with plans to add more features in the future).
2. **Instance Construction:** Assigning sepsis labels for each input instance.
3. **Data Split:** Implementing a 5-fold cross-validation at the patient level (subject_id) to prevent data leakage.

Following the approach proposed by [Stewart et al. 2023](https://www.computer.org/csdl/proceedings-article/bigdata/2023/10386180/1TUPtOpspXy), we implement a nightly detection setup tailored to Intensive Care Unit (ICU) needs. This setup uses data recorded during **nighttime hours, from 6 p.m. to 6 a.m. the following day**. Positive labels are assigned exclusively to instances where sepsis is predicted to develop within the next 24 hours.

**Advantages of Using Nighttime Data for Sepsis Detection:**

1. **Reduced External Interference**: Nighttime hours in the ICU involve fewer procedures, such as surgeries, diagnostic tests, and routine interventions. This results in physiological data being less affected by external factors, providing a cleaner and more accurate reflection of the patient's condition. Cleaner data helps the model make more precise predictions.
  
2. **Limited Staff Availability**: Night shifts generally have fewer healthcare staff, resulting in a higher patient-to-provider ratio. In this context, the model acts as an additional eye to supplement the limited human resources, continuously monitoring patients and assisting with early detection when direct supervision is reduced.

3. **Integration with Morning Rounds**: ICU morning rounds set the stage for planning patient treatment for the next 24 hours. By analyzing data from the previous night and predicting sepsis onset risk within the following 24 hours, the model naturally integrates into this workflow, supporting timely and informed decision-making for patient care.

This notebook will construct two versions of the dataset:
* **"S dataset"** (for the **S**tandard dataset without NaN values)
* **"N dataset"** (for the dataset with **N**aN values)



**Reference**:
T. Stewart, K. Stern, G. O'Keefe, A. Teredesai and J. Hu, "NPRL: Nightly Profile Representation Learning for Early Sepsis Onset Prediction in ICU Trauma Patients," in 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 2023, pp. 1843-1852, doi: 10.1109/BigData59044.2023.10386180. (Stewart et al. 2023)

# 0. Environment Setup

## Mount Google Drive
Considering that the overall process may take a long time and Colab execution may be interrupted, we highly recommend mounting your Google Drive to Colab to save intermediate results.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Set your parameters

Please make sure to update the following parameters to your own:

- **Project ID**: The BigQuery project ID to query the MIMIC-III v1.4 raw data. (If you're not sure what your project ID is, check details in `notebooks/MIMIC_III_Data_Access_Instructions.ipynb`.)
- **BASE_PATH**: The path where the GitHub project will be cloned.

In [None]:
# Remember to update the BigQuery project ID to your own to query raw data
PROJECT_ID = 'sepsis-mimic3'

# Remember to update this variable to your own path
# BASE_PATH is where the GitHub project will be cloned
BASE_PATH = "/content/drive/MyDrive/GitHub_Testing"

## Importing libraries.

In [None]:
%cd {BASE_PATH}/SepsisOnset_TraumaCohort

import os
import numpy as np
import pandas as pd
import time
from datetime import datetime, time, date, timedelta
from matplotlib import pyplot as plt

from src.data import data_utils, sql2df, data_fetcher
from scripts.cohort_extraction import extract_trauma_cohort_ids
from scripts.sepsis_onset_label_assignment import assign_sepsis_labels


# Initialize the ProjectPaths object
from src import path_manager
project_path_obj = path_manager.ProjectPaths(f'{BASE_PATH}/SepsisOnset_TraumaCohort')

/content/drive/MyDrive/GitHub_Testing/SepsisOnset_TraumaCohort


## Load Trauma Cohort

This section extracts a cohort of critically ill trauma patients and their corresponding hospital admission information from the MIMIC-III v1.4 dataset. The rationale and detailed explanations of the cohort extraction process, along with other relevant details, can be found in the `'notebooks/cohort_extraction.ipynb'` file.

In [None]:
# Check if the file exists
if os.path.exists(project_path_obj.trauma_cohort_info_path):
    # Load the existing file
    trum_ids = pd.read_csv(project_path_obj.trauma_cohort_info_path, index_col=0)
else:
    # File does not exist, extract cohort IDs and generate statistics report
    trum_ids = extract_trauma_cohort_ids(project_path_obj,    # Saved file paths
                                         PROJECT_ID,          # To query raw data
                                         is_report=True,      # Print statistics report
                                         is_saved=True        # Save the cohort IDs
                                        )

# This table should contain the hospital admission IDs (hadm_id) of trauma patients and the corresponding admission information.
# 1 row per patient
trum_cohort_info_df = trum_ids[['subject_id', 'hadm_id', 'icustay_id', 'admittime']]#.drop_duplicates('hadm_id') # we only need Hospital Admission ID,
# trum_cohort_info_df['admittime'] = pd.to_datetime(trum_cohort_info_df.admittime)
# trum_cohort_info_df['adm_date'] = pd.to_datetime(trum_cohort_info_df.admittime).dt.date
print(f"Trauma Cohort: {trum_cohort_info_df.hadm_id.nunique()} trauma patients")
trum_cohort_info_df.head()

Trauma Cohort: 1570 trauma patients


Unnamed: 0,subject_id,hadm_id,icustay_id,admittime
0,43,146828,225852,2186-10-01 23:15:00
9,141,168006,234668,2140-11-06 11:07:00
11,147,103631,252947,2158-06-24 18:50:00
15,179,161310,256090,2173-05-26 02:01:00
17,188,164735,284015,2161-07-01 19:44:00


# 1. Feature Extraction

The model leverages nighttime vital signs data to detect early sepsis onset within the next 24 hours. The focus is on data collected from 22:00 to 06:00 the following day, emphasizing nine key features: **heart rate, systolic blood pressure, diastolic blood pressure, mean blood pressure, respiratory rate, temperature, SpO2, glucose, and FiO2**. These features are essential for assessing physiological status and are commonly used for early sepsis detection.


The extraction, processing, and generation of 2D time-series data involve the following steps:

#### 1.1 Extract Vital Sign Records
We extract and combine raw data from the [CHARTEVENTS](https://mimic.mit.edu/docs/iii/tables/chartevents/) table of the MIMIC-III dataset for the trauma cohort. Each row in the resulting table represents the feature values at a particular timestamp for a patient. Note that the data is not necessarily recorded at hourly intervals.

#### 1.2 Extract and Process Nighttime Data
Given the raw records, this section focuses on extracting and aggregating nighttime data. Optionally, missing values are filled using the specified imputation method to maintain hourly intervals.

#### 1.3 Convert to 2D Time-Series Data
The cleaned records are then converted into 2D time-series data for each night and filtered to focus on the critical period for early sepsis detection.


## 1.1 Extract Vital Sign Records

This section details the extraction of nine vital sign features from the CHARTEVENTS table of the MIMIC-III dataset for trauma patients. The extraction is based on two SQL scripts provided by the official MIMIC GitHub project. In our adaptation of the source scripts, we replaced the use of `icustay_id` with `hadm_id` to track patients across their entire hospital stay, not just within the ICU. Additionally, we modified the extraction process to include as many charted records as possible, extending beyond ICU stays.

**Source Files**:
- **CHARTEVENTS**: The primary data repository for ICU patients, recording vital signs, ventilator settings, laboratory values, code status, and mental status. **Each row in the table represents a single value for one feature at a specific timestamp for a patient.** [Official Documentation](https://mimic.mit.edu/docs/iii/tables/chartevents/)

- **pivoted_vital.sql**: [View Script](https://github.com/MIT-LCP/mimic-code/blob/main/mimic-iii/concepts/pivot/pivoted_vital.sql)
- **pivoted_fio2.sql**: [View Script](https://github.com/MIT-LCP/mimic-code/blob/main/mimic-iii/concepts/pivot/pivoted_fio2.sql)

**Estimated Processing Time**: Approximately 17 minutes.



In [None]:
# Extract raw input data(vital sign) for the trauma cohort
def extract_trauma_vitalsign(project_path_obj, project_id,
                              trauma_ids,
                              is_report=True):
    """
    Extracts and merges vital signs and FiO2 data for trauma patients from the MIMIC-III dataset.
    The extracted features include: 'HeartRate', 'SysBP', 'DiasBP', 'MeanBP', 'RespRate', 'TempC', 'SpO2', 'Glucose', and 'FiO2'.

    Parameters:
        project_path_obj (object): Provides paths to processed data files.
        project_id (str): Project identifier for BigQuery database access.
        trauma_ids (DataFrame): DataFrame containing IDs and their corresponding hospital admission information of trauma patients.
        is_report (bool): Flag to enable printing of summary statistics for the extracted data.

    Returns:
        DataFrame: A DataFrame containing vital signs and FiO2 data for the specified trauma patients,
                  sorted by 'icustay_id' and 'charttime'.

  """
    trauma_ids = trauma_ids[['subject_id', 'hadm_id', 'admittime']].drop_duplicates()
    # Load vital signs data
    path = project_path_obj.get_raw_data_file("pivoted_vital.csv")
    if os.path.exists(path):
        vital_df = pd.read_csv(path, index_col=0)
    else:
        vital_df = sql2df.vital_signs_sql2df(project_id, saved_path=path)
    vital_df.drop('icustay_id', axis=1, inplace=True) # no need for icustay_id


    # Load FiO2 data
    path = project_path_obj.get_raw_data_file("pivoted_fio2.csv")
    if os.path.exists(path):
        fio2_df = pd.read_csv(path, index_col=0)
    else:
        fio2_df = sql2df.fio2_sql2df(project_id, saved_path=path)

    # Merge trauma patients' IDs with FiO2 and vital signs data
    trauma_fio2 = trauma_ids.merge(fio2_df, on='hadm_id', how='left')
    trauma_vital_df = trauma_ids.merge(vital_df, on=['hadm_id'], how='left')
    raw_df = trauma_vital_df.merge(trauma_fio2, on=['subject_id', 'hadm_id', 'admittime', 'charttime'], how='outer')
    raw_df.rename(columns={
        'heartrate': 'HeartRate',
        'sysbp': 'SysBP',
        'diasbp':'DiasBP',
        'meanbp': 'MeanBP',
        'resprate': 'RespRate',
        "tempc":'TempC',
        'spo2': 'SpO2',
        'glucose':'Glucose',
        'fio2': 'FiO2'}, inplace=True)

    if is_report:
        print(f"Extracted {trauma_fio2.shape[0]} FiO2 samples for {trauma_fio2['hadm_id'].nunique()} trauma patients.")
        print(f"Extracted {trauma_vital_df.shape[0]} vital sign samples for {trauma_vital_df['hadm_id'].nunique()} trauma patients.")
        print(f"Total samples after merging 2 tables: {raw_df.shape[0]} for {raw_df['hadm_id'].nunique()} trauma patients.")


    # Prepare datetime and time variables
    raw_df['admittime'] = pd.to_datetime(raw_df['admittime'])
    raw_df['charttime'] = pd.to_datetime(raw_df['charttime'])
    raw_df['Date'] = raw_df['charttime'].dt.date
    raw_df['Day'] = (raw_df['charttime'].dt.date - raw_df['admittime'].dt.date).apply(lambda x: x.days) + 1
    raw_df.loc[:,['Hour']] = raw_df.charttime.dt.hour

    return raw_df.sort_values(by=['hadm_id', 'charttime'])[
        ['subject_id', 'hadm_id',
         'Date', 'Day', 'Hour', #'admittime', 'charttime',
         'HeartRate', 'SysBP', 'DiasBP', 'MeanBP', 'RespRate', 'TempC', 'SpO2', 'Glucose', 'FiO2'
          ]]

In [None]:
# Example usage
raw_vs = extract_trauma_vitalsign(project_path_obj, PROJECT_ID, trum_cohort_info_df, is_report=True)
raw_vs.iloc[40:55]

Extracted 179910 FiO2 samples for 1570 trauma patients.
Extracted 686212 vital sign samples for 1570 trauma patients.
Total samples after merging 2 tables: 703095 for 1570 trauma patients.


Unnamed: 0,subject_id,hadm_id,Date,Day,Hour,HeartRate,SysBP,DiasBP,MeanBP,RespRate,TempC,SpO2,Glucose,FiO2
617578,87977,100011,2177-08-29,1,19,104.0,150.0,79.0,95.0,20.0,37.722222,100.0,,50.0
617579,87977,100011,2177-08-29,1,19,,144.0,84.0,99.0,,,,,
617580,87977,100011,2177-08-29,1,20,103.0,165.0,87.0,105.0,20.0,,100.0,140.0,
617581,87977,100011,2177-08-29,1,20,,139.0,82.0,94.0,,,,,
617582,87977,100011,2177-08-29,1,21,100.0,143.5,77.0,93.5,20.0,,100.0,,
617583,87977,100011,2177-08-29,1,22,103.0,123.0,76.0,87.0,20.0,38.444444,100.0,,
617584,87977,100011,2177-08-29,1,23,99.0,133.0,80.0,91.5,20.0,,100.0,,50.0
617585,87977,100011,2177-08-30,2,0,103.0,146.0,83.0,98.0,20.0,38.722222,100.0,,
617586,87977,100011,2177-08-30,2,0,,,,,,,,,50.0
617587,87977,100011,2177-08-30,2,0,,,,,,,,160.0,


It is important to note that the total number of merged samples is not equal to the direct sum of the individual sample counts (FiO2 + vital signs). This discrepancy occurs because the merging process was based on patient ID and chart time, which means some samples had overlapping chart times across the two tables. As a result, the total number of samples is less than the sum of the individual samples, but this does not indicate that any data was lost or missed during the merging process. The overlap simply reflects how the data aligns temporally across the two tables.

## 1.2 Extract and Process Nighttime Data

This section implements the preprocessing of nighttime data. The function performs the following tasks:

1. **Nighttime Data Extraction**: Isolates data recorded between 18:00 and 06:00 for analysis, optionally including a window for filling missing values. For deployable purposes, the filling window is restricted to not extend beyond the nighttime period ending at 06:00 am.
2. **Fill Missing Timestamps**: Ensures continuous time coverage by filling any missing hourly timestamps.
3. **Fill Missing Values**: Optionally fills missing values using the specified imputation method. (applies only to the S dataset, the Standard dataset without NaN values).
4. **Aggregation**: Aggregates multiple values recorded within the same hour into a single representative value for each feature.
5. **Drop Invalid Data**: Optionally removes rows with any remaining NaN values, ensuring each row accurately represents a patient's record at a specific timestamp. (applies only to the S dataset, the Standard dataset without NaN values).


In [None]:
def extract_night_data(df, night_time_window = [18,6],
                       feature_li = ['HeartRate', 'SysBP', 'DiasBP', 'MeanBP', 'RespRate', 'TempC', 'SpO2', 'Glucose', 'FiO2'],
                       is_fill = True):
    """
    """
    # By default use entire 24-hour window
    night_start, night_end = night_time_window # By default 18 - 6 next day
    window_s = night_end+1 #Starting from 7 am of the 1st day
    window_e = night_end #Ending at 6 am of the 2nd day

    # Select rows within the nighttime window
    # night_df = df[(df['Hour'] >= window_s) | (df['Hour'] <= window_e)].sort_values(['hadm_id', 'charttime'])
    df = df.copy()
    print(f"Extracted nighttime data with filling window (24h): {df.shape[0]} samples for {df.hadm_id.nunique()} trauma patients")

    # Adjust relative day for early-morning hours (belong to previous night)
    df.loc[df['Hour'] <= window_e, 'Day'] = df['Day'] - 1
    df.rename(columns={'Day': 'Night'}, inplace=True)
    df.loc[df['Hour']<= window_e, 'Date'] = (df.Date - timedelta(days=1))

    # Construct ID columns and feature list
    day_ids = ['subject_id','hadm_id','Date', 'Night']
    hour_ids = day_ids + ['Hour']
    night_time_list = [i for i in range(night_start, 24)] + [i for i in range(night_end+1)]
    df = df[hour_ids+feature_li]

    # Filter out nights with fewer than `miss_time_threshhold` valid hourly entries
    miss_time_threshhold = 0 # filter out night with all nan values
    count_df = df.groupby(['hadm_id','Night']).apply(lambda x: x.Hour.isin(night_time_list).sum())
    nighttime_count_df = count_df[count_df > miss_time_threshhold].reset_index().loc[:,['hadm_id','Night']]
    df = df.merge(nighttime_count_df, on=['hadm_id','Night'], how='inner')
    print(f"After filtering out nights with fewer than {miss_time_threshhold} valid hourly entries: {df.shape[0]} samples for {df.hadm_id.nunique()} trauma patients")

    # Create a full hourly template for each valid night
    night_hour = df.groupby(day_ids).apply(
        lambda x: pd.DataFrame(night_time_list, columns=['Hour'])
        ).reset_index(names= day_ids +['TimeIndex'])

    # Fill missing timestamps in the nighttime range
    full_night = df.merge(
        night_hour, on=hour_ids, how='outer'
        )

    # Assign dummy index for sorting, sort and fill
    full_night.TimeIndex.fillna(-1, inplace=True)
    full_night.sort_values(['hadm_id', 'Night', 'TimeIndex','Hour'], inplace=True)

    # Aggregate values in the same hour into one value per feature
    full_night = full_night.groupby(hour_ids).mean().reset_index()
    print(f"After aggregating one hour into one value: {full_night.shape[0]} samples for {full_night.hadm_id.nunique()} trauma patients")

    # Record Nan Values
    full_night['isNan'] = full_night[feature_li].apply(lambda x: np.isnan(x).astype(int).to_numpy(),axis=1)

    if is_fill:
        # Forward fill followed by backward fill
        full_night = full_night.groupby(day_ids).apply(lambda group: group.ffill()).reset_index(drop=True)
        full_night = full_night.groupby(day_ids).apply(lambda group: group.bfill()).reset_index(drop=True)
        print(f"After forward and backward filling: {full_night.shape[0]} samples for {full_night.hadm_id.nunique()} trauma patients")

    # Retain only the rows with nighttime hours (18. to next day 6 )
    night_window = full_night[full_night['Hour'].isin(night_time_list)]
    print(f"After filtering out nighttime hours(from{night_start}-{night_end}): {night_window.shape[0]} samples for {night_window.hadm_id.nunique()} trauma patients")

    return night_window.sort_values(['hadm_id', 'Night', 'TimeIndex'])

In [None]:
# Extract night-time data with missing values retained
data_w_null = extract_night_data(raw_vs, is_fill=False,
                                 night_time_window = [18,6])
data_w_null.iloc[13:26, :]

Extracted nighttime data with filling window (24h): 703095 samples for 1570 trauma patients


  count_df = df.groupby(['hadm_id','Night']).apply(lambda x: x.Hour.isin(night_time_list).sum())


After filtering out nights with fewer than 0 valid hourly entries: 689608 samples for 1570 trauma patients


  night_hour = df.groupby(day_ids).apply(
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  full_night.TimeIndex.fillna(-1, inplace=True)


After aggregating one hour into one value: 429115 samples for 1570 trauma patients
After filtering out nighttime hours(from18-6): 246324 samples for 1570 trauma patients


Unnamed: 0,subject_id,hadm_id,Date,Night,Hour,HeartRate,SysBP,DiasBP,MeanBP,RespRate,TempC,SpO2,Glucose,FiO2,TimeIndex,isNan
398649,87977,100011,2177-08-29,1,18,101.0,150.0,72.0,90.0,20.0,,100.0,,,0.0,"[0, 0, 0, 0, 0, 1, 0, 1, 1]"
398650,87977,100011,2177-08-29,1,19,104.0,147.0,81.5,97.0,20.0,37.722222,100.0,,50.0,1.0,"[0, 0, 0, 0, 0, 0, 0, 1, 0]"
398651,87977,100011,2177-08-29,1,20,103.0,152.0,84.5,99.5,20.0,,100.0,140.0,,2.0,"[0, 0, 0, 0, 0, 1, 0, 0, 1]"
398652,87977,100011,2177-08-29,1,21,100.0,143.5,77.0,93.5,20.0,,100.0,,,3.0,"[0, 0, 0, 0, 0, 1, 0, 1, 1]"
398653,87977,100011,2177-08-29,1,22,103.0,123.0,76.0,87.0,20.0,38.444444,100.0,,,4.0,"[0, 0, 0, 0, 0, 0, 0, 1, 1]"
398654,87977,100011,2177-08-29,1,23,99.0,133.0,80.0,91.5,20.0,,100.0,,50.0,5.0,"[0, 0, 0, 0, 0, 1, 0, 1, 0]"
398632,87977,100011,2177-08-29,1,0,103.0,146.0,83.0,98.0,20.0,38.722222,100.0,160.0,50.0,6.0,"[0, 0, 0, 0, 0, 0, 0, 0, 0]"
398633,87977,100011,2177-08-29,1,1,107.0,149.0,81.0,98.0,20.0,,100.0,153.0,,7.0,"[0, 0, 0, 0, 0, 1, 0, 0, 1]"
398634,87977,100011,2177-08-29,1,2,128.0,148.0,73.0,95.0,20.0,,100.0,,,8.0,"[0, 0, 0, 0, 0, 1, 0, 1, 1]"
398635,87977,100011,2177-08-29,1,3,111.0,95.0,53.0,67.0,18.0,37.666667,100.0,,40.0,9.0,"[0, 0, 0, 0, 0, 0, 0, 1, 0]"


In [None]:
raw_vs[(raw_vs.hadm_id==100011) & (raw_vs.Day==1) & (raw_vs.Hour.isin([15, 16, 17]))]

Unnamed: 0,subject_id,hadm_id,Date,Day,Hour,HeartRate,SysBP,DiasBP,MeanBP,RespRate,TempC,SpO2,Glucose,FiO2
617571,87977,100011,2177-08-29,1,15,90.0,148.0,54.0,76.0,,,100.0,,60.0
617572,87977,100011,2177-08-29,1,15,,,,,6.0,,,,
617573,87977,100011,2177-08-29,1,15,91.0,141.0,46.0,69.0,,,100.0,,
617574,87977,100011,2177-08-29,1,16,95.0,148.0,64.0,84.0,20.0,,100.0,,50.0
617575,87977,100011,2177-08-29,1,16,,,,,,38.722222,,,
617576,87977,100011,2177-08-29,1,17,103.0,134.0,68.0,84.0,20.0,38.333333,100.0,,


In [None]:
# Extract night-time data with missing values filled using forward and backward filling
# Extract night-time data with missing values retained
data_wo_null = extract_night_data(raw_vs, is_fill=True,
                                 night_time_window = [18,6])
data_wo_null.iloc[13:26, :]

Extracted nighttime data with filling window (24h): 703095 samples for 1570 trauma patients


  count_df = df.groupby(['hadm_id','Night']).apply(lambda x: x.Hour.isin(night_time_list).sum())


After filtering out nights with fewer than 0 valid hourly entries: 689608 samples for 1570 trauma patients


  night_hour = df.groupby(day_ids).apply(
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  full_night.TimeIndex.fillna(-1, inplace=True)


After aggregating one hour into one value: 429115 samples for 1570 trauma patients


  full_night = full_night.groupby(day_ids).apply(lambda group: group.ffill()).reset_index(drop=True)
  full_night = full_night.groupby(day_ids).apply(lambda group: group.bfill()).reset_index(drop=True)


After forward and backward filling: 429115 samples for 1570 trauma patients
After filtering out nighttime hours(from18-6): 246324 samples for 1570 trauma patients


Unnamed: 0,subject_id,hadm_id,Date,Night,Hour,HeartRate,SysBP,DiasBP,MeanBP,RespRate,TempC,SpO2,Glucose,FiO2,TimeIndex,isNan
398649,87977,100011,2177-08-29,1,18,101.0,150.0,72.0,90.0,20.0,38.333333,100.0,133.0,50.0,0.0,"[0, 0, 0, 0, 0, 1, 0, 1, 1]"
398650,87977,100011,2177-08-29,1,19,104.0,147.0,81.5,97.0,20.0,37.722222,100.0,133.0,50.0,1.0,"[0, 0, 0, 0, 0, 0, 0, 1, 0]"
398651,87977,100011,2177-08-29,1,20,103.0,152.0,84.5,99.5,20.0,37.722222,100.0,140.0,50.0,2.0,"[0, 0, 0, 0, 0, 1, 0, 0, 1]"
398652,87977,100011,2177-08-29,1,21,100.0,143.5,77.0,93.5,20.0,37.722222,100.0,140.0,50.0,3.0,"[0, 0, 0, 0, 0, 1, 0, 1, 1]"
398653,87977,100011,2177-08-29,1,22,103.0,123.0,76.0,87.0,20.0,38.444444,100.0,140.0,50.0,4.0,"[0, 0, 0, 0, 0, 0, 0, 1, 1]"
398654,87977,100011,2177-08-29,1,23,99.0,133.0,80.0,91.5,20.0,38.444444,100.0,140.0,50.0,5.0,"[0, 0, 0, 0, 0, 1, 0, 1, 0]"
398632,87977,100011,2177-08-29,1,0,103.0,146.0,83.0,98.0,20.0,38.722222,100.0,160.0,50.0,6.0,"[0, 0, 0, 0, 0, 0, 0, 0, 0]"
398633,87977,100011,2177-08-29,1,1,107.0,149.0,81.0,98.0,20.0,38.722222,100.0,153.0,50.0,7.0,"[0, 0, 0, 0, 0, 1, 0, 0, 1]"
398634,87977,100011,2177-08-29,1,2,128.0,148.0,73.0,95.0,20.0,38.722222,100.0,153.0,50.0,8.0,"[0, 0, 0, 0, 0, 1, 0, 1, 1]"
398635,87977,100011,2177-08-29,1,3,111.0,95.0,53.0,67.0,18.0,37.666667,100.0,153.0,40.0,9.0,"[0, 0, 0, 0, 0, 0, 0, 1, 0]"


## 1.3 Convert to 2D Time-Series Data

The final step converts the records into a 2D time-series format by grouping the data by night and aggregating 1D chart records. It then filters the nights to include only those from days 2 to 14, focusing on the critical period for early sepsis detection.

In [None]:
def gen_2Dnight_ti(df, feature_li = ['HeartRate', 'SysBP', 'DiasBP', 'MeanBP', 'RespRate', 'TempC', 'SpO2', 'Glucose', 'FiO2']):
  """
  Groups by patient and night, then aggregates the values into 2D arrays with shape of (T, F).
  Each row represents one patient on one night.
  Filters the nights to include only those from days 2 to 14
  """
  day_index_columns = ['subject_id', 'hadm_id', 'Date', 'Night']
  index_columns = day_index_columns + ['Hour', 'TimeIndex']
  # df = df.sort_values(index_columns) #the input df should be a sorted table according to hadm_id, Night, TimeIndex

  # Group by patient and night, then aggregate values into 2D arrays
  ti = df.groupby(day_index_columns).apply(
  lambda x: pd.Series({
      "Temporal Features": x[feature_li].values,
      "isNan": np.stack(x.isNan.values)
      })
  ).reset_index()
  print(f"After aggregating one night into 2D time-series, {ti.shape[0]} samples for {ti['hadm_id'].nunique()} trauma patients.")

  # Filter the nights to exclude the first 1 days
  ti_after2D = ti[(ti.Night>=2)]
  print(f"After filtering out the first night, {ti_after2D.shape[0]} samples for {ti_after2D['hadm_id'].nunique()} trauma patients.")

  # Filter out nights after day 14
  ti = ti_after2D[(ti_after2D.Night<=14)]
  # ti = ti[ti.Night<=14]
  print(f"After filtering out nights beyond day 14, {ti.shape[0]} samples for {ti['hadm_id'].nunique()} trauma patients.")

  return ti.sort_values(['hadm_id', 'Night'])

In [None]:
night_ti = gen_2Dnight_ti(data_w_null)
night_ti.head()

After aggregating one night into 2D time-series, 18948 samples for 1570 trauma patients.
After filtering out the first night, 17228 samples for 1570 trauma patients.
After filtering out nights beyond day 14, 12441 samples for 1561 trauma patients.


  ti = df.groupby(day_index_columns).apply(


Unnamed: 0,subject_id,hadm_id,Date,Night,Temporal Features,isNan
17607,87977,100011,2177-08-30,2,"[[88.0, 167.0, 82.0, 105.0, 16.0, nan, 100.0, ...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ..."
17608,87977,100011,2177-08-31,3,"[[109.33333333333333, 158.33333333333334, 82.6...","[[0, 0, 0, 0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 0, ..."
17609,87977,100011,2177-09-01,4,"[[97.0, 143.0, 65.0, 87.0, 23.0, nan, 98.0, na...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ..."
17610,87977,100011,2177-09-02,5,"[[81.0, 153.0, 79.0, 100.0, 21.0, nan, 100.0, ...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ..."
17611,87977,100011,2177-09-03,6,"[[95.0, 128.0, 59.0, 75.0, 45.0, nan, 96.0, na...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ..."


### sample checking

In [None]:
# #before agg to 2d
data_w_null.iloc[26:39, :]

Unnamed: 0,subject_id,hadm_id,Date,Night,Hour,HeartRate,SysBP,DiasBP,MeanBP,RespRate,TempC,SpO2,Glucose,FiO2,TimeIndex,isNan
398673,87977,100011,2177-08-30,2,18,88.0,167.0,82.0,105.0,16.0,,100.0,,,0.0,"[0, 0, 0, 0, 0, 1, 0, 1, 1]"
398674,87977,100011,2177-08-30,2,19,90.0,153.0,72.0,94.0,15.5,,100.0,,40.0,1.0,"[0, 0, 0, 0, 0, 1, 0, 1, 0]"
398675,87977,100011,2177-08-30,2,20,94.0,173.0,78.0,104.0,15.0,38.444444,100.0,124.0,,2.0,"[0, 0, 0, 0, 0, 0, 0, 0, 1]"
398676,87977,100011,2177-08-30,2,21,99.0,151.0,83.0,106.0,20.0,,100.0,,,3.0,"[0, 0, 0, 0, 0, 1, 0, 1, 1]"
398677,87977,100011,2177-08-30,2,22,104.0,127.0,71.0,88.0,14.0,38.277778,100.0,,,4.0,"[0, 0, 0, 0, 0, 0, 0, 1, 1]"
398678,87977,100011,2177-08-30,2,23,103.0,130.0,69.0,87.0,14.0,,100.0,,40.0,5.0,"[0, 0, 0, 0, 0, 1, 0, 1, 0]"
398655,87977,100011,2177-08-30,2,0,102.0,155.0,70.0,94.0,14.0,38.222222,100.0,,,6.0,"[0, 0, 0, 0, 0, 0, 0, 1, 1]"
398656,87977,100011,2177-08-30,2,1,108.0,149.0,72.0,93.0,14.0,,100.0,131.5,,7.0,"[0, 0, 0, 0, 0, 1, 0, 0, 1]"
398657,87977,100011,2177-08-30,2,2,104.0,138.0,65.0,87.0,14.0,38.055556,100.0,,,8.0,"[0, 0, 0, 0, 0, 0, 0, 1, 1]"
398658,87977,100011,2177-08-30,2,3,97.0,163.0,77.0,101.0,14.0,,100.0,,40.0,9.0,"[0, 0, 0, 0, 0, 1, 0, 1, 0]"


In [None]:
night_ti.loc[(night_ti.hadm_id==100011) & (night_ti.Night==2), 'Temporal Features'].values

array([array([[ 88.        , 167.        ,  82.        , 105.        ,
                16.        ,          nan, 100.        ,          nan,
                        nan],
              [ 90.        , 153.        ,  72.        ,  94.        ,
                15.5       ,          nan, 100.        ,          nan,
                40.        ],
              [ 94.        , 173.        ,  78.        , 104.        ,
                15.        ,  38.44444444, 100.        , 124.        ,
                        nan],
              [ 99.        , 151.        ,  83.        , 106.        ,
                20.        ,          nan, 100.        ,          nan,
                        nan],
              [104.        , 127.        ,  71.        ,  88.        ,
                14.        ,  38.27777778, 100.        ,          nan,
                        nan],
              [103.        , 130.        ,  69.        ,  87.        ,
                14.        ,          nan, 100.        ,          nan

# 2. Instance Construction

This section involves labeling nighttime instances based on the sepsis onset data of each patient (HADM_ID). A nighttime instance is labeled 1 if **sepsis occurs within 24 hours after the nighttime instance**; otherwise, it is labeled 0. That means all nighttime instances of non-sepsis patients are assigned a negative label (0). For sepsis patients, only one nighttime instance receives a positive label (1), while the rest before the onset are labeled negative and the ones after onset are not of interest of early sepsis detection.

## 2.1 Load Post-Trauma Sepsis Onset Timestamps

Post-Trauma Sepsis is defined based on [Stern et al. (2023)](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2800552) and adheres to Sepsis-3 consensus guidelines. Each row records the sepsis label and the corresponding onset timestamp for a patient (HADM_ID).

More detailed explanations and applications can be found in `notebooks/S2_Sepsis_Onset_Label_Assignment.ipynb`.

**References**:
- Stern, K., Qiu, Q., Weykamp, M., O’Keefe, G., & Brakenridge, S. C. (2023). Defining posttraumatic sepsis for population-level research. *JAMA Network Open, 6*(1), e2251445. https://doi.org/10.1001/jamanetworkopen.2022.51445

---

In [None]:
# Load sepsis patient labels and corresponding onset timestamps
sepsis_label_path = project_path_obj.sepsis_label_path  # Define the path to sepsis labels

if os.path.exists(sepsis_label_path):
    # If the file exists, load it from the specified path
    sepsis_label_df = pd.read_csv(sepsis_label_path, index_col=0)
else:
    # If the file does not exist, generate the sepsis labels by querying the raw data
    sepsis_label_df = assign_sepsis_labels(project_path_obj,  # Pass object containing file paths
                                           PROJECT_ID         # Provide the project ID for database access
    )

sepsis_label_df.head()

Unnamed: 0,hadm_id,is_infection,is_sepsis,onset_datetime,onset_day,cx_index,abx_index,sofa_index_1,sofa_index_2
0,100011,1.0,1.0,2177-09-04 11:12:00,7.0,0.0,1.0,94.0,164.0
1,100035,1.0,1.0,2115-02-27 15:27:00,6.0,4.0,4.0,395.0,444.0
2,100132,1.0,0.0,,,,,,
3,100133,0.0,0.0,,,,,,
4,100138,1.0,0.0,,,,,,


In [None]:
num_nonsepsis_patient, num_sepsis_patient = sepsis_label_df.is_sepsis.value_counts()
print(f'Number of trauma patients: {sepsis_label_df.shape[0]}')
print(f'Number of Sepsis patients: {num_sepsis_patient}')
print(f'Number of Non-Sepsis patients: {num_nonsepsis_patient}')

Number of trauma patients: 1570
Number of Sepsis patients: 535
Number of Non-Sepsis patients: 1035


## 2.2 Assign Instance Labels

Assign labels to each nighttime instance based on the sepsis status of the patient. The label is set as follows:
- **1**: If a patient develops sepsis within 24 hours after the nighttime instance. (excluding the instance at hour 0 and including up to 24 hours).
- **0**: Otherwise.

**Note**: Instances after sepsis onset are dropped, as they reflect a physiological status affected by sepsis treatment.

This means that:
- All instances for non-sepsis patients will be labeled as negative (0).
- For sepsis patients, only one nighttime instance will be labeled as positive (1), while all other nighttime instances will be labeled as negative (0).


In [None]:
def assign_label2instance(ti_df, label_df):
    """
    Assigns labels (0/1) to nighttime instances based on sepsis onset timestamps.
    Specifically, assigns a positive label if sepsis onset occurs within 24 hours after the night.
    """
    # Identify sepsis and non-sepsis patient identifiers based on labels
    nonsepsis_ids = label_df.is_sepsis == 0
    sepsis_ids = label_df.is_sepsis == 1
    # print(f"Trauma Cohort: sepsis patients ({sum(sepsis_ids)}) + non-sepsis patients ({sum(nonsepsis_ids)}) = {label_df.shape[0]}")

    # Extract data for non-sepsis patients & assign negative label; these data are ready
    nonsepsis_patient_ti_df = ti_df[ti_df['hadm_id'].isin(label_df[nonsepsis_ids]['hadm_id'])]
    nonsepsis_patient_ti_df = nonsepsis_patient_ti_df.assign(Label=0)
    nonsepsis_patient_ti_df['onset_datetime'] = np.nan # Add a new column 'onset_datetime' with NaN values for non-sepsis patients
    print(f"{nonsepsis_patient_ti_df.shape[0]} Negative instances for {nonsepsis_patient_ti_df.hadm_id.nunique()} non-sepsis patients")

    # Extract data for sepsis patients
    sepsis_patient_ti_df = ti_df[ti_df['hadm_id'].isin(label_df[sepsis_ids]['hadm_id'])]
    print(f"{sepsis_patient_ti_df.shape[0]} instances for {sepsis_patient_ti_df.hadm_id.nunique()} sepsis patients")
    sepsis_patient_df = sepsis_patient_ti_df.merge(label_df[['hadm_id', 'onset_datetime', 'onset_day']], on='hadm_id')

    # Classify according to the relationship between recorded time and onset time
    night_end_time = pd.to_datetime(sepsis_patient_df.Date) + pd.to_timedelta(1, unit='d') + pd.to_timedelta('06:59:59') #pd.to_timedelta(6, unit='h')
    time_diff = (pd.to_datetime(sepsis_patient_df['onset_datetime']) - night_end_time)
    sepsis_patient_df['time_diff'] = time_diff
    # 0< time_diff < 24h
    is_positive = (time_diff >= pd.to_timedelta(0, unit='d')) & (time_diff < pd.to_timedelta(1, unit='d'))
    sepsis_patient_df['Label'] = np.where(is_positive, 1, 0)
    # Drop instances after the onset time
    after_onset = (time_diff < pd.to_timedelta(0, unit='d')) # time_diff<0 => onset < night (i.e. the night after onset time)
    sepsis_patient_df = sepsis_patient_df[~after_onset]
    print(f"Dropped {after_onset.sum()} instances after sepsis onset")
    print(f"\t {sepsis_patient_df.Label.value_counts()[1]} (1s) + {sepsis_patient_df.Label.value_counts()[0]} (0s)")

    # Combine data from sepsis and non-sepsis patients
    mimic_data_df = pd.concat([nonsepsis_patient_ti_df, sepsis_patient_df[nonsepsis_patient_ti_df.columns]])
    print(f"Final Dataset: {mimic_data_df['Label'].value_counts()[1]}(1s) + {mimic_data_df['Label'].value_counts()[0]}(0s) = {mimic_data_df.shape[0]} (Patients={mimic_data_df['hadm_id'].nunique()})")

    return mimic_data_df

In [None]:
# dataset with missing value
# night_ti = gen_2Dnight_ti(data_w_null)
print(f"In total, there are {night_ti.shape[0]} samples for {night_ti.hadm_id.nunique()} unique hospital admissions.")
mimic_data_df = assign_label2instance(night_ti, sepsis_label_df)

In total, there are 12441 samples for 1561 unique hospital admissions.
6952 Negative instances for 1032 non-sepsis patients
5489 instances for 529 sepsis patients
Dropped 3464 instances after sepsis onset
	 455 (1s) + 1570 (0s)
Final Dataset: 455(1s) + 8522(0s) = 8977 (Patients=1535)


### Explain with Samples

For the purpose of grouping nighttime instances, we define a `Night` column and update the `Date` column based on the start date of the night. A single night spans across two days, starting at 22:00 (10:00 PM) on the first day and ending at 06:00 (6:00 AM) on the second day. The `Date` for a night is assigned based on the day when the night begins. Additionally, the `Night` is counted according to the hospital days based on the patient's admission date.

For example, if the `Date` is `2177-09-03`, the corresponding night will span from `2177-09-03 22:00:00` to `2177-09-04 06:59:59`. To label this night as positive for sepsis, the sepsis onset time should fall between `2177-09-04 07:00:00` and `2177-09-05 06:59:59`, which means the sepsis would occur within the next 24 hours after the night.

#### Labeling Rules:
1. **Positive Night**: If the sepsis onset occurs between the end of the night (06:00 on the next day) and before the following 06:00, that night is labeled as `1`.
2. **Negative Night**: Any night prior to the positive night is labeled as `0`.
3. **Post-Sepsis Data Removal**: All data after the positive night is removed from the dataset because the patient's condition is no longer stable due to the administration of antibiotics or other interventions. This prevents contamination from treatment effects in further predictions.

This method ensures that data used for detecting sepsis is clean, focusing only on the period before any interventions, which might otherwise affect the patient's physiological signals.


In [None]:
# edge samples: as 6am
sample_104877 = night_ti[night_ti.hadm_id==104877].merge(sepsis_label_df[['hadm_id', 'onset_datetime', 'onset_day']], on='hadm_id')
sample_104877['night_end_time'] = pd.to_datetime(sample_104877.Date) + pd.to_timedelta(1, unit='d') + pd.to_timedelta('06:59:59')
sample_104877['time_diff'] = (pd.to_datetime(sample_104877['onset_datetime']) - sample_104877.night_end_time)
display(sample_104877.head())
display(mimic_data_df[mimic_data_df.hadm_id==104877])

Unnamed: 0,subject_id,hadm_id,Date,Night,Temporal Features,isNan,onset_datetime,onset_day,night_end_time,time_diff
0,87008,104877,2150-10-13,2,"[[87.0, 141.0, 65.0, 87.0, 20.0, nan, 99.0, na...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",2150-10-16 06:47:00,5.0,2150-10-14 06:59:59,1 days 23:47:01
1,87008,104877,2150-10-14,3,"[[nan, 134.0, 72.0, 91.0, 20.0, nan, nan, nan,...","[[1, 0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 0, 0, ...",2150-10-16 06:47:00,5.0,2150-10-15 06:59:59,0 days 23:47:01
2,87008,104877,2150-10-15,4,"[[88.11111111111111, 113.11111111111111, 63.33...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",2150-10-16 06:47:00,5.0,2150-10-16 06:59:59,-1 days +23:47:01
3,87008,104877,2150-10-16,5,"[[76.0, 126.0, 67.0, 84.0, 20.0, nan, 98.0, na...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",2150-10-16 06:47:00,5.0,2150-10-17 06:59:59,-2 days +23:47:01
4,87008,104877,2150-10-17,6,"[[74.0, 123.0, 68.0, 84.0, 16.0, nan, 95.0, na...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",2150-10-16 06:47:00,5.0,2150-10-18 06:59:59,-3 days +23:47:01


Unnamed: 0,subject_id,hadm_id,Date,Night,Temporal Features,isNan,Label,onset_datetime
227,87008,104877,2150-10-13,2,"[[87.0, 141.0, 65.0, 87.0, 20.0, nan, 99.0, na...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2150-10-16 06:47:00
228,87008,104877,2150-10-14,3,"[[nan, 134.0, 72.0, 91.0, 20.0, nan, nan, nan,...","[[1, 0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 0, 0, ...",1,2150-10-16 06:47:00


In [None]:
# edge samples: as 7am
sample_117182 = night_ti[night_ti.hadm_id==117182].merge(sepsis_label_df[['hadm_id', 'onset_datetime', 'onset_day']], on='hadm_id')
sample_117182['night_end_time'] = pd.to_datetime(sample_117182.Date) + pd.to_timedelta(1, unit='d') + pd.to_timedelta('06:59:59')
sample_117182['time_diff'] = (pd.to_datetime(sample_117182['onset_datetime']) - sample_117182.night_end_time)
display(sample_117182.loc[3:7])
display(mimic_data_df[mimic_data_df.hadm_id==117182]) # Notice: samples after onset have been dropped

Unnamed: 0,subject_id,hadm_id,Date,Night,Temporal Features,isNan,onset_datetime,onset_day,night_end_time,time_diff
3,9920,117182,2181-06-09,5,"[[84.5, 177.5, 61.5, 100.5, 26.0, nan, 100.0, ...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",2181-06-12 07:01:00,8.0,2181-06-10 06:59:59,2 days 00:01:01
4,9920,117182,2181-06-10,6,"[[88.0, 101.5, 53.0, 70.5, 27.0, 38.4444427490...","[[0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, ...",2181-06-12 07:01:00,8.0,2181-06-11 06:59:59,1 days 00:01:01
5,9920,117182,2181-06-11,7,"[[93.0, 136.0, 62.5, 88.0, 25.0, nan, 100.0, n...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",2181-06-12 07:01:00,8.0,2181-06-12 06:59:59,0 days 00:01:01
6,9920,117182,2181-06-12,8,"[[84.0, 109.0, 89.0, 98.0, 25.0, 38.4444427490...","[[0, 0, 0, 0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 0, ...",2181-06-12 07:01:00,8.0,2181-06-13 06:59:59,-1 days +00:01:01
7,9920,117182,2181-06-13,9,"[[75.0, 131.0, 44.0, 70.6666488647461, 27.0, n...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",2181-06-12 07:01:00,8.0,2181-06-14 06:59:59,-2 days +00:01:01


Unnamed: 0,subject_id,hadm_id,Date,Night,Temporal Features,isNan,Label,onset_datetime
837,9920,117182,2181-06-06,2,"[[74.0, 132.0, 44.0, 66.0, 8.0, nan, 100.0, 70...","[[0, 0, 0, 0, 0, 1, 0, 0, 1], [0, 0, 0, 0, 0, ...",0,2181-06-12 07:01:00
838,9920,117182,2181-06-07,3,"[[75.0, 139.0, 47.0, 76.0, 17.0, nan, 100.0, n...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2181-06-12 07:01:00
839,9920,117182,2181-06-08,4,"[[76.0, 145.5, 66.0, 94.0, 17.0, nan, 100.0, 1...","[[0, 0, 0, 0, 0, 1, 0, 0, 1], [0, 0, 0, 0, 0, ...",0,2181-06-12 07:01:00
840,9920,117182,2181-06-09,5,"[[84.5, 177.5, 61.5, 100.5, 26.0, nan, 100.0, ...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2181-06-12 07:01:00
841,9920,117182,2181-06-10,6,"[[88.0, 101.5, 53.0, 70.5, 27.0, 38.4444427490...","[[0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, ...",0,2181-06-12 07:01:00
842,9920,117182,2181-06-11,7,"[[93.0, 136.0, 62.5, 88.0, 25.0, nan, 100.0, n...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",1,2181-06-12 07:01:00






# Integration and Execution Dataset

In [None]:
def instance_construction(project_path_obj, project_id, trum_cohort_info_df,
                          is_fill=True, is_report=True,
                          night_time_window = [18, 6],
                          feature_li = ['HeartRate', 'SysBP', 'DiasBP', 'MeanBP', 'RespRate', 'TempC', 'SpO2', 'Glucose', 'FiO2']
                          ):
    """
    Extracts and processes night-time data from the trauma cohort based on specified parameters.

    Parameters:
    -----------
    project_path_obj : object
        The object that provides access to project paths.
    PROJECT_ID : str
        The ID of the project.
    trum_cohort_info_df : pandas.DataFrame
        DataFrame containing trauma cohort information.
    is_fill : bool, optional
        If True, fills missing values in night-time data using forward and backward filling. Default is True.
    is_report : bool, optional
        If True, generates a report. Default is True.

    Returns:
    --------
    pandas.DataFrame
        A DataFrame containing processed night-time data, with missing values filled or retained as specified.
    """
    # Extract raw vital sign data
    raw_vs = extract_trauma_vitalsign(project_path_obj, project_id, trum_cohort_info_df, is_report=is_report)

    # Extract night-time data with or without filling missing values based on is_fill
    night_data = extract_night_data(raw_vs, is_fill=is_fill, night_time_window = night_time_window)

    # Generate 2D night-time instances
    night_ti = gen_2Dnight_ti(night_data, feature_li=feature_li)
    if is_fill:
      # Drop the data still have nan value
      no_missing_value  = night_ti['Temporal Features'].apply(lambda x: np.isnan(x).sum()==0)  # no missing value
      night_ti = night_ti[no_missing_value]

    # Load sepsis patient labels and corresponding onset timestamps
    sepsis_label_path = project_path_obj.sepsis_label_path  # Define the path to sepsis labels
    if os.path.exists(sepsis_label_path):
        # If the file exists, load it from the specified path
        sepsis_label_df = pd.read_csv(sepsis_label_path, index_col=0)
    else:
        # If the file does not exist, generate the sepsis labels by querying the raw data
        sepsis_label_df = assign_sepsis_labels(project_path_obj,  # Pass object containing file paths
                                              PROJECT_ID         # Provide the project ID for database access
        )

    # Assigns labels (0/1) to nighttime instances based on sepsis onset timestamps.
    mimic_data_df = assign_label2instance(night_ti, sepsis_label_df)

    # Convert 'Night' to string and pad it with leading zeros to 3 digits
    mimic_data_df['Night'] = mimic_data_df['Night'].astype(str).str.zfill(3)
    # Create the 'instance_id' by concatenating 'hadm_id' and the 3-digit 'Night'
    mimic_data_df.index = (mimic_data_df['hadm_id'].astype(str) + mimic_data_df['Night']).astype(int)

    return mimic_data_df

In [None]:
print("Generating Dataset w/o nan value..")
data_wo_nan = instance_construction(project_path_obj, PROJECT_ID, trum_cohort_info_df,
                                    feature_li = ['HeartRate', 'SysBP', 'DiasBP', 'MeanBP', 'RespRate', 'TempC', 'SpO2',
                                                  # 'Glucose', 'FiO2'
                                                  ],
                                    is_fill=True, is_report=True
                                    )
print("\nGenerating Dataset with nan value...")
data_with_nan = instance_construction(project_path_obj, PROJECT_ID, trum_cohort_info_df,
                                      feature_li = ['HeartRate', 'SysBP', 'DiasBP', 'MeanBP', 'RespRate', 'TempC', 'SpO2',
                                                  # 'Glucose', 'FiO2'
                                                  ],
                                      is_fill=False, is_report=True)

Generating Dataset w/o nan value..
Extracted 179910 FiO2 samples for 1570 trauma patients.
Extracted 686212 vital sign samples for 1570 trauma patients.
Total samples after merging 2 tables: 703095 for 1570 trauma patients.
Extracted nighttime data with filling window (24h): 703095 samples for 1570 trauma patients


  count_df = df.groupby(['hadm_id','Night']).apply(lambda x: x.Hour.isin(night_time_list).sum())


After filtering out nights with fewer than 0 valid hourly entries: 689608 samples for 1570 trauma patients


  night_hour = df.groupby(day_ids).apply(
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  full_night.TimeIndex.fillna(-1, inplace=True)


After aggregating one hour into one value: 429115 samples for 1570 trauma patients


  full_night = full_night.groupby(day_ids).apply(lambda group: group.ffill()).reset_index(drop=True)
  full_night = full_night.groupby(day_ids).apply(lambda group: group.bfill()).reset_index(drop=True)


After forward and backward filling: 429115 samples for 1570 trauma patients
After filtering out nighttime hours(from18-6): 246324 samples for 1570 trauma patients


  ti = df.groupby(day_index_columns).apply(


After aggregating one night into 2D time-series, 18948 samples for 1570 trauma patients.
After filtering out the first night, 17228 samples for 1570 trauma patients.
After filtering out nights beyond day 14, 12441 samples for 1561 trauma patients.
6821 Negative instances for 1031 non-sepsis patients
5349 instances for 527 sepsis patients
Dropped 3411 instances after sepsis onset
	 440 (1s) + 1498 (0s)
Final Dataset: 440(1s) + 8319(0s) = 8759 (Patients=1522)

Generating Dataset with nan value...
Extracted 179910 FiO2 samples for 1570 trauma patients.
Extracted 686212 vital sign samples for 1570 trauma patients.
Total samples after merging 2 tables: 703095 for 1570 trauma patients.
Extracted nighttime data with filling window (24h): 703095 samples for 1570 trauma patients


  count_df = df.groupby(['hadm_id','Night']).apply(lambda x: x.Hour.isin(night_time_list).sum())


After filtering out nights with fewer than 0 valid hourly entries: 689608 samples for 1570 trauma patients


  night_hour = df.groupby(day_ids).apply(
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  full_night.TimeIndex.fillna(-1, inplace=True)


After aggregating one hour into one value: 429115 samples for 1570 trauma patients
After filtering out nighttime hours(from18-6): 246324 samples for 1570 trauma patients
After aggregating one night into 2D time-series, 18948 samples for 1570 trauma patients.
After filtering out the first night, 17228 samples for 1570 trauma patients.
After filtering out nights beyond day 14, 12441 samples for 1561 trauma patients.
6952 Negative instances for 1032 non-sepsis patients
5489 instances for 529 sepsis patients
Dropped 3464 instances after sepsis onset
	 455 (1s) + 1570 (0s)
Final Dataset: 455(1s) + 8522(0s) = 8977 (Patients=1535)


  ti = df.groupby(day_index_columns).apply(


In [None]:
# Hospital Admission: 'hadm_id' represents a unique hospital admission,
#                      it aligns with the concept of a "patient" in clinical research
print(f"hadm_id \t{ data_wo_nan.hadm_id.nunique()} \t{data_with_nan.hadm_id.nunique()}")

# Subject: Each 'subject_id' represents a unique patient; a single patient can have multiple hospital admissions.
print(f"subject_id \t{ data_wo_nan.subject_id.nunique()} \t{data_with_nan.subject_id.nunique()}")

hadm_id 	1522 	1535
subject_id 	1505 	1518


# 3. Data Split

The function ensures a fair and structured data split for evaluation, using a **5-fold stratified split** (by default):  

1. **Patient-Level Splitting**:  
   - Each patient (`subject_id`) is assigned to a single fold, preventing data leakage across folds.

2. **Stratified Split**:  
   - The split maintains the same sepsis prevalence across folds to ensure a balanced distribution of positive and negative cases.

3. **Fold Assignment**:  
   - Patients are grouped by subject ID, and a `Fold` column is added to indicate fold assignments.   

This approach guarantees **consistent and unbiased** model evaluation while preventing data leakage.

**Note**: For fair comparison, this pre-defined split should be used in all experiments. The corresponding file is already stored in the GitHub repository:  

📁 `SepsisOnset_TraumaCohort/dataset/Fold_IDs.csv`

The following section details how this file is constructed.


In [None]:
f"Subject_id for each fold saved at {project_path_obj.fold_patient_info_path}"

'Subject_id for each fold saved at /content/drive/MyDrive/GitHub_Testing/SepsisOnset_TraumaCohort/dataset/Fold_IDs.csv'

In [None]:
from sklearn.model_selection import StratifiedKFold

def stratified_patient_split(patient_df, n_splits=5, random_state=42, is_report=True, is_saved=True):
    """
    Performs stratified 5-fold cross-validation at the patient (subject) level
    and stores dataset statistics for each fold.

    Parameters
    ----------
    dataset : pandas.DataFrame
        A DataFrame containing:
        - 'subject_id': Unique patient identifier.
        - 'Label': Binary label indicating sepsis presence (0 or 1).
        - Other relevant patient-level features.

    n_splits : int, optional
        Number of stratified folds for cross-validation (default: 5).

    random_state : int, optional
        Random seed for reproducibility (default: 42).

    Returns
    -------
    pandas.DataFrame
        A DataFrame containing dataset statistics per fold, including:
        - 'Samples': Number of samples per subset (train, val, test).
        - 'Patients': Number of unique patients per subset.
        - 'Imbalance Ratio': Ratio of positive to negative cases in each subset.

    Notes
    -----
    - The function aggregates labels at the patient level by taking the max Label per subject.
    - StratifiedKFold ensures each fold maintains the same sepsis prevalence as the entire dataset.
    - Calls `split_train_val_test()` to generate patient-level splits.
    - Calls `store_fold_statistics()` to record dataset statistics.

    Example
    -------
    ```python
    fold_info_df = stratified_patient_split(dataset)
    ```
    """
    # Define Stratified 5-Fold Cross-Validation for patient-level split
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    for fold, (train_val_idx, test_idx) in enumerate(skf.split(patient_df, patient_df.Label)):
        # Get subject-level train-validation data
        test_subjects = patient_df.iloc[test_idx]['subject_id']
        patient_df.loc[patient_df.subject_id.isin(test_subjects), 'Fold'] = int(fold)

    # Statistics Report
    if is_report:
      fold_info_df = patient_df.groupby('Fold').agg({
          'subject_id':['nunique'],
          'Label': ['sum']}).reset_index().astype(int)
      # fold_info_df['Imbalance Ratio'] = (fold_info_df[('Label', 'sum')]/ fold_info_df[('subject_id', 'nunique')]).round(3)
      fold_info_df.columns = ['Fold', 'NumPatients', 'NumPosPatients']
      display(fold_info_df)

    if is_saved:
      patient_df[['subject_id', 'Fold']].to_csv(project_path_obj.fold_patient_info_path)

    return patient_df[['subject_id', 'Fold']]

# Aggregate Label to the patient (subject) level
subject_id_df = trum_cohort_info_df[['subject_id', 'hadm_id']].merge(sepsis_label_df[['hadm_id', 'is_sepsis']], on='hadm_id')
patient_df = subject_id_df.rename(columns={'is_sepsis': 'Label'}).groupby('subject_id').Label.max().reset_index()
patient_df = stratified_patient_split(patient_df, n_splits=5, random_state=42, is_saved=False)

Unnamed: 0,Fold,NumPatients,NumPosPatients
0,0,311,107
1,1,310,106
2,2,310,106
3,3,310,107
4,4,310,107


In [None]:
# patient_df.head(5)

In [None]:
# data_wo_nan_df = data_wo_nan.copy()
# data_with_nan_df = data_with_nan.copy()
# data_wo_nan_df.subject_id.nunique(), data_with_nan_df.subject_id.nunique(), patient_df.subject_id.nunique()


# Integration and Execution Dataset Construction

In [None]:
def dataset_construction(project_path_obj, project_id,
                         night_time_window = [18, 6],
                         feature_li = ['HeartRate', 'SysBP', 'DiasBP', 'MeanBP', 'RespRate', 'TempC', 'SpO2', 'Glucose', 'FiO2'],
                         is_report=True, is_saved=True):
    """
    Constructs and saves two datasets:
    - One with NaN values retained.
    - One with NaN values filled.

    Each dataset contains the following columns:
    - Temporal Features: Multivariate time-series input data with shape (# of timestamps, # of features).
    - Label: Binary (0/1) indicating the output class.
    - Dataset: Indicates whether this instance belongs to the training or test set.

    Each row represents a nighttime instance, associated with patient identifiers (`subject_id`, `hadm_id`) and a timestamp (`Night`).

    Parameters:
    -----------
    project_path_obj : object
        Provides paths to processed data files.
    project_id : str
        Project identifier for BigQuery database access.
    is_report : bool, optional (default=True)
        If True, generates and prints dataset statistics.
    is_saved : bool, optional (default=True)
        If True, saves the generated datasets.

    Returns:
    --------
    tuple of DataFrames:
        - DataFrame containing NaN values.
        - DataFrame with NaN values filled.
    """

    # Check if both datasets already exist
    if os.path.exists(project_path_obj.dataset_with_nan_path) and os.path.exists(project_path_obj.dataset_wo_nan_path):
        print("Both datasets already exist. Skipping dataset construction and loading existing files.")

        # Load the datasets
        data_with_nan_df = pd.read_pickle(project_path_obj.dataset_with_nan_path)
        data_wo_nan_df = pd.read_pickle(project_path_obj.dataset_wo_nan_path)

    else:
        print("Generating datasets...")

        # Load Trauma Cohort
        # Detailed explanations of the cohort extraction process can be found in `notebooks/cohort_extraction.ipynb`.
        if os.path.exists(project_path_obj.trauma_cohort_info_path):
            # Load the existing file
            trauma_ids = pd.read_csv(project_path_obj.trauma_cohort_info_path, index_col=0)
        else:
            # File does not exist, extract cohort IDs and generate statistics report
            trauma_ids = extract_trauma_cohort_ids(project_path_obj, project_id, is_report=False, is_saved=True)

        # Extract necessary columns from trauma cohort data
        trauma_cohort_info_df = trauma_ids[['subject_id', 'hadm_id', 'icustay_id', 'admittime']]

        # Load patient fold assignment
        patient_df = pd.read_csv(project_path_obj.fold_patient_info_path, index_col=0, dtype=int)

        # Generate dataset with NaN values
        print("\nGenerating N Dataset (with NaN values)...")
        data_with_nan = instance_construction(project_path_obj, PROJECT_ID, trum_cohort_info_df,
                                              night_time_window = night_time_window, feature_li = feature_li, is_fill=False,
                                              is_report=is_report)
        # Assign fold ID
        data_with_nan_df = data_with_nan.merge(patient_df, on='subject_id', how='left')

        # Generate dataset without NaN values
        print("Generating S Dataset (without NaN values)...")
        data_wo_nan = instance_construction(project_path_obj, PROJECT_ID, trum_cohort_info_df,
                                            night_time_window = night_time_window, feature_li = feature_li, is_fill=True,
                                            is_report=is_report)
        # Retain only the instances in `data_wo_nan` that are also present in `data_with_nan` (to ensure consistency)
        data_wo_nan = data_wo_nan[data_wo_nan.index.isin(data_with_nan.index)]
        # Assign fold ID
        data_wo_nan_df = data_wo_nan.merge(patient_df, on='subject_id', how='left')

        # Save datasets if required
        if is_saved:
            print(f"Saving datasets to {project_path_obj.dataset_with_nan_path}...")
            data_with_nan_df.to_pickle(project_path_obj.dataset_with_nan_path)
            print(f"Saving datasets to {project_path_obj.dataset_wo_nan_path}...")
            data_wo_nan_df.to_pickle(project_path_obj.dataset_wo_nan_path)

    # Calculate statistics per fold
    if is_report:
        for name, df in {"N dataset": data_with_nan_df, "S dataset": data_wo_nan_df}.items():
            print(f"\nDataset: {name} | Shape: {df.shape} | Unique Patients (hadm_id): {df.hadm_id.nunique()}")

            # Initialize statistics report
            report_df = pd.DataFrame(
                columns=['NumInstance', 'NumPosInstance', 'RatioPosInstance', 'NumPatient(subject_id)',
                         'NumSepPatient(subject_id)', 'RatioSepPatient(subject_id)'],
                index=['test', 'train']
            )

            # Compute fold statistics
            fold_stats = df.groupby('Fold')['Label'].agg(
                Total_Instances='count',
                Positive_Instances=lambda x: (x == 1).sum(),
                Negative_Instances=lambda x: (x == 0).sum()
            ).reset_index()

            # Calculate imbalance ratio (pos/total)
            fold_stats['Imbalance_Ratio'] = fold_stats['Positive_Instances'] / fold_stats['Total_Instances']

            # Add total row
            total_row = {
                'Fold': 'Total',
                'Total_Instances': fold_stats['Total_Instances'].sum(),
                'Positive_Instances': fold_stats['Positive_Instances'].sum(),
                'Negative_Instances': fold_stats['Negative_Instances'].sum(),
                'Imbalance_Ratio': fold_stats['Positive_Instances'].sum() / fold_stats['Total_Instances'].sum()
            }
            fold_stats = pd.concat([fold_stats, pd.DataFrame([total_row])], ignore_index=True)

            display(fold_stats)

    return data_with_nan_df, data_wo_nan_df

# Example usage
data_with_nan_df, data_wo_nan_df = dataset_construction(project_path_obj, PROJECT_ID,
                                                        night_time_window = [18, 6],
                                                        feature_li = ['HeartRate', 'SysBP', 'DiasBP', 'MeanBP', 'RespRate', 'TempC', 'SpO2',
                                                                      # 'Glucose', 'FiO2'
                                                                      ],
                                                        is_report=True)

Both datasets already exist. Skipping dataset construction and loading existing files.

Dataset: N dataset | Shape: (8977, 9) | Unique Patients (hadm_id): 1535


Unnamed: 0,Fold,Total_Instances,Positive_Instances,Negative_Instances,Imbalance_Ratio
0,0,1788,92,1696,0.051454
1,1,1813,90,1723,0.049641
2,2,1825,94,1731,0.051507
3,3,1740,92,1648,0.052874
4,4,1811,87,1724,0.04804
5,Total,8977,455,8522,0.050685



Dataset: S dataset | Shape: (8759, 9) | Unique Patients (hadm_id): 1522


Unnamed: 0,Fold,Total_Instances,Positive_Instances,Negative_Instances,Imbalance_Ratio
0,0,1735,88,1647,0.05072
1,1,1763,87,1676,0.049348
2,2,1797,93,1704,0.051753
3,3,1711,90,1621,0.052601
4,4,1753,82,1671,0.046777
5,Total,8759,440,8319,0.050234


In [None]:
sample = data_wo_nan_df['Temporal Features'][0]
sample.shape, sample

((13, 7),
 array([[ 90.        , 142.        ,  73.        ,  99.        ,
          16.        ,  37.55555471,  96.        ],
        [ 84.        ,  99.        ,  57.        ,  73.        ,
          14.        ,  37.55555471,  96.        ],
        [ 90.        , 109.        ,  62.        ,  81.        ,
          17.        ,  37.05555386,  95.        ],
        [111.5       , 139.        ,  72.5       , 100.5       ,
          12.75      ,  37.05555386,  96.5       ],
        [ 87.        , 105.        ,  55.        ,  73.        ,
          17.        ,  37.05555386,  93.        ],
        [ 80.        ,  90.        ,  47.        ,  62.        ,
          14.        ,  37.05555386,  93.        ],
        [ 84.        , 114.        ,  58.        ,  79.        ,
          15.        ,  36.88888974,  96.        ],
        [107.        , 153.        ,  78.        , 110.        ,
          22.        ,  36.88888974,  94.        ],
        [ 87.        , 111.        ,  59.        ,  78

In [None]:
data_wo_nan_df

Unnamed: 0,subject_id,hadm_id,Date,Night,Temporal Features,isNan,Label,onset_datetime,Fold
0,19984,100132,2179-03-06,002,"[[90.0, 142.0, 73.0, 99.0, 16.0, 37.5555547078...","[[0, 0, 0, 0, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, ...",0,,3
1,19984,100132,2179-03-07,003,"[[97.0, 129.0, 61.0, 86.0, 20.0, 36.2222205268...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,,3
2,19984,100132,2179-03-08,004,"[[68.0, 87.0, 47.0, 60.33330154418945, 15.0, 3...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,,3
3,19984,100132,2179-03-09,005,"[[96.0, 133.0, 71.0, 91.6667022705078, 22.0, 3...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,,3
4,19984,100132,2179-03-10,006,"[[107.0, 188.0, 85.0, 119.33300018310548, 23.0...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,,3
...,...,...,...,...,...,...,...,...,...
8754,5726,199931,2199-07-12,003,"[[86.0, 126.0, 58.0, 79.0, 23.0, 37.1111128065...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2199-07-24 05:16:00,3
8755,5726,199931,2199-07-13,004,"[[83.0, 152.0, 62.0, 92.0, 31.0, 36.9444444444...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2199-07-24 05:16:00,3
8756,5726,199931,2199-07-14,005,"[[85.0, 135.0, 57.0, 83.0, 32.0, 36.7777760823...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2199-07-24 05:16:00,3
8757,5726,199931,2199-07-15,006,"[[70.0, 173.0, 84.0, 113.66699981689452, 24.0,...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2199-07-24 05:16:00,3


In [None]:
data_with_nan_df

Unnamed: 0,subject_id,hadm_id,Date,Night,Temporal Features,isNan,Label,onset_datetime,Fold
0,19984,100132,2179-03-06,002,"[[90.0, 142.0, 73.0, 99.0, 16.0, nan, 96.0], [...","[[0, 0, 0, 0, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, ...",0,,3
1,19984,100132,2179-03-07,003,"[[97.0, 129.0, 61.0, 86.0, 20.0, nan, 95.0], [...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,,3
2,19984,100132,2179-03-08,004,"[[68.0, 87.0, 47.0, 60.33330154418945, 15.0, n...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,,3
3,19984,100132,2179-03-09,005,"[[96.0, 133.0, 71.0, 91.6667022705078, 22.0, n...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,,3
4,19984,100132,2179-03-10,006,"[[107.0, 188.0, 85.0, 119.33300018310548, 23.0...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,,3
...,...,...,...,...,...,...,...,...,...
8972,5726,199931,2199-07-12,003,"[[86.0, 126.0, 58.0, 79.0, 23.0, nan, 94.0], [...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2199-07-24 05:16:00,3
8973,5726,199931,2199-07-13,004,"[[83.0, 152.0, 62.0, 92.0, 31.0, nan, 98.0], [...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2199-07-24 05:16:00,3
8974,5726,199931,2199-07-14,005,"[[85.0, 135.0, 57.0, 83.0, 32.0, nan, 96.0], [...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2199-07-24 05:16:00,3
8975,5726,199931,2199-07-15,006,"[[70.0, 173.0, 84.0, 113.66699981689452, 24.0,...","[[0, 0, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 0, ...",0,2199-07-24 05:16:00,3


# Droped/misssing data

## 1 patient in section 1.2

In **Section 1.2: Extract and Process Nighttime Data**, for the dataset without missing values, we dropped one patient due to missing data that persisted even after applying the filling method. This was caused by **a significant amount of missing data** in the 'SysBP', 'DiasBP', 'TempC', 'SpO2', 'Glucose', and 'FiO2' feature columns. For the filling window (from 7 a.m. to 6 a.m. the next day) across all 13 days, at least one of these features was missing for all timestamps on any given day.

The missing data is likely due to human error, such as forgetting to document information or incorrect data entry, or technical issues, such as errors during data transfer, storage, or extraction. Given the high-pressure working environment in the ICU, such gaps in data collection are unavoidable. Dropping this patient does not affect the overall quality of the dataset, as the vast majority of other patients have complete or adequately filled records, ensuring a reliable analysis. Furthermore, this patient is included in the dataset with missing values, which is the version we recommend for study.

In [None]:
df = raw_vs[raw_vs.hadm_id == 124142]
print(f"num of Days: {df.Day.nunique()}")
print(f"num of records: {df.shape[0]}")
display(df[['HeartRate', 'SysBP', 'DiasBP', 'MeanBP', 'RespRate', 'TempC', 'SpO2', 'Glucose', 'FiO2']].isna().sum())
df

num of Days: 13
num of records: 468


Unnamed: 0,0
HeartRate,170
SysBP,364
DiasBP,364
MeanBP,111
RespRate,173
TempC,416
SpO2,222
Glucose,448
FiO2,422


Unnamed: 0,subject_id,hadm_id,Date,Day,Hour,HeartRate,SysBP,DiasBP,MeanBP,RespRate,TempC,SpO2,Glucose,FiO2
401035,47643,124142,2130-07-28,1,19,,,,,,,,,100.0
401036,47643,124142,2130-07-29,2,1,,,,,26.0,,,,
401037,47643,124142,2130-07-29,2,1,73.0,,,,,,,,
401038,47643,124142,2130-07-29,2,1,,,,,14.0,,,,
401039,47643,124142,2130-07-29,2,1,,,,,,,72.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401498,47643,124142,2130-08-09,13,17,70.0,103.0,59.0,64.0,26.0,,,,
401499,47643,124142,2130-08-09,13,19,77.0,,,,15.0,,,,
401500,47643,124142,2130-08-09,13,20,76.0,119.0,47.0,,18.0,,,,
401501,47643,124142,2130-08-09,13,20,80.0,,,,19.0,37.055556,,,


## 9 patient in section1.3
* patient droped beccause only have recodes after 14 days
  > 9 patient droped in **dataset with missing value**
  > first icu start after 14 day of hosipital adimition



In [None]:
# Filter droped pacients
night_check = data_w_null.groupby(['subject_id', 'hadm_id']).apply(lambda df: df.Night.unique()).reset_index()
night_check.rename(columns={0:'Night_unique'}, inplace=True)
night_check['only_after_day14'] = night_check.Night_unique.apply(lambda x: len(x[x<14])==0)

# merge with relavent info
droped_after14days = night_check.loc[night_check.only_after_day14, ['hadm_id', 'Night_unique']].merge(sepsis_label_df[['hadm_id', 'onset_day', 'onset_datetime']], on='hadm_id')

print(droped_after14days.shape)
droped_after14days

(9, 4)


  night_check = data_w_null.groupby(['subject_id', 'hadm_id']).apply(lambda df: df.Night.unique()).reset_index()


Unnamed: 0,hadm_id,Night_unique,onset_day,onset_datetime
0,156050,"[42, 43, 44, 45]",,
1,157559,"[29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 4...",35.0,2129-05-22 08:35:00
2,159858,"[26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]",23.0,2187-08-14 13:27:00
3,146480,"[69, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 1...",33.0,2175-01-27 00:00:00
4,161643,"[23, 24, 25, 26, 28, 29, 30, 31, 38, 39, 42, 43]",,
5,155470,"[19, 20, 21, 22, 23, 24, 25, 26, 27]",18.0,2185-12-24 03:45:00
6,196517,"[35, 36, 37, 38, 39, 40, 71, 72, 73, 74, 75]",11.0,2156-12-20 11:20:00
7,173748,"[19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 31, 3...",,
8,118886,"[32, 33, 34, 41, 42, 43, 44, 45, 46, 47, 48, 4...",26.0,2137-05-08 01:39:00


In [None]:
# Patient's details
query = """
SELECT
    admissions.hadm_id, admissions.admittime, admissions.dischtime, admissions.deathtime, admissions.edregtime, admissions.edouttime,
    icustays.icustay_id , icustays.intime, icustays.outtime, icustays.los
FROM
    `physionet-data.mimiciii_clinical.admissions` AS admissions
JOIN
    `physionet-data.mimiciii_clinical.icustays` AS icustays
ON
    admissions.hadm_id = icustays.hadm_id
WHERE
    admissions.hadm_id IN (156050, 157559, 159858, 146480, 161643, 155470, 196517, 173748, 118886)
"""
patient_info = data_utils.run_query(query, PROJECT_ID)
patient_info['icu_start_day'] = (patient_info.intime.dt.date - patient_info.admittime.dt.date).apply(lambda x: x.days)

# # display
# for id in droped_df.hadm_id[:]:
#   print("\nhadm_id:", id, )
#   display(droped_df[droped_df.hadm_id==id])
#   display(patient_info.loc[patient_info.hadm_id==id, ['hadm_id', 'admittime', 'dischtime', 'deathtime', 'edregtime', 'edouttime']].drop_duplicates())
#   display(patient_info.loc[patient_info.hadm_id==id, ['icustay_id', 'intime', 'outtime', 'los', 'icu_start_day']].sort_values('intime'))

  return pd.io.gbq.read_gbq(


In [None]:
df = droped_after14days.merge(patient_info[['hadm_id', 'icu_start_day']], on='hadm_id').sort_values('icu_start_day')
df.shape, display(df)

Unnamed: 0,hadm_id,Night_unique,onset_day,onset_datetime,icu_start_day
12,155470,"[19, 20, 21, 22, 23, 24, 25, 26, 27]",18.0,2185-12-24 03:45:00,18
15,173748,"[19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 31, 3...",,,18
8,161643,"[23, 24, 25, 26, 28, 29, 30, 31, 38, 39, 42, 43]",,,23
5,159858,"[26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]",23.0,2187-08-14 13:27:00,25
10,161643,"[23, 24, 25, 26, 28, 29, 30, 31, 38, 39, 42, 43]",,,28
4,157559,"[29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 4...",35.0,2129-05-22 08:35:00,28
16,173748,"[19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 31, 3...",,,30
18,118886,"[32, 33, 34, 41, 42, 43, 44, 45, 46, 47, 48, 4...",26.0,2137-05-08 01:39:00,31
14,196517,"[35, 36, 37, 38, 39, 40, 71, 72, 73, 74, 75]",11.0,2156-12-20 11:20:00,35
9,161643,"[23, 24, 25, 26, 28, 29, 30, 31, 38, 39, 42, 43]",,,37


((19, 5), None)

In [None]:
# sample sepsis onset before first icustay
id = 118886
display(droped_after14days[droped_after14days.hadm_id==id])
display(patient_info.loc[patient_info.hadm_id==id, ['hadm_id', 'admittime', 'icustay_id', 'intime', 'outtime', 'icu_start_day']].sort_values('icu_start_day'))
display(raw_vs[raw_vs.hadm_id==id].head())

Unnamed: 0,hadm_id,Night_unique,onset_day,onset_datetime
8,118886,"[32, 33, 34, 41, 42, 43, 44, 45, 46, 47, 48, 4...",26.0,2137-05-08 01:39:00


Unnamed: 0,hadm_id,admittime,icustay_id,intime,outtime,icu_start_day
17,118886,2137-04-13 13:54:00,232398,2137-05-14 14:23:21,2137-05-17 14:22:28,31
16,118886,2137-04-13 13:54:00,271628,2137-05-23 23:47:20,2137-06-03 18:16:39,40


Unnamed: 0,subject_id,hadm_id,Date,Day,Hour,HeartRate,SysBP,DiasBP,MeanBP,RespRate,TempC,SpO2,Glucose,FiO2
581763,77471,118886,2137-05-14,32,14,59.0,107.0,55.0,68.0,15.0,,,,40.0
581764,77471,118886,2137-05-14,32,14,,,,,,,100.0,,
581765,77471,118886,2137-05-14,32,14,81.0,,,,17.0,35.222222,100.0,,
581766,77471,118886,2137-05-14,32,15,66.0,104.0,55.0,66.0,14.0,,100.0,,
581767,77471,118886,2137-05-14,32,16,71.0,112.0,66.0,76.0,15.0,,100.0,,


## 26 Pacient insection 2.2 (N dataset)

* there are 169 samples from 26 pacients only have recodes afte sepsis onset

In [None]:
print(f"In total, there are {night_ti.shape[0]} samples for {night_ti.hadm_id.nunique()} unique hospital admissions.")
mimic_data_df = assign_label2instance(night_ti, sepsis_label_df)

In total, there are 12441 samples for 1561 unique hospital admissions.
6952 Negative instances for 1032 non-sepsis patients
5489 instances for 529 sepsis patients
Dropped 3464 instances after sepsis onset
	 455 (1s) + 1570 (0s)
Final Dataset: 455(1s) + 8522(0s) = 8977 (Patients=1535)


In [None]:
# find missing 26 sepsis patient
nightly_ti_ids = night_ti.hadm_id.unique()
mimic_data_ids = mimic_data_df.hadm_id.unique()
nightly_ti_ids.shape, mimic_data_ids.shape

droped_afteronset_df = night_ti[~(night_ti.hadm_id.isin(mimic_data_ids))].merge(sepsis_label_df[['hadm_id', 'onset_datetime', 'onset_day']],on='hadm_id')
droped_afteronset_df['Night_end'] = pd.to_datetime(droped_afteronset_df.Date) + pd.to_timedelta(1, unit='d') + pd.to_timedelta('06:59:59')
print(f"In total, there are {droped_afteronset_df.shape[0]} samples for {droped_afteronset_df.hadm_id.nunique()} unique hospital admissions.")

# display recodes of data s.t. before pacient's onet
df = droped_afteronset_df.loc[droped_afteronset_df.Night_end<droped_afteronset_df.onset_datetime ]
if df.shape[0] == 0:
  print(f'missing sepsis patients due to missing data before sepsis onset')
else:
  display(df)

In total, there are 169 samples for 26 unique hospital admissions.
missing sepsis patients due to missing data before sepsis onset


## 48 Patient can't locate positive samples in section 2.2

In [None]:
# Identify patient can locate negative sample but not positive sample
sepsis_ids = sepsis_label_df[sepsis_label_df.is_sepsis == 1]#.unique()
sepsis_ids.shape

missing_pacient = sepsis_ids[
    ~(sepsis_ids.hadm_id.isin(mimic_data_df[mimic_data_df.Label==1].hadm_id)) #(455) sucessfult locat positive sample
    & ~(sepsis_ids.hadm_id.isin(droped_after14days.hadm_id))   #(6)droped due to only contains recodes after 14 days
    & ~(sepsis_ids.hadm_id.isin(droped_afteronset_df.hadm_id)) #(26)droped due to only contains recodes after onset
]

missing_pos_instence = missing_pacient[['hadm_id', 'onset_datetime', 'onset_day']].merge(night_ti[['hadm_id', 'Date', 'Night']])
missing_pos_instence = missing_pos_instence.groupby(['hadm_id', 'onset_datetime',	'onset_day']).apply(lambda x: list(x.Night)).reset_index(name='Night_unique')

# Calculate positive Sample Night/Date
missing_pos_instence['pos_night'] = np.where(pd.to_datetime(missing_pos_instence['onset_datetime']).dt.time < time(7, 0), 2, 1)
missing_pos_instence['pos_night'] = (missing_pos_instence.onset_day - missing_pos_instence.pos_night).apply(lambda x: int(x))
display(missing_pos_instence.hadm_id.unique())
# print({missing_pos_instence.hadm_id.nunique()})

# can't locate pos sample
missing_pos_instence[missing_pos_instence.apply(lambda p: p.pos_night in p.Night_unique, axis=1)]

  missing_pos_instence = missing_pos_instence.groupby(['hadm_id', 'onset_datetime',	'onset_day']).apply(lambda x: list(x.Night)).reset_index(name='Night_unique')


array([100619, 104665, 106591, 117412, 120032, 121701, 123562, 125256,
       129470, 131246, 132275, 135091, 136740, 138137, 138787, 139953,
       140482, 141976, 143113, 144855, 144894, 147742, 152253, 152517,
       158834, 163158, 164156, 164563, 164729, 166362, 168331, 169240,
       171956, 173136, 173453, 175706, 175881, 176342, 178038, 180714,
       180992, 186637, 190379, 191606, 193172, 193534, 195694, 199931])

Unnamed: 0,hadm_id,onset_datetime,onset_day,Night_unique,pos_night


In [None]:
# Patient's details
query = f"""
SELECT
    admissions.hadm_id, admissions.admittime, admissions.dischtime, admissions.deathtime, admissions.edregtime, admissions.edouttime,
    icustays.icustay_id , icustays.intime, icustays.outtime, icustays.los
FROM
    `physionet-data.mimiciii_clinical.admissions` AS admissions
JOIN
    `physionet-data.mimiciii_clinical.icustays` AS icustays
ON
    admissions.hadm_id = icustays.hadm_id
WHERE
    admissions.hadm_id IN (100619, 104665, 106591, 117412, 120032, 121701, 123562, 125256,
       129470, 131246, 132275, 135091, 136740, 138137, 138787, 139953,
       140482, 141976, 143113, 144855, 144894, 147742, 152253, 152517,
       158834, 163158, 164156, 164563, 164729, 166362, 168331, 169240,
       171956, 173136, 173453, 175706, 175881, 176342, 178038, 180714,
       180992, 186637, 190379, 191606, 193172, 193534, 195694, 199931)
"""
patient_info = data_utils.run_query(query, PROJECT_ID)
patient_info['icu_start_day'] = (patient_info.intime.dt.date - patient_info.admittime.dt.date).apply(lambda x: x.days)


In [None]:
# overview:
print(missing_pos_instence.shape)
missing_pos_instence

(48, 5)


Unnamed: 0,hadm_id,onset_datetime,onset_day,Night_unique,pos_night
0,100619,2158-05-25 19:55:00,12.0,"[2, 3, 12, 13, 14]",11
1,104665,2134-08-18 18:04:00,7.0,"[2, 3, 4, 7, 8]",6
2,106591,2145-05-07 10:11:00,5.0,"[2, 5, 6, 7, 8]",4
3,117412,2155-09-20 20:36:00,20.0,[2],19
4,120032,2133-10-17 13:00:00,20.0,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",19
5,121701,2154-04-04 09:25:00,16.0,"[2, 3, 5, 6, 7, 14]",15
6,123562,2164-09-15 09:51:00,11.0,"[2, 3, 4, 11, 12, 13, 14]",10
7,125256,2146-10-07 11:00:00,16.0,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",15
8,129470,2122-11-14 17:31:00,16.0,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",15
9,131246,2167-06-22 21:40:00,5.0,"[2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",4


In [None]:
# smaple
id = 164563
# display(night_ti[night_ti.hadm_id==id])
display(missing_pos_instence[missing_pos_instence.hadm_id==id])
display(patient_info.loc[patient_info.hadm_id==id, ['hadm_id', 'admittime', 'icustay_id', 'intime', 'outtime', 'icu_start_day']].sort_values('icu_start_day'))
display(raw_vs[
    (raw_vs.hadm_id==id)
    & (raw_vs.Day.isin([4, 5, 6]))
    ].head())

Unnamed: 0,hadm_id,onset_datetime,onset_day,Night_unique,pos_night
27,164563,2194-08-31 22:49:00,6.0,"[2, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14]",5


Unnamed: 0,hadm_id,admittime,icustay_id,intime,outtime,icu_start_day
15,164563,2194-08-26 10:35:00,253969,2194-08-26 10:36:50,2194-08-30 15:04:41,0
20,164563,2194-08-26 10:35:00,246373,2194-09-01 10:04:55,2194-09-21 16:11:53,6


Unnamed: 0,subject_id,hadm_id,Date,Day,Hour,HeartRate,SysBP,DiasBP,MeanBP,RespRate,TempC,SpO2,Glucose,FiO2
35407,3275,164563,2194-08-29,4,0,106.0,148.0,76.0,103.0,24.0,,93.0,,
35408,3275,164563,2194-08-29,4,0,115.0,163.0,82.0,112.0,24.0,,98.0,,
35409,3275,164563,2194-08-29,4,0,130.0,194.0,101.0,136.0,21.0,,95.0,,
35410,3275,164563,2194-08-29,4,0,132.0,175.0,95.0,127.0,27.0,,94.0,,
35411,3275,164563,2194-08-29,4,0,,156.0,89.0,,25.0,,,,


## 74 positive onset instence in section 2.2

* 74 pacient can't locate the positive instence in dataset w/o null values


In [None]:
# Identify sepsis and non-sepsis patient identifiers based on labels
sepsis_ids = sepsis_label_df.is_sepsis == 1

# Extract data for sepsis patients
sepsis_patient_ti_df = night_ti[night_ti['hadm_id'].isin(sepsis_label_df[sepsis_ids]['hadm_id'])]
sepsis_patient_df = sepsis_patient_ti_df.merge(sepsis_label_df[['hadm_id', 'onset_datetime', 'onset_day']], on='hadm_id')
sepsis_patient_df = sepsis_patient_df.groupby(['hadm_id', 'onset_datetime',	'onset_day']).apply(lambda x: list(x.Night)).reset_index(name='Night_unique')

# Calculate positive Sample Night/Date
sepsis_patient_df['pos_night'] = np.where(pd.to_datetime(sepsis_patient_df['onset_datetime']).dt.time < time(7, 0), 2, 1)
sepsis_patient_df['pos_night'] = (sepsis_patient_df.onset_day - sepsis_patient_df.pos_night).apply(lambda x: int(x))
# sepsis_patient_df
# # Locate missing data
is_missing = ~(sepsis_patient_df.apply(lambda p: p.pos_night in p.Night_unique, axis=1))
missing_df = sepsis_patient_df[is_missing]
missing_df

  sepsis_patient_df = sepsis_patient_df.groupby(['hadm_id', 'onset_datetime',	'onset_day']).apply(lambda x: list(x.Night)).reset_index(name='Night_unique')


Unnamed: 0,hadm_id,onset_datetime,onset_day,Night_unique,pos_night
6,100619,2158-05-25 19:55:00,12.0,"[2, 3, 12, 13, 14]",11
22,104665,2134-08-18 18:04:00,7.0,"[2, 3, 4, 7, 8]",6
33,106591,2145-05-07 10:11:00,5.0,"[2, 5, 6, 7, 8]",4
48,110130,2192-11-30 01:19:00,5.0,"[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",3
75,116404,2111-10-12 00:00:00,4.0,"[3, 4, 5, 14]",2
...,...,...,...,...,...
501,194731,2124-01-11 09:45:00,8.0,"[10, 11, 12, 13, 14]",7
508,195694,2130-03-01 09:20:00,24.0,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",23
509,195956,2159-07-03 13:19:00,11.0,"[11, 12, 13, 14]",10
526,199533,2146-12-08 09:31:00,11.0,"[11, 12, 13, 14]",10


In total

In [None]:
# Missing reason 1: onset after 2 weeks
after14d = missing_df[missing_df.pos_night>14]
print(f"{after14d.shape[0]} patients onset after 14 days")
after14d

20 patients onset after 14 days


Unnamed: 0,hadm_id,onset_datetime,onset_day,Night_unique,pos_night
82,117412,2155-09-20 20:36:00,20.0,[2],19
97,120032,2133-10-17 13:00:00,20.0,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",19
104,121701,2154-04-04 09:25:00,16.0,"[2, 3, 5, 6, 7, 14]",15
121,125256,2146-10-07 11:00:00,16.0,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",15
147,129470,2122-11-14 17:31:00,16.0,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",15
177,135091,2125-05-15 10:25:00,27.0,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",26
196,138787,2132-12-22 13:30:00,18.0,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",17
212,141976,2172-05-25 15:30:00,16.0,"[2, 3, 4, 5]",15
217,143113,2163-09-22 16:19:00,17.0,"[11, 12, 13, 14]",16
225,144855,2175-09-11 11:50:00,29.0,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]",28


In [None]:
# Missing reason 2: onset before ICU stay
before_icustay = missing_df.apply(lambda p: p.pos_night < min(p.Night_unique), axis=1)
before_icustay_df = missing_df[before_icustay]
print(f"{before_icustay_df.shape[0]} patients onset before first ICUstay start")

# Display sample
id = 140204
display(before_icustay_df[before_icustay_df.hadm_id==id])
display(patient_info.loc[patient_info.hadm_id==id, ['hadm_id', 'admittime', 'icustay_id', 'intime', 'outtime', 'icu_start_day']].sort_values('icu_start_day'))
display(raw_vs[raw_vs.hadm_id==id].head())

26 patients onset before first ICUstay start


Unnamed: 0,hadm_id,onset_datetime,onset_day,Night_unique,pos_night


Unnamed: 0,hadm_id,admittime,icustay_id,intime,outtime,icu_start_day


Unnamed: 0,subject_id,hadm_id,Date,Day,Hour,HeartRate,SysBP,DiasBP,MeanBP,RespRate,TempC,SpO2,Glucose,FiO2
71887,6594,140204,2183-11-24,4,17,,,,,,,,,48.0
71888,6594,140204,2183-11-24,4,18,,,,,,,,,81.0
71889,6594,140204,2183-11-24,4,20,,,,,,,,,60.0
71890,6594,140204,2183-11-27,7,21,88.0,105.0,66.0,79.0,,37.055554,,,
71891,6594,140204,2183-11-27,7,21,87.0,106.0,77.0,86.666702,,,,,


In [None]:
# Missing reason 3: between the icustay gap
df = missing_df[~(missing_df.hadm_id.isin(after14d.hadm_id) | missing_df.hadm_id.isin(before_icustay_df.hadm_id))]

for id in df.hadm_id[:3]:
  print("\nhadm_id:", id, )
  display(missing_df[missing_df.hadm_id==id])
  display(patient_info.loc[patient_info.hadm_id==id, ['hadm_id', 'admittime', 'icustay_id', 'intime', 'outtime', 'icu_start_day']].sort_values('icu_start_day'))

# 137668
# sample disply


hadm_id: 100619


Unnamed: 0,hadm_id,onset_datetime,onset_day,Night_unique,pos_night
6,100619,2158-05-25 19:55:00,12.0,"[2, 3, 12, 13, 14]",11


Unnamed: 0,hadm_id,admittime,icustay_id,intime,outtime,icu_start_day
52,100619,2158-05-14 20:54:00,254464,2158-05-14 20:55:32,2158-05-17 17:40:21,0
53,100619,2158-05-14 20:54:00,267881,2158-05-25 16:54:09,2158-06-23 11:02:46,11



hadm_id: 104665


Unnamed: 0,hadm_id,onset_datetime,onset_day,Night_unique,pos_night
22,104665,2134-08-18 18:04:00,7.0,"[2, 3, 4, 7, 8]",6


Unnamed: 0,hadm_id,admittime,icustay_id,intime,outtime,icu_start_day
50,104665,2134-08-12 07:31:00,239104,2134-08-12 07:32:47,2134-08-16 16:05:04,0
64,104665,2134-08-12 07:31:00,268438,2134-08-18 11:46:26,2134-08-19 22:59:42,6



hadm_id: 106591


Unnamed: 0,hadm_id,onset_datetime,onset_day,Night_unique,pos_night
33,106591,2145-05-07 10:11:00,5.0,"[2, 5, 6, 7, 8]",4


Unnamed: 0,hadm_id,admittime,icustay_id,intime,outtime,icu_start_day
72,106591,2145-05-03 10:40:00,209928,2145-05-03 10:41:13,2145-05-05 16:20:39,0
70,106591,2145-05-03 10:40:00,241605,2145-05-07 14:36:03,2145-05-11 02:26:40,4


In [None]:
# Reason4: missing night time data
# for id in [134244, 147742, 163158, 171956, 198296]:
#   print("\nhadm_id:", id, )
#   display(missing_df[missing_df.hadm_id==id])
#   display(patient_info.loc[patient_info.hadm_id==id, ['hadm_id', 'admittime', 'icustay_id', 'intime', 'outtime', 'icu_start_day']].sort_values('icu_start_day'))
#   break

id = 134244
display(missing_df[missing_df.hadm_id==id])
display(patient_info.loc[patient_info.hadm_id==id, ['hadm_id', 'admittime', 'icustay_id', 'intime', 'outtime', 'icu_start_day']].sort_values('icu_start_day'))
raw_vs[raw_vs.hadm_id==id].iloc[55:61]

Unnamed: 0,hadm_id,onset_datetime,onset_day,Night_unique,pos_night


Unnamed: 0,hadm_id,admittime,icustay_id,intime,outtime,icu_start_day
71,134244,2133-03-27 16:31:00,264413,2133-03-27 16:31:37,2133-04-04 16:15:28,0


Unnamed: 0,subject_id,hadm_id,Date,Day,Hour,HeartRate,SysBP,DiasBP,MeanBP,RespRate,TempC,SpO2,Glucose,FiO2
596397,81436,134244,2133-03-29,3,17,120.0,131.0,79.0,91.0,28.0,,96.0,,
596398,81436,134244,2133-03-29,3,18,116.0,,,68.0,23.0,,98.0,,
596399,81436,134244,2133-03-29,3,18,,135.0,79.0,,,,,,
596400,81436,134244,2133-03-29,3,19,118.0,,,,32.0,,93.0,,
596401,81436,134244,2133-03-29,3,20,,,,,,37.166667,,,
596402,81436,134244,2133-03-30,4,16,113.0,,,,16.0,,,,
