# Competition overview

**Competition Summary**

This is the PHEMS Hackathon: Pediatric Sepsis Prediction, a machine learning competition focused on developing an algorithm for the early detection of sepsis in pediatric intensive care unit (PICU) patients. Participants are tasked with creating a model that can predict the onset of sepsis up to 6 hours before a clinical diagnosis, using retrospective physiological and clinical time-series data.

Early detection of sepsis is a critical, life-saving intervention. The competition's goal is to create a data-driven solution that can enhance current clinical protocols and improve patient outcomes.

The primary evaluation metric is precision-recall AUC, which is particularly important for imbalanced medical datasets. Secondary metrics like accuracy and F1-score will also be considered. In case of a tie, model interpretability, execution time, and computational resource usage will serve as tiebreakers.

The competition runs from January 13, 2025, to February 5, 2025, and offers a total prize pool of 2,500€ for the top three winning teams.

### [Link to competition](https://www.kaggle.com/competitions/phems-hackathon-early-sepsis-prediction/data?select=SepsisLabel_sample_submission.csv)

# Dataset Overview

The dataset for the PHEMS Hackathon is a retrospective collection of pediatric patient data from the Pediatric Intensive Care Unit (PICU) of Hospital Sant Joan de Déu. The data is provided in a time-series format, with each row representing a measurement or event at a specific timestamp during a patient's stay. The dataset is split into training, testing, and a hidden private test set.

The data is organized into several standard tables, each providing a different category of patient information. The core objective is to use these predictor datasets to forecast the SepsisLabel for each patient at every time point in the test set. Importantly, all positive SepsisLabels in the training data have been adjusted to appear 6 hours earlier than the true onset of sepsis, aligning with the competition's goal of early prediction.

# Data Dictionary

The following is a data dictionary for the provided files:

**`SepsisLabel_(train|test).csv`**

This file contains the target variable for the competition.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-k9u1{border-color:inherit;color:#1B1C1D;font-size:100%;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-k9u1">Column</th>
    <th class="tg-7zrl">Description</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">person_id</td>
    <td class="tg-0lax">A unique identifier for each patient.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">measurement_datetime</td>
    <td class="tg-0lax">The specific timestamp of the measurement.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">SepsisLabel</td>
    <td class="tg-0lax">The binary outcome variable: 1 for a positive sepsis assessment, 0 for a negative assessment.</td>
  </tr>
</tbody>
</table>

**`devices.csv`**

This table records the usage of medical devices.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-k9u1{border-color:inherit;color:#1B1C1D;font-size:100%;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-k9u1">Column</th>
    <th class="tg-7zrl">Description</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">visit_occurrence_id</td>
    <td class="tg-0lax">A unique ID for a specific PICU episode.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">device_datetime_hourly</td>
    <td class="tg-0lax">The timestamp (hourly granularity) when the device was used.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">person_id</td>
    <td class="tg-0lax">A unique patient identifier.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">device</td>
    <td class="tg-0lax">The type of medical device used.</td>
  </tr>
</tbody>
</table>

**`drugexposure.csv`**

This table contains data on drug administration.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-k9u1{border-color:inherit;color:#1B1C1D;font-size:100%;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-k9u1">Column</th>
    <th class="tg-7zrl">Description</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">visit_occurrence_id</td>
    <td class="tg-0lax">A unique ID for a specific PICU episode.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">person_id</td>
    <td class="tg-0lax">A unique patient identifier.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">drug_datetime_hourly</td>
    <td class="tg-0lax">The timestamp (hourly granularity) of drug administration.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">drug_concept_id</td>
    <td class="tg-0lax">An identifier for the administered drug.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">quantity</td>
    <td class="tg-0lax">The amount of the drug administered.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">dose_unit_source_value</td>
    <td class="tg-0lax">The unit of measurement for the drug dosage.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">route_concept_id</td>
    <td class="tg-0lax">The route of drug administration (e.g., intravenous).</td>
  </tr>
</tbody></table>

**`observation.csv`**

This table contains general clinical observations.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-k9u1{border-color:inherit;color:#1B1C1D;font-size:100%;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-k9u1">Column</th>
    <th class="tg-7zrl">Description</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">visit_occurrence_id</td>
    <td class="tg-0lax">A unique ID for a specific PICU episode.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">person_id</td>
    <td class="tg-0lax">A unique patient identifier.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">observation_datetime</td>
    <td class="tg-0lax">The timestamp when the observation was made.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">observation_concept_id</td>
    <td class="tg-0lax">A label describing the type of observation.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">valuefilled</td>
    <td class="tg-0lax">The value or description of the observation.</td>
  </tr>
</tbody>
</table>

**`person_demographics_episode.csv`**

This table contains patient demographic information.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-k9u1{border-color:inherit;color:#1B1C1D;font-size:100%;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-k9u1">Column</th>
    <th class="tg-7zrl">Description</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">visit_occurrence_id</td>
    <td class="tg-0lax">A unique ID for a specific PICU episode.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">person_id</td>
    <td class="tg-0lax">A unique patient identifier.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">age_in_months</td>
    <td class="tg-0lax">The patient's age in months.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">gender</td>
    <td class="tg-0lax">The patient's gender.</td>
  </tr>
</tbody>
</table>

**`measurement_lab.csv, measurement_meds.csv, measurement_observation.csv`**

These files contain various lab measurements, medication administrations, and clinical observations. Specific column names are not provided in the description, but they would typically include `person_id, measurement_datetime`, and a series of columns for different lab tests, vital signs, or observations.

### [Link to dataset](https://www.kaggle.com/competitions/phems-hackathon-early-sepsis-prediction/data?select=SepsisLabel_sample_submission.csv)

# Submission Notebook Pipeline Overview

The script executes a sequential workflow to prepare the test data and generate predictions for a machine learning competition.

- **Data Loading and Feature Engineering:** The process starts by loading multiple raw data files for the test set, including patient vitals, lab results, and medication data. Using pandas, the code merges these disparate datasets into a single, comprehensive dataframe. During this stage, it performs feature engineering, creating new features like patient stay duration (duration) and handling missing values by filling them with the mean or mode.

- **Data Transformation:** This is a critical step where the test data is transformed to match the format of the training data used for the pre-trained model.

  - Text to Embedding: It loads a pre-trained Word2Vec model to convert text-based features (e.g., device, procedure) into dense numerical vectors. This allows the model to understand the semantic meaning of these medical terms.
  - Standardization: A pre-trained StandardScaler is loaded to normalize the numerical features. This is a crucial step for many machine learning models, as it ensures all features contribute equally during prediction.
  - Dimensionality Reduction: A pre-trained PCA (Principal Component Analysis) model is loaded to reduce the number of features. This helps mitigate the curse of dimensionality and improves computational efficiency without losing critical information.
  - Time Series Conversion: The create_time_series_dataset function reshapes the data into a 3D time-series format (`samples, time_steps, features`), which is the required input shape for the deep learning model. It does this by creating sequences of past data points for each sample.

- **Prediction and Submission:** The script loads a pre-trained TensorFlow model from a saved file. The prepared time-series data is passed to this model to generate sepsis predictions as probabilities. These probabilities are then converted into binary labels (0 or 1) using a threshold of 0.5. Finally, the predictions are formatted into a submission file as required by the competition platform and saved as `submission.csv`.

# Review Model Architecture and Training Concepts

The model used in this notebook is a pre-trained deep learning model optimized for time-series classification. While the code doesn't define the model architecture, the file name (`optimized_time_series_modelV1.h5`) suggests it's a variant of a Temporal Convolutional Network (TCN) or an LSTM-based network, which are architectures designed to capture temporal patterns.

**Core Concepts**

Time Series Classification: The task is to predict a future event (sepsis) based on a sequence of past data points. The model needs to learn patterns and dependencies over time.

**Feature Engineering:** 

This process involves using domain knowledge to create new features from existing data. For this project, features like the duration of a patient's hospital stay and the average time a device was used are created, as they are likely strong indicators of a patient's condition.

**Word Embeddings:** 

This is a concept from Natural Language Processing (NLP) that is applied to the categorical text data. Word2Vec learns to represent words as numerical vectors, where words with similar meanings have similar vectors. By averaging these vectors for a given patient's history, the model can infer a patient's overall medical state.

**Dimensionality Reduction:** 

PCA works by finding new, uncorrelated dimensions (called principal components) that capture the maximum variance in the data. This reduces the number of features while retaining most of the important information, which is essential when working with high-dimensional data like electronic health records.

**Model Training:** 

The pre-trained model was trained on a separate dataset (the training data) using a process that involved an optimizer (e.g., Adam) and a loss function (e.g., Binary Cross-Entropy) to minimize prediction error. Techniques like Early Stopping and Dropout were likely used to prevent overfitting, ensuring the model generalizes well to new data rather than just memorizing the training examples.

### [Model Training Notebook](https://www.kaggle.com/code/misterfour/phems-hackathon-model-training)

# Import libraries

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

import pandas as pd
import numpy as np
from gensim.models import Word2Vec
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np
import tensorflow as tf

/kaggle/input/phems-hackathon-early-sepsis-prediction/SepsisLabel_sample_submission.csv
/kaggle/input/phems-hackathon-early-sepsis-prediction/training_data/drugsexposure_train.csv
/kaggle/input/phems-hackathon-early-sepsis-prediction/training_data/observation_train.csv
/kaggle/input/phems-hackathon-early-sepsis-prediction/training_data/devices_train.csv
/kaggle/input/phems-hackathon-early-sepsis-prediction/training_data/measurement_lab_train.csv
/kaggle/input/phems-hackathon-early-sepsis-prediction/training_data/proceduresoccurrences_train.csv
/kaggle/input/phems-hackathon-early-sepsis-prediction/training_data/person_demographics_episode_train.csv
/kaggle/input/phems-hackathon-early-sepsis-prediction/training_data/SepsisLabel_train.csv
/kaggle/input/phems-hackathon-early-sepsis-prediction/training_data/measurement_observation_train.csv
/kaggle/input/phems-hackathon-early-sepsis-prediction/training_data/measurement_meds_train.csv
/kaggle/input/phems-hackathon-early-sepsis-prediction/tes

2025-09-02 22:11:32.850786: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756851093.143318      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756851093.228158      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Data loading

This step is data preparation pipeline for a neural network model.It loads and merges various datasets related to patient records, medical devices, lab results, medications, and procedures. The core concept behind this step is some feature engineering, which involves using domain knowledge to create new features that can help a model better understand the data. This step transforms raw, fragmented data into a single, cohesive dataset.

**Step-By-Step Breakdown**

**1.Initial Data Loading and Device Information**

The pipeline begins by loading the main `SepsisLabel_test.csv` file, which contains core patient and timestamp information. It then loads `devices_test.csv` and merges it.

- **Duration Calculation:** The code calculates the total time a person was monitored, in hours, by grouping data by `person_id` and finding the difference between the earliest and latest measurement timestamps.
- **Device Aggregation:** It groups the data by `person_id` to create a list of all unique devices used for that person. This summarizes device information into a single feature for each patient.

**2.Merging Lab and Medical Data**

This section loads and aggregates data from two more sources: `measurement_lab_test.csv` (for lab results) and `measurement_meds_test.csv` (for vital signs and medication data).

- **Data Consolidation:** For both files, the code uses a `.groupby()` operation on `person_id` and `measurement_datetime`. It then applies `.sum()` to aggregate values for various lab tests (e.g., Bilirubin, Platelet count) and vital signs (e.g., Systolic blood pressure, Heart rate) at each measurement timestamp. This is a crucial step to combine multiple measurements taken at the same time into a single row.

**3.Merging Observations and Procedures**

The pipeline continues to build the master dataset by merging data from two more files: `measurement_observation_test.csv` and `proceduresoccurrences_test.csv`.

- **Observation Aggregation:** It follows the same pattern of grouping by `person_id` and `measurement_datetime` and then summing values for observations like Glasgow coma scale and pupil diameter.
- **Procedure Aggregation:** It calculates the duration of a patient's procedures and aggregates the unique procedure names for each patient. This gives the model insight into the types of medical interventions a patient has undergone.

**4.Merging Drug and General Observation Data**

Finally, the pipeline incorporates information on drug exposure and general medical observations.

- **Drug Exposure:** The `drugsexposure_test.csv` file is loaded. The code calculates the total duration of drug administration for each patient and aggregates unique drug concept IDs and routes. This provides a clear picture of the patient's pharmacological history.
- **General Observations:** The `observation_test.csv` file is loaded and aggregated in the same manner, summarizing the duration of observations and listing unique observation concepts and values.

The result of this entire process is a single, broad DataFrame (`merged_test`) that contains all the relevant information from multiple sources, organized by `person_id` and `measurement_datetime`. This structured and enriched dataset is now ready to be used as input for a machine learning model to predict sepsis.

In [2]:
# Load test data
test_sepsislabel = pd.read_csv('/kaggle/input/phems-hackathon-early-sepsis-prediction/testing_data/SepsisLabel_test.csv')
test_sepsislabel['measurement_datetime'] = pd.to_datetime(test_sepsislabel['measurement_datetime'])
test_submission = test_sepsislabel.copy()

# Convert 'measurement_datetime' to datetime
test_sepsislabel['measurement_datetime'] = pd.to_datetime(test_sepsislabel['measurement_datetime'])

# Calculate duration for each person_id
test_sepsislabel['duration'] = test_sepsislabel.groupby('person_id')['measurement_datetime'].transform(lambda x: (x.max() - x.min()).total_seconds()/3600)
test_devices = pd.read_csv('/kaggle/input/phems-hackathon-early-sepsis-prediction/testing_data/devices_test.csv')
test_devices_grouped = test_devices.groupby('person_id')['device'].agg(lambda x: ', '.join(x.unique())).reset_index()
test_devices_grouped = pd.DataFrame(test_devices_grouped)

# Drop specified columns
test_devices_hr = test_devices.copy()
test_devices_hr = test_devices_hr.drop(columns=['visit_occurrence_id', 'device'])

# Convert 'device_datetime_hourly' to datetime objects
test_devices_hr['device_datetime_hourly'] = pd.to_datetime(test_devices_hr['device_datetime_hourly'])

# Extract the hour from the datetime column
test_devices_hr['device_hour'] = test_devices_hr['device_datetime_hourly'].dt.hour

# Group by 'person_id' and calculate the mean hour
mean_hours = test_devices_hr.groupby('person_id')['device_hour'].mean().reset_index()
mean_hours['device_mean_hour'] = mean_hours['device_hour']

# Merge the mean hour back into the original DataFrame
test_devices_hr = pd.merge(test_devices_hr, mean_hours[['person_id', 'device_mean_hour']], on='person_id', how='left')

# Remove duplicate 'person_id' rows (keeping the first occurrence)
test_devices_hr = test_devices_hr.drop_duplicates(subset='person_id')
test_devices_hr = test_devices_hr.drop(columns='device_datetime_hourly')
test_devices_merge = pd.merge(test_devices_grouped, test_devices_hr, on='person_id', how='left')
merged_test = pd.merge(test_sepsislabel, test_devices_merge, on='person_id', how='left')


test_lab=pd.read_csv('/kaggle/input/phems-hackathon-early-sepsis-prediction/testing_data/measurement_lab_test.csv')
test_lab = test_lab.groupby(['person_id','measurement_datetime'])[['Base excess in Venous blood by calculation',
       'Base excess in Arterial blood by calculation',
       'Phosphate [Moles/volume] in Serum or Plasma',
       'Potassium [Moles/volume] in Blood',
       'Bilirubin.total [Moles/volume] in Serum or Plasma',
       'Neutrophil Ab [Units/volume] in Serum',
       'Bicarbonate [Moles/volume] in Arterial blood',
       'Hematocrit [Volume Fraction] of Blood',
       'Glucose [Moles/volume] in Serum or Plasma',
       'Calcium [Moles/volume] in Serum or Plasma',
       'Chloride [Moles/volume] in Blood',
       'Sodium [Moles/volume] in Serum or Plasma',
       'C reactive protein [Mass/volume] in Serum or Plasma',
       'Carbon dioxide [Partial pressure] in Venous blood',
       'Oxygen [Partial pressure] in Venous blood',
       'Albumin [Mass/volume] in Serum or Plasma',
       'Bicarbonate [Moles/volume] in Venous blood',
       'Oxygen [Partial pressure] in Arterial blood',
       'Carbon dioxide [Partial pressure] in Arterial blood',
       'Interleukin 6 [Mass/volume] in Body fluid',
       'Magnesium [Moles/volume] in Blood', 'Prothrombin time (PT)',
       'Procalcitonin [Mass/volume] in Serum or Plasma',
       'Lactate [Moles/volume] in Blood', 'Creatinine [Mass/volume] in Blood',
       'Fibrinogen measurement', 'Bilirubin measurement',
       'Partial thromboplastin time', ' activated', 'Total white blood count',
       'Platelet count', 'White blood cell count', 'Blood venous pH',
       'D-dimer level', 'Blood arterial pH',
       'Hemoglobin [Moles/volume] in Blood', 'Ionised calcium measurement']].sum().reset_index()

# Ensure 'measurement_datetime' is converted to datetime64[ns] in all DataFrames
def ensure_datetime(df, datetime_column):
    df[datetime_column] = pd.to_datetime(df[datetime_column], errors='coerce')
    return df

# Convert 'measurement_datetime' to datetime in all relevant DataFrames
test_sepsislabel = ensure_datetime(test_sepsislabel, 'measurement_datetime')
test_lab = ensure_datetime(test_lab, 'measurement_datetime')

# Merge the DataFrames
merged_test = pd.merge(merged_test, test_lab, on=['person_id', 'measurement_datetime'], how='left')


test_meds=pd.read_csv('/kaggle/input/phems-hackathon-early-sepsis-prediction/testing_data/measurement_meds_test.csv')
test_meds = test_meds.groupby(['person_id','measurement_datetime'])[['Systolic blood pressure', 'Diastolic blood pressure',
       'Body temperature', 'Respiratory rate', 'Heart rate',
       'Measurement of oxygen saturation at periphery',
       'Oxygen/Gas total [Pure volume fraction] Inhaled gas']].sum().reset_index()


# Ensure 'measurement_datetime' is converted to datetime64[ns] in all DataFrames
def ensure_datetime(df, datetime_column):
    df[datetime_column] = pd.to_datetime(df[datetime_column], errors='coerce')
    return df

# Convert 'measurement_datetime' to datetime in all relevant DataFrames

merged_test = ensure_datetime(merged_test, 'measurement_datetime')
test_meds = ensure_datetime(test_meds, 'measurement_datetime')

# Merge the DataFrames

merged_test = pd.merge(merged_test, test_meds, on=['person_id', 'measurement_datetime'], how='left')


test_obs=pd.read_csv('/kaggle/input/phems-hackathon-early-sepsis-prediction/testing_data/measurement_observation_test.csv')

test_obs = test_obs.groupby(['person_id','measurement_datetime'])[['Left pupil Diameter Auto', 'Right pupil Diameter Auto',
       'Glasgow coma scale', 'Capillary refill [Time]', 'Pulse',
       'Arterial pulse pressure', 'Right pupil Pupillary response',
       'Left pupil Pupillary response']].sum().reset_index()


# Ensure 'measurement_datetime' is converted to datetime64[ns] in all DataFrames
def ensure_datetime(df, datetime_column):
    df[datetime_column] = pd.to_datetime(df[datetime_column], errors='coerce')
    return df

merged_test = ensure_datetime(merged_test, 'measurement_datetime')
test_obs = ensure_datetime(test_obs, 'measurement_datetime')
merged_test = pd.merge(merged_test, test_obs, on=['person_id', 'measurement_datetime'], how='left')


test_procedure = pd.read_csv('/kaggle/input/phems-hackathon-early-sepsis-prediction/testing_data/proceduresoccurrences_test.csv')
test_procedure['procedure_datetime_hourly'] = pd.to_datetime(test_procedure['procedure_datetime_hourly'])
test_procedure['procedure_duration'] = test_procedure.groupby('person_id')['procedure_datetime_hourly'].transform(lambda x: (x.max() - x.min()).total_seconds() / 3600)

test_procedure_A = test_procedure[['person_id','procedure_duration']]
test_procedure_A = test_procedure_A.drop_duplicates(subset=['person_id'])

test_procedure_grouped = test_procedure.groupby('person_id')[['procedure']].agg(lambda x: ', '.join(x.unique())).reset_index()
merged_test = pd.merge(merged_test, test_procedure_grouped, on='person_id', how='left')
merged_test = pd.merge(merged_test, test_procedure_A, on='person_id', how='left')
merged_test = merged_test.sort_values(by='measurement_datetime') 


test_drug=pd.read_csv('/kaggle/input/phems-hackathon-early-sepsis-prediction/testing_data/drugsexposure_test.csv')
test_drug=test_drug[['person_id','drug_datetime_hourly','drug_concept_id','route_concept_id']]
test_drug['route_concept_id'] = test_drug['route_concept_id'].astype(str)
test_drug['drug_datetime_hourly'] = pd.to_datetime(test_drug['drug_datetime_hourly'])
test_drug['drug_duration'] = test_drug.groupby('person_id')['drug_datetime_hourly'].transform(lambda x: (x.max() - x.min()).total_seconds()/3600)
test_drug_A = test_drug[['person_id','drug_duration']]
test_drug_A = test_drug_A.drop_duplicates(subset=['person_id'])
test_drug_grouped = test_drug.groupby(['person_id'])[['drug_concept_id','route_concept_id']].agg(lambda x: ', '.join(x.unique())).reset_index()
test_drug_grouped = pd.DataFrame(test_drug_grouped)
merged_test = pd.merge(merged_test, test_drug_grouped, on='person_id', how='left')
merged_test = pd.merge(merged_test, test_drug_A, on='person_id', how='left')



test_observation=pd.read_csv('/kaggle/input/phems-hackathon-early-sepsis-prediction/testing_data/observation_test.csv')
# Convert 'drug_datetime_hourly' to datetime
test_observation['observation_datetime'] = pd.to_datetime(test_observation['observation_datetime'])
test_observation['observation_duration'] = test_observation.groupby('person_id')['observation_datetime'].transform(lambda x: (x.max() - x.min()).total_seconds()/3600)
test_observation_A = test_observation[['person_id','observation_duration']]
test_observation_A = test_observation_A.drop_duplicates(subset=['person_id'])
test_observation = test_observation.groupby('person_id')[['observation_concept_name','valuefilled']].agg(lambda x: ', '.join(x.unique())).reset_index()
merged_test = pd.merge(merged_test, test_observation_A, on='person_id', how='left')
merged_test = pd.merge(merged_test, test_observation, on='person_id', how='left')

# Feature Engineering

This section of the code focuses on feature engineering and preparing the data for a machine learning model, specifically one that handles time series or sequential data. The goal of this step is to transform raw, messy test data into a clean, structured, and scaled format that a model can understand. This involves handling missing values, converting text data into numerical embeddings, scaling numerical features, and structuring the data into time-based sequences.

**Data Cleaning and Preparation**

This initial phase ensures the dataset is in a usable state.

- **Sorting the Dataset:** The first action is to sort the DataFrame merged_test by the measurement_datetime column. For time series analysis, the order of observations is critical. By sorting the data chronologically, you ensure that the subsequent steps - like filling missing values and creating time-based sequences - are applied in the correct order. This prevents data leakage where information from the future is used to fill in a past value.
- **Filling Missing Values (Imputation):** The code then fills in missing values (NaN) across different columns using various imputation techniques. Real-world datasets are rarely complete. Missing values can cause errors or poor model performance. Imputation is the process of replacing these missing values with a substituted value. The choice of method depends on the data type.

**Step-By-Step Code Breakdown:**

- **`mode()`:** Used for categorical data in extra_cols, which is a robust way to fill in gaps with the most frequently occurring value.
- **`ffill() and bfill()`:** Used for `categorical_cols. ffill()` (forward fill) propagates the last valid observation forward, while `bfill()` (backward fill) propagates the next valid observation backward. This is common for time series data where you assume a value remains constant until the next measurement.
- **`mean()`:** Used for `numerical_cols`. Filling missing numerical data with the mean is a simple yet effective technique, assuming the data is not heavily skewed.
- **Dropping Unnecessary Columns:** Finally, a list of columns is dropped from the DataFrame. This is likely based on previous analysis of the training data, where these columns were found to be irrelevant for the model's predictions.

## Sort dataset

In [3]:
# Convert 'measurement_datetime' to datetime (if not already done)
merged_test['measurement_datetime'] = pd.to_datetime(merged_test['measurement_datetime'], errors='coerce')

# Sort the DataFrame by 'measurement_datetime' from earliest to latest (ascending order)
merged_test = merged_test.sort_values(by='measurement_datetime')

# Display the cleaned DataFrame
x_test = merged_test

## Fill missing values

In [4]:
date_time = ['measurement_datetime']

numerical_cols = ['person_id',
 'duration',
 'device_hour',
 'device_mean_hour',
 'Base excess in Venous blood by calculation',
 'Base excess in Arterial blood by calculation',
 'Phosphate [Moles/volume] in Serum or Plasma',
 'Potassium [Moles/volume] in Blood',
 'Bilirubin.total [Moles/volume] in Serum or Plasma',
 'Neutrophil Ab [Units/volume] in Serum',
 'Bicarbonate [Moles/volume] in Arterial blood',
 'Hematocrit [Volume Fraction] of Blood',
 'Glucose [Moles/volume] in Serum or Plasma',
 'Calcium [Moles/volume] in Serum or Plasma',
 'Chloride [Moles/volume] in Blood',
 'Sodium [Moles/volume] in Serum or Plasma',
 'C reactive protein [Mass/volume] in Serum or Plasma',
 'Carbon dioxide [Partial pressure] in Venous blood',
 'Oxygen [Partial pressure] in Venous blood',
 'Albumin [Mass/volume] in Serum or Plasma',
 'Bicarbonate [Moles/volume] in Venous blood',
 'Oxygen [Partial pressure] in Arterial blood',
 'Carbon dioxide [Partial pressure] in Arterial blood',
 'Interleukin 6 [Mass/volume] in Body fluid',
 'Magnesium [Moles/volume] in Blood',
 'Prothrombin time (PT)',
 'Procalcitonin [Mass/volume] in Serum or Plasma',
 'Lactate [Moles/volume] in Blood',
 'Creatinine [Mass/volume] in Blood',
 'Fibrinogen measurement',
 'Bilirubin measurement',
 'Partial thromboplastin time',
 ' activated',
 'Total white blood count',
 'Platelet count',
 'White blood cell count',
 'Blood venous pH',
 'D-dimer level',
 'Blood arterial pH',
 'Hemoglobin [Moles/volume] in Blood',
 'Ionised calcium measurement',
 'Systolic blood pressure',
 'Diastolic blood pressure',
 'Body temperature',
 'Respiratory rate',
 'Heart rate',
 'Measurement of oxygen saturation at periphery',
 'Oxygen/Gas total [Pure volume fraction] Inhaled gas',
 'Left pupil Diameter Auto',
 'Right pupil Diameter Auto',
 'Glasgow coma scale',
 'procedure_duration',
 'drug_duration',
 'observation_duration']

categorical_cols = ['device',
 'Capillary refill [Time]',
 'Pulse',
 'Arterial pulse pressure',
 'Right pupil Pupillary response',
 'Left pupil Pupillary response',
 'procedure',
 'drug_concept_id',
 'route_concept_id',
 'observation_concept_name',
 'valuefilled']

extra_cols = [
 'Right pupil Pupillary response',
 'Left pupil Pupillary response']

x_test.loc[:, extra_cols] = x_test[extra_cols].apply(lambda col: col.fillna(col.mode()[0]), axis=0)
x_test.loc[:, categorical_cols] = x_test[categorical_cols].ffill().bfill()
x_test.loc[:, numerical_cols] = x_test[numerical_cols].fillna(x_test[numerical_cols].mean())
x_test.loc[:, date_time] = x_test[date_time].fillna(x_test[date_time].mean())

## Drop unecessary columns

In [5]:
x_test.drop('measurement_datetime', axis=1, inplace=True)

In [6]:
columns_to_drop = [
    'person_id',
    "Calcium [Moles/volume] in Serum or Plasma",
    "Left pupil Pupillary response",
    "Capillary refill [Time]",
    "Arterial pulse pressure",
    "Left pupil Diameter Auto",
    "Right pupil Pupillary response",
    "observation_concept_name",
    "Neutrophil Ab [Units/volume] in Serum",
    "Heart rate",
    "Interleukin 6 [Mass/volume] in Body fluid",
    "Glucose [Moles/volume] in Serum or Plasma",
    "C reactive protein [Mass/volume] in Serum or Plasma",
    "Oxygen [Partial pressure] in Venous blood",
    "Albumin [Mass/volume] in Serum or Plasma",
    "Oxygen [Partial pressure] in Arterial blood",
    "Carbon dioxide [Partial pressure] in Arterial blood",
    "Lactate [Moles/volume] in Blood",
    "D-dimer level",
    "Bilirubin measurement",
    " activated",  # Note: There might be a typo here (extra space at the beginning)
    "Total white blood count",
    "Platelet count",
    "White blood cell count",
    "Bicarbonate [Moles/volume] in Arterial blood",
    "Blood venous pH"
]

# Drop columns from x_test
x_test = x_test.drop(columns=columns_to_drop)

print("x_test shape after dropping columns:", x_test.shape)

x_test shape after dropping columns: (130483, 39)


## Text representation, Data scaling, and Dimensionality reduction

This is the core feature engineering part of the pipeline, where raw data is converted into a format suitable for a neural network.

- **Text Representation (Word2Vec):** Categorical text columns (e.g., 'device', 'procedure') are converted into numerical vectors. Machine learning models can only process numerical data. Word2Vec is a technique that turns words (or in this case, categories) into dense numerical vectors (embeddings). The key idea is that words with similar meanings have similar vector representations. This allows the model to learn relationships between different devices or procedures. The code loads a pre-trained Word2Vec model. It then defines a `TextToEmbeddingTransformer` class to apply this model. For each text column, it splits the text into words, looks up the vector for each word, and then takes the average of all word vectors to create a single, fixed-size vector for each observation.

- **Data Scaling (StandardScaler):** The numerical features are scaled to have a mean of 0 and a standard deviation of 1. Many machine learning algorithms, especially those that use gradient descent (like neural networks), perform better when features are on a similar scale. Standardization prevents features with larger numerical values from dominating the learning process. The code loads a pre-trained `StandardScaler` to ensure the test data is scaled consistently with how the training data was processed.

- **Dimensionality Reduction (PCA):** The final step in this section is to reduce the number of features using Principal Component Analysis. After creating embeddings, the dataset's dimensionality (number of columns) can become very large, which can lead to computational inefficiency and the curse of dimensionality. PCA is a technique that finds a smaller set of new features (principal components) that capture most of the variance from the original data. This reduces noise and improves model performance. The code loads a pre-trained `PCA` model to apply this transformation.

In [7]:
# Load the saved transformer object
text_to_embedding_tuple = joblib.load("/kaggle/input/text_to_embedding_v0/scikitlearn/default/1/text_to_embedding_transformer.pkl")

# Debugging: Check what's inside
print("Loaded object:", text_to_embedding_tuple)
print("Type of loaded object:", type(text_to_embedding_tuple))
print("Length of loaded object:", len(text_to_embedding_tuple))

# Extract text_columns correctly
#text_columns = text_to_embedding_tuple[0]  
text_columns = ['device', 'Pulse', 'procedure', 'drug_concept_id', 'route_concept_id', 'valuefilled']
# Load Word2Vec model separately
word2vec_model = Word2Vec.load("/kaggle/input/word2vec_v2/scikitlearn/default/1/word2vec.model")

# Define tokenizer
def tokenize_text(text):
    if pd.isna(text):
        return []
    return str(text).split()

# Define transformer class
class TextToEmbeddingTransformer:
    def __init__(self, word2vec_model, text_columns):
        self.word2vec_model = word2vec_model
        self.text_columns = text_columns

    def transform_text_to_embedding(self, text):
        tokens = tokenize_text(text)
        if not tokens:
            return np.zeros(self.word2vec_model.vector_size)
        vectors = [self.word2vec_model.wv[word] for word in tokens if word in self.word2vec_model.wv]
        if not vectors:
            return np.zeros(self.word2vec_model.vector_size)
        return np.mean(vectors, axis=0)

    def transform(self, X):
        X = X.copy()
        for col in self.text_columns:
            if col in X.columns:  # Ensure column exists before transformation
                X[col] = X[col].apply(self.transform_text_to_embedding)
            else:
                print(f"Warning: Column '{col}' not found in input DataFrame.")
        return X

# Recreate the TextToEmbeddingTransformer object
text_to_embedding = TextToEmbeddingTransformer(word2vec_model, text_columns)

# Apply text-to-embedding transformation
x_test_transformed = text_to_embedding.transform(x_test)

# Verify the transformation
#print("Sample transformed row:")
#print(x_test_transformed.iloc[0])

# Convert transformed embeddings to a structured NumPy array
embedding_dim = word2vec_model.vector_size
x_test_embedded = np.array([np.hstack(row[text_columns].values) for _, row in x_test_transformed.iterrows()])

# Load the saved scaler
scaler = joblib.load("/kaggle/input/standard_scaler_v3/scikitlearn/default/1/standard_scaler.joblib")

# Standardize the test data
x_test_scaled = scaler.transform(x_test_embedded)

# Load the saved PCA model
pca = joblib.load("/kaggle/input/pca_model/scikitlearn/default/1/pca_model.joblib")

# Apply PCA transformation
x_test_pca = pca.transform(x_test_scaled)

# Output the transformed test data
print("Transformed x_test shape:", x_test_pca.shape)

Loaded object: ['device', 'Pulse', 'procedure', 'drug_concept_id', 'route_concept_id', 'valuefilled']
Type of loaded object: <class 'list'>
Length of loaded object: 6


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Transformed x_test shape: (130483, 23)


## Prepared time series dataset

**Time Series Data Structuring**

The final step reshapes the cleaned and transformed data into a 3D tensor suitable for a sequence model. Models like LSTMs or Transformers require input data in a specific 3D format: (`n_samples, n_timesteps, n_features`). This structure allows the model to analyze sequences of data over time rather than just single, isolated observations. The `create_time_series_dataset` function transforms the 2D data into this 3D shape.

The function iterates through the dataset and for each row, it creates a sequence of the previous `time_steps` observations. This creates a "sliding window" over the data, which is a common technique for preparing time series data for deep learning. If there aren't enough previous steps (at the beginning of the dataset), it pads the sequence with zeros.

After these steps, the `x_test_time_series` variable contains the fully prepared, time-sequenced data, ready to be fed into the final machine learning model for inference.

In [8]:
def create_time_series_dataset(data, time_steps):
    """
    Create time series dataset while retaining the same shape as input data.
    
    Args:
        data (np.ndarray): Input data of shape (n_samples, n_features).
        time_steps (int): Number of time steps to use for creating sequences.
    
    Returns:
        np.ndarray: Time series data of shape (n_samples, time_steps, n_features).
    """
    n_samples, n_features = data.shape
    Xs = np.zeros((n_samples, time_steps, n_features))  # Initialize output array
    
    for i in range(n_samples):
        if i < time_steps:
            # Pad the beginning with zeros if there aren't enough previous time steps
            Xs[i, :i+1, :] = data[:i+1, :]
        else:
            # Slice time series data
            Xs[i, :, :] = data[i-time_steps+1:i+1, :]
    
    return Xs

time_steps = 100  # Number of time steps

# Create time series dataset
x_test_time_series = create_time_series_dataset(x_test_pca, time_steps)

print("Shape of time series X_test:", x_test_time_series.shape)

Shape of time series X_test: (130483, 100, 23)


# Model Prediction

**Step-By-Step Breakdown**

**Loading the Trained Model**

After a model is trained on a large dataset, its architecture and learned parameters (weights and biases) are saved to a file. This process is called serialization. Loading a Trained model means you don't need to retrain it, which saves a significant amount of time and computational resources. The `.h5` file format is a common way to save Keras/TensorFlow models, as it stores the model's architecture, weights, and optimizer state in a single file. The code first loads a saved TensorFlow model from a file. `tf.keras.models.load_model(...)` function is the standard way to load a saved Keras model. It reads the model's structure and weights from the specified file path, recreating the exact model that was saved.

**Making Predictions on New Data**

Once the model is loaded, the code uses it to generate predictions on the prepared test data. Inference is the process of using a trained model to make predictions or decisions on new data. The model takes the `x_test_time_series` dataset, which has been pre-processed and shaped into the correct format, and runs it through the neural network. The output, predictions, will be the model's forecast for each input sample. `model.predict(x_test_time_series)` is the core command that performs the inference. The model processes the input data in a forward pass, layer by layer, until it produces the final output. The `x_test_time_series` data is likely a 3D array with the shape (`number_of_samples, time_steps, number_of_features`), a format commonly used for time series models like LSTMs or Transformers, which would have been prepared in an earlier step. The output predictions will be a NumPy array containing the model's predictions.

In [9]:
# Load the TensorFlow model
model = tf.keras.models.load_model("/kaggle/input/mild_model_v6/tensorflow2/default/1/optimized_time_series_modelV1.h5")

# Make predictions
predictions = model.predict(x_test_time_series)

2025-09-02 22:13:03.016800: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


[1m4078/4078[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m594s[0m 145ms/step


# Submission File Generation

**Step-By-Step Breakdown**

**Adding Predictions to the Main DataFrame**

The first action is to add the generated predictions to the merged_test DataFrame.

- `merged_test['SepsisLabel'] = predictions` : A new column (`SepsisLabel`) is created in the `merged_test` DataFrame. The values from the predictions array, which are the model's output, are assigned to this new column. This effectively links the final prediction to the corresponding patient and timestamp.

**Merging Predictions with the Submission Template**

This part of the code merges the predictions back into the original test_submission DataFrame, which likely serves as the template for the final output.

- `pd.merge(...)`: The `pd.merge` function is used to join `test_submission` with the `merged_test` DataFrame.
- `on='person_id'`: The merge is performed on the `person_id` column. This is crucial for ensuring that the correct prediction is associated with the correct patient.
- `how='left'`: A left merge is used. This means that all rows from the `test_submission` DataFrame will be kept, and the corresponding `SepsisLabel` from `merged_test` will be added to each row.

**Creating a Unique Identifier**

A new column is created to serve as a unique identifier for each prediction. This is a common requirement for submission files in data science competitions.

- `test_submission['person_id_datetime'] = (...)`: A new column is created by concatenating the `person_id` and the `measurement_datetime` with an underscore separator. This creates a unique key for each prediction instance, as each patient at each specific time point has a distinct record.

**Finalizing the Submission File**

The final steps clean up the DataFrame and save it to a CSV file.

- `test_submission = test_submission[['person_id_datetime', 'SepsisLabel']]`: This command selects only the two required columns for the final submission: the unique identifier (`person_id_datetime`) and the predicted value (`SepsisLabel`). All other columns are dropped.
- `test_submission = test_submission.drop_duplicates(...)`: This ensures that each unique `person_id_datetime` has only one entry. While the merge should have created unique rows, this is a good practice to handle any potential data duplication issues.
- `test_submission.to_csv('submission.csv', index=False)`: The final DataFrame is saved as a CSV file named `submission.csv`. The `index=False` argument prevents pandas from writing the DataFrame's index as an extra column in the CSV file, which is typically not desired in a submission file.
- `print("Submission successfully created.")`: A message is printed to the console to confirm that the process is complete.

In [10]:
# Ensure 'measurement_datetime' is a string for concatenation
test_submission['measurement_datetime'] = test_submission['measurement_datetime'].astype(str)

# Add predictions to the merged_test dataset
merged_test['SepsisLabel'] = predictions

# Merge observations data into merged_test using a memory-efficient approach
# Instead of loading everything into memory, process chunks if the dataset is large
test_submission = pd.merge(
    test_submission,
    merged_test[['person_id', 'SepsisLabel']],  # Only select necessary columns
    on='person_id',
    how='left'
)

# Concatenate 'person_id' and 'measurement_datetime' with '_' as separator
test_submission['person_id_datetime'] = (
    test_submission['person_id'].astype(str) + '_' + test_submission['measurement_datetime']
)

# Select only the required columns
test_submission = test_submission[['person_id_datetime', 'SepsisLabel']]

# Remove duplicates based on 'person_id_datetime'
test_submission = test_submission.drop_duplicates(subset=['person_id_datetime'])

# Save the submission file
test_submission.to_csv('submission.csv', index=False)

print("Submission successfully created.")

Submission successfully created.
