# **OHIOT1DM DATA PROCESSING**   
- **Without feature enhancement**  
 
- **2:1:2 hypo:eu:hyper sampling ratio**  

# **CONTENTS**

[1. Requirements & Encironment](##1.-Requirements-&-Environment)  
[2. Read n OHIO-T1DM Data](##2.-Read-n-OHIO-T1DM-Data)  
[3. Initial Data Processing](##3.-OHIO-T1DM-Data-Initial-Processing)  
[4. OHIO T1DM Data Processing - No Undersampling](##4.-OHIO-T1DM-Data-Processing---No-Undersampling)  
[5. OHIO T1DM Data Processing - Hypo Oversamplin and  Eu/Hyper Undersampling](##5.-OHIO-T1DM-Data-Processing---Hypo-Oversampling-and-Eu/Hyper-Undersampling)  
[6. OHIO T1DM Validation Data Processing](##6.-OHIO-T1DM-Validation-Data-Processing)  
[7. OHIO T1DM Test Data Processing](##7.-OHIO-T1DM-Test-Data-Processing)  



## **1. Requirements & Environment**

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler
import sys
import random


from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from data_processing_modules import *
from data_processing_parameters import *



In [2]:
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)
torch.cuda.manual_seed_all(0)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('mps') if torch.backends.mps.is_available() else torch.device('cpu')
print(f"Using device: {device}")

Using device: mps


[Back to Table of Contents](#CONTENTS)

## **2. Read in OHIO-T1DM Data**

In [3]:
# saved file file paths

ohio_training_directory_no_undersampling = '../processed_data/ohio/training/no_undersampling'
os.makedirs(ohio_training_directory_no_undersampling, exist_ok=True)

ohio_training_directory_undersampling = '../processed_data/ohio/training/over_and_under_sampling'
os.makedirs(ohio_training_directory_undersampling, exist_ok=True)

ohio_validation_directory = '../processed_data/ohio/validation'
os.makedirs(ohio_validation_directory, exist_ok=True)

ohio_test_directory = '../processed_data/ohio/testing'
os.makedirs(ohio_test_directory, exist_ok=True)

In [None]:
ohio_training_data = get_ohio_data('training', 'Train')

In [5]:
print(ohio_training_data.keys())

dict_keys([584, 575, 563, 559, 540, 596, 588, 570, 591, 567, 552, 544])


In [6]:
ohio_test_data = get_ohio_data('test', 'Test')

In [7]:
ohio_test_data.keys()

dict_keys([552, 540, 559, 544, 588, 570, 563, 596, 591, 584, 567, 575])

In [8]:
print(ohio_training_data.keys() == ohio_test_data.keys())

True


[Back to Table of Contents](#CONTENTS)

Confirms that data for each patient ID is present in both the training and testing datasets.

## **3. OHIO-T1DM Data Initial Processing**

In [9]:
for ptid, df in ohio_training_data.items():
    print(f"Patient ID: {ptid}, Number of Glucose Measurements: {len(df)}")

Patient ID: 584, Number of Glucose Measurements: 12150
Patient ID: 575, Number of Glucose Measurements: 11866
Patient ID: 563, Number of Glucose Measurements: 12124
Patient ID: 559, Number of Glucose Measurements: 10796
Patient ID: 540, Number of Glucose Measurements: 11947
Patient ID: 596, Number of Glucose Measurements: 10877
Patient ID: 588, Number of Glucose Measurements: 12640
Patient ID: 570, Number of Glucose Measurements: 10982
Patient ID: 591, Number of Glucose Measurements: 10847
Patient ID: 567, Number of Glucose Measurements: 10858
Patient ID: 552, Number of Glucose Measurements: 9080
Patient ID: 544, Number of Glucose Measurements: 10623


In [10]:
ohio_training_dict = {}
ohio_validation_dict = {}

for ptid, df in ohio_training_data.items():
    train, test = train_test_split(df, test_size=0.2, shuffle=False)
    ohio_training_dict[ptid] = train
    ohio_validation_dict[ptid] = test

[Back to Table of Contents](#CONTENTS)

## **4. OHIO T1DM Data Processing - No Undersampling**

In [11]:
for ptid, df in ohio_training_dict.items():
    df = df.copy()
    df['real_value_flag'] = 1
    df['TimeDiff'] = df['timestamp'].diff().dt.total_seconds()

    # Identify rows where TimeDiff is around 600 seconds (10 min)
    mask = (df['TimeDiff'] > 595) & (df['TimeDiff'] < 605)
    insert_rows = df[mask].copy()

    if not insert_rows.empty:
        # Modify new rows: set `real_value_flag = 0`, shift `DateTime`, and set `GlucoseValue = NaN`
        insert_rows['real_value_flag'] = 0
        insert_rows['timestamp'] -= pd.to_timedelta(5, unit='m')
        insert_rows['GlucoseValue'] = np.nan

        # Append new rows to the dataframe and sort
        df = pd.concat([df, insert_rows]).sort_values(by='timestamp').reset_index(drop=True)

    # Convert 'value' column to numeric before interpolation
    df['value'] = pd.to_numeric(df['value'], errors='coerce')
    df['GlucoseValue'] = df['value'].interpolate(method='linear')

    df['Hour'] = df['timestamp'].dt.hour
    df['Minute'] = df['timestamp'].dt.minute
    df['TimeDiff'] = df['timestamp'].diff().dt.total_seconds()
    df['TimeDiffFlag'] = df['TimeDiff'].apply(lambda x: 0 if x < 295 or x > 305 else 1)
    df['RollingTimeDiffFlag'] = df['TimeDiffFlag'].rolling(window=96).sum()

    # Drop first 96 rows due to NaN values
    # df = df.iloc[95:].reset_index(drop=True)

    # drop columns
    df = df.drop(columns=['timestamp', 'value', 'TimeDiff', 'TimeDiffFlag', 'real_value_flag'])

    ohio_training_dict[ptid] = df

In [12]:
# Replace-BG normalisation metrics

normalisation_mean = 152.91051040286524

normalisation_std = 70.27050122812615

In [13]:
for ptid, df in ohio_training_dict.items():
    df['GlucoseValue'] = (df['GlucoseValue'] - normalisation_mean) / normalisation_std

In [16]:
ptid_training_slice_dict = {}


for ptid, df in ohio_training_dict.items():
    rolling_flag_array = df["RollingTimeDiffFlag"].to_numpy()  # Convert to NumPy array for fast indexing
    num_rows = len(df)
    starting_index = 0

    slice_list = []

    while starting_index + slice_size <= num_rows:
        if rolling_flag_array[starting_index + slice_size - 1] == slice_size:  # Use precomputed array
            slice_df = df.iloc[starting_index:starting_index + slice_size].copy()
            slice_df = slice_df.drop(columns='RollingTimeDiffFlag')
            slice_list.append(slice_df)
            starting_index += 1

        else:
            starting_index += 1
        
    ptid_training_slice_dict[ptid] = slice_list

In [None]:


for ptid, slice_list in ptid_training_slice_dict.items():


    ptid_count = 0

    encoder_dir = os.path.join(ohio_training_directory_no_undersampling, f'ohio_training_{ptid}', 'EncoderSlices')
    os.makedirs(encoder_dir, exist_ok=True)
    decoder_dir = os.path.join(ohio_training_directory_no_undersampling, f'ohio_training_{ptid}', 'DecoderSlices')
    os.makedirs(decoder_dir, exist_ok=True)
    target_dir = os.path.join(ohio_training_directory_no_undersampling, f'ohio_training_{ptid}', 'TargetSlices')
    os.makedirs(target_dir, exist_ok=True)

    for i, slice_df in enumerate(slice_list):
        # Replace all instances of 'slice' with 'slice_df'
        encoder_input = slice_df.iloc[:encoder_input_size]
        target = slice_df.iloc[encoder_input_size:]['GlucoseValue']

        decoder_input = slice_df.iloc[-decoder_input_size:].copy().reset_index(drop=True)
        decoder_input.loc[decoder_input.index[start_token_size:], 'GlucoseValue'] = 0

        encoder_path = os.path.join(encoder_dir, f'{ptid_count}.pt')
        decoder_path = os.path.join(decoder_dir, f'{ptid_count}.pt')
        target_path = os.path.join(target_dir, f'{ptid_count}.pt')

        torch.save(torch.tensor(encoder_input.values, dtype=torch.float32), encoder_path)
        torch.save(torch.tensor(decoder_input.values, dtype=torch.float32), decoder_path)
        torch.save(torch.tensor(target.values, dtype=torch.float32), target_path)

        ptid_count += 1

In [19]:
# load first file for ohio_training_544
encoder_path = get_first_file(os.path.join(ohio_training_directory_no_undersampling, 'ohio_training_544', 'EncoderSlices'))
decoder_path = get_first_file(os.path.join(ohio_training_directory_no_undersampling, 'ohio_training_544', 'DecoderSlices'))
target_path = get_first_file(os.path.join(ohio_training_directory_no_undersampling, 'ohio_training_544', 'TargetSlices'))

encoder_tensor = torch.load(encoder_path)
decoder_tensor = torch.load(decoder_path)
target_tensor = torch.load(target_path)

encoder_df = pd.DataFrame(encoder_tensor.numpy(), columns=["GlucoseValue", 'Hour', 'Minute'])
decoder_df = pd.DataFrame(decoder_tensor.numpy(), columns=["GlucoseValue", 'Hour', 'Minute'])
target_df = pd.DataFrame(target_tensor.numpy(), columns=["GlucoseValue"])

print(f"\n Encoder Shape: {encoder_df.shape}")
print(encoder_df.tail())
print(f"\n Decoder Shape: {decoder_df.shape}")
print(decoder_df.tail(30))
print(f"\n Target Shape: {target_df.shape}")
print(target_df.tail())


 Encoder Shape: (72, 3)
    GlucoseValue  Hour  Minute
67      0.157811  23.0    59.0
68      0.129350   0.0     4.0
69      0.100889   0.0     9.0
70      0.072427   0.0    14.0
71      0.043966   0.0    19.0

 Decoder Shape: (36, 3)
    GlucoseValue  Hour  Minute
6       0.172042  23.0    54.0
7       0.157811  23.0    59.0
8       0.129350   0.0     4.0
9       0.100889   0.0     9.0
10      0.072427   0.0    14.0
11      0.043966   0.0    19.0
12      0.000000   0.0    24.0
13      0.000000   0.0    29.0
14      0.000000   0.0    34.0
15      0.000000   0.0    39.0
16      0.000000   0.0    44.0
17      0.000000   0.0    49.0
18      0.000000   0.0    54.0
19      0.000000   0.0    59.0
20      0.000000   1.0     4.0
21      0.000000   1.0     9.0
22      0.000000   1.0    14.0
23      0.000000   1.0    19.0
24      0.000000   1.0    24.0
25      0.000000   1.0    29.0
26      0.000000   1.0    34.0
27      0.000000   1.0    39.0
28      0.000000   1.0    44.0
29      0.000000   1

[Back to Table of Contents](#CONTENTS)

## **5. OHIO T1DM Data Processing - Hypo OverSampling and Eu/Hyper Undersampling**


In [20]:
ptid_hypo_training_slice_dict = {}
ptid_hyper_training_slice_dict = {}
ptid_eu_training_slice_dict = {}

for ptid, df_list in ptid_training_slice_dict.items():
    hypo_list = []
    hyper_list = []
    eu_list = []

    normalised_hypo_threshold = (70 - normalisation_mean) / normalisation_std
    normalised_hyper_threshold = (180 - normalisation_mean) / normalisation_std

    threshold_count = 6

    for slice in df_list:

        target_slice = slice.iloc[-target_size:]['GlucoseValue'].values

        hypo_count = np.sum(target_slice < normalised_hypo_threshold)
        hyper_count = np.sum(target_slice > normalised_hyper_threshold)
        eu_count = target_size - hypo_count - hyper_count

        if hypo_count >= threshold_count:
            hypo_list.append(slice)
        elif hyper_count >= threshold_count:
            hyper_list.append(slice)
        else:
            eu_list.append(slice)

    
    print(f"Patient ID: {ptid}, Hypo: {len(hypo_list)}, Hyper: {len(hyper_list)}, EU: {len(eu_list)}")

    ptid_hypo_training_slice_dict[ptid] = hypo_list
    ptid_hyper_training_slice_dict[ptid] = hyper_list
    ptid_eu_training_slice_dict[ptid] = eu_list

Patient ID: 584, Hypo: 97, Hyper: 3366, EU: 2138
Patient ID: 575, Hypo: 641, Hyper: 1140, EU: 2725
Patient ID: 563, Hypo: 357, Hyper: 2411, EU: 5497
Patient ID: 559, Hypo: 270, Hyper: 2635, EU: 2584
Patient ID: 540, Hypo: 847, Hyper: 2143, EU: 4703
Patient ID: 596, Hypo: 146, Hyper: 2390, EU: 4343
Patient ID: 588, Hypo: 141, Hyper: 4332, EU: 4781
Patient ID: 570, Hypo: 190, Hyper: 4523, EU: 2711
Patient ID: 591, Hypo: 279, Hyper: 3004, EU: 3398
Patient ID: 567, Hypo: 402, Hyper: 1659, EU: 2508
Patient ID: 552, Hypo: 329, Hyper: 1313, EU: 2732
Patient ID: 544, Hypo: 189, Hyper: 3251, EU: 3571


In [21]:
for ptid, slice_list in ptid_hypo_training_slice_dict.items():
    #duplicate the hypo slices
    slice_list = slice_list + slice_list
    ptid_hypo_training_slice_dict[ptid] = slice_list
    print(f"Patient ID: {ptid}, Hypo: {len(slice_list)}")

Patient ID: 584, Hypo: 194
Patient ID: 575, Hypo: 1282
Patient ID: 563, Hypo: 714
Patient ID: 559, Hypo: 540
Patient ID: 540, Hypo: 1694
Patient ID: 596, Hypo: 292
Patient ID: 588, Hypo: 282
Patient ID: 570, Hypo: 380
Patient ID: 591, Hypo: 558
Patient ID: 567, Hypo: 804
Patient ID: 552, Hypo: 658
Patient ID: 544, Hypo: 378


In [22]:
for ptid, slice_list in ptid_hyper_training_slice_dict.items():

    target_size = len(ptid_hypo_training_slice_dict[ptid])

    hyper_dict = {idx: slice for idx, slice in enumerate(slice_list)}
    hyper_dict = undersample_dict(hyper_dict, target_size)

    ptid_hyper_training_slice_dict[ptid] = list(hyper_dict.values())

    print(f"Patient ID: {ptid}, Hyper: {len(ptid_hyper_training_slice_dict[ptid])}")

Patient ID: 584, Hyper: 194
Patient ID: 575, Hyper: 1140
Patient ID: 563, Hyper: 714
Patient ID: 559, Hyper: 540
Patient ID: 540, Hyper: 1694
Patient ID: 596, Hyper: 292
Patient ID: 588, Hyper: 282
Patient ID: 570, Hyper: 380
Patient ID: 591, Hyper: 558
Patient ID: 567, Hyper: 804
Patient ID: 552, Hyper: 658
Patient ID: 544, Hyper: 378


In [23]:
for ptid, slice_list in ptid_eu_training_slice_dict.items():

    target_size = len(ptid_hypo_training_slice_dict[ptid])

    eu_dict = {idx: slice for idx, slice in enumerate(slice_list)}
    eu_dict = undersample_dict(eu_dict, target_size)

    ptid_eu_training_slice_dict[ptid] = list(eu_dict.values())

    print(f"Patient ID: {ptid}, EU: {len(ptid_eu_training_slice_dict[ptid])}")

Patient ID: 584, EU: 194
Patient ID: 575, EU: 1282
Patient ID: 563, EU: 714
Patient ID: 559, EU: 540
Patient ID: 540, EU: 1694
Patient ID: 596, EU: 292
Patient ID: 588, EU: 282
Patient ID: 570, EU: 380
Patient ID: 591, EU: 558
Patient ID: 567, EU: 804
Patient ID: 552, EU: 658
Patient ID: 544, EU: 378


In [24]:
undersampled_training_dict = {}

for ptid, slice_list in ptid_hypo_training_slice_dict.items():
    hypo_slices = slice_list
    hyper_slices = ptid_hyper_training_slice_dict[ptid]
    eu_slices = ptid_eu_training_slice_dict[ptid]

    final_slices = hypo_slices + hyper_slices + eu_slices

    np.random.shuffle(final_slices)
    undersampled_training_dict[ptid] = final_slices

    print(f"Patient ID: {ptid}, Total Slices: {len(final_slices)}")

Patient ID: 584, Total Slices: 582
Patient ID: 575, Total Slices: 3704
Patient ID: 563, Total Slices: 2142
Patient ID: 559, Total Slices: 1620
Patient ID: 540, Total Slices: 5082
Patient ID: 596, Total Slices: 876
Patient ID: 588, Total Slices: 846
Patient ID: 570, Total Slices: 1140
Patient ID: 591, Total Slices: 1674
Patient ID: 567, Total Slices: 2412
Patient ID: 552, Total Slices: 1974
Patient ID: 544, Total Slices: 1134


In [25]:
for ptid, slice_list in undersampled_training_dict.items():
    
    ptid_count = 0

    encoder_dir = os.path.join(ohio_training_directory_undersampling, f'ohio_training_{ptid}', 'EncoderSlices')
    os.makedirs(encoder_dir, exist_ok=True)
    decoder_dir = os.path.join(ohio_training_directory_undersampling, f'ohio_training_{ptid}', 'DecoderSlices')
    os.makedirs(decoder_dir, exist_ok=True)
    target_dir = os.path.join(ohio_training_directory_undersampling, f'ohio_training_{ptid}', 'TargetSlices')
    os.makedirs(target_dir, exist_ok=True)

    for i, slice in enumerate(slice_list):

        encoder_input = slice.iloc[:encoder_input_size]
        target = slice.iloc[encoder_input_size:]['GlucoseValue']

        decoder_input = slice_df.iloc[-decoder_input_size:].copy().reset_index(drop=True)
        decoder_input.loc[decoder_input.index[start_token_size:], 'GlucoseValue'] = 0

        encoder_path = os.path.join(encoder_dir, f'{ptid_count}.pt')
        decoder_path = os.path.join(decoder_dir, f'{ptid_count}.pt')
        target_path = os.path.join(target_dir, f'{ptid_count}.pt')

        torch.save(torch.tensor(encoder_input.values, dtype=torch.float32), encoder_path)
        torch.save(torch.tensor(decoder_input.values, dtype=torch.float32), decoder_path)
        torch.save(torch.tensor(target.values, dtype=torch.float32), target_path)

        ptid_count += 1


In [26]:
# load first file for ohio_training_559
encoder_path = get_first_file(os.path.join(ohio_training_directory_undersampling, 'ohio_training_540', 'EncoderSlices'))
decoder_path = get_first_file(os.path.join(ohio_training_directory_undersampling, 'ohio_training_540', 'DecoderSlices'))
target_path = get_first_file(os.path.join(ohio_training_directory_undersampling, 'ohio_training_540', 'TargetSlices'))

encoder_tensor = torch.load(encoder_path)
decoder_tensor = torch.load(decoder_path)
target_tensor = torch.load(target_path)

encoder_df = pd.DataFrame(encoder_tensor.numpy(), columns=["GlucoseValue", 'Hour', 'Minute'])
decoder_df = pd.DataFrame(decoder_tensor.numpy(), columns=["GlucoseValue", 'Hour', 'Minute'])
target_df = pd.DataFrame(target_tensor.numpy(), columns=["GlucoseValue"])

print(f"\n Encoder Shape: {encoder_df.shape}")
print(encoder_df.tail())
print(f"\n Decoder Shape: {decoder_df.shape}")
print(decoder_df.tail(30))
print(f"\n Target Shape: {target_df.shape}")
print(target_df.tail())


 Encoder Shape: (72, 3)
    GlucoseValue  Hour  Minute
67      0.029735  19.0    56.0
68      0.072427  20.0     1.0
69      0.058196  20.0     6.0
70      0.015504  20.0    11.0
71      0.001274  20.0    16.0

 Decoder Shape: (36, 3)
    GlucoseValue  Hour  Minute
6      -0.141034  23.0    19.0
7      -0.141034  23.0    24.0
8      -0.126803  23.0    29.0
9      -0.126803  23.0    34.0
10     -0.126803  23.0    39.0
11     -0.112572  23.0    44.0
12      0.000000  23.0    49.0
13      0.000000  23.0    54.0
14      0.000000  23.0    59.0
15      0.000000   0.0     4.0
16      0.000000   0.0     9.0
17      0.000000   0.0    14.0
18      0.000000   0.0    19.0
19      0.000000   0.0    24.0
20      0.000000   0.0    29.0
21      0.000000   0.0    34.0
22      0.000000   0.0    39.0
23      0.000000   0.0    44.0
24      0.000000   0.0    49.0
25      0.000000   0.0    54.0
26      0.000000   0.0    59.0
27      0.000000   1.0     4.0
28      0.000000   1.0     9.0
29      0.000000   1

[Back to Table of Contents](#CONTENTS)

## **6. OHIO T1DM Validation Data Processing**

In [27]:
for ptid, df in ohio_validation_dict.items():
    df = df.copy()
    df['real_value_flag'] = 1
    df['TimeDiff'] = df['timestamp'].diff().dt.total_seconds()

    # Identify rows where TimeDiff is around 600 seconds (10 min)
    mask = (df['TimeDiff'] > 595) & (df['TimeDiff'] < 605)
    insert_rows = df[mask].copy()

    if not insert_rows.empty:
        # Modify new rows: set `real_value_flag = 0`, shift `DateTime`, and set `GlucoseValue = NaN`
        insert_rows['real_value_flag'] = 0
        insert_rows['timestamp'] -= pd.to_timedelta(5, unit='m')
        insert_rows['GlucoseValue'] = np.nan

        # Append new rows to the dataframe and sort
        df = pd.concat([df, insert_rows]).sort_values(by='timestamp').reset_index(drop=True)

    # Convert 'value' column to numeric before interpolation
    df['value'] = pd.to_numeric(df['value'], errors='coerce')
    df['GlucoseValue'] = df['value'].interpolate(method='linear')

    df['Hour'] = df['timestamp'].dt.hour
    df['Minute'] = df['timestamp'].dt.minute
    df['TimeDiff'] = df['timestamp'].diff().dt.total_seconds()
    df['TimeDiffFlag'] = df['TimeDiff'].apply(lambda x: 0 if x < 295 or x > 305 else 1)
    df['RollingTimeDiffFlag'] = df['TimeDiffFlag'].rolling(window=96).sum()

    # Drop first 96 rows due to NaN values
    # df = df.iloc[95:].reset_index(drop=True)

    # drop columns
    df = df.drop(columns=['timestamp', 'value', 'TimeDiff', 'TimeDiffFlag', 'real_value_flag'])

    ohio_validation_dict[ptid] = df

In [28]:
for ptid, df in ohio_validation_dict.items():
    df['GlucoseValue'] = (df['GlucoseValue'] - normalisation_mean) / normalisation_std

In [29]:
ptid_validation_slice_dict = {}


for ptid, df in ohio_validation_dict.items():
    rolling_flag_array = df["RollingTimeDiffFlag"].to_numpy()  # Convert to NumPy array for fast indexing
    num_rows = len(df)
    starting_index = 0

    slice_list = []

    while starting_index + slice_size <= num_rows:
        if rolling_flag_array[starting_index + slice_size - 1] == slice_size:  # Use precomputed array
            slice_df = df.iloc[starting_index:starting_index + slice_size].copy()
            slice_df = slice_df.drop(columns='RollingTimeDiffFlag')
            slice_list.append(slice_df)
            starting_index += 1

        else:
            starting_index += 1
        
    ptid_validation_slice_dict[ptid] = slice_list

In [30]:
for ptid, slice_list in ptid_validation_slice_dict.items():


    ptid_count = 0

    encoder_dir = os.path.join(ohio_validation_directory, f'ohio_validation_{ptid}', 'EncoderSlices')
    os.makedirs(encoder_dir, exist_ok=True)
    decoder_dir = os.path.join(ohio_validation_directory, f'ohio_validation_{ptid}', 'DecoderSlices')
    os.makedirs(decoder_dir, exist_ok=True)
    target_dir = os.path.join(ohio_validation_directory, f'ohio_validation_{ptid}', 'TargetSlices')
    os.makedirs(target_dir, exist_ok=True)

    for i, slice_df in enumerate(slice_list):
        # Replace all instances of 'slice' with 'slice_df'
        encoder_input = slice.iloc[:encoder_input_size]
        target = slice.iloc[encoder_input_size:]['GlucoseValue']

        decoder_input = slice_df.iloc[-decoder_input_size:].copy().reset_index(drop=True)
        decoder_input.loc[decoder_input.index[start_token_size:], 'GlucoseValue'] = 0

        encoder_path = os.path.join(encoder_dir, f'{ptid_count}.pt')
        decoder_path = os.path.join(decoder_dir, f'{ptid_count}.pt')
        target_path = os.path.join(target_dir, f'{ptid_count}.pt')

        torch.save(torch.tensor(encoder_input.values, dtype=torch.float32), encoder_path)
        torch.save(torch.tensor(decoder_input.values, dtype=torch.float32), decoder_path)
        torch.save(torch.tensor(target.values, dtype=torch.float32), target_path)

        ptid_count += 1

In [31]:
# load first file for ohio_training_559
encoder_path = get_first_file(os.path.join(ohio_validation_directory, 'ohio_validation_544', 'EncoderSlices'))
decoder_path = get_first_file(os.path.join(ohio_validation_directory, 'ohio_validation_544', 'DecoderSlices'))
target_path = get_first_file(os.path.join(ohio_validation_directory, 'ohio_validation_544', 'TargetSlices'))

encoder_tensor = torch.load(encoder_path)
decoder_tensor = torch.load(decoder_path)
target_tensor = torch.load(target_path)

encoder_df = pd.DataFrame(encoder_tensor.numpy(), columns=["GlucoseValue", 'Hour', 'Minute'])
decoder_df = pd.DataFrame(decoder_tensor.numpy(), columns=["GlucoseValue", 'Hour', 'Minute'])
target_df = pd.DataFrame(target_tensor.numpy(), columns=["GlucoseValue"])

print(f"\n Encoder Shape: {encoder_df.shape}")
print(encoder_df.tail())
print(f"\n Decoder Shape: {decoder_df.shape}")
print(decoder_df.tail(30))
print(f"\n Target Shape: {target_df.shape}")
print(target_df.tail())



 Encoder Shape: (72, 3)
    GlucoseValue  Hour  Minute
67     -1.108723   6.0     4.0
68     -1.151415   6.0     9.0
69     -1.108723   6.0    14.0
70     -1.108723   6.0    19.0
71     -1.208338   6.0    24.0

 Decoder Shape: (36, 3)
    GlucoseValue  Hour  Minute
6       1.353192  21.0    50.0
7       1.367423  21.0    55.0
8       1.381654  22.0     0.0
9       1.410115  22.0     5.0
10      1.424346  22.0    10.0
11      1.424346  22.0    15.0
12      0.000000  22.0    20.0
13      0.000000  22.0    25.0
14      0.000000  22.0    30.0
15      0.000000  22.0    35.0
16      0.000000  22.0    40.0
17      0.000000  22.0    45.0
18      0.000000  22.0    50.0
19      0.000000  22.0    55.0
20      0.000000  23.0     0.0
21      0.000000  23.0     5.0
22      0.000000  23.0    10.0
23      0.000000  23.0    15.0
24      0.000000  23.0    20.0
25      0.000000  23.0    25.0
26      0.000000  23.0    30.0
27      0.000000  23.0    35.0
28      0.000000  23.0    40.0
29      0.000000  23

[Back to Table of Contents](#CONTENTS)

## **7. OHIO T1DM Test Data Processing**

In [32]:
for ptid, df in ohio_test_data.items():
    print(f"Patient ID: {ptid}, Number of Glucose Measurements: {len(df)}")

Patient ID: 552, Number of Glucose Measurements: 2364
Patient ID: 540, Number of Glucose Measurements: 2896
Patient ID: 559, Number of Glucose Measurements: 2514
Patient ID: 544, Number of Glucose Measurements: 2716
Patient ID: 588, Number of Glucose Measurements: 2791
Patient ID: 570, Number of Glucose Measurements: 2745
Patient ID: 563, Number of Glucose Measurements: 2570
Patient ID: 596, Number of Glucose Measurements: 2743
Patient ID: 591, Number of Glucose Measurements: 2760
Patient ID: 584, Number of Glucose Measurements: 2665
Patient ID: 567, Number of Glucose Measurements: 2389
Patient ID: 575, Number of Glucose Measurements: 2590


In [33]:
for ptid, df in ohio_test_data.items():
    df = df.copy()
    df['real_value_flag'] = 1
    df['TimeDiff'] = df['timestamp'].diff().dt.total_seconds()

    # Identify rows where TimeDiff is around 600 seconds (10 min)
    mask = (df['TimeDiff'] > 595) & (df['TimeDiff'] < 605)
    insert_rows = df[mask].copy()

    if not insert_rows.empty:
        # Modify new rows: set `real_value_flag = 0`, shift `DateTime`, and set `GlucoseValue = NaN`
        insert_rows['real_value_flag'] = 0
        insert_rows['timestamp'] -= pd.to_timedelta(5, unit='m')
        insert_rows['GlucoseValue'] = np.nan

        # Append new rows to the dataframe and sort
        df = pd.concat([df, insert_rows]).sort_values(by='timestamp').reset_index(drop=True)

    # Convert 'value' column to numeric before interpolation
    df['value'] = pd.to_numeric(df['value'], errors='coerce')
    df['GlucoseValue'] = df['value'].interpolate(method='linear')

    df['Hour'] = df['timestamp'].dt.hour
    df['Minute'] = df['timestamp'].dt.minute
    df['TimeDiff'] = df['timestamp'].diff().dt.total_seconds()
    df['TimeDiffFlag'] = df['TimeDiff'].apply(lambda x: 0 if x < 295 or x > 305 else 1)
    df['RollingTimeDiffFlag'] = df['TimeDiffFlag'].rolling(window=96).sum()

    # Drop first 96 rows due to NaN values
    # df = df.iloc[95:].reset_index(drop=True)

    # drop columns
    df = df.drop(columns=['timestamp', 'value', 'TimeDiff', 'TimeDiffFlag', 'real_value_flag'])

    ohio_test_data[ptid] = df

In [36]:
for ptid, df in ohio_test_data.items():
    df['GlucoseValue'] = (df['GlucoseValue'] - normalisation_mean) / normalisation_std

In [37]:
ptid_test_slice_dict = {}


for ptid, df in ohio_test_data.items():
    rolling_flag_array = df["RollingTimeDiffFlag"].to_numpy()  # Convert to NumPy array for fast indexing
    num_rows = len(df)
    starting_index = 0

    slice_list = []

    while starting_index + slice_size <= num_rows:
        if rolling_flag_array[starting_index + slice_size - 1] == slice_size:  # Use precomputed array
            slice_df = df.iloc[starting_index:starting_index + slice_size].copy()
            slice_df = slice_df.drop(columns='RollingTimeDiffFlag')
            slice_list.append(slice_df)
            starting_index += 1

        else:
            starting_index += 1
        
    ptid_test_slice_dict[ptid] = slice_list

In [38]:
for ptid, slice_list in ptid_test_slice_dict.items():

    ptid_count = 0

    encoder_dir = os.path.join(ohio_test_directory, f'ohio_test_{ptid}', 'EncoderSlices')
    os.makedirs(encoder_dir, exist_ok=True)
    decoder_dir = os.path.join(ohio_test_directory, f'ohio_test_{ptid}', 'DecoderSlices')
    os.makedirs(decoder_dir, exist_ok=True)
    target_dir = os.path.join(ohio_test_directory, f'ohio_test_{ptid}', 'TargetSlices')
    os.makedirs(target_dir, exist_ok=True)

    for i, slice_df in enumerate(slice_list):
        # Replace all instances of 'slice' with 'slice_df'
        encoder_input = slice.iloc[:encoder_input_size]
        target = slice.iloc[encoder_input_size:]['GlucoseValue']

        decoder_input = slice_df.iloc[-decoder_input_size:].copy().reset_index(drop=True)
        decoder_input.loc[decoder_input.index[start_token_size:], 'GlucoseValue'] = 0

        encoder_path = os.path.join(encoder_dir, f'{ptid_count}.pt')
        decoder_path = os.path.join(decoder_dir, f'{ptid_count}.pt')
        target_path = os.path.join(target_dir, f'{ptid_count}.pt')

        torch.save(torch.tensor(encoder_input.values, dtype=torch.float32), encoder_path)
        torch.save(torch.tensor(decoder_input.values, dtype=torch.float32), decoder_path)
        torch.save(torch.tensor(target.values, dtype=torch.float32), target_path)

        ptid_count += 1

In [39]:
# load first file for ohio_training_559
encoder_path = get_first_file(os.path.join(ohio_test_directory, 'ohio_test_540', 'EncoderSlices'))
decoder_path = get_first_file(os.path.join(ohio_test_directory, 'ohio_test_540', 'DecoderSlices'))
target_path = get_first_file(os.path.join(ohio_test_directory, 'ohio_test_540', 'TargetSlices'))

encoder_tensor = torch.load(encoder_path)
decoder_tensor = torch.load(decoder_path)
target_tensor = torch.load(target_path)

encoder_df = pd.DataFrame(encoder_tensor.numpy(), columns=["GlucoseValue", 'Hour', 'Minute'])
decoder_df = pd.DataFrame(decoder_tensor.numpy(), columns=["GlucoseValue", 'Hour', 'Minute'])
target_df = pd.DataFrame(target_tensor.numpy(), columns=["GlucoseValue"])



print(f"\n Encoder Shape: {encoder_df.shape}")
print(encoder_df.tail())
print(f"\n Decoder Shape: {decoder_df.shape}")
print(decoder_df.tail(30))
print(f"\n Target Shape: {target_df.shape}")
print(target_df.tail())




 Encoder Shape: (72, 3)
    GlucoseValue  Hour  Minute
67     -1.108723   6.0     4.0
68     -1.151415   6.0     9.0
69     -1.108723   6.0    14.0
70     -1.108723   6.0    19.0
71     -1.208338   6.0    24.0

 Decoder Shape: (36, 3)
    GlucoseValue  Hour  Minute
6      -0.496802   5.0    42.0
7      -0.525263   5.0    47.0
8      -0.553725   5.0    52.0
9      -0.567955   5.0    57.0
10     -0.553725   6.0     2.0
11     -0.525263   6.0     7.0
12      0.000000   6.0    12.0
13      0.000000   6.0    17.0
14      0.000000   6.0    22.0
15      0.000000   6.0    27.0
16      0.000000   6.0    32.0
17      0.000000   6.0    37.0
18      0.000000   6.0    42.0
19      0.000000   6.0    47.0
20      0.000000   6.0    52.0
21      0.000000   6.0    57.0
22      0.000000   7.0     2.0
23      0.000000   7.0     7.0
24      0.000000   7.0    12.0
25      0.000000   7.0    17.0
26      0.000000   7.0    22.0
27      0.000000   7.0    27.0
28      0.000000   7.0    32.0
29      0.000000   7

[Back to Table of Contents](#CONTENTS)