# T0.00 Simple Data Loader Examples

## Purpose
This notebook is meant to show how different data loaders behave with our package.

### 1. Kaggle Bristol T1D


In [1]:
from src.data.diabetes_datasets.data_loader import get_loader

#### No Cache - Training Data Load

In [2]:
loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="train",
    keep_columns=[
        "datetime",
        "p_num",
        "bg_mM",
        "hr_bpm",
        "steps",
        "cals",
        "cob",
        "carb_availability",
        "dose_units",
        "iob",
        "insulin_availability",
    ],
    use_cached=False,
    parallel=True,
    max_workers=9,
)

2025-09-09T00:20:41 - Beginning data loading process with the following parmeters:
2025-09-09T00:20:41 - 	Dataset: kaggle_brisT1D - train
2025-09-09T00:20:41 - 	Columns: ['datetime', 'p_num', 'bg_mM', 'hr_bpm', 'steps', 'cals', 'cob', 'carb_availability', 'dose_units', 'iob', 'insulin_availability']
2025-09-09T00:20:41 - 	Generic patient start date: 2024-01-01 00:00:00
2025-09-09T00:20:41 - 	Number of validation days: 20
2025-09-09T00:20:41 - 	In parallel with up to 9 workers.

2025-09-09T00:20:41 - Processed cache not found or not used, processing raw data and saving to cache...
2025-09-09T00:20:41 - Raw data for kaggle_brisT1D already exists in cache
2025-09-09T00:20:51 - _process_raw_train_data: Processing train data. This may take a while...
2025-09-09T00:20:51 - Processing 9 patients:
2025-09-09T00:20:51 - Running process_single_patient_data(), 
			Processing patient p01 data...
			Patient start date: 2024-01-01
2025-09-09T00:20:51 - 	Inputed patient start time: 06:10:00
2025-09-0

In [3]:
for key, value in loader.processed_data.items():
    print(f"{key}: {value.shape}")

p01: (8711, 14)
p05: (8808, 14)
p06: (8791, 14)
p12: (26371, 14)
p11: (25559, 14)
p10: (25803, 14)
p04: (24983, 14)
p02: (26423, 14)
p03: (26423, 14)


#### Cached - Training Data Load

In [8]:
loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="train",
    keep_columns=[
        "datetime",
        "p_num",
        "bg_mM",
        "hr_bpm",
        "steps",
        "cals",
        "cob",
        "carb_availability",
        "dose_units",
        "iob",
        "insulin_availability",
    ],
    use_cached=True,
    parallel=True,
    max_workers=9,
)

2025-09-09T00:00:28 - Beginning data loading process with the following parmeters:
2025-09-09T00:00:28 - 	Dataset: kaggle_brisT1D - train
2025-09-09T00:00:28 - 	Columns: ['datetime', 'p_num', 'bg_mM', 'hr_bpm', 'steps', 'cals', 'cob', 'carb_availability', 'dose_units', 'iob', 'insulin_availability']
2025-09-09T00:00:28 - 	Generic patient start date: 2024-01-01 00:00:00
2025-09-09T00:00:28 - 	Number of validation days: 20
2025-09-09T00:00:28 - 	In parallel with up to 9 workers.

2025-09-09T00:00:28 - Processed data path for kaggle_brisT1D: /u6/cjrisi/nocturnal/cache/data/kaggle_brisT1D/processed
2025-09-09T00:00:28 - cache_manager.load_processed_data() returned dfs for:
 [['p11', 'p10', 'p12', 'p04', 'p05', 'p06', 'p01', 'p02', 'p03']]
2025-09-09T00:00:28 - Loading processed data from cache...


#### Processed Training Data Results


##### Processed
This should be the full dataset that gets split into training and validation.

In [4]:
for p_num, df in loader.processed_data.items():
    print(f"Patient Number: {p_num}")
    print(f"DataFrame Shape: {df.shape}")
    print(f"Dataframe Head:\n{df.head()}\n")

Patient Number: p01
DataFrame Shape: (8711, 14)
Dataframe Head:
                        id p_num  bg_mM  dose_units  food_g  hr_bpm  steps  \
datetime                                                                     
2024-01-01 06:10:00  p01_0   p01   15.1      0.0417     NaN     NaN    NaN   
2024-01-01 06:25:00  p01_1   p01   14.4      0.0417     NaN     NaN    NaN   
2024-01-01 06:40:00  p01_2   p01   13.9      0.0417     NaN     NaN    NaN   
2024-01-01 06:55:00  p01_3   p01   13.8      0.0417     NaN     NaN    NaN   
2024-01-01 07:10:00  p01_4   p01   13.4      0.0417     NaN     NaN    NaN   

                     cals activity msg_type  cob  carb_availability  \
datetime                                                              
2024-01-01 06:10:00   NaN      NaN      NaN  0.0                0.0   
2024-01-01 06:25:00   NaN      NaN      NaN  0.0                0.0   
2024-01-01 06:40:00   NaN      NaN      NaN  0.0                0.0   
2024-01-01 06:55:00   NaN      NaN

##### Training Data
This should be approximately the first 60-70 days per patient.

In [5]:
for p_num, df in loader.train_data.items():
    print(f"Patient Number: {p_num}")
    print(f"DataFrame Shape: {df.shape}")
    print(f"Dataframe Head:\n{df.head()}\n")

Patient Number: p01
DataFrame Shape: (6723, 14)
Dataframe Head:
                        id p_num  bg_mM  dose_units  food_g  hr_bpm  steps  \
datetime                                                                     
2024-01-01 06:10:00  p01_0   p01   15.1      0.0417     NaN     NaN    NaN   
2024-01-01 06:25:00  p01_1   p01   14.4      0.0417     NaN     NaN    NaN   
2024-01-01 06:40:00  p01_2   p01   13.9      0.0417     NaN     NaN    NaN   
2024-01-01 06:55:00  p01_3   p01   13.8      0.0417     NaN     NaN    NaN   
2024-01-01 07:10:00  p01_4   p01   13.4      0.0417     NaN     NaN    NaN   

                     cals activity msg_type  cob  carb_availability  \
datetime                                                              
2024-01-01 06:10:00   NaN      NaN      NaN  0.0                0.0   
2024-01-01 06:25:00   NaN      NaN      NaN  0.0                0.0   
2024-01-01 06:40:00   NaN      NaN      NaN  0.0                0.0   
2024-01-01 06:55:00   NaN      NaN

##### Validation Data

This should be about the last 20 days per patient.

In [6]:
for p_num, df in loader.validation_data.items():
    print(f"Patient Number: {p_num}")
    print(f"DataFrame Shape: {df.shape}")
    print(f"Dataframe Head:\n{df.head()}\n")

Patient Number: p01
DataFrame Shape: (1921, 14)
Dataframe Head:
                           id p_num  bg_mM  dose_units  food_g  hr_bpm  steps  \
datetime                                                                        
2024-03-11 06:55:00       NaN   NaN    NaN         NaN     NaN     NaN    NaN   
2024-03-11 07:10:00  p01_6506   p01    NaN      0.0417     NaN    80.3   92.0   
2024-03-11 07:25:00  p01_6507   p01    NaN      0.0417     NaN    75.5    0.0   
2024-03-11 07:40:00  p01_6508   p01    NaN      0.0417     NaN    76.9   65.0   
2024-03-11 07:55:00  p01_6509   p01    NaN      0.0417     NaN    77.1    0.0   

                     cals activity msg_type  cob  carb_availability  \
datetime                                                              
2024-03-11 06:55:00   NaN      NaN      NaN  0.0                0.0   
2024-03-11 07:10:00   4.8      NaN      NaN  0.0                0.0   
2024-03-11 07:25:00   4.8      NaN      NaN  0.0                0.0   
2024-03-11 07

#### No Cache - Testing Data Load
For Kaggle the testing data is already provided and is in a format that is quite different from the training set. This should not be considered true testing data outside of a few special use cases. 

If you want a proper train/val/test split we suggest splitting validation into two 10-day datasets. Or a 15-day and 5-day dataset.

In [2]:
from src.data.diabetes_datasets.data_loader import get_loader

In [3]:
loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="test",
    keep_columns=[
        "datetime",
        "p_num",
        "bg_mM",
        "hr_bpm",
        "steps",
        "cals",
        "cob",
        "carb_availability",
        "dose_units",
        "iob",
        "insulin_availability",
    ],
    use_cached=False,
    parallel=True,
    max_workers=8,
)

2025-09-05T21:26:42 - Beginning data loading process with the following parmeters:
2025-09-05T21:26:42 - 	Dataset: kaggle_brisT1D - test
2025-09-05T21:26:42 - 	Columns: ['datetime', 'p_num', 'bg_mM', 'hr_bpm', 'steps', 'cals', 'cob', 'carb_availability', 'dose_units', 'iob', 'insulin_availability']
2025-09-05T21:26:42 - 	Generic patient start date: 2024-01-01 00:00:00
2025-09-05T21:26:42 - 	In parallel with up to 8 workers.

2025-09-05T21:26:42 - Processed cache not found or not used, processing raw data and saving to cache...
2025-09-05T21:26:42 - Raw data for kaggle_brisT1D already exists in cache
2025-09-05T21:26:42 - Processing test data. This may take a while...
2025-09-05T21:28:23 - Processed data path for kaggle_brisT1D: /u6/cjrisi/nocturnal/cache/data/kaggle_brisT1D/processed
2025-09-05T21:28:23 - Processed test data will be cached at: /u6/cjrisi/nocturnal/cache/data/kaggle_brisT1D/processed/test
2025-09-05T21:28:23 - Processing test data in parallel with up to 8 workers...
202

#### Cached - Testing Data Load

In [6]:
loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="test",
    keep_columns=[
        "datetime",
        "p_num",
        "bg_mM",
        "hr_bpm",
        "steps",
        "cals",
        "cob",
        "carb_availability",
        "dose_units",
        "iob",
        "insulin_availability",
    ],
    use_cached=True,
    parallel=True,
    max_workers=8,
)

2025-09-05T21:36:28 - Beginning data loading process with the following parmeters:
2025-09-05T21:36:28 - 	Dataset: kaggle_brisT1D - test
2025-09-05T21:36:28 - 	Columns: ['datetime', 'p_num', 'bg_mM', 'hr_bpm', 'steps', 'cals', 'cob', 'carb_availability', 'dose_units', 'iob', 'insulin_availability']
2025-09-05T21:36:28 - 	Generic patient start date: 2024-01-01 00:00:00
2025-09-05T21:36:28 - 	In parallel with up to 8 workers.

2025-09-05T21:36:28 - Processed data path for kaggle_brisT1D: /u6/cjrisi/nocturnal/cache/data/kaggle_brisT1D/processed
2025-09-05T21:36:28 - Processed data path for kaggle_brisT1D: /u6/cjrisi/nocturnal/cache/data/kaggle_brisT1D/processed
2025-09-05T21:36:29 - Loaded nested test data from cache
2025-09-05T21:36:29 - Loaded data for 15 patients
2025-09-05T21:36:29 - Loaded nested test data from compressed cache


#### Processed Test Data Results

In [7]:
for p_num, data_dict in loader.processed_data.items():
    print(f"Patient Number: {p_num}")
    for row_id, df in data_dict.items():
        print(f"Row ID: {row_id}")
        print(f"DataFrame Shape: {df.shape}\n")

Patient Number: p01
Row ID: p01_8459
DataFrame Shape: (72, 12)

Row ID: p01_8460
DataFrame Shape: (72, 12)

Row ID: p01_8461
DataFrame Shape: (72, 12)

Row ID: p01_8462
DataFrame Shape: (72, 12)

Row ID: p01_8463
DataFrame Shape: (72, 12)

Row ID: p01_8464
DataFrame Shape: (72, 12)

Row ID: p01_8465
DataFrame Shape: (72, 12)

Row ID: p01_8466
DataFrame Shape: (72, 12)

Row ID: p01_8467
DataFrame Shape: (72, 12)

Row ID: p01_8468
DataFrame Shape: (72, 12)

Row ID: p01_8469
DataFrame Shape: (72, 12)

Row ID: p01_8470
DataFrame Shape: (72, 12)

Row ID: p01_8471
DataFrame Shape: (72, 12)

Row ID: p01_8472
DataFrame Shape: (72, 12)

Row ID: p01_8473
DataFrame Shape: (72, 12)

Row ID: p01_8474
DataFrame Shape: (72, 12)

Row ID: p01_8475
DataFrame Shape: (72, 12)

Row ID: p01_8476
DataFrame Shape: (72, 12)

Row ID: p01_8477
DataFrame Shape: (72, 12)

Row ID: p01_8478
DataFrame Shape: (72, 12)

Row ID: p01_8479
DataFrame Shape: (72, 12)

Row ID: p01_8480
DataFrame Shape: (72, 12)

Row ID: p01_

In [8]:
loader.processed_data["p24"]["p24_260"]

Unnamed: 0_level_0,bg_mM,dose_units,food_g,hr_bpm,steps,cals,activity,p_num,cob,carb_availability,insulin_availability,iob
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2024-01-01 21:15:00,9.9,0.0792,,76.8,58.0,8.04,,p24,0.0,0.0,0.000000,0.079200
2024-01-01 21:20:00,10.1,0.0792,,84.4,98.0,11.04,,p24,0.0,0.0,0.000000,0.158400
2024-01-01 21:25:00,10.0,0.0792,,84.8,30.0,9.87,,p24,0.0,0.0,0.000000,0.237600
2024-01-01 21:30:00,9.9,0.0792,,91.9,29.0,8.88,,p24,0.0,0.0,0.000655,0.316800
2024-01-01 21:35:00,10.0,0.0792,,81.7,29.0,7.38,,p24,0.0,0.0,0.002048,0.395548
...,...,...,...,...,...,...,...,...,...,...,...,...
2024-01-02 02:50:00,6.4,0.0771,,72.7,0.0,4.31,,p24,0.0,0.0,0.109248,2.349364
2024-01-02 02:55:00,6.4,0.0000,,71.8,,4.15,,p24,0.0,0.0,0.113737,2.259138
2024-01-02 03:00:00,6.4,0.0000,,76.2,0.0,4.23,,p24,0.0,0.0,0.117266,2.165815
2024-01-02 03:05:00,6.5,0.0327,,70.5,,4.15,,p24,0.0,0.0,0.119963,2.102757
