# 1.08 Data Loader Example Notebook

## Purpose
This notebook is meant to show how different data loaders behave with our package.

### Kaggle Bristol T1D
#### No Cache - Training Data Load

In [1]:
from src.data.diabetes_datasets.data_loader import get_loader


In [3]:
loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="train",
    keep_columns=[
        "datetime",
        "p_num",
        "bg_mM",
        "hr_bpm",
        "steps",
        "cals",
        "cob",
        "carb_availability",
        "dose_units",
        "iob",
        "insulin_availability",
    ],
    use_cached=False,
    parallel=True,
    max_workers=9,
)

2025-09-04T23:58:42 - Beginning data loading process with the following parmeters:
2025-09-04T23:58:42 - 	Dataset: kaggle_brisT1D - train
2025-09-04T23:58:42 - 	Columns: ['datetime', 'p_num', 'bg_mM', 'hr_bpm', 'steps', 'cals', 'cob', 'carb_availability', 'dose_units', 'iob', 'insulin_availability']
2025-09-04T23:58:42 - 	Generic patient start date: 2024-01-01 00:00:00
2025-09-04T23:58:42 - 	Number of validation days: 20
2025-09-04T23:58:42 - 	In parallel with up to 9 workers.

2025-09-04T23:58:42 - Processed cache not found or not used, processing raw data and saving to cache...
2025-09-04T23:58:42 - Raw data for kaggle_brisT1D already exists in cache
2025-09-04T23:58:51 - _process_raw_train_data: Processing train data. This may take a while...
2025-09-04T23:58:52 - Processing 9 patients:
2025-09-04T23:58:52 - Running process_single_patient_data(), 
			Processing patient p01 data...
			Patient start date: 2024-01-01
2025-09-04T23:58:52 - 	Inputed patient start time: 06:10:00
2025-09-0

In [4]:
for key, value in loader.processed_data.items():
    print(f"{key}: {value.shape}")

p01: (8711, 14)
p05: (8808, 14)
p06: (8791, 14)
p12: (26371, 14)
p11: (25559, 14)
p10: (25803, 14)
p04: (24983, 14)
p02: (26423, 14)
p03: (26423, 14)


#### Cached - Training Data Load

In [5]:
loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="train",
    keep_columns=[
        "datetime",
        "p_num",
        "bg_mM",
        "hr_bpm",
        "steps",
        "cals",
        "cob",
        "carb_availability",
        "dose_units",
        "iob",
        "insulin_availability",
    ],
    use_cached=True,
    parallel=True,
    max_workers=9,
)

2025-09-05T00:01:14 - Beginning data loading process with the following parmeters:
2025-09-05T00:01:14 - 	Dataset: kaggle_brisT1D - train
2025-09-05T00:01:14 - 	Columns: ['datetime', 'p_num', 'bg_mM', 'hr_bpm', 'steps', 'cals', 'cob', 'carb_availability', 'dose_units', 'iob', 'insulin_availability']
2025-09-05T00:01:14 - 	Generic patient start date: 2024-01-01 00:00:00
2025-09-05T00:01:14 - 	Number of validation days: 20
2025-09-05T00:01:14 - 	In parallel with up to 9 workers.

2025-09-05T00:01:14 - Processed data path for kaggle_brisT1D: /u6/cjrisi/nocturnal/cache/data/kaggle_brisT1D/processed
2025-09-05T00:01:14 - cache_manager.load_processed_data() returned:
 ['p11', 'p10', 'p12', 'p04', 'p05', 'p06', 'p01', 'p02', 'p03']
2025-09-05T00:01:14 - Loading processed data from cache...


#### Processed Training Data

In [6]:
for p_num, df in loader.train_data.items():
    print(f"Patient Number: {p_num}")
    print(f"DataFrame Shape: {df.shape}")   
    print(f"Dataframe Head:\n{df.head()}\n")

Patient Number: p11
DataFrame Shape: (18803, 10)
Dataframe Head:
                    p_num  bg_mM  hr_bpm  steps  cals  cob  carb_availability  \
datetime                                                                        
2024-01-01 06:05:00   p11    9.3     NaN    NaN   NaN  0.0                0.0   
2024-01-01 06:10:00   p11    9.2     NaN    NaN   NaN  0.0                0.0   
2024-01-01 06:15:00   p11    9.1     NaN    NaN   NaN  0.0                0.0   
2024-01-01 06:20:00   p11    9.1     NaN    NaN   NaN  0.0                0.0   
2024-01-01 06:25:00   p11    9.2     NaN    NaN   NaN  0.0                0.0   

                     dose_units  iob  insulin_availability  
datetime                                                    
2024-01-01 06:05:00         NaN  0.0                   0.0  
2024-01-01 06:10:00         NaN  0.0                   0.0  
2024-01-01 06:15:00         NaN  0.0                   0.0  
2024-01-01 06:20:00         NaN  0.0                   0.0  
2

#### Processed Validation Data

In [7]:
for p_num, df in loader.validation_data.items():
    print(f"Patient Number: {p_num}")
    print(f"DataFrame Shape: {df.shape}")   
    print(f"Dataframe Head:\n{df.head()}\n")

Patient Number: p11
DataFrame Shape: (5548, 10)
Dataframe Head:
                    p_num  bg_mM  hr_bpm  steps  cals  cob  carb_availability  \
datetime                                                                        
2024-03-09 06:55:00   p11    4.6     NaN    NaN  5.95  0.0                0.0   
2024-03-09 07:00:00   p11    4.6     NaN    NaN  5.95  0.0                0.0   
2024-03-09 07:05:00   p11    4.6     NaN    NaN  5.95  0.0                0.0   
2024-03-09 07:10:00   p11    4.5     NaN    NaN  5.95  0.0                0.0   
2024-03-09 07:15:00   p11    4.3     NaN    NaN  5.95  0.0                0.0   

                     dose_units       iob  insulin_availability  
datetime                                                         
2024-03-09 06:55:00       0.075  1.841735              0.104553  
2024-03-09 07:00:00       0.075  1.831227              0.104047  
2024-03-09 07:05:00       0.075  1.818093              0.103205  
2024-03-09 07:10:00       0.075  1.805

#### Testing Data

In [8]:
from src.data.diabetes_datasets.data_loader import get_loader

loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="test",
    keep_columns=[
        "datetime",
        "p_num",
        "bg_mM",
        "hr_bpm",
        "steps",
        "cals",
        "cob",
        "carb_availability",
        "dose_units",
        "iob",
        "insulin_availability",
    ],
    use_cached=False,
    parallel=True,
    max_workers=2,
)

2025-09-05T00:01:33 - Beginning data loading process with the following parmeters:
2025-09-05T00:01:33 - 	Dataset: kaggle_brisT1D - test
2025-09-05T00:01:33 - 	Columns: ['datetime', 'p_num', 'bg_mM', 'hr_bpm', 'steps', 'cals', 'cob', 'carb_availability', 'dose_units', 'iob', 'insulin_availability']
2025-09-05T00:01:33 - 	Generic patient start date: 2024-01-01 00:00:00
2025-09-05T00:01:33 - 	In parallel with up to 2 workers.

2025-09-05T00:01:33 - Processed cache not found or not used, processing raw data and saving to cache...
2025-09-05T00:01:33 - Raw data for kaggle_brisT1D already exists in cache
2025-09-05T00:01:33 - Processing test data. This may take a while...
2025-09-05T00:03:00 - Processed data path for kaggle_brisT1D: /u6/cjrisi/nocturnal/cache/data/kaggle_brisT1D/processed
2025-09-05T00:03:00 - ensure_regular_time_intervals(): Ensuring regular time intervals...
2025-09-05T00:03:00 - 	Most common time interval: 5 minutes
2025-09-05T00:03:00 - Post-ensure_regular_time_interval

KeyError: 'food_g'