# T0.01 Advanced Data Cleaner Analysis Examples

- **Author**: Christopher Risi
- **AI Assitance**: Claude Sonnet 3.7

The purpose of this notebook is to show examples of how our data cleaning functions should work when preparing data for modeling. 

## Requirements
### General Requirements
1. Install the package locally with `pip install -e .` in your project root directory
2. Activate your virtual environment (`.noctprob-venv`) before running any code
3. Ensure required Python packages are installed from `requirements.txt`

### Dataset-Specific Requirements

#### Kaggle BrisT1D
1. Set up the Kaggle API on your machine:

    - Create a Kaggle account if you don't have one
    - Generate and download an API key from your Kaggle account settings
    - Place the `kaggle.json` file in `~/.kaggle/ directory`
    - Set proper permissions: ```chmod 600 ~/.kaggle/kaggle.json```

2. Download the dataset using the provided script:

    ```bash scripts/data_downloads/kaggle_data_download.sh```

3. Ensure the dataset files are in the correct locations:

    - Training data: ```src/data/datasets/kaggle_bris_t1d/raw/train.csv```
    - Test data: ```src/data/datasets/kaggle_bris_t1d/raw/test.csv```

#### Gluroo Example Dataset
1. Ensure the Gluroo JSON data is available at the path you'll specify with the file_path parameter
2. Optional: Configure custom cleaning parameters through the config dictionary

## Code Example

### Kaggle Bristol T1D DataLoader Initialization
> [!Note] 
>
> If your data is not yet cached, with the current implementation (2025/07/05) this takes ~25 minutes to run on WATGPU. 
> 
> As of (2025/07/06) this now takes about ~9 minutes to run on WATGPU with the change to process the patients in parallel.
> Once the data is cached this ran in ~6 seconds.
> 
> We are working on efficiency improvements for this processing.


In [6]:
# Import the factory function for creating data loaders
from src.data.diabetes_datasets.data_loader import get_loader

# For Bristol T1D dataset (train)
bris_loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="train",
    use_cached=True,  # Set to False to reprocess raw data
)
bris_data = bris_loader.processed_data

ModuleNotFoundError: No module named 'src'

In [10]:
# For Bristol T1D dataset (test)
bris_test_loader = get_loader(
    data_source_name="kaggle_brisT1D", dataset_type="test", use_cached=True
)
bris_test_data = bris_test_loader.processed_data

#### Train/Val Data Analysis

In [11]:
# Display all methods/attributes available on the bris_loader
print("Available methods and attributes of bris_loader:")
for item in dir(bris_loader):
    if not item.startswith("__"):  # Skip dunder/magic methods
        print(f"- {item}")

Available methods and attributes of bris_loader:
- _abc_impl
- _get_day_splits
- _load_from_cache
- _load_test_data_from_cache
- _process_and_cache_data
- _process_patient_data
- _process_raw_data
- _process_raw_data_old
- _split_train_validation
- _validate_data
- cached_path
- dataset_name
- dataset_type
- file_path
- get_validation_day_splits
- keep_columns
- load_data
- load_data_old
- load_raw
- num_train_days
- num_validation_days
- processed_data
- raw_data_path
- train_data
- train_dt_col_type
- use_cached
- val_dt_col_type
- validation_data


In [12]:
print("\nBristol T1D Cleaned Dataset Information: \n")
print("Dataset Name: ", bris_loader.dataset_name)
print("Dataset Type: ", bris_loader.dataset_type)
print("Raw Data Path: ", bris_loader.raw_data_path)
print("File Path: ", bris_loader.file_path)
print("Cached Path: ", bris_loader.cached_path)
print("Use Cached: ", bris_loader.use_cached)


print("\n")
print("\nBristol T1D Cleaned Dataset Properties:  \n")
print("Keep Columns: ", bris_loader.keep_columns)
print("\nValidation Data Shape: ", bris_loader.validation_data.shape)
print("Number of Validation Days: ", bris_loader.num_validation_days)
print("Validation Data Type: ", bris_loader.val_dt_col_type)

print("\nTrain Data Shape: ", bris_loader.train_data.shape)
print("Number of Train Days: ", bris_loader.num_train_days)
print("Train Data Type: ", bris_loader.train_dt_col_type)


Bristol T1D Cleaned Dataset Information: 

Dataset Name:  kaggle_brisT1D
Dataset Type:  train
Raw Data Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/raw/train.csv
File Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/processed/train_cached.csv
Cached Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/processed/train_cached.csv
Use Cached:  True



Bristol T1D Cleaned Dataset Properties:  

Keep Columns:  ['datetime', 'id', 'p_num', 'time', 'bg-0:00', 'insulin-0:00', 'carbs-0:00', 'hr-0:00', 'steps-0:00', 'cals-0:00', 'activity-0:00', 'cob', 'carb_availability', 'insulin_availability', 'iob']

Validation Data Shape:  (40329, 15)
Number of Validation Days:  20
Validation Data Type:  datetime64[ns]

Train Data Shape:  (140160, 15)
Number of Train Days:  72
Train Data Type:  datetime64[ns]


In [13]:
from IPython.display import display

print("Train Data: ")
display(bris_loader.train_data.head())
print(bris_loader.train_data["datetime"].dtype)

print("\nValidation Data:")
display(bris_loader.validation_data.head())
print(bris_loader.validation_data["datetime"].dtype)

Train Data: 


Unnamed: 0,datetime,id,p_num,time,bg-0:00,insulin-0:00,carbs-0:00,hr-0:00,steps-0:00,cals-0:00,activity-0:00,cob,carb_availability,insulin_availability,iob
0,2025-01-01 06:10:00,p01_0,p01,06:10:00,15.1,0.0417,,,,,,0.0,0.0,0.0,0.4028
1,2025-01-01 06:25:00,p01_1,p01,06:25:00,14.4,0.0417,,,,,,0.0,0.0,0.003428,0.872082
2,2025-01-01 06:40:00,p01_2,p01,06:40:00,13.9,0.0417,,,,,,0.0,0.0,0.012039,1.385682
3,2025-01-01 06:55:00,p01_3,p01,06:55:00,13.8,0.0417,,,,,,0.0,0.0,0.024747,1.838095
4,2025-01-01 07:10:00,p01_4,p01,07:10:00,13.4,0.0417,,,,,,0.0,0.0,0.040416,2.203691


datetime64[ns]

Validation Data:


Unnamed: 0,datetime,id,p_num,time,bg-0:00,insulin-0:00,carbs-0:00,hr-0:00,steps-0:00,cals-0:00,activity-0:00,cob,carb_availability,insulin_availability,iob
6723,2025-03-12 06:55:00,p01_6723,p01,,,,,,,,,87.40897,48.647153,1.931029,28.187021
6724,2025-03-12 07:10:00,p01_6724,p01,07:10:00,,0.0417,,80.3,92.0,4.8,,81.39424,47.24137,1.899391,27.295364
6725,2025-03-12 07:25:00,p01_6725,p01,07:25:00,,0.0417,,75.5,0.0,4.8,,75.574056,45.48211,1.863221,26.425267
6726,2025-03-12 07:40:00,p01_6726,p01,07:40:00,,0.0417,,76.9,65.0,4.8,,69.987191,43.473307,1.823792,25.57234
6727,2025-03-12 07:55:00,p01_6727,p01,07:55:00,,0.0417,,77.1,0.0,5.0,,59.271225,36.874083,1.513891,22.067263


datetime64[ns]


In [14]:
print("Processed Data: ", bris_loader.get_validation_day_splits)
print("Load Data: ", bris_loader.load_data)
print("Load Raw: ", bris_loader.load_raw)

Processed Data:  <bound method BrisT1DDataLoader.get_validation_day_splits of <src.data.datasets.kaggle_bris_t1d.bris_t1d.BrisT1DDataLoader object at 0x7ba0de378b90>>
Load Data:  <bound method BrisT1DDataLoader.load_data of <src.data.datasets.kaggle_bris_t1d.bris_t1d.BrisT1DDataLoader object at 0x7ba0de378b90>>
Load Raw:  <bound method BrisT1DDataLoader.load_raw of <src.data.datasets.kaggle_bris_t1d.bris_t1d.BrisT1DDataLoader object at 0x7ba0de378b90>>


#### Test Data Analysis

In [16]:
# Display all methods/attributes available on the bris_loader
print("Available methods and attributes of bris_loader:")
for item in dir(bris_test_loader):
    if not item.startswith("__"):  # Skip dunder/magic methods
        print(f"- {item}")

Available methods and attributes of bris_loader:
- _abc_impl
- _get_day_splits
- _load_from_cache
- _load_test_data_from_cache
- _process_and_cache_data
- _process_patient_data
- _process_raw_data
- _process_raw_data_old
- _split_train_validation
- _validate_data
- cached_path
- dataset_name
- dataset_type
- file_path
- get_validation_day_splits
- keep_columns
- load_data
- load_data_old
- load_raw
- num_train_days
- num_validation_days
- processed_data
- raw_data_path
- train_dt_col_type
- use_cached
- val_dt_col_type


In [15]:
print("\nBristol T1D Cleaned Test Dataset Information: \n")
print("Dataset Name: ", bris_test_loader.dataset_name)
print("Dataset Type: ", bris_test_loader.dataset_type)
print("Raw Data Path: ", bris_test_loader.raw_data_path)
print("File Path: ", bris_test_loader.file_path)
print("Cached Path: ", bris_test_loader.cached_path)
print("Use Cached: ", bris_test_loader.use_cached)


print("\n")
print("\nBristol T1D Cleaned Dataset Properties:  \n")
print("Keep Columns: ", bris_test_loader.keep_columns)
print("\nNumber of Validation Days: ", bris_test_loader.num_validation_days)
print("Validation Data Type: ", bris_test_loader.val_dt_col_type)

print("\nNumber of Train Days: ", bris_test_loader.num_train_days)
print("Train Data Type: ", bris_test_loader.train_dt_col_type)


Bristol T1D Cleaned Test Dataset Information: 

Dataset Name:  kaggle_brisT1D
Dataset Type:  test
Raw Data Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/raw/test.csv
File Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/processed/test_cached
Cached Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/processed/test_cached
Use Cached:  True



Bristol T1D Cleaned Dataset Properties:  

Keep Columns:  None

Number of Validation Days:  0
Validation Data Type:  None

Number of Train Days:  0
Train Data Type:  None


In [None]:
from IPython.display import display

print("Test Data (1 patient, 6 hours): ")
display(bris_test_loader.processed_data["p02"]["p02_25995"])

Test Data: 


Unnamed: 0.1,Unnamed: 0,datetime,time,bg-0:00,insulin-0:00,carbs-0:00,hr-0:00,steps-0:00,cals-0:00,activity-0:00,p_num,id,cob,carb_availability,insulin_availability,iob
0,0,2025-01-01 06:00:00,06:00:00,7.2,0.1321,,,,,,p02,p02_0,0.000000,0.000000,0.000000,0.132100
1,1,2025-01-01 06:05:00,06:05:00,7.4,0.1185,,,,,,p02,p02_1,0.000000,0.000000,0.000233,0.250577
2,2,2025-01-01 06:10:00,06:10:00,7.5,0.1166,,,,,,p02,p02_2,0.000000,0.000000,0.001262,0.366788
3,3,2025-01-01 06:15:00,06:15:00,7.6,0.1118,,,,,,p02,p02_3,0.000000,0.000000,0.003197,0.477239
4,4,2025-01-01 06:20:00,06:20:00,7.6,0.1094,,,,,,p02,p02_4,0.000000,0.000000,0.005974,0.583715
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,67,2025-01-01 11:35:00,11:35:00,10.5,0.1000,,73.0,,0.52,,p02,p02_67,16.549977,11.507406,0.363245,5.267237
68,68,2025-01-01 11:40:00,11:40:00,10.2,0.1000,,71.0,,0.17,,p02,p02_68,15.151727,10.708740,0.355503,5.097970
69,69,2025-01-01 11:45:00,11:45:00,10.0,0.1000,,71.1,,,,p02,p02_69,13.852006,9.937309,0.347881,4.940906
70,70,2025-01-01 11:50:00,11:50:00,9.9,0.1000,,72.5,,1.67,,p02,p02_70,12.647151,9.197928,0.341171,4.802700


### Gluroo Example Data

TODO: The Gluroo example is a WIP. 

```python
import pandas as pd

# For Gluroo dataset with custom configuration
gluroo_config = {
    "max_consecutive_nan_values_per_day": 36,
    "coerse_time_interval": pd.Timedelta(minutes=5),
    "day_start_time": pd.Timedelta(hours=4),
    "min_carbs": 5,
    "meal_length": pd.Timedelta(hours=2),
    "n_top_carb_meals": 3,
}

gluroo_loader = get_loader(
    data_source_name="gluroo",
    file_path="path/to/gluroo_data.csv",
    config=gluroo_config,
    use_cached=False,
)
gluroo_data = gluroo_loader.processed_data
```

## Lynch_2022 Examples

## Lynch_2022 Examples



Ensure the Lynch 2022 dataset files are downloaded from https://public.jaeb.org/dataset/579 
and placed in `cache/data/awesome_cgm/lynch_2022/raw/` before running.

In [None]:
import sys
import pathlib

# dunno how to run this without doin this ugly hack
sys.path.insert(
    0,
    str(
        pathlib.Path(
            r"C:\Users\jonat\Documents\Code\nocturnal-hypo-gly-prob-forecast"
        ).resolve()
    ),
)

from src.data.diabetes_datasets.data_loader import get_loader

# For Lynch 2022 dataset
lynch_loader = get_loader(
    data_source_name="lynch_2022",
    use_cached=True,
)
lynch_data = lynch_loader.processed_data

2025-10-14T20:36:38 - Beginning data loading process with the following parameters:
2025-10-14T20:36:38 - 	Dataset: lynch_2022 - train
2025-10-14T20:36:38 - 	Columns: None
2025-10-14T20:36:38 - 	Generic patient start date: 2024-01-01 00:00:00
2025-10-14T20:36:38 - 	Number of validation days: 20
2025-10-14T20:36:38 - 	In parallel with up to 3 workers.

2025-10-14T20:36:38 - Processed data path for lynch_2022: c:\Users\jonat\Documents\Code\nocturnal-hypo-gly-prob-forecast\cache\data\awesome_cgm\lynch_2022\processed
  df = pd.read_csv(csv_file, index_col=0, parse_dates=True)
  df = pd.read_csv(csv_file, index_col=0, parse_dates=True)
  df = pd.read_csv(csv_file, index_col=0, parse_dates=True)
2025-10-14T20:37:12 - Loaded full processed data from cache for 343 patients
2025-10-14T20:37:12 - Processed data path for lynch_2022: c:\Users\jonat\Documents\Code\nocturnal-hypo-gly-prob-forecast\cache\data\awesome_cgm\lynch_2022\processed
2025-10-14T20:37:17 - Loaded existing train/validation spli

#### Lynch 2022 Data Analysis

In [None]:
# Display all methods/attributes available on the lynch_loader
print("Available methods and attributes of lynch_loader:")
for item in dir(lynch_loader):
    if not item.startswith("__"):
        print(f"- {item}")

Available methods and attributes of lynch_loader:
- _abc_impl
- _load_from_cache
- _load_nested_test_data_from_cache
- _process_and_cache_data
- _process_raw_data
- _process_raw_test_data
- _process_raw_train_data
- _split_train_validation
- _validate_data
- cache_manager
- data_shape_summary
- dataset_config
- dataset_name
- dataset_type
- generic_patient_start_date
- keep_columns
- load_data
- load_raw
- max_workers
- num_patients
- num_train_days
- num_validation_days
- parallel
- patient_ids
- processed_data
- raw_data
- test_data
- to_dataframe
- train_data
- train_dt_col_type
- use_cached
- val_dt_col_type
- validation_data


In [None]:
print("\nLynch 2022 Cleaned Dataset Information: \n")
print("Dataset Name: ", lynch_loader.dataset_name)
print("Dataset Type: ", lynch_loader.dataset_type)
print("Use Cached: ", lynch_loader.use_cached)


print("\n")
print("\nLynch 2022 Cleaned Dataset Properties:  \n")
print("Keep Columns: ", lynch_loader.keep_columns)
print("Data Shape: ", lynch_loader.data_shape_summary)
print("Number of Validation Days: ", lynch_loader.num_validation_days)
print("Validation Data Type: ", lynch_loader.val_dt_col_type)

print("Number of Train Days: ", lynch_loader.num_train_days)
print("Train Data Type: ", lynch_loader.train_dt_col_type)


Lynch 2022 Cleaned Dataset Information: 

Dataset Name:  lynch_2022
Dataset Type:  train
Use Cached:  True



Lynch 2022 Cleaned Dataset Properties:  

Keep Columns:  None
Data Shape:  {'lynch_100': (26212, 19), 'lynch_101': (26204, 19), 'lynch_103': (26488, 19), 'lynch_105': (27716, 19), 'lynch_107': (62392, 19), 'lynch_10': (50493, 19), 'lynch_111': (26194, 19), 'lynch_112': (27653, 19), 'lynch_113': (26177, 19), 'lynch_116': (27612, 19), 'lynch_119': (27670, 19), 'lynch_121': (26200, 19), 'lynch_125': (27595, 19), 'lynch_126': (26210, 19), 'lynch_128': (27062, 19), 'lynch_12': (26166, 19), 'lynch_130': (27086, 19), 'lynch_131': (27961, 19), 'lynch_133': (28203, 19), 'lynch_134': (25923, 19), 'lynch_135': (27028, 19), 'lynch_137': (26230, 19), 'lynch_138': (26177, 19), 'lynch_139': (28212, 19), 'lynch_13': (27037, 19), 'lynch_141': (26164, 19), 'lynch_143': (26207, 19), 'lynch_144': (26187, 19), 'lynch_146': (182, 19), 'lynch_148': (26179, 19), 'lynch_151': (27318, 19), 'lynch_153':

In [None]:
from IPython.display import display

print("Train Data: ")


display(lynch_loader.train_data.head())
print(lynch_loader.train_data.index.get_level_values("datetime").dtype)

print("\nValidation Data:")
display(lynch_loader.validation_data.tail())
print(lynch_loader.validation_data.index.get_level_values("datetime").dtype)

Train Data: 


Unnamed: 0_level_0,Unnamed: 1_level_0,bg_mM,dose_units,food_g,hr_bpm,steps,cals,activity,msg_type,age,sex,insulinModality,type,device,dataset,cob,carb_availability,insulin_availability,iob
p_num,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
lynch_10,2019-11-12 00:00:18,9.611111,0.09,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.0,0.09
lynch_10,2019-11-12 00:05:18,10.0,0.175,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.0,0.265
lynch_10,2019-11-12 00:10:18,10.444444,0.26,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.0,0.525
lynch_10,2019-11-12 00:15:18,10.888889,0.285,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.000744,0.81
lynch_10,2019-11-12 00:20:18,11.333333,0.28,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.003029,1.089487


datetime64[ns]

Validation Data:


Unnamed: 0_level_0,Unnamed: 1_level_0,bg_mM,dose_units,food_g,hr_bpm,steps,cals,activity,msg_type,age,sex,insulinModality,type,device,dataset,cob,carb_availability,insulin_availability,iob
p_num,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
lynch_99,2021-04-20 06:39:03,8.055556,0.1,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.077013,2.077236
lynch_99,2021-04-20 06:44:03,7.944444,0.085,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.083539,2.096622
lynch_99,2021-04-20 06:49:03,7.777778,0.08,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.089667,2.109179
lynch_99,2021-04-20 06:54:03,7.611111,0.08,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.095284,2.119289
lynch_99,2021-04-20 06:59:03,7.166667,0.0,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.098951,2.033941


datetime64[ns]




Ensure the Lynch 2022 dataset files are downloaded from https://public.jaeb.org/dataset/579 
and placed in `cache/data/awesome_cgm/lynch_2022/raw/` before running.

In [2]:
import sys
import pathlib

# dunno how to run this without doin this ugly hack
sys.path.insert(
    0,
    str(
        pathlib.Path(
            r"C:\Users\jonat\Documents\Code\nocturnal-hypo-gly-prob-forecast"
        ).resolve()
    ),
)

from src.data.diabetes_datasets.data_loader import get_loader

# For Lynch 2022 dataset
lynch_loader = get_loader(
    data_source_name="lynch_2022",
    use_cached=True,
)
lynch_data = lynch_loader.processed_data

2025-10-14T20:36:38 - Beginning data loading process with the following parameters:
2025-10-14T20:36:38 - 	Dataset: lynch_2022 - train
2025-10-14T20:36:38 - 	Columns: None
2025-10-14T20:36:38 - 	Generic patient start date: 2024-01-01 00:00:00
2025-10-14T20:36:38 - 	Number of validation days: 20
2025-10-14T20:36:38 - 	In parallel with up to 3 workers.

2025-10-14T20:36:38 - Processed data path for lynch_2022: c:\Users\jonat\Documents\Code\nocturnal-hypo-gly-prob-forecast\cache\data\awesome_cgm\lynch_2022\processed
  df = pd.read_csv(csv_file, index_col=0, parse_dates=True)
  df = pd.read_csv(csv_file, index_col=0, parse_dates=True)
  df = pd.read_csv(csv_file, index_col=0, parse_dates=True)
2025-10-14T20:37:12 - Loaded full processed data from cache for 343 patients
2025-10-14T20:37:12 - Processed data path for lynch_2022: c:\Users\jonat\Documents\Code\nocturnal-hypo-gly-prob-forecast\cache\data\awesome_cgm\lynch_2022\processed
2025-10-14T20:37:17 - Loaded existing train/validation spli

#### Lynch 2022 Data Analysis

In [3]:
# Display all methods/attributes available on the lynch_loader
print("Available methods and attributes of lynch_loader:")
for item in dir(lynch_loader):
    if not item.startswith("__"):
        print(f"- {item}")

Available methods and attributes of lynch_loader:
- _abc_impl
- _load_from_cache
- _load_nested_test_data_from_cache
- _process_and_cache_data
- _process_raw_data
- _process_raw_test_data
- _process_raw_train_data
- _split_train_validation
- _validate_data
- cache_manager
- data_shape_summary
- dataset_config
- dataset_name
- dataset_type
- generic_patient_start_date
- keep_columns
- load_data
- load_raw
- max_workers
- num_patients
- num_train_days
- num_validation_days
- parallel
- patient_ids
- processed_data
- raw_data
- test_data
- to_dataframe
- train_data
- train_dt_col_type
- use_cached
- val_dt_col_type
- validation_data


In [4]:
print("\nLynch 2022 Cleaned Dataset Information: \n")
print("Dataset Name: ", lynch_loader.dataset_name)
print("Dataset Type: ", lynch_loader.dataset_type)
print("Use Cached: ", lynch_loader.use_cached)


print("\n")
print("\nLynch 2022 Cleaned Dataset Properties:  \n")
print("Keep Columns: ", lynch_loader.keep_columns)
print("Data Shape: ", lynch_loader.data_shape_summary)
print("Number of Validation Days: ", lynch_loader.num_validation_days)
print("Validation Data Type: ", lynch_loader.val_dt_col_type)

print("Number of Train Days: ", lynch_loader.num_train_days)
print("Train Data Type: ", lynch_loader.train_dt_col_type)


Lynch 2022 Cleaned Dataset Information: 

Dataset Name:  lynch_2022
Dataset Type:  train
Use Cached:  True



Lynch 2022 Cleaned Dataset Properties:  

Keep Columns:  None
Data Shape:  {'lynch_100': (26212, 19), 'lynch_101': (26204, 19), 'lynch_103': (26488, 19), 'lynch_105': (27716, 19), 'lynch_107': (62392, 19), 'lynch_10': (50493, 19), 'lynch_111': (26194, 19), 'lynch_112': (27653, 19), 'lynch_113': (26177, 19), 'lynch_116': (27612, 19), 'lynch_119': (27670, 19), 'lynch_121': (26200, 19), 'lynch_125': (27595, 19), 'lynch_126': (26210, 19), 'lynch_128': (27062, 19), 'lynch_12': (26166, 19), 'lynch_130': (27086, 19), 'lynch_131': (27961, 19), 'lynch_133': (28203, 19), 'lynch_134': (25923, 19), 'lynch_135': (27028, 19), 'lynch_137': (26230, 19), 'lynch_138': (26177, 19), 'lynch_139': (28212, 19), 'lynch_13': (27037, 19), 'lynch_141': (26164, 19), 'lynch_143': (26207, 19), 'lynch_144': (26187, 19), 'lynch_146': (182, 19), 'lynch_148': (26179, 19), 'lynch_151': (27318, 19), 'lynch_153':

In [10]:
from IPython.display import display

print("Train Data: ")


display(lynch_loader.train_data.head())
print(lynch_loader.train_data.index.get_level_values("datetime").dtype)

print("\nValidation Data:")
display(lynch_loader.validation_data.tail())
print(lynch_loader.validation_data.index.get_level_values("datetime").dtype)

Train Data: 


Unnamed: 0_level_0,Unnamed: 1_level_0,bg_mM,dose_units,food_g,hr_bpm,steps,cals,activity,msg_type,age,sex,insulinModality,type,device,dataset,cob,carb_availability,insulin_availability,iob
p_num,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
lynch_10,2019-11-12 00:00:18,9.611111,0.09,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.0,0.09
lynch_10,2019-11-12 00:05:18,10.0,0.175,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.0,0.265
lynch_10,2019-11-12 00:10:18,10.444444,0.26,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.0,0.525
lynch_10,2019-11-12 00:15:18,10.888889,0.285,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.000744,0.81
lynch_10,2019-11-12 00:20:18,11.333333,0.28,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.003029,1.089487


datetime64[ns]

Validation Data:


Unnamed: 0_level_0,Unnamed: 1_level_0,bg_mM,dose_units,food_g,hr_bpm,steps,cals,activity,msg_type,age,sex,insulinModality,type,device,dataset,cob,carb_availability,insulin_availability,iob
p_num,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
lynch_99,2021-04-20 06:39:03,8.055556,0.1,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.077013,2.077236
lynch_99,2021-04-20 06:44:03,7.944444,0.085,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.083539,2.096622
lynch_99,2021-04-20 06:49:03,7.777778,0.08,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.089667,2.109179
lynch_99,2021-04-20 06:54:03,7.611111,0.08,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.095284,2.119289
lynch_99,2021-04-20 06:59:03,7.166667,0.0,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.098951,2.033941


datetime64[ns]


## Tamborlane_2008 Examples

In [2]:
import sys
import pathlib

# Add project root to Python path
project_root = pathlib.Path(r"/Users/kirby/BCG-WatAI/nocturnal-hypo-gly-prob-forecast")
if project_root not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"Added to Python path: {project_root}")


Added to Python path: /Users/kirby/BCG-WatAI/nocturnal-hypo-gly-prob-forecast


In [3]:
from src.data.diabetes_datasets.data_loader import get_loader

# For Tamborlane 2008 dataset
tamborlane_loader = get_loader(
    data_source_name="tamborlane_2008",
    use_cached=True,
)
tamborlane_data = tamborlane_loader.processed_data

2025-11-03T13:45:05 - Beginning Tamborlane 2008 data loading process:
2025-11-03T13:45:05 - 	Dataset: tamborlane_2008 - train
2025-11-03T13:45:05 - 	Columns: None
2025-11-03T13:45:05 - 	Extract features: True
2025-11-03T13:45:05 - 	Generic patient start date: 2008-01-01 00:00:00
2025-11-03T13:45:05 - 	Number of validation days: 20
2025-11-03T13:45:05 - 	Using parallel processing with 3 workers
2025-11-03T13:45:05 - Processed data path for tamborlane_2008: /Users/kirby/BCG-WatAI/nocturnal-hypo-gly-prob-forecast/cache/data/tamborlane_2008/processed
2025-11-03T13:45:05 - Processed data path for tamborlane_2008: /Users/kirby/BCG-WatAI/nocturnal-hypo-gly-prob-forecast/cache/data/tamborlane_2008/processed
2025-11-03T13:45:05 - Processing raw data...


## Lynch_2022 Examples



Ensure the Lynch 2022 dataset files are downloaded from https://public.jaeb.org/dataset/579 
and placed in `cache/data/awesome_cgm/lynch_2022/raw/` before running.

In [None]:
import sys
import pathlib

# dunno how to run this without doin this ugly hack
sys.path.insert(
    0,
    str(
        pathlib.Path(
            r"C:\Users\jonat\Documents\Code\nocturnal-hypo-gly-prob-forecast"
        ).resolve()
    ),
)

from src.data.diabetes_datasets.data_loader import get_loader

# For Lynch 2022 dataset
lynch_loader = get_loader(
    data_source_name="lynch_2022",
    use_cached=True,
)
lynch_data = lynch_loader.processed_data

2025-10-14T20:36:38 - Beginning data loading process with the following parameters:
2025-10-14T20:36:38 - 	Dataset: lynch_2022 - train
2025-10-14T20:36:38 - 	Columns: None
2025-10-14T20:36:38 - 	Generic patient start date: 2024-01-01 00:00:00
2025-10-14T20:36:38 - 	Number of validation days: 20
2025-10-14T20:36:38 - 	In parallel with up to 3 workers.

2025-10-14T20:36:38 - Processed data path for lynch_2022: c:\Users\jonat\Documents\Code\nocturnal-hypo-gly-prob-forecast\cache\data\awesome_cgm\lynch_2022\processed
  df = pd.read_csv(csv_file, index_col=0, parse_dates=True)
  df = pd.read_csv(csv_file, index_col=0, parse_dates=True)
  df = pd.read_csv(csv_file, index_col=0, parse_dates=True)
2025-10-14T20:37:12 - Loaded full processed data from cache for 343 patients
2025-10-14T20:37:12 - Processed data path for lynch_2022: c:\Users\jonat\Documents\Code\nocturnal-hypo-gly-prob-forecast\cache\data\awesome_cgm\lynch_2022\processed
2025-10-14T20:37:17 - Loaded existing train/validation spli

#### Lynch 2022 Data Analysis

In [None]:
# Display all methods/attributes available on the lynch_loader
print("Available methods and attributes of lynch_loader:")
for item in dir(lynch_loader):
    if not item.startswith("__"):
        print(f"- {item}")

Available methods and attributes of lynch_loader:
- _abc_impl
- _load_from_cache
- _load_nested_test_data_from_cache
- _process_and_cache_data
- _process_raw_data
- _process_raw_test_data
- _process_raw_train_data
- _split_train_validation
- _validate_data
- cache_manager
- data_shape_summary
- dataset_config
- dataset_name
- dataset_type
- generic_patient_start_date
- keep_columns
- load_data
- load_raw
- max_workers
- num_patients
- num_train_days
- num_validation_days
- parallel
- patient_ids
- processed_data
- raw_data
- test_data
- to_dataframe
- train_data
- train_dt_col_type
- use_cached
- val_dt_col_type
- validation_data


In [None]:
print("\nLynch 2022 Cleaned Dataset Information: \n")
print("Dataset Name: ", lynch_loader.dataset_name)
print("Dataset Type: ", lynch_loader.dataset_type)
print("Use Cached: ", lynch_loader.use_cached)


print("\n")
print("\nLynch 2022 Cleaned Dataset Properties:  \n")
print("Keep Columns: ", lynch_loader.keep_columns)
print("Data Shape: ", lynch_loader.data_shape_summary)
print("Number of Validation Days: ", lynch_loader.num_validation_days)
print("Validation Data Type: ", lynch_loader.val_dt_col_type)

print("Number of Train Days: ", lynch_loader.num_train_days)
print("Train Data Type: ", lynch_loader.train_dt_col_type)


Lynch 2022 Cleaned Dataset Information: 

Dataset Name:  lynch_2022
Dataset Type:  train
Use Cached:  True



Lynch 2022 Cleaned Dataset Properties:  

Keep Columns:  None
Data Shape:  {'lynch_100': (26212, 19), 'lynch_101': (26204, 19), 'lynch_103': (26488, 19), 'lynch_105': (27716, 19), 'lynch_107': (62392, 19), 'lynch_10': (50493, 19), 'lynch_111': (26194, 19), 'lynch_112': (27653, 19), 'lynch_113': (26177, 19), 'lynch_116': (27612, 19), 'lynch_119': (27670, 19), 'lynch_121': (26200, 19), 'lynch_125': (27595, 19), 'lynch_126': (26210, 19), 'lynch_128': (27062, 19), 'lynch_12': (26166, 19), 'lynch_130': (27086, 19), 'lynch_131': (27961, 19), 'lynch_133': (28203, 19), 'lynch_134': (25923, 19), 'lynch_135': (27028, 19), 'lynch_137': (26230, 19), 'lynch_138': (26177, 19), 'lynch_139': (28212, 19), 'lynch_13': (27037, 19), 'lynch_141': (26164, 19), 'lynch_143': (26207, 19), 'lynch_144': (26187, 19), 'lynch_146': (182, 19), 'lynch_148': (26179, 19), 'lynch_151': (27318, 19), 'lynch_153':

In [None]:
from IPython.display import display

print("Train Data: ")


display(lynch_loader.train_data.head())
print(lynch_loader.train_data.index.get_level_values("datetime").dtype)

print("\nValidation Data:")
display(lynch_loader.validation_data.tail())
print(lynch_loader.validation_data.index.get_level_values("datetime").dtype)

Train Data: 


Unnamed: 0_level_0,Unnamed: 1_level_0,bg_mM,dose_units,food_g,hr_bpm,steps,cals,activity,msg_type,age,sex,insulinModality,type,device,dataset,cob,carb_availability,insulin_availability,iob
p_num,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
lynch_10,2019-11-12 00:00:18,9.611111,0.09,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.0,0.09
lynch_10,2019-11-12 00:05:18,10.0,0.175,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.0,0.265
lynch_10,2019-11-12 00:10:18,10.444444,0.26,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.0,0.525
lynch_10,2019-11-12 00:15:18,10.888889,0.285,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.000744,0.81
lynch_10,2019-11-12 00:20:18,11.333333,0.28,0.0,,,,,bg,8.0,F,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.003029,1.089487


datetime64[ns]

Validation Data:


Unnamed: 0_level_0,Unnamed: 1_level_0,bg_mM,dose_units,food_g,hr_bpm,steps,cals,activity,msg_type,age,sex,insulinModality,type,device,dataset,cob,carb_availability,insulin_availability,iob
p_num,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
lynch_99,2021-04-20 06:39:03,8.055556,0.1,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.077013,2.077236
lynch_99,2021-04-20 06:44:03,7.944444,0.085,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.083539,2.096622
lynch_99,2021-04-20 06:49:03,7.777778,0.08,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.089667,2.109179
lynch_99,2021-04-20 06:54:03,7.611111,0.08,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.095284,2.119289
lynch_99,2021-04-20 06:59:03,7.166667,0.0,0.0,,,,,bg,18.0,M,1.0,1.0,Dexcom G6,lynch2022,0.0,0.0,0.098951,2.033941


datetime64[ns]
