# 0.00 Data Cleaner Examples

- **Author**: Christopher Risi
- **AI Assitance**: Claude Sonnet 3.7

The purpose of this notebook is to show examples of how our data cleaning functions should work when preparing data for modeling. 

## Requirements
### General Requirements
1. Install the package locally with `pip install -e .` in your project root directory
2. Activate your virtual environment (`.noctprob-venv`) before running any code
3. Ensure required Python packages are installed from `requirements.txt`

### Dataset-Specific Requirements

#### Kaggle BrisT1D
1. Set up the Kaggle API on your machine:

    - Create a Kaggle account if you don't have one
    - Generate and download an API key from your Kaggle account settings
    - Place the `kaggle.json` file in `~/.kaggle/ directory`
    - Set proper permissions: ```chmod 600 ~/.kaggle/kaggle.json```

2. Download the dataset using the provided script:

    ```bash scripts/data_downloads/kaggle_data_download.sh```

3. Ensure the dataset files are in the correct locations:

    - Training data: ```src/data/datasets/kaggle_bris_t1d/raw/train.csv```
    - Test data: ```src/data/datasets/kaggle_bris_t1d/raw/test.csv```

#### Gluroo Example Dataset
1. Ensure the Gluroo JSON data is available at the path you'll specify with the file_path parameter
2. Optional: Configure custom cleaning parameters through the config dictionary

## Code Example

### Kaggle Bristol T1D DataLoader Initialization
> [!Note] 
>
> If your data is not yet cached, with the current implementation (2025/07/05) this takes ~25 minutes to run on WATGPU. 
> 
> As of (2025/07/06) this now takes about ~9 minutes to run on WATGPU with the change to process the patients in parallel.
> Once the data is cached this ran in ~6 seconds.
> 
> We are working on efficiency improvements for this processing.


In [9]:
# Import the factory function for creating data loaders
from src.data.diabetes_datasets.data_loader import get_loader

# For Bristol T1D dataset (train)
bris_loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="train",
    use_cached=True,  # Set to False to reprocess raw data
)
bris_data = bris_loader.processed_data

In [10]:
# For Bristol T1D dataset (test)
bris_test_loader = get_loader(
    data_source_name="kaggle_brisT1D", dataset_type="test", use_cached=True
)
bris_test_data = bris_test_loader.processed_data

#### Train/Val Data Analysis

In [11]:
# Display all methods/attributes available on the bris_loader
print("Available methods and attributes of bris_loader:")
for item in dir(bris_loader):
    if not item.startswith("__"):  # Skip dunder/magic methods
        print(f"- {item}")

Available methods and attributes of bris_loader:
- _abc_impl
- _get_day_splits
- _load_from_cache
- _load_test_data_from_cache
- _process_and_cache_data
- _process_patient_data
- _process_raw_data
- _process_raw_data_old
- _split_train_validation
- _validate_data
- cached_path
- dataset_name
- dataset_type
- file_path
- get_validation_day_splits
- keep_columns
- load_data
- load_data_old
- load_raw
- num_train_days
- num_validation_days
- processed_data
- raw_data_path
- train_data
- train_dt_col_type
- use_cached
- val_dt_col_type
- validation_data


In [12]:
print("\nBristol T1D Cleaned Dataset Information: \n")
print("Dataset Name: ", bris_loader.dataset_name)
print("Dataset Type: ", bris_loader.dataset_type)
print("Raw Data Path: ", bris_loader.raw_data_path)
print("File Path: ", bris_loader.file_path)
print("Cached Path: ", bris_loader.cached_path)
print("Use Cached: ", bris_loader.use_cached)


print("\n")
print("\nBristol T1D Cleaned Dataset Properties:  \n")
print("Keep Columns: ", bris_loader.keep_columns)
print("\nValidation Data Shape: ", bris_loader.validation_data.shape)
print("Number of Validation Days: ", bris_loader.num_validation_days)
print("Validation Data Type: ", bris_loader.val_dt_col_type)

print("\nTrain Data Shape: ", bris_loader.train_data.shape)
print("Number of Train Days: ", bris_loader.num_train_days)
print("Train Data Type: ", bris_loader.train_dt_col_type)


Bristol T1D Cleaned Dataset Information: 

Dataset Name:  kaggle_brisT1D
Dataset Type:  train
Raw Data Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/raw/train.csv
File Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/processed/train_cached.csv
Cached Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/processed/train_cached.csv
Use Cached:  True



Bristol T1D Cleaned Dataset Properties:  

Keep Columns:  ['datetime', 'id', 'p_num', 'time', 'bg-0:00', 'insulin-0:00', 'carbs-0:00', 'hr-0:00', 'steps-0:00', 'cals-0:00', 'activity-0:00', 'cob', 'carb_availability', 'insulin_availability', 'iob']

Validation Data Shape:  (40329, 15)
Number of Validation Days:  20
Validation Data Type:  datetime64[ns]

Train Data Shape:  (140160, 15)
Number of Train Days:  72
Train Data Type:  datetime64[ns]


In [13]:
from IPython.display import display

print("Train Data: ")
display(bris_loader.train_data.head())
print(bris_loader.train_data["datetime"].dtype)

print("\nValidation Data:")
display(bris_loader.validation_data.head())
print(bris_loader.validation_data["datetime"].dtype)

Train Data: 


Unnamed: 0,datetime,id,p_num,time,bg-0:00,insulin-0:00,carbs-0:00,hr-0:00,steps-0:00,cals-0:00,activity-0:00,cob,carb_availability,insulin_availability,iob
0,2025-01-01 06:10:00,p01_0,p01,06:10:00,15.1,0.0417,,,,,,0.0,0.0,0.0,0.4028
1,2025-01-01 06:25:00,p01_1,p01,06:25:00,14.4,0.0417,,,,,,0.0,0.0,0.003428,0.872082
2,2025-01-01 06:40:00,p01_2,p01,06:40:00,13.9,0.0417,,,,,,0.0,0.0,0.012039,1.385682
3,2025-01-01 06:55:00,p01_3,p01,06:55:00,13.8,0.0417,,,,,,0.0,0.0,0.024747,1.838095
4,2025-01-01 07:10:00,p01_4,p01,07:10:00,13.4,0.0417,,,,,,0.0,0.0,0.040416,2.203691


datetime64[ns]

Validation Data:


Unnamed: 0,datetime,id,p_num,time,bg-0:00,insulin-0:00,carbs-0:00,hr-0:00,steps-0:00,cals-0:00,activity-0:00,cob,carb_availability,insulin_availability,iob
6723,2025-03-12 06:55:00,p01_6723,p01,,,,,,,,,87.40897,48.647153,1.931029,28.187021
6724,2025-03-12 07:10:00,p01_6724,p01,07:10:00,,0.0417,,80.3,92.0,4.8,,81.39424,47.24137,1.899391,27.295364
6725,2025-03-12 07:25:00,p01_6725,p01,07:25:00,,0.0417,,75.5,0.0,4.8,,75.574056,45.48211,1.863221,26.425267
6726,2025-03-12 07:40:00,p01_6726,p01,07:40:00,,0.0417,,76.9,65.0,4.8,,69.987191,43.473307,1.823792,25.57234
6727,2025-03-12 07:55:00,p01_6727,p01,07:55:00,,0.0417,,77.1,0.0,5.0,,59.271225,36.874083,1.513891,22.067263


datetime64[ns]


In [14]:
print("Processed Data: ", bris_loader.get_validation_day_splits)
print("Load Data: ", bris_loader.load_data)
print("Load Raw: ", bris_loader.load_raw)

Processed Data:  <bound method BrisT1DDataLoader.get_validation_day_splits of <src.data.datasets.kaggle_bris_t1d.bris_t1d.BrisT1DDataLoader object at 0x7ba0de378b90>>
Load Data:  <bound method BrisT1DDataLoader.load_data of <src.data.datasets.kaggle_bris_t1d.bris_t1d.BrisT1DDataLoader object at 0x7ba0de378b90>>
Load Raw:  <bound method BrisT1DDataLoader.load_raw of <src.data.datasets.kaggle_bris_t1d.bris_t1d.BrisT1DDataLoader object at 0x7ba0de378b90>>


#### Test Data Analysis

In [16]:
# Display all methods/attributes available on the bris_loader
print("Available methods and attributes of bris_loader:")
for item in dir(bris_test_loader):
    if not item.startswith("__"):  # Skip dunder/magic methods
        print(f"- {item}")

Available methods and attributes of bris_loader:
- _abc_impl
- _get_day_splits
- _load_from_cache
- _load_test_data_from_cache
- _process_and_cache_data
- _process_patient_data
- _process_raw_data
- _process_raw_data_old
- _split_train_validation
- _validate_data
- cached_path
- dataset_name
- dataset_type
- file_path
- get_validation_day_splits
- keep_columns
- load_data
- load_data_old
- load_raw
- num_train_days
- num_validation_days
- processed_data
- raw_data_path
- train_dt_col_type
- use_cached
- val_dt_col_type


In [15]:
print("\nBristol T1D Cleaned Test Dataset Information: \n")
print("Dataset Name: ", bris_test_loader.dataset_name)
print("Dataset Type: ", bris_test_loader.dataset_type)
print("Raw Data Path: ", bris_test_loader.raw_data_path)
print("File Path: ", bris_test_loader.file_path)
print("Cached Path: ", bris_test_loader.cached_path)
print("Use Cached: ", bris_test_loader.use_cached)


print("\n")
print("\nBristol T1D Cleaned Dataset Properties:  \n")
print("Keep Columns: ", bris_test_loader.keep_columns)
print("\nNumber of Validation Days: ", bris_test_loader.num_validation_days)
print("Validation Data Type: ", bris_test_loader.val_dt_col_type)

print("\nNumber of Train Days: ", bris_test_loader.num_train_days)
print("Train Data Type: ", bris_test_loader.train_dt_col_type)


Bristol T1D Cleaned Test Dataset Information: 

Dataset Name:  kaggle_brisT1D
Dataset Type:  test
Raw Data Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/raw/test.csv
File Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/processed/test_cached
Cached Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/processed/test_cached
Use Cached:  True



Bristol T1D Cleaned Dataset Properties:  

Keep Columns:  None

Number of Validation Days:  0
Validation Data Type:  None

Number of Train Days:  0
Train Data Type:  None


In [None]:
from IPython.display import display

print("Test Data (1 patient, 6 hours): ")
display(bris_test_loader.processed_data["p02"]["p02_25995"])

Test Data: 


Unnamed: 0.1,Unnamed: 0,datetime,time,bg-0:00,insulin-0:00,carbs-0:00,hr-0:00,steps-0:00,cals-0:00,activity-0:00,p_num,id,cob,carb_availability,insulin_availability,iob
0,0,2025-01-01 06:00:00,06:00:00,7.2,0.1321,,,,,,p02,p02_0,0.000000,0.000000,0.000000,0.132100
1,1,2025-01-01 06:05:00,06:05:00,7.4,0.1185,,,,,,p02,p02_1,0.000000,0.000000,0.000233,0.250577
2,2,2025-01-01 06:10:00,06:10:00,7.5,0.1166,,,,,,p02,p02_2,0.000000,0.000000,0.001262,0.366788
3,3,2025-01-01 06:15:00,06:15:00,7.6,0.1118,,,,,,p02,p02_3,0.000000,0.000000,0.003197,0.477239
4,4,2025-01-01 06:20:00,06:20:00,7.6,0.1094,,,,,,p02,p02_4,0.000000,0.000000,0.005974,0.583715
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,67,2025-01-01 11:35:00,11:35:00,10.5,0.1000,,73.0,,0.52,,p02,p02_67,16.549977,11.507406,0.363245,5.267237
68,68,2025-01-01 11:40:00,11:40:00,10.2,0.1000,,71.0,,0.17,,p02,p02_68,15.151727,10.708740,0.355503,5.097970
69,69,2025-01-01 11:45:00,11:45:00,10.0,0.1000,,71.1,,,,p02,p02_69,13.852006,9.937309,0.347881,4.940906
70,70,2025-01-01 11:50:00,11:50:00,9.9,0.1000,,72.5,,1.67,,p02,p02_70,12.647151,9.197928,0.341171,4.802700


### Gluroo Example Data

TODO: The Gluroo example is a WIP. 

```python
import pandas as pd

# For Gluroo dataset with custom configuration
gluroo_config = {
    "max_consecutive_nan_values_per_day": 36,
    "coerse_time_interval": pd.Timedelta(minutes=5),
    "day_start_time": pd.Timedelta(hours=4),
    "min_carbs": 5,
    "meal_length": pd.Timedelta(hours=2),
    "n_top_carb_meals": 3,
}

gluroo_loader = get_loader(
    data_source_name="gluroo",
    file_path="path/to/gluroo_data.csv",
    config=gluroo_config,
    use_cached=False,
)
gluroo_data = gluroo_loader.processed_data
```