# 0.00 Data Cleaner Examples

- **Author**: Christopher Risi
- **AI Assitance**: Claude Sonnet 3.7

The purpose of this notebook is to show examples of how our data cleaning functions should work when preparing data for modeling. 

## Requirements
### General Requirements
1. Install the package locally with `pip install -e .` in your project root directory
2. Activate your virtual environment (`.noctprob-venv`) before running any code
3. Ensure required Python packages are installed from `requirements.txt`

### Dataset-Specific Requirements

#### Kaggle BrisT1D
1. Set up the Kaggle API on your machine:

    - Create a Kaggle account if you don't have one
    - Generate and download an API key from your Kaggle account settings
    - Place the `kaggle.json` file in `~/.kaggle/ directory`
    - Set proper permissions: ```chmod 600 ~/.kaggle/kaggle.json```

2. Download the dataset using the provided script:

    ```bash scripts/data_downloads/kaggle_data_download.sh```

3. Ensure the dataset files are in the correct locations:

    - Training data: ```src/data/datasets/kaggle_bris_t1d/raw/train.csv```
    - Test data: ```src/data/datasets/kaggle_bris_t1d/raw/test.csv```

#### Gluroo Example Dataset
1. Ensure the Gluroo JSON data is available at the path you'll specify with the file_path parameter
2. Optional: Configure custom cleaning parameters through the config dictionary

## Code Example

### Kaggle Bristol T1D
> [!Note] 
>
> If your data is not yet cached, with the current implementation (2025/07/05) this takes ~25 minutes to run on WATGPU. 
> 
> Once the data is cached this ran in ~6 seconds.
> 
> We are working on efficiency improvements for this processing.

In [2]:
# Import the factory function for creating data loaders
from src.data.datasets.data_loader import get_loader

# For Bristol T1D dataset (train)
bris_loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="train",
    use_cached=True,  # Set to False to reprocess raw data
)
bris_data = bris_loader.processed_data

# For Bristol T1D dataset (test)
bris_test_loader = get_loader(
    data_source_name="kaggle_brisT1D", dataset_type="test", use_cached=True
)
bris_test_data = bris_test_loader.processed_data

In [3]:
# Display all methods/attributes available on the bris_loader
print("Available methods and attributes of bris_loader:")
for item in dir(bris_loader):
    if not item.startswith("__"):  # Skip dunder/magic methods
        print(f"- {item}")

Available methods and attributes of bris_loader:
- _abc_impl
- _get_day_splits
- _process_raw_data
- _validate_data
- cached_path
- dataset_name
- dataset_type
- default_path
- file_path
- get_validation_day_splits
- keep_columns
- load_data
- load_raw
- num_validation_days
- processed_data
- train_data
- use_cached
- validation_data


In [4]:
print("Cached Path: ", bris_loader.cached_path)
print("Dataset Name: ", bris_loader.dataset_name)
print("Dataset Type: ", bris_loader.dataset_type)
print("Raw Data Path: ", bris_loader.default_path)
print("File Path: ", bris_loader.file_path)
print("Keep Columns: ", bris_loader.keep_columns)
print("Number of Validation Days: ", bris_loader.num_validation_days)
print("Use Cached: ", bris_loader.use_cached)

Cached Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/processed/train_cached.csv
Dataset Name:  kaggle_brisT1D
Dataset Type:  train
Raw Data Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/raw/train.csv
File Path:  /u6/cjrisi/nocturnal/src/data/datasets/kaggle_bris_t1d/processed/train_cached.csv
Keep Columns:  None
Number of Validation Days:  20
Use Cached:  True


In [10]:
from IPython.display import display

print("Train Data: ")
display(bris_loader.train_data.head())
print("\nValidation Data:")
display(bris_loader.validation_data.head())
print(type(bris_loader.validation_data["datetime"]))

Train Data: 


Unnamed: 0,datetime,id,p_num,time,bg-0:00,insulin-0:00,carbs-0:00,hr-0:00,steps-0:00,cals-0:00,activity-0:00,cob,carb_availability,insulin_availability,iob
0,2025-01-01 06:10:00,p01_0,p01,06:10:00,15.1,0.0417,,,,,,0.0,0.0,0.0,0.4028
1,2025-01-01 06:25:00,p01_1,p01,06:25:00,14.4,0.0417,,,,,,0.0,0.0,0.003428,0.872082
2,2025-01-01 06:40:00,p01_2,p01,06:40:00,13.9,0.0417,,,,,,0.0,0.0,0.012039,1.385682
3,2025-01-01 06:55:00,p01_3,p01,06:55:00,13.8,0.0417,,,,,,0.0,0.0,0.024747,1.838095
4,2025-01-01 07:10:00,p01_4,p01,07:10:00,13.4,0.0417,,,,,,0.0,0.0,0.040416,2.203691



Validation Data:


Unnamed: 0,datetime,id,p_num,time,bg-0:00,insulin-0:00,carbs-0:00,hr-0:00,steps-0:00,cals-0:00,activity-0:00,cob,carb_availability,insulin_availability,iob
6723,2025-03-12 06:55:00,p01_6723,p01,,,,,,,,,87.40897,48.647153,1.931029,28.187021
6724,2025-03-12 07:10:00,p01_6724,p01,07:10:00,,0.0417,,80.3,92.0,4.8,,81.39424,47.24137,1.899391,27.295364
6725,2025-03-12 07:25:00,p01_6725,p01,07:25:00,,0.0417,,75.5,0.0,4.8,,75.574056,45.48211,1.863221,26.425267
6726,2025-03-12 07:40:00,p01_6726,p01,07:40:00,,0.0417,,76.9,65.0,4.8,,69.987191,43.473307,1.823792,25.57234
6727,2025-03-12 07:55:00,p01_6727,p01,07:55:00,,0.0417,,77.1,0.0,5.0,,59.271225,36.874083,1.513891,22.067263


<class 'pandas.core.series.Series'>


In [6]:
print("Processed Data: ", bris_loader.get_validation_day_splits)
print("Load Data: ", bris_loader.load_data)
print("Load Raw: ", bris_loader.load_raw)

Processed Data:  <bound method BrisT1DDataLoader.get_validation_day_splits of <src.data.datasets.kaggle_bris_t1d.bris_t1d.BrisT1DDataLoader object at 0x7ca0e0a68650>>
Load Data:  <bound method BrisT1DDataLoader.load_data of <src.data.datasets.kaggle_bris_t1d.bris_t1d.BrisT1DDataLoader object at 0x7ca0e0a68650>>
Load Raw:  <bound method BrisT1DDataLoader.load_raw of <src.data.datasets.kaggle_bris_t1d.bris_t1d.BrisT1DDataLoader object at 0x7ca0e0a68650>>


### Gluroo Example Data

TODO: The Gluroo example is a WIP. 

```python
import pandas as pd

# For Gluroo dataset with custom configuration
gluroo_config = {
    "max_consecutive_nan_values_per_day": 36,
    "coerse_time_interval": pd.Timedelta(minutes=5),
    "day_start_time": pd.Timedelta(hours=4),
    "min_carbs": 5,
    "meal_length": pd.Timedelta(hours=2),
    "n_top_carb_meals": 3,
}

gluroo_loader = get_loader(
    data_source_name="gluroo",
    file_path="path/to/gluroo_data.csv",
    config=gluroo_config,
    use_cached=False,
)
gluroo_data = gluroo_loader.processed_data
```