# 0.00 Data Cleaner Examples

- **Author**: Christopher Risi
- **AI Assitance**: Claude Sonnet 3.7

The purpose of this notebook is to show examples of how our data cleaning functions should work when preparing data for modeling. 

## Requirements
### General Requirements
1. Install the package locally with `pip install -e .` in your project root directory
2. Activate your virtual environment (`.noctprob-venv`) before running any code
3. Ensure required Python packages are installed from `requirements.txt`

### Dataset-Specific Requirements

#### Kaggle BrisT1D
1. Set up the Kaggle API on your machine:

    - Create a Kaggle account if you don't have one
    - Generate and download an API key from your Kaggle account settings
    - Place the `kaggle.json` file in `~/.kaggle/ directory`
    - Set proper permissions: ```chmod 600 ~/.kaggle/kaggle.json```

2. Download the dataset using the provided script:

    ```bash scripts/data_downloads/kaggle_data_download.sh```

3. Ensure the dataset files are in the correct locations:

    - Training data: ```src/data/datasets/kaggle_bris_t1d/raw/train.csv```
    - Test data: ```src/data/datasets/kaggle_bris_t1d/raw/test.csv```

#### Gluroo Example Dataset
1. Ensure the Gluroo JSON data is available at the path you'll specify with the file_path parameter
2. Optional: Configure custom cleaning parameters through the config dictionary

## Code Example

In [2]:
# Import the factory function for creating data loaders
from src.data.datasets.data_loader import get_loader

# For Bristol T1D dataset (train)
bris_loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="train",
    use_cached=True  # Set to False to reprocess raw data
)
bris_data = bris_loader.processed_data

# For Bristol T1D dataset (test)
bris_test_loader = get_loader(
    data_source_name="kaggle_brisT1D",
    dataset_type="test",
    use_cached=True
)
bris_test_data = bris_test_loader.processed_data



ModuleNotFoundError: No module named 'src'

In [None]:
bris_data

In [None]:
# For Gluroo dataset with custom configuration
gluroo_config = {
    "max_consecutive_nan_values_per_day": 36,
    "coerse_time_interval": pd.Timedelta(minutes=5),
    "day_start_time": pd.Timedelta(hours=4),
    "min_carbs": 5,
    "meal_length": pd.Timedelta(hours=2),
    "n_top_carb_meals": 3
}

gluroo_loader = get_loader(
    data_source_name="gluroo",
    file_path="path/to/gluroo_data.csv", 
    config=gluroo_config,
    use_cached=False
)
gluroo_data = gluroo_loader.processed_data

In [2]:
test_df = load_data(dataset_type="test")

In [3]:
test_df.head()

Unnamed: 0,id,p_num,time,bg-5:55,bg-5:50,bg-5:45,bg-5:40,bg-5:35,bg-5:30,bg-5:25,...,activity-0:45,activity-0:40,activity-0:35,activity-0:30,activity-0:25,activity-0:20,activity-0:15,activity-0:10,activity-0:05,activity-0:00
0,p01_8459,p01,06:45:00,,9.2,,,10.2,,,...,,,,,,,,,,
1,p01_8460,p01,11:25:00,,,9.9,,,9.4,,...,,,,,,,,Walk,Walk,Walk
2,p01_8461,p01,14:45:00,,5.5,,,5.5,,,...,,,,,,,,,,
3,p01_8462,p01,04:30:00,,3.4,,,3.9,,,...,,,,,,,,,,
4,p01_8463,p01,04:20:00,,,8.3,,,10.0,,...,,,,,,,,,,


In [4]:
patient_dfs = clean_brist1d_test_data(test_df)

In [5]:
for k, v in patient_dfs["p01"].items():
    print(k)
    print(v.head(15))
    break

p01_8459
        time  bg-value  insulin-value  carbs-value  hr-value  steps-value  \
0   00:50:00       NaN         0.0083          NaN      59.7          0.0   
1   00:55:00       9.2         0.0083          NaN      55.6          NaN   
2   01:00:00       NaN         0.0083          NaN      58.2          0.0   
3   01:05:00       NaN         0.0083          NaN      59.3          0.0   
4   01:10:00      10.2         0.0083          NaN      58.0          0.0   
5   01:15:00       NaN         0.0083          NaN      62.7          0.0   
6   01:20:00       NaN         0.0083          NaN      59.7          NaN   
7   01:25:00      10.3         0.0083          NaN      55.7          0.0   
8   01:30:00       NaN         0.0083          NaN      55.8          NaN   
9   01:35:00       NaN         0.0083          NaN      57.2          NaN   
10  01:40:00      10.2         0.0083          NaN      61.1          NaN   
11  01:45:00       NaN         0.0083          NaN      57.9       