# Brown 2019 (DCLP3) Data Pipeline

This notebook implements the complete data processing pipeline for the Brown 2019 dataset.

## Dataset Overview

- **Study**: DCLP3 - Closed-Loop Control vs Sensor-Augmented Pump therapy
- **Patients**: 168 total (126 with insulin pump data)
- **Data Sources**:
  - `cgm.txt` - CGM readings (~9M rows)
  - `Pump_BasalRateChange.txt` - Basal rate changes (~2.6M rows)
  - `Pump_BolusDelivered.txt` - Bolus deliveries (~221K rows)

## Pipeline Steps

1. Load raw data
2. Parse timestamps
3. Floor timestamps to 5-min grid
4. Rename columns to standard names
5. Aggregate collisions
6. Create regular 5-min grid for CGM
7. Merge insulin data onto CGM backbone
8. Fill missing values
9. Final output

## Setup & Imports

In [14]:
import pandas as pd

from src.utils.os_helper import get_project_root
from src.data.preprocessing.data_splitting import split_multipatient_dataframe
from src.data.preprocessing.sampling import (
    ensure_regular_time_intervals_with_aggregation,
)

# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)

# Paths
root = get_project_root()
CACHE_DIR = root / "cache" / "data" / "awesome_cgm" / "brown_2019"
DATA_TABLES = (
    CACHE_DIR / "raw" / "DCLP3 Public Dataset - Release 3 - 2022-08-04" / "Data Files"
)

print(f"Data directory: {DATA_TABLES}")
print(f"Directory exists: {DATA_TABLES.exists()}")

Data directory: /Volumes/LaCieSSD/WATai/BGC/nocturnal-hypo-gly-prob-forecast/cache/data/awesome_cgm/brown_2019/raw/DCLP3 Public Dataset - Release 3 - 2022-08-04/Data Files
Directory exists: True


---
## Step 1: Load Raw Data

In [15]:
# Load raw data files
cgm_df = pd.read_csv(DATA_TABLES / "cgm.txt", sep="|")
basal_df = pd.read_csv(
    DATA_TABLES / "Pump_BasalRateChange.txt", sep="|", low_memory=False
)
bolus_df = pd.read_csv(DATA_TABLES / "Pump_BolusDelivered.txt", sep="|")

# Fix CGM unnamed column (data format quirk)
cgm_df = cgm_df.rename(columns={"Unnamed: 3": "CGM"})

print("=== Raw Data Loaded ===")
print(f"CGM:   {len(cgm_df):,} rows, {cgm_df['PtID'].nunique()} patients")
print(f"Basal: {len(basal_df):,} rows, {basal_df['PtID'].nunique()} patients")
print(f"Bolus: {len(bolus_df):,} rows, {bolus_df['PtID'].nunique()} patients")

=== Raw Data Loaded ===
CGM:   9,032,235 rows, 168 patients
Basal: 2,610,772 rows, 125 patients
Bolus: 221,292 rows, 125 patients


In [16]:
# Preview each dataframe
print("=== CGM ===")
print(cgm_df.head(3))
print(f"\nColumns: {list(cgm_df.columns)}")

print("\n=== Basal ===")
print(basal_df.head(3))
print(f"\nColumns: {list(basal_df.columns)}")

print("\n=== Bolus ===")
print(bolus_df.head(3))
print(f"\nColumns: {list(bolus_df.columns)}")

=== CGM ===
   PtID       Period          DataDtTm  CGM
0     1  1. Baseline  11DEC17:23:59:25  172
1     1  1. Baseline  12DEC17:00:04:24  170
2     1  1. Baseline  12DEC17:00:09:24  167

Columns: ['PtID', 'Period', 'DataDtTm', 'CGM']

=== Basal ===
   PtID  RecID             DataDtTm  CommandedBasalRate DataDtTm_adjusted
0    10      3  2018-04-04 12:52:41                2.00               NaN
1    10      4  2018-04-04 12:57:41                0.00               NaN
2    12      7  2018-05-17 12:43:45                0.55               NaN

Columns: ['PtID', 'RecID', 'DataDtTm', 'CommandedBasalRate', 'DataDtTm_adjusted']

=== Bolus ===
   PtID   RecID             DataDtTm  BolusAmount DataDtTm_adjusted BolusType
0    79   10100  2018-04-29 04:45:44     0.645283               NaN  Standard
1    31   23863  2018-08-27 22:25:50     1.380000               NaN  Standard
2    88  111267  2018-06-12 08:16:17     2.000000               NaN  Standard

Columns: ['PtID', 'RecID', 'DataDtTm', 'Bo

---
## Step 2: Parse Timestamps

Note: CGM has a different datetime format than basal/bolus.

In [18]:
# CGM has format: '11DEC17:23:59:25' (DDmmmYY:HH:MM:SS)
cgm_df["datetime"] = pd.to_datetime(cgm_df["DataDtTm"], format="%d%b%y:%H:%M:%S")

# Basal/Bolus: use DataDtTm_adjusted if available (corrected timestamps), else DataDtTm
# Some patients (114, 165) have incorrect dates in DataDtTm (e.g., 2010 instead of 2018)
basal_df["datetime"] = pd.to_datetime(
    basal_df["DataDtTm_adjusted"].fillna(basal_df["DataDtTm"])
)
bolus_df["datetime"] = pd.to_datetime(
    bolus_df["DataDtTm_adjusted"].fillna(bolus_df["DataDtTm"])
)

print("=== Timestamps Parsed ===")
print(f"CGM datetime range:   {cgm_df['datetime'].min()} to {cgm_df['datetime'].max()}")
print(
    f"Basal datetime range: {basal_df['datetime'].min()} to {basal_df['datetime'].max()}"
)
print(
    f"Bolus datetime range: {bolus_df['datetime'].min()} to {bolus_df['datetime'].max()}"
)

=== Timestamps Parsed ===
CGM datetime range:   2016-08-04 00:01:34 to 2019-03-25 07:06:50
Basal datetime range: 2017-07-31 12:22:36 to 2019-03-25 07:01:51
Bolus datetime range: 2017-07-31 13:42:34 to 2019-03-25 00:47:57


---
## Step 3: Floor Timestamps to 5-min Grid

**Why floor (not round)?** 
- Preserves causality for time-series prediction
- An event at 18:37 should be in the 18:35 bin (happened before 18:40)
- Rounding could place it in 18:40 bin, causing data leakage

In [19]:
# Floor all timestamps to 5-minute grid
cgm_df["datetime"] = cgm_df["datetime"].dt.floor("5min")
basal_df["datetime"] = basal_df["datetime"].dt.floor("5min")
bolus_df["datetime"] = bolus_df["datetime"].dt.floor("5min")

print("=== Timestamps Floored to 5-min Grid ===")
print(f"Sample CGM timestamps: {cgm_df['datetime'].head(3).tolist()}")
print(f"Sample Basal timestamps: {basal_df['datetime'].head(3).tolist()}")
print(f"Sample Bolus timestamps: {bolus_df['datetime'].head(3).tolist()}")

=== Timestamps Floored to 5-min Grid ===
Sample CGM timestamps: [Timestamp('2017-12-11 23:55:00'), Timestamp('2017-12-12 00:00:00'), Timestamp('2017-12-12 00:05:00')]
Sample Basal timestamps: [Timestamp('2018-04-04 12:50:00'), Timestamp('2018-04-04 12:55:00'), Timestamp('2018-05-17 12:40:00')]
Sample Bolus timestamps: [Timestamp('2018-04-29 04:45:00'), Timestamp('2018-08-27 22:25:00'), Timestamp('2018-06-12 08:15:00')]


---
## Step 4: Rename Columns to Standard Names

In [20]:
# Rename CGM columns
cgm_df = cgm_df.rename(columns={"PtID": "p_num", "CGM": "bg_mgdL", "Period": "period"})

# Add mmol/L version (standard unit)
cgm_df["bg_mM"] = (cgm_df["bg_mgdL"] / 18.0).round(2)

# Rename insulin columns
basal_df = basal_df.rename(columns={"PtID": "p_num"})
bolus_df = bolus_df.rename(columns={"PtID": "p_num"})

# Drop original datetime string columns
cgm_df = cgm_df.drop(columns=["DataDtTm"])
basal_df = basal_df.drop(columns=["DataDtTm", "DataDtTm_adjusted", "RecID"])
bolus_df = bolus_df.drop(columns=["DataDtTm", "DataDtTm_adjusted", "RecID"])

print("=== Columns Renamed ===")
print(f"CGM columns: {list(cgm_df.columns)}")
print(f"Basal columns: {list(basal_df.columns)}")
print(f"Bolus columns: {list(bolus_df.columns)}")

=== Columns Renamed ===
CGM columns: ['p_num', 'period', 'bg_mgdL', 'datetime', 'bg_mM']
Basal columns: ['p_num', 'CommandedBasalRate', 'datetime']
Bolus columns: ['p_num', 'BolusAmount', 'BolusType', 'datetime']


---
## Step 5: Aggregate Collisions

When multiple readings fall in the same 5-min bin:
- **CGM**: Take mean (duplicate readings are similar)
- **Bolus**: SUM (don't lose any insulin!)
- **Basal**: Take last (most recent rate is what's active)

In [22]:
# Check for collisions before aggregation
cgm_collisions = cgm_df.groupby(["p_num", "datetime"]).size()
basal_collisions = basal_df.groupby(["p_num", "datetime"]).size()
bolus_collisions = bolus_df.groupby(["p_num", "datetime"]).size()

print("=== Collisions Detected ===")
print(f"CGM bins with >1 reading:   {(cgm_collisions > 1).sum():,}")
print(f"Basal bins with >1 change:  {(basal_collisions > 1).sum():,}")
print(f"Bolus bins with >1 bolus:   {(bolus_collisions > 1).sum():,}")

=== Collisions Detected ===
CGM bins with >1 reading:   6,215
Basal bins with >1 change:  7,944
Bolus bins with >1 bolus:   2,188


In [23]:
# Aggregate CGM: mean for numeric, first for categorical
cgm_agg = (
    cgm_df.groupby(["p_num", "datetime"])
    .agg({"bg_mgdL": "mean", "bg_mM": "mean", "period": "first"})
    .reset_index()
)

# Aggregate Bolus: SUM (critical - don't lose insulin!)
bolus_agg = (
    bolus_df.groupby(["p_num", "datetime"])
    .agg(
        {
            "BolusAmount": "sum",
            "BolusType": "first",  # Keep for reference
        }
    )
    .reset_index()
)

# Aggregate Basal: LAST (most recent rate in bin)
basal_agg = (
    basal_df.groupby(["p_num", "datetime"])
    .agg({"CommandedBasalRate": "last"})
    .reset_index()
)

print("=== After Aggregation ===")
print(f"CGM:   {len(cgm_agg):,} rows (was {len(cgm_df):,})")
print(f"Basal: {len(basal_agg):,} rows (was {len(basal_df):,})")
print(f"Bolus: {len(bolus_agg):,} rows (was {len(bolus_df):,})")

=== After Aggregation ===
CGM:   9,026,020 rows (was 9,032,235)
Basal: 2,602,733 rows (was 2,610,772)
Bolus: 219,046 rows (was 221,292)


---
## Step 6: Create Regular 5-min Grid for CGM

Use existing preprocessing function to fill gaps with NaN.

In [24]:
# Set datetime as index for processing
cgm_indexed = cgm_agg.set_index("datetime")

# Split by patient
patient_dict = split_multipatient_dataframe(cgm_indexed, patient_col="p_num")

print(f"Processing {len(patient_dict)} patients...")

# Create regular grid per patient
processed_patients = {}
for i, (pid, pdf) in enumerate(patient_dict.items()):
    processed_df, freq = ensure_regular_time_intervals_with_aggregation(pdf)
    processed_patients[pid] = processed_df

    if (i + 1) % 50 == 0:
        print(f"  Processed {i + 1}/{len(patient_dict)} patients...")

print(f"Done! Processed all {len(patient_dict)} patients.")

2025-11-29T23:47:48 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular time intervals with aggregation...
2025-11-29T23:47:48 - 	Most common time interval: 5 minutes
2025-11-29T23:47:48 - 	Aggregation strategy: {'p_num': 'first', 'bg_mgdL': 'sum', 'bg_mM': 'mean', 'period': 'first'}
2025-11-29T23:47:48 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient 1 
			 - old index length: 57016, 
			 - new index length: 58465
2025-11-29T23:47:48 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular time intervals with aggregation...
2025-11-29T23:47:48 - 	Most common time interval: 5 minutes
2025-11-29T23:47:48 - 	Aggregation strategy: {'p_num': 'first', 'bg_mgdL': 'sum', 'bg_mM': 'mean', 'period': 'first'}
2025-11-29T23:47:48 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient 2 
			 - old index length: 53500, 
			 - new index length: 58269
2025-11-29T23:47:48 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular

Processing 168 patients...


2025-11-29T23:47:48 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient 4 
			 - old index length: 48772, 
			 - new index length: 56301
2025-11-29T23:47:48 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular time intervals with aggregation...
2025-11-29T23:47:48 - 	Most common time interval: 5 minutes
2025-11-29T23:47:48 - 	Aggregation strategy: {'p_num': 'first', 'bg_mgdL': 'sum', 'bg_mM': 'mean', 'period': 'first'}
2025-11-29T23:47:48 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient 5 
			 - old index length: 57114, 
			 - new index length: 58465
2025-11-29T23:47:48 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular time intervals with aggregation...
2025-11-29T23:47:48 - 	Most common time interval: 5 minutes
2025-11-29T23:47:48 - 	Aggregation strategy: {'p_num': 'first', 'bg_mgdL': 'sum', 'bg_mM': 'mean', 'period': 'first'}
2025-11-29T23:47:48 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient

  Processed 50/168 patients...


2025-11-29T23:47:50 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient 57 
			 - old index length: 56926, 
			 - new index length: 58464
2025-11-29T23:47:50 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular time intervals with aggregation...
2025-11-29T23:47:50 - 	Most common time interval: 5 minutes
2025-11-29T23:47:50 - 	Aggregation strategy: {'p_num': 'first', 'bg_mgdL': 'sum', 'bg_mM': 'mean', 'period': 'first'}
2025-11-29T23:47:50 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient 58 
			 - old index length: 54043, 
			 - new index length: 55301
2025-11-29T23:47:50 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular time intervals with aggregation...
2025-11-29T23:47:50 - 	Most common time interval: 5 minutes
2025-11-29T23:47:50 - 	Aggregation strategy: {'p_num': 'first', 'bg_mgdL': 'sum', 'bg_mM': 'mean', 'period': 'first'}
2025-11-29T23:47:50 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patie

  Processed 100/168 patients...


2025-11-29T23:47:53 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient 107 
			 - old index length: 53059, 
			 - new index length: 55584
2025-11-29T23:47:53 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular time intervals with aggregation...
2025-11-29T23:47:53 - 	Most common time interval: 5 minutes
2025-11-29T23:47:53 - 	Aggregation strategy: {'p_num': 'first', 'bg_mgdL': 'sum', 'bg_mM': 'mean', 'period': 'first'}
2025-11-29T23:47:53 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient 108 
			 - old index length: 55015, 
			 - new index length: 56244
2025-11-29T23:47:53 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular time intervals with aggregation...
2025-11-29T23:47:53 - 	Most common time interval: 5 minutes
2025-11-29T23:47:53 - 	Aggregation strategy: {'p_num': 'first', 'bg_mgdL': 'sum', 'bg_mM': 'mean', 'period': 'first'}
2025-11-29T23:47:53 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Pat

  Processed 150/168 patients...


2025-11-29T23:47:55 - 	Most common time interval: 5 minutes
2025-11-29T23:47:55 - 	Aggregation strategy: {'p_num': 'first', 'bg_mgdL': 'sum', 'bg_mM': 'mean', 'period': 'first'}
2025-11-29T23:47:55 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient 158 
			 - old index length: 52490, 
			 - new index length: 54432
2025-11-29T23:47:55 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular time intervals with aggregation...
2025-11-29T23:47:55 - 	Most common time interval: 5 minutes
2025-11-29T23:47:55 - 	Aggregation strategy: {'p_num': 'first', 'bg_mgdL': 'sum', 'bg_mM': 'mean', 'period': 'first'}
2025-11-29T23:47:56 - Post-ensure_regular_time_intervals_with_aggregation(): 
			Patient 159 
			 - old index length: 54564, 
			 - new index length: 55872
2025-11-29T23:47:56 - ensure_regular_time_intervals_with_aggregation(): Ensuring regular time intervals with aggregation...
2025-11-29T23:47:56 - 	Most common time interval: 5 minutes
2025-11-29T23:47:56 - 	A

Done! Processed all 168 patients.


In [25]:
# Recombine into single dataframe
cgm_regular = pd.concat(processed_patients.values()).reset_index()

print("=== Regular Grid Created ===")
print(f"Total rows: {len(cgm_regular):,}")
print(f"Rows added (gaps): {len(cgm_regular) - len(cgm_agg):,}")
print("\nSample data:")
print(cgm_regular.head())

=== Regular Grid Created ===
Total rows: 9,742,311
Rows added (gaps): 716,291

Sample data:
             datetime  p_num  bg_mgdL  bg_mM       period
0 2017-12-11 23:55:00    1.0    172.0   9.56  1. Baseline
1 2017-12-12 00:00:00    1.0    170.0   9.44  1. Baseline
2 2017-12-12 00:05:00    1.0    167.0   9.28  1. Baseline
3 2017-12-12 00:10:00    1.0    163.0   9.06  1. Baseline
4 2017-12-12 00:15:00    1.0    160.0   8.89  1. Baseline


---
## Step 7: Merge Insulin Data onto CGM Backbone

Left join: keep all CGM rows, add insulin where available.
- 126 patients have insulin data
- 42 patients will have NaN for insulin columns

In [26]:
# Merge bolus onto CGM
merged = cgm_regular.merge(
    bolus_agg[["p_num", "datetime", "BolusAmount"]],
    on=["p_num", "datetime"],
    how="left",
)

# Merge basal onto result
merged = merged.merge(
    basal_agg[["p_num", "datetime", "CommandedBasalRate"]],
    on=["p_num", "datetime"],
    how="left",
)

print("=== After Merge ===")
print(f"Total rows: {len(merged):,}")
print(f"Rows with bolus data: {merged['BolusAmount'].notna().sum():,}")
print(f"Rows with basal data: {merged['CommandedBasalRate'].notna().sum():,}")
print("\nSample:")
print(merged.head(10))

=== After Merge ===
Total rows: 9,742,311
Rows with bolus data: 218,186
Rows with basal data: 2,599,420

Sample:
             datetime  p_num  bg_mgdL  bg_mM       period  BolusAmount  \
0 2017-12-11 23:55:00    1.0    172.0   9.56  1. Baseline          NaN   
1 2017-12-12 00:00:00    1.0    170.0   9.44  1. Baseline          NaN   
2 2017-12-12 00:05:00    1.0    167.0   9.28  1. Baseline          NaN   
3 2017-12-12 00:10:00    1.0    163.0   9.06  1. Baseline          NaN   
4 2017-12-12 00:15:00    1.0    160.0   8.89  1. Baseline          NaN   
5 2017-12-12 00:20:00    1.0    158.0   8.78  1. Baseline          NaN   
6 2017-12-12 00:25:00    1.0    157.0   8.72  1. Baseline          NaN   
7 2017-12-12 00:30:00    1.0    155.0   8.61  1. Baseline          NaN   
8 2017-12-12 00:35:00    1.0    154.0   8.56  1. Baseline          NaN   
9 2017-12-12 00:40:00    1.0    153.0   8.50  1. Baseline          NaN   

   CommandedBasalRate  
0                 NaN  
1                 NaN  


In [30]:
print(merged[merged["BolusAmount"].notna()].head(10))
merged[merged["CommandedBasalRate"].notna()].head(10)

                  datetime  p_num  bg_mgdL  bg_mM                 period  \
119285 2018-01-05 20:35:00    3.0    113.0   6.28            1. Baseline   
119316 2018-01-05 23:10:00    3.0    155.0   8.61            1. Baseline   
119439 2018-01-06 09:25:00    3.0    118.0   6.56            1. Baseline   
119464 2018-01-06 11:30:00    3.0    116.0   6.44            1. Baseline   
119748 2018-01-07 11:10:00    3.0    141.0   7.83            1. Baseline   
120980 2018-01-11 17:50:00    3.0    188.0  10.44  2. Post Randomization   
120981 2018-01-11 17:55:00    3.0    189.0  10.50  2. Post Randomization   
120996 2018-01-11 19:10:00    3.0    161.0   8.94  2. Post Randomization   
121017 2018-01-11 20:55:00    3.0    162.0   9.00  2. Post Randomization   
121143 2018-01-12 07:25:00    3.0     99.0   5.50  2. Post Randomization   

        BolusAmount  CommandedBasalRate  
119285     0.810000                 NaN  
119316     0.200000                 NaN  
119439     0.400000                 N

Unnamed: 0,datetime,p_num,bg_mgdL,bg_mM,period,BolusAmount,CommandedBasalRate
119191,2018-01-05 12:45:00,3.0,126.0,7.0,1. Baseline,,0.6
119326,2018-01-06 00:00:00,3.0,196.0,10.89,1. Baseline,,0.5
119410,2018-01-06 07:00:00,3.0,156.0,8.67,1. Baseline,,0.6
119614,2018-01-07 00:00:00,3.0,130.0,7.22,1. Baseline,,0.5
119698,2018-01-07 07:00:00,3.0,74.0,4.11,1. Baseline,,0.6
119902,2018-01-08 00:00:00,3.0,198.0,11.0,1. Baseline,,0.5
119986,2018-01-08 07:00:00,3.0,108.0,6.0,1. Baseline,,0.6
120010,2018-01-08 09:00:00,3.0,165.0,9.17,1. Baseline,,0.0
120011,2018-01-08 09:05:00,3.0,174.0,9.67,1. Baseline,,0.6
120115,2018-01-08 17:45:00,3.0,100.0,5.56,1. Baseline,,0.0


---
## Step 8: Fill Missing Values

- **Bolus**: Missing = 0 (no bolus was given)
- **Basal**: Forward fill per patient (rate persists until next change)

In [31]:
# Bolus: no bolus = 0 units
merged["bolus_u"] = merged["BolusAmount"].fillna(0)

# Basal: forward fill per patient (rate persists until next change)
merged = merged.sort_values(["p_num", "datetime"])
merged["basal_rate_uhr"] = merged.groupby("p_num")["CommandedBasalRate"].ffill()

# Convert basal U/hr to U per 5-min interval
# 0.8 U/hr = 0.8 / 12 = 0.0667 U per 5-min
merged["basal_u"] = merged["basal_rate_uhr"] / 12

print("=== After Filling ===")
print(f"Bolus NaN remaining: {merged['bolus_u'].isna().sum()}")
print(
    f"Basal NaN remaining: {merged['basal_u'].isna().sum()} (expected for 42 patients without pump data)"
)
print(
    f"\nPatients with basal data: {merged[merged['basal_u'].notna()]['p_num'].nunique()}"
)
print(
    f"Patients without basal data: {merged[merged['basal_u'].isna()]['p_num'].nunique()}"
)

=== After Filling ===
Bolus NaN remaining: 0
Basal NaN remaining: 2994423 (expected for 42 patients without pump data)

Patients with basal data: 125
Patients without basal data: 165


In [43]:
# === Basal NaN Analysis ===
# After forward-fill, understand where remaining NaN values come from

print("=== Basal Coverage After Forward-Fill ===")
print(f"Rows with basal data: {merged['basal_u'].notna().sum():,}")
print(f"Rows with NaN basal:  {merged['basal_u'].isna().sum():,}")

# Identify patient groups based on basal coverage
patients_with_any_nan = set(merged[merged["basal_u"].isna()]["p_num"])
patients_with_any_data = set(merged[merged["basal_u"].notna()]["p_num"])
patients_no_pump_data = patients_with_any_nan - patients_with_any_data  # ONLY have NaN

print("\n=== Patient Breakdown ===")
print(f"Patients with at least some NaN:  {len(patients_with_any_nan)}")
print(f"Patients with at least some data: {len(patients_with_any_data)}")
print(f"Patients with NO pump data at all: {len(patients_no_pump_data)}")

# Quantify NaN sources
rows_from_no_pump = merged[merged["p_num"].isin(patients_no_pump_data)].shape[0]
total_nan = merged["basal_u"].isna().sum()

print("\n=== NaN Source Breakdown ===")
print(
    f"NaN from {len(patients_no_pump_data)} patients with no pump data: {rows_from_no_pump:,} ({rows_from_no_pump/total_nan*100:.1f}%)"
)
print(
    f"NaN from leading values (before 1st rate change): {total_nan - rows_from_no_pump:,} ({(total_nan - rows_from_no_pump)/total_nan*100:.1f}%)"
)

# Note: Basal data only logs RATE CHANGES, not continuous values.
# Forward-fill propagates each rate until the next change.
# Leading NaN = rows before a patient's first logged rate change (unknown rate).

=== Basal Coverage After Forward-Fill ===
Rows with basal data: 6,747,888
Rows with NaN basal:  2,994,423

=== Patient Breakdown ===
Patients with at least some NaN:  165
Patients with at least some data: 125
Patients with NO pump data at all: 43

=== NaN Source Breakdown ===
NaN from 43 patients with no pump data: 2,438,315 (81.4%)
NaN from leading values (before 1st rate change): 556,108 (18.6%)


---
## Step 9: Final Output

In [44]:
# Select and order final columns
final_columns = [
    "datetime",
    "p_num",
    "period",
    "bg_mM",
    "bg_mgdL",
    "basal_u",  # U per 5-min interval
    "basal_rate_uhr",  # Original U/hr (for reference)
    "bolus_u",  # U delivered in this bin
]

output_df = merged[final_columns].copy()
output_df = output_df.set_index("datetime").sort_index()

print("=== Final Output ===")
print(f"Shape: {output_df.shape}")
print(f"Patients: {output_df['p_num'].nunique()}")
print(f"Date range: {output_df.index.min()} to {output_df.index.max()}")
print(f"\nColumns: {list(output_df.columns)}")
print("\nData types:")
print(output_df.dtypes)

=== Final Output ===
Shape: (9742311, 7)
Patients: 168
Date range: 2016-08-04 00:00:00 to 2019-03-25 07:05:00

Columns: ['p_num', 'period', 'bg_mM', 'bg_mgdL', 'basal_u', 'basal_rate_uhr', 'bolus_u']

Data types:
p_num             float64
period             object
bg_mM             float64
bg_mgdL           float64
basal_u           float64
basal_rate_uhr    float64
bolus_u           float64
dtype: object


In [None]:
# Sample output for a patient WITH insulin data
sample_pid = output_df[output_df["basal_u"].notna()]["p_num"].iloc[0]
print(f"=== Sample Patient {sample_pid} (with insulin) ===")
sample = output_df[output_df["p_num"] == sample_pid].head(20)
print(sample)

In [None]:
# Sample output for a patient WITHOUT insulin data
sample_pid_no_ins = output_df[output_df["basal_u"].isna()]["p_num"].iloc[0]
print(f"=== Sample Patient {sample_pid_no_ins} (without insulin) ===")
sample_no_ins = output_df[output_df["p_num"] == sample_pid_no_ins].head(20)
print(sample_no_ins)

---
## Validation Checks

In [None]:
print("=== Validation Summary ===")

# Check 1: All timestamps on 5-min grid
ts_check = (output_df.index.minute % 5 == 0) & (output_df.index.second == 0)
print(f"\n1. All timestamps on 5-min grid: {ts_check.all()}")

# Check 2: No negative values
print(f"2. No negative BG values: {(output_df['bg_mM'].dropna() >= 0).all()}")
print(f"3. No negative bolus values: {(output_df['bolus_u'] >= 0).all()}")
print(f"4. No negative basal values: {(output_df['basal_u'].dropna() >= 0).all()}")

# Check 3: Period values
print(f"5. Period values: {output_df['period'].unique().tolist()}")

# Check 4: Patient counts
print(f"\n6. Total patients: {output_df['p_num'].nunique()}")
patients_with_insulin = output_df[output_df["basal_u"].notna()]["p_num"].nunique()
patients_without_insulin = output_df[output_df["basal_u"].isna()]["p_num"].nunique()
print(f"7. Patients with insulin data: {patients_with_insulin}")
print(f"8. Patients without insulin data: {patients_without_insulin}")

# Check 5: Bolus statistics
print("\n9. Bolus statistics:")
print(f"   - Rows with bolus > 0: {(output_df['bolus_u'] > 0).sum():,}")
print(f"   - Max bolus: {output_df['bolus_u'].max():.2f} U")
print(
    f"   - Mean bolus (when given): {output_df[output_df['bolus_u'] > 0]['bolus_u'].mean():.2f} U"
)

# Check 6: Basal statistics
print("\n10. Basal statistics:")
print(f"   - Max basal rate: {output_df['basal_rate_uhr'].max():.2f} U/hr")
print(f"   - Mean basal rate: {output_df['basal_rate_uhr'].mean():.2f} U/hr")

---
## Save Output (Optional)

In [None]:
# Uncomment to save
# output_path = CACHE_DIR / "processed" / "brown_2019_processed.parquet"
# output_path.parent.mkdir(parents=True, exist_ok=True)
# output_df.to_parquet(output_path)
# print(f"Saved to: {output_path}")