# Supervised Learning

**Supervised learning** methods are applied directly to the event-level data to
predict the next blood glucose measurement based on recent management actions and physiological context.
By framing each glucose measurement as a prediction target and using preceding insulin, meal, exercise,
and glucose events within a fixed lookback window as input features, the task evaluates
how well short-term glucose dynamics can be inferred from logged behavioral sequences.

In [3]:
!uv pip install \
  "fastparquet>=2025.12.0" \
  "ipykernel>=7.2.0" \
  "llvmlite>=0.42" \
  "matplotlib>=3.10.8" \
  "numba>=0.59" \
  "numpy>=2.3.5" \
  "pandas==2.3.3" \
  "scikit-learn>=1.8.0" \
  "scipy>=1.17.0" \
  "seaborn>=0.13.2"

[2mUsing Python 3.12.12 environment at: /Users/z.yang/playground/srh-stat-and-ml-exam/.venv[0m
[2mAudited [1m10 packages[0m [2min 6ms[0m[0m


In [4]:
from pathlib import Path

ROOT_DIR = Path().resolve().parent
DATA_DIR = ROOT_DIR / "data"

In [5]:
import pandas as pd

pd.set_option("future.no_silent_downcasting", True)
pd.set_option('display.max_columns', None)

In [6]:
import warnings

warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=SyntaxWarning)

## Data Splitting

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Split into training (80%), validation (10%), testing (10%)
df = pd.read_parquet(DATA_DIR / "processed" / "supervised" / "diabetes_next_glucose_supervised.parquet")
df_train, df_val_test = train_test_split(
    df,
    train_size=0.8,
    random_state=42)
df_val, df_test = train_test_split(
    df_val_test,
    train_size=0.5,
    random_state=42)

print(f"Shape of df_train: {df_train.shape}")
print(f"Shape of df_val: {df_val.shape}")
print(f"Shape of df_test: {df_test.shape}")

SAMPLE_DIR = DATA_DIR / "processed" / "supervised" / "samples"
SAMPLE_DIR.mkdir(parents=True, exist_ok=True)
df_train.to_parquet(SAMPLE_DIR / "train.parquet")
df_val.to_parquet(SAMPLE_DIR / "val.parquet")
df_test.to_parquet(SAMPLE_DIR / "test.parquet")

Shape of df_train: (8489, 30)
Shape of df_val: (1061, 30)
Shape of df_test: (1062, 30)


## Preprocessing

In [20]:
df_train

Unnamed: 0,id,event_time,target_time,delta_t_minutes,y_next_glucose,hour_of_day,day_of_week,is_weekend,is_paper_like_time_flag,bg_last_value,bg_window_count,bg_window_mean,bg_window_std,bg_window_min,bg_window_max,bg_window_range,time_since_last_bg,ins_window_count,ins_window_sum,ins_window_mean,ins_window_std,ins_window_max,time_since_last_ins,ins_33_sum,ins_34_sum,ins_35_sum,meal_events_count,exercise_events_count,hypo_event_flag,special_event_flag
8273,data-56,1989-03-03 18:00:00,1989-03-03 22:00:00,240,63.0,18,4,False,True,172.0,1,172.0,0.0,172.0,172.0,0.0,0,1,5.0,5.0,0.000000,5.0,0.0,5.0,0.0,0.0,0,0,False,False
9315,data-63,1991-07-28 07:11:00,1991-07-28 12:31:00,320,341.0,7,6,True,False,84.0,1,84.0,0.0,84.0,84.0,0.0,0,0,0.0,,,,,0.0,0.0,0.0,0,0,False,False
7874,data-55,1991-05-22 12:00:00,1991-05-22 18:00:00,360,212.0,12,2,False,True,201.0,1,201.0,0.0,201.0,201.0,0.0,0,1,8.0,8.0,0.000000,8.0,0.0,8.0,0.0,0.0,0,0,False,False
1482,data-16,1991-07-11 08:56:00,1991-07-11 19:05:00,609,182.0,8,3,False,False,246.0,1,246.0,0.0,246.0,246.0,0.0,0,0,0.0,,,,,0.0,0.0,0.0,0,0,False,False
7581,data-55,1991-03-10 18:00:00,1991-03-10 22:00:00,240,167.0,18,6,True,True,187.0,1,187.0,0.0,187.0,187.0,0.0,0,1,10.0,10.0,0.000000,10.0,0.0,10.0,0.0,0.0,0,0,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5734,data-42,1991-09-01 08:00:00,1991-09-01 12:00:00,240,182.0,8,6,True,True,49.0,1,49.0,0.0,49.0,49.0,0.0,0,1,2.0,2.0,0.000000,2.0,0.0,2.0,0.0,0.0,0,0,False,False
5191,data-39,1991-07-18 12:10:00,1991-07-18 18:05:00,355,161.0,12,3,False,False,104.0,1,104.0,0.0,104.0,104.0,0.0,0,0,0.0,,,,,0.0,0.0,0.0,0,0,False,False
5390,data-41,1991-01-20 22:00:00,1991-01-21 08:00:00,600,96.0,22,6,True,True,140.0,1,140.0,0.0,140.0,140.0,0.0,0,1,5.0,5.0,0.000000,5.0,0.0,0.0,5.0,0.0,0,0,False,False
860,data-07,1989-04-28 08:00:00,1989-04-28 12:00:00,240,126.0,8,4,False,True,215.0,1,215.0,0.0,215.0,215.0,0.0,0,2,25.0,12.5,7.778175,18.0,0.0,7.0,18.0,0.0,0,0,False,False


### Handle missing values

In [19]:
cols_with_nan = df_train.columns[df.isna().any()]
print(f"columns with NaN: {cols_with_nan.tolist()}")

nan_counts = df.isna().sum()
print(f"NaN counts: \n{nan_counts[nan_counts > 0]}")

columns with NaN: ['ins_window_mean', 'ins_window_std', 'ins_window_max', 'time_since_last_ins']
NaN counts: 
ins_window_mean        5474
ins_window_std         5474
ins_window_max         5474
time_since_last_ins    5474
dtype: int64


Although the columns listed above contain a large number of missing values, this is expected for diabetes treatment behavior data, where the absence of insulin usage on certain days is a normal scenario. Since the columns with missing values are all derived statistical features, we fill them with zeros rather than discarding the data.