# Freestyle Libre Sensor Data Preprocessing — HUPA-UCM Diabetes Dataset

This notebook documents the preprocessing of **Continuous Glucose Monitoring (CGM)** data obtained from the **HUPA-UCM Diabetes Dataset**, following the official methodology described in the research paper *"HUPA-UCM Diabetes Dataset" (Data in Brief, 2024)*.

The raw data used here comes from the **FreeStyle Libre sensor** and includes glucose readings for **five patients**:
**HUPA0001P**, **HUPA0003P**, **HUPA0005P**, **HUPA0006P**, and **HUPA0007P**.  
Each patient’s original folder contained sensor-specific files, which were first merged into a single dataset named **`free_style_sensor_all.csv`**.  
This merged file consolidates all glucose readings from the five patients into one combined table.

In this notebook, we perform a detailed preprocessing of that merged sensor file to prepare it for machine learning analysis.  
The steps implemented here are **fully aligned with the preprocessing protocol** outlined in the HUPA-UCM research article:

1. **Header and structure correction:** Detect and load the actual column header from the raw semicolon-separated file.  
2. **Column normalization:** Rename and clean column names to standardized English labels.  
3. **Data filtering:** Retain only relevant variables — primarily timestamp (`time`) and glucose readings (`glucose_history` or `glucose_scan`).  
4. **Time alignment:** Convert timestamps to datetime objects and round them to the nearest 5 minutes.  
5. **Data resampling:** Subsample the glucose readings to 15-minute intervals to unify sampling frequency.  
6. **Interpolation:** Apply linear interpolation to reconstruct a continuous 5-minute glucose time series.  
7. **Export:** Save the final, cleaned dataset as **`free_style_sensor_cleaned.csv`** — containing uniform, gap-free glucose data ready for integration with Fitbit and insulin pump records.

This preprocessing pipeline ensures that all glucose readings are consistent, evenly spaced, and directly usable for downstream analysis such as glucose level prediction, hypoglycemia detection, and AI-driven diabetes management research.


In [1]:
# --- IMPORTS ---
import pandas as pd
import numpy as np
from pathlib import Path

In [17]:
# --- STEP 1: LOAD RAW DATA (semicolon-separated, Spanish-encoded) ---
sensor_df = pd.read_csv("/content/free_style_sensor_all.csv", sep=';', encoding='latin-1', engine='python')

In [18]:
# --- RELOAD CLEANLY WITH AUTO HEADER DETECTION ---
sensor_df = pd.read_csv("/content/free_style_sensor_all.csv", sep=';', encoding='latin-1', engine='python', header=None)

# Find which row actually contains the real header (it has the word 'Hora' or 'Hist')
header_row = None
for i in range(5):  # check first 5 rows
    row = sensor_df.iloc[i].astype(str).tolist()
    if any("Hora" in c or "Hist" in c for c in row):
        header_row = i
        break

if header_row is None:
    raise ValueError("Could not locate header row automatically")

# Reload with the correct header
sensor_df = pd.read_csv("/content/free_style_sensor_all.csv", sep=';', encoding='latin-1', engine='python', header=header_row)
print("Correct header detected at row:", header_row)
print("Columns now:", sensor_df.columns.tolist()[:10])
print("Shape:", sensor_df.shape)


Correct header detected at row: 1
Columns now: ['ID', 'Hora', 'Tipo de registro', 'HistÃ³rico glucosa (mg/dL)', 'Glucosa leÃ\xadda (mg/dL)', 'Insulina de acciÃ³n rÃ¡pida sin valor numÃ©rico', 'Insulina de acciÃ³n rÃ¡pida (unidades)', 'Alimentos sin valor numÃ©rico', 'Carbohidratos (raciones)', 'Insulina de acciÃ³n lenta sin valor numÃ©rico']
Shape: (6938, 19)


In [19]:
# --- STEP 4: CLEAN & STANDARDIZE COLUMN NAMES ---

# rename Spanish -> English
rename_map = {
    'ID': 'record_id',
    'Hora': 'time',
    'Tipo de registro': 'record_type',
    'HistÃ³rico glucosa (mg/dL)': 'glucose_history',
    'Glucosa leÃ\xadda (mg/dL)': 'glucose_scan',
    'Insulina de acciÃ³n rÃ¡pida sin valor numÃ©rico': 'rapid_insulin_no_value',
    'Insulina de acciÃ³n rÃ¡pida (unidades)': 'rapid_insulin_units',
    'Alimentos sin valor numÃ©rico': 'food_no_value',
    'Carbohidratos (raciones)': 'carbohydrates_servings',
    'Insulina de acciÃ³n lenta sin valor numÃ©rico': 'long_insulin_no_value',
    'Insulina de acciÃ³n lenta (unidades)': 'long_insulin_units',
    'Notas': 'notes',
    'Glucosa de la tira (mg/dL)': 'strip_glucose_mgdl',
    'Cetonas (mmol/L)': 'ketones_mmol_l',
    'Insulina comida (unidades)': 'meal_insulin_units',
    'Insulina correcciÃ³n (unidades)': 'correction_insulin_units',
    'Insulina cambio usuario (unidades)': 'user_change_insulin_units',
    'Hora anterior': 'previous_time',
    'Hora actualizada,HUPA0001P,,,,': 'patient_id'
}

sensor_df.rename(columns=rename_map, inplace=True)


In [20]:
# remove columns that are completely empty
empty_cols = [c for c in sensor_df.columns if sensor_df[c].isna().all()]
if empty_cols:
    sensor_df.drop(columns=empty_cols, inplace=True)

# confirm column cleanup
print("Columns after cleaning:")
print(sensor_df.columns.tolist())
print("Shape after cleaning:", sensor_df.shape)
print("Non-null values per column:")
print(sensor_df.notnull().sum())

Columns after cleaning:
['record_id', 'time', 'record_type', 'glucose_history', 'glucose_scan', 'patient_id']
Shape after cleaning: (6938, 6)
Non-null values per column:
record_id          6938
time               1548
record_type        1548
glucose_history    1331
glucose_scan        217
patient_id         1548
dtype: int64


In [31]:
print("\n--- STEP 5: Selecting Relevant Columns ---")

# Keep only columns related to glucose and time, exactly as the paper specifies
# (Interstitial glucose values processing section)
if 'glucose_history' in sensor_df.columns:
    glucose_col = 'glucose_history'
    print("Using 'glucose_history' as glucose source.")
elif 'glucose_scan' in sensor_df.columns:
    glucose_col = 'glucose_scan'
    print("Using 'glucose_scan' as glucose source.")
else:
    raise ValueError("❌ No glucose column found! Please check your dataset headers.")



--- STEP 5: Selecting Relevant Columns ---
Using 'glucose_history' as glucose source.


In [32]:
# Drop rows without valid time or glucose
before_rows = len(sensor_df)
sensor_df = sensor_df.dropna(subset=['time', glucose_col])
after_rows = len(sensor_df)
print(f"Dropped {before_rows - after_rows} rows with missing time or glucose values.")
print("Remaining records:", after_rows)

Dropped 0 rows with missing time or glucose values.
Remaining records: 1331


In [33]:
print("\n--- STEP 6: Parsing and Rounding Time to Nearest 5 Minutes ---")

sensor_df['time'] = pd.to_datetime(sensor_df['time'], errors='coerce')
before_na = sensor_df['time'].isna().sum()
sensor_df = sensor_df.dropna(subset=['time'])
print(f"Converted 'time' column to datetime. Dropped {before_na} invalid timestamps.")
sensor_df['time'] = sensor_df['time'].dt.round('5min')
print("Rounded timestamps to the nearest 5 minutes.")


--- STEP 6: Parsing and Rounding Time to Nearest 5 Minutes ---
Converted 'time' column to datetime. Dropped 0 invalid timestamps.
Rounded timestamps to the nearest 5 minutes.


In [35]:
print("\n--- STEP 7: Sorting and Removing Duplicates ---")

sensor_df = sensor_df.sort_values('time').drop_duplicates(subset='time', keep='last')
print(f"Sorted by time and removed duplicate timestamps. Final shape: {sensor_df.shape}")



--- STEP 7: Sorting and Removing Duplicates ---
Sorted by time and removed duplicate timestamps. Final shape: (1331, 6)


In [36]:
print("\n--- STEP 8: Converting Glucose Values to Numeric ---")

sensor_df[glucose_col] = pd.to_numeric(sensor_df[glucose_col], errors='coerce')
na_glucose = sensor_df[glucose_col].isna().sum()
print(f"Converted glucose values to numeric. Found {na_glucose} NaNs (will handle via interpolation).")



--- STEP 8: Converting Glucose Values to Numeric ---
Converted glucose values to numeric. Found 0 NaNs (will handle via interpolation).


In [39]:
print("\n--- STEP 9: Resampling to 15-Minute Intervals ---")

# Keep only time and glucose columns for numeric processing
sensor_df = sensor_df[['time', glucose_col]]

# Ensure glucose is numeric
sensor_df[glucose_col] = pd.to_numeric(sensor_df[glucose_col], errors='coerce')

# Set time as index and resample
sensor_df = sensor_df.set_index('time').resample('15min').mean()

print(f"Resampled to 15-minute intervals. New shape: {sensor_df.shape}")



--- STEP 9: Resampling to 15-Minute Intervals ---
Resampled to 15-minute intervals. New shape: (1340, 1)


In [40]:
print("\n--- STEP 10: Linear Interpolation to Obtain 5-Minute Records ---")

sensor_df = sensor_df.resample('5min').interpolate(method='linear')
print(f"Interpolated linearly to 5-minute intervals. Records now: {len(sensor_df)}")



--- STEP 10: Linear Interpolation to Obtain 5-Minute Records ---
Interpolated linearly to 5-minute intervals. Records now: 4018


In [41]:
print("\n--- STEP 11: Resetting Index and Finalizing ---")
sensor_df = sensor_df.reset_index()
sensor_df.rename(columns={glucose_col: 'glucose'}, inplace=True)

if sensor_df['glucose'].isna().any():
    print("Warning: Some glucose values are still missing after interpolation.")
else:
    print("All glucose values filled successfully after interpolation.")


--- STEP 11: Resetting Index and Finalizing ---
All glucose values filled successfully after interpolation.


In [42]:
print("\n--- STEP 12: Saving Cleaned Freestyle Sensor Data ---")

output_path = Path("/content/free_style_sensor_cleaned.csv")
sensor_df.to_csv(output_path, index=False)

print(f"SUCCESS! Cleaned Freestyle Libre sensor data saved to: {output_path}")
print(f"Total cleaned records: {len(sensor_df)}")
print("\nSample of final cleaned data:")
print(sensor_df.head(10))

print("\n--- STEP 13: Final Summary ---")
print(sensor_df.info())
print("Preprocessing for Freestyle Sensor completed successfully according to the research paper!")



--- STEP 12: Saving Cleaned Freestyle Sensor Data ---
SUCCESS! Cleaned Freestyle Libre sensor data saved to: /content/free_style_sensor_cleaned.csv
Total cleaned records: 4018

Sample of final cleaned data:
                 time     glucose
0 2018-06-13 17:15:00  488.000000
1 2018-06-13 17:20:00  481.666667
2 2018-06-13 17:25:00  475.333333
3 2018-06-13 17:30:00  469.000000
4 2018-06-13 17:35:00  458.000000
5 2018-06-13 17:40:00  447.000000
6 2018-06-13 17:45:00  436.000000
7 2018-06-13 17:50:00  430.000000
8 2018-06-13 17:55:00  424.000000
9 2018-06-13 18:00:00  418.000000

--- STEP 13: Final Summary ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4018 entries, 0 to 4017
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   time     4018 non-null   datetime64[ns]
 1   glucose  4018 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 62.9 KB
None
Preprocessing for Freestyle Sensor 