# Freestyle Libre Sensor Data Preprocessing — HUPA-UCM Diabetes Dataset

This notebook implements and documents the preprocessing pipeline for **Continuous Glucose Monitoring (CGM)** data from the **HUPA-UCM Diabetes Dataset**, following the methodology described in the research article *“HUPA-UCM Diabetes Dataset” (Data in Brief, 2024)*.

The dataset consists of glucose readings collected from **five patients** using the **Freestyle Libre sensor**:  
**HUPA0001P**, **HUPA0003P**, **HUPA0005P**, **HUPA0006P**, and **HUPA0007P**.  
Each patient’s sensor data is stored in a separate CSV file.  

For this notebook, the preprocessing procedure was **tested and validated on one representative patient (HUPA0001P)**, but the pipeline is designed to handle **all five patients** independently following the same steps.

---

## Pipeline Overview

The following preprocessing steps are implemented, fully aligned with the official **HUPA-UCM methodology**:

1. **Raw data inspection:** Load and preview the unprocessed Freestyle Libre sensor file.  
2. **Header detection:** Automatically identify and parse the correct header row.  
3. **Column normalization:** Standardize column names from Spanish to English.  
4. **Patient-specific data cleaning:**  
   - Remove invalid start/end periods (sensor warm-up & shutdown)  
   - Remove physiologically impossible glucose values  
   - Split data into continuous segments (no >30 min gaps)  
5. **Time alignment:** Convert timestamps and round them to the nearest 5 minutes.  
6. **Resampling:** Subsample glucose readings to uniform 15-minute intervals.  
7. **Interpolation:** Linearly interpolate values to obtain a continuous 5-minute time series.  
8. **Export:** Save the final cleaned dataset (`free_style_sensor_cleaned.csv`) for integration with other patient data (e.g., Fitbit, insulin pump).

---

This preprocessing ensures that all glucose data are **consistent, evenly spaced, and biologically valid**, making them ready for downstream applications such as **glucose prediction**, **hypoglycemia detection**, and **AI-based diabetes management research**.


In [1]:
# --- IMPORTS ---
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
# --- STEP 0: LOAD Data (semicolon-separated, Spanish-encoded) ---
data_path = Path("/content/free_style_sensor_all.csv")
temp = pd.read_csv(data_path, sep=';', encoding='latin-1', header=None, engine='python')


In [44]:
# Preview
print("Dataset loaded successfully.")
print("Shape:", temp.shape)
print("\nFirst few rows of the raw data:")
display(temp.head(10))   # or use print(raw_preview.head(10)) if display() not available

print("\nColumn names:")
print(temp.columns.tolist())

Dataset loaded successfully.
Shape: (6940, 19)

First few rows of the raw data:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,HUPA0001P,,,,,,,,,,,,,,,,,,",patient_id,HUPA0003P,HUPA0005P,HUPA0006P,HUPA..."
1,ID,Hora,Tipo de registro,HistÃ³rico glucosa (mg/dL),Glucosa leÃ­da (mg/dL),Insulina de acciÃ³n rÃ¡pida sin valor numÃ©rico,Insulina de acciÃ³n rÃ¡pida (unidades),Alimentos sin valor numÃ©rico,Carbohidratos (raciones),Insulina de acciÃ³n lenta sin valor numÃ©rico,Insulina de acciÃ³n lenta (unidades),Notas,Glucosa de la tira (mg/dL),Cetonas (mmol/L),Insulina comida (unidades),Insulina correcciÃ³n (unidades),Insulina cambio usuario (unidades),Hora anterior,"Hora actualizada,HUPA0001P,,,,"
2,48745,2018/06/13 17:19,0,488,,,,,,,,,,,,,,,",HUPA0001P,,,,"
3,48746,2018/06/13 17:34,0,469,,,,,,,,,,,,,,,",HUPA0001P,,,,"
4,48747,2018/06/13 17:49,0,436,,,,,,,,,,,,,,,",HUPA0001P,,,,"
5,48748,2018/06/13 18:10,1,,435,,,,,,,,,,,,,,",HUPA0001P,,,,"
6,48761,2018/06/13 18:04,0,418,,,,,,,,,,,,,,,",HUPA0001P,,,,"
7,48762,2018/06/13 18:19,0,397,,,,,,,,,,,,,,,",HUPA0001P,,,,"
8,48763,2018/06/13 18:34,0,358,,,,,,,,,,,,,,,",HUPA0001P,,,,"
9,48764,2018/06/13 18:49,0,326,,,,,,,,,,,,,,,",HUPA0001P,,,,"



Column names:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]


In [3]:
# --- STEP 1: DETECT AND LOAD TRUE HEADER ROW ---header_row = None
for i in range(10):
    row = temp.iloc[i].astype(str).tolist()
    if any("Hora" in c or "Hist" in c for c in row):
        header_row = i
        break

if header_row is None:
    raise ValueError("Could not locate header row automatically")

# Reload with the correct header
sensor_df  = pd.read_csv(data_path, sep=';', encoding='latin-1', engine='python', header=header_row)

In [5]:
print("Correct header detected at row:", header_row)
print("Columns now:", sensor_df.columns.tolist()[:10])
print("Shape:", sensor_df.shape)

Correct header detected at row: 1
Columns now: ['ID', 'Hora', 'Tipo de registro', 'HistÃ³rico glucosa (mg/dL)', 'Glucosa leÃ\xadda (mg/dL)', 'Insulina de acciÃ³n rÃ¡pida sin valor numÃ©rico', 'Insulina de acciÃ³n rÃ¡pida (unidades)', 'Alimentos sin valor numÃ©rico', 'Carbohidratos (raciones)', 'Insulina de acciÃ³n lenta sin valor numÃ©rico']
Shape: (6938, 19)


In [7]:
# --- STEP 2: CLEAN AND STANDARDIZE COLUMN NAMES ---
# Convert Spanish column labels to standardized English names for consistency
rename_map = {
    'ID': 'record_id',
    'Hora': 'time',
    'Tipo de registro': 'record_type',
    'HistÃ³rico glucosa (mg/dL)': 'glucose_history',
    'Glucosa leÃ\xadda (mg/dL)': 'glucose_scan',
    'Insulina de acciÃ³n rÃ¡pida sin valor numÃ©rico': 'rapid_insulin_no_value',
    'Insulina de acciÃ³n rÃ¡pida (unidades)': 'rapid_insulin_units',
    'Alimentos sin valor numÃ©rico': 'food_no_value',
    'Carbohidratos (raciones)': 'carbohydrates_servings',
    'Insulina de acciÃ³n lenta sin valor numÃ©rico': 'long_insulin_no_value',
    'Insulina de acciÃ³n lenta (unidades)': 'long_insulin_units',
    'Notas': 'notes',
    'Glucosa de la tira (mg/dL)': 'strip_glucose_mgdl',
    'Cetonas (mmol/L)': 'ketones_mmol_l',
    'Insulina comida (unidades)': 'meal_insulin_units',
    'Insulina correcciÃ³n (unidades)': 'correction_insulin_units',
    'Insulina cambio usuario (unidades)': 'user_change_insulin_units',
    'Hora anterior': 'previous_time',
    'Hora actualizada,HUPA0001P,,,,': 'patient_id'
}

sensor_df.rename(columns=rename_map, inplace=True)


In [36]:
# Fix patient column if it contains any HUPA code in its name
possible_patient_cols = [c for c in sensor_df.columns if "HUPA" in c]
if possible_patient_cols:
    sensor_df.rename(columns={possible_patient_cols[0]: "patient_id"}, inplace=True)

In [9]:
# Drop completely empty columns
empty_cols = [c for c in sensor_df.columns if sensor_df[c].isna().all()]
if empty_cols:
    sensor_df.drop(columns=empty_cols, inplace=True)

# confirm
print("Columns after cleaning:")
print(sensor_df.columns.tolist())
print("Shape after cleaning:", sensor_df.shape)
print("Non-null values per column:")
print(sensor_df.notnull().sum())

Columns after cleaning:
['record_id', 'time', 'record_type', 'glucose_history', 'glucose_scan', 'patient_id']
Shape after cleaning: (6938, 6)
Non-null values per column:
record_id          6938
time               1548
record_type        1548
glucose_history    1331
glucose_scan        217
patient_id         1548
dtype: int64


In [37]:
# Ensure patient_id column exists
if 'patient_id' not in sensor_df.columns:
    raise ValueError("Missing patient_id column.")

# Convert patient_id to string
sensor_df['patient_id'] = sensor_df['patient_id'].str.replace(',', '').str.strip()

In [12]:
# Select relevant columns (only related to glucose and time)

if 'glucose_history' in sensor_df.columns:
    glucose_col = 'glucose_history'
    print("Using 'glucose_history' as glucose source.")
elif 'glucose_scan' in sensor_df.columns:
    glucose_col = 'glucose_scan'
    print("Using 'glucose_scan' as glucose source.")
else:
    raise ValueError("❌ No glucose column found! Please check your dataset headers.")



--- STEP 5: Selecting Relevant Columns ---
Using 'glucose_history' as glucose source.


In [14]:
# Drop rows without valid time or glucose
before_rows = len(sensor_df)
sensor_df = sensor_df.dropna(subset=['time', glucose_col])
after_rows = len(sensor_df)
print(f"Dropped {before_rows - after_rows} rows with missing time or glucose values.")
print("Remaining records:", after_rows)

Dropped 0 rows with missing time or glucose values.
Remaining records: 1306


In [15]:
# --- Convert time datatype to datetime---
sensor_df['time'] = pd.to_datetime(sensor_df['time'], errors='coerce')

In [16]:

# Note: this section performs patient-specific cleaning.
# It is written to support any number of patients

# Includes trimming invalid periods, removing outliers, and handling gaps.
def trim_invalid_periods(df):
    start = df['time'].min() + pd.Timedelta(hours=2)
    end   = df['time'].max() - pd.Timedelta(hours=2)
    return df[(df['time'] >= start) & (df['time'] <= end)]

def remove_glucose_outliers(df, col='glucose_history', low=40, high=400):
    return df[(df[col] >= low) & (df[col] <= high)]

def split_large_gaps(df, max_gap_min=30):
    df = df.sort_values('time')
    diffs = df['time'].diff().dt.total_seconds().div(60)
    seg_id = (diffs > max_gap_min).cumsum()
    df['segment_id'] = seg_id
    return [g.copy().drop(columns=['segment_id']) for _, g in df.groupby('segment_id') if len(g) > 1]

patients_clean = []
for pid in sensor_df['patient_id'].unique():
    patient_df= sensor_df[sensor_df['patient_id'] == pid].copy()
    patient_df = trim_invalid_periods(patient_df)
    patient_df = remove_glucose_outliers(patient_df, glucose_col)
    segments = split_large_gaps(patient_df, max_gap_min=30)
    for seg in segments:
        seg['patient_id'] = pid
        patients_clean.append(seg)

sensor_df = pd.concat(patients_clean, ignore_index=True).reset_index(drop=True)
print(f"After patient-specific cleaning: {sensor_df.shape}")

After patient-specific cleaning: (1290, 6)


In [17]:
print(sensor_df['patient_id'].value_counts())
print(sensor_df['time'].min(), "→", sensor_df['time'].max())
sensor_df.describe()

patient_id
,HUPA0001P,,,,    1290
Name: count, dtype: int64
2018-06-13 21:20:00 → 2018-06-27 12:10:00


Unnamed: 0,time,record_type,glucose_history,glucose_scan
count,1290,1290.0,1290.0,0.0
mean,2018-06-20 16:10:31.674418688,0.0,180.011628,
min,2018-06-13 21:20:00,0.0,40.0,
25%,2018-06-17 06:10:45,0.0,127.0,
50%,2018-06-20 15:30:30,0.0,166.0,
75%,2018-06-24 02:36:15,0.0,229.0,
max,2018-06-27 12:10:00,0.0,398.0,
std,,0.0,70.992569,


In [18]:
sensor_df = sensor_df[['time', 'glucose_history', 'patient_id']].copy()
sensor_df.rename(columns={'glucose_history': 'glucose'}, inplace=True)

print(sensor_df.head())
print(sensor_df.info())

                 time  glucose      patient_id
0 2018-06-13 21:20:00     46.0  ,HUPA0001P,,,,
1 2018-06-13 21:35:00     43.0  ,HUPA0001P,,,,
2 2018-06-13 21:50:00     73.0  ,HUPA0001P,,,,
3 2018-06-13 22:05:00    102.0  ,HUPA0001P,,,,
4 2018-06-13 22:20:00    128.0  ,HUPA0001P,,,,
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1290 entries, 0 to 1289
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   time        1290 non-null   datetime64[ns]
 1   glucose     1290 non-null   float64       
 2   patient_id  1290 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 30.4+ KB
None


In [19]:
sensor_df['patient_id'] = sensor_df['patient_id'].str.replace(',', '').str.strip()
print(sensor_df['patient_id'].unique())

['HUPA0001P']


In [20]:
print(sensor_df.head())
print(sensor_df.info())

                 time  glucose patient_id
0 2018-06-13 21:20:00     46.0  HUPA0001P
1 2018-06-13 21:35:00     43.0  HUPA0001P
2 2018-06-13 21:50:00     73.0  HUPA0001P
3 2018-06-13 22:05:00    102.0  HUPA0001P
4 2018-06-13 22:20:00    128.0  HUPA0001P
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1290 entries, 0 to 1289
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   time        1290 non-null   datetime64[ns]
 1   glucose     1290 non-null   float64       
 2   patient_id  1290 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 30.4+ KB
None


In [38]:
# --- STEP 6: JOURNAL-ALIGNED PREPROCESSING (HUPA-UCM 2024) ---
# Round timestamps, resample to 15-min intervals, and interpolate back to 5-min intervals

final_patients = []

for pid, p_df in sensor_df.groupby('patient_id'):
    print(f"Processing {pid}...")

    # Ensure time sorted and rounded
    p_df = p_df.sort_values('time')
    p_df['time'] = p_df['time'].dt.round('5min')

    # Step 1 — Subsample to 15-min intervals
    sensor_15min = (
        p_df
          .set_index('time')
          .resample('15min')
          .first()
          .dropna(subset=['glucose'])
    )

    # Step 2 — Interpolate back to 5-min intervals
    sensor_5min = (
        sensor_15min
          .resample('5min')
          .interpolate(method='linear')
          .reset_index()
    )

    # Add patient ID back
    sensor_5min['patient_id'] = pid
    final_patients.append(sensor_5min)

Processing HUPA0001P...


  .interpolate(method='linear')


In [39]:
# --- STEP 7: EXPORT & VERIFICATION ---
# Combine all processed patients and save the final dataset

sensor_final = pd.concat(final_patients, ignore_index=True)

In [40]:
# Save
output_path = Path("/content/free_style_sensor_cleaned.csv")
sensor_final.to_csv(output_path, index=False)

In [41]:
print(f"\n Saved final cleaned data for all patients: {output_path}")
print(f"Total records: {len(sensor_final)}")


 Saved final cleaned data for all patients: /content/free_style_sensor_cleaned.csv
Total records: 3922


In [42]:
# --- Verify all patients were processed ---
unique_patients = sensor_final['patient_id'].unique()
print(f"\n Unique patients in final dataset: {len(unique_patients)}")
print("Patient IDs:", unique_patients)



 Unique patients in final dataset: 1
Patient IDs: ['HUPA0001P']
