# HUPA-UCM Diabetes Dataset - Loading & Merging Notebook
**Research Paper:** [Data in Brief 55 (2024) 110559 ‚Äì HUPA-UCM Diabetes Dataset](https://www.sciencedirect.com/science/article/pii/S2352340924005262#sec0003)  
**Dataset Source:** [Mendeley Data ‚Äì HUPA-UCM Diabetes Dataset (DOI: 10.17632/3hbcscwz44.1)](https://data.mendeley.com/datasets/3hbcscwz44/1)  

# Purpose of this notebook

This notebook performs the first stage of data preparation for the HUPA-UCM Diabetes Dataset, which contains real physiological and behavioral data from 25 individuals with Type 1 Diabetes Mellitus (T1DM).
Each patient‚Äôs folder includes data collected from three devices:

- FreeStyle Libre 2 Sensor ‚Üí continuous glucose monitoring (CGM) readings

- Insulin Pump (Medtronic/Roche) ‚Üí basal and bolus insulin infusion data

- Fitbit Ionic Watch ‚Üí physical activity, heart rate, calories burned, and sleep quality

The main objective here is to:

- Explore and verify the raw dataset structure

- Load and merge individual CSV files for each patient

- Unify and clean folder naming inconsistencies

- Merge all available per-patient data by device type (Fitbit, Sensor, Pump)

For this stage, I selected **five patients** with complete and consistent records  
(**HUPA0001P, HUPA0003P, HUPA0005P, HUPA0006P, and HUPA0007P**) to test and validate the merging process.  
This smaller subset ensures the workflow runs smoothly before scaling to all 25 patients.  

This work forms the foundation for later preprocessing and modeling steps, including  
glucose prediction, hypoglycemia/hyperglycemia event detection, and physiological pattern analysis.



### Imports

In [2]:
import os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

## Exploring and Loading the Raw Dataset

The raw HUPA-UCM Diabetes Dataset is organized by **patient folders**, each representing one of the 25 Type 1 Diabetes participants.  
Each patient folder (e.g., `HUPA0001P`, `HUPA0002P`, etc.) contains three subfolders corresponding to different data sources:

- **`fitbit/`** ‚Üí containing activity data such as steps, calories, heart rate, and sleep.  
- **`freestyle_sensor/`** ‚Üí containing continuous glucose monitoring (CGM) readings (from FreeStyle Libre 2).  
- **`medtronic_insulin_pump/`** ‚Üí containing insulin infusion data (basal and bolus doses).

In this section, we will:
1. **Verify the directory structure** to ensure all 25 patient folders exist.  
2. **Inspect the contents** of a few sample folders.  
3. **Load the data** from Fitbit, CGM, and insulin pump CSVs for a single patient to understand their format.  
4. **Prepare to merge** the data sources based on timestamps to create unified time-series records.[link text](https://)

This step is essential before preprocessing ‚Äî it helps us understand data alignment, missing files, and how each device‚Äôs readings will be synchronized later.


In [3]:
raw_path = "/content/drive/MyDrive/GP Preprocessing/Raw_Data"
patients = sorted(os.listdir(raw_path))

print(f"Found {len(patients)} patients:", patients)

Found 25 patients: ['HUPA0001P', 'HUPA0002P', 'HUPA0003P', 'HUPA0004P', 'HUPA0005P', 'HUPA0006P', 'HUPA0007P', 'HUPA0009P', 'HUPA0010P', 'HUPA0011P', 'HUPA0014P', 'HUPA0015P', 'HUPA0016P', 'HUPA0017P', 'HUPA0018P', 'HUPA0019P', 'HUPA0020P', 'HUPA0021P', 'HUPA0022P', 'HUPA0023P', 'HUPA0024P', 'HUPA0025P', 'HUPA0026P', 'HUPA0027P', 'HUPA0028P']


In [6]:
#  just one patient to check the structure of the folders
for pid in patients[:1]:
    p_path = os.path.join(raw_path, pid)
    print(f"\n{pid} folders and files:")
    for sub in os.listdir(p_path):
        sub_path = os.path.join(p_path, sub)
        if os.path.isdir(sub_path):
            print(f"  üìÅ {sub} ‚Üí {os.listdir(sub_path)}")
        else:
            print(f"  üìÑ {sub}")



HUPA0001P folders and files:
  üìÅ free_style_sensor ‚Üí ['HUPA0001P_free_style_sensor_2018-06-13_2018-06-27.csv']
  üìÅ medtronic_insulin_pump ‚Üí ['HUPA0001P_medtronic_insulin_pump_&_sensor_2018-06-13_2018-06-27.csv']
  üìÅ fitbit ‚Üí ['HUPA0001P_cals_2018-07-04.csv', 'HUPA0001P_sleep_2018-06-25_night_summary.csv', 'HUPA0001P_cals_2018-06-21.csv', 'HUPA0001P_sleep_2018-06-16_night_summary.csv', 'HUPA0001P_sleep_2018-06-25_nap_0_summary.csv', 'HUPA0001P_sleep_2018-06-18_night.csv', 'HUPA0001P_sleep_2018-06-27_nap_0_summary.csv', 'HUPA0001P_heart_2018-06-22.csv', 'HUPA0001P_sleep_2018-06-24_night.csv', 'HUPA0001P_sleep_2018-06-22_night_summary.csv', 'HUPA0001P_sleep_2018-06-19_nap_0_summary.csv', 'HUPA0001P_heart_2018-06-16.csv', 'HUPA0001P_sleep_2018-06-27_night_summary.csv', 'HUPA0001P_cals_2018-07-01.csv', 'HUPA0001P_sleep_2018-06-25_night.csv', 'HUPA0001P_sleep_2018-06-20_night.csv', 'HUPA0001P_cals_2018-06-25.csv', 'HUPA0001P_heart_2018-06-21.csv', 'HUPA0001P_cals_2018-06-20.c

In [7]:
# Make each patient has 3 csv files (fitbit.csv,medtronic_pump.csv,free_style_sensor.csv)

def merge_patient_csvs(raw_path, save_over=True):
    patients = sorted(os.listdir(raw_path))

    for pid in patients:
        p_path = os.path.join(raw_path, pid)
        if not os.path.isdir(p_path):
            continue

        print(f"\nüîÑ Processing {pid}...")

        for subfolder, outname in {
            "fitbit": "fitbit.csv",
            "free_style_sensor": "free_style_sensor.csv",
            "medtronic_insulin_pump": "medtronic_pump.csv"
        }.items():

            folder_path = os.path.join(p_path, subfolder)
            out_path = os.path.join(p_path, outname)

            if not os.path.exists(folder_path):
                print(f"  ‚ö†Ô∏è Missing folder: {subfolder}")
                continue

            csv_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith(".csv")]
            if not csv_files:
                print(f"  ‚ö†Ô∏è No CSVs in {subfolder}")
                continue

            try:
                merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)
                merged_df["patient_id"] = pid
                merged_df.to_csv(out_path, index=False)
                print(f"  ‚úÖ Saved {outname} ({merged_df.shape[0]} rows)")
            except Exception as e:
                print(f"  ‚ùå Error merging {subfolder}: {e}")


In [8]:
# Preview Merged Files

def preview_merged_folders(raw_path):
    """Prints folder contents to confirm each patient has 3 merged CSVs."""
    for pid in sorted(os.listdir(raw_path)):
        p_path = os.path.join(raw_path, pid)
        if os.path.isdir(p_path):
            files = [f for f in os.listdir(p_path) if f.endswith(".csv")]
            print(f"\nüìÅ {pid}: {files}")


In [9]:
merge_patient_csvs(raw_path)



üîÑ Processing HUPA0001P...


  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (555226 rows)
  ‚úÖ Saved free_style_sensor.csv (1549 rows)
  ‚úÖ Saved medtronic_pump.csv (14700 rows)

üîÑ Processing HUPA0002P...


  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (341927 rows)
  ‚ö†Ô∏è Missing folder: free_style_sensor
  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0003P...
  ‚úÖ Saved fitbit.csv (175616 rows)
  ‚úÖ Saved free_style_sensor.csv (1482 rows)
  ‚úÖ Saved medtronic_pump.csv (849 rows)

üîÑ Processing HUPA0004P...


  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (261941 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 1 fields in line 12, saw 2

  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0005P...


  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (241895 rows)
  ‚úÖ Saved free_style_sensor.csv (1555 rows)
  ‚úÖ Saved medtronic_pump.csv (1030 rows)

üîÑ Processing HUPA0006P...


  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (348000 rows)
  ‚úÖ Saved free_style_sensor.csv (786 rows)
  ‚úÖ Saved medtronic_pump.csv (2059 rows)

üîÑ Processing HUPA0007P...
  ‚úÖ Saved fitbit.csv (164914 rows)
  ‚úÖ Saved free_style_sensor.csv (1567 rows)
  ‚úÖ Saved medtronic_pump.csv (3272 rows)

üîÑ Processing HUPA0009P...
  ‚úÖ Saved fitbit.csv (196562 rows)
  ‚úÖ Saved free_style_sensor.csv (3 rows)
  ‚úÖ Saved medtronic_pump.csv (1822 rows)

üîÑ Processing HUPA0010P...


  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (269903 rows)
  ‚úÖ Saved free_style_sensor.csv (391 rows)
  ‚úÖ Saved medtronic_pump.csv (808 rows)

üîÑ Processing HUPA0011P...
  ‚úÖ Saved fitbit.csv (181074 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 1 fields in line 12, saw 2

  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0014P...
  ‚úÖ Saved fitbit.csv (219211 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 1 fields in line 3, saw 5

  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0015P...
  ‚úÖ Saved fitbit.csv (487490 rows)
  ‚úÖ Saved free_style_sensor.csv (1264 rows)
  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0016P...
  ‚úÖ Saved fitbit.csv (567730 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 1 fields in line 259, saw 2

  ‚úÖ Saved medtronic_pump.csv (3202 rows)

üîÑ Processing HUPA0017P...
  ‚úÖ Saved fitbit.cs

  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (382440 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0021P...


  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (111678 rows)
  ‚úÖ Saved free_style_sensor.csv (1585 rows)
  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0022P...
  ‚úÖ Saved fitbit.csv (184630 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0023P...
  ‚úÖ Saved fitbit.csv (324057 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0024P...


  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (354137 rows)
  ‚úÖ Saved free_style_sensor.csv (2921 rows)
  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0025P...
  ‚úÖ Saved fitbit.csv (150658 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 1 fields in line 19, saw 2

  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0026P...


  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (1869821 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 1 fields in line 14260, saw 2

  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0027P...


  merged_df = pd.concat([pd.read_csv(f) for f in sorted(csv_files)], ignore_index=True)


  ‚úÖ Saved fitbit.csv (14265331 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 5 fields in line 3, saw 19

  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump

üîÑ Processing HUPA0028P...
  ‚úÖ Saved fitbit.csv (2857452 rows)
  ‚ùå Error merging free_style_sensor: Error tokenizing data. C error: Expected 5 fields in line 3, saw 19

  ‚ö†Ô∏è Missing folder: medtronic_insulin_pump


In [11]:
preview_merged_folders(raw_path)



üìÅ HUPA0001P: ['fitbit.csv', 'free_style_sensor.csv', 'medtronic_pump.csv']

üìÅ HUPA0002P: ['fitbit.csv']

üìÅ HUPA0003P: ['fitbit.csv', 'free_style_sensor.csv', 'medtronic_pump.csv']

üìÅ HUPA0004P: ['fitbit.csv']

üìÅ HUPA0005P: ['fitbit.csv', 'free_style_sensor.csv', 'medtronic_pump.csv']

üìÅ HUPA0006P: ['fitbit.csv', 'free_style_sensor.csv', 'medtronic_pump.csv']

üìÅ HUPA0007P: ['fitbit.csv', 'free_style_sensor.csv', 'medtronic_pump.csv']

üìÅ HUPA0009P: ['fitbit.csv', 'free_style_sensor.csv', 'medtronic_pump.csv']

üìÅ HUPA0010P: ['fitbit.csv', 'free_style_sensor.csv', 'medtronic_pump.csv']

üìÅ HUPA0011P: ['fitbit.csv']

üìÅ HUPA0014P: ['fitbit.csv']

üìÅ HUPA0015P: ['fitbit.csv', 'free_style_sensor.csv']

üìÅ HUPA0016P: ['fitbit.csv', 'medtronic_pump.csv']

üìÅ HUPA0017P: ['fitbit.csv']

üìÅ HUPA0018P: ['fitbit.csv', 'free_style_sensor.csv']

üìÅ HUPA0019P: ['fitbit.csv', 'free_style_sensor.csv']

üìÅ HUPA0020P: ['fitbit.csv']

üìÅ HUPA0021P: ['fitbit.csv'

In [14]:
def repair_and_merge_all(raw_path):
    summary = []
    folder_aliases = {
        "dexcom": "medtronic_insulin_pump",
        "Roche_insulin_pump": "medtronic_insulin_pump",
        "medtronic_insulin_pump": "medtronic_insulin_pump",
        "fitbit": "fitbit",
        "free_style_sensor": "free_style_sensor"
    }

    for pid in sorted(os.listdir(raw_path)):
        p_path = os.path.join(raw_path, pid)
        if not os.path.isdir(p_path):
            continue

        print(f"\nüß© Repairing {pid}...")
        patient_summary = {"patient": pid}

        # unify folder names if misnamed
        for sub in os.listdir(p_path):
            old_path = os.path.join(p_path, sub)
            if os.path.isdir(old_path) and sub in folder_aliases and sub != folder_aliases[sub]:
                new_path = os.path.join(p_path, folder_aliases[sub])
                if not os.path.exists(new_path):
                    os.rename(old_path, new_path)
                    print(f"  üîß Renamed {sub} ‚Üí {folder_aliases[sub]}")

        # now handle merging again safely
        for folder, outname in {
            "fitbit": "fitbit.csv",
            "free_style_sensor": "free_style_sensor.csv",
            "medtronic_insulin_pump": "medtronic_pump.csv"
        }.items():

            folder_path = os.path.join(p_path, folder)
            out_path = os.path.join(p_path, outname)

            if not os.path.exists(folder_path):
                patient_summary[folder] = "missing"
                continue

            csv_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith(".csv")]
            if not csv_files:
                patient_summary[folder] = "empty"
                continue

            dfs = []
            for f in csv_files:
                try:
                    df = pd.read_csv(f, on_bad_lines='skip', low_memory=False)
                    dfs.append(df)
                except Exception as e:
                    print(f"  ‚ö†Ô∏è Skipping bad file {f}: {e}")
            if dfs:
                merged = pd.concat(dfs, ignore_index=True)
                merged["patient_id"] = pid
                merged.to_csv(out_path, index=False)
                patient_summary[folder] = f"{merged.shape[0]} rows"
                print(f"  ‚úÖ {folder} merged ‚Üí {merged.shape[0]} rows")
            else:
                patient_summary[folder] = "no valid files"

        summary.append(patient_summary)

    return pd.DataFrame(summary)


In [15]:
summary_df = repair_and_merge_all(raw_path)
summary_df


üß© Repairing HUPA0001P...


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 555226 rows
  ‚úÖ free_style_sensor merged ‚Üí 1549 rows
  ‚úÖ medtronic_insulin_pump merged ‚Üí 14700 rows

üß© Repairing HUPA0002P...


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 341927 rows

üß© Repairing HUPA0003P...
  ‚úÖ fitbit merged ‚Üí 175616 rows
  ‚úÖ free_style_sensor merged ‚Üí 1482 rows
  ‚úÖ medtronic_insulin_pump merged ‚Üí 849 rows

üß© Repairing HUPA0004P...


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 261941 rows
  ‚úÖ free_style_sensor merged ‚Üí 965 rows

üß© Repairing HUPA0005P...


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 241895 rows
  ‚úÖ free_style_sensor merged ‚Üí 1555 rows
  ‚úÖ medtronic_insulin_pump merged ‚Üí 1030 rows

üß© Repairing HUPA0006P...


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 348000 rows
  ‚úÖ free_style_sensor merged ‚Üí 786 rows
  ‚úÖ medtronic_insulin_pump merged ‚Üí 2059 rows

üß© Repairing HUPA0007P...
  ‚úÖ fitbit merged ‚Üí 164914 rows
  ‚úÖ free_style_sensor merged ‚Üí 1567 rows
  ‚úÖ medtronic_insulin_pump merged ‚Üí 3272 rows

üß© Repairing HUPA0009P...
  ‚úÖ fitbit merged ‚Üí 196562 rows
  ‚úÖ free_style_sensor merged ‚Üí 3 rows
  ‚úÖ medtronic_insulin_pump merged ‚Üí 1822 rows

üß© Repairing HUPA0010P...


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 269903 rows
  ‚úÖ free_style_sensor merged ‚Üí 391 rows
  ‚úÖ medtronic_insulin_pump merged ‚Üí 808 rows

üß© Repairing HUPA0011P...
  ‚úÖ fitbit merged ‚Üí 181074 rows
  ‚úÖ free_style_sensor merged ‚Üí 5004 rows

üß© Repairing HUPA0014P...
  ‚úÖ fitbit merged ‚Üí 219211 rows
  ‚úÖ free_style_sensor merged ‚Üí 1253 rows

üß© Repairing HUPA0015P...
  ‚úÖ fitbit merged ‚Üí 487490 rows
  ‚úÖ free_style_sensor merged ‚Üí 1264 rows

üß© Repairing HUPA0016P...
  ‚úÖ fitbit merged ‚Üí 567730 rows
  ‚úÖ free_style_sensor merged ‚Üí 6474 rows
  ‚úÖ medtronic_insulin_pump merged ‚Üí 3202 rows

üß© Repairing HUPA0017P...
  ‚úÖ fitbit merged ‚Üí 216977 rows
  ‚úÖ free_style_sensor merged ‚Üí 1494 rows

üß© Repairing HUPA0018P...
  ‚úÖ fitbit merged ‚Üí 510965 rows
  ‚úÖ free_style_sensor merged ‚Üí 1419 rows

üß© Repairing HUPA0019P...
  ‚úÖ fitbit merged ‚Üí 154741 rows
  ‚úÖ free_style_sensor merged ‚Üí 1620 rows

üß© Repairing HUPA0020P...


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 382440 rows
  ‚úÖ free_style_sensor merged ‚Üí 119 rows

üß© Repairing HUPA0021P...


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 111678 rows
  ‚úÖ free_style_sensor merged ‚Üí 1585 rows

üß© Repairing HUPA0022P...
  ‚úÖ fitbit merged ‚Üí 184630 rows
  ‚úÖ free_style_sensor merged ‚Üí 202 rows

üß© Repairing HUPA0023P...
  ‚úÖ fitbit merged ‚Üí 324057 rows
  ‚úÖ free_style_sensor merged ‚Üí 283 rows

üß© Repairing HUPA0024P...


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 354137 rows
  ‚úÖ free_style_sensor merged ‚Üí 2921 rows

üß© Repairing HUPA0025P...
  ‚úÖ fitbit merged ‚Üí 150658 rows
  ‚úÖ free_style_sensor merged ‚Üí 1797 rows

üß© Repairing HUPA0026P...


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 1869821 rows
  ‚úÖ free_style_sensor merged ‚Üí 16215 rows

üß© Repairing HUPA0027P...
  üîß Renamed dexcom ‚Üí medtronic_insulin_pump


  merged = pd.concat(dfs, ignore_index=True)


  ‚úÖ fitbit merged ‚Üí 14265331 rows
  ‚úÖ free_style_sensor merged ‚Üí 15372 rows
  ‚úÖ medtronic_insulin_pump merged ‚Üí 25322 rows

üß© Repairing HUPA0028P...
  ‚úÖ fitbit merged ‚Üí 2857452 rows
  ‚úÖ free_style_sensor merged ‚Üí 1 rows


Unnamed: 0,patient,fitbit,free_style_sensor,medtronic_insulin_pump
0,HUPA0001P,555226 rows,1549 rows,14700 rows
1,HUPA0002P,341927 rows,missing,missing
2,HUPA0003P,175616 rows,1482 rows,849 rows
3,HUPA0004P,261941 rows,965 rows,missing
4,HUPA0005P,241895 rows,1555 rows,1030 rows
5,HUPA0006P,348000 rows,786 rows,2059 rows
6,HUPA0007P,164914 rows,1567 rows,3272 rows
7,HUPA0009P,196562 rows,3 rows,1822 rows
8,HUPA0010P,269903 rows,391 rows,808 rows
9,HUPA0011P,181074 rows,5004 rows,missing


In [None]:
# 1. Add a patient_id column to every per-patient csv
# 2. Merge them per device type -> fitbit_all.csv, medtronic_all.csv, free_style_sensor_all.csv.

def combine_patients_by_device(base_path, patient_list, output_path="merged_five_patients"):
    os.makedirs(output_path, exist_ok=True)

    # Device aliases (so naming differences don't break it)
    device_aliases = {
        "fitbit": ["fitbit"],
        "free_style_sensor": ["free_style_sensor"],
        "medtronic_insulin_pump": ["medtronic_insulin_pump", "medtronic_pump"]
    }

    combined = {device: [] for device in device_aliases}

    for pid in patient_list:
        p_path = os.path.join(base_path, pid)
        if not os.path.exists(p_path):
            print(f"‚ö†Ô∏è Skipping {pid}: folder missing")
            continue

        for device, aliases in device_aliases.items():
            csv_path = None
            for alias in aliases:
                test_path = os.path.join(p_path, f"{alias}.csv")
                if os.path.exists(test_path):
                    csv_path = test_path
                    break

            if csv_path:
                try:
                    df = pd.read_csv(csv_path, low_memory=False)
                    df["patient_id"] = pid
                    combined[device].append(df)
                    print(f"‚úÖ Added {device} for {pid} ({len(df)} rows)")
                except Exception as e:
                    print(f"‚ùå Error reading {device} for {pid}: {e}")
            else:
                print(f"‚ö†Ô∏è Missing {device} for {pid}")

    # Save combined outputs
    for device, dfs in combined.items():
        if dfs:
            merged = pd.concat(dfs, ignore_index=True)
            save_path = os.path.join(output_path, f"{device}_all.csv")
            merged.to_csv(save_path, index=False)
            print(f"üíæ Saved {save_path} ({len(merged)} total rows)")
        else:
            print(f"‚ö†Ô∏è No data found for {device}")

    print("\n Done combining all selected patients!")


### Some patients have missing data, so only a few patients have been picked

In [25]:
sample_patients = ["HUPA0001P", "HUPA0003P", "HUPA0005P", "HUPA0006P", "HUPA0007P"]
combine_patients_by_device(base_path=raw_path, patient_list=sample_patients)

‚úÖ Added fitbit for HUPA0001P (555226 rows)
‚úÖ Added free_style_sensor for HUPA0001P (1549 rows)
‚úÖ Added medtronic_insulin_pump for HUPA0001P (14700 rows)
‚úÖ Added fitbit for HUPA0003P (175616 rows)
‚úÖ Added free_style_sensor for HUPA0003P (1482 rows)
‚úÖ Added medtronic_insulin_pump for HUPA0003P (849 rows)
‚úÖ Added fitbit for HUPA0005P (241895 rows)
‚úÖ Added free_style_sensor for HUPA0005P (1555 rows)
‚úÖ Added medtronic_insulin_pump for HUPA0005P (1030 rows)
‚úÖ Added fitbit for HUPA0006P (348000 rows)
‚úÖ Added free_style_sensor for HUPA0006P (786 rows)
‚úÖ Added medtronic_insulin_pump for HUPA0006P (2059 rows)
‚úÖ Added fitbit for HUPA0007P (164914 rows)
‚úÖ Added free_style_sensor for HUPA0007P (1567 rows)
‚úÖ Added medtronic_insulin_pump for HUPA0007P (3272 rows)
üíæ Saved merged_across_patients/fitbit_all.csv (1485651 total rows)
üíæ Saved merged_across_patients/free_style_sensor_all.csv (6939 total rows)
üíæ Saved merged_across_patients/medtronic_insulin_pump_all.cs