# Table of Contents
- [Introduction](#Introduction)
    - [Capstone 2 Project Scope](#Capstone-2-Project-Scope:)
    - [Data Available](#Data-Available:)
    - [References](#References:)
- [Data Preparation](#Data-Preparation)
    - [Import necessary libraries](#Import-necessary-libraries)
    - [Patient_001 Example - Loading wristband + CGM data frames](#Patient_001-Example---Loading-Wristband--CGM-data-frames)
        - [Inspect DFs and column names](#Inspect-DFs-and-column-names)
        - [Column renaming for consistency, strip unnecessary characters, convert to datetime](#Column-renaming-for-consistency-strip-unnecessary-characters-convert-to-datetime)
    - [**Data sampling mismatch**](#Data-sampling-mismatch)
        - [Early feature engineering on respective sampling periods](#Early-feature-engineering-on-respective-sampling-periods)
        - [Acceleration magnitude](#Acceleration-magnitude)
        - [Peak Detection in electrodermal activity (EDA)](#Peak-Detection-in-electrodermal-activity-EDA)
        - [5 minute aggregate features - mean, std, min, max, 25% quantile, 75% quantile, and skewness](#5-minute-aggregate-features---mean-std-min-max-25-quantile-75-quantile-and-skewness)
        - [Merge each dataframe resampling on our glucose measurements (5min)](#Merge-each-dataframe-resampling-on-our-glucose-measurements-5min)
        - [Drop NaN values](#Drop-NaN-values)
        - [Inspect merged_df](#Inspect-merged_df)
    - [Universal code to wrangle the data (similar as we did for Patient_001) for each of the 16 patients, creating one large df with each patient's data](#Universal-code-to-wrangle-the-data-similar-as-we-did-for-patient_001-for-each-of-the-16-patients-creating-one-large-df-with-each-patients-data)
        - [Same wrangling steps as before](#Universal-code-to-wrangle-the-data-similar-as-we-did-for-patient_001-for-each-of-the-16-patients-creating-one-large-df-with-each-patients-data)
        - [Add column with Patient ID from 1 to 16](#Universal-code-to-wrangle-the-data-similar-as-we-did-for-patient_001-for-each-of-the-16-patients-creating-one-large-df-with-each-patients-data)
    - [Demographics CSV](#Demographics-csv)
        - [Includes gender, HbA1c, and patient ID](#Includes-gender-hba1c-and-patient-id)
        - [Load and create pandas df from demographics_csv](#Load-and-create-pandas-df-from-demographics_csv)
        - [Merge previous df with demographics_df](#Merge-previous-df-with-demographics_df)
        - [Organize columns in a more intuitive order](#Organize-columns-in-a-more-intuitive-order)
        - [Save df with wearables+demographic data to patient_df.csv](#Save-df-with-wearablesdemographic-data-to-patient_df.csv)


# Introduction
Prediabetes affects one in three people and has a 10% annual conversion rate to type 2 diabetes without lifestyle or medical interventions. Management of glycemic health is essential to prevent progression to type 2 diabetes. However, there is currently no commercially-available and noninvasive method for monitoring glycemic health to aid in self-management of prediabetes. There is a critical need for innovative, practical strategies to improve monitoring and management of glycemic health. In this study, using a dataset of 25,000 simultaneous interstitial glucose and noninvasive wearable smartwatch measurements, the goal is to demonstrate the feasibility of using noninvasive smartwatches and food logs recorded over 10 days, to continuously detect personalized glucose deviations and to predict the exact interstitial glucose value in real time.

### Capstone 2 Project Scope
- My goal is to re-create and improve on the model from the Nature Publication. 
- The authors were able to predict interstitial glucose (mg/dL) with a 21.2 RMSE and 14.3 % MAPE. 
- From the initial 8 features the authors engineered a total of 69 features applied to their final model. 

### Data Available

For each patient there is a set of files with their specific data. Note that the sampling periods are different for each, and that Dexcom is our target variable to predict.
<br>

| CSV              | Description                                                    | Source                                         | Median Sampling Period |
|------------------|----------------------------------------------------------------|------------------------------------------------|------------------------|
| **ACC_001**      | Tri-axial accelerometry (X-Y-Z)                                | Empatica E4 wrist-worn device                  | 0.03125 seconds        |
| **BVP_001**      | Blood volume pulse                                             | Empatica E4 wrist-worn device                  | 0.015625 seconds       |
| **Dexcom_001**   | Interstitial glucose concentration (mg/dL)                     | Dexcom G6, a continuous glucose monitor system | 300.0 seconds          |
| **EDA_001**      | Electrodermal activity                                         | Empatica E4 wrist-worn device                  | 0.25 seconds           |
| **HR_001_**      | Heart Rate                                                     | Empatica E4 wrist-worn device                  | 1.24 seconds           |
| **IBI_001**      | Interbeat interval                                             | Empatica E4 wrist-worn device                  | 0.98442 seconds        |
| **TEMP_001**     | Skin Temperature                                               | Empatica E4 wrist-worn device                  | 0.25 seconds           |
| **food_log**     | Log of food intake with timestamps and nutritional information | User input                                     | As needed              |
| **demographics_csv** | Sex, HbA1c, Patient ID                                         | User input                                     | One time               |

<br>




### References
- The dataset is publicly available from PhysioNet. [Dataset](https://physionet.org/content/big-ideas-glycemic-wearable/1.1.2/001/#files-panel)
- And the Nature publication is also published without a paywall: [Nature publication](https://www.nature.com/articles/s41746-021-00465-w#article-info)





# Data Preparation

### Import necessary libraries

In [1]:
# Importing necessary libraries
import pandas as pd
from scipy.signal import find_peaks

### Patient_001 Example - Loading wristband + CGM data frames

In this section we will load only 1 patient's data to better observe what it looks like and build best practices for handling and modifying the data. Further down this notebook we will streamline the loading process into a single step. 

In [2]:
# Filepaths in the local directory
filepaths = ['Data/001/ACC_001.csv', 'Data/001/BVP_001.csv', 'Data/001/Dexcom_001.csv', 'Data/001/EDA_001.csv', 'Data/001/HR_001.csv', 'Data/001/IBI_001.csv', 'Data/001/TEMP_001.csv']

# Dictionary to store the dataframes
dfs = {}

for csv in filepaths:
    key = csv.split('/')[-1].split('.')[0]  # Get the filename without the extension
    dfs[key] = pd.read_csv(csv)  # Read the csv file and store the DataFrame in the dictionary

### Inspect DFs and column names

In [3]:
# Display the DFs and respective columns within the dictionary
for key, df in dfs.items():
    print(f"DataFrame: {key}")
    print("Columns:", df.columns.tolist())
    print()

DataFrame: ACC_001
Columns: ['datetime', ' acc_x', ' acc_y', ' acc_z']

DataFrame: BVP_001
Columns: ['datetime', ' bvp']

DataFrame: Dexcom_001
Columns: ['Index', 'Timestamp (YYYY-MM-DDThh:mm:ss)', 'Event Type', 'Event Subtype', 'Patient Info', 'Device Info', 'Source Device ID', 'Glucose Value (mg/dL)', 'Insulin Value (u)', 'Carb Value (grams)', 'Duration (hh:mm:ss)', 'Glucose Rate of Change (mg/dL/min)', 'Transmitter Time (Long Integer)']

DataFrame: EDA_001
Columns: ['datetime', ' eda']

DataFrame: HR_001
Columns: ['datetime', ' hr']

DataFrame: IBI_001
Columns: ['datetime', ' ibi']

DataFrame: TEMP_001
Columns: ['datetime', ' temp']



### Column renaming for consistency, strip unnecessary characters, convert to datetime

In [4]:
# Remove leading/trailing spaces from column names
for key, df in dfs.items():
    df.columns = df.columns.str.strip()

# Rename the column 'Timestamp ...' in Dexcom_001 to 'datetime'
dfs['Dexcom_001'] = dfs['Dexcom_001'].rename(columns={'Timestamp (YYYY-MM-DDThh:mm:ss)': 'datetime'})

# Convert 'datetime' columns in to datetime
for key in dfs.keys():
    if 'datetime' in dfs[key].columns:
        dfs[key]['datetime'] = pd.to_datetime(dfs[key]['datetime'])
        
# Display the DFs and respective columns within the dictionary
for key, df in dfs.items():
    print(f"DataFrame: {key}")
    print("Columns:", df.columns.tolist())
    print()

  dfs[key]['datetime'] = pd.to_datetime(dfs[key]['datetime'])


DataFrame: ACC_001
Columns: ['datetime', 'acc_x', 'acc_y', 'acc_z']

DataFrame: BVP_001
Columns: ['datetime', 'bvp']

DataFrame: Dexcom_001
Columns: ['Index', 'datetime', 'Event Type', 'Event Subtype', 'Patient Info', 'Device Info', 'Source Device ID', 'Glucose Value (mg/dL)', 'Insulin Value (u)', 'Carb Value (grams)', 'Duration (hh:mm:ss)', 'Glucose Rate of Change (mg/dL/min)', 'Transmitter Time (Long Integer)']

DataFrame: EDA_001
Columns: ['datetime', 'eda']

DataFrame: HR_001
Columns: ['datetime', 'hr']

DataFrame: IBI_001
Columns: ['datetime', 'ibi']

DataFrame: TEMP_001
Columns: ['datetime', 'temp']



In [5]:
#Removing all columns in Dexcom_001 except for Glucose Value and datetime.
dfs['Dexcom_001'] = dfs['Dexcom_001'][['datetime', 'Glucose Value (mg/dL)']]

dfs['Dexcom_001'] = dfs['Dexcom_001'].rename(columns={"Glucose Value (mg/dL)": "glucose"})

# Remove rows where 'datetime' is null/NaN (the first twelve rows)
dfs['Dexcom_001'].dropna(subset=['datetime'], inplace=True)

In [6]:
# Set 'datetime' column as index
for df in dfs.values():
    df.set_index('datetime', inplace=True)

In [7]:
for key, df in dfs.items():
    dfs[key] = df.sort_index()

In [8]:
for key, df in dfs.items():
    if df.index.is_monotonic_increasing or df.index.is_monotonic_decreasing:
        print(f"Index for DataFrame: {key} is monotonic")
    else:
        print(f"Index for DataFrame: {key} is NOT monotonic")

Index for DataFrame: ACC_001 is monotonic
Index for DataFrame: BVP_001 is monotonic
Index for DataFrame: Dexcom_001 is monotonic
Index for DataFrame: EDA_001 is monotonic
Index for DataFrame: HR_001 is monotonic
Index for DataFrame: IBI_001 is monotonic
Index for DataFrame: TEMP_001 is monotonic


### Data sampling mismatch

As we have several data inputs it's important that sampling period is inspected. 

In [9]:
for key, df in dfs.items():
    # Calculate time delta series
    timedelta_series = df.index.to_series().diff()

    # Compute mean/median of timedelta_series in seconds
    mean_sampling_period_seconds = timedelta_series.mean().total_seconds()
    median_sampling_period_seconds = timedelta_series.median().total_seconds()

    # Count the number of samples
    num_samples = len(df)

    # Get the start time and end time
    start_time = df.index.min()
    end_time = df.index.max()

    print(f"For DataFrame '{key}', mean sampling period is {mean_sampling_period_seconds} seconds, \
          median sampling period is {median_sampling_period_seconds} seconds,\
          there are {num_samples} samples,\
          start time is {start_time}, and end time is {end_time}.\n")


For DataFrame 'ACC_001', mean sampling period is 0.038747 seconds,           median sampling period is 0.03125 seconds,          there are 20296428 samples,          start time is 2020-02-13 15:28:50, and end time is 2020-02-22 17:56:03.843750.

For DataFrame 'BVP_001', mean sampling period is 0.019373 seconds,           median sampling period is 0.015625 seconds,          there are 40592838 samples,          start time is 2020-02-13 15:28:50, and end time is 2020-02-22 17:56:03.781250.

For DataFrame 'Dexcom_001', mean sampling period is 304.449609 seconds,           median sampling period is 300.0 seconds,          there are 2561 samples,          start time is 2020-02-13 17:23:32, and end time is 2020-02-22 17:53:23.

For DataFrame 'EDA_001', mean sampling period is 0.30998 seconds,           median sampling period is 0.25 seconds,          there are 2537046 samples,          start time is 2020-02-13 15:28:50, and end time is 2020-02-22 17:56:03.250000.

For DataFrame 'HR_001', mean


| Dataframe   | Mean Sampling Period | Median Sampling Period | Number of Samples | Start Time                | End Time                  |
|-------------|----------------------|------------------------|-------------------|---------------------------|---------------------------|
| ACC_001     | 0.038747 seconds     | 0.03125 seconds        | 20296428          | 2020-02-13 15:28:50       | 2020-02-22 17:56:03.843750|
| BVP_001     | 0.019373 seconds     | 0.015625 seconds       | 40592838          | 2020-02-13 15:28:50       | 2020-02-22 17:56:03.781250|
| Dexcom_001  | 304.449609 seconds   | 300.0 seconds          | 2561              | 2020-02-13 17:23:32       | 2020-02-22 17:53:23       |
| EDA_001     | 0.30998 seconds      | 0.25 seconds           | 2537046           | 2020-02-13 15:28:50       | 2020-02-22 17:56:03.250000|
| HR_001_new  | 1.240044 seconds     | 0.0 seconds            | 634188            | 2020-02-13 15:29:00       | 2020-02-22 17:56:00       |
| IBI_001     | 2.950438 seconds     | 0.98442 seconds        | 266366            | 2020-02-13 15:33:22.059328| 2020-02-22 17:51:35.691598|
| TEMP_001    | 0.30998 seconds      | 0.25 seconds           | 2537040           | 2020-02-13 15:28:50       | 2020-02-22 17:56:03.750000|



### Early feature engineering on respective sampling periods

### Acceleration magnitude

In [10]:
# Create 'acc' column in 'ACC_001' DataFrame
dfs['ACC_001']['acc'] = dfs['ACC_001'][['acc_x', 'acc_y', 'acc_z']].sum(axis=1).abs()


In [11]:
# Display the DFs and respective columns within the dictionary
for key, df in dfs.items():
    print(f"DataFrame: {key}")
    print("Columns:", df.columns.tolist())
    print()

DataFrame: ACC_001
Columns: ['acc_x', 'acc_y', 'acc_z', 'acc']

DataFrame: BVP_001
Columns: ['bvp']

DataFrame: Dexcom_001
Columns: ['glucose']

DataFrame: EDA_001
Columns: ['eda']

DataFrame: HR_001
Columns: ['hr']

DataFrame: IBI_001
Columns: ['ibi']

DataFrame: TEMP_001
Columns: ['temp']



In [12]:
for key, df in dfs.items():
    dfs[key].info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20296428 entries, 2020-02-13 15:28:50 to 2020-02-22 17:56:03.843750
Data columns (total 4 columns):
 #   Column  Dtype  
---  ------  -----  
 0   acc_x   float64
 1   acc_y   float64
 2   acc_z   float64
 3   acc     float64
dtypes: float64(4)
memory usage: 774.2 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 40592838 entries, 2020-02-13 15:28:50 to 2020-02-22 17:56:03.781250
Data columns (total 1 columns):
 #   Column  Dtype  
---  ------  -----  
 0   bvp     float64
dtypes: float64(1)
memory usage: 619.4 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2561 entries, 2020-02-13 17:23:32 to 2020-02-22 17:53:23
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   glucose  2561 non-null   float64
dtypes: float64(1)
memory usage: 40.0 KB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2537046 entries, 2020-02-13 15:28:50 to 2020-02-22 17:56:03.250000
Data

### Confirm datetime index is monotonic

In [13]:
for key, df in dfs.items():
    if df.index.is_monotonic_increasing or df.index.is_monotonic_decreasing:
        print(f"Index for DataFrame: {key} is monotonic")
    else:
        print(f"Index for DataFrame: {key} is NOT monotonic")


Index for DataFrame: ACC_001 is monotonic
Index for DataFrame: BVP_001 is monotonic
Index for DataFrame: Dexcom_001 is monotonic
Index for DataFrame: EDA_001 is monotonic
Index for DataFrame: HR_001 is monotonic
Index for DataFrame: IBI_001 is monotonic
Index for DataFrame: TEMP_001 is monotonic


### Early Feature Engineering

Early feature engineering so we can take advantage of the given data. Glucose measurements from Dexcom are sampled every 5 minutes but other features have much more data. If we simply resampled on Dexcom then we would lose a lot of data, so we build features on the original samplings prior to merging dfs. Below are domain specific features that the publication supported:
- Electrodermal Peak detection
- Mean, std, min, max, Q1, Q3, Skew 
- Rolling 2hr mean and 2hr max

In each case we calculate the features on a 5 minute chunk.


In [14]:
# Define parameters for peak detection
height = 0
distance = 4
prominence = 0.3

# This function finds peaks in a timeseries and returns the number of peaks.
def count_peaks(x):
    peaks, _ = find_peaks(x, height=height, distance=distance, prominence=prominence)
    return len(peaks)

def process_df(df, resample_period='5min', calculate_peaks=False, rolling_2h=False):
    # Resample the data
    resampled = df.resample(resample_period)

    # Initialize an empty dataframe to store results
    df_result = pd.DataFrame(index=resampled.indices.keys())

    for col_name in df.columns:
        # Calculate statistics for each column
        df_result[f'{col_name}_mean'] = resampled[col_name].mean()
        df_result[f'{col_name}_std'] = resampled[col_name].std()
        df_result[f'{col_name}_min'] = resampled[col_name].min()
        df_result[f'{col_name}_max'] = resampled[col_name].max()
        df_result[f'{col_name}_q1'] = resampled[col_name].quantile(0.25)
        df_result[f'{col_name}_q3'] = resampled[col_name].quantile(0.75)
        df_result[f'{col_name}_skew'] = resampled[col_name].skew()

        if calculate_peaks:
            df_result[f'{col_name}_peaks'] = resampled[col_name].apply(count_peaks)

        if rolling_2h:
            rolling_2h_agg_func = df[col_name].rolling('2h').agg(['mean', 'max'])
            df_result[f'{col_name}_2hr_mean'] = rolling_2h_agg_func['mean'].resample(resample_period).last()
            df_result[f'{col_name}_2hr_max'] = rolling_2h_agg_func['max'].resample(resample_period).last()

    return df_result


# Then apply this function to each of your dataframes
dfs['EDA_001'] = process_df(dfs['EDA_001'], calculate_peaks=True)

# Apply function for dfs that need 2h rolling features
dfs['ACC_001'] = process_df(dfs['ACC_001'], rolling_2h=True)

# Apply function for remaining dfs
for name in ['HR_001', 'TEMP_001', 'IBI_001', 'BVP_001']:
    dfs[name] = process_df(dfs[name])


In [15]:
print(dfs['IBI_001'].head())

                     ibi_mean   ibi_std   ibi_min   ibi_max    ibi_q1  \
2020-02-13 15:30:00  0.903166  0.059910  0.828163  0.984420  0.875040   
2020-02-13 15:35:00  0.849333  0.228782  0.468771  1.140677  0.625028   
2020-02-13 15:40:00  0.930846  0.159200  0.437520  1.078174  0.910197   
2020-02-13 15:45:00  0.953820  0.157979  0.562526  1.250057  0.890666   
2020-02-13 15:50:00  0.937543  0.098188  0.734409  1.125051  0.859414   

                       ibi_q3  ibi_skew  
2020-02-13 15:30:00  0.937543  0.253720  
2020-02-13 15:35:00  1.039110 -0.492573  
2020-02-13 15:40:00  1.023484 -2.479890  
2020-02-13 15:45:00  1.046923 -0.681244  
2020-02-13 15:50:00  1.000046 -0.440397  


In [16]:
for key, df in dfs.items():
    dfs[key].info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2125 entries, 2020-02-13 15:25:00 to 2020-02-22 17:55:00
Data columns (total 36 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   acc_x_mean      2125 non-null   float64
 1   acc_x_std       2125 non-null   float64
 2   acc_x_min       2125 non-null   float64
 3   acc_x_max       2125 non-null   float64
 4   acc_x_q1        2125 non-null   float64
 5   acc_x_q3        2125 non-null   float64
 6   acc_x_skew      2125 non-null   float64
 7   acc_x_2hr_mean  2125 non-null   float64
 8   acc_x_2hr_max   2125 non-null   float64
 9   acc_y_mean      2125 non-null   float64
 10  acc_y_std       2125 non-null   float64
 11  acc_y_min       2125 non-null   float64
 12  acc_y_max       2125 non-null   float64
 13  acc_y_q1        2125 non-null   float64
 14  acc_y_q3        2125 non-null   float64
 15  acc_y_skew      2125 non-null   float64
 16  acc_y_2hr_mean  2125 non-null   float64
 1

### Merge seperate dfs into 1 resampling on Dexcom (glucose)

In [17]:
# Starting with Dexcom_001 DataFrame and setting 'glucose' as the only column
merged_df = dfs['Dexcom_001'][['glucose']]

# Making sure the index is sorted
merged_df = merged_df.sort_index()

for key, df in dfs.items():
    # skip if the current dataframe is 'Dexcom_001'
    if key == 'Dexcom_001':
        continue

    # make sure the df is sorted by index
    df_sorted = df.sort_index()

    # Merge with the current dataframe
    merged_df = pd.merge_asof(merged_df, df_sorted, left_index=True, right_index=True, direction='nearest',
                              tolerance=pd.Timedelta('4min'))

merged_df.info()  # print the info of the merged dataframe


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2561 entries, 2020-02-13 17:23:32 to 2020-02-22 17:53:23
Data columns (total 73 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   glucose         2561 non-null   float64
 1   acc_x_mean      2071 non-null   float64
 2   acc_x_std       2071 non-null   float64
 3   acc_x_min       2071 non-null   float64
 4   acc_x_max       2071 non-null   float64
 5   acc_x_q1        2071 non-null   float64
 6   acc_x_q3        2071 non-null   float64
 7   acc_x_skew      2071 non-null   float64
 8   acc_x_2hr_mean  2071 non-null   float64
 9   acc_x_2hr_max   2071 non-null   float64
 10  acc_y_mean      2071 non-null   float64
 11  acc_y_std       2071 non-null   float64
 12  acc_y_min       2071 non-null   float64
 13  acc_y_max       2071 non-null   float64
 14  acc_y_q1        2071 non-null   float64
 15  acc_y_q3        2071 non-null   float64
 16  acc_y_skew      2071 non-null   float64
 1

### Drop NaN values

In [18]:
# Drop rows with NaN values
merged_df = merged_df.dropna()

# Print the info of the DataFrame after dropping NaN rows
merged_df.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2014 entries, 2020-02-13 17:23:32 to 2020-02-22 17:53:23
Data columns (total 73 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   glucose         2014 non-null   float64
 1   acc_x_mean      2014 non-null   float64
 2   acc_x_std       2014 non-null   float64
 3   acc_x_min       2014 non-null   float64
 4   acc_x_max       2014 non-null   float64
 5   acc_x_q1        2014 non-null   float64
 6   acc_x_q3        2014 non-null   float64
 7   acc_x_skew      2014 non-null   float64
 8   acc_x_2hr_mean  2014 non-null   float64
 9   acc_x_2hr_max   2014 non-null   float64
 10  acc_y_mean      2014 non-null   float64
 11  acc_y_std       2014 non-null   float64
 12  acc_y_min       2014 non-null   float64
 13  acc_y_max       2014 non-null   float64
 14  acc_y_q1        2014 non-null   float64
 15  acc_y_q3        2014 non-null   float64
 16  acc_y_skew      2014 non-null   float64
 1

### Inspect merged_df

In [19]:
merged_df.head()

Unnamed: 0_level_0,glucose,acc_x_mean,acc_x_std,acc_x_min,acc_x_max,acc_x_q1,acc_x_q3,acc_x_skew,acc_x_2hr_mean,acc_x_2hr_max,...,ibi_q1,ibi_q3,ibi_skew,temp_mean,temp_std,temp_min,temp_max,temp_q1,temp_q3,temp_skew
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-02-13 17:23:32,61.0,1.25625,10.071903,-52.0,80.0,-4.0,4.0,3.424449,17.888216,127.0,...,0.671906,0.76566,0.316494,33.171867,0.266618,32.73,33.57,32.925,33.43,-0.174456
2020-02-13 17:28:32,59.0,-0.251875,6.610492,-27.0,25.0,-4.0,4.0,0.108406,15.825048,127.0,...,0.76566,0.906291,0.427697,33.136333,0.251755,32.75,33.47,32.87,33.39,-0.175517
2020-02-13 17:33:32,58.0,34.318958,27.275652,-90.0,127.0,5.0,55.0,-0.710716,15.7548,127.0,...,0.718783,0.812537,1.671801,33.244767,0.05205,33.11,33.39,33.21,33.27,-0.178307
2020-02-13 17:38:32,59.0,36.178958,17.215578,-61.0,127.0,30.0,45.0,-0.674044,15.917096,127.0,...,0.718783,0.890666,-0.018164,33.315067,0.068227,33.21,33.43,33.25,33.37,-0.111803
2020-02-13 17:43:31,63.0,30.370521,7.888225,-33.0,70.0,25.0,36.0,-0.416,15.850135,127.0,...,0.687531,0.828163,0.390202,33.660067,0.145768,33.34,33.91,33.59,33.75,-0.730198


### Universal code to wrangle the data (similar as we did for Patient_001) for each of the 16 patients, creating one large df with each patient's data

Below we aggregate all of the above steps in 1, which is looped through for each patient and concatenated in to a single DataFrame.

In [21]:
# This function finds peaks in a timeseries and returns the number of peaks.
def count_peaks(x):
    height = 0
    distance = 4
    prominence = 0.3
    peaks, _ = find_peaks(x, height=height, distance=distance, prominence=prominence)
    return len(peaks)


def process_df(df, resample_period='5min', calculate_peaks=False, rolling_2h=False):
    # Resample the data
    resampled = df.resample(resample_period)

    # Initialize an empty dataframe to store results
    df_result = pd.DataFrame(index=resampled.indices.keys())

    for col_name in df.columns:
        # Calculate statistics for each column
        df_result[f'{col_name}_mean'] = resampled[col_name].mean()
        df_result[f'{col_name}_std'] = resampled[col_name].std()
        df_result[f'{col_name}_min'] = resampled[col_name].min()
        df_result[f'{col_name}_max'] = resampled[col_name].max()
        df_result[f'{col_name}_q1'] = resampled[col_name].quantile(0.25)
        df_result[f'{col_name}_q3'] = resampled[col_name].quantile(0.75)
        df_result[f'{col_name}_skew'] = resampled[col_name].skew()

        if calculate_peaks:
            df_result[f'{col_name}_peaks'] = resampled[col_name].apply(count_peaks)

        if rolling_2h:
            rolling_2h_agg_func = df[col_name].rolling('2h').agg(['mean', 'max'])
            df_result[f'{col_name}_2hr_mean'] = rolling_2h_agg_func['mean'].resample(resample_period).last()
            df_result[f'{col_name}_2hr_max'] = rolling_2h_agg_func['max'].resample(resample_period).last()

    return df_result

def process_patient_data(patient_id):
    # Filepaths in the local directory
    filepaths = [f'Data/{patient_id}/ACC_{patient_id}.csv', f'Data/{patient_id}/BVP_{patient_id}.csv',
                 f'Data/{patient_id}/Dexcom_{patient_id}.csv', f'Data/{patient_id}/EDA_{patient_id}.csv',
                 f'Data/{patient_id}/HR_{patient_id}.csv', f'Data/{patient_id}/IBI_{patient_id}.csv',
                 f'Data/{patient_id}/TEMP_{patient_id}.csv']

    # Dictionary to store the dataframes
    dfs = {}

    for csv in filepaths:
        key = csv.split('/')[-1].split('.')[0]  # Get the filename without the extension
        dfs[key] = pd.read_csv(csv)  # Read the csv file and store the DataFrame in the dictionary

        # Remove leading/trailing spaces from column names
        dfs[key].columns = dfs[key].columns.str.strip()

        # Special preprocessing for Dexcom files
        if 'Dexcom' in key:
            dfs[key] = dfs[key].rename(columns={'Timestamp (YYYY-MM-DDThh:mm:ss)': 'datetime'})
            dfs[key]['datetime'] = pd.to_datetime(dfs[key]['datetime'], format='mixed')
            dfs[key] = dfs[key][['datetime', 'Glucose Value (mg/dL)']].rename(columns={"Glucose Value (mg/dL)": "glucose"})
            dfs[key].dropna(subset=['datetime'], inplace=True)

        if 'datetime' in dfs[key].columns:
            dfs[key]['datetime'] = pd.to_datetime(dfs[key]['datetime'], format='mixed')

        # Set 'datetime' column as index and sort the DataFrame by index
        if 'datetime' in dfs[key].columns:
            dfs[key].set_index('datetime', inplace=True)
            dfs[key].sort_index(inplace=True)
        
        # Special preprocessing for ACC files
        if 'ACC' in key:
            dfs[key]['acc'] = dfs[key][['acc_x', 'acc_y', 'acc_z']].sum(axis=1).abs()

    processed_dfs = {}

    for name in [f'{prefix}_{patient_id}' for prefix in ['EDA', 'ACC', 'HR', 'TEMP', 'IBI', 'BVP']]:
        if name.startswith('EDA'):
            processed_dfs[name] = process_df(dfs[name], calculate_peaks=True)
        elif name.startswith('ACC'):
            processed_dfs[name] = process_df(dfs[name], rolling_2h=True)
        else:
            processed_dfs[name] = process_df(dfs[name])

    # Merge all dataframes
    merged_df = dfs[f'Dexcom_{patient_id}'][['glucose']].sort_index()
    for key, df in processed_dfs.items():
        if key == f'Dexcom_{patient_id}':
            continue
        df_sorted = df.sort_index()
        merged_df = pd.merge_asof(merged_df, df_sorted, left_index=True, right_index=True, direction='nearest',
                                  tolerance=pd.Timedelta('4min'))

    merged_df = merged_df.dropna()
    return merged_df



# List of patient ids
patient_ids = [f"{i + 1:03d}" for i in range(16)]  # This will generate a list: ['001', '002', ..., '015', '016']

# List to store each processed dataframe
dfs_list = []

for pid in patient_ids:
    df = process_patient_data(pid)

    # Add a 'patient_id' column
    df['patient_id'] = pid

    dfs_list.append(df)

    print(f'Patient {pid} is finished')

# Concatenate all dataframes in the list into one
final_df = pd.concat(dfs_list)


Patient 001 is finished
Patient 002 is finished
Patient 003 is finished
Patient 004 is finished
Patient 005 is finished
Patient 006 is finished
Patient 007 is finished
Patient 008 is finished
Patient 009 is finished
Patient 010 is finished
Patient 011 is finished
Patient 012 is finished
Patient 013 is finished
Patient 014 is finished
Patient 015 is finished
Patient 016 is finished


In [22]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 26665 entries, 2020-02-13 17:23:32 to 2020-07-23 22:28:07
Data columns (total 74 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   glucose         26665 non-null  float64
 1   eda_mean        26665 non-null  float64
 2   eda_std         26665 non-null  float64
 3   eda_min         26665 non-null  float64
 4   eda_max         26665 non-null  float64
 5   eda_q1          26665 non-null  float64
 6   eda_q3          26665 non-null  float64
 7   eda_skew        26665 non-null  float64
 8   eda_peaks       26665 non-null  float64
 9   acc_x_mean      26665 non-null  float64
 10  acc_x_std       26665 non-null  float64
 11  acc_x_min       26665 non-null  float64
 12  acc_x_max       26665 non-null  float64
 13  acc_x_q1        26665 non-null  float64
 14  acc_x_q3        26665 non-null  float64
 15  acc_x_skew      26665 non-null  float64
 16  acc_x_2hr_mean  26665 non-null  float64
 

In [23]:
final_df.describe()

Unnamed: 0,glucose,eda_mean,eda_std,eda_min,eda_max,eda_q1,eda_q3,eda_skew,eda_peaks,acc_x_mean,...,ibi_q1,ibi_q3,ibi_skew,bvp_mean,bvp_std,bvp_min,bvp_max,bvp_q1,bvp_q3,bvp_skew
count,26665.0,26665.0,26665.0,26665.0,26665.0,26665.0,26665.0,26665.0,26665.0,26665.0,...,26665.0,26665.0,26665.0,26665.0,26665.0,26665.0,26665.0,26665.0,26665.0,26665.0
mean,114.653778,1.133445,0.25143,0.681762,1.748982,0.933382,1.322333,0.366475,9.768311,-1.28884,...,0.808752,0.886245,-0.152614,0.00028,75.379098,-621.812248,568.628263,-33.886198,34.646923,-0.3913
std,22.815274,2.501562,0.898707,1.727409,3.679443,2.161359,3.044264,2.658492,53.272854,37.423859,...,0.159997,0.159704,0.989126,0.109694,55.423698,534.850417,520.803998,28.150735,28.712791,1.185936
min,40.0,0.0,0.0,0.0,0.0,0.0,0.0,-34.641016,0.0,-66.026875,...,0.312514,0.359391,-8.479147,-4.3155,0.590773,-3414.13,2.08,-202.0975,0.39,-20.775731
25%,99.0,0.196076,0.009713,0.05121,0.271713,0.166553,0.212599,-0.379025,0.0,-37.562083,...,0.703157,0.781286,-0.600243,-0.024085,28.752605,-934.95,152.6,-49.68,10.1,-0.531109
50%,112.0,0.375073,0.035328,0.221497,0.538134,0.3267,0.401251,0.100827,0.0,-0.147083,...,0.812537,0.886759,-0.098833,0.000132,65.927052,-500.13,432.73,-28.16,28.79,-0.27078
75%,126.0,0.891795,0.14495,0.527762,1.55844,0.71751,0.999404,0.70595,2.0,33.851875,...,0.921917,0.992233,0.3596,0.024734,110.670453,-168.48,835.66,-9.97,51.03,-0.039316
max,261.0,62.594477,19.379012,49.525658,71.29129,58.617767,66.688246,34.641016,573.0,63.378021,...,1.53132,1.718828,7.980958,2.962534,378.143987,-2.08,3563.34,-0.39,193.93,15.271768


In [24]:
final_df.groupby('patient_id').size()

patient_id
001    2014
002    1877
003    1276
004    1244
005    2192
006    1486
007    1995
008    1966
009    1893
010    1653
011    1966
012    1628
013    1752
014    1438
015     493
016    1792
dtype: int64

In [25]:
final_df

Unnamed: 0_level_0,glucose,eda_mean,eda_std,eda_min,eda_max,eda_q1,eda_q3,eda_skew,eda_peaks,acc_x_mean,...,ibi_q3,ibi_skew,bvp_mean,bvp_std,bvp_min,bvp_max,bvp_q1,bvp_q3,bvp_skew,patient_id
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-02-13 17:23:32,61.0,0.848050,0.527374,0.156304,1.669808,0.166553,1.295926,-0.329936,2.0,1.256250,...,0.765660,0.316494,-0.004786,14.599009,-69.01,84.95,-9.6800,9.6800,-0.106814,001
2020-02-13 17:28:32,59.0,0.632578,0.283507,0.269314,1.654434,0.384620,0.840736,0.684854,0.0,-0.251875,...,0.906291,0.427697,-0.001255,12.277287,-64.23,65.07,-7.4400,7.9900,-0.333560,001
2020-02-13 17:33:32,58.0,1.544714,0.139987,1.139401,2.021050,1.459695,1.628811,0.167808,4.0,34.318958,...,0.812537,1.671801,0.020368,24.076577,-174.61,202.98,-8.5125,9.6100,-0.234153,001
2020-02-13 17:38:32,59.0,1.839445,0.352127,1.097122,3.046176,1.747960,2.058524,-0.398859,11.0,36.178958,...,0.890666,-0.018164,-0.009613,21.945661,-191.80,130.97,-5.8000,6.7200,-0.735376,001
2020-02-13 17:43:31,63.0,4.880899,1.612257,1.999104,7.903478,3.938064,6.247704,-0.014927,7.0,30.370521,...,0.828163,0.390202,-0.012741,14.068040,-147.92,102.04,-6.8000,6.9000,-0.880729,001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-07-23 22:08:08,114.0,0.223396,0.032818,0.133186,0.517377,0.210024,0.229234,3.591163,1.0,-23.731042,...,0.781286,-0.469285,-0.054992,67.391250,-325.48,306.67,-48.2550,49.6175,-0.221133,016
2020-07-23 22:13:07,121.0,0.300561,0.024836,0.206183,0.341930,0.286863,0.320159,-0.963423,0.0,27.118437,...,0.812537,0.237338,0.112629,37.053240,-285.10,236.23,-28.2325,28.4500,-0.204556,016
2020-07-23 22:18:07,127.0,0.363086,0.015251,0.334246,0.393156,0.349614,0.376507,-0.073975,0.0,29.574792,...,0.796911,0.233227,-0.032901,38.154863,-75.18,69.96,-33.2325,33.9525,-0.118820,016
2020-07-23 22:23:07,132.0,0.390997,0.002382,0.386752,0.398278,0.389314,0.393156,0.568695,0.0,30.000000,...,0.812537,0.146047,0.199750,41.394166,-77.17,71.76,-36.0975,37.2175,-0.103555,016


In [26]:
final_df.columns

Index(['glucose', 'eda_mean', 'eda_std', 'eda_min', 'eda_max', 'eda_q1',
       'eda_q3', 'eda_skew', 'eda_peaks', 'acc_x_mean', 'acc_x_std',
       'acc_x_min', 'acc_x_max', 'acc_x_q1', 'acc_x_q3', 'acc_x_skew',
       'acc_x_2hr_mean', 'acc_x_2hr_max', 'acc_y_mean', 'acc_y_std',
       'acc_y_min', 'acc_y_max', 'acc_y_q1', 'acc_y_q3', 'acc_y_skew',
       'acc_y_2hr_mean', 'acc_y_2hr_max', 'acc_z_mean', 'acc_z_std',
       'acc_z_min', 'acc_z_max', 'acc_z_q1', 'acc_z_q3', 'acc_z_skew',
       'acc_z_2hr_mean', 'acc_z_2hr_max', 'acc_mean', 'acc_std', 'acc_min',
       'acc_max', 'acc_q1', 'acc_q3', 'acc_skew', 'acc_2hr_mean',
       'acc_2hr_max', 'hr_mean', 'hr_std', 'hr_min', 'hr_max', 'hr_q1',
       'hr_q3', 'hr_skew', 'temp_mean', 'temp_std', 'temp_min', 'temp_max',
       'temp_q1', 'temp_q3', 'temp_skew', 'ibi_mean', 'ibi_std', 'ibi_min',
       'ibi_max', 'ibi_q1', 'ibi_q3', 'ibi_skew', 'bvp_mean', 'bvp_std',
       'bvp_min', 'bvp_max', 'bvp_q1', 'bvp_q3', 'bvp_skew', 'patien

In [27]:
final_df.dtypes

glucose       float64
eda_mean      float64
eda_std       float64
eda_min       float64
eda_max       float64
               ...   
bvp_max       float64
bvp_q1        float64
bvp_q3        float64
bvp_skew      float64
patient_id     object
Length: 74, dtype: object

In [28]:
final_df['patient_id'] = final_df['patient_id'].astype(int)


# Demographics CSV

### Load and create pandas df from demographics_csv

### Includes gender, HbA1c, and patient ID

In [29]:
# Load the CSV file from the 'Data' folder in your current directory
demographics_df = pd.read_csv('./Data/Demographics.csv')

# Show the first 5 rows of the DataFrame
demographics_df


Unnamed: 0,ID,Gender,HbA1c
0,13,MALE,5.7
1,1,FEMALE,5.5
2,3,FEMALE,5.9
3,4,FEMALE,6.4
4,5,FEMALE,5.7
5,2,MALE,5.6
6,6,FEMALE,5.8
7,7,FEMALE,5.3
8,8,FEMALE,5.6
9,10,FEMALE,6.0


In [30]:
# Convert the 'datetime' index into a column
final_df.reset_index(inplace=True)

### Merge previous df with demographics_df

In [35]:
# Merge the patient_df with the demographics_df
patient_df = final_df.merge(demographics_df, left_on='patient_id', right_on='ID', how='left')

# Drop the 'ID' column as it's duplicate of 'patient_id' column
patient_df = patient_df.drop('ID', axis=1)

# Print the first 5 rows of the new DataFrame
patient_df.head()


Unnamed: 0,datetime,glucose,eda_mean,eda_std,eda_min,eda_max,eda_q1,eda_q3,eda_skew,eda_peaks,...,bvp_mean,bvp_std,bvp_min,bvp_max,bvp_q1,bvp_q3,bvp_skew,patient_id,Gender,HbA1c
0,2020-02-13 17:23:32,61.0,0.84805,0.527374,0.156304,1.669808,0.166553,1.295926,-0.329936,2.0,...,-0.004786,14.599009,-69.01,84.95,-9.68,9.68,-0.106814,1,FEMALE,5.5
1,2020-02-13 17:28:32,59.0,0.632578,0.283507,0.269314,1.654434,0.38462,0.840736,0.684854,0.0,...,-0.001255,12.277287,-64.23,65.07,-7.44,7.99,-0.33356,1,FEMALE,5.5
2,2020-02-13 17:33:32,58.0,1.544714,0.139987,1.139401,2.02105,1.459695,1.628811,0.167808,4.0,...,0.020368,24.076577,-174.61,202.98,-8.5125,9.61,-0.234153,1,FEMALE,5.5
3,2020-02-13 17:38:32,59.0,1.839445,0.352127,1.097122,3.046176,1.74796,2.058524,-0.398859,11.0,...,-0.009613,21.945661,-191.8,130.97,-5.8,6.72,-0.735376,1,FEMALE,5.5
4,2020-02-13 17:43:31,63.0,4.880899,1.612257,1.999104,7.903478,3.938064,6.247704,-0.014927,7.0,...,-0.012741,14.06804,-147.92,102.04,-6.8,6.9,-0.880729,1,FEMALE,5.5


In [36]:
patient_df.columns

Index(['datetime', 'glucose', 'eda_mean', 'eda_std', 'eda_min', 'eda_max',
       'eda_q1', 'eda_q3', 'eda_skew', 'eda_peaks', 'acc_x_mean', 'acc_x_std',
       'acc_x_min', 'acc_x_max', 'acc_x_q1', 'acc_x_q3', 'acc_x_skew',
       'acc_x_2hr_mean', 'acc_x_2hr_max', 'acc_y_mean', 'acc_y_std',
       'acc_y_min', 'acc_y_max', 'acc_y_q1', 'acc_y_q3', 'acc_y_skew',
       'acc_y_2hr_mean', 'acc_y_2hr_max', 'acc_z_mean', 'acc_z_std',
       'acc_z_min', 'acc_z_max', 'acc_z_q1', 'acc_z_q3', 'acc_z_skew',
       'acc_z_2hr_mean', 'acc_z_2hr_max', 'acc_mean', 'acc_std', 'acc_min',
       'acc_max', 'acc_q1', 'acc_q3', 'acc_skew', 'acc_2hr_mean',
       'acc_2hr_max', 'hr_mean', 'hr_std', 'hr_min', 'hr_max', 'hr_q1',
       'hr_q3', 'hr_skew', 'temp_mean', 'temp_std', 'temp_min', 'temp_max',
       'temp_q1', 'temp_q3', 'temp_skew', 'ibi_mean', 'ibi_std', 'ibi_min',
       'ibi_max', 'ibi_q1', 'ibi_q3', 'ibi_skew', 'bvp_mean', 'bvp_std',
       'bvp_min', 'bvp_max', 'bvp_q1', 'bvp_q3', 'bvp_sk

### Organize columns in a more intuitive order

In [37]:
patient_df = patient_df[
    ['datetime', 'patient_id', 'glucose', 'Gender', 'HbA1c', 'acc_mean', 'bvp_mean', 'eda_mean', 'hr_mean', 'ibi_mean',
     'temp_mean'] + [col for col in patient_df.columns if
                     col not in ['datetime', 'patient_id', 'glucose', 'Gender', 'HbA1c', 'acc_mean', 'bvp_mean',
                                 'eda_mean', 'hr_mean', 'ibi_mean', 'temp_mean']]]


In [38]:
patient_df.columns

Index(['datetime', 'patient_id', 'glucose', 'Gender', 'HbA1c', 'acc_mean',
       'bvp_mean', 'eda_mean', 'hr_mean', 'ibi_mean', 'temp_mean', 'eda_std',
       'eda_min', 'eda_max', 'eda_q1', 'eda_q3', 'eda_skew', 'eda_peaks',
       'acc_x_mean', 'acc_x_std', 'acc_x_min', 'acc_x_max', 'acc_x_q1',
       'acc_x_q3', 'acc_x_skew', 'acc_x_2hr_mean', 'acc_x_2hr_max',
       'acc_y_mean', 'acc_y_std', 'acc_y_min', 'acc_y_max', 'acc_y_q1',
       'acc_y_q3', 'acc_y_skew', 'acc_y_2hr_mean', 'acc_y_2hr_max',
       'acc_z_mean', 'acc_z_std', 'acc_z_min', 'acc_z_max', 'acc_z_q1',
       'acc_z_q3', 'acc_z_skew', 'acc_z_2hr_mean', 'acc_z_2hr_max', 'acc_std',
       'acc_min', 'acc_max', 'acc_q1', 'acc_q3', 'acc_skew', 'acc_2hr_mean',
       'acc_2hr_max', 'hr_std', 'hr_min', 'hr_max', 'hr_q1', 'hr_q3',
       'hr_skew', 'temp_std', 'temp_min', 'temp_max', 'temp_q1', 'temp_q3',
       'temp_skew', 'ibi_std', 'ibi_min', 'ibi_max', 'ibi_q1', 'ibi_q3',
       'ibi_skew', 'bvp_std', 'bvp_min', 'bvp_

### Save df with wearables+demographic data to patient_df.csv

In [39]:
patient_df.to_csv('patient_df.csv', index=False)