# Table of Contents
- [Introduction](#introduction)
- [Data Preparation](#data-preparation)
    - [Import necessary libraries](#import-necessary-libraries)
    - [Patient_001 Example - Loading wristband + CGM data frames](#patient_001-example-loading-wristband-CGM-data-frames)
        - [Inspect DFs and column names](#Inspect-DFs-and-column-names)
        - [Column renaming for consistency, strip unnecessary characters, convert to datetime](#column-renaming-for-consistency-strip-unnecessary-characters-convert-to-datetime)
    - [**Data sampling mismatch**](#data-sampling-mismatch)
        - [Early feature engineering on respective sampling periods](#early feature engineering on respective sampling periods)
        - [Acceleration magnitude](#acceleration-magnitude)
        - [Peak Detection in electrodermal activity (EDA)](#early-feature-engineering)
        - [5 minute aggregate features - mean, std, min, max, 25% quantile, 75% quantile, and skewness](#early-feature-engineering)
        - [Merge each dataframe resampling on our glucose measurements (5min)](#merge seperate dfs into 1 resampling on Dexcom (glucose))
        - [Drop NaN values](#Drop NaN values)
        - [Inspect merged_df](#inspect-merged_df)
    - [Universal code to wrangle the data (similar as we did for Patient_001) for each of the 16 patients, creating one large df with each patient's data](#all-patient-merge-code)
        - [Same wrangling steps as before](#all-patient-merge-code)
        - [Add column with Patient ID from 1 to 16](#all-patient-merge-code)
    - [Demographics CSV](#Demographics-CSV)
        - [Includes gender, HbA1c, and patient ID](#Demographics-CSV)
        - [Load and create pandas df from demographics_csv](#Demographics-CSV)
        - [Merge previous df with demographics_df](#demographics-merge)
        - [Organize columns in a more intuitive order](#Reorder-columns)
        - [Save df with wearables+demographic data to patient_df.csv](#patient-df-csv)


# Introduction
Prediabetes affects one in three people and has a 10% annual conversion rate to type 2 diabetes without lifestyle or medical interventions. Management of glycemic health is essential to prevent progression to type 2 diabetes. However, there is currently no commercially-available and noninvasive method for monitoring glycemic health to aid in self-management of prediabetes. There is a critical need for innovative, practical strategies to improve monitoring and management of glycemic health. In this study, using a dataset of 25,000 simultaneous interstitial glucose and noninvasive wearable smartwatch measurements, the goal is to demonstrate the feasibility of using noninvasive smartwatches and food logs recorded over 10 days, to continuously detect personalized glucose deviations and to predict the exact interstitial glucose value in real time.

### Capstone 2 Project scope:
- My goal is to re-create and improve on the model from the Nature Publication. 
- The authors were able to predict interstitial glucose (mg/dL) with a 21.2 RMSE and 14.3 % MAPE. 
- From the initial 8 features the authors engineered a total of 69 features applied to their final model. 

### Data Available:

For each patient there is a set of files with their specific data. Note that the sampling periods are different for each, and that Dexcom is our target variable to predict.
<br>

| CSV              | Description                                                    | Source                                         | Median Sampling Period |
|------------------|----------------------------------------------------------------|------------------------------------------------|------------------------|
| **ACC_001**      | Tri-axial accelerometry (X-Y-Z)                                | Empatica E4 wrist-worn device                  | 0.03125 seconds        |
| **BVP_001**      | Blood volume pulse                                             | Empatica E4 wrist-worn device                  | 0.015625 seconds       |
| **Dexcom_001**   | Interstitial glucose concentration (mg/dL)                     | Dexcom G6, a continuous glucose monitor system | 300.0 seconds          |
| **EDA_001**      | Electrodermal activity                                         | Empatica E4 wrist-worn device                  | 0.25 seconds           |
| **HR_001_**      | Heart Rate                                                     | Empatica E4 wrist-worn device                  | 1.24 seconds           |
| **IBI_001**      | Interbeat interval                                             | Empatica E4 wrist-worn device                  | 0.98442 seconds        |
| **TEMP_001**     | Skin Temperature                                               | Empatica E4 wrist-worn device                  | 0.25 seconds           |
| **food_log**     | Log of food intake with timestamps and nutritional information | User input                                     | As needed              |
| **demographics_csv** | Sex, HbA1c, Patient ID                                         | User input                                     | One time               |

<br>




### References: 
- The dataset is publicly available from PhysioNet. [Dataset](https://physionet.org/content/big-ideas-glycemic-wearable/1.1.2/001/#files-panel)
- And the Nature publication is also published without a paywall: [Nature publication](https://www.nature.com/articles/s41746-021-00465-w#article-info)





# Data Preparation

### Import necessary libraries

In [82]:
# Importing necessary libraries
import pandas as pd
from pandas import to_datetime
import numpy as np

import wearablecompute as wc

from scipy.signal import find_peaks

import numpy as np
import math

from scipy.signal import find_peaks
import numpy as np
import pandas as pd
from scipy.signal import find_peaks


### Patient_001 Example - Loading wristband + CGM data frames

In [83]:
# Filepaths in the local directory
filepaths = ['Data/001/ACC_001.csv', 'Data/001/BVP_001.csv', 'Data/001/Dexcom_001.csv', 'Data/001/EDA_001.csv', 'Data/001/HR_001.csv', 'Data/001/IBI_001.csv', 'Data/001/TEMP_001.csv']

# Dictionary to store the dataframes
dfs = {}

for csv in filepaths:
    key = csv.split('/')[-1].split('.')[0]  # Get the filename without the extension
    dfs[key] = pd.read_csv(csv)  # Read the csv file and store the DataFrame in the dictionary

### Inspect DFs and column names

In [84]:
  
# Display the DFs and respective columns within the dictionary
for key, df in dfs.items():
    print(f"DataFrame: {key}")
    print("Columns:", df.columns.tolist())
    print()

DataFrame: ACC_001
Columns: ['datetime', ' acc_x', ' acc_y', ' acc_z']

DataFrame: BVP_001
Columns: ['datetime', ' bvp']

DataFrame: Dexcom_001
Columns: ['Index', 'Timestamp (YYYY-MM-DDThh:mm:ss)', 'Event Type', 'Event Subtype', 'Patient Info', 'Device Info', 'Source Device ID', 'Glucose Value (mg/dL)', 'Insulin Value (u)', 'Carb Value (grams)', 'Duration (hh:mm:ss)', 'Glucose Rate of Change (mg/dL/min)', 'Transmitter Time (Long Integer)']

DataFrame: EDA_001
Columns: ['datetime', ' eda']

DataFrame: HR_001
Columns: ['datetime', ' hr']

DataFrame: IBI_001
Columns: ['datetime', ' ibi']

DataFrame: TEMP_001
Columns: ['datetime', ' temp']



### Column renaming for consistency, strip unnecessary characters, convert to datetime

In [85]:
# Remove leading/trailing spaces from column names
for key, df in dfs.items():
    df.columns = df.columns.str.strip()

# Rename the column 'Timestamp ...' in Dexcom_001 to 'datetime'
dfs['Dexcom_001'] = dfs['Dexcom_001'].rename(columns={'Timestamp (YYYY-MM-DDThh:mm:ss)': 'datetime'})

# Convert 'datetime' columns in to datetime
for key in dfs.keys():
    if 'datetime' in dfs[key].columns:
        dfs[key]['datetime'] = pd.to_datetime(dfs[key]['datetime'])
        
# Display the DFs and respective columns within the dictionary
for key, df in dfs.items():
    print(f"DataFrame: {key}")
    print("Columns:", df.columns.tolist())
    print()

  dfs[key]['datetime'] = pd.to_datetime(dfs[key]['datetime'])


DataFrame: ACC_001
Columns: ['datetime', 'acc_x', 'acc_y', 'acc_z']

DataFrame: BVP_001
Columns: ['datetime', 'bvp']

DataFrame: Dexcom_001
Columns: ['Index', 'datetime', 'Event Type', 'Event Subtype', 'Patient Info', 'Device Info', 'Source Device ID', 'Glucose Value (mg/dL)', 'Insulin Value (u)', 'Carb Value (grams)', 'Duration (hh:mm:ss)', 'Glucose Rate of Change (mg/dL/min)', 'Transmitter Time (Long Integer)']

DataFrame: EDA_001
Columns: ['datetime', 'eda']

DataFrame: HR_001
Columns: ['datetime', 'hr']

DataFrame: IBI_001
Columns: ['datetime', 'ibi']

DataFrame: TEMP_001
Columns: ['datetime', 'temp']



In [86]:
#Removing all columns in Dexcom_001 except for Glucose Value and datetime.
dfs['Dexcom_001'] = dfs['Dexcom_001'][['datetime', 'Glucose Value (mg/dL)']]

dfs['Dexcom_001'] = dfs['Dexcom_001'].rename(columns={"Glucose Value (mg/dL)": "glucose"})

# Remove rows where 'datetime' is null/NaN (the first twelve rows)
dfs['Dexcom_001'].dropna(subset=['datetime'], inplace=True)

In [87]:
# Set 'datetime' column as index
for df in dfs.values():
    df.set_index('datetime', inplace=True)

In [88]:
for key, df in dfs.items():
    dfs[key] = df.sort_index()


In [89]:
for key, df in dfs.items():
    if df.index.is_monotonic_increasing or df.index.is_monotonic_decreasing:
        print(f"Index for DataFrame: {key} is monotonic")
    else:
        print(f"Index for DataFrame: {key} is NOT monotonic")


Index for DataFrame: ACC_001 is monotonic
Index for DataFrame: BVP_001 is monotonic
Index for DataFrame: Dexcom_001 is monotonic
Index for DataFrame: EDA_001 is monotonic
Index for DataFrame: HR_001 is monotonic
Index for DataFrame: IBI_001 is monotonic
Index for DataFrame: TEMP_001 is monotonic


### Data sampling mismatch

In [90]:
for key, df in dfs.items():
    # Calculate time delta series
    timedelta_series = df.index.to_series().diff()

    # Compute mean/median of timedelta_series in seconds
    mean_sampling_period_seconds = timedelta_series.mean().total_seconds()
    median_sampling_period_seconds = timedelta_series.median().total_seconds()

    # Count the number of samples
    num_samples = len(df)

    # Get the start time and end time
    start_time = df.index.min()
    end_time = df.index.max()

    print(f"For DataFrame '{key}', mean sampling period is {mean_sampling_period_seconds} seconds, \
          median sampling period is {median_sampling_period_seconds} seconds,\
          there are {num_samples} samples,\
          start time is {start_time}, and end time is {end_time}.\n")


For DataFrame 'ACC_001', mean sampling period is 0.038747 seconds,           median sampling period is 0.03125 seconds,          there are 20296428 samples,          start time is 2020-02-13 15:28:50, and end time is 2020-02-22 17:56:03.843750.

For DataFrame 'BVP_001', mean sampling period is 0.019373 seconds,           median sampling period is 0.015625 seconds,          there are 40592838 samples,          start time is 2020-02-13 15:28:50, and end time is 2020-02-22 17:56:03.781250.

For DataFrame 'Dexcom_001', mean sampling period is 304.449609 seconds,           median sampling period is 300.0 seconds,          there are 2561 samples,          start time is 2020-02-13 17:23:32, and end time is 2020-02-22 17:53:23.

For DataFrame 'EDA_001', mean sampling period is 0.30998 seconds,           median sampling period is 0.25 seconds,          there are 2537046 samples,          start time is 2020-02-13 15:28:50, and end time is 2020-02-22 17:56:03.250000.

For DataFrame 'HR_001', mean


| Dataframe   | Mean Sampling Period | Median Sampling Period | Number of Samples | Start Time                | End Time                  |
|-------------|----------------------|------------------------|-------------------|---------------------------|---------------------------|
| ACC_001     | 0.038747 seconds     | 0.03125 seconds        | 20296428          | 2020-02-13 15:28:50       | 2020-02-22 17:56:03.843750|
| BVP_001     | 0.019373 seconds     | 0.015625 seconds       | 40592838          | 2020-02-13 15:28:50       | 2020-02-22 17:56:03.781250|
| Dexcom_001  | 304.449609 seconds   | 300.0 seconds          | 2561              | 2020-02-13 17:23:32       | 2020-02-22 17:53:23       |
| EDA_001     | 0.30998 seconds      | 0.25 seconds           | 2537046           | 2020-02-13 15:28:50       | 2020-02-22 17:56:03.250000|
| HR_001_new  | 1.240044 seconds     | 0.0 seconds            | 634188            | 2020-02-13 15:29:00       | 2020-02-22 17:56:00       |
| IBI_001     | 2.950438 seconds     | 0.98442 seconds        | 266366            | 2020-02-13 15:33:22.059328| 2020-02-22 17:51:35.691598|
| TEMP_001    | 0.30998 seconds      | 0.25 seconds           | 2537040           | 2020-02-13 15:28:50       | 2020-02-22 17:56:03.750000|



### Early feature engineering on respective sampling periods

### Acceleration magnitude

In [91]:
# Create 'acc' column in 'ACC_001' DataFrame
dfs['ACC_001']['acc'] = dfs['ACC_001'][['acc_x', 'acc_y', 'acc_z']].sum(axis=1).abs()


In [92]:
  
# Display the DFs and respective columns within the dictionary
for key, df in dfs.items():
    print(f"DataFrame: {key}")
    print("Columns:", df.columns.tolist())
    print()

DataFrame: ACC_001
Columns: ['acc_x', 'acc_y', 'acc_z', 'acc']

DataFrame: BVP_001
Columns: ['bvp']

DataFrame: Dexcom_001
Columns: ['glucose']

DataFrame: EDA_001
Columns: ['eda']

DataFrame: HR_001
Columns: ['hr']

DataFrame: IBI_001
Columns: ['ibi']

DataFrame: TEMP_001
Columns: ['temp']



In [93]:
for key, df in dfs.items():
    dfs[key].info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20296428 entries, 2020-02-13 15:28:50 to 2020-02-22 17:56:03.843750
Data columns (total 4 columns):
 #   Column  Dtype  
---  ------  -----  
 0   acc_x   float64
 1   acc_y   float64
 2   acc_z   float64
 3   acc     float64
dtypes: float64(4)
memory usage: 774.2 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 40592838 entries, 2020-02-13 15:28:50 to 2020-02-22 17:56:03.781250
Data columns (total 1 columns):
 #   Column  Dtype  
---  ------  -----  
 0   bvp     float64
dtypes: float64(1)
memory usage: 619.4 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2561 entries, 2020-02-13 17:23:32 to 2020-02-22 17:53:23
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   glucose  2561 non-null   float64
dtypes: float64(1)
memory usage: 40.0 KB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2537046 entries, 2020-02-13 15:28:50 to 2020-02-22 17:56:03.250000
Data

### Confirm datetime index is monotonic

In [94]:
for key, df in dfs.items():
    if df.index.is_monotonic_increasing or df.index.is_monotonic_decreasing:
        print(f"Index for DataFrame: {key} is monotonic")
    else:
        print(f"Index for DataFrame: {key} is NOT monotonic")


Index for DataFrame: ACC_001 is monotonic
Index for DataFrame: BVP_001 is monotonic
Index for DataFrame: Dexcom_001 is monotonic
Index for DataFrame: EDA_001 is monotonic
Index for DataFrame: HR_001 is monotonic
Index for DataFrame: IBI_001 is monotonic
Index for DataFrame: TEMP_001 is monotonic


### Early Feature Engineering

In [110]:
# Define parameters for peak detection
height = 0
distance = 4
prominence = 0.3

# This function finds peaks in a timeseries and returns the number of peaks.
def count_peaks(x):
    peaks, _ = find_peaks(x, height=height, distance=distance, prominence=prominence)
    return len(peaks)

def process_df(df, resample_period='5min', calculate_peaks=False, rolling_2h=False):
    # Resample the data
    resampled = df.resample(resample_period)

    # Initialize an empty dataframe to store results
    df_result = pd.DataFrame(index=resampled.indices.keys())

    for col_name in df.columns:
        # Calculate statistics for each column
        df_result[f'{col_name}_mean'] = resampled[col_name].mean()
        df_result[f'{col_name}_std'] = resampled[col_name].std()
        df_result[f'{col_name}_min'] = resampled[col_name].min()
        df_result[f'{col_name}_max'] = resampled[col_name].max()
        df_result[f'{col_name}_q1'] = resampled[col_name].quantile(0.25)
        df_result[f'{col_name}_q3'] = resampled[col_name].quantile(0.75)
        df_result[f'{col_name}_skew'] = resampled[col_name].skew()

        if calculate_peaks:
            df_result[f'{col_name}_peaks'] = resampled[col_name].apply(count_peaks)

        if rolling_2h:
            rolling_2h_agg_func = df[col_name].rolling('2h').agg(['mean', 'max'])
            df_result[f'{col_name}_2hr_mean'] = rolling_2h_agg_func['mean'].resample(resample_period).last()
            df_result[f'{col_name}_2hr_max'] = rolling_2h_agg_func['max'].resample(resample_period).last()

    return df_result


# Then apply this function to each of your dataframes
dfs['EDA_001'] = process_df(dfs['EDA_001'], calculate_peaks=True)

# Apply function for dfs that need 2h rolling features
dfs['ACC_001'] = process_df(dfs['ACC_001'], rolling_2h=True)

# Apply function for remaining dfs
for name in ['HR_001', 'TEMP_001', 'IBI_001', 'BVP_001']:
    dfs[name] = process_df(dfs[name])


In [111]:
print(dfs['IBI_001'].head())

                     ibi_mean   ibi_std   ibi_min   ibi_max    ibi_q1  \
2020-02-13 15:30:00  0.903166  0.059910  0.828163  0.984420  0.875040   
2020-02-13 15:35:00  0.849333  0.228782  0.468771  1.140677  0.625028   
2020-02-13 15:40:00  0.930846  0.159200  0.437520  1.078174  0.910197   
2020-02-13 15:45:00  0.953820  0.157979  0.562526  1.250057  0.890666   
2020-02-13 15:50:00  0.937543  0.098188  0.734409  1.125051  0.859414   

                       ibi_q3  ibi_skew  
2020-02-13 15:30:00  0.937543  0.253720  
2020-02-13 15:35:00  1.039110 -0.492573  
2020-02-13 15:40:00  1.023484 -2.479890  
2020-02-13 15:45:00  1.046923 -0.681244  
2020-02-13 15:50:00  1.000046 -0.440397  


In [None]:
for key, df in dfs.items():
    dfs[key].info()

### Merge seperate dfs into 1 resampling on Dexcom (glucose)

In [None]:
# Starting with Dexcom_001 DataFrame and setting 'glucose' as the only column
merged_df = dfs['Dexcom_001'][['glucose']]

# Making sure the index is sorted
merged_df = merged_df.sort_index()

for key, df in dfs.items():
    # skip if the current dataframe is 'Dexcom_001'
    if key == 'Dexcom_001':
        continue

    # make sure the df is sorted by index
    df_sorted = df.sort_index()

    # Merge with the current dataframe
    merged_df = pd.merge_asof(merged_df, df_sorted, left_index=True, right_index=True, direction='nearest',
                              tolerance=pd.Timedelta('4min'))

merged_df.info()  # print the info of the merged dataframe


### Drop NaN values

In [None]:
# Drop rows with NaN values
merged_df = merged_df.dropna()

# Print the info of the DataFrame after dropping NaN rows
merged_df.info()


### Inspect merged_df

In [None]:
merged_df.head()

In [None]:
merged_df.info()

### All patient merge code

In [None]:
# This function finds peaks in a timeseries and returns the number of peaks.
def count_peaks(x):
    height = 0
    distance = 4
    prominence = 0.3
    peaks, _ = find_peaks(x, height=height, distance=distance, prominence=prominence)
    return len(peaks)


def process_df(df, resample_period='5min', calculate_peaks=False, rolling_2h=False):
    # Resample the data
    resampled = df.resample(resample_period)

    # Initialize an empty dataframe to store results
    df_result = pd.DataFrame(index=resampled.indices.keys())

    for col_name in df.columns:
        # Calculate statistics for each column
        df_result[f'{col_name}_mean'] = resampled[col_name].mean()
        df_result[f'{col_name}_std'] = resampled[col_name].std()
        df_result[f'{col_name}_min'] = resampled[col_name].min()
        df_result[f'{col_name}_max'] = resampled[col_name].max()
        df_result[f'{col_name}_q1'] = resampled[col_name].quantile(0.25)
        df_result[f'{col_name}_q3'] = resampled[col_name].quantile(0.75)
        df_result[f'{col_name}_skew'] = resampled[col_name].skew()

        if calculate_peaks:
            df_result[f'{col_name}_peaks'] = resampled[col_name].apply(count_peaks)

        if rolling_2h:
            rolling_2h_agg_func = df[col_name].rolling('2h').agg(['mean', 'max'])
            df_result[f'{col_name}_2hr_mean'] = rolling_2h_agg_func['mean'].resample(resample_period).last()
            df_result[f'{col_name}_2hr_max'] = rolling_2h_agg_func['max'].resample(resample_period).last()

    return df_result

def process_patient_data(patient_id):
    # Filepaths in the local directory
    filepaths = [f'Data/{patient_id}/ACC_{patient_id}.csv', f'Data/{patient_id}/BVP_{patient_id}.csv',
                 f'Data/{patient_id}/Dexcom_{patient_id}.csv', f'Data/{patient_id}/EDA_{patient_id}.csv',
                 f'Data/{patient_id}/HR_{patient_id}.csv', f'Data/{patient_id}/IBI_{patient_id}.csv',
                 f'Data/{patient_id}/TEMP_{patient_id}.csv']

    # Dictionary to store the dataframes
    dfs = {}

    for csv in filepaths:
        key = csv.split('/')[-1].split('.')[0]  # Get the filename without the extension
        dfs[key] = pd.read_csv(csv)  # Read the csv file and store the DataFrame in the dictionary

        # Remove leading/trailing spaces from column names
        dfs[key].columns = dfs[key].columns.str.strip()

        # Special preprocessing for Dexcom files
        if 'Dexcom' in key:
            dfs[key] = dfs[key].rename(columns={'Timestamp (YYYY-MM-DDThh:mm:ss)': 'datetime'})
            dfs[key]['datetime'] = pd.to_datetime(dfs[key]['datetime'], format='mixed')
            dfs[key] = dfs[key][['datetime', 'Glucose Value (mg/dL)']].rename(columns={"Glucose Value (mg/dL)": "glucose"})
            dfs[key].dropna(subset=['datetime'], inplace=True)

        if 'datetime' in dfs[key].columns:
            dfs[key]['datetime'] = pd.to_datetime(dfs[key]['datetime'], format='mixed')

        # Set 'datetime' column as index and sort the DataFrame by index
        if 'datetime' in dfs[key].columns:
            dfs[key].set_index('datetime', inplace=True)
            dfs[key].sort_index(inplace=True)
        
        # Special preprocessing for ACC files
        if 'ACC' in key:
            dfs[key]['acc'] = dfs[key][['acc_x', 'acc_y', 'acc_z']].sum(axis=1).abs()

    processed_dfs = {}

    for name in [f'{prefix}_{patient_id}' for prefix in ['EDA', 'ACC', 'HR', 'TEMP', 'IBI', 'BVP']]:
        if name.startswith('EDA'):
            processed_dfs[name] = process_df(dfs[name], calculate_peaks=True)
        elif name.startswith('ACC'):
            processed_dfs[name] = process_df(dfs[name], rolling_2h=True)
        else:
            processed_dfs[name] = process_df(dfs[name])

    # Merge all dataframes
    merged_df = dfs[f'Dexcom_{patient_id}'][['glucose']].sort_index()
    for key, df in processed_dfs.items():
        if key == f'Dexcom_{patient_id}':
            continue
        df_sorted = df.sort_index()
        merged_df = pd.merge_asof(merged_df, df_sorted, left_index=True, right_index=True, direction='nearest',
                                  tolerance=pd.Timedelta('4min'))

    merged_df = merged_df.dropna()
    return merged_df



# List of patient ids
patient_ids = [f"{i + 1:03d}" for i in range(16)]  # This will generate a list: ['001', '002', ..., '015', '016']

# List to store each processed dataframe
dfs_list = []

for pid in patient_ids:
    df = process_patient_data(pid)

    # Add a 'patient_id' column
    df['patient_id'] = pid

    dfs_list.append(df)

    print(f'Patient {pid} is finished')

# Concatenate all dataframes in the list into one
final_df = pd.concat(dfs_list)


In [None]:
final_df.info()

In [None]:
final_df.describe()

In [None]:
final_df.groupby('patient_id').size()


In [None]:
final_df

In [None]:
final_df.columns

In [None]:
final_df.dtypes

In [None]:
final_df['patient_id'] = final_df['patient_id'].astype(int)


### Demographics CSV

In [None]:
# Load the CSV file from the 'Data' folder in your current directory
demographics_df = pd.read_csv('./Data/Demographics.csv')

# Show the first 5 rows of the DataFrame
demographics_df


In [None]:
# Convert the 'datetime' index into a column
final_df.reset_index(inplace=True)

### Demographics Merge

In [None]:
# Merge the patient_df with the demographics_df
patient_df = final_df.merge(demographics_df, left_on='patient_id', right_on='ID', how='left')

# Drop the 'ID' column as it's duplicate of 'patient_id' column
patient_df = patient_df.drop('ID', axis=1)

patient_df = patient_df.drop('index', axis=1)
patient_df = patient_df.drop('level_0', axis=1)
# Print the first 5 rows of the new DataFrame
patient_df.head()


In [None]:
patient_df.columns

### Reorder columns

In [None]:
patient_df = patient_df[
    ['datetime', 'patient_id', 'glucose', 'Gender', 'HbA1c', 'acc_mean', 'bvp_mean', 'eda_mean', 'hr_mean', 'ibi_mean',
     'temp_mean'] + [col for col in patient_df.columns if
                     col not in ['datetime', 'patient_id', 'glucose', 'Gender', 'HbA1c', 'acc_mean', 'bvp_mean',
                                 'eda_mean', 'hr_mean', 'ibi_mean', 'temp_mean']]]


In [None]:
patient_df.columns

### patient df csv

In [None]:
patient_df.to_csv('patient_df.csv', index=False)