# Data Resampling and Cleaning

This file is composed of data resampling from the combined sensor data and cleaning the outcomes dataset. We use the cleaned outcomes dataset to select periods of time where activity occured in the combined sensor dataset. These time segments are then added to a new dataframe and files are output for each participant, and a file for all participants. 

__INPUT: .csv files containing the combined sensor data and the outcomes time dataset__(These files are output from 01_sensor_concat. (10_ACC_Combined.csv, 10_Temp_Combined.csv, 10_EDA_Combined.csv, 10_BVP_Combined.csv, 10_Outcomes.csv)

__OUTPUT: Datasets for Individuals and Outcomes Dataset w/ End-Times__ (19-0_PATIENT_ID_HERE_aggregated.csv, 30_all_partic_aggregated_with_activity.csv, 20_Outcomes_w_end.csv)

## Imports

In [4]:
import pandas as pd
import datetime 
from datetime import timedelta
import os
import matplotlib.pyplot as plt
import seaborn as sns
import re

import warnings
warnings.simplefilter("ignore")

## Read in Data

In [6]:
ACC1 = pd.read_csv("10_ACC_Combined.csv")
TEMP1 = pd.read_csv("10_Temp_Combined.csv")
EDA1 = pd.read_csv("10_EDA_Combined.csv")
BVP1 = pd.read_csv("10_BVP_Combined.csv")
HR1 = pd.read_csv("10_HR_Combined.csv")

outcomes1 = pd.read_csv("10_Outcomes.csv")

In [4]:
ACC = ACC1.copy()
TEMP = TEMP1.copy()
EDA = EDA1.copy()
BVP = BVP1.copy()
HR = HR1.copy()
outcomes = outcomes1.copy()

## Pre-Processing 

Below we convert the Time column to date-time format, set time as the index, and drop Subject_ID.

In [6]:
# Convert to date time and set it as the index so interpolation can work

ACC['Time'] = pd.to_datetime(ACC['Time'])
ACC = ACC.set_index('Time')

TEMP['Time'] = pd.to_datetime(TEMP['Time'])
TEMP = TEMP.set_index('Time')

EDA['Time'] = pd.to_datetime(EDA['Time'])
EDA = EDA.set_index('Time')

BVP['Time'] = pd.to_datetime(BVP['Time'])
BVP = BVP.set_index('Time')

HR['Time'] = pd.to_datetime(HR['Time'])
HR = HR.set_index('Time')

In [7]:
ids = EDA['Subject_ID'].copy()
ids.isnull().values.any()

False

In [8]:
EDA = EDA.drop(['Subject_ID'], axis = 1)
ACC = ACC.drop(['Subject_ID'], axis = 1)
TEMP = TEMP.drop(['Subject_ID'], axis = 1)
BVP = BVP.drop(['Subject_ID'], axis = 1)
HR = HR.drop(['Subject_ID'], axis = 1)

## Resampling & Interpolation

Our raw sensor data was sampled at various frequencies: 
- Accelerometry 3-axis (32 Hz)
- Skin Temperature (4 Hz)
- Electrodermal Activity (4 Hz)
- Blood Volume Pulse (64 Hz)

In order to put our data into a machine learning model, we needed the same sampling rate for all sensors. Below, we downsampled all sensors to 4 Hz by interpolating these data points. 

In [10]:
EDA = EDA.resample('250L').interpolate()

In [11]:
TEMP = TEMP.resample('250L').interpolate()

In [12]:
BVP = BVP.resample('250L').interpolate()

In [None]:
HR = HR.resample('250L').interpolate()

In [None]:
ACC = ACC.drop_duplicates()

In [None]:
ACC = ACC.resample('250L').interpolate()

In [15]:
ids = ids.resample('250L').ffill()

## Concatenate Resampled Data From Individual Sensors

In [18]:
comb = pd.concat([ACC, TEMP, EDA, BVP, HR, ids], axis = 1)

## Cleaning Outcomes Dataframe

Now that we have a concatenated dataframe for all participants with all activities, we need to use the outcome file to select periods of time where specific activities were occuring and label them. 

Experimental Procedure for the 56 Participants
1. seated rest to measure baseline __(4  min)__ <br>
2. paced deep breathing __(1  min)__ <br>
3. physical activity (walking to increase HR up to 50% of the recommended maximum) __(5  min)__ <br>
4. seated rest (washout from physical activity) __(~2  min)__ <br>
5. a typing task __(1  min)__ <br>

### Convert all columns to datetime

In [None]:
for i in list(outcomes.columns[1:9]):
    outcomes[i] = pd.to_datetime(outcomes[i])

### Fix Extra Space

In [22]:
outcomes = outcomes.rename(columns = {'Activity Start 1 ': 'Activity Start 1'})

### Activity Segmentation

Below, we generate end times for each activity so we can subset for the data between the start and end times.

In [23]:
outcomes['Baseline End 1'] = outcomes['Baseline Start 1'] + timedelta(minutes = 4)
outcomes['Baseline End 2'] = outcomes['Baseline Start 2'] + timedelta(minutes = 4)
outcomes['DB End 1'] = outcomes['DB Start 1'] + timedelta(minutes = 1)
outcomes['DB End 2'] = outcomes['DB Start 2'] + timedelta(minutes = 1)
outcomes['Type End 1'] = outcomes['Type Start 1'] + timedelta(minutes = 1)
outcomes['Type End 2'] = outcomes['Type Start 2'] + timedelta(minutes = 1)
outcomes['Activity End 1'] = outcomes['Activity Start 1'] + timedelta(minutes = 5)
outcomes['Activity End 2'] = outcomes['Activity Start 2'] + timedelta(minutes = 5)

### Write Out Updated Outcomes to CSV

We save this write this dataframe to csv in case it is needed for later use. 

In [25]:
outcomes.to_csv('20_Outcomes_w_end.csv', index = False)

## Clean & Segment Combined Sensor Dataset

### Reset Index

In [None]:
comb = comb.reset_index()

### Evaluate Existing Sensor Dataset

The value counts are very disparate among subjects. Once the data is segmented, we should see an equal amount of data points for all subjects.

In [27]:
comb['Subject_ID'].value_counts()

19-015    1997756
19-040    1920120
19-006    1074288
19-030     866768
19-011     318648
19-054     294888
19-007     287452
19-025     276156
19-047     258428
19-044     253480
19-005     237060
19-021     232516
19-050     232148
19-002     231264
19-017     229320
19-035     181748
19-003     100500
19-028      98668
19-034      86512
19-048      85572
19-055      75140
19-039      72960
19-001      61528
19-020      58352
19-041      56756
19-056      51320
19-036      50656
19-016      45464
19-045      42752
19-009      32092
19-027      31960
19-031      31208
19-014      31172
19-052      31004
19-049      30932
19-012      30600
19-053      29108
19-046      29096
19-019      28884
19-018      28588
19-004      28476
19-008      28004
19-023      27852
19-022      27628
19-024      27468
19-032      27248
19-033      26792
19-013      26404
19-010      25832
19-037      25804
19-038      24796
19-051      23800
19-043      21680
19-026      13176
19-029      10932
19-042    

This imbalance is because our resampling took second by second measures over multiple days for some participants, as shown in subject 19-015 below:

In [28]:
comb.loc[comb['Subject_ID'] =='19-015']

Unnamed: 0,Time,ACC1,ACC2,ACC3,TEMP,EDA,BVP,HR,Subject_ID
2513320,2019-07-24 18:22:15.000,2.006549,-62.943078,9.976828,24.790000,0.000000e+00,-0.000000,58.054587,19-015
2513321,2019-07-24 18:22:15.250,2.006003,-62.947821,9.978759,24.790000,3.842000e-03,-0.050000,58.053222,19-015
2513322,2019-07-24 18:22:15.500,2.005457,-62.952565,9.980690,24.790000,2.177200e-02,6.250000,58.051857,19-015
2513323,2019-07-24 18:22:15.750,2.004911,-62.957308,9.982621,24.790000,2.689500e-02,28.670000,58.050493,19-015
2513324,2019-07-24 18:22:16.000,2.004366,-62.962052,9.984552,24.790000,2.561400e-02,121.190000,58.049128,19-015
...,...,...,...,...,...,...,...,...,...
4511071,2019-07-30 13:06:12.750,-72.998734,-33.000060,9.999265,23.590014,2.894725e-08,-0.000131,114.999438,19-015
4511072,2019-07-30 13:06:13.000,-72.998787,-33.000058,9.999295,23.590011,2.315780e-08,-0.000105,114.999451,19-015
4511073,2019-07-30 13:06:13.250,-72.998840,-33.000055,9.999326,23.590008,1.736835e-08,-0.000079,114.999463,19-015
4511074,2019-07-30 13:06:13.500,-72.998892,-33.000053,9.999357,23.590006,1.157890e-08,-0.000053,114.999476,19-015


### Filtering Combined Dataset for Activity Time Segments

Below we create a new dataframe 'comb_filter,' where we append our segmented times and label each activity with the activity type and the round number for that activity.

In [29]:
comb_filter = pd.DataFrame(columns = comb.columns)

In [30]:
for subject in outcomes['Subject ID']:
    #Baseline 1
    baseline_start1 = outcomes.loc[outcomes['Subject ID'] == subject, 'Baseline Start 1'].item()
    baseline_end1 = outcomes.loc[outcomes['Subject ID'] == subject, 'Baseline End 1'].item()
    keep = comb[comb['Subject_ID'] == subject]
    keep = keep.loc[(keep['Time']>=baseline_start1) & (keep['Time']<=baseline_end1)]
    keep['Activity'] = 'Baseline 1'
    comb_filter = comb_filter.append(keep)
    
    #Baseline 2
    baseline_start2 = outcomes.loc[outcomes['Subject ID'] == subject, 'Baseline Start 2'].item()
    baseline_end2 = outcomes.loc[outcomes['Subject ID'] == subject, 'Baseline End 2'].item()
    keep = comb[comb['Subject_ID'] == subject]
    keep = keep.loc[(keep['Time']>=baseline_start2) & (keep['Time']<=baseline_end2)]
    keep['Activity'] = 'Baseline 2'
    comb_filter = comb_filter.append(keep)
    
    #DB 1
    db_start1 = outcomes.loc[outcomes['Subject ID'] == subject, 'DB Start 1'].item()
    db_end1 = outcomes.loc[outcomes['Subject ID'] == subject, 'DB End 1'].item()
    keep = comb[comb['Subject_ID'] == subject]
    keep = keep.loc[(keep['Time']>=db_start1) & (keep['Time']<=db_end1)]
    keep['Activity'] = 'DB 1'
    comb_filter = comb_filter.append(keep)
    
    #DB 2
    db_start2 = outcomes.loc[outcomes['Subject ID'] == subject, 'DB Start 2'].item()
    db_end2 = outcomes.loc[outcomes['Subject ID'] == subject, 'DB End 2'].item()
    keep = comb[comb['Subject_ID'] == subject]
    keep = keep.loc[(keep['Time']>=db_start2) & (keep['Time']<=db_end2)]
    keep['Activity'] = 'DB 2'
    comb_filter = comb_filter.append(keep)
    
    #Activity 1
    activity_start1 = outcomes.loc[outcomes['Subject ID'] == subject, 'Activity Start 1'].item()
    activity_end1 = outcomes.loc[outcomes['Subject ID'] == subject, 'Activity End 1'].item()
    keep = comb[comb['Subject_ID'] == subject]
    keep = keep.loc[(keep['Time']>=activity_start1) & (keep['Time']<=activity_end1)]
    keep['Activity'] = 'Activity 1'
    comb_filter = comb_filter.append(keep)
    
    #Activity 2
    activity_start2 = outcomes.loc[outcomes['Subject ID'] == subject, 'Activity Start 2'].item()
    activity_end2 = outcomes.loc[outcomes['Subject ID'] == subject, 'Activity End 2'].item()
    keep = comb[comb['Subject_ID'] == subject]
    keep = keep.loc[(keep['Time']>=activity_start2) & (keep['Time']<=activity_end2)]
    keep['Activity'] = 'Activity 2'
    comb_filter = comb_filter.append(keep)
    
    #Type 1
    type_start1 = outcomes.loc[outcomes['Subject ID'] == subject, 'Type Start 1'].item()
    type_end1 = outcomes.loc[outcomes['Subject ID'] == subject, 'Type End 1'].item()
    keep = comb[comb['Subject_ID'] == subject]
    keep = keep.loc[(keep['Time']>=type_start1) & (keep['Time']<=type_end1)]
    keep['Activity'] = 'Type 1'
    comb_filter = comb_filter.append(keep)
    
    #Type 2
    type_start2 = outcomes.loc[outcomes['Subject ID'] == subject, 'Type Start 2'].item()
    type_end2 = outcomes.loc[outcomes['Subject ID'] == subject, 'Type End 2'].item()
    keep = comb[comb['Subject_ID'] == subject]
    keep = keep.loc[(keep['Time']>=type_start2) & (keep['Time']<=type_end2)]
    keep['Activity'] = 'Type 2'
    comb_filter = comb_filter.append(keep)   

In [32]:
comb_filter['Subject_ID'].value_counts()

19-030    5288
19-004    5288
19-011    5288
19-051    5288
19-016    5288
19-019    5288
19-023    5288
19-035    5288
19-029    5288
19-022    5288
19-046    5288
19-054    5288
19-040    5288
19-053    5288
19-032    5288
19-039    5288
19-003    5288
19-024    5288
19-013    5288
19-006    5288
19-005    5288
19-008    5288
19-031    5288
19-052    5288
19-036    5288
19-012    5288
19-045    5288
19-033    5288
19-047    5288
19-001    5288
19-050    5288
19-021    5288
19-010    5288
19-007    5288
19-041    5288
19-027    5288
19-034    5288
19-055    5288
19-048    5288
19-026    5288
19-018    5288
19-042    5288
19-014    5288
19-043    5288
19-017    5288
19-044    5288
19-025    5288
19-015    5288
19-037    5288
19-020    5288
19-056    5288
19-038    5288
19-049    5288
19-002    5288
19-009    5288
19-028    2665
Name: Subject_ID, dtype: int64

Much better, we now have 5288 data values for each subject, meaning that all of the sensors are sampled at 4 Hz, and that we have succesfully segmented activity data for each individual.

## Create Individual Datasets from Filtered & Combined Sensor Data

We create a list of unique Subject ID's to be used to create and name individual csv files.

In [None]:
sub_name = list(comb_filter['Subject_ID'].unique())

Below we save both, files for individual participants with the segmented activity periods, and a csv with all participants and their activity segments. 

In [33]:
# Each individual participant
for i in range(len(sub_name)):
    df = comb_filter[comb_filter['Subject_ID'] == sub_name[i]].to_csv(f'./10_Individual Subjects Activity/{sub_name[i]}_aggregated_with_activity.csv', index = False)

In [34]:
#All participants with activity
comb_filter.to_csv("../20_exploratory_data_analysis/30_all_partic_aggregated_with_activity.csv", index = False)
comb_filter.to_csv("../10_code/30_end_pre_processing/31_remove_outliers/30_all_partic_aggregated_with_activity.csv", index = False)