# Physio preprocessing

In this notebook, we're performing a preliminary analysis.  The first step is preprocessing in which we will add timestamps to each measurement.

In what follows, we show detailed how the data is preprocessed for all different physio metrics.
We're focusing on the data from subject `005`.

At the bottom, we combine the data (for ACC) from this preprocessing with the datasets created by Seung to make sure our analyses result in the same data.

**Notes:** The tags-files are empty so I have not used these...

### preambule and constants

In [1]:
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import os

  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
datadir = "/Users/jokedurnez/Box/CAFE Consortium/Heather Info for CAFE Physio Pilot"
subject = 'WI_AMP_005'

subdir = os.path.join(datadir,'Preliminary Physio Wristband Data for Mollie',subject)
outdir = os.path.join(datadir,'preprocessed',subject)
if not os.path.exists(os.path.join(datadir,'preprocessed')):
    os.mkdir(os.path.join(datadir,'preprocessed'))
if not os.path.exists(outdir):
    os.mkdir(outdir)
physiometrics = ['ACC',"BVP","EDA",'IBI','TEMP','HR','TAGS']

### Show procedure using temperature data

Below, we show in detail the different steps for the temp data.

In [3]:
# reading in the data
metric = "TEMP"
file = os.path.join(subdir,"%s.csv"%metric)
data = pd.read_csv(file,header=None)
data.columns = ["%s_%i"%(metric,ind) for ind in range(len(data.columns))]

# converting the first timestamp to a datetime field
timestamp = list(data.iloc[0])[0]
timestamp = datetime.fromtimestamp(timestamp)

# converting the samplerate from Hz to seconds
samplerate = list(data.iloc[1])[0]
samplerate_seconds = 1/samplerate

# extract the measurements (starting from row 2 - the third row)
measurements = pd.DataFrame(data.iloc[2:]).reset_index(drop=True)

# add timestamps for each measurement
offsets = measurements.index * timedelta(seconds=samplerate_seconds)
measurements['timestamp'] = timestamp + offsets

# show
measurements.head(10)

Unnamed: 0,TEMP_0,timestamp
0,42.63,2018-03-22 14:24:47.000
1,42.63,2018-03-22 14:24:47.250
2,42.63,2018-03-22 14:24:47.500
3,42.63,2018-03-22 14:24:47.750
4,42.63,2018-03-22 14:24:48.000
5,42.63,2018-03-22 14:24:48.250
6,42.63,2018-03-22 14:24:48.500
7,42.63,2018-03-22 14:24:48.750
8,42.55,2018-03-22 14:24:49.000
9,42.55,2018-03-22 14:24:49.250


### Repeat preprocessing for all similar files

Now we make an abstraction of the code above and make it into a function that we can then use for all files in the same format.

In [4]:
def extract_measurements(metric, subdir):
    file = os.path.join(subdir,"%s.csv"%metric)
    data = pd.read_csv(file,header=None)
    data.columns = ["%s_%i"%(metric,ind) for ind in range(len(data.columns))]

    # converting the first timestamp to a datetime field
    timestamp = list(data.iloc[0])[0]
    timestamp = datetime.fromtimestamp(timestamp)

    # converting the samplerate from Hz to seconds
    samplerate = list(data.iloc[1])[0]
    samplerate_seconds = 1/samplerate

    # extract the measurements (starting from row 2 - the third row)
    measurements = pd.DataFrame(data.iloc[2:]).reset_index(drop=True)

    # add timestamps for each measurement
    offsets = measurements.index * timedelta(seconds=samplerate_seconds)
    measurements['timestamp'] = timestamp + offsets
    
    return measurements

Next we run this function on all different physio data.  We store the data in a dictionary for further handling, but we also export the data to csv's.

**Question:** I'm a bit confused about heart rate.  I don't really get how you can get a heart rate metric on a fixed timestamp, i.e. how this is related to the IBI metrics.  I don't really _have_ to understand it, but it'd be great if you could confirm that this is indeed a fixed-interval measurement for heart rate...

In [5]:
measurements = {}

for metric in ['ACC', 'EDA', 'BVP', 'TEMP', 'HR']:
    measurements[metric] = extract_measurements(metric,subdir)
    # for 'ACC': add SVM
    if metric == 'ACC':
        measurements[metric]['SVM'] = np.sqrt(measurements[metric]['ACC_0']**2 + \
            measurements[metric]['ACC_1']**2 + \
            measurements[metric]['ACC_2']**2)
    measurements[metric].to_csv(os.path.join(
        outdir,"PHYSIO_%s_%s.csv"%(subject,metric)),index=False)
    # logging
    start = measurements[metric]['timestamp'][0].strftime("%H:%M:%S")
    end = measurements[metric]['timestamp'].iloc[len(measurements[metric])-1].strftime("%H:%M:%S")
    print("%s was measured from %s to %s"%(metric, start, end))


ACC was measured from 14:24:47 to 15:02:17
EDA was measured from 14:24:47 to 15:02:16
BVP was measured from 14:24:47 to 15:02:17
TEMP was measured from 14:24:47 to 15:02:12
HR was measured from 14:24:57 to 15:02:17


### Unique approach for IBI

Below we preprocess IBI. The procedure is as follows: we add the onsets (first column) to the onset we can extract from the BVP-file.  **Is this the right approach?** I'm a bit confused since the last IBI measurement for this subject is 40 seconds before the last BVP measurement...

In [6]:
# read in data
IBIdata = pd.read_csv(os.path.join(subdir,"IBI.csv"),header=None)
IBIdata.columns = ['offset_seconds','IBI']

# read in BVP to get onset timestamp
BVPdata = pd.read_csv(os.path.join(subdir,"BVP.csv"),header=None)
BVPonset = datetime.fromtimestamp(BVPdata.iloc[0,0])

# extract timestamps:
# attention: onset = BVPonset + IBIonset (1st column in IBI)
IBItimestamps = [BVPonset + timedelta(seconds = x) for x in 
                 IBIdata['offset_seconds']]

# add timestamp and remove offset
IBIdata['timestamp'] = IBItimestamps
IBIdata = IBIdata.drop(['offset_seconds'],axis=1).iloc[1:]

# export
IBIdata.to_csv(os.path.join(outdir,"PHYSIO_%s_IBI.csv"%(subject)), index=False)

## Validation

Below, I compare the preprocessed data for `ACC` from Seung and these analysis.  As a validation, I create a new dataset which is a join of both analyses.  The steps to make the timepoints and metrics match:

- the timestamps seem to be off by an hour (Seungs timestamps are 1 hour earlier).  Not sure if this is due to DST?
- Seung's resolution is at 1/1000 of a second, mine at 1/10000.  

For now, I take this as _evidence_ that these preprocessing is correct.  We will need to figure out which timestamp is the correct one...

In [7]:
preproc_Seung_dir = "/Users/jokedurnez/Documents/projects/projectsOngoing/accounts/Data/CAFE/Physio/Preliminary Physio Wristband Data for Mollie/Physio+Attention data templete/Converted files for participant 005"
preproc_Seung = pd.read_excel(os.path.join(preproc_Seung_dir,'Physio data/ACC_005.xlsx'),sheet_name='ACC',header=None,skiprows=2)

In [8]:
preproc_Seung.columns = ['ACC_0','ACC_1','ACC_2','index','offset','timestamp']

In [9]:
preproc_Seung.head()

Unnamed: 0,ACC_0,ACC_1,ACC_2,index,offset,timestamp
0,8,11,60,0,1521754000.0,2018-03-22 16:24:47.000
1,7,11,60,1,1521754000.0,2018-03-22 16:24:47.031
2,6,11,60,2,1521754000.0,2018-03-22 16:24:47.062
3,6,10,60,3,1521754000.0,2018-03-22 16:24:47.094
4,6,7,60,4,1521754000.0,2018-03-22 16:24:47.125


In [10]:
# change hour in Seungs data
preproc_Seung['timestamp'] = preproc_Seung['timestamp']-timedelta(hours=2)

In [11]:
# the resolution is different in both datasets.  To make them match, I round the timestamps to 1/100s
def ch_reso(ts):
    dt_str = ts.strftime("%Y-%m-%d %H:%M:%S.%f")
    dt_splt = dt_str.split(".")
    dt_str = "%s.%s"%(dt_splt[0],dt_splt[1][:2])
    return dt_str

preproc_Seung['timestamp_rnd'] = preproc_Seung['timestamp'].apply(ch_reso)
measurements['ACC']['timestamp_rnd'] = measurements['ACC']['timestamp'].apply(ch_reso)

In [12]:
np.min(measurements['ACC']['timestamp_rnd'])

'2018-03-22 14:24:47.00'

In [13]:
np.min(preproc_Seung['timestamp_rnd'])

'2018-03-22 14:24:47.00'

Below we can see how the two datasets match if the two steps above are applied...

In [14]:
merged = pd.merge(measurements['ACC'],preproc_Seung,
                  on='timestamp_rnd',how='outer',
                  suffixes=['_Joke','_Seung'])

merged.sort_values(by='timestamp_rnd')

Unnamed: 0,ACC_0_Joke,ACC_1_Joke,ACC_2_Joke,timestamp_Joke,SVM,timestamp_rnd,ACC_0_Seung,ACC_1_Seung,ACC_2_Seung,index,offset,timestamp_Seung
0,8.0,11.0,60.0,2018-03-22 14:24:47.000000,61.522354,2018-03-22 14:24:47.00,8,11,60,0,1.521754e+09,2018-03-22 14:24:47.000
1,7.0,11.0,60.0,2018-03-22 14:24:47.031250,61.400326,2018-03-22 14:24:47.03,7,11,60,1,1.521754e+09,2018-03-22 14:24:47.031
2,6.0,11.0,60.0,2018-03-22 14:24:47.062500,61.294372,2018-03-22 14:24:47.06,6,11,60,2,1.521754e+09,2018-03-22 14:24:47.062
3,6.0,10.0,60.0,2018-03-22 14:24:47.093750,61.122827,2018-03-22 14:24:47.09,6,10,60,3,1.521754e+09,2018-03-22 14:24:47.094
4,6.0,7.0,60.0,2018-03-22 14:24:47.125000,60.704201,2018-03-22 14:24:47.12,6,7,60,4,1.521754e+09,2018-03-22 14:24:47.125
5,5.0,9.0,62.0,2018-03-22 14:24:47.156250,62.849025,2018-03-22 14:24:47.15,5,9,62,5,1.521754e+09,2018-03-22 14:24:47.156
6,14.0,16.0,60.0,2018-03-22 14:24:47.187500,63.655322,2018-03-22 14:24:47.18,14,16,60,6,1.521754e+09,2018-03-22 14:24:47.187
7,15.0,12.0,60.0,2018-03-22 14:24:47.218750,63.000000,2018-03-22 14:24:47.21,15,12,60,7,1.521754e+09,2018-03-22 14:24:47.219
8,4.0,12.0,60.0,2018-03-22 14:24:47.250000,61.318839,2018-03-22 14:24:47.25,4,12,60,8,1.521754e+09,2018-03-22 14:24:47.250
9,4.0,11.0,60.0,2018-03-22 14:24:47.281250,61.131007,2018-03-22 14:24:47.28,4,11,60,9,1.521754e+09,2018-03-22 14:24:47.281
