# Project Description

Use data collected from a Samsung health app to draw relevant conclusions.

#### Files:
+ [sleep-export2.csv](https://www.dropbox.com/s/7fdmc0l3410g8hu/sleep-export2.csv?dl=0)
+ [exercise.csv](https://www.dropbox.com/s/swvtjxw2ilcn4pl/exercise.csv?dl=0)
+ [heart_rate.csv](https://www.dropbox.com/s/7h2sphkvf4cjbsh/heart_rate.csv?dl=0)
+ [Step_Count](https://www.dropbox.com/s/4edk6mwwsb6dogp/step_co7unt.csv?dl=0)
+ [Floors_climbed](https://www.dropbox.com/s/wyde3yf57gurp1v/floors_climbed.csv?dl=0)

#### Jupyter Notebook:
+ Set up
  + Imports
  + Define Retrieve_Data class
+ Preprocess each data file individually
  + Convert time labels to meaningful format
  + Create coarse features, such as
    + Sleep hour
    + Day of the week
    + Time since timezone has changed
+ Merge data from the multiple sources
+ Analyze individual files
+ Analyze the combined data

#### Classes:
+ Retrieve_Time
+ Merge_Data

#### Samsung app documentation:
+ [Technical details](https://developer.samsung.com/html/techdoc/ProgrammingGuide_SHealthService.pdf)
+ [Property description](https://developer.samsung.com/onlinedocs/health/index.html?com/samsung/android/sdk/healthdata/HealthConstants.Sleep.html)
+ [Health data](https://developer.samsung.com/onlinedocs/health/index.html?com/samsung/android/sdk/healthdata/HealthConstants.html)

#### Notes:
+ The reported times are all measured at the UTC timezone. They are corrected for the local time for these analyses. (See Field Detail - START_TIME in the app [documentation](https://developer.samsung.com/onlinedocs/health/index.html?com/samsung/android/sdk/healthdata/HealthConstants.Sleep.html).


---
# Set up
+ Imports
+ Define Retrieve_Data class
+ Define Merge_Data class
---

#### Imports

In [2]:
import sys
import os
import calendar
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.ticker import MultipleLocator
from datetime import datetime
from datetime import timedelta
from dateutil import tz
from collections import Counter
%matplotlib inline

#Set matplotlib variables for prettier plots.
mpl.rcParams['mathtext.fontset'] = 'stix'
mpl.rcParams['mathtext.fontset'] = 'stix'
mpl.rcParams['font.family'] = 'STIXGeneral'
fs = 36.

#### Rerieve_Time

In [3]:
class Retrieve_Timestamps(object):
    """
    Description:
    ------------
    Given a list of time strings, convert it to a datetime object which is in the
    local timezone.
    """        
    def __init__(self, target, tz_true, tz_used, inp_format, time_format):
        self.target = target
        self.tz_true = tz_true
        self.tz_used = tz_used
        self.inp_format = inp_format
        self.time_format = time_format
        
        self.out = None
        self.time_obj = None
                
        self.create_date_obj()
        self.get_corrected_timeobj()

    def create_date_obj(self):
        if self.inp_format == 'datestr':
            self.time_obj = np.array(
              [datetime.strptime(t, self.time_format) for t in self.target])
        elif self.inp_format == 'milisec':
            self.time_obj = np.array(
              [datetime.fromtimestamp(t) for t in self.target])            
        else:
            raise ValueError('inp_format of %s is not accepeted.'\
                             %(self.inp_format))

    def get_corrected_timeobj(self):
        from_zone = tz.gettz(self.tz_used)
        to_zone = [tz.gettz(t_off) for t_off in self.tz_true]
        
        #The timezone information below is erased for compatibility with pandas resample.
        self.out = np.array([_t.replace(tzinfo=from_zone).astimezone(_to_zone).replace(tzinfo=None)
                            for (_t,_to_zone) in zip(self.time_obj,to_zone)])

#Run tests
def run_tests():
    assert\
        Retrieve_Timestamps(['2018-04-07 17:26:10'], 'UTC', 'UTC-0200', 'datestr', '%Y-%m-%d %H:%M:%S').out\
        == [datetime(2018, 4, 7, 15, 26, 10)], ValueError('Time conversion not working.')
    assert\
        Retrieve_Timestamps([1326244364], 'UTC', 'UTC-0200', 'milisec', '%Y-%m-%d %H:%M:%S').out\
        == [datetime(2012, 1, 10, 17, 12, 44)], ValueError('Time conversion not working.')
    
run_tests()

---
# Analysis: sleep data
---

#### Retrieve data

In [4]:
fpath = './data/sleep-export2.csv'
sleep_df = pd.read_csv(fpath, header=0, index_col=0, low_memory=False)

#Rename columns for simplicity.
newcols = {col : col.replace('com.samsung.health.sleep.', '') for col in sleep_df.columns}
sleep_df.rename(columns=newcols, inplace=True)    

#### Preprocess data

In [52]:
#Use the Retrieve_Timestamps class to convert timestamps to readable values.
time_format = '%Y-%m-%d %H:%M:%S.%f'
starttime_obj = Retrieve_Timestamps(
  sleep_df['start_time'].values/1000., sleep_df['time_offset'].values, 'UTC', 'milisec', time_format).out
endtime_obj = Retrieve_Timestamps(
  sleep_df['end_time'].values/1000., sleep_df['time_offset'].values, 'UTC', 'milisec', time_format).out
sleep_df['Start_time_obj'] = starttime_obj

#Compute the measurement date (without including hour--for merging purposes).
sleep_df['date'] = np.array([t.strftime('%Y/%m/%d') for t in starttime_obj])

#Compute hour of the day the measurement started.
sleep_df['start_hour'] = np.array([t.hour + t.minute/60. + t.second/3600. for t in starttime_obj])

#Compute day of the week.
sleep_df['weekday'] = np.array([calendar.day_name[t.weekday()] for t in starttime_obj])

#Compute sleep duration.
duration = endtime_obj - starttime_obj
sleep_df['sleep_duration'] = np.array([t.days*24.*60 + t.seconds/60. for t in duration]) #In minutes

#Compute time progression.
ref_date = min(starttime_obj)
time_prog = starttime_obj - ref_date
sleep_df['time_prog'] = np.array([t.days + t.seconds/86400. for t in time_prog]) #In days

#Sort sleep data according to time progression.
sleep_df.sort_values(by ='time_prog', inplace=True)

print('Start date: ', (min(starttime_obj)))
print('End date: ', (max(starttime_obj)))

Start date:  2017-12-15 20:02:00
End date:  2018-07-28 21:04:00


In [33]:
#Compute time elapsed since a timezone change.
#Compute number of days since time zone change.
time_since = 0.
tz_duration = []
for ((index2,row2),(index1,row1)) in zip(sleep_df.shift(1).iterrows(),sleep_df.iterrows()):
    if row1['time_offset'] == row2['time_offset']:
        time_since += (row1['time_prog'] - row2['time_prog']) #Additional time transpired since the tz changed.
    else:
        time_since = 0.
    tz_duration.append(time_since)

tz_duration = np.array(tz_duration)    

#We do not need tz_duration to be fine for plotting purposes.
def coarsify_duration(x):
    if x <= 2.:
        return 'tz < 2'
    elif (x > 2.) and (x <= 5.):
        return '2 < tz < 5'
    elif (x > 5.):
        return 'tz > 5'

sleep_df['tz_duration'] = [coarsify_duration(tz) for tz in tz_duration]
#We don't know for how long the person had been on the initial time zone.
sleep_df['tz_duration'].iloc[0:3] = np.nan

In [76]:
#Aggregate the data according to date.
aggregator = {'sleep_duration':'sum', 'efficiency':'mean', 'tz_duration':'first', 'date':'first', 'weekday':'first'}
sleep_agg_df = sleep_df.resample('D', on='Start_time_obj').agg(aggregator)
sleep_agg_df = sleep_agg_df.reset_index()
sleep_agg_df = sleep_agg_df.dropna(subset=['date'])

---
# Analysis: Exercise
---

#### Retrieve data

In [35]:
fpath = './data/exercise.csv'
exer_df = pd.read_csv(fpath, header=0, index_col=0, low_memory=False)

#### Preprocess data

In [64]:
time_format = '%Y-%m-%d %H:%M:%S.%f'
#Use the Retrieve_Timestamps class to convert timestamps to readable values.
starttime_obj = Retrieve_Timestamps(
  exer_df['start_time'].values, exer_df['time_offset'].values, 'UTC', 'datestr', time_format).out
endtime_obj = Retrieve_Timestamps(
  exer_df['end_time'].values, exer_df['time_offset'].values, 'UTC', 'datestr', time_format).out
exer_df['Start_time_obj'] = starttime_obj

#Compute the measurement date (without including hour--for merging purposes).
exer_df['date'] = np.array([t.strftime('%Y/%m/%d') for t in starttime_obj])

#Compute duration. This is, supposedly always 1min.
duration = endtime_obj - starttime_obj
exer_df['exer_duration'] = np.array([t.days*24.*60 + t.seconds/60. for t in duration]) #In minutes

#Compute hour of the day the measurement started.
exer_df['start_hour'] = np.array([t.hour + t.minute/60. + t.second/3600. for t in starttime_obj])

#Compute day of the week.
exer_df['weekday'] = np.array([calendar.day_name[t.weekday()] for t in starttime_obj])

#Compute time progression.
ref_date = min(starttime_obj)
time_prog = starttime_obj - ref_date
exer_df['time_prog'] = np.array([t.days + t.seconds/86400. for t in time_prog]) #In days

#Sort sleep data according to time progression.
exer_df.sort_values(by ='time_prog', inplace=True)

print('Start date: ', (min(starttime_obj)))
print('End date: ', (max(starttime_obj)))

#Note: The duration seems to be a fixed small interval, which
#indicates that the heart_rate entry is the instantaneous heart_rate
#at the start time.

Start date:  2016-06-25 04:28:30.517000
End date:  2018-05-03 05:04:31


In [24]:
#Aggregate the data according to date.
aggregator = {'distance':'sum', 'exer_duration':'sum', }
exer_agg_df = exer_df.resample('D', on='Start_time_obj').agg(aggregator)
exer_agg_df = exer_agg_df.reset_index()

# Analysis: Steps

#### Retrieve data

In [12]:
fpath = './data/step_count.csv'
step_df = pd.read_csv(fpath, header=0, index_col=0, low_memory=False)

#### Preprocess data


---
# Analysis: Heart rate
---

#### Retrieve data

In [41]:
fpath = './data/heart_rate.csv'
heart_df = pd.read_csv(fpath, header=0, index_col=0, low_memory=False)

#### Preprocess data


In [49]:
time_format = '%Y-%m-%d %H:%M:%S.%f'
#Use the Retrieve_Timestamps class to convert timestamps to readable values.
starttime_obj = Retrieve_Timestamps(
  heart_df['start_time'].values, heart_df['time_offset'].values, 'UTC', 'datestr', time_format).out
#The two earliest dates seem spurious. Remove them.
heart_df = heart_df.drop(heart_df[heart_df.start_time == min(heart_df.start_time)].index)

#Re-calculate datetime objects without the spurious entries.
starttime_obj = Retrieve_Timestamps(
  heart_df['start_time'].values, heart_df['time_offset'].values, 'UTC', 'datestr', time_format).out
endtime_obj = Retrieve_Timestamps(
  heart_df['end_time'].values, heart_df['time_offset'].values, 'UTC', 'datestr', time_format).out

#Compute the measurement date (without including hour--for merging purposes).
heart_df['date'] = np.array([t.strftime('%Y/%m/%d') for t in starttime_obj])

#Compute hour of the day the measurement started.
heart_df['start_hour'] = np.array([t.hour + t.minute/60. + t.second/3600. for t in starttime_obj])

#Compute day of the week.
heart_df['weekday'] = np.array([calendar.day_name[t.weekday()] for t in starttime_obj])

#Compute time progression.
ref_date = min(starttime_obj)
time_prog = starttime_obj - ref_date
heart_df['time_prog'] = np.array([t.days + t.seconds/86400. for t in time_prog]) #In days

#Sort sleep data according to time progression.
heart_df.sort_values(by ='time_prog', inplace=True)

print('Start date: ', (min(starttime_obj)))
print('End date: ', (max(starttime_obj)))

#Note: The duration seems to be a fixed small interval, which
#indicates that the heart_rate entry is the instantaneous heart_rate
#at the start time.

Start date:  2016-06-19 12:19:30.327000
End date:  2018-04-25 14:46:13.476000


---
# Analysis: Floors climbed
---

#### Retrieve data

In [15]:
fpath = './data/floors_climbed.csv'
floor_df = pd.read_csv(fpath, header=0, index_col=0, low_memory=False)

#### Preprocess data


In [66]:
#Use the Retrieve_Timestamps class to convert timestamps to readable values.
starttime_obj = Retrieve_Timestamps(
  floor_df['start_time'].values, floor_df['time_offset'].values, 'UTC', 'datestr', time_format).out
endtime_obj = Retrieve_Timestamps(
  floor_df['end_time'].values, floor_df['time_offset'].values, 'UTC', 'datestr', time_format).out
floor_df['Start_time_obj'] = starttime_obj

#Compute the measurement date (without including hour--for merging purposes).
floor_df['date'] = np.array([t.strftime('%Y/%m/%d') for t in starttime_obj])

#Compute hour of the day the measurement started.
floor_df['start_hour'] = np.array([t.hour + t.minute/60. + t.second/3600. for t in starttime_obj])

#Compute day of the week.
floor_df['weekday'] = np.array([calendar.day_name[t.weekday()] for t in starttime_obj])

#Compute duration. This is, supposedly always 1min.
duration = endtime_obj - starttime_obj
floor_df['floors_duration'] = np.array([t.days*24.*60 + t.seconds/60. for t in duration]) #In minutes

#Compute time progression.
ref_date = min(starttime_obj)
time_prog = starttime_obj - ref_date
floor_df['time_prog'] = np.array([t.days + t.seconds/86400. for t in time_prog]) #In days

#Sort floors data according to time progression.
floor_df.sort_values(by ='time_prog', inplace=True)

print('Start date: ', (min(starttime_obj)))
print('End date: ', (max(starttime_obj)))

<class 'str'>
Start date:  2017-12-16 04:46:12
End date:  2018-05-03 05:13:11


In [77]:
#Aggregate the data according to date.
aggregator = {'floor':'sum', 'floors_duration':'sum', 'date':'first', 'weekday':'first'}
floor_agg_df = floor_df.resample('D', on='Start_time_obj').agg(aggregator)
floor_agg_df = floor_agg_df.reset_index()
floor_agg_df = floor_agg_df.dropna(subset=['date'])

---
# Merge data
---

In [79]:
#master_df = pd.merge(sleep_agg_df, floor_agg_df, how='left', on='date')
master_df = pd.merge(sleep_agg_df, floor_agg_df, how='left', on=['date', 'weekday'])
print(master_df.shape)
print(sleep_agg_df.shape)
print(floor_agg_df.shape)
master_df

(218, 9)
(218, 6)
(115, 5)


Unnamed: 0,Start_time_obj_x,sleep_duration,efficiency,tz_duration,date,weekday,Start_time_obj_y,floor,floors_duration
0,2017-12-15,411.0,94.902916,,2017/12/15,Friday,NaT,,
1,2017-12-17,491.0,96.341460,,2017/12/17,Sunday,2017-12-17,9.0,2.166667
2,2017-12-19,293.0,93.174065,,2017/12/19,Tuesday,2017-12-19,10.0,2.016667
3,2017-12-21,501.0,91.434265,tz < 2,2017/12/21,Thursday,2017-12-21,9.0,3.266667
4,2017-12-22,322.0,93.188850,tz < 2,2017/12/22,Friday,2017-12-22,8.0,3.100000
5,2017-12-23,526.0,94.117645,tz < 2,2017/12/23,Saturday,2017-12-23,10.0,4.916667
6,2017-12-24,563.0,93.262410,2 < tz < 5,2017/12/24,Sunday,2017-12-24,12.0,3.450000
7,2017-12-25,502.0,90.323890,2 < tz < 5,2017/12/25,Monday,2017-12-25,79.0,29.700000
8,2017-12-27,545.0,89.926735,tz > 5,2017/12/27,Wednesday,2017-12-27,6.0,1.883333
9,2017-12-28,495.0,90.120964,tz > 5,2017/12/28,Thursday,2017-12-28,2.0,0.733333


In [19]:
#Derive quantities for merged data. This avoids confusion in trying
#to merge these quantities from the original data sets.
time_obj = pd.to_datetime(master_df['Start_time_obj'].values)

#Compute day of the week.
master_df['weekday'] = np.array([calendar.day_name[t.weekday()] for t in time_obj])

#Compute time progression.
ref_date = min(time_obj)
time_prog = time_obj - ref_date
master_df['time_prog'] = np.array([t.days for t in time_prog]) #In days, same as the index.
print(master_df)

    Start_time_obj  sleep_duration  efficiency tz_duration  floor  \
0       2017-12-15           411.0   94.902916         NaN    NaN   
1       2017-12-16             0.0         NaN         NaN   10.0   
2       2017-12-17           491.0   96.341460         NaN    9.0   
3       2017-12-18             0.0         NaN         NaN   13.0   
4       2017-12-19           293.0   93.174065         NaN   10.0   
..             ...             ...         ...         ...    ...   
221     2018-07-24           374.0   92.266670      tz > 5    NaN   
222     2018-07-25           499.0   89.779564      tz > 5    NaN   
223     2018-07-26           459.0   92.826090      tz > 5    NaN   
224     2018-07-27           521.0   90.804596      tz > 5    NaN   
225     2018-07-28           425.0   89.906105      tz > 5    NaN   

     floors_duration    weekday  time_prog  
0                NaN     Friday          0  
1           2.266667   Saturday          1  
2           2.166667     Sunday     

---
# Analysis: Combined data
---