# APPLE HEALTH KIT IMPORT TOOL
The scope of this notebook is to transform elements in *Apple's Health Kit* `export.xml` into Python objects and to visually explore the dataset.\
The focus is on `Workout` and `Record` tags from the *xml* structure.
If you want to load an already processed dataset from a pickle file (without running the time-consuming xml conversion steps), please run the import statements and then jump to [this paragraph](#Load-from-pickle).

It takes around 10 minutes to complete the conversion process with a 500 MB `export.xml` file.

## Import statements
Classes inside the `healthkit.py` module are dedicated to *workouts* (`HKWorkout`) and *record* (`HKRecord`) handling.

In [1]:
# Apple Health Kit data handling
import xml.etree.ElementTree as ET
from html.entities import name2codepoint
import healthkit as hk
import concurrent.futures as cf
# Data analysis
import pandas as pd
import numpy as np
# Plots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
# GUI features
import ipywidgets as widgets
import IPython
from tqdm.notebook import tqdm
# Data export
import pickle
# String replacement
import re

# Remove pandas row/columns limitations
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

We removed *pandas* columns and rows limitations for being able to scroll completely the exploratory tables we are going to print.

## Load XML data
The following cell load the `export.xml` file: select it from Apple's Health Kit export folder.

In [2]:
xml_file_path = 'apple_health_export/export.xml'

We explore the tree from its root with `ElementTree`.

In [3]:
# Load root
tree = ET.parse(xml_file_path)
root_data = tree.getroot()
print('ROOT')
print(root_data.tag)

# Print attributes
print('\n> ATTRIBUTES')
for attribute in root_data.attrib:
    print(f'\t* {attribute}: {root_data.attrib[attribute]}')

ROOT
HealthData

> ATTRIBUTES
	* locale: it_IT


The root has only one attribute.
## Exploratory analysis
Let's now explore the children: each one has its own *tag* and some defining *attributes*.\
Unique combinations of *tag*, *type* and *workout type* are shown as rows in the following table.\
This should give us an idea of what's inside *Apple's Health Kit* exported data.

In [4]:
# Print a table with unique combinations of tag, type, workout type 
df_rows = []
for child in root_data:
    if (child.tag, child.get('sourceName'), child.get('type'), child.get('workoutActivityType')) \
    not in [(t, src, ty, wty) for t, src, ty, wty, att in df_rows]:
        df_rows.append((child.tag, child.get('sourceName'), child.get('type'), 
                        child.get('workoutActivityType'), list(child.attrib.keys())))
df = pd.DataFrame(df_rows, 
                  columns=['Tag', 'Source', 'Record Type', 'Workout Type', 'Attributes']
                 ).sort_values(['Tag', 'Source']).reset_index(drop=True)
display(df)

Unnamed: 0,Tag,Source,Record Type,Workout Type,Attributes
0,ActivitySummary,,,,"[dateComponents, activeEnergyBurned, activeEnergyBurnedGoal, activeEnergyBurnedUnit, appleMoveTime, appleMoveTimeGoal, appleExerciseTime, appleExerciseTimeGoal, appleStandHours, appleStandHoursGoal]"
1,Correlation,HealthManager,HKCorrelationTypeIdentifierBloodPressure,,"[type, sourceName, sourceVersion, creationDate, startDate, endDate]"
2,ExportDate,,,,[value]
3,Me,,,,"[HKCharacteristicTypeIdentifierDateOfBirth, HKCharacteristicTypeIdentifierBiologicalSex, HKCharacteristicTypeIdentifierBloodType, HKCharacteristicTypeIdentifierFitzpatrickSkinType, HKCharacteristicTypeIdentifierCardioFitnessMedicationsUse]"
4,Record,Apple Watch di Michele,HKQuantityTypeIdentifierHeartRate,,"[type, sourceName, sourceVersion, device, unit, creationDate, startDate, endDate, value]"
5,Record,Apple Watch di Michele,HKQuantityTypeIdentifierStepCount,,"[type, sourceName, sourceVersion, device, unit, creationDate, startDate, endDate, value]"
6,Record,Apple Watch di Michele,HKQuantityTypeIdentifierDistanceWalkingRunning,,"[type, sourceName, sourceVersion, device, unit, creationDate, startDate, endDate, value]"
7,Record,Apple Watch di Michele,HKQuantityTypeIdentifierBasalEnergyBurned,,"[type, sourceName, sourceVersion, device, unit, creationDate, startDate, endDate, value]"
8,Record,Apple Watch di Michele,HKQuantityTypeIdentifierActiveEnergyBurned,,"[type, sourceName, sourceVersion, device, unit, creationDate, startDate, endDate, value]"
9,Record,Apple Watch di Michele,HKQuantityTypeIdentifierFlightsClimbed,,"[type, sourceName, sourceVersion, device, unit, creationDate, startDate, endDate, value]"


We see two *tags* of interest with regards to *walking* and *running* activities:
- `Workout`: physical activity, *duration* and *energy burned* are the relevant data stored for this class.\
Workouts activity type is listed under the *Workout Type* column.
- `Record`: parameters recorded during activities, a *numeric value* and a *unit of measurement* is available for each record. Every record is contrasigned by a *type* attribute as shown in the *Record Type* column.

## Data preparation
Now we can transform *workouts* and *records* to Python objects.\
For this scope I have moved all the code into a dedicated **Python module** (`healthkit.py`).
I have built some classes (*HKWorkout* and *HKRecord*) to support the analysis based on the table above:
basically each attribute of *records* and *workouts* is saved into a dedicated class property.
Outputs from `ElementTree` functions are `strings`, so each class take care of data type conversion (to `float`) when appropriate.\
To keep the code *clean* inside the notebook, I call `load_workouts`and `load_records` from the `healthkit` module.
Those functions take advantage from `concurrent` package to exploit **multi-threading**: we have a lot of data to analyze and *threading* speeds up the process.

A table with existing workouts (sorted by *data source* and *starting time*) is then displayed.

In [5]:
# Threads number for parallel computing
n_threads = 100

# Initiate dataframe rows
df_rows = []

# Return healthkit.HKworkouts objects and prepare pandas dataframe
workouts = hk.load_workouts(root_data, n_threads=n_threads)

# Return healthkit.HKrecord objects
records = hk.load_records(root_data, n_threads=n_threads)

# Print Workout dataframe
df_rows = [(w.activity_type, w.start_date, w.end_date, w.source_name) for w in workouts]
df = pd.DataFrame(df_rows, columns=['Activity', 'Start', 'End', 'Source'])\
                .sort_values(['Source', 'Start']).reset_index(drop=True)
display(df)

100%|██████████| 379/379 [00:00<00:00, 1826.08wrk/s, Submit jobs for Workout]
100%|██████████| 379/379 [00:00<00:00, 297073.67wrk/s, Convert Workout]
100%|██████████| 1368056/1368056 [00:40<00:00, 33770.79rec/s, Submit jobs for Record]
100%|██████████| 1368056/1368056 [10:07<00:00, 2250.25rec/s, Convert Record]


Unnamed: 0,Activity,Start,End,Source
0,HKWorkoutActivityTypeWalking,2020-06-03 12:49:17+02:00,2020-06-03 13:14:04+02:00,Apple Watch di Michele
1,HKWorkoutActivityTypeWalking,2020-06-03 14:10:59+02:00,2020-06-03 14:31:26+02:00,Apple Watch di Michele
2,HKWorkoutActivityTypeTennis,2020-06-06 15:13:00+02:00,2020-06-06 17:07:35+02:00,Apple Watch di Michele
3,HKWorkoutActivityTypeWalking,2020-06-09 17:41:38+02:00,2020-06-09 18:13:07+02:00,Apple Watch di Michele
4,HKWorkoutActivityTypeTennis,2020-06-10 08:08:20+02:00,2020-06-10 08:47:00+02:00,Apple Watch di Michele
5,HKWorkoutActivityTypeTennis,2020-06-11 18:39:38+02:00,2020-06-11 19:30:22+02:00,Apple Watch di Michele
6,HKWorkoutActivityTypeWalking,2020-06-14 16:49:37+02:00,2020-06-14 17:19:41+02:00,Apple Watch di Michele
7,HKWorkoutActivityTypeTennis,2020-06-15 17:19:22+02:00,2020-06-15 18:00:07+02:00,Apple Watch di Michele
8,HKWorkoutActivityTypeWalking,2020-06-16 12:47:41+02:00,2020-06-16 13:10:05+02:00,Apple Watch di Michele
9,HKWorkoutActivityTypeTennis,2020-06-19 20:07:09+02:00,2020-06-19 21:11:36+02:00,Apple Watch di Michele


Finally a list of dictionaries is created by the function `build_workouts_timeseries` from the `healthkit` package.\
Each dictionary has the following keys:
- `workout`: the reference workout object (namely `HKWorkout`).
- `records`: a list of records (`HKRecord`) that belong to the same time when the workout was carried out.
- `timeseries`: a dictionary of pandas time series (made out of the `value` property of each record).\
Records are grouped by types, so each key of this dictionary is a record type.
- `units`: dictionary of units. each key of this dictionary is a record type and its value is the unit of measurement (*UoM*)

In [6]:
# Create a list of dict where every dict connects a workout with its records and units of measurement
# Records are grouped by types and converted to pandas time series
# It is possible to select a specific data source and ask for duplicated timestamps removal from time series.
# Only time series will benefit from this clean-up processes, the 'records' keyword stores all records from 
# the original xml file.
HK_data = hk.build_workouts_timeseries(workouts, 
                                       records, 
                                       n_threads=n_threads, 
                                       rem_duplicates=True, 
                                       ts_source=['Apple\xa0Watch di Michele'])

100%|██████████| 379/379 [01:49<00:00,  3.48wrk/s, Submit jobs for building timeseries]
100%|██████████| 379/379 [00:11<00:00, 34.31wrk/s, Build timeseries for each workout] 


## Save to pickle
We don't want to wait every time we need those *Health Kit* data!\
Let's save the resulting list of dictionaries to a file with `pickle`, we can then reload it whenever we want.

In [7]:
with open('HKdata.pickle', mode='wb') as pickle_file:
    pickle.dump(HK_data, file=pickle_file)

## Load from pickle
If you want to explore the dataset without running the previous steps, run the following cell to load `HK_data` from pickle file.

**Note:** *you need a `HKdata.pickle` file inside the notebook folder.*

In [8]:
with open('HKdata.pickle', mode='rb') as pickle_file:
    HK_data = pickle.load(pickle_file)

## Workout and record types
Now that everything is well sorted we can create two tables with workouts and records types.\
Find them below. 

In [9]:
# Workout types
temp = [d['workout'].activity_type for d in HK_data]
workout_types = pd.Series(list(set(temp)), name='Workout Type')
display(workout_types.to_frame())

# Record types (loop through records in every HK data)
record_types = []
units = {}
for item in HK_data:
    record_types = record_types + list(item['records'].keys())
    units.update(item['units'])
# Keep unique types from each workout
record_types = pd.DataFrame(list(set(record_types)), columns=['Record Type'])
record_types['Unit'] = record_types['Record Type'].apply(lambda x: units.get(x, 'Not Found'))
display(record_types)

Unnamed: 0,Workout Type
0,HKWorkoutActivityTypeOther
1,HKWorkoutActivityTypeCrossTraining
2,HKWorkoutActivityTypeFunctionalStrengthTraining
3,HKWorkoutActivityTypeCooldown
4,HKWorkoutActivityTypeHiking
5,HKWorkoutActivityTypeCoreTraining
6,HKWorkoutActivityTypeRunning
7,HKWorkoutActivityTypeCycling
8,HKWorkoutActivityTypeSwimming
9,HKWorkoutActivityTypeWalking


Unnamed: 0,Record Type,Unit
0,HKQuantityTypeIdentifierStairAscentSpeed,m/s
1,HKQuantityTypeIdentifierDistanceSwimming,m
2,HKQuantityTypeIdentifierWalkingStepLength,cm
3,HKQuantityTypeIdentifierDistanceCycling,km
4,HKQuantityTypeIdentifierBasalEnergyBurned,kcal
5,HKQuantityTypeIdentifierAppleExerciseTime,min
6,HKCategoryTypeIdentifierAppleStandHour,
7,HKQuantityTypeIdentifierHeadphoneAudioExposure,dBASPL
8,HKQuantityTypeIdentifierActiveEnergyBurned,kcal
9,HKQuantityTypeIdentifierDistanceWalkingRunning,km


## Inspect numeric data
This section is a simple *widget* for data visualization.\
Here is the list of all workouts with their *indexes* and *starting dates*.\
Search for one that looks interesting and keep its index in mind.

In [10]:
activity = [item['workout'].activity_type.replace('HKWorkoutActivityType','') for item in HK_data]
activity = [re.sub(r'(\w)([A-Z])', r'\1 \2', a) for a in activity]
start_date = [item['workout'].start_date for item in HK_data]
workouts_indexes = pd.DataFrame({'Activity': activity, 'Start date': start_date})
display(workouts_indexes)

Unnamed: 0,Activity,Start date
0,High Intensity Interval Training,2020-03-29 17:40:20+02:00
1,Cross Training,2020-03-30 19:12:26+02:00
2,High Intensity Interval Training,2020-03-30 19:21:49+02:00
3,Functional Strength Training,2020-04-01 17:39:15+02:00
4,High Intensity Interval Training,2020-04-02 18:00:22+02:00
5,High Intensity Interval Training,2020-04-02 18:04:28+02:00
6,Cross Training,2020-04-03 18:41:07+02:00
7,Functional Strength Training,2020-04-04 18:33:00+02:00
8,High Intensity Interval Training,2020-04-05 19:38:23+02:00
9,Cross Training,2020-04-05 19:46:52+02:00


Now run the following cell and then fill the `Workout index` selector with the index you have choosen from the previous table (if you input the number with your keayboard, click outside of the selector to plot the data).\
All the available timseries for th selected workout are plotted right below, and since they are empowered by `plotly`, they are completely interactive.

Enjoy your time, explorator!

**Note:** *if a source for timeseries has been set by using `ts_source` in `hk.build_workouts_timeseries` it may happens that some workout will have no available timeseries to plot (because they belong to an excluded source).*

In [11]:
def print_selected_workout(idx):
    # Create lambda for remove apple references and left meaningfule name - RegEx
    clean_ts_names = lambda name: re.sub(r"(\w)([A-Z])", r"\1 \2", name.replace('HKQuantityTypeIdentifier',''))
    clean_act_names = lambda name: re.sub(r"(\w)([A-Z])", r"\1 \2", name.replace('HKWorkoutActivityType',''))
    
    # Extract time series
    sel_data = HK_data[idx.new]['timeseries']
    sel_dict = {rec_type: sel_data[rec_type] for rec_type in sel_data if 'category' not in rec_type.lower()}
    
    # Color palette (repeat the palette if more colors are required)
    colors = ["#1e78a8", "#c70a26", "#e6af2e", "#4c956c", "#8bc1c4", 
              "#1a5a5e", "#7a776e", "#6b2b67", "#223f4f", "#9c895f"]
    if len(sel_dict) <= len(colors):
        color_list = colors
    else:
        n, r = divmod(len(sel_dict), len(colors))
        color_list = colors * (n + 1 if r > 0 else 0)
    color_n = 0
    
    # Clean output (used by widget when refreshing)
    IPython.display.clear_output()
    display(selector)
    
    # Print activity type (remove apple references and left meaningfule name - RegEx)
    activity = clean_act_names(HK_data[idx.new]['workout'].activity_type)
    source = clean_act_names(HK_data[idx.new]['workout'].source_name)
    print(f"ACTIVITY TYPE: {activity}")
    print(f"SOURCE: {source}")
    
    # Plot all time series
    for k, ts in sel_dict.items():
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=ts.index.to_numpy(), 
                                 y=ts.to_numpy(), 
                                 name=clean_ts_names(k), 
                                 mode="markers+lines",
                                 line_color=color_list[color_n],
                                 marker_color=color_list[color_n],
                                 marker_size=8,
                                 marker_symbol='diamond-open-dot',
                                 marker_line_width=1.5))
        color_n += 1
    
        # Show plotted image
        fig.update_layout(height=300, width=950, showlegend=False,
                         title=clean_ts_names(k) + ' [' + HK_data[idx.new]['units'].get(k) + ']')
        fig.show()

# Create widget (smple layout with selector)
layout = widgets.Layout(width='160px', height='40px')
selector = widgets.BoundedIntText(value=1,
                                  min=0,
                                  max=len(HK_data)-1,
                                  step=1,
                                  description='Workout index:',
                                  layout=layout,
                                  style={'description_width': 'initial'})

# Connect selector to callout function
selector.observe(print_selected_workout, 'value')

# Start by plotting the first workout
selector.value=0

BoundedIntText(value=0, description='Workout index:', layout=Layout(height='40px', width='160px'), max=378, st…

ACTIVITY TYPE: High Intensity Interval Training
SOURCE: Nike Training
