Concordia Big Data Course Project

This is a project for Concordia's Big Data class (SOEN691 UU) by Dr. Tristan Glatard in Winter 2020. Team Members:

Le, Manh Quoc Dat (Student ID: 40153127)
Tran, Trong Tuan (Student ID: 40151694)
Phan, Vu Hong Hai (Student ID: 40154023)
Zhang, Yefei (Student ID: 40153319)

Abstract

The data from wearable gadgets (i.e. smartwatches or fitness tracking devices) is a reliable source that provides a lot of knowledge and potential to build a wide range of health-related applications such as health monitoring and physical training recommendation. Along with the increase in quantity and quality of data, the need to get the most out of data is more and more crucial. However, these data are heterogeneous, noisy, individual-driven and sequential-based, which are the key challenges to deal with. In the project, our main interests are to analyze and build models that can predict people's heart rates based on sequential data from sensors. The dataset is collected from endomondo.com by the authors of [1], which contains about 200.000 records of over 1000 users with a couple hundred million of sensor measurements and metadata. In the project’s scope, we will develop machine learning (ML) models with the main concern about being able to process big time series data with 2 different approaches: one with the traditional methods learned in class and available in Apache Spark, while another one to explore some methods a little bit more advanced and then compare them together.

Introduction

The wearable gadgets such as smartwatches are becoming more and more popular nowadays. With many sensors, they can provide a lot of useful information about individual users as well as the environment. With appropriate data mining techniques, we can use these data for a larger potential application domain such as context-aware health supervisors. To achieve these such capabilities, there are several challenges that need to be overcome: (1) the collected data are sequential, type diversity and noisy, (2) the activity patterns and subjects' condition changes over time, and (3) the mining models need to be either small enough to be implemented on wearable gadgets or big enough to handle huge amounts of requests each time. Under the course's scope, we divide our work into 2 main folds following the typical framework:

Applying learned techniques to analyze and pre-process the dataset: the raw data contains lots of noisy metadata that need to rectify. We also need to calculate several derived features to improve the models' performance.
Implementing ML models that can be scaled in Spark: We expect to implement at least two prediction models in Apache Spark to predict the heart rate of individual users based on the context of the training such as historical workout, current heart rate, and user-dependent features. We are planning to implement non-deep learning models only for the sake of limited processing time.

Recently, there were several relevant works in mining sensor data for health area: [2] provided an overview of data mining research in healthcare and discussed the impact of these techniques on pervasive sensing market; Farseev et al. proposed a model that combined collected exercises and social network data to predict users’ wellness trend (Body Mass Index - BMI) using AdaBoost-based method [4]; [3, 24] built context-aware models that applied in many fields such as recommendation, social network and clinical predictions using individual information like BMI, age, gender, etc.

Materials and Methods

Dataset

The dataset used in this project is collected by Ni et al. in [1] from endomondo.com. The data contains sensors data: heart rate, timestamps, distance, speed, ... and contextual data: GPS location (longitude and latitude), altitude, gender, sport (activity), users' ID, etc. The overview of this dataset is described in Table 1.

Table 1: Endomondo dataset overview

Attribute	Quantity
Number of workouts	253,020
Total number of records	111,541,956
Number of users	1,104
Average length (hours)	5.998

The dataset of each workout consists of 500 data points recorded through flexible time intervals, which ranging from seconds to several minutes. Therefore, we need to deal with this kind of interval variation. In this project, after analyzing the data, our main goal is to predict the heart rate of each user over the course of the workout based on the sensors and contextual data. The heart rate prediction model then can help to warn users about the abnormal health condition and give recommendations about increasing or decreasing workout's intensity.

Techniques and Algorithms

We consider data analysis as a part of pre-processing data to train the models. We are expected to find some insights about data to selecting and deriving features, which can help improve the ML models' performance. The main pre-exploratory techniques will be used are:

Plotting and normalizing based on some distribution for each data type
Calculating data's common statistics: mean, standard deviation, etc.
Using unsupervised techniques such as clustering to gain insight
Generating derived features, using interpolation and resampling technique to fill missing data points

After analyzing and pre-processing data, we will feed the data to our implemented algorithms, and base on PySpark as the main framework. Since the goal of our predictor is to predict the heart rate (beats per minute - BPM), the result can be acquired by two methods: (a) directly via a regression model, or (b) via a classification model with each specific value of BPM is considered as a label (considering about human heartbeats, there will be only maximum 200 labels). After considering multiple approaches, we decided to implement Rolling window technique with Regression random forest. The idea is to predict the value at the time step t, the model just needs information from k previous time steps.

The variable in the dataset described in Table 2. We also need to derive some other features for our model:

The distance between 2 consecutive timestamps
The vacant speed at the current timestamps (missing data or lack of sensor)
The time of day

Since we are applying the traditional machine learning technique, feature engineering affects heavily to the performance of the prediction model. In this work, at each window, the features are:

Categorical features: sport and gender
Sequence features: longitude, latitude, altitude, time (hour), speed, heart rate, time variation, heart rate variation, speed, and distance. These features are taken from every timestamp of the rolling window (except the heart rate and heart rate variation of current timestamp, of course)
Aggregation features: min, max, mean, and standard variation of features in 2

The prediction model is tree-based, therefore we don't need to do feature standardization.

One of the key hyper parameters is the width of the rolling window k. If k is too small, lots of information will be ignored and if k is too big, the number of features will be too big. In our implementation, we chose a reasonable size k=3.

We use random grid search and cross-validation for the model's hyper parameters selection. The number of folds for cross-validation is set to 5. The used metric is RMSE. The detail of the model can be found in the corresponding notebook.

Table 2: Contextual data and measurement description

Variable	Type	Unit
Heart rate	Sequence	Beat per minute
Timestamp	Sequence	Unix timestamp
Speed	Sequence	Mile per hour
Longitude	Sequence	Degree
Latitude	Sequence	Degree
Altitude	Sequence	Ft
User ID	String
Sport	String
Gender	String	Male, female

Evaluation

We will use RMSE (Root Mean Square Error) as the main metric to evaluate our efforts. The training and testing dataset will be divided by each user's workout.

Results

Exploratory Data Analysis

Note: we performed quite a lot of exploratory analyses but just summarize a few main points here. For more details, please go to the Exploratory Analysis Notebook.

Dataset Overview

We first worked with the raw data set from the original project. Original columns & data type:

This is what the first 2 data rows look like:

Based on the column type, data frame description and the first 2 rows, we know that each data row is a record of a workout of a single user, with the associated activity type (sport) and other data like heart rates, locations and record times (timestamp).

High level summary of the data set:

Users count	Activity types count	Workouts count	Total records count
1,104	49	253,020	111,541,956

Filtering

During the exploration phase, we observed some abnormalities in some major columns.

Missing / abnormal values across columns:
- For string columns, we check for None and null
- For numeric columns, we check for zeros and NaN
- For array type columns, we check if the array contain zeroes or NaN

The number of records in a per workout in a few workouts is abnormally low (min column):

There are some record intervals way too high compared to the 95th percentile, example:

(interval: difference between 2 consecutive timestamps of a workout)

Statistical summaries of heart rate by sport have some extreme outliers (mix/max) compared to normal range of 25-250 bpm:

Based on those abnormalities, we applied some filters for our dataset to remove them:

Removed all rows containing heart rate below 25 and above 260 bpm
Removed rows that have less than 50 records
Removed rows having intervals larger than 278.35 (95th percentile of interval statistic)

Final result after filtering:

Users count	Activity types count	Workouts count	Total records count
1,092	46	201,710	89,201,998

Class Imbalance

There are a lot of class imbalances in the data set.

Gender imbalance, where males’ users and activities are dominant vs. females’:

Sport types imbalance, where a few sports took majority of the activities:

Those class imbalances might heavily impact the prediction accuracy, especially when predicting users belong to minority classes. So during our sampling phase for model training, we will make sure to sample across all genders & sports.

Some record-level plots for visualization purpose

Plot of workouts' user heart rates on normalized time (duration from workout start), sampling from a few users

Plot of some workouts ’ routes on 3D graphs based on longitude, latitude & altitude:

Statistical Inferences

Note: Details on statistical inferences section can be seen in the Statistical Inferences Notebook

Some basic analyses

Difference in average heart rate between male vs. female across sports:

This plot showed that in most of the sports having both genders participated, average heart rates of female are higher than male's.

Difference in average Pearson coefficients of (heart rate, altitude) vs. average coefficients of (heart rate, speed):

The 2 charts show that for both males and females, the average correlation between heart rate and altitude vs heart rate and speed are not too much different.

Comparison between average of heart rate, speed and altitude across workouts started at different time periods of the day:

Again, for overall, average speed, altitude and heart rate are not different among time periods.

K-means clustering to find similar users

We applied k-means clustering technique to group similar users, based on their gender, workout counts per period, workout count per sport, average heart rate, average speed and average altitude per workout (Since it’s just a preprocessing step, we did not put much resources to reach record level measurements, but just on average per workout level).

Example of users’ coordinate vectors:

Each feature above was weighted by a predefined number and also was standardized by scaling factors when calculating distances from users to centroids.

K-means result:

The summary plots on k-means results showed that average heart rate and speed do not differ much among groups, only the obvious difference is average altitude.

Prediction model result

Since we have some technical issue with our cluster in the lab, the result shown here is achieved from a sub-sampled data set with only 200 work out

Prediction model training code: Prediction Model

The RMSE of our regression's model is 3.8, a solid result when comparing to the scale of heart rate. With this result, we can confidently give recommendations for users if their heart rate is slightly lower or higher than the predicted one, for example, minus or plus 5. We can also alert if the user's heart rate exceeds the safe region (for example plus 10 when compared to the predicted result).

From the result, we find out that the model performs better for the sport that is less affected by the environmental condition. This suggests we need to add more contextual information to achieve a better result. From the model, we extracted the top 5 importance features at the timestamp t , which are: Heart rate variation from timestamp t-2 to t-1 Elapsed time from the last measurement The current speed The heart rate at the timestamp t-2 The standard variation of heart rate We can see that the heart rate variation (trend) along with speed has an important role to decide the current heart rate. The elapsed time is also an important feature because elapsed time and heart rate variation is 2 fundamentals variable to build the trending function.

Discussion

Limitations

Along with some Random Forest algorithm’s disadvantages [6] such as: huge amount of computational costs and memory to train a large number of deep trees, predictions are slower so that it could put challenges and pressure on applications, and the trained model is less explainable than an individual decision tree; the rolling window approach also ignores the long term relation between features from the previous timestamp and involve heavy feature engineering.

Future Work

Due to limited time offered, we could not have applied as many algorithms as we had planned. We decided to put them into our future work so that we could have much time to analyzing to get deeper insights of the dataset as well as implementing variety of models using different types of algorithms related to Time Series, for example, Conditional Random Field (CRF) and Recurrent Neural Network algorithms.

There is an additional data file that we had not analyzed yet because we recognized that the data from it do not contribute a lot in our current model. However, there are some attributes from that dataset that could be helpful in other algorithms, such as weather, so we definitely need to work on this dataset to build models that have higher accuracy utilizing the two above-mentioned algorithms.

Finally, we really want to embed each user’s historical workout measurements into our models. The reason behind that is that we want to develop a system that is user-centered. The system could personalize each user’s workout sequential modeling to predict how each user’s heart rates will fluctuate across each workout. In addition, the ability of the system could also be reached to identify clusters of users that share the common embedding structures and recommend alternative routes for users to achieve a heart rate profile [1].

References

[1]. Jianmo Ni, Larry Muhlstein, Julian McAuley, "Modeling heart rate and activity data for personalized fitness recommendation", in Proc. of the 2019 World Wide Web Conference (WWW'19), San Francisco, US, May. 2019.

[2]. Hadi Banaee, Mobyen Uddin Ahmed, and Amy Loutfi. 2013. Data Mining for Wearable Sensors in Health Monitoring Systems: A Review of Recent Trends and Challenges. In Sensors

[3]. Miguel Ramos de Araujo, Pedro Manuel Pinto Ribeiro, and Christos Faloutsos. 2017. TensorCast: Forecasting with Context Using Coupled Tensors (Best Paper Award). In ICDM

[4]. Aleksandr Farseev and Tat-Seng Chua. 2017. TweetFit: Fusing Multiple Social Media and Sensor Data for Wellness Profile Learning. In AAAI

[5]. Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis. In ICDM

[6]. S. Jansen, Hands-On Machine Learning for Algorithmic Trading [Book]. Packt Publishing, 2018.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
archive		archive
images		images
BigData_Project_Explore.ipynb		BigData_Project_Explore.ipynb
BigData_Project_Statistical_Inferences.ipynb		BigData_Project_Statistical_Inferences.ipynb
ML_model.ipynb		ML_model.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concordia Big Data Course Project

Abstract

Introduction