# COGS 118A - Project Checkpoint

# Names


- Jacob Au
- Evan Liu
- Lauren Lui
- Rina Kaura

# Abstract 

The focus of our project is centered around predicting an individual’s quality of sleep based on an individual’s exercise habits and thus hopefully improving the quality of sleep. The goal of our project is to implement a model which accurately predicts and improves the quality of sleep. We will examine data collected from Lifesnaps Fitbit data which obtained various information on the exercise such as activity type, sleep duration, the amount of minutes to fall asleep, sleep efficiency, number of activity minutes, and BPM. With the collected data, we will utilize a linear regression model to make predictions and identify relationships between an individual’s quality of sleep and the exercise features as listed above. The performance of our linear regression model will be assessed using the mean absolute error (MAE), mean squared error (MSE), and the R-squared metric. 


# Background

There is a lot of prior research investigating the correlation between exercise and sleep quality. The general consensus tends to be that exercise can improve sleep quality and duration for all age groups[1]. However, some studies show that exercise has a greater positive effect on people over 40 years of age than under[2], especially when these older individuals live a rather sedentary life[4]. Exercise has also been shown to be an effective substitute for pharmacological interventions to improve sleep quality in insomniacs[3]. The participants in our dataset come from a wide range of age groups and fall everywhere on the scale of sedentary to active lifestyle. Luckily, whether their exercise consisted of walking, cycling, aerobics, or sports shouldn’t affect the effectiveness of our model since prior research shows increases in exercise intensity and duration doesn’t appear to have significant effects on sleep quality[5]. With the existing body of knowledge, we know that there is indeed a correlation between exercise and sleep that we hope to predict with our model. 
One unanswered question from the literature is how strong the correlation between exercise and sleep quality is. The studies investigating the correlation were mostly just trying to find its existence and not its strength. The studies testing the strength of the correlation had biases and limitations that call their findings into question. Many of them had a majority, if not solely, male subject group for their study which is not a representative sample. Hence, with a representative database, our group is aiming to test the correlation between exercise and sleep quality by seeing if we can successfully train a model to predict sleep quality based on type, duration, and intensity of exercise. There is a risk of there not being a strong correlation between the two variables, but that is also something our group is willing to investigate and debunk or prove.

[1] https://www.researchgate.net/profile/Zubia-Veqar/publication/236582394_Sleep_Quality_Improvement_and_Exercise_A_Review/links/00b495180b7a157bb9000000/Sleep-Quality-Improvement-and-Exercise-A-Review.pdf 

[2]https://www.sciencedirect.com/science/article/pii/S1836955312701066 

[3] https://peerj.com/articles/5172/?fbclid=IwAR0LMECJQibRK-g3wN3mLIc4Eg4SjW0duz5KaF4-yK1L1gk3md79v61E3gQ&utm_source=TrendMD&utm_campaign=PeerJ_TrendMD_0&utm_medium=TrendMD 
[4] https://onlinelibrary.wiley.com/doi/full/10.1111/eci.13202 

[5] https://link.springer.com/article/10.1007/s00421-011-2034-9 




# Problem Statement

The problem our project aims to address is improving the quality of an individual’s sleep based on one’s exercise habits, which is quantifiable, measurable, and replicable. We predict that exercise and quality of sleep must be correlated. This begs the question of how. What kind of exercise? How much exercise? When do you exercise? How do these factors affect your quality of sleep? Thanks to recent advances in personal technology, in particular the Fitbit, we are able to easily gain lots of quantitative data on people’s sleeping (rem/sleep ratio, duration, etc) and exercise habits (kind of exercise, heart rate, calories burned). Using these metrics, we should be able to accurately predict how an individual's exercising habits affect their sleep quality.


# Data

UPDATED FROM PROPOSAL!

You should have obtained and cleaned (if necessary) data you will use for this project.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


**Dataset Name:** Lifesnaps Fitbit Dataset

**Link to the dataset:** https://www.kaggle.com/datasets/skywescar/lifesnaps-fitbit-dataset

**Number of variables:** 63

**Number of observations:** 7,410

**Description:** Each observation in this dataset represents information collected regarding a specific fitbit user’s activity and device usage statistics over the course of a single day. The data from each observation was collected from one of 71 willing study participants.

**Critical Variables:** For the purposes of the present analysis, some of the notable variables from the aforementioned dataset include:
- id: Numeric representation of individual study participantnremhrs: Number of minutes of REM sleep logged in a single night (not hours)
- minutesAsleep: Number of minutes of total sleep logged in a single night
- sleep_rem_ratio: Proportion of sleep minutes occupied by REM sleep
- bpm: Average daily beats per minute
- activityType: List of all activities a user engaged in during a single day
- sedentary_minutes: Number of minutes spent idling in a single day
- lightly_active_minutes: Number of minutes spent engaging in light physical activity in a single day
- moderately_active_minutes: Number of minutes spent engaging in moderate physical activity in a single day
- very_active_minutes: Number of minutes spent engaging in strenuous physical activity in a single day
- steps: Total number of steps walked over the course of a single day

**Required Transformations:** Before operating on the original dataset in its given form, our group felt the need to refactor both the data and labels found within it in order to simplify the data extraction and identification processes. For example, the original dataset utilizes certain inconsistent column labeling schemes that make the task of identifying relevant columns more difficult, as in the case of camel-case labels clashing with underline-separated ones. Standardizing the naming convention of these multi-word labels, as well as refactoring labels with unintuitive names such as “nremhrs” which actually describes the number of minutes of REM sleep for a given user, is anticipated to help our group quickly and effectively utilize the dataset in our model. Additionally, given that certain metrics within the original dataset are recorded in different units despite the fact that columns with similar contents have different units, we also felt it necessary to standardize the numerous time-related metrics to be recorded in terms of minutes rather than hours and milliseconds. This transformation, we anticipated, would allow us to more easily apply a single model to our data and prevent the generation of inaccurate regression lines born from misleading numerical trends.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("daily_fitbit_sema_df_unprocessed.csv")

In [2]:
bad_cols = ['rmssd', 'spo2', 'responsiveness_points_percentage', 'badgeType', 
            'filteredDemographicVO2Max', 'mindfulness_session', 'minutes_in_default_zone_1', 
            'minutes_below_default_zone_1', 'minutes_in_default_zone_2','nightly_temperature',
            'minutes_in_default_zone_3', 'step_goal', 'min_goal', 'max_goal', 
            'step_goal_label', 'ALERT', 'HAPPY', 'NEUTRAL', 'RESTED/RELAXED', 'date', 
            'SAD', 'TENSE/ANXIOUS', 'TIRED', 'ENTERTAINMENT', 'GYM', 'HOME', 'calories', 
            'HOME_OFFICE', 'OTHER', 'OUTDOORS', 'TRANSIT', 'WORK/SCHOOL', 
            'scl_avg', 'age', 'gender', 'bmi', 'daily_temperature_variation', 
            'full_sleep_breathing_rate', 'Unnamed: 0', 'sleep_duration']

df = df.drop(columns=bad_cols)
df

Unnamed: 0,id,nremhr,stress_score,sleep_points_percentage,exertion_points_percentage,distance,activityType,bpm,lightly_active_minutes,moderately_active_minutes,...,minutesToFallAsleep,minutesAsleep,minutesAwake,minutesAfterWakeup,sleep_efficiency,sleep_deep_ratio,sleep_wake_ratio,sleep_light_ratio,sleep_rem_ratio,steps
0,621e2e8e67b776a24055b564,57.432,78.0,0.833333,0.675,6517.5,['Walk'],71.701565,149.0,24.0,...,0.0,445.0,76.0,0.0,93.0,1.243243,0.987013,0.921642,1.341772,8833.0
1,621e2e8e67b776a24055b564,57.681,80.0,0.833333,0.725,7178.6,['Walk'],70.579300,132.0,25.0,...,0.0,460.0,88.0,0.0,94.0,1.466667,1.142857,0.947566,1.197531,9727.0
2,621e2e8e67b776a24055b564,57.481,84.0,0.966667,0.725,6090.9,['Walk'],71.842573,112.0,27.0,...,0.0,493.0,67.0,0.0,96.0,1.116883,0.858974,1.015038,1.670732,8253.0
3,621e2e8e67b776a24055b564,57.493,82.0,0.933333,0.725,6653.1,['Walk'],71.725477,133.0,21.0,...,0.0,540.0,87.0,0.0,93.0,1.128205,1.129870,1.191729,1.588235,9015.0
4,621e2e8e67b776a24055b564,56.750,81.0,0.866667,0.725,9557.9,['Walk'],74.401028,136.0,42.0,...,0.0,493.0,68.0,0.0,94.0,0.910256,0.871795,1.211896,1.090909,12949.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7405,621e362467b776a2404ad513,,,,,,,,,,...,,,,,,,,,,
7406,621e36f967b776a240e5e7c9,,,,,,,,,,...,,,,,,,,,,
7407,621e362467b776a2404ad513,,,,,,,,,,...,,,,,,,,,,
7408,621e339967b776a240e502de,,,,,,,,,,...,,,,,,,,,,


In [3]:
def amountOfSleep(row):
    return row.minutesAwake + row.minutesAsleep + row.minutesToFallAsleep + row.minutesAfterWakeup
df['sleep_time'] = df.apply(amountOfSleep, axis=1)
df = df.drop(columns = ['minutesAwake', 'minutesAsleep', 
                        'minutesToFallAsleep', "minutesAfterWakeup"])

In [4]:
indexSleep = df[ df['sleep_points_percentage'].isnull() ].index
df.drop(indexSleep , inplace=True)
df.shape

(1876, 20)

In [5]:
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,id,nremhr,stress_score,sleep_points_percentage,exertion_points_percentage,distance,activityType,bpm,lightly_active_minutes,moderately_active_minutes,very_active_minutes,sedentary_minutes,resting_hr,sleep_efficiency,sleep_deep_ratio,sleep_wake_ratio,sleep_light_ratio,sleep_rem_ratio,steps,sleep_time
0,621e2e8e67b776a24055b564,57.432,78.0,0.833333,0.675,6517.5,['Walk'],71.701565,149.0,24.0,33.0,713.0,62.07307,93.0,1.243243,0.987013,0.921642,1.341772,8833.0,521.0
1,621e2e8e67b776a24055b564,57.681,80.0,0.833333,0.725,7178.6,['Walk'],70.5793,132.0,25.0,31.0,704.0,62.121476,94.0,1.466667,1.142857,0.947566,1.197531,9727.0,548.0
2,621e2e8e67b776a24055b564,57.481,84.0,0.966667,0.725,6090.9,['Walk'],71.842573,112.0,27.0,31.0,710.0,62.263999,96.0,1.116883,0.858974,1.015038,1.670732,8253.0,560.0
3,621e2e8e67b776a24055b564,57.493,82.0,0.933333,0.725,6653.1,['Walk'],71.725477,133.0,21.0,37.0,622.0,62.3689,93.0,1.128205,1.12987,1.191729,1.588235,9015.0,627.0
4,621e2e8e67b776a24055b564,56.75,81.0,0.866667,0.725,9557.9,['Walk'],74.401028,136.0,42.0,54.0,647.0,61.965409,94.0,0.910256,0.871795,1.211896,1.090909,12949.0,561.0


In [6]:
def normalize(columnName):
    df[columnName] = ((df[columnName]-df[columnName].min()) / 
                      (df[columnName].max() - df[columnName].min()))
    
toBeNormalized = ['nremhr', 'stress_score', 'distance', 'bpm', 'lightly_active_minutes', 
                  'moderately_active_minutes', 'very_active_minutes', 'sedentary_minutes',
                 'resting_hr', 'sleep_efficiency', 'steps', 'sleep_time']

for col in toBeNormalized:
    normalize(col)
    
df.head()

0.862069    232
0.827586    227
0.793103    212
0.896552    211
0.758621    188
0.931034    160
0.724138    140
0.965517    108
0.689655     98
0.655172     79
0.620690     47
0.586207     43
0.551724     35
1.000000     31
0.517241     24
0.482759     12
0.448276      7
0.379310      4
0.310345      3
0.413793      1
0.000000      1
0.241379      1
0.344828      1
Name: sleep_efficiency, dtype: int64

In [7]:
df = df.rename(columns={"activityType": "activity_type"})
df

Unnamed: 0,id,nremhr,stress_score,sleep_points_percentage,exertion_points_percentage,distance,activity_type,bpm,lightly_active_minutes,moderately_active_minutes,very_active_minutes,sedentary_minutes,resting_hr,sleep_efficiency,sleep_deep_ratio,sleep_wake_ratio,sleep_light_ratio,sleep_rem_ratio,steps,sleep_time
0,621e2e8e67b776a24055b564,0.620571,0.829787,0.833333,0.675,0.218337,['Walk'],0.241212,0.252144,0.083045,0.080685,0.504243,0.368842,0.758621,1.243243,0.987013,0.921642,1.341772,0.204885,0.419823
1,621e2e8e67b776a24055b564,0.623262,0.851064,0.833333,0.725,0.240483,['Walk'],0.227209,0.222985,0.086505,0.075795,0.497878,0.370130,0.793103,1.466667,1.142857,0.947566,1.197531,0.225622,0.441579
2,621e2e8e67b776a24055b564,0.621101,0.893617,0.966667,0.725,0.204045,['Walk'],0.242972,0.188679,0.093426,0.075795,0.502122,0.373920,0.862069,1.116883,0.858974,1.015038,1.670732,0.191432,0.451249
3,621e2e8e67b776a24055b564,0.621230,0.872340,0.933333,0.725,0.222879,['Walk'],0.241511,0.224700,0.072664,0.090465,0.439887,0.376709,0.758621,1.128205,1.129870,1.191729,1.588235,0.209107,0.505238
4,621e2e8e67b776a24055b564,0.613202,0.861702,0.866667,0.725,0.320190,['Walk'],0.274894,0.229846,0.145329,0.132029,0.457567,0.365979,0.793103,0.910256,0.871795,1.211896,1.090909,0.300357,0.452055
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1871,621e375b67b776a240290cdc,0.712060,0.893617,0.833333,0.900,0.301360,['Walk'],0.392483,0.605489,0.065744,0.004890,0.472419,0.679802,0.827586,1.148936,0.807692,0.983051,0.800000,0.316339,0.319903
1872,621e375b67b776a240290cdc,0.707403,0.946809,1.000000,0.900,0.118490,,0.338379,0.308748,0.055363,0.007335,0.541726,0.666075,0.931034,1.416667,0.780000,1.080508,1.353659,0.124467,0.381144
1873,621e375b67b776a240290cdc,0.689466,0.872340,0.900000,0.800,0.230852,,0.407651,0.452830,0.079585,0.048900,0.471711,0.676481,0.896552,0.921569,0.979592,1.042017,1.406977,0.242346,0.373892
1874,621e375b67b776a240290cdc,0.685025,0.893617,0.633333,0.900,0.377013,['Walk'],0.330711,0.464837,0.169550,0.063570,0.393211,0.670656,0.758621,0.420000,1.265306,1.141667,1.274725,0.395505,0.381950


# Proposed Solution

In order to predict a user’s quality of sleep from a selection of exercising features, we will be using **linear or polynomial regression**. This will allow us to find the relationships between the exercise features and sleep, and make predictions about their sleep quality. Using a train-validate-test split, we will also run cross validation to make predictions on what degree of polynomial we should try to fit. After finding the best degree, we will train our data, and see how it performs on the test set.

Furthermore, if we find that we have too many columns to predict from performing the above statement, we may use **Principal Component Analysis** to reduce the dimensions, and then perform the above process again on the reduced data set. 

# Evaluation Metrics

As the proposed analysis involves the use of a linear regression model to make predictions about sleep quality based on different exercise-related metrics, it is expected that the evaluation metrics used to quantify the performance of the model could include Mean Squared Error Score (MSE), Mean Absolute Error Score (MAE), or R-Squared Score. Preliminarily, Mean Squared Error was deemed a valid evaluation metric because it involves qualifying model performance based on differences between true and model-predicted labels, which is a task made more simple and computationally inexpensive given the aforementioned model’s inherent linear projections. Additionally, because this performance measure involves squaring the numerical differences between true and predicted values, further standardization of the original dataset to prevent the generation of negative scores would not be required, allowing models generated in successive validation folds to be compared with minimal difficulty. One concession associated with this method, however, is that because the final evaluation metric involves squaring errors, the measure is inherently sensitive to extraneous values and outliers, and given the varied values associated with the exercise and sleep columns of the target dataset, it is likely that rows exist in the dataset that could generate unwieldy evaluation scores. To address these concerns, our group also considered utilizing Mean Absolute Error to assess model performance, which substitutes the squaring process of MSE with a summation of error magnitude to address potentially negative scores. Alternatively, our group also considered substantiating the results of either of the aforementioned metrics with R-Squared scores, which qualifies model performance by taking the ratio of the sum of squared residuals and the sum of squares total, the latter of which is computed by squaring the difference between column values and the mean of that column. Inherently, this measure provides heightened interpretability compared to the other metrics, representing the degree to which the model’s performance surpasses the mean at each datapoint, with 0 representing no improvement and 1 representing perfect fit.


# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy

As our dataset has a lot of personal health and exercise information, it is important that our participants remain anonymous. Although the participants gave consent to participate in the study and the data was collected unobtrusively, through a fitbit, it is possible to backtrack their identities in conjunction with their demographic information. Furthermore, because our project is so aligned with discovering what constitutes healthy exercising and sleeping patterns, we may be unintentionally promoting unhealthy exercise, dietary, and sleep habits. Health is not a one size fits all type situation, so we will take extra caution when discussing our findings.


# Team Expectations 

- **Team Expectation 1:** Given our proposed project timeline, the members of our team should be expected to be present during these times in order to effectively complete the assigned section during that given meeting time. 
- **Team Expectation 2:** The discussion and completion of our project should be completed equally by all members of our team. Equal contribution in discussion, writing, and reviewing aspects of our project is expected of all team members.  
- **Team Expectation 3:** Consideration of individual contributions, specifically the strength and weaknesses of each member, is important for effective and efficient completion of the project. 

# Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting           | Discuss at Meeting                                                   |
| ------------ | ------------ | ---------------------------------- | -------------------------------------------------------------------- |
| 02/21        | 4:00 PM      | Brainstorm ideas on project ideas  | Project ideas for the project proposal and discuss team expectations |
| 02/25        | 4:00 PM      | Peer review other projects         | Discuss how to improve project and start on project checkpoint       |
| 03/01        | 4:00 PM      | Peer Review Due                    | Peer Review Due                                                      |
| 03/04        | 4:00 PM      | Programming for project            | EDA, discussion of analysis, project code                            |
| 03/08        | 4:00 PM      | Project Checkpoint Due             | Project Checkpoint Due                                               |
| 03/18        | 4:00 PM      | Finalize project code and analysis | Review and edit project for submission                               |
| 03/22        | 4:00 PM      | Final Project due                  | Final Project due                                                    |


# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
