Fitbit Fitness Data Analysis in R Programm

1. Ask Question to Make Data-driven Decisions

1.1. Introduction

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.

The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

1.2. Business Task

Bellabeat is looking for to identify the trend on how consumers use smart devices and the available business growth opportunity. Additionally, to come up high-level recommendations to use on marketing strategy.

1.3. Business Objectives

The main objectives of the case study is based on these three business question which underline the scope of the study:

What are the trends identified?
How could these trends apply to Bellabeat consumers
How could these trends help influence Bellabeat marketing strategy?

1.4. Key Stakeholders

Bellabeat has key stakeholders who are interesting to obtain solution on business tasks company involved:

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer.
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

1.5. Delieverables

A clear summary of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
A summary of your analysis
Supporting visualizations and key findings
Your top high-level content recommendations based on your analysis

2. Prepare Data for Exporation

in here we go through data exploration, where the data was stored and how data was verified the ROCCC method, checking the data licencing, privacy, security, accessibility and protected its integrity. Furthermore, we will highlight on how data help us to answer business questions.

2.1. Source of Data

This case study data is available in popular public website Kaggle. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

2.2. ROCCC Method

High Quality data can be help us to determine reliable decisions. To obtain this data we need to check the quality of our data. ROCCC method will show us how are data quality is.

Reliability: The data is from 30 FitBit users who consented to the submission of personal tracker data and generated by from a distributed survey via Amazon Mechanical Turk.
Original: The data is from 30 FitBit users who consented to the submission of personal tracker data via Amazon Mechanical Turk.
Comprehensive: Data minute-level output for physical activity, heart rate, and sleep monitoring. While the data tracks many factors in the user activity and sleep, but the sample size is small and most data is recorded during certain days of the week.
Current: Data is from March 2016 to May 2016. Data is not current so the users habit may be different now.
Cited: Unknown.

2.3. Data Ethics and Privacy

Bellabeat have set the standards collected, shared and used this data. Bellabeat has kept the privacy and the validity of this data. Fit Bit dataset meets the six elements of data ethics: ownership, transaction transparency, consent, currency, privacy and openness.

3. Process data from Dirty to Cleaning

FitBit data is not clean data to process this data, we will go through, dataset files and check data variables and observations, will sort and filter data, remove missing data, change column names and prepare clean dataset.

3.1. Data Cleaning Tool

Bellabeat Fitness App case study was used R program to clean and analysis data. R Program is one of the best data analysis programming language, which originally created for statistical analysis purpose.

3.2. Set Working environment

setwd("~/Desktop/fitbit/Fitness")

3.3. Load the essential Library

R program use several library to speed up data analysis process. In this capstone, we will use the following pacakges.

library("tidyverse")
library("skimr")
library("here")
library("lubridate")
library("janitor")
library("dplyr")
library("scales")
library("ggpubr")

3.4. Import FitBit datasets

There are a number of CSV files in fitbit data set. we are only going to analysis three most important which are daily_activity, Sleep and hourly steps. As to explore the data we need to import these dataset into our environment.

activity <- read_csv("~/Desktop/fitbit/Fitness/fitabase_data/dailyActivity_merged.csv")
sleep <- read_csv("~/Desktop/fitbit/Fitness/fitabase_data/sleepDay_merged.csv")
steps <- read_csv("~/Desktop/fitbit/Fitness/fitabase_data/hourlySteps_merged.csv")

let us explore data and check the competence of data. We have uploaded three dataset weight, daily_activities and sleepDay. Before, we go analyse data, we need to clean it and remove, duplicates, missing values, and format any column need to be formatted.

3.5 Data Cleaning

In this stage, we are looking the overall of our dataset and Identify, if there are some missing values, duplicates and data types

3.5.1 Exploring Dataset

We have determined that this data need to clean and make tidy. At first, we will look the number of users in this data should be 30 users approximate, but we hope it may be greater or less than few numbers.

n_unique(activity$Id)
n_unique(sleep$Id)
n_unique(steps$Id)

3.5.2 Remove Duplicates

sum(duplicated(activity))
sum(duplicated(sleep))
sum(duplicated(steps))

We found that there are 3 duplicate observation in daily_sleep. Now, let us remove the duplicated using this

sleep <- sleep %>%
  distinct() %>%
  drop_na()

Now let us check whether or not removed the duplicates.

sum(duplicated(sleep))

3.5.3 Clean Colunm Names

Final data can have upper and lowers letters, this can create confusion in the data analysis process. So it is best practice to covert all your column names into lower letters.

# Daily Activity datasets 
clean_names(activity)
activity<- rename_with(activity, tolower)

# Daily Sleep datasets 
clean_names(sleep)
sleep <- rename_with(sleep, tolower)

# Hourly Steps datasets 
clean_names(steps)
steps <- rename_with(steps, tolower)

3.5.4 Format Date & Time

Date and Time are very important in this data process, because what we are going to analysis the daily activities records. So if we do not change property Date and Time, your data will not be correct.

As we have seen the in daily_activity and daily_sleep, the columns activitydate and sleepDay are character data type. Let us convert into format.

activity <- activity %>% 
  rename(date = activitydate) %>% 
  mutate(date = as_date(date, format = "%m/%d/%Y"))

sleep <- sleep %>% 
  rename(date = sleepday) %>% 
  mutate(date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))

steps <- steps %>% 
  rename(date_time = activityhour) %>% 
  mutate(date_time = as.POSIXct(date_time, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))

str(activity)
str(sleep)

3.5.5 Merging Data

We arrived the last stage of data processing, Now we are combined to dataset to examine the relations between daily_activity and daily_sleep.

activity_sleep <- merge(activity, sleep, by=c("id", "date"))

glimpse(activity_sleep)

4. Data Analyse and Visualization

We are going to extract from the data the insights of Bellabeat fitbit users usage and know how the company determine the trend of the market.

Activity trackers provide data which enables you to become aware of your physical activity levels, work towards a goal and monitor progress. Studies using the 10,000 steps per day goal have shown weight loss, improved glucose tolerance, and reduced blood pressure from increased physical activity toward achieving this goal. The following pedometer indices have been developed to provide a guideline on steps and activity levels:

Sedentary is less than 5,000 steps per day
Low active is 5,000 to 7,499 steps per day
Somewhat active is 7,500 to 9,999 steps per day
Active is more than 10,000 steps per day
Highly active is more than 12,500

Although the program promotes the goal of reaching 10,000 steps each day for healthy adults, this goal is not universally appropriate across all ages and physical function. There are some groups where the goal of 10,000 steps may not be accurate, such as the elderly and children. Your individual step goal should be based on current activity levels and overall health and fitness goals. For people who normally do fewer than 10,000 steps, increasing daily activity by 1-2,000 steps per day will provide health benefits.

4.1. Correlation Between Steps & Calories Burning

ggplot(data = activity, aes(x=totalsteps, y=calories, fill = totalsteps))+
  geom_point() + geom_smooth() + labs(title = "Total Steps vs Calories")

The findings show that there is correlations between Total Steps and Calories. It is known, when you walk long you burn more calories.

daily_average <- activity_sleep %>%
  group_by(id) %>%
  summarise (mean_daily_steps = mean(totalsteps), mean_daily_calories = mean(calories), mean_daily_sleep = mean(totalminutesasleep))

head(daily_average)

Now, let the classify our users by daily average steps:

user_type <- daily_average %>%
  mutate(user_type = case_when(
    mean_daily_steps < 5000 ~ "sedentary",
    mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "lightly active", 
    mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "fairly active", 
    mean_daily_steps >= 10000 ~ "very active"
  ))

head(user_type)

4.2. Types of Users

Now that we have a new column with the user type we will create a data frame with the percentage of each user type to better visualize them on a graph.

user_type_percent <- user_type %>%
  group_by(user_type) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(user_type) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

user_type_percent$user_type <- factor(user_type_percent$user_type , levels = c("very active", "fairly active", "lightly active", "sedentary"))


head(user_type_percent)

Below we can see that users are fairly distributed by their activity considering the daily amount of steps. We can determine that based on users activity all kind of users wear smart-devices.

user_type_percent %>%
  ggplot(aes(x="",y=total_percent, fill=user_type)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("#85e085","#e6e600", "#ffd480", "#ff8080")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  labs(title="User type distribution")

4.3. Steps and minutes asleep per weekday

We want to know now what days of the week are the users more active and also what days of the week users sleep more. We will also verify if the users walk the recommended amount of steps and have the recommended amount of sleep.

Below we are calculating the weekdays based on our column date. We are also calculating the average steps walked and minutes sleeped by weekday.

weekday_steps_sleep <- activity_sleep %>%
  mutate(weekday = weekdays(date))

weekday_steps_sleep$weekday <-ordered(weekday_steps_sleep$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))

 weekday_steps_sleep <-weekday_steps_sleep%>%
  group_by(weekday) %>%
  summarize (daily_steps = mean(totalsteps), daily_sleep = mean(totalminutesasleep))

head(weekday_steps_sleep)

ggarrange(
    ggplot(weekday_steps_sleep) +
      geom_col(aes(weekday, daily_steps), fill = "#006699") +
      geom_hline(yintercept = 7500) +
      labs(title = "Daily steps per weekday", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1)),
    ggplot(weekday_steps_sleep, aes(weekday, daily_sleep)) +
      geom_col(fill = "#85e0e0") +
      geom_hline(yintercept = 480) +
      labs(title = "Minutes asleep per weekday", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))
  )

In the graphs above we can determine the following:

Users walk daily the recommended amount of steps of 7500 besides Sunday's.
Users don't sleep the recommended amount of minutes/ hours - 8 hours.

4.4 Hourly steps throughout the day

Getting deeper into our analysis we want to know when exactly are users more active in a day.

We will use the hourly_steps data frame and separate date_time column.

head(steps)

steps <- steps %>%
  separate(date_time, into = c("date", "time"), sep= " ") %>%
  mutate(date = ymd(date)) 
  
head(steps)

steps %>%
  group_by(time) %>%
  summarize(average_steps = mean(steptotal)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = average_steps, fill = average_steps)) + 
  labs(title = "Hourly steps throughout the day", x="", y="") + 
  scale_fill_gradient(low = "green", high = "red")+
  theme(axis.text.x = element_text(angle = 90))

We can see that users are more active between 8am and 7pm. Walking more steps during lunch time from 12pm to 2pm and evenings from 5pm and 7pm.

4.5 Correlations Daily Steps and Daily Sleep

We will now determine if there is any correlation between different variables:

Daily steps and daily sleep
Daily steps and calories

ggarrange(
ggplot(activity_sleep, aes(x=totalsteps, y=totalminutesasleep))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Daily steps vs Minutes asleep", x = "Daily steps", y= "Minutes asleep") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(activity_sleep, aes(x=totalsteps, y=calories))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Daily steps vs Calories", x = "Daily steps", y= "Calories") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14))
)

Per our plots:

There is no correlation between daily activity level based on steps and the amount of minutes users sleep a day.
Otherwise we can see a positive correlation between steps and calories burned. As assumed the more steps walked the more calories may be burned.

4.7. Use of smart device

4.7.1. Days used smart device

Now that we have seen some trends in activity, sleep and calories burned, we want to see how often do the users in our sample use their device. That way we can plan our marketing strategy and see what features would benefit the use of smart devices.

We will calculate the number of users that use their smart device on a daily basis, classifying our sample into three categories knowing that the date interval is 31 days:

high use - users who use their device between 21 and 31 days. moderate use - users who use their device between 10 and 20 days. low use - users who use their device between 1 and 10 days. First we will create a new data frame grouping by Id, calculating number of days used and creating a new column with the classification explained above.

daily_use <- activity_sleep %>%
  group_by(id) %>%
  summarize(days_used=sum(n())) %>%
  mutate(usage = case_when(
    days_used >= 1 & days_used <= 10 ~ "low use",
    days_used >= 11 & days_used <= 20 ~ "moderate use", 
    days_used >= 21 & days_used <= 31 ~ "high use", 
  ))
  
head(daily_use)

We will now create a percentage data frame to better visualize the results in the graph. We are also ordering our usage levels.

daily_use_percent <- daily_use %>%
  group_by(usage) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(usage) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

daily_use_percent$usage <- factor(daily_use_percent$usage, levels = c("high use", "moderate use", "low use"))

head(daily_use_percent)

Now that we have our new table we can create a percentage dataframe to better visualize the results in the graph. we are also ordering our usage levels.

daily_use_percent %>%
  ggplot(aes(x="",y=total_percent, fill=usage)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("#006633","#00e673","#80ffbf"),
                    labels = c("High use - 21 to 31 days",
                                 "Moderate use - 11 to 20 days",
                                 "Low use - 1 to 10 days"))+
  labs(title="Daily use of smart device")

Analyzing our results we can see that

50% of the users of our sample use their device frequently - between 21 to 31 days.
12% use their device 11 to 20 days.
38% of our sample use really rarely their device.

5. Conclusion & Recommendation

The findings of fitbit dataset exposed the correlation between usage of fitness app and health. Fitness app is a motivator and have close relationship with the users. This findings depicted different activities users involved and how to tract their help trend.

Improve the quality of Bellabeat apps and must come up something better than other market revelers.
Bellabeat app should also produce couple smart watch to motivate and compete one another.
Fitness app summary results much propose some motivation advice to users depending on whether the user reach its daily goals or weekly goal or not.

6. REFERENCE

https://www.kaggle.com/code/macarenalacasa/capstone-case-study-bellabeat

https://www.10000steps.org.au/articles/counting-steps/

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Bellebeat_report.Rmd		Bellebeat_report.Rmd
Bellebeat_report.html		Bellebeat_report.html
Case-Study-2-_-How-can-a-wellness-technology-company-play-it-smart.pdf		Case-Study-2-_-How-can-a-wellness-technology-company-play-it-smart.pdf
README.md		README.md
bellabeat.R		bellabeat.R
dailyActivity_merged.csv		dailyActivity_merged.csv
fitabase_data.zip		fitabase_data.zip
fitbiz.Rproj		fitbiz.Rproj
hourlySteps_merged.csv		hourlySteps_merged.csv
sleepDay_merged.csv		sleepDay_merged.csv

Moheid/fitbit_repo

Folders and files

Latest commit

History

Repository files navigation