Skip to content

Bellabeat's Data Analysis Case Study for Google Data Analytics Professional Certificate

Notifications You must be signed in to change notification settings

xgabrielex/Bellabeat-Data-Analysis-Case-Study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 

Repository files navigation

Bellabeat-Data-Analysis-Case-Study

Bellabeat is a wellness brand for women with a variety of products and services focused on women’s health. The company develops wearables and accompanying products that monitor biometric and lifestyle data to help women better understand how their bodies work and make healthier choices. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women. By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively.(https://bellabeat.com/)

Characters:

  • Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer
  • Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
  • Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.

Products:

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf Tracker connects to the Bellabeat app to track activity, sleep, and stress.
  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, beauty, and mindfulness based on their lifestyle and goals.

ASK

Scenario:

Urška Sršen, co-founder and Chief Creative Officer of Bellabeat believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. As a Junior Data Analyst, I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices.

There are three main questions that need to be answered:

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat's marketing strategy?

Business task

Perform data analysis on non-Beallabeat's products to discover consumer usage trends and provide founded insights and high-level recommendations to Bellabeat’s stakeholders Urška Sršen CCO and Sando Mur: Mathematician and Bellabeat’s cofounder.

PREPARE

About the data

The data source used in this analysis is a Kaggle data set. The crowd-sourced Fitbit dataset, dated 03.12.2016-05.12.2016, is in the form of 18 CSV files that were generated by respondents to a distributed survey via Amazon Mechanical Turk. It was placed in Zenodo, where it got a DOI and had a license of CC0: Public Domain.

The Digital Object Identifier DOI of this data is 10.5281/zenodo.53894. It was posted on Zenodo by Furberg, Robert; Brinton, Julia; Keating, Michael; Ortiz, Alexa. The data was collected through a survey so there are limitations on verifying its integrity, however, the data was used and uploaded in Zenodo, which is a safe and credible website operated by CERN and OpenAIRE to ensure that everyone can join in Open Science.

There are some issues with bias or credibility in this data. Data could be biased since there is no information about consumers who submitted the responses, we are not aware of their gender, age, and so on. The data is not from Bellabeat's product and the usage of FitBit could slightly differ from BellaBeat's products. Ideally, to make data-driven decisions and have data analysis for BellaBeat, we would have the data from the BellaBeat product.

Evaluation of ROCCC

  • Reliable - Data is not necessarily very reliable since it’s not Bellabeat’s data and it’s only 30 consumers' data.
  • Original - The data is not original and was produced by a third party - Amazon Mechanical Turk.
  • Comprehensive - The data is somewhat comprehensive since it has 18 datasets, that include calories, steps, sleep and so on which are relevant to our analysis.
  • Current - The data is from 2016 so it’s not super current. At the time of the analysis, it’s 7 years old, which could mean that usage and trends during this time have changed.
  • Cited – The data on Kaggle has been used many times by various analysts, in Zenodo, where it got the DOI, it was cited 3 times.

PROCESS

Tools for the process

To clean, manipulate, and analyze the data I have chosen R programming language. I will ensure data integrity by not making any changes to the actual data, but rather creating new data frames and manipulating data that way. Also, I will be keeping notes of all changes made and actions taken regarding this data.

Start of data processing

I have uploaded the dataset to RStudio. Once all 18 CSV files have been uploaded, I have installed and uploaded tidyverse package and set the working directory.

# installing and uploading tidyverse
install.packages("tidyverse")
library(tidyverse)
# setting working directory
setwd("/cloud/project/Fitabase Data 4.12.16-5.12.16")

Choosing the files to work with

I looked through the CSV files in the File pane and immediately decided not to use the following files:

hourlyCalories_merged.csv, hourlySteps_merged.csv, minuteCaloriesNarrow_merged.csv, minuteCaloriesWide_merged.csv, minuteIntensitiesNarrow_merged.csv, minuteIntensitiesWide_merged.csv, minuteSleep_merged.csv, minuteStepsNarrow_merged.csv, minuteStepsWide_merged.csv, minuteMETsNarrow_merged.csv.

I chose to analyze daily patterns rather than minute and hour ones because I believe they will provide more useful insights. Then, I read the remaining CSV files and assigned them data frames.

#reading files in the data set and assigning them data frames
daily_activity<-read.csv('dailyActivity_merged.csv')
daily_calories <- read.csv('dailyCalories_merged.csv')
daily_intensities <- read.csv('dailyIntensities_merged.csv')
daily_steps <- read.csv('dailySteps_merged.csv')
heartrate_seconds <- read_csv('heartrate_seconds_merged.csv')
hourly_intensities <-read.csv('hourlyIntensities_merged.csv')
minute_METs <- read.csv('minuteMETsNarrow_merged.csv')
sleep_day <- read.csv('sleepDay_merged.csv')
weight_log_info <- read.csv('weightLogInfo_merged.csv')

After assigning CSV files data frames, I got familiar with the data, to see which CSV files I would be using for the analysis. I used head() and n_distinct() functions to get familiar with data frames daily_activity, daily_calories, daily_intensities, daily_steps.

#trying to get familiar with data 
head(daily_activity)
head(daily_calories)
head(daily_intensities)
head(daily_steps)
n_distinct(daily_activity$Id)
n_distinct(daily_calories$Id)
n_distinct(daily_intensities$Id)
n_distinct(daily_steps$Id)
head1 head 2

From performing this code, it was obvious that daily_calories, daily_intensities, and daily_steps are already in the daily_activity data frame, so I will be using daily_activity data frame for further analysis. There are also 33 distinct Ids in all 4 data frames, where it should be 30.

Next, I got familiar with the remaining data frames.

# getting familiar with heartrate_seconds data frame
head(heartrate_seconds)
n_distinct(heartrate_seconds$Id)
heart

After executing the code it was clear that there are less than half unique Id's than it should be, so this data frame will not be used for further analysis. Next, I got familiar with hourly_intensities and sleep_day data frames.

# getting familiar with hourly_intensities and sleep_day data frames
head(hourly_intensities)
n_distinct(hourly_intensities$Id) 
head(sleep_day)
n_distinct(sleep_day$Id)
head 3

After executing the code, it looks like these two data sets can be useful for further analysis.

# getting familiar with weight_log_info data frame
head(weight_log_info)
n_distinct(weight_log_info$Id)
head 4

Unfortunately, after running the code it was clear that I could not use the data frame weight_log_info, since there are only 8 distinct Ids in this data frame.

Files that will be used for processing and analysis:

After reviewing all data frames, I have decided to use the following for further processing and analysis:

  • daily_activity
  • hourly_intensities
  • sleep_day

Cleaning selected data frames - Daily Activity

First, I've started with the daily_activity data frame. Running glimpse() and str() functions gave me basic information like the data frame has 940 rows and 15 columns, I was also presented with the data type of each variable. That is where I've noticed that ActivityDate is a character and not a date type of variable, which I would need to change.

# gathering basic information about daily_activity data frame
glimpse(daily_activity)
str(daily_activity)
daily activity 1

I've checked to see how many distinct Ids there were, also if there were any NA values or duplicates. Distinct 33 Id, no NA or duplicate values were found.

# number of distinct Id in the daily_activity data frame
n_distinct(daily_activity$Id)

# checking to see if there is any null values in the data frame 
sum(is.na(daily_activity))

# checking for duplicates
sum(duplicated(daily_activity))
daily activity 2

Since there were no NA values, and no duplicates, it was time to change the data type of ActivityDate. I've installed the necessary package lubridate and changed the data type.

# installing and leading needed package
install.packages("lubridate")
library(lubridate)
​
#changing ActivityDate data type
daily_activity$Date <- mdy(daily_activity$ActivityDate)
​
#checking to see if the data type has been changed correctly
str(daily_activity)
daily activity 3

The next step was creating a new column from Date that would showcase Date in Weekday format. Also creating new columns transforming minutes to hours, changing data type of Calories and Total_Steps cleaning up column names, and making them neater. At last, I created a new data frame with only those columns that I intended to use.

# now I will be creating a new column for weekdays
daily_activity$Weekday <- wday(daily_activity$Date, label=TRUE)
head(daily_activity)

# creating active hours column combining all activities
daily_activity$ActiveHours <-((daily_activity$VeryActiveMinutes)+(daily_activity$FairlyActiveMinutes)+
                                (daily_activity$LightlyActiveMinutes))/60

# creating sedentary hours column
daily_activity$SedentaryHours <-(daily_activity$SedentaryMinutes)/60

# creating different active hours columns
daily_activity$VeryActiveHours <-(daily_activity$VeryActiveMinutes)/60
daily_activity$FairlyActiveHours <-(daily_activity$FairlyActiveMinutes)/60
daily_activity$LightlyActiveHours <-(daily_activity$LightlyActiveMinutes)/60

# making the column names neat
colnames(daily_activity) <- gsub("([a-z])([A-Z])", "\\1_\\2", colnames(daily_activity))

#making calories and total steps as numeric 
daily_activity$Calories <- as.numeric(daily_activity$Calories)
daily_activity$Total_Steps <- as.numeric(daily_activity$Total_Steps)

# creating data frame from daily_activity with only columns that I will use
library(dplyr)
daily_activity2 <- daily_activity %>%
  select(Id, Date, Weekday, Calories, Total_Steps, Total_Distance, Very_Active_Distance, Moderately_Active_Distance, Light_Active_Distance, Active_Hours, Sedentary_Hours, Very_Active_Hours, Fairly_Active_Hours, Lightly_Active_Hours) 
head(daily_activity2)
daily activity 4 daily activity 5 daily activity 6

Cleaning and manipulating daily activity was done. Next, it was time for another data frame.

Cleaning selected data frames - Hourly Intensities

I've run head(), glimpse(), and str() for basic information about the data frame. The hourly intensities data frame consists of 4 columns and 22099 rows. ActivityHour contains date and time in one column, so I will separate them and change the data type of Time. Also, column names need some manipulation to make them neater.

# now lets dive in to another data set - hourly intensities
head(hourly_intensities)
str(hourly_intensities)
glimpse(hourly_intensities)

#checking for distinct Ids
n_distinct(hourly_intensities$Id)

# check if there is na values
sum(is.na(hourly_intensities))

# checking for duplicates
sum(duplicated(hourly_intensities))

I've also checked for unique Ids, NA values, and duplicates. The number of unique Ids is 33, and no Null values or duplicates were found. Then, I manipulated the column names and separated Activity_Hour column into 2 using the str_split_fixed() function. I've changed the Time data type and formatted it so that when I will need to make a chart the hours will go in order.

# after looking at the data set, we can see that we need to make col names neat

colnames(hourly_intensities) <- gsub("([a-z])([A-Z])", "\\1_\\2", colnames(hourly_intensities))
glimpse(hourly_intensities)

# create two separate columns from activity hour to have only hours and only date

library(stringr)

hourly_intensities[c('Date', 'Time')] <- str_split_fixed(hourly_intensities$Activity_Hour, ' ', 2)

#changing the format of new column Time

install.packages("chron")
library(chron)

hourly_intensities$Time <- strptime(hourly_intensities$Time, format = "%I:%M:%S %p")

# extract only the hours and format the result to include AM/PM
hourly_intensities$Time <- factor(format(hourly_intensities$Time, format = "%I %p"),
                                  levels = c("12 AM", "01 AM", "02 AM", "03 AM", "04 AM", "05 AM", 
                                         "06 AM", "07 AM", "08 AM", "09 AM", "10 AM", "11 AM", 
                                         "12 PM", "01 PM", "02 PM", "03 PM", "04 PM", "05 PM", 
                                         "06 PM", "07 PM", "08 PM", "09 PM", "10 PM", "11 PM"))

glimpse(hourly_intensities)
hourly intensities 1 hourly intensities 2

Cleaning and manipulating of hourly intensities data frame was done. Next, it was time to clean and manipulate sleep day data frame.

Cleaning selected data frames - Sleep Day

I've run head(), glimpse(), and str() for basic information about the data frame. The sleep day data frame consists of 5 columns and 413 rows. I've also checked for unique Ids, NA values, and duplicates. The number of unique Ids is 24, no Null values but 3 duplicates were found. I've removed duplicates with distinct() functions and manipulated column names.

# cleaning the sleep data data frame
head(sleep_day)
glimpse(sleep_day)
str(sleep_day)

# unique Ids
n_distinct(sleep_day$Id)

#checking for NA values
sum(is.na(sleep_day))

# checking for duplicates
sum(duplicated(sleep_day))

#removing duplicates
sleep_day <- distinct(sleep_day)
sum(duplicated(sleep_day))

#making col names neat
colnames(sleep_day) <- gsub("([a-z])([A-Z])", "\\1_\\2", colnames(sleep_day))
View(sleep_day)


sleep day
sleep day 1
sleep day 2

Cleaning of all three selected data frames was done. Now the time has come for analysis of the data.

ANALYZE

Analysing daily_activity2 data frame

I have started with a function summary() to obtain the statistical data of variables from daily_activity2.

# data analysis of daily activity starts
summary(daily_activity2)

# from this we get important information on mean of variables like
# daily calories, steps and distance
analysis daily activity2

This provided me with the average daily calories burned 2304, which fits into the normal interval of burned calories per day. According to Cleaveland Clinic, humans burn 1300-2000 calories per day without working out, but that depends on age, sex, and so on. (https://health.clevelandclinic.org/calories-burned-in-a-day/)

Another statistic that I have observed is the average daily steps taken, which was 7638. The recommended step count per day is 8000 - 10000, so our consumers could do a little better in achieving that goal. (https://utswmed.org/medblog/how-many-steps-per-day/)

Next, I have created a graph to showcase which days of the week people were more active.

# now we will see which days in a week people are more active

install.packages("ggplot2")
library(ggplot2)

Frequency_Weekdays <- ggplot(daily_activity2, aes(x = factor(format(Date, "%a"), levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))) +
  geom_bar(fill = "lightpink") +
  labs(title = "Frequency on Weekdays",
       x = "Weekdays",
       y = "Frequency") +
  theme_classic()

print(Frequency_Weekdays)

ggsave("Frequency_Weekdays.png", plot = Frequency_Weekdays, device = "png")

# looks like people are mostly active on tues, wend, and thurs, and then the usage of 
# wellness app drops towards the weekend
freq weekdays

The data suggests that people are mostly active on Tuesday, Wednesday, and Thursday. Then their activity slowly drops on the weekends and on Monday we observe a slight rise compared to Sunday.

Next, I created a pie chart representing different activity hours logged during the day.

# now I will make a pie chart for active hours and sedentary hours 

# first I will calculate sums of all hours loged, sedentary and active

Total_Sedentary_Hours <- sum(daily_activity2$Sedentary_Hours)
Total_Very_Active_Hours <- sum(daily_activity2$Very_Active_Hours)
Total_Fairly_Active_Hours <- sum(daily_activity2$Fairly_Active_Hours)
Total_Lightly_Active_Hours <- sum(daily_activity2$Lightly_Active_Hours)

# then I will create a data frame where these calculations will be stored

hours_chart_data <- data.frame (
  Variable = c("Very Active Hours", "Fairly Active Hours","Lightly Active Hours", "Sedentary Hours"),
  Value = c(Total_Very_Active_Hours, Total_Fairly_Active_Hours, Total_Lightly_Active_Hours, Total_Sedentary_Hours))

# then I will create percentage values for better representation

percentages <- round(100 * hours_chart_data$Value / sum(hours_chart_data$Value), 1)

# create a vector of labels with percentages
labels_with_percentages <- paste(hours_chart_data$Variable, percentages, "%")

# to have nicer color themes I will use RColorBrewer package

install.packages("RColorBrewer")
library(RColorBrewer)
library(graphics)
# pick a pallet and pick label size and creating the pie chart

myPalette <- brewer.pal(5, "RdPu") 
label_size <- 0.6

png("pie.png")

pie <- pie(hours_chart_data$Value, 
           labels = labels_with_percentages, 
           main = "Percentage of Logged Hours", 
           cex.main = 1.2, 
           radius = 1, 
           border = "white", 
           col = myPalette, 
           cex = label_size)

# created a pie chart and saved it as pdf for better quality
# this data suggest that people mostly log sedentary hours, wich take up
# more than 81 % of all hours logged 

dev.off()

# dev.off() function for closing plots
pie

The data suggest that users mostly log sedentary hours, which added up to 81.3%. This could either be the fault of not being active enough or simply forgetting to log a workout or activity in the tracker. According to Fitbit, it takes at least 15 min for the tracker to start recording activity, so this could be one of the reasons that the user is mostly unactive during the day.

The next graph that I've decided to create was Steps vs Calories. This was done to see if there is any positive relationship between these two metrics.

# now I will create a scatter plot to see if there is any relationship 
# between steps and calories 

Steps_vs_Calories <- ggplot(daily_activity2, aes(x=Calories, y=Total_Steps, color=(Id)))+
  geom_point()+
  geom_smooth(method="loess", se=TRUE, fullrange=FALSE, level=0.95, color="black")+
  ggtitle("Steps Taken vs Calories Burned") +
  xlab("Calories") + ylab("Steps")+
  scale_color_gradient(low = "pink", high = "pink3")

print(Steps_vs_Calories)

ggsave("Steps_vs_Calories.png")

# as we can see there is positive correlation in the graph, showing that for 
# every Id the more steps taken the more calories are burned, which was expected
steps vs calories

I observed a positive relationship between steps taken and calories burned per day. This means that for the most part, the more steps were taken the more calories were burned.

Next, I wanted to see how calories correlated with active and sedentary hours. I have made two graphs called Active Hours vs Calories and Sedentary Hours vs Calories.

#now let see if there is any correlation between active hours and calories

Active_hours_vs_Calories <- ggplot(daily_activity2, aes(x=Calories, y=Active_Hours, color=(Id)))+
  geom_point()+
  geom_smooth(method="loess", se=TRUE, fullrange=FALSE, level=0.95, color="black")+
  ggtitle("Active Hours vs Calories") +
  xlab("Calories") + ylab("Active Hours")+
  scale_color_gradient(low = "lightblue1", high = "lightblue3")

print(Active_hours_vs_Calories)

ggsave("Active_hours_vs_Calories.png")

# there is a positive correlation to more active hours per day, more calories are burned

Sedentary_hours_vs_Calories <- ggplot(daily_activity2, aes(x=Calories, y=Sedentary_Hours, color=(Id)))+
  geom_point()+
  geom_smooth(method="loess", se=TRUE, fullrange=FALSE, level=0.95, color="black")+
  ggtitle("Sedentary Hours vs Calories") +
  xlab("Calories") + ylab("Sedentary Hours")+
  scale_color_gradient(low = "seagreen1", high = "seagreen4")

print(Sedentary_hours_vs_Calories)

ggsave("Sedentary_hours_vs_Calories.png")

# we can see that really sedentary hours does not have much of correlation with 
# calories burned during the day, as long as you are moving at least a little bit,
# you will be burning calories and the rest of the day with sedentary life style will 
# have no effect
active hours vs calories sedentary hours vs calories

The data suggest that as predicted active hours have a positive relationship with calories, so the more active you are the more calories you burn. The sedentary hours have, for the most part, no relationship with burned calories. This would mean that no matter how sedentary your lifestyle is if you are active for a short period of time, you can burn calories during the day.

The next relationship that I wanted to explore was how activity differences affected distance.

#now lets see how does activity differences translates to  distance 

library(dplyr)

activity_distance <- daily_activity2 %>%
  summarize(
    distance = c(sum(Very_Active_Distance), sum(Moderately_Active_Distance), sum(Light_Active_Distance)),
  active_hours = c(sum(Very_Active_Hours), sum(Fairly_Active_Hours), sum(Lightly_Active_Hours)))

#creting new column  
activity_types= c("Very Active", "Fairly Active", "Lightly Active")
activity_distance$activity_types = activity_types
activity_distance$ratio <- activity_distance$distance/activity_distance$active_hours

View(activity_distance)

# form this table we can see that very active hours has largest ratio, meaning 
# that that 1 active hour is giving 4 miles in comparison lightly active hour gives us only 1 mile
activity distance

From the table above we can see that very active hours have the largest ratio, which means that 1 active hour gives 4 miles of distance, in comparison lightly active hour gives us only 1 mile distance. This ratio suggests that you could reach your distance goals and have a more productive workout with higher intensity.

Next, I wanted to see different activity types of hours in correlation with burned calories.

# now let's see how activity type affects calories

install.packages("ggpubr")
library(ggpubr)

Very_Active_Hours_vs_Calories <- ggplot(daily_activity2, aes(x=Calories, y=Very_Active_Hours, color=(Id)))+
  geom_point()+
  geom_smooth(method="loess", se=TRUE, fullrange=FALSE, level=0.95, color="black")+
  ggtitle("Very Active Hours vs Calories") +
  xlab("Calories") + ylab("Very Active Hours")+
  scale_color_gradient(low = "palevioletred1", high = "palevioletred4")

Fairly_Active_Hours_vs_Calories <- ggplot(daily_activity2, aes(x=Calories, y=Fairly_Active_Hours, color=(Id)))+
  geom_point()+
  geom_smooth(method="loess", se=TRUE, fullrange=FALSE, level=0.95, color="black")+
  ggtitle("Fairly Active Hours vs Calories") +
  xlab("Calories") + ylab("Fairly Active Hours")+
  scale_color_gradient(low = "paleturquoise1", high = "paleturquoise4")

Lightly_Active_Hours_vs_Calories <- ggplot(daily_activity2, aes(x=Calories, y=Lightly_Active_Hours, color=(Id)))+
  geom_point()+
  geom_smooth(method="loess", se=TRUE, fullrange=FALSE, level=0.95, color="black")+
  ggtitle("Lightly Active Hours vs Calories") +
  xlab("Calories") + ylab("Lightly Active Hours")+
  scale_color_gradient(low = "palegreen", high = "palegreen4")

ggarrange(Very_Active_Hours_vs_Calories, Fairly_Active_Hours_vs_Calories, Lightly_Active_Hours_vs_Calories, 
          ncol = 2, nrow = 2)
diff active hours

With the graphs above we are watching an interesting correlation. As expected, the very active hours burn calories with the highest positive correlation. Fairly active hours have a slow rise of burned calories compared to hours passed and could mean a very weak positive correlation, but lightly active hours show a rapid rise and then at around 2000 cal, 4 hours, the curve is linear, which would mean that there is no correlation between two metrics.

Analysing hourly_intensities data frame

With this data set, I wanted to see which hours of the day were mostly logged and therefore the most active. First I grouped the data set by Time variable.

time_intensity<- hourly_intensities %>%
  group_by(Time) %>%
  summarize(Sum_Total_Intensity=sum(Total_Intensity))

View(time_intensity)

Then I created a chart to observe the relationship between intensity and hours in the day.

intensity_vs_hour <- ggplot(time_intensity, aes(x = Time, y = Sum_Total_Intensity))+
  geom_col(fill = "lightpink") +
  labs(title = "Hourly Intensity",
       x = "Time",
       y = "Intensity") +
  theme_classic()
print(intensity_vs_hour)  
hourly intensities graph

The graph suggests that users had the highest intensity logged from 5 PM to 7 PM. As expected we see very low activity from 11 PM until 5 AM. From 5 AM the intensity is rising steadily. The second highest wave is observed around 12 PM - 2 PM.

Analysing sleep_day data frame

I wanted to figure out how many hours on average users sleep per night and how long does it take for them to fall asleep.

#how many hours on average sleeps
summary(sleep_day)
mean(sleep_day$Total_Minutes_Asleep)/60

#how many minutes spends in bed before falling asleep
sleep_day$Minutes_Awake <- sleep_day$Total_Time_In_Bed - sleep_day$Total_Minutes_Asleep

average_minutes_awake <- mean(sleep_day$Minutes_Awake, na.rm = TRUE)

print(average_minutes_awake)
sleep day 3

On average users sleep 6.9 hours per night, which according to NIH is a little bit too little. To meet the minimum recommended hours of sleep adults should sleep 7-9 hours per night. (https://www.nhlbi.nih.gov/health/sleep/how-much-sleep)

When talking about minutes spent in bed before falling asleep, the Sleep Foundation recommends that it should take 15-20 min. In our case, it takes around 40 minutes (https://www.sleepfoundation.org/sleep-faqs/how-long-should-it-take-to-fall-asleep#:~:text=Most%20healthy%20people%20fall%20asleep,fall%20asleep%20easily%20every%20night)

SHARE

all visual

Analysis Summary

  • The average daily calories burned is 2304. This exceeds the normal interval of burned calories per day, which is 1300-2000 calories per day.
  • The average daily steps taken is 7638. Users have not reached the recommended steps per day, which is 8000-10000.
  • Most days that users logged their activity were Tuesday, Wednesday, and Thursday. The activity dropped over the weekend and started to slowly rise on Monday.
  • 81.3% of hours logged were sedentary, this could either be the fault of not being active enough or simply forgetting to log a workout or activity in the tracker.
  • As expected, the daily steps and daily burned calories relationship has a positive correlation, meaning the more steps were taken the more calories were burned during the day. In the same way, active hours and daily calories have a positive correlation, the more active hours logged during the day the more calories burned. While sedentary hours and calories do not have any correlation. This is kind of expected but still exciting news, as no matter how sedentary our lifestyle is, if we move at least a little, calories will be burned.
  • The ratio between different active hours and distance was highest in the very active hours type, which was 1 active hour to 4 miles of distance, in comparison 1 lightly active hour gave a 1-mile distance. This ratio suggests that users could reach their distance goals and have a more productive workout with higher intensity.
  • When talking about different activity hours and their relationship with burned calories, as with distance metric, very active hours have a positive correlation, meaning the more active hours users logged the more calories were burned. I've noticed a trend that up until the 1-hour mark and around 3000, it's a super slow rise of hours, but calories are being burned fast. Once it hits that mark, the hours needed to burn calories rise significantly, and users have to stay very active for a longer time to burn more than 3000 calories. With fairly active hours it's a pretty weak positive correlation, and interestingly enough with lightly active hours, there is a positive correlation up until around 4 hours and 2000 calories mark, and then I observed no correlation at all. This would suggest that shorter, higher-intensity work would be beneficial for those who are trying to lose some weight.
  • Hourly intensities logged during the day suggest that users had the highest intensity from 5 PM to 7 PM, which is typical after-work hours. As expected we see very low activity from 11 PM until 5 AM, as it is night time. From 5 AM the intensity is rising steadily. The second highest wave is observed around 12 PM - 2 PM, which is probably people who work night shifts and who are going to work out, or there is some waking done during lunch at work.
  • Regarding sleep data, on average users sleep 6.9 hours per night, which according to NIH is not enough. To meet the minimum recommended hours of sleep adults should sleep 7-9 hours per night.
  • When talking about minutes spent in bed before falling asleep, the Sleep Foundation recommends that it should take 15-20 min. In our case, it takes around 40 minutes for users to fall asleep, which could mean that they either have some trouble falling asleep, or staying on their phones, etc.

ACT

Bellabeat is a wellness brand for women with a goal to help women improve their health by developing wearables and accompanying products that monitor biometric and lifestyle data. After performing an analysis on non-Bellabeat products, I have identified user trends that can help not only Bellabeat's users but also influence the company's marketing strategy.

As my assignment was to produce recommendations towards one of Bellabeat's products, I've chosen the Bellabeat app, which collects the activity, sleep, stress, menstrual cycle, and mindfulness habits of its users. This data closely resembles the type of data collected from Fitbit users (which was analyzed in this project) and therefore can be used in drawing conclusions regarding users of Bellabeat's app.

Before presenting the main conclusions and recommendations to stakeholders Urška Sršen CCO and Sando Mur, Mathematician, and Bellabeat’s cofounder, regarding Beallabeat's app, I would like to mention that the data that I have worked with was limited. It only had 24-33 unique users in datasets, there was no information on whether the users were women, their age, and so on. The data was also collected through external sources and came from non-Bellabeat products.

For the future, I would suggest continuing the analysis of user trends, however, the source of the data should be the Bellabeat app, the metadata would be important and the scope should be much larger than 30 users.

Conclutions

According to the CDC (Centers for Disease Control and Prevention) living a non-active lifestyle can lead to heart disease, obesity, high blood pressure, high blood cholesterol, type 2 diabetes, and various cancers. Regular physical activity is one of the most important factors in improving people's health and it benefits everyone, regardless of their age, sex, race, ethnicity, or current fitness level. That being said, our analysis has shown that most logged hours (81.3%) of our users were sedentary, they were not hitting the recommended step count and had trouble falling asleep and sleeping the recommended 7-9 hours per day. It is truly important to encourage Bellabeats users to lead a more active lifestyle and understand its benefits.

Recomendations

1. Reminder notifications

Users have not reached the recommended steps per day, which is 8000-10000, the reached number was 7638. Notification could help users remember their step goal if they have one, or just by default target 10000 per day.

2. Weekend challenge

Since I have observed users' activity declines on weekends, a "weekend fitness challenge" could be a great way to encourage people to be more active and log their activity on the weekends.

3. The activity report

During the analysis of users' logged hours, it was clear that striking 81.3% of logged time was sedentary. To avoid this in the Bellabeat app it is important to have clear communication with the user about the importance of logging their activity as it helps form recommendations, predictions, and suggestions for their healthy habits. After a month of usage, the user could get a monthly report with a short analysis of their data, however, the report would clearly communicate which days the app was not used and what percentage of accuracy the report is, this would help the user to understand that by logging the active time they are actually helping themselves.

4. Move a little

During the analysis, I learned that steps taken and active hours have a positive correlation with burned calories and that sedentary hours have no correlation. This is expected but still exciting news. With this information, the Bellabeat app could have a "Move a little" notification which would be prompted after a long period of sedentary activity logged on the app. It would encourage people to as it says move at least a little bit because that's all it takes to get closer to fitness goals or get those steps in.

5. Recommended HITT workouts

For someone who is trying to lose weight or reach a certain distance daily, data suggest that shorter higher intensity workouts would be the key. The Bellabeat app could have a recommended workouts section depending on the activity logged and preferences pre-defined by the user.

6. Motivation boost

As I've learned from data analysis, the most active hours of the day are 12 PM to 2 PM and 5 PM to 7 PM. During this time people are most likely to be active therefore Bellabeat app could have "Motivational boost" notifications during these hours. If the person was working out and received a motivational message, it would validate their work, and if not, it would work as a reminder to be more active.

7. Evening meditation

User data suggest that people are not getting enough sleep and having a harder time falling asleep. To try and solve this issue Bellabeat app could have an "Evening meditation" integration which would recommend users a 5-minute meditation before sleep. This could potentially help users fall asleep and get the recommended 7-9 hours of sleep per day.

Recourses

Releases

No releases published

Packages

No packages published