<a href="https://www.kaggle.com/code/thierrymasters/bellabeat-case-study-using-r?scriptVersionId=144004987" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
title: "BellaBeat Case Study using R"
author: "Thierry L."
date: "2023-02-19"
output: html_document
---


![](https://www.mobis.hr/upload/catalog/manufacturer/bellabeat-logo-02_5a2506cecd6ed.png)



## 1. Overview

Bellabeat is a cutting-edge tech firm that creates innovative medical devices. To educate and empower women, they provide a variety of smart devices that track physical activity, sleep, mental stress, and reproductive health. 

In this case, we'll be looking at fitness data from intelligent devices to see if we can take advantage of any untapped expansion opportunities for Bellabeat. Below, we'll take a closer look at Bellabeat's flagship app. 

The Bellabeat app tracks a user's activity, sleep, stress, menstrual cycle, and mindfulness practices and then displays this information to the user. Users will be able to gain insight into their current habits and make more informed choices about how to improve their health thanks to this information. When used in conjunction with their intelligent wellness products, the Bellabeat app can help you stay healthy and on track. 


## 2. Ask

### 2.1 Business Task

This business task aims to inform the marketing strategy for Bellabeat products by analyzing consumer data from smart devices made by Bellabeat. The research will look for patterns in the participants' device usage and figure out how Bellabeat can incorporate those into its product line to boost sales and customer loyalty. The data collected will inform strategic marketing decisions and increase the efficiency of advertising initiatives.

Key Stakeholders:

* Urška Sršen, Bellabeat's cofounder and Chief Creative Officer
* Sando Mur, Mathematician and Bellabeat's cofounder; key member of the Bellabeat executive team
* Bellabeat marketing analytics team, responsible for collecting, analyzing, and reporting data that     helps guide Bellabeat's marketing strategy
* Bellabeat customers, specifically those who use smart devices
* Potential Bellabeat customers who may be interested in purchasing a smart device
* Online retailers who sell Bellabeat products
* Traditional advertising media partners such as radio, out-of-home billboards, print, and television.


## 3. Prepare

### 3.1 Data source

Our investigation relies on information collected from FitBit fitness trackers. This dataset can be accessed through [Mobieus](https://www.kaggle.com/arashnic/fitbit) and is hosted in Kaggle.

### 3.2 Data information

To the scope permitted by law, the author has relinquished all copyright protections worldwide for the work, including all rights that are derivative or adjacent to the job. It is possible to make copies, edit, distribute, and perform it, even for profit. Our investigation relies on information collected from FitBit fitness trackers. This dataset can be accessed through Mobius and is hosted in Kaggle.

Thirty Fitbit users gave their permission to submit personal tracker data, which included minute-by-minute output for exercise, heart rate, and sleep tracking. However, the varying results can be attributed to the wide variety of Fitbit trackers available and the fact that different people have different tracking habits and preferences.

In total, we have access to 18 separate CSV files. Fitbit's quantitative data is represented in multiple documents. Each row represents a single time point for each subject, so for each subject, there will be data in multiple rows. Since information is recorded in discrete chunks throughout the day, users have their ID and set of rows.

The data may have potential biases, as the sample consists of only 30 Fitbit users who opted to share their data. Therefore, the data may be different from the general population. To assess the credibility of the data, we can use the ROCCC framework (Relevant, Objective, Current, Complete, and Cited). In addition, the dataset needs to be updated, and the survey's timeline needs to provide timely results (2 months long). For this reason, we will take a pragmatic strategy in our case study.


## 4. Process

### 4.1 Installing and loading libraries

The following packages will be used for the analysis:

* `tidyverse`
* `dplyr`
* `lubridate`
* `ggpubr`


In [None]:
options(warn = -1, message = -1)

In [None]:
library(tidyverse)
library(dplyr)
library(lubridate)
library(ggpubr)

### 4.2 Importing datasets

Although there are many CSV files available, for the sake of our analysis we will focus only on the following data sets:


In [None]:
daily_activity <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
daily_sleep <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weight_log <- read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

### 4.3 Exploring the datasets

Preview and summary of our data frames.

In [None]:
head(daily_activity)
str(daily_activity)

head(daily_sleep)
str(daily_sleep)

head(weight_log)
str(weight_log)

### 4.4 Cleaning and preprocessing data

Now that we know how our data is structured, we can correct any error or inconsistency.

##### Veryfying number of users

In [None]:
n_distinct(daily_activity$Id)
n_distinct(daily_sleep$Id)
n_distinct(weight_log$Id)

##### Removing duplicates and N/A

In [None]:
daily_activity <- daily_activity %>% 
  distinct() %>% 
  drop_na()

daily_sleep <- daily_sleep %>% 
  distinct() %>% 
  drop_na()

weight_log <- weight_log %>% 
  distinct() 

We can now verify if every duplicate has been removed.

In [None]:
sum(duplicated(daily_activity))
sum(duplicated(daily_sleep))
sum(duplicated(weight_log))

##### Cleaning columns and correcting date/time inconsistencies

In [None]:
daily_activity_clean <- daily_activity %>%
  filter(Calories > 0 & TotalSteps > 0) %>%
  mutate(date = as.Date(ActivityDate, format = "%m/%d/%Y")) %>% 
  mutate(TotalActivity = LightlyActiveMinutes + FairlyActiveMinutes + VeryActiveMinutes)

daily_sleep_clean <- daily_sleep %>%
  filter(TotalMinutesAsleep > 0) %>%
  mutate(date = as.Date(SleepDay, format = "%m/%d/%Y"))

weight_log_clean <- weight_log %>% 
  select(-c(WeightPounds, Fat, IsManualReport, LogId)) %>% 
  filter(BMI > 0 & WeightKg > 0) %>% 
  mutate(date = as.Date(Date, format = "%m/%d/%Y"))

### 4.5 Merging relevant datasets

##### Merging daily activity data with BMI (Body Mass Index)

In [None]:
daily_BMI <- weight_log_clean %>% 
  select(Id, BMI)
daily_activity_BMI <-merge(daily_activity_clean, daily_BMI, By = "Id")

##### Merging daily activity data with the daily sleep

In [None]:
daily_activity_sleep <-merge(daily_activity_clean, daily_sleep_clean, By = "Id")

## 5. Analyse and Share

We can explore the data and identify trends in smart device usage. For example, we can plot the daily step count over time.


### 5.1 Diving into the steps count

In [None]:
ggplot(daily_activity_clean, aes(x = date, y = TotalSteps)) +
  geom_col(fill = "#1f77b4") +
  labs(title = "Daily Step Count", x = "Date", y = "Total steps") +
  scale_y_continuous(labels = scales::comma_format()) 

From the plot, we can see that there is a lot of variability in the step count, with some days having very high or very low step counts. However, there are some patterns that emerge. 

We can also look at the distribution of step counts and calculate summary statistics.


In [None]:
summary(daily_activity$TotalSteps)

In [None]:
daily_activity_clean %>% 
  group_by(Id) %>% 
  summarise(mean_steps = mean(TotalSteps)) %>% 
  ggplot(aes(mean_steps)) +
  geom_histogram(binwidth = 500, fill = "#1f77b4", color = "white") +
  labs(title = "Distribution of Average Daily Steps per User",
       x = "Average Daily Steps",
       y = "Frequency")

In [None]:
ggplot(daily_activity, aes(x = TotalSteps)) +
  geom_histogram(bins = 30, color = "white") +
  ggtitle("Distribution of daily step count") +
  ylab("Frequency") +
  xlab("Total steps")

We can see that the distribution is highly skewed with a long tail to the right. Most people take less than 10,000 steps per day, but there are a few people who take more than 20,000 steps.

The average number of steps taken is around 7600, with a significant variation, as shown by the summary statistics and the histogram. So that's a lot of variation in activity levels among device's users. 

However, to apply these tendencies to Bellabeat customers, it is necessary to consider the unique qualities of Bellabeat products. The Bellabeat line of products, for instance, is designed to improve the health and well-being of women by tracking their exercise, sleep, stress levels, and ovulation. This allows us to analyze the current tendencies in these fields. 

The Fitabase dataset includes information about sleep patterns that can serve as a starting point. In addition, the time spent sleeping can be graphed.

In [None]:
daily_activity_clean$weekday <- wday(as.Date(daily_activity_clean$date))

# Summarize the data by day of the week and hour of the day
daily_activity_summary <- daily_activity_clean %>%
  group_by(weekday) %>%
  summarize(total_steps = sum(TotalSteps))

# Create the stacked bar chart
ggplot(daily_activity_summary, aes(x = weekday, y = total_steps)) +
  geom_col(fill = "#1f77b4") +
  labs(x = "Day of the Week", y = "Total Steps",
       title = "Distribution of Daily Steps by Day of the Week") +
  scale_x_discrete(labels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")) +
  theme_minimal() +
  annotate("text", x = 1:7, y = -2000, label = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")) 

The plot shows that people tend to be more active on weekdays than on weekends, with the highest step counts observed on Tuesdays and Wednesdays.

But what insights can we gain from the sleep data?


### 5.2 Diving into the sleep duration

In [None]:
options(warn = -1)

ggplot(daily_sleep_clean, aes(x = date, y = TotalMinutesAsleep/60)) +
  geom_bar(stat = "identity", fill = "#1f77b4") +
  geom_smooth(color = "red") +
  labs(title = "Daily Sleep Duration", x = "Date", y = "Hours") +
  scale_y_continuous(limits = c(0, max(daily_sleep_clean$TotalMinutesAsleep/60) + 2), 
                     breaks = seq(0, max(daily_sleep_clean$TotalMinutesAsleep/60) + 2, by = 2))

From the plot, we can see that there is also a lot of variability in sleep duration, with some days having very low or very high sleep durations. However, it would be helpful to explore other variables, such as the quality of sleep, and see if they provide additional insights.


In [None]:
ggplot(daily_sleep, aes(x = TotalMinutesAsleep)) +
  geom_histogram(binwidth = 30, fill = "#1f77b4", color = "white") +
  labs(title = "Distribution of Sleep Duration",
       x = "Total Minutes Asleep", y = "Count")

We will divide the data into 3 different categories (good, medium, low), to get more out of our data.


In [None]:
daily_sleep %>%
  mutate(quality = case_when(
    TotalMinutesAsleep >= 480 ~ "Good",
    TotalMinutesAsleep >= 360 ~ "Medium",
    TotalMinutesAsleep < 360 ~ "Poor"
  )) %>%
  count(quality) %>% 
  ggplot(aes(x = "", y = n, fill = quality)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(round(100 * n/sum(n), 1), "%")), position = position_stack(vjust = 0.5)) +
  labs(title = "Sleep Quality", fill = "Quality") +
  scale_fill_manual(values = c("#85e085", "#e6e600", "#ff8080")) +
  theme_void()

The insight from the visualization is that a majority of the users have a medium quality of sleep, followed by a good quality of sleep. This suggests that the overall sleep quality of the users in the dataset is relatively good, since only a small percentage of the users have poor quality of sleep.

However, it is worth noting that the amount of sleep needed can vary greatly depending on the individual, and quality of sleep is not solely determined by duration. Therefore, it is important not to make sweeping generalisations about sleep quality based solely on this analysis.


We can investigate if there is a relationship between sleep and daily step count. We can start by plotting the distribution of daily step counts for days with different amounts of sleep.



### 5.3 Sleep and Steps correlations

In [None]:
step_sleep <- daily_activity_sleep %>%
  filter(TotalMinutesAsleep > 0) %>%
  group_by(TotalMinutesAsleep) %>%
  summarize(mean_steps = mean(TotalSteps, na.rm = TRUE))

ggplot(step_sleep, aes(x = TotalMinutesAsleep, y = mean_steps)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Total Sleep Minutes", y = "Mean Daily Steps",
       title = "Sleep vs. Daily Steps") +
  theme_minimal()

There is a negative correlation between `Total Sleep Minutes` and `Mean Daily Steps`, then it suggests that people who get more sleep tend to have lower daily step counts, on average.

However, it's important to keep in mind that correlation does not imply causation, and there may be other factors at play that influence both sleep and physical activity levels. It could be that individuals with higher daily activity levels have a harder time getting to sleep at night, or that people who work sedentary jobs have more time and energy for exercise, but also have a harder time getting a good night's sleep.

Further analysis and investigation would be needed to better understand the underlying relationships between sleep and physical activity.


#### Going further


This analysis is trying to investigate whether there is a relationship between the amount of steps taken (as classified into low, medium and high categories) and the total minutes of sleep per day. The box plot allows for a comparison of the median, range, and distribution of `TotalMinutesAsleep` across the different `steps_category` levels, which can help identify any potential patterns or trends.

In [None]:
daily_activity_clean %>%
  inner_join(daily_sleep_clean, by = "Id") %>%
  mutate(steps_category = case_when(
    TotalSteps < 5000 ~ "Low",
    TotalSteps >= 5000 & TotalSteps < 10000 ~ "Medium",
    TotalSteps >= 10000 ~ "High"
  )) %>%
  ggplot(aes(steps_category, TotalMinutesAsleep)) +
  geom_boxplot(fill = c("#F8766D", "#7CAE00", "#00BFC4")) +
  labs(x = "Total Minutes Asleep", y = "Total Sleep Minutes",
       title = "Steps Category vs. Total Minutes Asleep")

The "Low" category appears to have the highest median sleeping minutes on the plot. This suggests that there may not be a clear trend between step count category and total minutes asleep.

One possible explanation for this observation could be that individuals in the "Low" step count category are more likely to lead a sedentary lifestyle, which can lead to feelings of fatigue and thus longer sleeping times. On the other hand, individuals in the "High" step count category may be more physically active, which could result in shorter sleeping times due to increased energy levels and a more active lifestyle. However, this is just one possible explanation, and further investigation would be necessary to draw any conclusions about the relationship between step count category and total minutes asleep.


### 5.4 Diving into the Body Mass Index


Now let's take a look at the distribution of BMI in the data:

In [None]:
ggplot(daily_activity_BMI, aes(x = BMI)) + 
  geom_histogram(binwidth = 1, fill = "#1f77b4", color = "white") +
  labs(title = "Distribution of BMI", x = "BMI", y = "Count")

This produces a histogram of the BMI values in the data. We can see that the distribution is roughly normal, with a mean around 27.



Now that we have a sense of the distribution of BMI and steps in the data, we can explore their relationship using a scatterplot.


### 5.5 BMI correlations

In [None]:
ggarrange(
  
  # Create scatterplot to show relationship between BMI and daily activity levels
ggplot(data = daily_activity_BMI, aes(x = TotalActivity, y = BMI)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Total Daily Activity (minutes)", y = "BMI") +
  ggtitle("BMI vs. Daily Activity"),

# Visualize the relationship between BMI and daily activity levels using scatterplot
ggplot(data = daily_activity_BMI, aes(x = TotalSteps, y = BMI)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Daily Steps", y = "BMI", 
       title = "BMI vs. Taken Steps")
)

We can see that there is a weak negative correlation between BMI and steps taken,as well as BMI and daily activity, which is somewhat expected since higher BMI levels are generally associated with lower physical activity levels. However, there is still a lot of variability in the data, and it's not a very strong relationship.


### 5.6 Correlation between daily steps and calories burned

In [None]:
# Create scatterplot to show correlation between daily steps and calories burned
ggplot(data = daily_activity_clean, aes(x = TotalSteps, y = Calories)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Daily Step Count", y = "Calories Burned") +
  ggtitle("Correlation between Daily Steps and Calories Burned") +
  theme_bw()

This graph shows a positive correlation between the step count and calories burned: the more you walk the more you burn calories.


## 6. Act (Conclusion)


Bellabeat aims to give women the tools they need to find their identities through data. 

We should use our tracking data for further analysis to respond to our business task and aid Bellabeat in its mission. However, since we needed more information on our users' demographics, the datasets we used had a small sample size and were potentially biased. Nevertheless, it's essential to keep an eye out for emerging trends so that we can tailor our marketing strategy to the needs of today's modern young and middle-aged women. 

Nevertheless, after conducting this research, we have identified a few patterns that may be helpful for our digital marketing strategy and the development of the Bellabeat app:

* BellaBeat should consider expanding its product line to include sleep monitoring features, as there is a potential link between physical activity and sleep quality that was identified in the data. This could help BellaBeat appeal to a wider range of consumers who are interested in improving their overall health and wellness.
* Consider developing a personalized sleep tracking system: The analysis showed that sleep patterns varied significantly among users, which suggests that a one-size-fits-all approach may not be the most effective. Developing a system that allows users to track their sleep patterns and receive personalized recommendations based on their data could be a valuable feature for the company to consider.
* BellaBeat should explore ways to make their products more engaging and interactive, as this could help increase user retention and encourage more consistent use of the products. This could include adding social features to the app, such as the ability to compete with friends or share achievements.
* BellaBeat could partner with fitness and wellness influencers to help promote their products and reach a wider audience. Influencer marketing can be a very effective way to increase brand awareness and drive sales.
