# **Case Study: How Can a Wellness Technology Company Play It Smart?**


   ![](https://theme.zdassets.com/theme_assets/1034702/b755e49170d01ece3270371bcad3155ff24d8da5.png)


## **Table of content**
* [Summary](#C0)
* [Phase 1: Ask](#C1)
    * [1.1 Business task](#SS1.1)
    * [1.2 Stakeholders](#SS1.2)
* [Phase 2: Prepare](#C2)
    * [2.1 Data source](#SS2.1)
    * [2.2 Data organization](#SS2.2)
    * [2.3 Data integrity and credibility](#SS2.3)
    * [2.4 Data privacy and accessibility](#SS2.4)
    * [2.5 Approaching the problem](#SS2.5)
* [Phase 3: Process](#C3)
    * [3.1 Tools for analysis](#SS3.1)
    * [3.2 Data cleaning with R](#SS3.2)
* [Phase 4: Analyze](#C4)
    * [4.1 User active dates analysis](#SS4.1)
    * [4.2 Steps and distance analysis](#SS4.2)
    * [4.3 Active minutes analysis](#SS4.3)
    * [4.4 Sleep data analysis](#SS4.4)
* [Phase 5: Share](#C5)
    * [5.1 Daily usage visualization](#SS5.1)
    * [5.2 Steps per user and activeness visualization](#SS5.2)
    * [5.3 Manual vs automatic distance entries](#SS5.3)
    * [5.4 Active minutes and calories correlations](#SS5.4)
    * [5.5 Sleep data visualization](#SS5.5)
    * [5.6 Steps and sleep visualization](#SS5.6)
* [Phase 6: Act (Conclusion)](#C6)


## **Summary** <a class="anchor"  id="C0"></a>
Bellabeat is a tech company that manufactures health tracking smart products for women. They offer a variety of products that allow users to track their activity, sleep, stress and other wellness habits. 
This study aims to analyze user habits with fitness tracking products in order to develop a marketing strategy to allow Bellabeat to grow further. Activity and sleep habits are the main focus of this study, which provides data-driven insights based on a small sample of 24 users.


## **Phase 1: Ask** <a class="anchor"  id="C1"></a>
### 1.1 Business task <a class="anchor"  id="SS1.1"></a>
Investigate data usage trends from a competing fitness app in order to gain insights about users’ habits and allow the stakeholders to make data-driven business decisions. 
### 1.2 Stakeholders <a class="anchor"  id="SS1.2"></a>
**Urška Sršen:** Bellabeat’s cofounder and Chief Creative Officer. <br />
**Sando Mur:** Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team. <br />
**Bellabeat marketing analytics team:** A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.


## **Phase 2: Prepare** <a class="anchor"  id="C2"></a>
### 2.1 Data source <a class="anchor"  id="SS2.1"></a>
The data used in this study is the FitBit Fitness Tracker Data available in kaggle.
Link to dataset source:  [FitBit Tracker Data](https://www.kaggle.com/arashnic/fitbit)
### 2.2 Data organization <a class="anchor"  id="SS2.2"></a>
The dataset includes 18 .csv files containing health data tracked from 
different FitBit users. Each user was given a unique ID number and consented to share their personal data including,  including minute-level output for physical activity, heart rate, and sleep monitoring. Pivot table was generated to gain initial insight about the data that is summarized in the following table: <br />

| .csv file name | Description | 
| -------------- | ------------|
| dailyActivity_merged | Daily activity records of 33 users with varying number of days per user (4-31 days). The table is in long format and combines physical activities from dailyCalories_merged, dailyIntensities_merged and dailySteps_merged. |
|dailyCalories_merged | Daily calories records of 33 users with varying number of days per user (4-31 days). The data is stored in long format. |
| dailyIntensities_merged | Daily record of intensities of 33 users with varying number of days per user (4-31 days). The intensities are represented by activity minutes and activity distance. The activity rate is divided into sedentary, lightly active, fairly active and very active. The data is stored in long format. |
| dailySteps_merged | Daily steps records of 33 users with varying number of days per user (4-31 days). The data is stored in long format. |
| heartrate_seconds_merged | Heart rate per seconds of 7 users stored in long format. |
| hourlyCalories_merged | Hourly calories records of 33 users. The data is stored in long format. |
hourlyIntensities_merged | Hourly record of intensities of 33 users. The intensities are represented by total intensity and average intensity. The data is stored in long format. |
| hourlySteps_merged | Hourly steps records of 33 users. The data is stored in long format. |
| minuteCaloriesNarrow_merged | Minute calories records of 33 users. The data is stored in long format. |
| minuteCaloriesWide_merged | Minute calories records of 33 users. The data is stored in wide format. |
| minuteIntensitiesNarrow_merged | Minute intensity records of 33 users. The data is stored in long format. |
| minuteIntensitiesWide_merged | Minute intensity records of 33 users. The data is stored in wide format. |
| minuteMETsNarrow_merged | Minute METs records of 27 users. The data is stored in long format. |
| minuteSleep_merged | Minutes sleep records of 24 users stored in long format. The value column data is unspecified. |
| minuteStepsNarrow_merged | Minute step records of 33 users. The data is stored in long format. |
| minuteStepsWide_merged | Minute step records of 33 users. The data is stored in wide format. |
| sleepDay_merged | Daily sleep records of 27 users including count of time asleep per day, minutes asleep and time in bed. The table is stored in long format. |
| weightLogInfo_merged | Weight logs of 8 users including the weight in kg and in pounds. It also includes info about fat and BMI. the table is stored in long format. |

### 2.3 Data integrity and credibility <a class="anchor"  id="SS2.3"></a>
No demographic data such as age, gender or location is provided. Hence, we cannot verify that the data is unbiased. Additionally, we cannot verify that the sample size is representative of the population. Another limitation is that the duration of the data is limited to 2 months and approximately 7 years old.    
### 2.4 Data privacy and accessibility <a class="anchor"  id="SS2.4"></a>
The data is made publicly available in kaggle under creative common licence (CC0: Public Domain). Users personal data is protected by assigning an ID number to each user. 
### 2.5 Approaching the problem <a class="anchor"  id="SS2.5"></a>
For consistency, we will analyze daily activity and daily sleep. Meanwhile we will exclude the heart rate data and weight log info from the analysis due to the limited sample number.  


## **Phase 3: Process** <a class="anchor"  id="C3"></a>
### 3.1 Tools for analysis <a class="anchor"  id="SS3.1"></a>
Since we have 18 csv files, R programming would be the best tool to process the data. This is due to the fact that R can process multiple csv files more efficiently and easily merge data where needed.   

### 3.2 Data cleaning with R <a class="anchor"  id="SS3.2"></a>
#### Step 1: install the required R packages
Installing the standard data analytics packages.

In [None]:
library(tidyverse) #standard package for data anlysis
library(skimr) #provides summary of statistics 
library(janitor) #includes tools for data cleaning
library(lubridate) #for date-time management

#### Step 2: Import and preview the csv files
As mentioned in phase 2, we will focus on the following datasets: <br />
- dailyActivity_merged.csv
- sleepDay_merged.csv <br />

In this step, we import the csv files, preview the data, review the data organization and review data types.

In [None]:
# Import and preview daily activity data
daily_activity <- read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
head(daily_activity)
str(daily_activity)

In [None]:
# Import and preview daily sleep data
daily_sleep <- read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
head(daily_sleep)
str(daily_sleep)

#### Step 3: Verify the number of users
Check the number of user ids recorded in each dataset. Then check for duplicated information. 

In [None]:
paste("Count of users in the daily activity dataset: ", n_distinct(daily_activity$Id))
paste("Count of users in the daily sleep dataset: ", n_distinct(daily_sleep$Id))

In [None]:
# Check duplicated entries
sum(duplicated(daily_activity))
sum(duplicated(daily_sleep))

#### Step 4: Cleaning the dataset
Generate new cleaned datasets using R pipelines. the cleaning process includes:
- Removing duplicates.
- Removing null values.
- Making column names consistant.
- Making the dates format consistant. <br />

In the next code blocks, The cleaning is done seperatly for the daily activity data and the daily sleep data. 

In [None]:
# Cleaning the daily activity data
daily_activity_cleaned <- daily_activity %>% 
distinct() %>% #unique rows
drop_na() %>% #remove nulls
clean_names() %>% #make column names consistant 
rename(date = activity_date) %>% #rename the activity_date column to have consistant naming with the daily sleep dataset
mutate(date = as_date(date, format = "%m/%d/%Y")) #change the date format from chr to date


Next, preview the new data to verify the cleaning result.

In [None]:
head(daily_activity_cleaned)

In [None]:
# Cleaning the daily sleep data
daily_sleep_cleaned <- daily_sleep %>% 
distinct() %>% #unique rows
drop_na() %>% #remove nulls
clean_names() %>% #make column names consistant 
rename(date = sleep_day) %>% #rename the activity_date column to have consistant naming with the daily sleep dataset
mutate(date = as_date(date, format = "%m/%d/%Y")) #change the date format from chr to date

Preview the new data to verify the cleaning result.

In [None]:
head(daily_sleep_cleaned)

Lastly, check for duplicates in the cleaned datasets.

In [None]:
sum(duplicated(daily_activity_cleaned))
sum(duplicated(daily_sleep_cleaned))

#### Step 5: Merge the cleaned datasets
The generated table is the main dataset used for the analysis phase. The dataset is named activity_sleep_data.

In [None]:
activity_sleep_data <- merge(daily_activity_cleaned, daily_sleep_cleaned, by= c("id", "date")) # Merge by id and date

Preview and verify data in the new dataset.

In [None]:
head(activity_sleep_data)
str(activity_sleep_data)

In [None]:
# Check for the number of users in the merged dataset
n_distinct(activity_sleep_data$id)

## **Phase 4: Analyze** <a class="anchor"  id="C4"></a>
### 4.1 User active dates analysis <a class="anchor"  id="SS4.1"></a>
In this step we will check the first and last date records to determine the number of days when users used their tracking devices. This step will allow us to identify usage trends among different users. <br />
Considering a duration between April 12th and May 12th, only two users used their device for a less than two days. However, the analysis shows that majority of users used their devices for more than two weeks.

In [None]:
activity_sleep_data %>%
group_by(id) %>%
summarize(start_date = min(date), end_date = max(date),days = max(date)-min(date))

### 4.2 Steps and distance analysis <a class="anchor"  id="SS4.2"></a>
In this part, we calculate the average steps and the average distances per user, with respect to the number of their active days. This step shows a direct causation between steps and distances, where more steps reflect a longer distance. Thus, steps and distances can be used interchangably for the marketing strategy.   

In [None]:
activity_sleep_data %>%
group_by(id) %>%
summarize(days = max(date)-min(date), avg_steps = mean(total_steps), avg_distance = mean(tracker_distance))

To further investigate the trends of number of steps, we will classify users activity into the following:
- Insufficient days: if the steps were recorded for less than 5 days.
- lightly active: if the steps count is less than 4999.
- modeately active: if the steps count is between 5000 and 9999.
- very active: if the steps count is more than 10000. <br />

We will add a new column to the dataset with the user classification.


In [None]:
classified_steps <- activity_sleep_data %>%
group_by(id) %>%
summarize(days = max(date)-min(date), avg_steps = mean(total_steps), 
          avg_distance = mean(tracker_distance), 
          avg_calories = mean(calories)) %>%
mutate(user_activity = case_when(
    days < 5 ~ "insufficient days",
    days >= 5 & avg_steps < 5000 ~ "lightly active",
    days >= 5 & avg_steps >= 5000 & avg_steps < 10000 ~ "moderately active",
    days >= 5 & avg_steps >= 10000 ~ "very active"
)) %>%
relocate(avg_calories, .after = user_activity) %>%
arrange(desc(avg_steps))

head(classified_steps)

Now that we have users categorized into 4 classes, we can better understand the trend by viewing how many users belong to each group.

In [None]:
classified_steps_percent <- classified_steps %>%
group_by(user_activity) %>%
summarize(n_of_users = n(), ratio = (n_of_users)/nrow(classified_steps)) %>% # n() gives the group size
mutate(percentage = scales::percent(ratio)) #gives the calculation as percentage

head(classified_steps_percent)

In the next step, we check how many times users entered their active distance manually. This indicates that users do not commit to recording data manually. Hence the tracking is more efficient when it is done automatically.  

In [None]:
logged_dist = sum(activity_sleep_data$logged_activities_distance != 0) # distances manually entered by users
obs = nrow(activity_sleep_data) #total number of observations in activity_sleep_data dataset
paste("Users added activie distance manually ", logged_dist, " times in ", obs, " days")
paste("Only ", format(round((logged_dist/obs * 100), 2), nsmall = 2), "% of users recorded the distance manually")

### 4.3 Active minutes analysis <a class="anchor"  id="SS4.3"></a>
The original dataset has already categorized active minutes into very active, fairly active, lightly active and sedentary. In the next code  block we calculate the average minutes of each category per user, then we try to find correlations between active minutes the average calories burnt.
There is no clear correlation when we observe the data in the table. Therefore, it is better to search for correlations as we visualize the data in the next phase. 

In [None]:
active_minutes <- activity_sleep_data %>%
group_by(id) %>%
summarize(avg_very_active_mins = mean(very_active_minutes),
          avg_fairly_active_mins = mean(fairly_active_minutes),
          avg_lightly_active_mins = mean(lightly_active_minutes), 
          avg_sedentary_mins = mean(sedentary_minutes), 
          avg_calories = mean(calories))

head(active_minutes)

### 4.4 Sleep data analysis <a class="anchor"  id="SS4.4"></a>
In this step we start by calculating the average sleep minutes per user. Then, we convert the average minutes asleep into hours. Lastly, we deduct the minutes asleep from the total time on bed to gain data on time to fall asleep for each user.  

In [None]:
activity_sleep_data %>%
group_by(id) %>%
summarize(days = max(date)-min(date), 
          avg_asleep_mins = mean(total_minutes_asleep), 
          avg_asleep_hours = mean(total_minutes_asleep)/60,
          avg_time_to_fall_asleep = mean(total_time_in_bed - total_minutes_asleep))

Based on the output from the previous step, we will categorize user sleeping habits into the following:
- healthy sleeper: if the user sleeps from 6 to 8.30 hours daily.
- long sleeper: if the user sleeps more than 8.30 hours daily.
- short sleeper: if the user sleeps less than 6 hours daily.
- unhealthy: if the user remains in bed without sleep for longer than 45 minutes.

In [None]:
classified_sleep <- activity_sleep_data %>%
group_by(id) %>%
summarize(days = max(date)-min(date), 
          avg_asleep_mins = mean(total_minutes_asleep), 
          avg_asleep_hours = mean(total_minutes_asleep)/60,
          avg_time_to_fall_asleep = mean(total_time_in_bed - total_minutes_asleep))%>%
mutate(sleep_classification = case_when(
    avg_asleep_hours >= 6 & avg_asleep_hours <= 8.3 & avg_time_to_fall_asleep <=45 ~ "healthy sleeper" ,
    avg_asleep_hours > 8.3 & avg_time_to_fall_asleep <=45 ~ "long sleeper" ,
    avg_asleep_hours < 6 & avg_time_to_fall_asleep <=45 ~ "short sleeper" ,
    avg_time_to_fall_asleep > 45 ~ "unhealthy" 
      ))

head(classified_sleep)

Next, we check the number of users falls under each sleep category.

In [None]:
classified_sleep_percent <- classified_sleep %>%
group_by(sleep_classification) %>%
summarize(n_of_users = n(), ratio = (n_of_users)/nrow(classified_sleep)) %>% # n() gives the group size
mutate(percentage = scales::percent(ratio)) #gives the calculation as percentage

head(classified_sleep_percent)

## **Phase 5: Share** <a class="anchor"  id="C5"></a>
### 5.1 Daily usage visualization <a class="anchor"  id="SS5.1"></a>
The following bar chart illustrates the number of days when users had been active within the 30 days duration. 7 out of 24 users had their data tracked for the full duration. Moreover, majority of users (22 out of 24) provided data for more than one week of tracking.

In [None]:
ggplot(classified_steps) + geom_bar(mapping = aes(x = days)) + 
labs(title = "Active days vs. Number of Users", subtitle = "Number of days when users used their tracker", x = "Days", y = "Number of users")


### 5.2 Steps per user and activeness visualization <a class="anchor"  id="SS5.2"></a>
The bar chart provides insight on the number of steps taken by each user, while the bars colors show their activity level group. Meanwhile, the pie chart illustrates the distribution of users in each group. 

In [None]:

ggplot(classified_steps) + geom_col(mapping = aes(factor(id), avg_steps, fill = user_activity)) + 
theme(axis.text.x = element_text(angle = 90,vjust = 1, hjust = 1), plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
labs(title = "Steps vs. Users", subtitle = "Number of steps taken by each user", x = "User id", y = "Average steps")

ggplot(classified_steps_percent, aes(x = "", y = ratio, fill = user_activity)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
theme_minimal()+
theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
geom_text(aes(label = percentage),
        position = position_stack(vjust = 0.5))+
labs(title="User Activity Type")


Next we will visualize steps vs calories in a scatter plot to search for trends. The plot shows a positive correlation between steps and calories burnt, where more steps may increase the number of burnt calories.   

In [None]:
ggplot(activity_sleep_data) + geom_point(mapping = aes(calories, total_steps)) +
geom_smooth(mapping=aes(x= calories, y= total_steps)) +
theme(axis.text.x = element_text(angle = 90,vjust = 1, hjust = 1), plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
labs(title = "Steps and Calories Correlation", x = "Calories", y = "Steps")


### 5.3 Manual vs automatic distance entries <a class="anchor"  id="SS5.3"></a>
The pie chart provided shows that in 410 distance data entries, only 4.63% were entered manually. This could indicate that manual tracking is inconvenient and users do not commit to tracking. While automatic tracking is more reliable and more appealing to users.    

In [None]:
slices <- c(4.63, 95.37)
lbls <- c("4.63%", "95.37%")
pie(slices, labels = lbls, main="Times when users recorded the distance manually", radius = 1.8)
legend("bottom", legend = c("Manual records", "Automatic records"),
       fill =  c("white", "lightblue"), ncol = 2,cex=1.5, text.width = 2)

### 5.4 Active minutes and calories correlations <a class="anchor"  id="SS5.4"></a>
The first scatter plot visualizes very active minutes and calories burnt. The plot shows a strong correlation with a significant increase of calories burnt as the very active minutes increase.

In [None]:
ggplot(activity_sleep_data) + geom_point(mapping = aes(calories, very_active_minutes)) +
geom_smooth(mapping=aes(x= calories, y= very_active_minutes)) +
theme(axis.text.x = element_text(angle = 90,vjust = 1, hjust = 1), plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
labs(title = "Very Active Minutes and Calories Correlation",
     subtitle = "Calories burn for users with very active minutes records", 
     x = "Calories", y = "Active minutes")

In the second scatter plot we will visualize fairly active minutes and calories burnt. The trend shows a positive correlation with a small increment in calories burnt compared to increasing fairly active minute.

In [None]:
ggplot(activity_sleep_data) + geom_point(mapping = aes(calories, fairly_active_minutes)) +
geom_smooth(mapping=aes(x= calories, y= fairly_active_minutes)) +
theme(axis.text.x = element_text(angle = 90,vjust = 1, hjust = 1), plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
labs(title = "Fairly Active Minutes and Calories Correlation",
     subtitle = "Calories burn for users with fairly active minutes records", 
     x = "Calories", y = "Active minutes")

Lastly, in the following two plots representing lightly active minutes and sedentary minutes against calories burnt, there are no clear correlation. We might require more data to obtain a more accurate trends.

In [None]:
ggplot(activity_sleep_data) + geom_point(mapping = aes(calories, lightly_active_minutes)) +
geom_smooth(mapping=aes(x= calories, y= lightly_active_minutes)) +
theme(axis.text.x = element_text(angle = 90,vjust = 1, hjust = 1), plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
labs(title = "Lightly Active Minutes and Calories Correlation",
     subtitle = "Calories burn for users with ligtly active minutes records", 
     x = "Calories", y = "Active minutes")

In [None]:
ggplot(activity_sleep_data) + geom_point(mapping = aes(calories, sedentary_minutes)) +
geom_smooth(mapping=aes(x= calories, y= sedentary_minutes)) +
theme(axis.text.x = element_text(angle = 90,vjust = 1, hjust = 1), plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
labs(title = "Sedentary Minutes and Calories Correlation",
     subtitle = "Calories burn for users with sedentary minutes records", 
     x = "Calories", y = "Sedentary minutes")

### 5.5 Sleep data visualization <a class="anchor"  id="SS5.5"></a>
The following bar chart represent the average sleep time for each user, where the colors indicate the sleep type category.   

In [None]:
ggplot(classified_sleep) + geom_col(mapping = aes(x = factor(id), y = avg_asleep_hours, fill = sleep_classification)) + 
geom_hline(yintercept = 8) +
theme(axis.text.x = element_text(angle = 90,vjust = 1, hjust = 1), plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
labs(title = "Sleep Time Per User", subtitle = "Number of average hours slept by each user", x = "User", y = "Average hours") +
annotate("text",label = "Ideal sleep time of 8 hours", x = 20 , y = 8.5)


Meanwhile, the pie chart illustrates how many users belong to each sleep type category, where approximatly 42% of uses exhibit bad sleeping habits that they may not be aware of without sleep tracking.   

In [None]:
ggplot(classified_sleep_percent, aes(x = "", y = ratio, fill = sleep_classification)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
theme_minimal()+
theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
geom_text(aes(label = percentage),
        position = position_stack(vjust = 0.5))+
labs(title="Sleep Type Distribution")

### 5.6 Steps and sleep visualization <a class="anchor"  id="SS5.6"></a>
The following scatter plot is made to check for correlation between steps and sleep minutes. After observing the graph and the data we did not find any reliable correlation.  

In [None]:
ggplot(activity_sleep_data) + geom_point(mapping = aes(total_minutes_asleep, total_steps)) +
geom_smooth(mapping=aes(x= total_minutes_asleep, y= total_steps)) +
theme(axis.text.x = element_text(angle = 90,vjust = 1, hjust = 1), plot.title = element_text(hjust = 0.5, size=14, face = "bold"))+
labs(title = "Steps vs. Sleep", x = "Sleep minutes", y = "Steps")

## **Phase 6: Act (Conclusion)** <a class="anchor"  id="C6"></a>
This case study provides a detailed analysis of FitBit tracker data in order to allow Bellabeat stakeholders to develop a marketing plan for their fitness app. The study analyzed activity and sleep data of 24 users within 30 days. Although the data is limited, we were able to gain some useful insights. However, it would be of great benefit to increase the volume of data and expand the time duration to reach the best conclusion. <br />
With that being said, our study showed that many users did not use the tracker on a daily basis. Furthermore, we were able to detect a trend of high calorie burns with higher activity. On the other hand, about 20% of users are not active enough. Moreover, by tracking sleep we detected more than 40% of users need to change their sleep habits.     

#### **Recommendations** 
- Organize a campaign to educate people about their unhealthy habits and show how bellabeat fitness app can improve their wellbeing.
- Notify people through the app if they exhibit low activity rates or bad sleeping habits.
- Develop a reward system where active users can gain points and use them to get discounts at local shops.