![image.png](attachment:04a36e61-8a6e-4812-8674-4411c61325f0.png)

# Introduction
#### Welcome to the Bellabeat data analysis case study!

# Scenario

#### Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, 

#### cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities 

#### for the company.

### **Products**

**Bellabeat app:** The Bellabeat app provides users with health data related to their activity, sleep, stress,menstrual cycle,and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

**Leaf:** Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

**Time:** This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

**Spring:** This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

**Bellabeat membership:** Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.


# Ask

1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

## Data Preparation


In [103]:
install.packages("hms")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [104]:
library(plyr)
library(tidyverse)
library(skimr)
library(ggplot2)
library(janitor)
library(lubridate)
library(gridExtra)
library(rmarkdown)
library(dplyr)
library(hms)

## Data Sources

Import CSV data from all the datasets from Kaggle.


In [105]:
fitbit_activity = read_csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
fitbit_heartbeat = read_csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
fitbit_sleep = read_csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
fitbit_body = read_csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

[1mRows: [22m[34m940[39m [1mColumns: [22m[34m15[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): ActivityDate
[32mdbl[39m (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m2483658[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): Time
[32mdbl[39m (2): Id, Value

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m413[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn 

## Data Checkup

The focus of checkup was primarily on the daily activity, sleep & hear-rate data.

# Data Processing

## Summary:

* Inspect each dataframe for missing values.
* Standardize the column names.
* Remove erroneous outliers.
* Convert data values into consistent metrics.
* Merge & split columns as required
* Fix typos
* Sort records
* Remove machine-dependent/ irrelevant columns

# 1. FitBit Fitness Tracker Data:- 

*** ACITIVITY**

In [106]:
head(fitbit_activity)

skim_without_charts(fitbit_activity)

Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1503960366,4/12/2016,13162,8.5,8.5,0,1.88,0.55,6.06,0,25,13,328,728,1985
1503960366,4/13/2016,10735,6.97,6.97,0,1.57,0.69,4.71,0,21,19,217,776,1797
1503960366,4/14/2016,10460,6.74,6.74,0,2.44,0.4,3.91,0,30,11,181,1218,1776
1503960366,4/15/2016,9762,6.28,6.28,0,2.14,1.26,2.83,0,29,34,209,726,1745
1503960366,4/16/2016,12669,8.16,8.16,0,2.71,0.41,5.04,0,36,10,221,773,1863
1503960366,4/17/2016,9705,6.48,6.48,0,3.19,0.78,2.51,0,38,20,164,539,1728


Unnamed: 0_level_0,skim_type,skim_variable,n_missing,complete_rate,character.min,character.max,character.empty,character.n_unique,character.whitespace,numeric.mean,numeric.sd,numeric.p0,numeric.p25,numeric.p50,numeric.p75,numeric.p100
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,character,ActivityDate,0,1,8.0,9.0,0.0,31.0,0.0,,,,,,,
2,numeric,Id,0,1,,,,,,4855407000.0,2424805000.0,1503960366.0,2320127000.0,4445115000.0,6962181000.0,8877689000.0
3,numeric,TotalSteps,0,1,,,,,,7637.911,5087.151,0.0,3789.75,7405.5,10727.0,36019.0
4,numeric,TotalDistance,0,1,,,,,,5.489702,3.924606,0.0,2.62,5.245,7.7125,28.03
5,numeric,TrackerDistance,0,1,,,,,,5.475351,3.907276,0.0,2.62,5.245,7.71,28.03
6,numeric,LoggedActivitiesDistance,0,1,,,,,,0.1081709,0.6198965,0.0,0.0,0.0,0.0,4.942142
7,numeric,VeryActiveDistance,0,1,,,,,,1.502681,2.658941,0.0,0.0,0.21,2.0525,21.92
8,numeric,ModeratelyActiveDistance,0,1,,,,,,0.5675426,0.8835803,0.0,0.0,0.24,0.8,6.48
9,numeric,LightActiveDistance,0,1,,,,,,3.340819,2.040655,0.0,1.945,3.365,4.7825,10.71
10,numeric,SedentaryActiveDistance,0,1,,,,,,0.001606383,0.007346176,0.0,0.0,0.0,0.0,0.11


── Data Summary ────────────────────────
                           Values         
Name                       fitbit_activity
Number of rows             940            
Number of columns          15             
_______________________                   
Column type frequency:                    
  character                1              
  numeric                  14             
________________________                  
Group variables            None           

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
[90m1[39m ActivityDate          0             1   8   9     0       31          0

── Variable type: numeric ──────────────────────────────────────────────────────
   skim_variable            n_missing complete_rate    mean      sd         p0
[90m 1[39m Id                               0             1 4.86[90me[39m+9 2.42[90me[39m+9 [4m1[24m503[4m9[24m[4m6[

In [107]:
fitbit_activity_copy <- fitbit_activity %>% 
clean_names() %>% 
distinct() %>% 
rename(date = activity_date, steps = total_steps, distance = total_distance) %>% 
mutate(date = as.Date(date, "%m/%d/%Y"), week_day = weekdays(date)) %>%  
arrange(id,date)

In [108]:
head(fitbit_activity_copy)

id,date,steps,distance,tracker_distance,logged_activities_distance,very_active_distance,moderately_active_distance,light_active_distance,sedentary_active_distance,very_active_minutes,fairly_active_minutes,lightly_active_minutes,sedentary_minutes,calories,week_day
<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1503960366,2016-04-12,13162,8.5,8.5,0,1.88,0.55,6.06,0,25,13,328,728,1985,Tuesday
1503960366,2016-04-13,10735,6.97,6.97,0,1.57,0.69,4.71,0,21,19,217,776,1797,Wednesday
1503960366,2016-04-14,10460,6.74,6.74,0,2.44,0.4,3.91,0,30,11,181,1218,1776,Thursday
1503960366,2016-04-15,9762,6.28,6.28,0,2.14,1.26,2.83,0,29,34,209,726,1745,Friday
1503960366,2016-04-16,12669,8.16,8.16,0,2.71,0.41,5.04,0,36,10,221,773,1863,Saturday
1503960366,2016-04-17,9705,6.48,6.48,0,3.19,0.78,2.51,0,38,20,164,539,1728,Sunday


* **HEARTBEAT**

In [None]:
fitbit_heartbeat_copy <- fitbit_heartbeat %>%
clean_names() %>%
distinct() %>%
separate (col = time, into = c("date", "time"), sep = " ") %>%
rename(heart_rate = value) %>%
mutate(
    date = as.Date(date, format = "%m/%d/%Y"), 
    time = as_hms(time),
) #use the hms package, which provides a suitable class for times. 

In [None]:
fitbit_heartbeat_copy <- fitbit_heartbeat_copy %>% 
group_by(id, date, time) %>% 
summarize(heart_rate = mean(heart_rate)) %>% 
arrange(id, date, time)

In [None]:
head(fitbit_heartbeat_copy)

###  * **Sleep**

In [None]:
head(fitbit_sleep)

In [None]:
fitbit_sleep_copy <- fitbit_sleep %>% 
clean_names() %>% 
distinct() %>%
mutate(sleep_day = as.Date(sleep_day,"%m/%d/%Y")) %>% 
rename(date = sleep_day, sleep_time = total_minutes_asleep, bed_time = total_time_in_bed)

In [None]:
fitbit_sleep_copy <- fitbit_sleep_copy %>% 
select(-total_sleep_records) %>% 
mutate(wake_time = bed_time - sleep_time) %>% 
arrange(id, date)

In [None]:
head(fitbit_sleep_copy)

### * **Body**

In [None]:
fitbit_body_copy <- fitbit_body %>% 
clean_names() %>% 
distinct() %>% 
rename(weight = weight_kg) %>% 
mutate(date = as.Date(date, "%m/%d/%Y"), height = sqrt(weight/bmi)* 100) %>% 
arrange(id,date) %>% 
select(-c(log_id, weight_pounds))


In [None]:
head(fitbit_body_copy)

# **Analyze and Visualize**
Now that data is cleaned, we can look into the number of unique users and the range of dates across all of the data frames.

In [None]:
# Print the summary of the activity data
print(paste("activity data has", n_distinct(fitbit_activity_copy$id), "Ids, data tracked for", 
            n_distinct(fitbit_activity_copy$date), "days between", min(fitbit_activity_copy$date), 
            "and", max(fitbit_activity_copy$date)))


In [None]:
print(paste("heartrate has ", n_distinct(fitbit_heartbeat_copy$id), " Ids, data tracked for ", 
            n_distinct(fitbit_heartbeat_copy$date), " days between ", min(fitbit_heartbeat_copy$date), 
            " and ", max(fitbit_heartbeat_copy$date)))

In [None]:
print(paste("sleep has ", n_distinct(fitbit_sleep_copy$id), " Ids, data tracked for ", 
            n_distinct(fitbit_sleep_copy$date), " days between ", min(fitbit_sleep_copy$date), 
            " and ", max(fitbit_sleep_copy$date)))


In [None]:
print(paste("weight has ", n_distinct(fitbit_body_copy$id), " Ids, data tracked for ", 
            n_distinct(fitbit_body_copy$date), " days between ", min(fitbit_body_copy$date), 
            " and ", max(fitbit_body_copy$date)))

### **Avaliable data**

* 33 unique users
* Data is performed bewtween (31 days) 2016-04-12 and 2016-05-12 
* I have taken 4 basic tracked: steps, sleep, heart rate, weight


In [None]:
colnames(fitbit_activity_copy)
colnames(fitbit_heartbeat_copy)
colnames(fitbit_sleep_copy)
colnames(fitbit_body_copy)

In [None]:
max_steps <- max(fitbit_activity_copy$steps, na.rm = TRUE)
min_steps <- min(fitbit_activity_copy$steps, na.rm = TRUE)
print(paste("The maximum number of steps is:", max_steps))
print(paste("The minimum number of steps is:", min_steps))

In [None]:
# For the daily activity dataframe:
fitbit_activity_copy %>%  
  select(steps,
         distance,
         calories) %>%
  summary()


In [None]:
fitbit_sleep_copy %>%  
  select(sleep_time,
         bed_time) %>%
  summary()


In [None]:
# Load required library
library(ggplot2)

options(repr.plot.width = 15, repr.plot.height = 7)

# Create the scatter plot
ggplot(fitbit_activity_copy, aes(x = steps, y = calories)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + # Added method "lm" for linear model and removed the shaded area
  labs(title = "Scatter Plot of Steps vs. Calories",
       x = "Number of Steps",
       y = "Calories") +
  theme_minimal()


##### By analyzing the above graph, it's evident that there is a strong correlation between the number of steps and the calories burned: **the more steps a user takes, the more calories they burn.**

##### Consequently, I will investigate the calorie data more thoroughly within the dataset. 

As a starting point, let's examine the relationship between the day of the week and the calories burned.

In [None]:
  # Create the scatter plot
  ggplot(fitbit_activity_copy, aes(x = week_day , y = calories)) + 
    geom_col() +
    labs(title = "Scatter Plot of Weekdays vs. Calories",
         x = "Weekdays",
         y = "Calories") +
    theme_minimal()

#### So far, the above graph does not reveal any significant relationship between the days of the week and the calories burned.

In [None]:
# Original distinct IDs
original_ids <- fitbit_activity_copy %>% distinct(id) %>% pull(id)

# Distinct IDs for calories > 2304
high_cal_ids <- fitbit_activity_copy %>% filter(calories > 2304) %>% distinct(id) %>% pull(id)

# Distinct IDs for calories < 1500
low_cal_ids <- fitbit_activity_copy %>% filter(calories < 1500) %>% distinct(id) %>% pull(id)

# Check IDs that are in both groups
common_ids <- intersect(low_cal_ids, high_cal_ids)

# IDs only in the high-calorie group
only_high_cal_ids <- setdiff(high_cal_ids, common_ids)

# IDs only in the low-calorie group
only_low_cal_ids <- setdiff(low_cal_ids, common_ids)

# IDs in both groups (moderate-calorie)
moderate_cal_ids <- common_ids

# Add new column 'calories_group'
fitbit_activity_copy <- fitbit_activity_copy %>%
  mutate(
    calories_group = case_when(
      id %in% only_high_cal_ids ~ "High",
      id %in% only_low_cal_ids ~ "Low",
      id %in% moderate_cal_ids ~ "Moderate",
      TRUE ~ NA_character_  # Handle any other cases (though in this scenario it shouldn't occur)
    )
  )

# Print the first few rows to check the result
head(fitbit_activity_copy)


#### Now, let's divide the data into three categories based on calorie consumption: high, moderate, and low. We will create a new column to signify the user's category.

In [None]:
# Summarize data
calorie_group_summary <- fitbit_activity_copy %>%
  distinct(id, calories_group) %>%
  group_by(calories_group) %>%
  summarize(count = n())

# Plot the bar chart
options(repr.plot.width = 10, repr.plot.height = 7)

ggplot(calorie_group_summary, aes(x = calories_group, y = count, fill = calories_group)) +
  geom_bar(stat = "identity") +
  labs(title = "Comparison of User Groups by Calorie Consumption",
       x = "Calorie Consumption Group",
       y = "Number of Users") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") 


#### We can clearly see there are 7 number of High calories User, 8 number of Low calories User and rest of 18 are moderate User.
 
##### Lets check the Users in weely basic.
##### Then, check the distribution of Distances by Activity Level for Each Calorie Group


In [None]:
# Summarize data
weekly_summary <- fitbit_activity_copy %>%
  distinct(id, week_day, calories_group) %>%
  group_by(week_day, calories_group) %>%
  summarize(count = n())

# Plot the bar chart
options(repr.plot.width = 15, repr.plot.height = 7)

ggplot(weekly_summary, aes(x = factor(week_day), y = count, fill = calories_group)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Weekly Comparison of User Groups by Calorie Consumption",
       x = "Week",
       y = "Number of Users") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") # Optional: for better colors

In [None]:
# Summarize the total distances for each category
distance_summary <- fitbit_activity_copy %>%
  group_by(calories_group) %>%
  summarize(
    very_active_distance = sum(very_active_distance, na.rm = TRUE),
    moderately_active_distance = sum(moderately_active_distance, na.rm = TRUE),
    light_active_distance = sum(light_active_distance, na.rm = TRUE),
    sedentary_active_distance = sum(sedentary_active_distance, na.rm = TRUE)
  )

# Reshape the data to a long format
distance_long <- distance_summary %>%
  pivot_longer(
    cols = c(very_active_distance, moderately_active_distance, light_active_distance, sedentary_active_distance),
    names_to = "activity_level",
    values_to = "distance"
  )

# Create the pie charts
options(repr.plot.width = 15, repr.plot.height = 7)

ggplot(distance_long, aes(x = "", y = distance, fill = activity_level)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  facet_wrap(~ calories_group) +
  labs(title = "Distribution of Distances by Activity Level for Each Calorie Group",
       x = "",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set4") 

In [None]:
colnames(fitbit_activity_copy)

In [None]:
# Summarize the total activity for each category
distance_summary <- fitbit_activity_copy %>%
  group_by(calories_group) %>%
  summarize(
    very_active_minutes = sum(very_active_minutes, na.rm = TRUE),
    fairly_active_minutes = sum(fairly_active_minutes, na.rm = TRUE),
  )

# Reshape the data to a long format
distance_long <- distance_summary %>%
  pivot_longer(
    cols = c(very_active_minutes, fairly_active_minutes),
    names_to = "activities",
    values_to = "time_taken"
  )

# Create the pie charts
options(repr.plot.width = 15, repr.plot.height = 7)

ggplot(distance_long, aes(x = "", y = time_taken, fill = activities)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  facet_wrap(~ calories_group) +
  labs(title = "Distribution of Distances by Activity Level for Each Calorie Group",
       x = "",
       y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2") 

* **Original unique IDs: 33**
* IDs only in high calories group **7**
* IDs only in modarate calories group **18**
* IDs only in low calories group: **8**

75% Users are very much daily in terms of activity scale and as well as calories scale. 
25% users are not very active and not so consistent in terms of calories and activities scale although they are use the fitbit watch regularly.

In [None]:
# Correctly join the data frames using inner_join
merge_activity_heartbeat <- fitbit_activity_copy %>% inner_join(fitbit_heartbeat_copy, by = c("id","date"))

head(merge_activity_heartbeat)

In [None]:
# Load necessary library
library(dplyr)
library(ggplot2)  # for plotting


# Group by calorie_group and filter rows with non-NA heart_rate values
calories_with_heart_rate <- merge_activity_heartbeat %>%
  filter(!is.na(heart_rate)) %>%
  group_by(calories_group) %>%
  summarise(heart_rate_count = n())

# Calculate percentages
calories_with_heart_rate <- mutate(calories_with_heart_rate,
                                   percentage = heart_rate_count / sum(heart_rate_count) * 100)

# Create a pie chart
ggplot(calories_with_heart_rate, aes(x = "", y = percentage, fill = calories_group)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  scale_fill_brewer(palette = "Set3") +  # Change color palette as needed
  geom_text(aes(label = paste0(round(percentage, 1), "%")),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Percentage of Heart Rate Values by Calorie Group",
       fill = "calories_group",
       x = NULL, y = NULL) +
  theme_minimal()

Mostly High and moderate User are comfertable to measre the heartrate through fitbit device.

it is clearly show 0.1% users of low calories user measure the hearrate.

In [None]:
# Load necessary library
library(dplyr)

# Group by calorie_group and filter rows with non-NA heart_rate values
calories_with_heart_rate <- merge_activity_heartbeat %>%
  filter(!is.na(heart_rate)) %>%
  group_by(calories_group) %>%
  summarise(
    heart_rate_count = n(),
    average_heart_rate = mean(heart_rate)
  )

# Display the results
print(calories_with_heart_rate)

In turm of the heart rate:-  
* High calories users have avarage of 75 bpm.
* Low calories users have average of 94 bpm approx.The input is too low to count the avagare so that it came so high in turms of heart rate avarage.


#### In term of heart rate count, The high and moderate calories user have more 99.9% input,whereas low calories user having 0.1% input in the heartbeat data. 

In [None]:
merge_activity_sleep <- fitbit_activity_copy %>% inner_join(fitbit_sleep_copy, by = c("id","date"))

head(merge_activity_sleep)

In [None]:
# Group by calorie_group and filter rows with non-NA sleep values
calories_with_sleep_time <- merge_activity_sleep %>%
  filter(!is.na(sleep_time)) %>%
  group_by(calories_group) %>%
  summarise(sleep_time_count = n())

# Calculate percentages
calories_with_sleep_time <- mutate(calories_with_sleep_time,
                                   percentage = sleep_time_count / sum(sleep_time_count) * 100)

# Create a pie chart
ggplot(calories_with_sleep_time, aes(x = "", y = percentage, fill = calories_group)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  scale_fill_brewer(palette = "Set3") +  # Change color palette as needed
  geom_text(aes(label = paste0(round(percentage, 1), "%")),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Percentage of Sleep Time Values by Calorie Group",
       fill = "calories_group",
       x = NULL, y = NULL) +
  theme_minimal()

In [None]:
# Group by calorie_group and filter rows with non-NA heart_rate values
calories_with_sleep_time <- merge_activity_sleep %>%
  filter(!is.na(sleep_time)) %>%
  group_by(calories_group) %>%
  summarise(
    sleep_time_count = n(),
    average_sleep_time = mean(sleep_time)
  )

# Display the results
print(calories_with_sleep_time)

In turm of the heart rate:-  
* High calories users have avarage of 423 hr.
* Modarate calories users have avarage of 427 hr.
* Low calories users have average of 393 hr.

#### In term of sleep count,The high calories user have more 20% input, The high calories user have more 59% input,whereas low calories user having 21% input in the sleep data.

Sleep feature is used throught out equally among everyone all the user compare to daily data set.


In [None]:
merge_activity_body <- fitbit_activity_copy %>% inner_join(fitbit_body_copy, by = c("id","date"))

head(merge_activity_body)

In [None]:
# Load necessary library
library(dplyr)

# Create the Weight_Status column based on bmi value
merge_activity_body <- merge_activity_body %>%
  mutate(Weight_Status = case_when(
    bmi < 18.5 ~ "Underweight",
    bmi >= 18.5 & bmi <= 24.9 ~ "Healthy Weight",
    bmi >= 25.0 & bmi <= 29.9 ~ "Overweight",
    bmi >= 30.0 ~ "Obesity"
  ))

# Display the first few rows of the updated data frame
head(merge_activity_body)


In [None]:
ggplot(merge_activity_body, aes(x = calories_group, fill = Weight_Status)) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of Weight Status by Calorie Group",
       x = "calories_group",
       y = "Count",
       fill = "Weight Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

In [None]:
# Summarize the data to calculate the percentage of each Weight_Status group
summary_data <- merge_activity_body %>%
  group_by(Weight_Status) %>%
  summarise(count = n(), .groups = 'drop') %>%
  mutate(percentage = count / sum(count) * 100)

# Create a donut chart
ggplot(summary_data, aes(x = 2, y = percentage, fill = Weight_Status)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5)) +
  xlim(0.5, 2.5) +  # Adjust the x-axis limits to create the donut hole
  theme_void() +  # Remove background, gridlines, and axis
  theme(legend.position = "right") +
  labs(title = "Percentage of Weight Status Groups",
       fill = "Weight Status")


In [None]:
# Summarize the data to calculate the percentage of weight status within each manual report group
summary_weight_status <- merge_activity_body %>%
  group_by(is_manual_report, Weight_Status) %>%
  summarise(count = n(), .groups = 'drop') %>%
  mutate(percentage = count / sum(count) * 100)

# Create a donut chart for weight status within each manual report group
ggplot(summary_weight_status, aes(x = 2, y = percentage, fill = Weight_Status)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5)) +
  facet_wrap(~ is_manual_report, labeller = labeller(is_manual_report = c(`TRUE` = "Manual Report", `FALSE` = "Automatic Report"))) +
  xlim(0.5, 2.5) +  # Adjust the x-axis limits to create the donut hole
  theme_void() +  # Remove background, gridlines, and axis
  theme(legend.position = "right") +
  labs(title = "Weight Status Distribution within Manual and Automatic Users",
       fill = "Weight Status")


By looking the above graph it can say that manually report is famous for healthy user. but it has very less interective in term of obesity and the overweight user. 

Only 24% of the User are use the body data to track recods.

Lets check the persentage of the user now:-


Manual Method: 

* (50 % of the Healthy User) Out of 33 Users are found Healthy which makes 12% of the users prefer to use manual methods.
* (10% of the Overweight User) Out of 33 Users are found Overweight which makes 2.5 % of the User prefer to use manual methods.

Automatic Method:

* (37% of the Overweight User) Out of 33 Users are found Overweight which makes 9 % of the User prefer to use Autometic methods.
* (1.5% of the Obesity  User) Out of 33 Users are found Obesity which makes 0.4 % of the User prefer to use Autometic methods.

In [None]:
# Create a set of unique ids for every group of users
steps_ids <- unique(fitbit_activity_copy$id)
sleep_ids <- unique(fitbit_sleep_copy$id)
heartrate_ids <- unique(fitbit_heartbeat_copy$id)
weight_ids <- unique(fitbit_body_copy$id)

In [None]:
install.packages("VennDiagram")

In [None]:
library(VennDiagram)

options(repr.plot.width = 7, repr.plot.height = 7) # specify the desired size of the figure

# Create a venn diagram
plot <- venn.diagram(
  x = list(steps_ids, sleep_ids, heartrate_ids, weight_ids),
  category.names = c("Steps" , "Sleep" , "Heart rate", "Weight"),
  filename = NULL,
  fill = c("goldenrod2", "seagreen1", "orchid1", "lightskyblue1"),
  cex = 1.5, fontface = "bold", fontfamily = "sans", # formatting of numbers
  cat.cex = 1, cat.fontface = "bold", cat.fontfamily = "sans") # formatting of set names
grid::grid.draw(plot)

Out of 33 Users, below mentioned the Percentage:-

![image.png](attachment:36425ec4-f5aa-4c2a-ab58-6f105baf0adb.png)

#### Multi-feature users:

* 100% (33 Ids) have STEPS count records (combine with or without other features)
* 73% (24 Ids) have STEPS count and SLEEP tracking records (this subgroup is fairly close to that of Bellabeat's users)
* 42% (14 Ids) have STEPS count and HEARTRATE monitoring records
* 24% (8 Ids) have STEPS count and WEIGHT tracking records
* 9% (3 Ids) have all four featured records of STEPS - SLEEP - HEARTRATE - WEIGHT

Of which:-

**Single-feature records or users:**

18% (6 Ids) have only STEPS count records (no other features being used)

**Duo-feature users:**

27% (9 Ids) have only duo-feature of STEPS - SLEEP records (This subgroup is the closest one to that of Bellabeat's Leaf users as purely recorded Steps - Sleep)
1 id has only the duo feature of STEPS - WEIGHT record
1 id has duo feature of STEPS - HEARTRATE record

**Trio_feature users:**

27% (9 ids) used 3 features of STEPS - SLEEP - HEARTRATE
9% (3 ids) used 3 features of STEPS - SLEEP - WEIGHT
1 id used trio-feature STEPS - HEARTRATE - WEIGHT

**Features with 0 users:**

0 Id used only HEARTRATE or WEIGHT or SLEEP feature alone
0 Id used only duo features of HEARTRATE - WEIGHT or SLEEP - WEIGHT or HEARTRATE - SLEEP

The user group of 9 Ids that have only a duo-feature of STEPS - SLEEP records is the closest to Leaf users. 

Unfortunately, the sample for this segment is relatively small.

In [None]:
# Install DataExplorer (if not already installed)
install.packages("DataExplorer")

# Load the package
library(DataExplorer)

In [None]:
options(repr.plot.width = 15, repr.plot.height = 7) # specify the desired size of the figure
merge_activity_sleep %>%
  select(-c("id")) %>%
  plot_histogram(ncol = 3, ggtheme = theme_light())


In [None]:
# Get number of days a user used their device in a 31 day period:
obs_days <- merge_activity_sleep %>% group_by(id) %>% 
  summarise(num_dayuse = sum(n()), .groups = "drop") %>% 
  arrange(-num_dayuse)
head(obs_days)

In [None]:
# Classify users into usage ranges
usage <- obs_days %>% 
  mutate(group = case_when(
    between(num_dayuse, 1, 10) ~ "low usage",
    between(num_dayuse, 11, 20) ~ "moderate usage",
    between(num_dayuse, 21, 31) ~ "high usage",
    TRUE ~ NA_character_
    ))
# Create a df with new attributes
usage_df <- merge_activity_sleep %>% 
  left_join(usage, by = "id")
# Compute percentage of each usage groups 
sum_usage <- usage %>% 
  mutate(group = fct_relevel(group, c("high usage", "moderate usage", "low usage"))) %>% 
  group_by(group) %>%  
  summarise(num_users = n()) %>% 
  mutate(percent = num_users/sum(num_users)*100)

In [None]:
install.packages("formattable")
library(formattable)


In [None]:
formattable(sum_usage, list(percent = color_bar("yellow")))

In general, users didn't wear their devices on a day in day out basis. This is not a surprising fact, although after observing the number of days each user wore their fitness tracker (or had data on that day), I have noticed that surprisingly some users would keep their devices on daily or almost every day (n = 27 ~ 31 days) while few of them used their devices just only a few days and others reached the average number of days in the recording period.

Stats:

* 50% of users who used their devices frequently on a nearly day-in-day-out basis (on a 21-31 day scale),

* 12% of users who moderately used their devices (on an 11-20 day scale),

* 38% of users who used their devices least frequently (on a 1-10 day scale)