# PHASE 1 : ASK
Key stakeholders:
Bellabeat’s 2 co-founders Urška Sršen and Sando Mur, the company's Marketing Analytics teams.

Business task statement:
Bellabeat wants to find new opportunities to grow its business. For that reason, I am going to analyze information about its current users utilizing the products offered by the company. The task is to find trends in the data and make useful recommendations for the company.Business task statement:Bellabeat wants to find new opportunities to grow its business. For that reason, I am going to analyze information about its current users utilizing the products offered by the company. The task is to find trends in the data and make useful recommendations for the company.


# PHASE 2 : PREPARE
**Data Source**
* The dataset used for this data analysis is the FitBit Fitness Tracker Data from Kaggle by Möbius, under open-access terms for educational and portfolio-building purposes.
 
**Data Summary**
* This Kaggle data set
contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of
personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes
information about daily activity, steps, and heart rate that can be used to explore users’ habits.

**Data Integrity & Limitations**
* sample size only have 30 users 
* data collected in 2016, the data is outdated and may not fully reflect current usage patterns or product versions
* small time frame of data collection
* lack of demographic information about users, like age, region, etc.

*dailyActivity_merged,    sleepDay_merged,    weightLogInfo_merged , hourlyIntensities_merged are selected for analysis.*


In [None]:
#Importing libraries
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ggpubr)

In [None]:
#Read the dataset
Daily_Activity <- read.csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
Sleep_Day<-read.csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
Weight_Log_Info<-read.csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
Hourly_Intensities<-read.csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")


# PHASE 3: PROCESS
I used R in Kaggle notebook to do the analysis.
First, I preview the dataset and check if there are any error.


In [None]:
#Preview the dataset
glimpse(Daily_Activity)
n_distinct(Daily_Activity)
sum(is.na(Daily_Activity))
head(Daily_Activity)

glimpse(Sleep_Day)
n_distinct(Sleep_Day)
sum(is.na(Sleep_Day))
head(Sleep_Day)

glimpse(Weight_Log_Info)
n_distinct(Weight_Log_Info)
sum(is.na(Weight_Log_Info))
head(Weight_Log_Info)

glimpse(Hourly_Intensities)
n_distinct(Hourly_Intensities)
sum(is.na(Hourly_Intensities))
head(Hourly_Intensities)

**Findings:**
1. There are some duplicate data row in Sleep_day. There are total 413 rows but I use n_distinct()function to check there are 410 rows.
2. There are 65 Non Available in Weight_Log_Info. Most of those are from the "Fat" Column.If I remove those rows, only 2 rows remained. So I decided to keep those rows and don't use it to do analysis. 

In [None]:
#Remove Duplicate in Sleep_day
new_Sleep_Day<-Sleep_Day %>% distinct()


#Rename the date column name and change the date column to datetime format. 
# 1. Process the 'activity' data frame
# The date column is named 'ActivityDate' and has no time component.
activity <- Daily_Activity %>%
  rename(Date = ActivityDate) %>%
  mutate(Date = mdy(Date))

# 2. Process the 'sleep' data frame
# The date column is named 'SleepDay' and includes time.
sleep <- new_Sleep_Day %>%
  rename(Date = SleepDay) %>%
  mutate(Date = as.Date(mdy_hms(Date)))

# 3. Process the 'weight' data frame
# The date column is already named 'Date', so we only need to convert its type.
weight <- Weight_Log_Info %>%
  mutate(Date = as.Date(mdy_hms(Date)))

#4. Process the "hourlyIntensities" data frame
# To get the hour and create a new column
hourly_intensity <- Hourly_Intensities %>%
  mutate(Hour = hour(mdy_hms(ActivityHour)))
   


head(activity)
head(sleep)
head(weight)
head(hourly_intensity)


#  PHASE 4: Analysis
I created plots to look for the trend and relationship of different variables

In [None]:
ggplot(data=activity, mapping = aes(x=TotalSteps, y=Calories)) + 
geom_point() + 
geom_smooth(method = "loess") +
labs(title= "Calories vs. Total Steps") 

**More steps, more calories burned.**

I calulated the average intensities of all user in one day to see which period has the highest intensities.

In [None]:
hourly_summary <- hourly_intensity %>%
  group_by(Hour) %>%
  summarise(MeanIntensity = mean(AverageIntensity, na.rm = TRUE))


ggplot(data = hourly_summary, aes(x = Hour, y = MeanIntensity)) +
    geom_histogram(stat = "identity", fill='darkblue') +
    labs( title = "Average Activity Intensity by Hour")+
    scale_x_continuous(breaks = seq(0, 23, by = 1)) 

People are more active in 17:00-19:00

In [None]:
#Join the sleep and activity table into a new table
sleep_activity_data <- inner_join(activity, sleep, by = c("Id", "Date"))

# Calculate Sleep Efficiency and add it as a new column

sleep_activity_data <- sleep_activity_data %>%
  mutate(SleepEfficiency = (TotalMinutesAsleep / TotalTimeInBed) * 100) %>%
  # Data Cleaning:
  # 1. Remove any rows where TotalTimeInBed is 0 to avoid division by zero errors.
  # 2. Remove any data entry errors where sleep efficiency is > 100%.
  # 3. Remove rows with NA values for our key metrics to ensure clean plots.
  filter(TotalTimeInBed > 0, 
         SleepEfficiency <= 100,
         !is.na(SleepEfficiency),
         !is.na(TotalSteps))


# Plot 1: Total Steps vs. Sleep Efficiency

ggplot(data = sleep_activity_data, mapping=aes(x = TotalSteps, y = SleepEfficiency)) +
  geom_point(alpha = 0.6, color = "purple") + # Scatter plot points
  geom_smooth(method = "lm", color = "black") + # Add a linear model trendline
  labs(
    title = "Relationship between Daily Steps and Sleep Efficiency",
   
    x = "Total Steps Taken in a Day",
    y = "Sleep Efficiency (%)"
  ) 



As the number of steps increases, sleep efficiency tends to decrease slightly.

In [None]:
ggplot(data = sleep_activity_data, aes(x = VeryActiveMinutes, y = SleepEfficiency)) +
  geom_point(alpha = 0.6, color = "orange") +
  geom_smooth(method = "lm", color = "black") +
  labs(
    title = "Relationship between Intensive time and Sleep efficiency",
    x = "Minutes of Very Active Time",
    y = "Sleep Efficiency (%)"
  ) 

On average, as the amount of intense daily activity increases, sleep efficiency tends to increase slightly. However, the relationship is not very strong.

In [None]:
ggplot(activity, aes(x = SedentaryMinutes, y = Calories)) +
  geom_point(alpha = 0.6, color = "#E41A1C") +
  geom_smooth(method = "lm", se = FALSE, color = "#377EB8") +
  labs(title = "Relationship Between Sedentary Time and Calories Burned",
       x = "Sedentary Minutes",
       y = "Calories Burned") +
  theme_minimal()

As the amount of Sedentary time increase, calories burned decrease.

# Phase 5: ACT

Recommondations:

1. Implement a "Smart" Notification System
*  Enhance Sedentary Alerts. Instead of just "Time to move!", make them more informative
*  Create "Golden Hour" Push Notifications. Around 5 PM,send some encouraging messages
2. Refine Goal-Setting Features
*  Create some new challenge like: "The Sleep-Booster: Achieve 20 minutes of vigorous activity before 8 PM for 5 days in a row." for user to achieve.
3. Develop a Personalized Content and Reporting Strategy
*  Create Personalized Weekly Reports with "Did You Know?" Insights. Tell user their personal health story.