In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Background
## 1.1 About the company
Bellabeat, co-founded by Urška Sršen and Sando Mur, is a leading tech company crafting health-focused smart products with a keen aesthetic touch. It empowers women through data on activity, sleep, stress, and reproductive health. Since 2013, Bellabeat has expanded globally, offering products through online retailers and robust digital marketing. In 2016, it opened worldwide offices, emphasizing digital marketing channels like Google Search, social media, and display ads. To further growth, Sršen has tasked the marketing analytics team to analyze smart device usage data, aiming to shape Bellabeat's marketing strategy. This data-driven approach reflects Bellabeat's commitment to enhancing women's well-being.

## 1.2 Characters
* **Urška Sršen**: Bellabeat’s cofounder and Chief Creative Officer
* **Sando Mur**: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team 
* **Bellabeat marketing analytics team**: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst,can help Bellabeat achieve them.


## 1.3 Products
* **Bellabeat app**: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
* **Leaf**: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
* **Time**: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
* **Spring**: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.


# 2. Ask
## 2.1 Business Task 
The primary objective is to gather valuable insights into how consumers utilize smart devices. Through a careful analysis of these devices, we aim to gain a better understanding of user behavior, preferences, and patterns. This will aid in the development of Bellabeat's marketing strategy and open new routes for growth opportunities.

## 2.2 Stakeholders
* **Urška Sršen**: Bellabeat’s cofounder and Chief Creative Officer.
* **Sando Mur**: Bellabeat’s cofounder; key member of the Bellabeat executive team.
* **Bellabeat marketing analytics team.**

# 3. Prepare
## 3.1 About the dataset
**Name**: FitBit Fitness Tracker Data

**Source**: kaggle (https://www.kaggle.com/datasets/arashnic/fitbit)

**Content**:This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk 
between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

**Data Integrity**: Given the constraints, such as a limited sample size of 30 users and the absence of demographic details, there's a potential for sampling bias. We cannot guarantee that this sample accurately reflects the entire population. Additionally, the dataset is outdated, and the survey spanned only two months, posing limitations. Consequently, we intend to adopt an operational approach in our case study to address these challenges.



# 4. Process
## 4.1 Loading Packages
I will be using the tidyverse, lubridate and ggplot2 packages for this analysis.

In [None]:
library(tidyverse)
library(lubridate)
library(ggplot2)

## 4.2 Importing Data
I will be using the daily data of activity, steps, and sleep, and the hourly data of steps, calories, and intensity. I will also be looking at the weight data.

In [None]:
d_act <- read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
d_stp <- read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
d_sle <- read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
h_stp <- read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
h_cal <- read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteCaloriesNarrow_merged.csv")
h_int <- read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
weight <- read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

## 4.3 Preview of the data

In [None]:
head(d_act)
head(d_stp)
head(d_sle)
head(h_stp)
head(h_cal)
head(h_int)
head(weight)

## 4.4 Data Cleaning and Formatting
### 4.4.1 Checking the Data Sructure

In [None]:
str(d_act)
str(d_stp)
str(d_sle)
str(h_stp)
str(h_cal)
str(h_int)
str(weight)

### 4.4.2 Checking the sample size

In [None]:
dist_d_act <- d_act %>%
    pull(Id) %>%
    n_distinct()
print(dist_d_act)

dist_d_stp <- d_stp %>%
    pull(Id) %>%
    n_distinct()
print(dist_d_stp)

dist_d_sle <- d_sle %>%
    pull(Id) %>%
    n_distinct()
print(dist_d_sle)

dist_h_stp <- h_stp %>%
    pull(Id) %>%
    n_distinct()
print(dist_h_stp)

dist_h_cal <- h_cal %>%
    pull(Id) %>%
    n_distinct()
print(dist_h_cal)

dist_h_int <- h_int %>%
    pull(Id) %>%
    n_distinct()
print(dist_h_int)

dist_weight <- weight %>%
    pull(Id) %>%
    n_distinct()
print(dist_weight)


The sample size for the weight variable is too small to accurately represent the entire population. Therefore, we cannot draw conclusions from it.

### 4.4.3 Cheking for duplicates

In [None]:
sum(duplicated(d_act))
sum(duplicated(d_stp))
sum(duplicated(d_sle))
sum(duplicated(h_stp))
sum(duplicated(h_cal))
sum(duplicated(h_int))


The variable d_sle has 3 duplicates that need to be removed.

### 4.4.4 Removing duplicates

In [None]:
d_sle <- d_sle %>%
distinct()

### 4.4.5 Checking for duplicates
Verifying if the duplicates were removed

In [None]:
sum(duplicated(d_sle))

### 4.4.6 Checking for missing values

In [None]:
sum(is.na(d_act))
sum(is.na(d_stp))
sum(is.na(d_sle))
sum(is.na(h_stp))
sum(is.na(h_cal))
sum(is.na(h_int))

There are no missing values.

### 4.4.7 Formating date columns
Converting date columns into date format


In [None]:
d_act <- d_act %>%
  rename(date = ActivityDate) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))

d_stp <- d_stp %>%
  rename(date = ActivityDay) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))

d_sle <- d_sle %>%
  rename(date = SleepDay) %>%
  mutate(date = as_date(date, format ="%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone()))

h_stp <- h_stp %>% 
  rename(date_time = ActivityHour) %>% 
  mutate(date_time = as.POSIXct(date_time, format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

h_cal <- h_cal %>% 
  rename(date_time = ActivityMinute) %>% 
  mutate(date_time = as.POSIXct(date_time, format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

h_int <- h_int %>% 
  rename(date_time = ActivityHour) %>% 
  mutate(date_time = as.POSIXct(date_time, format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))


### 4.4.8 Verifying if the changes were made

In [None]:
head(d_act)
head(d_stp)
head(d_sle)
head(h_stp)
head(h_cal)
head(h_int)

# 5. Analyse
## 5.1 Summarize and explore the data sets

### 5.1.1 Amount of sedentary time

In [None]:
d_act %>%
    select(SedentaryMinutes)%>%
    summary()

The participants spent an average of 991.2 minutes or 16.52 hours per day being sedentary, which is alarming.

### 5.1.2 Amount of daily steps

In [None]:
d_stp %>%
    select(StepTotal)%>%
    summary()

On average, the participants take 7638 steps per day, which is lower than the recommended 10000. However, a study by BBC found that taking at least 4000 steps daily can decrease the risk of premature death from any cause.

### 5.1.3 Amount of daily sleep

In [None]:
d_sle %>%
    select(TotalMinutesAsleep)%>%
    summary()

The average sleeping time per day is 419,2 minutes, or approximately 7 hours of sleep per night which is below the 8 hours recommended however it's still considered a healthy amout of sleep.

### 5.1.4 Hourly steps through the day

In [None]:
h_stp <- h_stp %>%
  separate(date_time, into = c("date", "time"), sep= " ") 

head(h_stp)

h_stp <- h_stp %>%
    group_by(time) %>%
    drop_na() %>%
summarise(mean_t_stp = mean(StepTotal))

In [None]:
ggplot(data = h_stp, aes(x = time, y = mean_t_stp)) + geom_histogram(stat = "identity", fill='blue') +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title="Average Total Steps by Hour")


There is increased activity observed among people between 16h30 and 17h30, although not as high there´s another peak observed between 11h30 and 14h30. The first peak could be attributed to the time when participants finish work, while the second peak is during lunchtime when participants might take a walk.


### 5.1.5 Hourly steps vs. Hourly Calories
How does the amount of steps per hour correlate to the amount of calories burnt per hour.

In [None]:
ggplot(data = d_act, aes(x = TotalSteps, y= Calories)) +
    geom_point() + geom_smooth() + labs(title = "Total Steps vs. Calories")

As we can see by observing the plot, there's a positive correlation between the total of steps taken and the calories burned.

### 5.1.6 Hourly intensity through the day

In [None]:
h_int <- h_int %>%
  separate(date_time, into = c("date", "time"), sep= " ") 

head(h_int)

h_int <- h_int %>%
    group_by(time) %>%
    drop_na() %>%
summarise(mean_t_int = mean(TotalIntensity))

In [None]:
ggplot(data = h_int, aes(x = time, y = mean_t_int)) + geom_histogram(stat = "identity", fill='green') +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title="Average Intensity by Hour")

As we observed previously in the "Average of total steps by hour" graph the hours where the users are more active are between 16h30 and 17h30.

# 6. Share
## 6.1 Summary of the identified trends
* Participants are sedentary for more than 16 hours a day, which can harm their health.
* Participants tend to walk around 7600 steps daily which can reduce some health risks.
* Average daily sleeping time is below the recommended 8 hours, but it's not far.
* The most active intervals of time are between 11h30 and 14h30, and between 11h30 and 14h30.

# 7. Act
## 7.1 Recomendations for the Bellabeat team
* Collect tracking data from the bellabeat products for better results.
* Extend the data collection period to better analyze seasonal changes in variables.


## 7.2 Recomendations for the Bellabeat app
* Periodically send notifications reminding about the health benefits of leading a more active lifestyle.
* Create a daily goal system to encourage the users to walk more. This system could award badges once a goal is reached helping the users feel accomplished once they meet them.
* Create campaigns and in-app advertisements to motivate individuals to input their weight information.
