Introduction Bellabeat
A high-tech company that manufactures health-focused smart products.
Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around
the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with
knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly
positioned itself as a tech-driven wellness company for women.

Senario:

So first of all, as a junior data analyst working on the marketing analyst team at Bellabeat. I've been asked asked to focus on one of Bellabeat's products and analyze smart device data to gain insight into how consumers are using their smart devices. Then, using this information, the stakeholder would like high-level
recommendations for how these trends can inform Bellabeat marketing strategy.



## Now, lets explore our datasets:

* These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. 
* Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

## Install the Packages

In [None]:
# Install Packages
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("here")
install.packages("stringr")
install.packages("lubridate")
install.packages("readr")
install.packages("janitor")
install.packages("skimr")
install.packages("data.table")


# Load packages
library(tidyverse)
library(ggplot2)
library(dplyr)
library(here)
library(stringr)
library(lubridate)
library(readr)
library(janitor)
library(skimr)
library(data.table)


#### 1. ASK

1. What are Bellabeat current products that are best seller, also how they're different from other devices? 
2. After identify and analyze on how consumers use non-Bellabeat smart devices from the data, how these insights are being able to apply to Bellabeat product to improve their products and grow their business?


#### 2. PREPARE

There are total 18 files in this dataset, It contains peprsonal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of
personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes
information about daily activity, steps, and heart rate that can be used to explore users’ habits. In this case, I want to use these files:

* dailyActivity_merged.csv
* dailySteps_merged.csv
* heartrate_seconds_merged.csv
* sleepDay_merged.csv
* weightLogInfo_merged.csv
* hourlyIntensities_merged.csv

In [None]:
# Read the data
daily_activity <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

In [None]:
daily_steps <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")

In [None]:
head(daily_steps)

In [None]:
heartrate_seconds <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")

In [None]:
# Check how many unique users in this dataset
n_distinct(heartrate_seconds$Id)

In [None]:
head(heartrate_seconds)

In [None]:
sleep_day <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

In [None]:
head(sleep_day)

In [None]:
weight_log <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

In [None]:
head(weight_log)

In [None]:
# Now, Understand daily_activity dataset and get to know data structure and glimpse
str(daily_activity)


In [None]:
glimpse(daily_activity)

In [None]:
head(daily_activity, 5)

In [None]:
# Check for Null values from each table
print(colSums(is.na(daily_activity)))
cat("\n\n")
print(colSums(is.na(daily_steps)))
cat("\n\n")
print(colSums(is.na(heartrate_seconds)))
cat("\n\n")
print(colSums(is.na(sleep_day)))
cat("\n\n")
print(colSums(is.na(weight_log)))
cat("\n\n")


In [None]:
# Weight_log has 1 feature Fat that missing 65 values. 
# Lets see the data
head(weight_log)

In [None]:
# The missing values occupied more more than 80% of the data, I decide to drop the Fat column.
# The LogId does not contribute meaning to this analysis , so I also drop this column
weight_log <- within(weight_log, rm(Fat,LogId))

In [None]:
head(weight_log)

#### 3. PROCESS

#### With the observation in the first dataset,there is one feature ActivateDate has shown as Char data type. We will convert it to the Date datatype.

In [None]:
daily_activity$ActivityDate <- as.Date(daily_activity$ActivityDate, "%m/%d/%Y")
head(daily_activity)

In [None]:
# Lets see how many unique users in this dataset
n_distinct(daily_activity$Id)

In [None]:
# First, lets take a quick summary statistics about daily_activity data
daily_activity_summary <- daily_activity %>%
select(c(-"Id",-"ActivityDate")) %>%
summary()
daily_activity_summary

Through this summary, we have a better overview of the data. For instance, the total steps of these 33 users with min value is 0 and max is 36019 steps, average is 7638 steps. The median value gives an ida about the central tendency of the data, in this case the median total steps of 33 users in this data is 7406, it's slighly less than the average value but not too different, we can consider the distribution of this feature data follow a normal distribution. The 1st Qu. is 3790 means 25% of users have total steps less than 3790 while 75% users have total steps less than 3rd Qu. which is 10727 steps in this case.

In [None]:
install.packages("rstatix")
install.packages("ggpubr")
library(rstatix)
library(ggpubr)

In [None]:
# See plot summarystats on 2 features: 
#ggsummarystats(
 # daily_activity, x , y, 
  #ggfunc = ggboxplot, add = "jitter")

In [None]:
# See the relationship between totalsteps and calories through ggplot
ggplot(data=daily_activity) +
geom_point(mapping=aes(x=TotalSteps,y=Calories))

In [None]:
# More clearly with line trend
ggplot(data=daily_activity) +
geom_smooth(mapping=aes(x=TotalSteps,y=Calories)) +
geom_point(mapping=aes(x=TotalSteps,y=Calories))

* The plot show a positive relationship between these 2 variables. It shows that the larger of total steps users took, the more calories they burned.

In [None]:
# Lets see weight log dataset with feature BMI distribution
weight_log %>%
    summarise(Id, BMI) %>%
    ggplot(aes(BMI)) +
    geom_histogram(binwidth=.5)

* BMI is Body mass index - It's calculated by a person's weight in kg devided by the square of height in meters. BMI does not measure body fat directly but it apprears to be as strongly correlated with various metabolic and disease outcome. 
* According to cdc.gov, BMI is interpreted using standard weight status categories. Lets check out this dataset and how BMI indicated and associated with weight.


In [None]:
weight_categories <- weight_log %>%
    select(-"Date",-"IsManualReport") %>%
    group_by(Id) %>%
    summarise(BMI=mean(BMI),
              weight_status = case_when(BMI < 18.5 ~ "Underweight",
                                       BMI > 18.5 & BMI <= 24.9 ~ "Normal or Healthy Weight",
                                       BMI >= 25.0 & BMI <= 29.9 ~ "Overweight",
                                       BMI >= 30.0 ~ "Obese"))
weight_categories

In [None]:
install.packages("RColorBrewer")

In [None]:
library("RColorBrewer")

In [None]:
# Visualize BMI and weight status
ggplot(weight_categories) + 
  geom_bar(aes(x = as.character(Id), y = BMI, fill = weight_status), stat = "identity", position = "dodge") + 
  scale_fill_brewer(type = "qual", name = "") +  # customise colours and legend title
  coord_flip() # make bars horizontal
                      

*  The chart shows half of users are overweight in this dataset. 
*  There is an datapoint that has BMI score over 45 which is indicated of obese.

In [None]:
# check out the next dataset sleep_day
# Split SleepDay column into Date and Time
sleep_day <- sleep_day %>%
    separate(SleepDay,c("Date","Time")," ")

In [None]:
sleep_day$Date <- as.Date(sleep_day$Date, "%m/%d/%Y")
head(sleep_day)

In [None]:
# merge daily_activity and sleep_day
daily_merge_sleep <- left_join(sleep_day,daily_activity,by=c("Id"="Id","Date"="ActivityDate"))
head(daily_merge_sleep)

In [None]:
glimpse(daily_merge_sleep)

In [None]:
# Lets check how many unique participants are in the dataset
n_distinct(daily_merge_sleep$Id)

* The daily_activity dataset itself has more users than sleep_day, when I used left_join it filtered out only users from sleep_day dataset. 
* So how activity does affect to our sleep or vice versa? We can take a look at what data tells us by exploring the relationship between some features in this merge dataset. Does exercise help users to increase the time of sleeping or reduce the amount of time people laying in bed awake during the night?

In [None]:
# Check out Calories distribution
daily_merge_sleep %>%
    summarise(Id, Calories) %>%
    ggplot(aes(Calories)) +
    geom_histogram() +
    geom_vline(aes(xintercept=mean(Calories, na.rm=T)),   # Ignore NA values for mean
               color="red", linetype="dashed", size=1)

In [None]:
# checking NULL values
print(colSums(is.na(daily_merge_sleep)))
cat("\n\n")

In [None]:
# Identify 3 type of users based on their activity status. 
activity_status <- daily_merge_sleep %>%
    filter(TotalSteps > 0) %>%
    group_by(Id) %>%
    summarize(total_steps = sum(TotalSteps), mean_steps = mean(TotalSteps))
    activity_status$activity_level = case_when(
        activity_status$mean_steps >= 10727 ~ "Very Active",
        activity_status$mean_steps < 10727 & activity_status$mean_steps > 3790 ~ "Normal Active",
        activity_status$mean_steps <= 3790 ~ "Sedentary Active")


In [None]:
sleep_status <- daily_merge_sleep %>%
    filter(TotalMinutesAsleep > 0) %>%
    group_by(Id) %>%
    summarize(total_asleep = sum(TotalMinutesAsleep), mean_sleep = mean(TotalMinutesAsleep))
    sleep_status$sleep_level = case_when(
        sleep_status$mean_sleep >= 490 ~ "Over Sleep",
        sleep_status$mean_sleep < 490 & sleep_status$mean_sleep > 361 ~ "Normal Sleep",
        sleep_status$mean_sleep <= 361 ~ "Light Sleep")

sleep_status

In [None]:
# merge these 2 dataset into 1
combined_data <- merge(activity_status, sleep_status, by="Id")
combined_data

In [None]:
# See how this perform by using ggplot
ggplot(combined_data, aes(x=activity_level,fill=sleep_level)) +
geom_bar()

In [None]:
ggplot(combined_data,aes(x=activity_level,y=mean_steps, color=sleep_level))+
    geom_boxplot() +
    facet_wrap(~Id, scale="free")

* The initial hypothesis might be that the more you exercise, the more tired you are and sleep more than usual. But according to the data discovered above, people who are very active have sleep less than normal. 
* With the help of tracker device, users can monitor their activity daily to adjust their habit in order to get better sleep over time. 

In [None]:
# check out the sleep day data
head(sleep_day)

In [None]:
# Lets see how many users in this dataset
n_distinct(sleep_day$Id)

In [None]:
# Understand the statistical summary for sleep_day dataset
sleep_day %>%  
  select(TotalSleepRecords,
  TotalMinutesAsleep,
  TotalTimeInBed) %>%
  summary()

In [None]:
# lets see the time people in bed compare with total sleep time
avr_bed_sleep <- mutate(sleep_day, TimeInBedvsSleep=TotalTimeInBed - TotalMinutesAsleep) %>%
group_by(Id) %>%
summarize(mean_time = mean(TimeInBedvsSleep))
avr_bed_sleep

* Since R plot wont show all labels if they are too long, I changed the user Id from 1 to 24 as we have 24 unique user in this dataset just for the sake of this visualization.

In [None]:
barplot(height = avr_bed_sleep$mean_time,
        names.arg=c(1:24),
        las=2)

* Most of users have average time in bed until fall asleep in around 10 to 30 mins.
* There are 2 users have average time a lot higher, one is more than 150 mins and other greater than 300 mins.
* We can also see the relationship between 2 variables total time in bed vs total time asleep by plot below


In [None]:
ggplot(data=sleep_day, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) +
geom_point() +
geom_smooth()

* The chart shows a positive relationship between minutes asleep and time in bed.

## Conclusions:
1. Users can adjust their sleep time to create a better routine and improve their health by mornitoring their sleep time. Or our device functionality should have the alarm/reminder to remind users to go to sleep.
2. With the activity tracking function, Bellabeat product can track users activity and recommend the different types of activities that help to get better sleep. For instance, which time of the day users should do a heavy work out such as running, tabada, outdoor bike rides,etc... or users can do yoga in a certain time of the day to get better sleep, improve heart rate. 
3. Since BMI used to measure overweight and obesity, users can adjust their weight to get healthier or lose weight. Our product can have some features to either recommend some healthy meals with low calories/carbs. 
4. With the heartrate tracker, users are not only being able to monitor their quality of sleep but also adjust their time of exercise to recovery heart rate, then create a better routine of exercise over time. 



# Thank you for your time! Hope you enjoy this analysis!