# 1.0 Introduction
This is capstone project for Google Data Analytics Profefessional Certificate Case Study 2: How Can a Wellness Technology Company Play It Smart?

Project completed by: Souritra Banerjee dated 3-May-2022

# 2.0 Summary of Business Task
What is the problem you are trying to solve?
- To investigate and identify some trends in smart device usage to gain insight into how consumers use non-Bellabeat smart devices.
- To identify how to use the trends discovered and apply to Bellabeat customers.

How can your insights drive business decisions?

- Using the trends discovered to help influence Bellabeat marketing strategy and unlock new growth opportunities for the company.

# 3.0 Description of All Data Sources Used
## 3.1 Source, Licensing, Privacy
- I would like to thank Möbius for providing this relevant dataset to conduct this smart wellness device usage and its trending

- License: CC0: Public Domain

- Source: https://zenodo.org/record/53894#.X9oeh3Uzaao

- Privacy: These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

## 3.2 Data Credibility Check Using ROCCC Method

- Reliability This Kaggle data set contains personal fitness tracker from 30 eligible Fitbit users, the sample size is too small, might not reflect the overall population, therefore chances of being bias is present. However, increasing the sample size by adding another data could help to address the limitation of small data size. Furthermore, the content section of the dataset mentioned that ‘Thirty eligible Fitbit users consented to the submission of personal tracker data’, further investigation and exploration is needed to find out the criteria for being ‘eligible’ users to the submission of the personal tracker data.

- Original The datasets are third party information from public domain by Mobius, not originally by the service provider, Amazon Mechanical Turk. Hence, the originality of the datasets are low.

- Comprehensive Missing information on age, gender, device type used on the tracking etc. hence, these datasets are not comprehensive.

- Current These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016.

- Cited These datasets are considered as crowdsourcing data generated by respondents to a distributed survey via Amazon Mechanical Turk, hence, the data source is considered properly cited.

In [1]:
#Load Libraries
library('tidyverse')
library('janitor')
library('skimr')
library('here')
library('dplyr')
library(lubridate)
library(ggplot2)

In [3]:
#Import Datasets

dailyActivity <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
dailyCalories <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
dailyIntensities <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
dailySteps <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
heartrate_seconds <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
hourlyCalories <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
hourlyIntensities <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
hourlySteps <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
minuteCaloriesNarrow <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteCaloriesNarrow_merged.csv")
minuteCaloriesWide <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteCaloriesWide_merged.csv")
minuteMETsNarrow <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteMETsNarrow_merged.csv")
minuteSleep <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")
minuteStepsNarrow <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteStepsNarrow_merged.csv")
minuteStepsWide <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteStepsWide_merged.csv")
sleepDay <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weightLogInfo <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")


## 3.3 Explore Dataset

- There are 18 tables in this dataset comprised of daily, hourly, minutes.
- Some proprocessing was already done on hourly and minutes tables and had been aggregated and merged into daily activities table.

#### The following 3 datasets will be used for trend analysis:

- dailyActivity_merged.csv
- sleepDay_merged.csv
- weightLogInfo_merged.csv

# 4.0 Documentation of Cleaning or Manipulation of Data
## 4.1 Data Preparation
Clean column names in the dataframes

In [4]:
dailyActivity_merged_2 <- clean_names(dailyActivity)
sleepDay_merged_2 <- clean_names(sleepDay)
weightLogInfo_merged_2 <- clean_names(weightLogInfo)

View(dailyActivity_merged_2)
View(sleepDay_merged_2)
View(weightLogInfo_merged_2)


#### Cleaning Dates

dailyActivity has the data in mdy format, whereas weightLog & sleepDay has it in mdy_hms format. Need to transform into ymd format.

In [5]:
#standardize date format
dailyActivity_merged_2$activity_date <- as.Date(dailyActivity_merged_2$activity_date, "%m/%d/%Y")
weightLogInfo_merged_2$date <- parse_date_time(weightLogInfo_merged_2$date, orders = 'mdy HMS')
weightLogInfo_merged_2$date <- as.Date(weightLogInfo_merged_2$date, "%m/%d/%y %h:%m:%s")
sleepDay_merged_2$sleep_day <- parse_date_time(sleepDay_merged_2$sleep_day, orders = 'mdy HMS')
sleepDay_merged_2$sleep_day <- as.Date(sleepDay_merged_2$sleep_day, "%m/%d/%y %h:%m:%s")
View(sleepDay_merged_2)
View(weightLogInfo_merged_2)
View(dailyActivity_merged_2)

#examining the structure of dataframes after formatting
str(sleepDay_merged_2)
str(weightLogInfo_merged_2)
str(dailyActivity_merged_2)


#### Merge Data Frames

While merging 2 data frames, due to a difference in the number of observations, we are using a left join to merge the data. NA will be shown on every observation not matching, which is replaced by a 0 here.

In [6]:
daily_activity_sleep <- merge(x= dailyActivity_merged_2, y= sleepDay_merged_2,
                              by.x = c("id", "activity_date"), by.y = c("id", "sleep_day"), all.x = TRUE)
daily_activity_sleep [is.na(daily_activity_sleep)] <- 0
View(daily_activity_sleep)


# 5.0 Data Analysis
## 5.1 Create Categories

- Sleep into <6 hr, 6-8hr, >8hr
- Calories into <1500, 1500-2500, >2500
- Distance into <5km, 5-10km, >10km

In [7]:
daily_activity_sleep <- daily_activity_sleep %>% 
  mutate(sleep_categories = case_when(
    total_minutes_asleep >360 & total_minutes_asleep <= 480 ~ "6h-8h",
    total_minutes_asleep > 480 ~ "> 8h",
    TRUE ~ "< 6h"
  )) %>% 
  mutate(calorie_categories = case_when(
    calories > 1500 & calories <= 2500 ~ "1.5k-2.5k",
    calories > 2500 ~ "> 2.5k",
    TRUE ~ "< 1.5k"
  )) %>% 
  mutate(distance_categories = case_when(
    total_distance > 5 & total_distance <= 10 ~ "5km-10km",
    total_distance > 10 ~ "> 10km",
    TRUE ~ "<5km"
  ))

View(daily_activity_sleep)

## 5.2 Create Visualization

In [8]:
#Correlation between distance & calories burnt
ggplot(data= daily_activity_sleep) +
  geom_boxplot(mapping= aes(x=distance_categories, y= calories, fill= distance_categories))

In [9]:
#Correlation between sleep & calories burnt
ggplot(data= daily_activity_sleep) +
  geom_boxplot(mapping= aes(x=sleep_categories, y= calories, fill= sleep_categories))+facet_wrap("distance_categories")

## 5.3 Summary of Data Analysis

*#Correlation between distance & calories burnt*

- The boxplot shows direct correlation between the distance taken and the calories burnt, where the greater the distance taken, the more the calories burnt.
- The average calories burnt by a person taken distance of less than 5km is around 1800 calories a day.
- The average calories burnt by a person taken distance of 5km-10km is around 2400 calories a day.
- The average calories burnt by a person taken distance of more than 10km is around 3100 calories a day.

*#Correlation between sleep & calories burnt*

- People who tend to sleep <6h a day & people tend to sleep >8h a day burn fewer calories as compared to people with 6h-8h sleep while covering similar distance.


# 6.0 Business Recommendations
Based on the analysis conducted, please find my recommendations for Bellabeat as follows,

- There is a clear relationship between sleep and calories burnt. This can showcase to the customers the benefits of tracking sleep in achieving wight loss goals.
- A marketing strategy can be implemented to tell about sufficient sleep required by body, how it be achieved and how bellabeat can help them keep track of it and improve it.
- One of the most beneficial features of smart wearing devices is to motivate customers to have healthier lifestyles. A peer comparison feature might be developed to encourage customers to increase their active level to improve their health.
- As the data quality is not great based on POCCC review, all the abpve recommendations required further validation.