# Google Data Analyst Bellabeat Case Study 


# Deliverables 
1. A clear summary of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of the analysis
5. Supporting visualizations and key findings
6. Top high-level content recommendations based on the analysis

# Business Task 
****Identify trends in the use of health smart devices and implement those insights to a Bellabeat product in order to keep the products up to date with people´s preferences.****

# Data Sources Used 


#### FitBit Data Source
This data set comes from Kaggle through the following link: https://www.kaggle.com/arashnic/fitbit

**Acknowlegement:** 

Furberg, Robert; Brinton, Julia; Keating, Michael ; Ortiz, Alexa
https://zenodo.org/record/53894#.YMoUpnVKiP9


More about this data set:
* It is a Secord Party Data Set that was collected by Amazon Mechanical Turk
* It is a Public Data Set 
* It has its own identifyer so we can trust is reliable and safe
* It has unique data about the use of health smart devices, so we can analyze it in order to identify possible trends. 
* Data was collected in 2016, so up to date data might be needed to have a better analysis. 


# Data Preparation

Our first task is to load packages and our data into variables so we have a better manipulation during the analysis. 
We will use packages dedicated to data manipulation and visualization in R. 


In [None]:
install.packages("tidyverse")
install.packages("janitor")
install.packages("skimr")
install.packages("lubridate")
library(tidyverse)
library(janitor)
library(skimr)
library(lubridate)

In [None]:
daily_sleep <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
daily_activity <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
daily_steps  <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
weight <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
hour_calories <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
hour_intensities <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
heartrate <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")


# Data Cleaning 

Now that we have our packages and data are ready to use, we can start with the cleaning process. 

In [None]:

# Daily Sleep Data, clean names, remove empty rows and arrange the data set by "id" in a new variable called "daily_sleepV2"
# We add a new column with the correct date format 

daily_sleepV2 <- daily_sleep %>% clean_names() %>% remove_empty() %>% arrange(id)
daily_sleepV2$date <-mdy_hms(daily_sleepV2$sleep_day)

# Daily Activity Data, clean names, remove empty rows and arrange the data set by "id" in a new variable called "daily_activityV2"
# We add a new column with the correct date format 

daily_activityV2 <- daily_activity %>% clean_names() %>% remove_empty %>% arrange(id)
daily_activityV2$date <- mdy(daily_activityV2$activity_date)

# Daily Steps Data, clean names, remove empty rows, arrange the data set by "id" in a new variable called "daily_activityV2" and rename the colum "step_total" to "total_steps"
# We add a new column with the correct date format 

daily_stepsV2 <- daily_steps %>% clean_names() %>% remove_empty() %>% rename(total_steps = step_total) %>% arrange(id)
daily_stepsV2$date <- mdy(daily_stepsV2$activity_day)

# Weight Data, clean names, remove empty rows, arrange the data set by "id" in a new variable called "weightV2" and rename the colum "date" to "old_date"
# We add a new column with the correct date format 

weightV2 <- weight %>% clean_names() %>% remove_empty() %>% rename(old_date=date) %>% arrange(id)
weightV2$date <- mdy_hms(weightV2$old_date)

# Hourly Calories Data, clean names, remove empty rows and arrange the data set by "id" in a new variable called "hour_caloriesV2"
# We add a new column with the correct date format 

hour_caloriesV2 <- hour_calories %>% clean_names() %>% remove_empty() %>% arrange(id)
hour_caloriesV2$date <- mdy_hms(hour_caloriesV2$activity_hour)

# Hourly Intensities Data, clean names, remove empty rows and arrange the data set by "total_intensity" in a new variable called "hour_caloriesV2"
# We add a new column with the correct date format

hour_intensitiesV2 <- hour_intensities %>% clean_names() %>% remove_empty() %>%  arrange(desc(total_intensity))
hour_intensitiesV2$date <- mdy_hms(hour_intensitiesV2$activity_hour)

# Heartrate Data, clean names, remove empty rows and arrange the data set by "id" in a new variable called "heart_rateV2"
# We add a new column with the correct date format

heartrateV2 <- heartrate %>% clean_names() %>% remove_empty() %>% arrange(id)
heartrateV2$date_time <- mdy_hms(heartrateV2$time)

# Analysis and Visualizations

Now that we have our data sets ready and clean, we can start to create some visualizations in order to have a better understanding of the data

In [None]:
# Number of Records  

n_distinct(daily_activityV2$id)
n_distinct(daily_sleepV2$id)
n_distinct(daily_stepsV2$id)
n_distinct(weightV2$id)
n_distinct(hour_caloriesV2$id)
n_distinct(hour_intensitiesV2$id)
n_distinct(heartrateV2$id)

If we take a look at the number of records available on the data sets, it seems like some of them have less data than others. 

The number of users that have records of:

* Their dialy activity is ***33***
* Their daily sleep is ***24***
* Their daily steps is ***33***
* Their weight is ***8***
* Calories Burned during the day is ***33***
* Activity Intensity is ***33***
* Heart Rate is ***14***


Due the amount of data, tables like weight and heart rate might not be the best for an analysis. ***More data might be needed in order to perform a complete and better analysis.***

In [None]:
ggplot(data= daily_activityV2) +
    geom_smooth(mapping=aes(x=calories,y=total_steps), color= "orange") +
    labs(title="Total Steps vs Calories",x="Calories Burned", y= "Total Steps")

Here we can see a correlation beetwen steps and calories. More steps equals more calories burned, which is something expected.

Implementing something about this correlation would be a great opportunity for Bellabeat. 

* ***A daily goal of steps can be implemented so users try to complete it***
* ***Adding a tracker so users know how many calories they burned just for walking can motivate them to increase the amount of steps per day in order to lose weight***

We can dive more into this correlation if we analyze the intensity of activities and the amount of calories burned per hour:

In [None]:
calories_intensity_merged <- merge(hour_intensitiesV2,hour_caloriesV2, by=c("id","date"))


ggplot(data=calories_intensity_merged) +
    geom_smooth(mapping=aes(x=calories,y=total_intensity), color="orange") +
    labs(title="Activity Intensity vs Calories Burned", x= "Calories Burned", y= "Activity Intensity")

As expected, there is a strong correlation between the intensity of the activities people do and the calories burned. Which makes sense, the more energy our bodies consume, the more calories we burn.

* ***Adding a list of high intensity activities that people can do might benefit users whose aim is to loose weight and be more active.***

In [None]:
# New Columns, we add them so that we have the day of the week and the hour of each record.

hour_intensitiesV2$day <- wday(hour_intensitiesV2$date, label=TRUE)
hour_intensitiesV2$time <- hour(hour_intensitiesV2$date)

# New Variable, contains a new table of the daily intensity grouped by the hour of the day and adds a new column with the intensity mean. 
summary_intensity <- hour_intensitiesV2 %>% group_by(time) %>% summarise(mean_intensity=mean(total_intensity))

summary_intensity %>%
ggplot() + 
    geom_col(mapping=aes(x=time,y=mean_intensity),color="white",fill="blue")+
    labs(title="Intensity", subtitle= "Activity Intensity During the Day", x="Day Hour", y="Intensity") 

There are some hours where people tend to do more high intensity activities during the day. I assume that part of the day is when people go to the Gym or try to do some physical activity.

As we can see, between ***16:00*** and ***20:00*** are the highest intensity records.

* ***Adding a notification or reminder to do physical activity during that part of the day might benefit users.***
* ***Implementing a tracker with the daily activity and intensity that users have archived during the week or month might keep them motivated about their health.***

Although data shows an increase in activity, we could have more information about the levels of activity and the time users tend to spend in each level:

In [None]:
# Summary of daily activity levels

summary_activity <- daily_activityV2 %>% 
    select(very_active_minutes,
           fairly_active_minutes,
           lightly_active_minutes,
           sedentary_minutes)

summary(summary_activity)

As we can see, *Sedentary* minutes have the highest numbers on this summary, meaning that users spend a lot of minutes being inactive. Data says that in average, people tend to spend 991 minutes being inactive, ***that is almost 17 hours.*** 

This amount of time also includes sleep hours from users, but as we will see in the next chart, people sleep only between 7 and 8 hours.  
* ***Adding a tracker so users know the amount of time they spend on every intensity level of activities might motivate them to change some habits and start to consider new ways of being more active.***
* ***Reminders or notifications during the day might benefit users during the day. If users exceed a limit of innactive hours during the day send a reminder or a notification to their phone.*** 

In [None]:
# New Column, we add the day of the week

daily_sleepV2$day <- wday(daily_sleepV2$date, label=TRUE)

# New Variable, contains a table grouped by Day and the Minues Asleep Mean

summary_sleep <- daily_sleepV2 %>% group_by(day) %>% summarise(sleep_mean=mean(total_minutes_asleep))

summary_sleep %>%
    mutate(sleep_mean=sleep_mean / 60) %>%
    ggplot()+
    geom_col(mapping=aes(x=day,y=sleep_mean),fill="dark blue")+
    labs(title="Sleep Hours", subtitle="Sleep Hours per Day", x="Day",y="Sleep Hours")

Not only physical movement is related to a better health, sleepping the right amount of hours and let our bodies rest is also important. 

According to some studies, people in general should sleep beetwen 7 and 9 hours.

* ***This can be archieved if we add a "Time to Sleep" reminder for users that want to improve their sleep.***

# Recomendations for the Business

The final part of this analysis are the recomendations.
After analyzing the FitBit Fitness Tracker Data, we found some insights that can ***benefit Bellabeat strategy and products:***

##### Daily Steps 
1. A daily goal of steps can be implemented so users try to complete it. Users can beat their own record of steps per day and track their progress. 
2. Adding a tracker so users know how many calories they burned just for walking can motivate them to increase the amount of steps per day in order to lose weight

##### Activity Intensity
3. Adding a list of high intensity activities that people can do might benefit users whose aim is to loose weight and be more active.
4. Adding a notification or reminder to do physical activity according to the daily schedule of each user can create a new healthy habit and benefit user´s health
5. Implementing a tracker with the daily activity and intensity that users have archived during the week or month can keep them motivated about their health.
6. Adding a tracker so users know the amount of time they spend on every intensity level of activities might motivate them to consider new ways of being more active.
7. Reminders or notifications during the day can benefit users during the day. If users exceed a limit of innactive hours during the day send a reminder or a notification to their phone.

##### Sleep
* Reminders for users who want to improve their sleep. At a certain hour, a notification on their phone or smartwach pops up and tells you that you should go to bed. 