In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Summary:

Bellabeat is a high-tech company that manufactures health-focused smart products like Bellabeat smart leaf, smart watches, bellabeat app, Bellabeat spring.Urška Sršen and Sando Mur founded Bellabeat. 
The main motive of the company is to empower the women with the data that are collect by these smart devices and services. They produce the data to monitor stress, sleep cycle, activity and reproductive health.

# 2. ASK phase
Business Objective
To identify the trends of the market for the non-Bellabeat users and then analyse how these trends can influence the Bellabeat market customers and inprove the market's startegy.

Stakeholders

1. Urška Sršen - Bellabeat cofounder and Chief Creative Officer
2. Sando Mur - Bellabeat cofounder and key member of Bellabeat executive team
3. Bellabeat Marketing Analytics team


# 3.PREPARE Phase

3.1 Data set used

We have used FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius).

3.2 Privacy, security and accessiblity of data

The metadata of the given data source is open source and hence any one can download and modify the source code and redistribute it in their own name.

3.3 Information about the dataset

This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). 

3.4 Data crediblity or bias(ROCCC)

Reliablity- The data is not reliable beacause the data has only 30 respondants.
Original- The data is not original because it is collected by the third party Amazon mechanical Turk.
Comprehensive- The data is comprehensive because it matches most of the Belladata's products parameters.
Current- The given dataset is not current because it was recorded between 03.12.2016-05.12.2016 which 6 years old.
Cited- The data collected is from the third party and hence data is considered to be of "bad Quality" and hence it recommended to not produce any business outcomes.

3.5 Data Organization

The data is distributed over 18 csv files. Each data file has long data formats which records different quantitative data tracked by Fitbit.




# 4.PROCESS Phase
Since dataset is large in numbers, analysis using R tool is helpful to me as it also supports many libraries for data viz and ultimatly be usefull in sharing it with stakeholders.

4.1 Package installation
* tidyverse
* here
* skimr
* janitor
* lubridate
* ggpubr
* ggrepel



In [None]:
library(tidyverse)
library(here)
library(skimr)
library(janitor)
library(lubridate)
library(ggrepel)
library(dplyr)


4.2 Importing the dataset
For my analysis I am importing three dataset
* Daily_activity
* Daily_sleep
* Hourly_steps


In [None]:
daily_activity2 <- read_csv(file = "../input/bellabeat-fitness-tracker-dataset/dailyActivity_merged.csv")
daily_sleep1 <- read_csv(file = "../input/bellabeat-fitness-tracker-dataset/sleepDay_merged.csv")
hourly_steps1 <- read_csv(file = "../input/bellabeat-fitness-tracker-dataset/hourlySteps_merged.csv")

4.3 Dataset Structure

We can preview the data set by using Str() or head() function in R.

In [None]:
head(daily_activity2)
str(daily_activity2)

head(daily_sleep1)
str(daily_sleep1)

head(hourly_steps1)
str(hourly_steps1)

4.4 Data cleaning and Formatting

In order to clean the data we must ensure that the data set has no redundant observations/ rows (atomic data set).


4.4.1 Number of users

We know that the dataset has records for total 30 smart devices users. In order to check this we are checking the number of unique values in these respective data files.


In [None]:
n_unique(daily_activity2$Id)
n_unique(daily_sleep1$Id)
n_unique(hourly_steps1$Id)

4.4.2 Duplicates
Checking for the duplcates

In [None]:
sum(duplicated(daily_activity2)) #check total duplicated rows
sum(duplicated(daily_sleep1)) #check total duplicated rows
sum(duplicated(hourly_steps1)) #check total duplicated rows

4.4.3 Removing the Duplicates and N/A (null values)

It has been found that sleep data set has 3 duplicated values whereas other two data set(daily_activity2 and hourly_steps1) have no duplicated values.

In [None]:
daily_activity2 <- daily_activity2 %>% 
  distinct() %>% 
  drop_na()

daily_sleep1 <- daily_sleep1 %>% 
  distinct() %>% 
  drop_na()

hourly_steps1 <- hourly_steps1 %>% 
  distinct() %>% 
  drop_na()

verifying the above step by using the below mentioned code.

In [None]:
sum(duplicated(daily_sleep1)) #check total duplicated rows


4.4.4 Renaming the Column and Cleaning the data set

Renaming the all the column names to lower case in order to maintain the consistency. And cleaning the data column names using clean_names() function.

In [None]:
clean_names(daily_activity2)
daily_activity2 <- rename_with(daily_activity2, tolower)

clean_names(daily_sleep1)
daily_sleep1 <- rename_with(daily_sleep1, tolower)

clean_names(hourly_steps1)
hourly_steps1 <- rename_with(hourly_steps1, tolower)

4.4.5 Consitency in Date and time 

Further we need to merge the daily_activity2 dataset and daily_sleep1 data set on the basis of id and date. In order to do this merging we need to rename the date and time column for both the data set into same name.(here date column name).

In [None]:
daily_activity2 <- daily_activity2 %>%
  rename(date = activitydate) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))

daily_sleep1 <- daily_sleep1 %>%
  rename(date = sleepday) %>%
  mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))


checking the clean dataset


In [None]:
head(daily_activity2)
str(daily_activity2)

head(daily_sleep1)
str(daily_sleep1)

Note : For HourlySteps dataset we need to change the data string to date time format.

In [None]:
hourly_steps1<- hourly_steps1 %>% 
  rename(date_time = activityhour) %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

In [None]:
head(hourly_steps1)

4.5 Merging the dataset

Now we will merge the daily_activity2 and daily_sleep1 data set into single data set named as daily_activity_sleep2. This merging is based upon id and date column.

In [None]:
daily_activity_sleep2 <- merge(daily_activity2, daily_sleep1, by=c ("id", "date"))


In [None]:
glimpse(daily_activity_sleep2)


# 5. ANALYZE Phase

In this phase we will justify the business objective. We need to analyse the trend and verify how these trends influence the Bellabeat's market strategy.


5.1 Users Type

Depending upon daily activities , users are divided into four different categories.
* Sedentary - Less than 5000 steps a day.
* Lightly active - Between 5000 and 7499 steps a day.
* Fairly active - Between 7500 and 9999 steps a day.
* Very active - More than 10000 steps a day.

Classification has been made per the following article https://www.10000steps.org.au/articles/counting-steps/

Step1: Calculating average daily steps taken by the users.

In [None]:
daily_average <- daily_activity_sleep2 %>% 
  group_by(id) %>% 
  summarise(mean_daily_steps=mean(totalsteps), mean_daily_calories= mean(calories), mean_daily_sleep= mean(totalminutesasleep))


In [None]:
head(daily_average)

Now classifying the types of users depnding average daily steps.

In [None]:
user_type <- daily_average %>% 
  mutate(user_type= case_when(
    mean_daily_steps < 5000 ~ "sedentary",
    mean_daily_steps >=5000 & mean_daily_steps < 7499 ~ "lightly active",
    mean_daily_steps >=7500 & mean_daily_steps < 9999 ~ "fairly active",
    mean_daily_steps >=10000 ~ "Very Active"  ))


In [None]:
check the above code


In [None]:
head(user_type)

In order to visualize in the graph we are finding the percentage also.

In [None]:
user_type_percentage <- user_type %>% 
  group_by(user_type) %>% 
  summarise(total = n()) %>% 
  mutate(totals = sum(total)) %>% 
  group_by(user_type) %>% 
  summarise(total_percent = total/totals) %>% 
  mutate(labels= scales::percent(total_percent))

verify the above code

In [None]:
head(user_type_percentage)

Hence we can say that based on users activity all kind of users wear smart-devices.

Drawing a pie chart in order to show the categories of the percentage of the users.

In [None]:
user_type_percentage %>%
  ggplot(aes(x="",y=total_percent, fill=user_type)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("light blue","yellow", "green", "purple")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  labs(title="Distribution based on User type")


5.2 Steps taken and minutes asleep per weekday

Here we will analyze how many in a week the users are more active by counting the number of steps and how many days of the week the user is more asleep.

In [None]:
weekday_steps_sleep <- daily_activity_sleep2 %>% 
  mutate(weekday= weekdays(date))

In [None]:
weekday_steps_sleep$weekday <-ordered(weekday_steps_sleep$weekday,
                                      levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
                                                                            "Friday", "Saturday", "Sunday"))


In [None]:
weekday_steps_sleep <-weekday_steps_sleep%>%
  group_by(weekday) %>%
  summarize (daily_steps = mean(totalsteps), daily_sleep = mean(totalminutesasleep))



verify the above code

In [None]:
View(weekday_steps_sleep)

Visualizing the above graph to draw the insights.

In [None]:
ggplot(weekday_steps_sleep) +
    geom_col(aes(weekday, daily_steps), fill = "#006699") +
    geom_hline(yintercept = 7500) +
    labs(title = "Daily steps per weekday", x= "", y = "") +
    theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))

From the above bar chart we can say that:- On sundays the users are not walikng the  recommended steps. Hence it can be said that on sundays they are not taking care of their steps


In [None]:
 ggplot(weekday_steps_sleep, aes(weekday, daily_sleep)) +
    geom_col(fill = "#85e0e0") +
    geom_hline(yintercept = 480) +
    labs(title = "Minutes asleep per weekday", x= "", y = "") +
    theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))

From the second bar chart it can be observed that all the user are not taking the recommended amount of sleep. Hence it can be said that all the users are not taking care of their sleeps on all the weekdays.

# 6. Conclusion

1. Based on users activity all kind of users wear smart-devices.
2. On sundays the users are not walikng the  recommended steps. Hence it can be said that on sundays they are not taking care of their steps.
3. From the second bar chart it can be observed that all the user are not taking the recommended amount of sleep. Hence it can be said that all the users are not taking care of their sleeps on all the weekdays.