**THE COMPANY**

 Bellabeat is a high-tech manufacturer of health-focused products for women. It designs smart devices that informs and inspires women around the world. By collecting data on activity, sleep, stress, and reproductive health, Bellabeat has empowered women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
Urška Sršen, Bellabeat’s cofounder and Chief Creative Officer has asked to  focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

**PHASE 1: ASK**

WHAT IS THE BUSINESS TASK?
I have been tasked with the job of analysing a smart device data of  non bellabeat devices, apply insights gained on a bellabeat product then, use the insights to bellabeat marketing team’s strategy. So in one sentence the business task is: Identify trends in usage of non Bellabeat devices and apply the insight on a bellabeat product to grow the marketing strategy.

WHO ARE THE KEY STAKEHOLDERS:
Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer 
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team 
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. 

**PHASE 2: PREPARE**

DATASET SOURCE:
The dataset I will be using is the FitBit Fitness Tracker Data. It is a public data made available by Mobius. It is a Kaggle dataset which contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. It contains 18 csv files and it is in a long format.
IS THE DATA ROCCC?
A good data set must be ROCCC meaning it must be reliable, Original,Comprehensive,Current and Cited.
Reliable – This dataset is not reliable because it has a sample size of 30 respondent, this is too small to be a true representation of fitness app users. It also only accounts for two months of data. This is too low and makes the data biased.
Originality – The data set is not original as it was distributed by a third party Amazon Mechanical Turk.
Comprehensive – The data set is not comprehensive enough as it does not not provide data such as the age, health condition  etc of the respondents.
Current – The data set is not current as it dates back to 2016 and contains no recent data so it would not contain current trend in smart device usages.
Cited – We have little information about the credibility of the source.
All these makes the data set unreliable and uncredible. Further analysis with more recent and reliable data needs to be done to make reliable recommendations.

SORT AND FILTER DATA;
After going through the data on excel spreadsheet, I have selected the dailyAcivity_Merged.csv file, sleepDay_merged.csv and hourly_steps.csv because I want to find the trends in the usage of the app without focusing too much on detailed performance of the users. These data I belive will show interesting patterns.



**PHASE 3: PROCESS**

TOOLS USED;
I observed the data on Excel then did the analysis and visualization on R

INSTALLING PACKAGES: 
I loaded the following libraries to use for my analysis


In [None]:
library(tidyverse)
library(here)
library(skimr)
library(janitor)
library(ggplot2)
library(lubridate)
library(readr)
library(tidyr)
library(ggrepel)
library(sqldf)
library(RColorBrewer)
library(dplyr)

**IMPORTING DATA:**

Next I imported the data i will be making use of which is the following:

*  Daily Activity
*  Sleep Day
*  Hourly Steps

**CLEANING AND FORMATTING DATA**

I used the str() function to check the data types of the data and all were accurate except data which was in character instead of datetime


In [None]:
daily_activity <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
sleep_day <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
hourly_steps <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
str(daily_activity)
str(sleep_day)
str(hourly_steps)

I renamed the date columns and formatted it using as.date()

In [None]:
daily_activity <- daily_activity %>%
  rename(date = ActivityDate) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))
str(daily_activity)

In [None]:
sleep_day <- sleep_day %>%
  rename(date = SleepDay) %>%
  mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
str(sleep_day)



In [None]:
hourly_steps<- hourly_steps %>% 
  rename(date_time = ActivityHour) %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
str(hourly_steps)

i checked for the total number of distinct users

In [None]:
### checking for total number of respondent
sqldf("select distinct id
      from daily_activity")

sqldf("select distinct id
      from sleep_day")

sqldf("select distinct id
    from hourly_steps")

i checked to see if there are any duplicate rows. The sleep_day had 3

In [None]:
# checking for duplicate entries
sum(duplicated(daily_activity))
sum(duplicated(sleep_day))
sum(duplicated(hourly_steps))

i removed the duplicate rows in the sleep_day

In [None]:
# sleep_day has 3 duplicated entry so remove them
sleep_day <- sleep_day %>%
  distinct( ) %>%
  drop()
### confirm removal of duplicate values
sum(duplicated(sleep_day))


Next,i checked if there are any null values in the data. I found none

In [None]:
#checking for null values
colSums(is.na(daily_activity))
colSums(is.na(sleep_day))
colSums(is.na(hourly_steps))


I merged the daily_activity and sleep_day to check for correlations in the data. i achieved this using the "Id" and "date" columns as primary keys.

In [None]:
#merge the data
merged_data <- merge(daily_activity, sleep_day, by=c ("Id", "date"))
View(merged_data)
glimpse(merged_data)

summary(merged_data)


Then i checked for total distinct users to make sure it is still the same

In [None]:
### checking for total number of respondent
sqldf("select distinct id from merged_data")



PHASE 4: ANALYZE

Here, i analyze the data.

i tried to find the total steps taken, total amout of calories burned and total amount of minutes slept by each user

In [None]:
### find mean and sums grouped by ids
user_data <- merged_data %>% 
  group_by(Id) %>%
  summarise(mean_daily_steps = mean(TotalSteps), mean_daily_calories = mean(Calories),
            mean_daily_sleep = mean(TotalMinutesAsleep),
            mean_daily_distance = mean(TotalDistance))
View(user_data)

After getting the total steps, we classify them into different groups. the classification is made based on www.10000steps.org

In [None]:
### Classify users base on average steps
user_type <- user_data %>%
  mutate(user_type = case_when(
    mean_daily_steps < 5000 ~ "sedentary",
    mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "lightly active", 
    mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "fairly active", 
    mean_daily_steps >= 10000 ~ "very active"
  ))
View(user_type)


Next, I create a new data frame where one of the colums shows us in percentage how each classified group varies.

In [None]:
### to categorise the daily steps
user_type_group <- user_type %>%
  group_by(user_type) %>%
  summarise(total= n()) %>%
  mutate(total_percent= scales::percent (total/sum(total)))

View(user_type_group)

**PHASE 6: VISUALIZATION:

Here i visualize the data.
first, i plotted the total steps against calories burned to see if there is any correlation,



In [None]:
ggplot(data = daily_activity, mapping = aes(x = Calories, y = TotalSteps)) + geom_point() +
  geom_smooth() + labs(title = "Total Steps VS Calories Burned")


We can see there is a positive correlation between the steps taken and calories burned. This indicates that more steps taken does leads to calories burned.

Next, i perform a statistical analysis to confirm this.

In [None]:
##Calculate the correlation coefficient.
total_steps <- daily_activity$TotalSteps
calories <- daily_activity$Calories
cor(total_steps,calories)


The result shows there is a 59% correlation between both which confirms the visualization above.

Next, i visualized the user type distribution. This visualization is based on the classification groups i created and it shows visually how active the users are.

In [None]:
user_type_group %>%
  ggplot(aes(x = "",y=total_percent, fill= user_type)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
 scale_fill_manual(values = c("tomato 2", "medium blue", "yellow 2", "sea green ")) +
  geom_text(aes(label = total_percent),
            position = position_stack(vjust = 0.5))+
  labs(title="User Type Distribution") 

To know what day the users walk the most steps and see if they are are taking the  amount of recommended steps and if they are having the recommended minutes of sleep, i perform the next analysis by calculating the average minutes slept and average steps taken by week day,

In [None]:
week_days <- merged_data %>%
  mutate(weekday = weekdays(date))
week_days$weekday <-ordered(week_days$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
                                                                            "Friday", "Saturday", "Sunday"))

week_days <-week_days%>%
  group_by(weekday) %>%
  summarize (daily_steps = mean(TotalSteps), daily_sleep = mean(TotalMinutesAsleep))

View(week_days)

In [None]:

    ggplot(week_days) +
      geom_col(aes(weekday, daily_steps), fill = "tomato 4") +
      geom_hline(yintercept = 7500) +
      labs(title = "Daily Steps Per Week Day", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))

In [None]:
ggplot(week_days) +
      geom_col(aes(weekday, daily_sleep),fill = "spring green 4") +
      geom_hline(yintercept = 480) +
      labs(title = "Minutes asleep per weekday", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1)
  )

From the above plots and table, we can see that:

* The users are taking the recommended amount of steps
* The users are not having the recommended amount of slepp(480 minutes)

Now, i want to check for the particular hour in the days of the weel, the user are most active. i amde use of the hourly_steps table and created a time column

In [None]:
hourly_steps <- hourly_steps %>%
  separate(date_time, into = c("date", "time"), sep= " ") %>%
  mutate(date = ymd(date)) 
  
head(hourly_steps)

In [None]:
hourly_steps %>%
  group_by(time) %>%
  summarize(average_steps = mean(StepTotal)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = average_steps, fill = average_steps)) + 
  labs(title = "Hourly steps throughout the day", x="", y="") + 
  scale_fill_gradient(low = "red", high = " dark green")+
  theme(axis.text.x = element_text(angle = 90))

We can see the users are active between the hours of 8am and 7pm. They are most active around 5pm - 7pm and also between the hours 12pm - 2pm

Next, i checked to see how often the users make use of their smart devices. I grouped them based on the following:

* high usage - users who make use of their device between 21 and 31 days.
* moderate usage - users who make use of their device between 10 and 20 days.
* low usage - users who make use of their device between 1 and 10 days.


In [None]:
app_daily_usage <- merged_data %>%
  group_by(Id) %>%
  summarize(days_used=sum(n())) %>%
  mutate(usage = case_when(
    days_used >= 1 & days_used <= 10 ~ "low usage",
    days_used >= 11 & days_used <= 20 ~ "moderate usage", 
    days_used >= 21 & days_used <= 31 ~ "high usage", 
  ))
  
head(app_daily_usage)

I put them in percentage to visualize them better

In [None]:
daily_usage_percent <- app_daily_usage %>%
  group_by(usage) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(usage) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

daily_usage_percent$usage <- factor(daily_usage_percent$usage, levels = c("high usage", "moderate usage", "low usage"))

head(daily_usage_percent)

In [None]:
daily_usage_percent %>%
  ggplot(aes(x="",y=total_percent, fill=usage)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  geom_text(aes(label = labels), position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("sea green 4","spring green 3", "pale green"),
                    labels = c("High use - 21 to 31 days",
                                 "Moderate use - 11 to 20 days",
                                 "Low use - 1 to 10 days"))+
  labs(title="Daily Use Of Smart Device")

From the above plot, we can see half of the users use their smart devices frequently 38% rarely use their devices and 12% are moderate users.

Next, i chekced for how minutes in a day the users make use of their smart devices. To do this, i merged the daily activity and app daily usage dataframe.

In [None]:
merged_daily_usage <- merge(daily_activity, app_daily_usage, by=c ("Id"))
head(merged_daily_usage)

To cslculate the minutes spent by the users on their smart devices daily, i grouped them based on the following:

* All day - device was worn all day.
* More than half day - device was worn more than half of the day.
* Less than half day - device was worn less than half of the day.

In [None]:
minutes_worn <- merged_daily_usage %>% 
  mutate(total_minutes_worn = VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)%>%
  mutate (percent_minutes_worn = (total_minutes_worn/1440)*100) %>%
  mutate (worn = case_when(
    percent_minutes_worn == 100 ~ "All day",
    percent_minutes_worn < 100 & percent_minutes_worn >= 50~ "More than half day", 
    percent_minutes_worn < 50 & percent_minutes_worn > 0 ~ "Less than half day"
  ))

head(minutes_worn)

To visualize this bettter, i created  four dataframes. The first frame shows the total users and the percentage of minutes the devices are used based on the groups i cfeated earlier.
The other three dataframe is filtered using the daily usage dataframe and shows us how many minutes in percentage, the devices ae used.

In [None]:
minutes_worn_percent<- minutes_worn%>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_high_use <- minutes_worn%>%
  filter (usage == "high usage")%>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_moderate_use <- minutes_worn%>%
  filter(usage == "moderate usage") %>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_low_use <- minutes_worn%>%
  filter (usage == "low usage") %>%
  group_by(worn) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(worn) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

minutes_worn_high_use$worn <- factor(minutes_worn_high_use$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_percent$worn <- factor(minutes_worn_percent$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_moderate_use$worn <- factor(minutes_worn_moderate_use$worn, levels = c("All day", "More than half day", "Less than half day"))
minutes_worn_low_use$worn <- factor(minutes_worn_low_use$worn, levels = c("All day", "More than half day", "Less than half day"))

head(minutes_worn_percent)
head(minutes_worn_high_use)
head(minutes_worn_moderate_use)
head(minutes_worn_low_use)

Now that i have created the dataframes, i visualised it.

In [None]:
ggplot(minutes_worn_percent, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5)) +
    scale_fill_manual(values = c("sky blue 4", "sky blue 3","sky blue"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3.5)+
  labs(title="Time Worn Per Day", subtitle = "Total Users")
 

In [None]:
ggplot(minutes_worn_high_use, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5))+
    scale_fill_manual(values = c("sky blue 4", "sky blue 3","sky blue"))+
  geom_text_repel(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title= "Time Worn Per Day", subtitle = "High Usage Users")

In [None]:
ggplot(minutes_worn_moderate_use, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5)) +
    scale_fill_manual(values = c("sky blue 4", "sky blue 3","sky blue"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="Time Worn Per Day", subtitle = "Moderate Use Users")

In [None]:
 ggplot(minutes_worn_low_use, aes(x="",y=total_percent, fill=worn)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5)) +
    scale_fill_manual(values = c("sky blue 4", "sky blue 3","sky blue"))+
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5), size = 3)+
  labs(title="Time Worn Per User", subtitle = "Low Use Users")

From the plots we can see that 36% the total users use their devices all day,while 60% of them use it for more tham half a day and 4% use it less than half a day.
For the high users(users who use the app 21-31 days), 6.8% use the devices all day, 88.9% use it for more than half a day and 4.3% use it for less than half a day
For the moderate users(users who use the app 10-20 days), 27% use the devices all day, 69% use it for more than half a day and 4% use it for less than half a day
For the low users(users who use the app less than 10 days), 80% use the devices all day, 18% use it for more than half a day and 2% use it for less than half a day.

From the above, we can say the moderate users use their smart devices the least while the low users on days when they use their smart devices use them the most.

**PHASE 6: ACT **

My recommendation based on the analysis made are:

1) Since 50% of the users are high users and 36% of them wear their smart devices all day, Bellabeat should ensure the devices have long lasting batteries and arewater resistant. This would encourage users to wear their devices for longer period.

2) Bellabeat could make their devices more interactive for users. From the analysis, the users dont get the recommeded minutes of sleep in a day. Bellabeat could create features that allows users schedule sleep/nap time and make recommendations when they dont sleep for enough minutes.

3) Bellabeat could make tasks and reward for users. They could create daily or weekly chalenges which alows users to achieve requiered amout of steps and sleep then reward users for completion of the tasks. 

4) They could also send notifications to remind them prior to them time their activities are to start and check in when the users can not complete their activities/tasks.

5) Bellabeat could give their devices more fancy designs.

6) Bellabeat could recommend simple work out exercises for users