# <span style="color:#22223b"> Project_01 Bellabeat Case Study  </span>
![bellabeat_logo](https://th.bing.com/th/id/R.a1e9c2ad22f7cfab0c59fbd1cd9fc2fc?rik=17y7jjy%2bo3NWsg&riu=http%3a%2f%2fdizajn.hr%2fwp-content%2fuploads%2f2017%2f01%2fad_hdd-900x600.jpeg&ehk=MvAFG1rywDFxu0txBwxQCcmM6ArCfUcWrH0MGY%2b4A9U%3d&risl=&pid=ImgRaw&r=0)
# <span style="color:#22223b"> Table of Contents </span>
#### [1. Introduction](#summary_1) 

#### [2. Ask Phase](#ask_phase_2)
> #### [2.1 Summary of Business Task](#business_task_2.1)

#### [3. Prepare Phase](#prepare_3)
> #### [3.1 Data Sources Used and Description](#data_used_3.1)
> #### [3.2 Data Lincensing, Credibility and Accessibility](#data_used_3.2)
> #### [3.3 Data Integrity](#data_used_3.2)

#### [4. Process Phase](#process_4)
> #### [4.1 Loading Packages](#process_4.1)
> #### [4.2 Importing Relevant Data](#process_4.2)
> #### [4.3 Preview of Imported Data sets ](#process_4.3)
> #### [4.4 Data Wrangling and Formating](#process_4.4)
>> #### [4.4.1 Number of Unique users](#process_4.4.1)
>> #### [4.4.2 Duplicate Entries](#process_4.4.2)
>> #### [4.4.3 Checking for NA](#process_4.4.3)
>> #### [4.4.4 Removing Duplicates and NA](#process_4.4.4)
>> #### [4.4.5 Consistency of Date columns](#process_4.4.5)
> #### [4.5 Merging Data sets](#process_4.5)

#### [5. Analyse and Share Phase](#analyse_5)
> #### [5.1 Summary Statistics](#analyse_5.1)
> #### [5.2 Weekday Summaries](#analyse_5.2)
>> #### [5.2.1 Calories burned and Hours asleep per weekday](#analyse_5.2.1)
>> #### [5.2.2 Daily steps and Distance per weekday](#analyse_5.2.2)
> #### [5.3 Correlations](#analyse_5.3)
> #### [5.4 Use of Smart Device](#analyse_5.4)

#### [6. Summary and Recomendations Phase](#act_6)


# <span style="color:#22223b"> 1. Introduction </span> <a id="summary_1"></a>
Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights discovered will then help guide marketing strategy for the company.
My report will include the following deliverables:
1. A clear summary of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of your analysis
5. Supporting visualizations and key findings
6. Your top high-level content recommendations based on your analysis

# <span style="color:#22223b"> 2. Ask Phase </span> <a id="ask_phase"></a>
Sršen has asked to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants one Bellabeat product to apply these insights to in your presentation. These questions guides the analysis:
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?
 
#### <span style="color:#22223b"> 2.1 Summary of Business Task </span> <a id="business_task_2.1"></a>  
Identify trends in how consumers use non-Bellabeat smart devices to apply insights into Bellabeat’s marketing strategy.

# <span style="color:#22223b"> 3. Prepare Phase </span> <a id="prepare_3"></a>
The prepare phase ensures you have all of the data you need for your analysis and that you have credible, useful data. Here, I used the business task as a guide to decide which data from the dataset is relevant to my analysis.

#### <span style="color:#22223b"> 3.1 Data Sources Used and Description</span> <a id="data_used_3.1"></a>
Sršen encourages to use public data that explores smart device users’ daily habits. She points to a specific data set: [fitbit_dataset](http://https://www.kaggle.com/arashnic/fitbit) .
This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

#### <span style="color:#22223b"> 3.2 Data Lincensing, Credibility and Accessibility </span> <a id="data_used_3.2"></a>
This data is confirmed to be lincensed under an open lincense. This data is openly accessible, exploitable, editable and shared by anyone for any purpose.

#### <span style="color:#22223b"> 3.3 Data Integrity </span> <a id="data_used_3.3"></a>
Although this data has the potential of answering our business question, it does have some limitations in the sense that the sample size is quite small (30 users) and not representative of the poulation of smart device users.

# <span style="color:#22223b"> 4. Process Phase </span> <a id="process_4"></a>
Now I know the data is credible and relevant to the business problem, I need to clean it so that my analysis is error-free. For this analysis, I will be using R due to its flexibility, the size of data to be analysed and also to be able to create data visualization to share my results with stakeholders. 

#### <span style="color:#22223b"> 4.1 Loading Packages </span> <a id="process_4.1"></a>
To proceed with this, I need to load some relevant R packages to begin wrangling:
* library(tidyverse)
* library(janitor)
* library(lubridate)
* library(patchwork)

In [None]:
library(tidyverse)
library(janitor)
library(lubridate)
library(patchwork)
library(skimr)

#### <span style="color:#22223b"> 4.2 Importing Relevant Data</span> <a id="process_4.2"></a>

In [None]:
activity <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
sleep <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weight <- read.csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

#### <span style="color:#22223b"> 4.3 Preview of Imported Datasets </span> <a id="process_4.3"></a>

In [None]:
activity <- activity %>% select(Id, ActivityDate, Calories, TotalSteps, TotalDistance, SedentaryMinutes)
  head(activity)

sleep <- sleep %>% select(Id, SleepDay, TotalMinutesAsleep, TotalTimeInBed)
  head(sleep)

weight <- weight %>% select(Id, Date, WeightKg, BMI)
  head(weight)

#### <span style="color:#22223b"> 4.4 Data Wrangling and Formating</span> <a id="process_4.3"></a>

#### <span style="color:#22223b"> 4.4.1 Number of unique users </span> <a id="process_4.4.1"></a>
The code chunk below is used to check for the number of unique users in each data set using the id column 

In [None]:
n_unique(activity$Id)
n_unique(sleep$Id)
n_unique(weight$Id)

#### <span style="color:#22223b"> 4.4.2 Duplicate entries </span> <a id="process_4.4.2"></a>
Duplicates can skew or overexxgerate the result of our analysis. The code chunk below helps check if there are any duplicates in our data set.

In [None]:
sum(duplicated(activity)) 
sum(duplicated(sleep))
sum(duplicated(weight))

#### <span style="color:#22223b"> 4.4.3 Checkinig for NA </span> <a id="process_4.4.3"></a>

In [None]:
sum(is.na(activity))
sum(is.na(sleep))
sum(is.na(weight))

#### <span style="color:#22223b"> 4.4.4 Removing duplicates and NA  </span> <a id="process_4.4.4"></a>
Duplicates and NA values were removed from both the activity and sleep dataframes but the weight dataframe happens to contain alot for NA entries in the Fat column, Removing these NA entries will erase a huge part of the relevant data in the dataset. 

In [None]:
activity <- activity %>%
  distinct() %>% 
  drop_na()

sleep <- sleep %>%
  distinct() %>% 
  drop_na()

weight <- weight %>% 
  distinct()

#### <span style="color:#22223b"> 4.4.5 Consistency of date columns </span> <a id="process_4.4.5"></a>
For ease of analysis and visualisation, it is imperative that all dates are in a consistent format.

In [None]:
activity <- activity %>% 
  rename(date = ActivityDate) %>% 
  mutate(date = as_date(date, format = "%m/%d/%Y"))
head(activity)

sleep <- sleep %>%
  rename(date = SleepDay) %>%
  mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p"))
head(sleep)

weight <- weight %>%
  rename(date = Date) %>%
  mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p"))
head(weight)

#### <span style="color:#22223b"> 4.5 Merging Data sets </span> <a id="process_4.5"></a>
For ease of analysis, we merge the relevnant data sets using the id and date columns as primary keys

In [None]:
daily_activity_sleep <- merge(activity, sleep, by = c ("Id", "date")) 
head(daily_activity_sleep)

# <span style="color:#22223b"> 5. Analyse and Share Phase </span> <a id="analyse_5"></a>
The goal of this stage is to identify trends and relationships within the data in order to accurately answer the guiding questions.

#### <span style="color:#22223b"> 5.1 Summary statistics </span> <a id="analyse_5.1"></a>
Getting some summary statistics to get an idea of the data sets as a whole. This can give us some basic insights about the data sets 

In [None]:
activity_avg <- activity %>% select(Id, Calories, TotalSteps, TotalDistance, SedentaryMinutes) %>% 
  group_by(Id) %>% 
  summarize(avg_calories = mean(Calories), avg_steps = mean(TotalSteps), 
            avg_distance = mean(TotalDistance), sedentary_avg = mean(SedentaryMinutes))
head(activity_avg)

sleep_avg <- sleep %>% select(Id, TotalMinutesAsleep, TotalTimeInBed) %>% 
  group_by(Id) %>% 
  summarize(avg_timeinbed = mean(TotalTimeInBed), avg_min_asleep = mean(TotalMinutesAsleep))
head(sleep_avg)

weight_avg<- weight %>% select(Id, WeightKg, BMI) %>%
  group_by(Id) %>% 
  summarize(avg_weight = mean(WeightKg), avg_BMI = mean(BMI))
head(weight_avg)

#### <span style="color:#22223b"> 5.2 Weekday Summaries </span> <a id="analyse_5.2"></a>
I intend to slice and dice the data sets in order to get a deeper look at the average amount of calories, steps, distance, sedentary hours, hours asleep, and hours in bed are recorded for each individual per weekday. This can uncover some interesting insights. 
First, I'll create a new data frame for this;

In [None]:
weekday_summary <- daily_activity_sleep %>%
  mutate(weekday = weekdays(date), sedentary_hours= SedentaryMinutes / 60, 
         hours_asleep = TotalMinutesAsleep / 60, hours_in_bed = TotalMinutesAsleep / 60)

weekday_summary$weekday <- ordered(weekday_summary$weekday, 
  levels=c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday", "Sunday"))

weekday_summary <- weekday_summary%>%
  group_by(weekday) %>%
  summarize (calories = mean(Calories), steps = mean(TotalSteps), 
             distance = mean(TotalDistance), sedentary_hours = mean(sedentary_hours),
             hours_asleep = mean(hours_asleep), hours_in_bed = mean(hours_in_bed))
head(weekday_summary)

#### <span style="color:#22223b"> 5.2.1 Calories burned and Hours asleep per weekday </span> <a id="analyse_5.2.1"></a>
Visuals help identify trends more quickly. This visuals will give us a deeper understanding of weekday summaries

In [None]:
p1 <- ggplot(weekday_summary) +
    geom_col(mapping = aes(weekday, calories), fill = ("#12a4d9")) +
    labs(title = "Calories burned on weekdays", x = "", y = "") +
    theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))

p2 <- ggplot(weekday_summary) +
    geom_col(mapping = aes(weekday, hours_asleep), fill = ("#24A897")) + 
    geom_hline(yintercept = 8) +
    labs(title = "Hours asleep per weekday", x = "", y = "") +
    theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))
p1 + p2

From the graph above we can say the following;
1. Users are getting less than 8 hours of recommend sleep each day
2. Users burn the most calories on Mondays, Tuesdays and Saturdays


#### <span style="color:#22223b"> 5.2.2 Daily steps and Distance per weekday </span> <a id="analyse_5.2.2"></a>

In [None]:
p3 <- ggplot(weekday_summary) +
    geom_col(mapping = aes(x = weekday, y = steps), fill = "#4960ae") +
    labs(title = "Daily steps per weekday", x= "", y = "") +
    theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))

p4 <- ggplot(weekday_summary) +
    geom_col(mapping = aes(x = weekday, y = distance), fill = "#6b7b8c")+
    labs(title = "Distance per weekday (km)", x= "", y = "") +
    theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))
p3 + p4

From the graph above we can say the following;
1. Users have more steps on Mondays, Tuesdays and Saturdays. Little wonder these are the days with the most burnt calories.
2. The distance covered correlates with the daily steps and also the calories burned.

#### <span style="color:#22223b"> 5.3 Correlations </span> <a id="analyse_5.3"></a>
* Total Steps and Clalories burned
* Sedentary minutes and Calories burned

In [None]:
p5 <- ggplot(daily_activity_sleep, aes(x= TotalSteps, y= Calories))+
    geom_jitter() +
    geom_smooth(color = "#e60049") + 
    labs(title = "Daily steps vs Calories", x = "Daily steps", y= "Calories") +
    theme(panel.background = element_blank(),
          plot.title = element_text( size=14))

p6 <- ggplot(daily_activity_sleep, aes(x= SedentaryMinutes, y=Calories))+
    geom_jitter() +
    geom_smooth(color = "#e60049") + 
    labs(title = "Sedentary Minutes vs Calories", x = "Sedentary Minutes", y= "Calories") +
    theme(panel.background = element_blank(),
          plot.title = element_text( size=14))
p5 + p6

From the visual above, it is clear that;
1. A positive relationship exists betwween the Daily steps and Calories. Suffice to say that if users take more steps, they're likely to burn more calories
2. There's a negative relationship betweeen the sedentary minutes and the calories burned which makes sense because the more sedentary minutes accumulated, the less calories are burned by the user.

#### <span style="color:#22223b"> 5.4 Use of Smart Device </span> <a id="analyse_5.4"></a>
Some interesting inights can be drawn from how often users use smart devices. To this end, we will first group users according to the number of days they used smart devices.

In [None]:
daily_use <- daily_activity_sleep %>%
  group_by(Id) %>%
  summarize(days_used=n())%>%
  mutate(usage_level = case_when(
    days_used >= 1 & days_used <= 10 ~ "low use",
    days_used >= 11 & days_used <= 20 ~ "moderate use", 
    days_used >= 21 & days_used <= 31 ~ "high use", 
  ))
head(daily_use)

Taking this further, we can know what the percentage the different usuage levels are of the total

In [None]:
daily_use_perc <- daily_use %>% 
  group_by(usage_level) %>% 
  summarize(level_totals = n()) %>% 
  mutate(total = sum(level_totals)) %>%
  group_by(usage_level) %>% 
  summarize(total_perc = level_totals/ total) %>%
  mutate(labels = scales::percent(total_perc))
head(daily_use_perc)

Visual showing the percentage of total each usuage level occupies

In [None]:
daily_use_perc %>%
  ggplot(aes(x="",y=total_perc, fill=usage_level)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=20, face = "bold")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("#a45167","#cc98a7","#ecdbdc"),
                    labels = c("High use - 21 to 31 days",
                               "Moderate use - 11 to 20 days",
                               "Low use - 1 to 10 days"))+
 labs(title="Daily use of smart device")

From the above visual;

1. 50% of the users of users their device  quite frequently - between 21 to 31 days.
2. 12% use their device - 11 to 20 days.
3.  38% rarely use their device - 1 to 10 days

# <span style="color:#22223b"> 6. Conclusion and Recommendations (Act Phase) </span> <a id="act_6"></a>

![](https://th.bing.com/th/id/R.8174426628c86b23d1c666ef99f5a693?rik=y%2bipCwaIz8VKXA&riu=http%3a%2f%2fwww.pixelstalk.net%2fwp-content%2fuploads%2f2016%2f06%2fFree-Desktop-Fitness-Wallpapers-Images.jpg&ehk=8vU6dAJTzv84UHR%2bCWvqvBY1MnntTtbvIc40SIoBBXM%3d&risl=&pid=ImgRaw&r=0)

Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.

In order to have more a more robust recommedation to enable Bellabeat become a larger player, I first recommend that exixting smart devices should be used to collect more detailed data of customers as regard age and demographics. Furthermore, online surveys is another way Bellabeat can uncover more details about their customers as well as their preferences. 

However, from the above analysis, I reccommend the following for the **Bellabeat Time product**;

|Recommendation| Description |
|---| ---|
|1. Sedentary Minutes Check | The analysis shows that users spend about 11hrs each day immobile. Users are most likely unaware of this for serveral reasons. A notification can be sent to users if they have been at the same time for too long to get up and take some steps or stretch to get blood flowing. 
|2. Sleep time| Users are not getting the recommended 8hrs of sleep daily. I recommend a notification to remind users when its bed time and time to get up. 
|3. Goals | I consider it a good idea if the Bellabeat Time product can give users targets at the begining of day say, "Hello Mark, do you think you can burn 1000 calories today? well, let's see you try!."
|4. Benefits| Most people don't like being told what to do more especially when its from a gadget. I recommend that periodically health fun facts are shown to users to remind them how important lower sedntary minutes, taking more steps, among other health tips are for the body.
|5. Featureas | To encourage more usage days, the Time Product can be made to appear more fashionable and elegant to go with a variety of attaires.