In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Case Study 2: How Can a Wellness Technology Company Play It Smart?** 

***Introduction***
Welcome to the Bellabeat data analysis case study! In this case study, I will perform many real-world tasks of a junior data analyst. I will imagine I'm working for Bellabeat, a high-tech manufacturer of health-focused products for women, and
meet different characters and team members. In order to answer the key business questions, I will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. 


**Scenario**
I'm a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.


***Characters and products***
● Characters
○ Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
○ Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
○ Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and
reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have
been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst,
can help Bellabeat achieve them.

● Products
○ Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress,
menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and
make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
○ Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects
to the Bellabeat app to track activity, sleep, and stress.
○ Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user
activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your
daily wellness.
○ Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are
appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your
hydration levels.
○ Bellabeat membership: Bellabeat also offers a subscription-based membership program for users.
Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and
beauty, and mindfulness based on their lifestyle and goals.


**1.Ask Phase:**
Firstly, we need to address who are our key stakeholders? In this case we have following stakeholders:

Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s co-founder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

Business Objectives:

What are some trends in smart device usage?

How could these trends apply to Bellabeat customers?

How could these trends help influence Bellabeat marketing strategy?

**2.Prepare Phase**
Sršen encourages me to use public data that explores smart device users’ daily habits. She points me to a specific data set:
● FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set
contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of
personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes
information about daily activity, steps, and heart rate that can be used to explore users’ habits.

*Is Data ROCCC?*

A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited.

Reliable — LOW — Not reliable as it only has 30 respondents

Original — LOW — Third party provider (Amazon Mechanical Turk)

Comprehensive — MED — Parameters match most of Bellabeat products’ parameters

Current — LOW — Data is 7 years old and may not be relevant

Cited — LOW — Data collected from third party, hence unknown Overall, the dataset is considered bad quality data and it is not recommended to produce business recommendations based on this data.

I have downloaded the data from secure browser in my secured hard disk. And stored under a secured folder inside the file.

**3.Prepare Phase**
In this phase we will process the data by cleaning and ensuring that it is correct,relevant,complete and error free.

We have to check if data contains any missing or null values
Transform the data into format we want for the analysis
Tool:

I have used RStudio for data cleaning,data transformation,data analysis and visualization.

Firstly, we need to install and read the packages we need for analysis: I have all packages installed, so I read all the packages simultaneously.

Loading packages
Now, I’m going to Install some R packages that will help me in my analysis. And I will add some data cleaning packages as well (last 3 packages)




In [None]:
install.packages("tidyverse")
install.packages("lubridate")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("here")
install.packages("skimr")
install.packages("janitor")

Now, I’m going to load these packages.

In [None]:
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)
library(here)
library(skimr)
library(janitor)

Importing dataset
Now, I’m going to Import all dataset. Then VIEW, CLEAN, FORMAT, and ORGANIZE the data. After
reviewing all the dataset, I decided to make some asumptions and work only with these data for my analysis:
• dailyActivity_merged.csv

In [None]:
> Activity <- read.csv("myproject\\Fitabase Data 4.12.16-5.12.16\\dailyActivity_merged.csv")
> head(Activity)
>colnames(Activity)
>str(Activity)

Now for :

dailyCalories_merged.csv

In [None]:
Calories <- read.csv("myproject\\Fitabase Data 4.12.16-5.12.16\\dailyCalories_merged.csv")
> head(Calories)
>colnames(Calories)
>str(Calories)

In [None]:
Now for :

dailyIntensities_merged.csv

In [None]:
> Intensities <- read.csv("myproject\\Fitabase Data 4.12.16-5.12.16\\dailyIntensities_merged.csv")
> head(Intensities) 
  Id ActivityDay SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes
1 1503960366   4/12/2016              728                  328                  13
2 1503960366   4/13/2016              776                  217                  19
3 1503960366   4/14/2016             1218                  181                  11
4 1503960366   4/15/2016              726                  209                  34
5 1503960366   4/16/2016              773                  221                  10
6 1503960366   4/17/2016              539                  164                  20
  VeryActiveMinutes SedentaryActiveDistance LightActiveDistance
1                25                       0                6.06
2                21                       0                4.71
3                30                       0                3.91
4                29                       0                2.83
5                36                       0                5.04
6                38                       0                2.51
  ModeratelyActiveDistance VeryActiveDistance
1                     0.55               1.88
2                     0.69               1.57
3                     0.40               2.44
4                     1.26               2.14
5                     0.41               2.71
6                     0.78               3.19
>colnames(Intensities)
>str(Intensities)

Now for :

heartrate_seconds_merged.csv

In [None]:
> Heartrate <- read.csv("myproject\\Fitabase Data 4.12.16-5.12.16\\heartrate_seconds_merged.csv")
> head(Heartrate)
          Id                 Time Value
1 2022484408 4/12/2016 7:21:00 AM    97
2 2022484408 4/12/2016 7:21:05 AM   102
3 2022484408 4/12/2016 7:21:10 AM   105
4 2022484408 4/12/2016 7:21:20 AM   103
5 2022484408 4/12/2016 7:21:25 AM   101
6 2022484408 4/12/2016 7:22:05 AM    95
> colnames(Heartrate)
[1] "Id"    "Time"  "Value"
> str(Heartrate)
'data.frame':	2483658 obs. of  3 variables:
 $ Id   : num  2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
 $ Time : chr  "4/12/2016 7:21:00 AM" "4/12/2016 7:21:05 AM" "4/12/2016 7:21:10 AM" "4/12/2016 7:21:20 AM" ...
 $ Value: int  97 102 105 103 101 95 91 93 94 93 ...

Now for :

sleepDay_merged.csv

In [None]:
> Sleep <- read.csv("myproject\\Fitabase Data 4.12.16-5.12.16\\sleepDay_merged.csv")
> head(Sleep)
          Id              SleepDay TotalSleepRecords TotalMinutesAsleep
1 1503960366 4/12/2016 12:00:00 AM                 1                327
2 1503960366 4/13/2016 12:00:00 AM                 2                384
3 1503960366 4/15/2016 12:00:00 AM                 1                412
4 1503960366 4/16/2016 12:00:00 AM                 2                340
5 1503960366 4/17/2016 12:00:00 AM                 1                700
6 1503960366 4/19/2016 12:00:00 AM                 1                304
  TotalTimeInBed
1            346
2            407
3            442
4            367
5            712
6            320
> colnames(Sleep)
[1] "Id"                 "SleepDay"           "TotalSleepRecords" 
[4] "TotalMinutesAsleep" "TotalTimeInBed"    
> str(Sleep)
'data.frame':	413 obs. of  5 variables:
 $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
 $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
 $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
 $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

Last dataset for this analysis :

weightLogInfo_merged.csv

In [None]:
> Weight <- read.csv("myproject\\Fitabase Data 4.12.16-5.12.16\\weightLogInfo_merged.csv")
> head(Weight)
          Id                  Date WeightKg WeightPounds Fat   BMI IsManualReport
1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65           True
2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65           True
3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54          False
4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45           True
5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69           True
6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45           True
         LogId
1 1.462234e+12
2 1.462320e+12
3 1.460510e+12
4 1.461283e+12
5 1.463098e+12
6 1.460938e+12
> colnames(Weight)
[1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
[5] "Fat"            "BMI"            "IsManualReport" "LogId"         
> str(Weight)
'data.frame':	67 obs. of  8 variables:
 $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
 $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
 $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
 $ WeightPounds  : num  116 116 294 125 126 ...
 $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
 $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
 $ IsManualReport: chr  "True" "True" "False" "True" ...
 $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

**4.Process Phase** 
Basics cleaning:
Now, I'm going to Process, Clean and Organize the dataset for analysis.

And here some cleaning steps I did with the data :

For Dataset (Activity, Calories and Intensities): For the data cleaning steps, I did NOT FOUND in this data (Spelling errors, Misfield values, Missing values, Extra and blank space, no duplicated found). For formatting, I used clear formatting. For Data types, some data were converted to numeric and Dates columns will be converted to date type.

For Sleep data : 3 duplicates were found and removed.

For Weight data : too many missing values were found in one column. And I decided to remove that column. 

Fixing formatting
I spotted some problems with the timestamp data. So before analysis, I need to convert it to date time format and split to date and time.

In [None]:
# Activity
Activity$ActivityDate=as.POSIXct(Activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
Activity$date <- format(Activity$ActivityDate, format = "%m/%d/%y")
Activity$ActivityDate=as.Date(Activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
Activity$date=as.Date(Activity$date, format="%m/%d/%Y")


# Intensities
Intensities$ActivityDay=as.Date(Intensities$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())


# Sleep
Sleep$SleepDay=as.POSIXct(Sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Sleep$date <- format(Sleep$SleepDay, format = "%m/%d/%y")
Sleep$date=as.Date(Sleep$date, "% m/% d/% y")

We are also going to count unique IDs to confirm whether data has 30 IDs as claimed by the survey. We can do this with 2 approaches first using direct function to calculate and second using SQL query.

In [None]:
> n_distinct(Activity$Id)
[1] 33


There are 33 unique IDs, instead of 30 unique IDs as expected. Some users may have created additional IDs during the survey period.

Now the data cleaning and manipulation is done.Now data is ready to be analyzed.

**5.Analyse Phase** 
Now that all the data is stored appropriately and has been prepared for analysis, I can start exploring and analyzing the data sets.

Let's look at the total number of participants in each of our data sets:

In [None]:
> n_distinct(Calories$Id)
[1] 33
> n_distinct(Intensities$Id)
[1] 33
> n_distinct(Heartrate$Id)
[1] 14
> n_distinct(Sleep$Id)
[1] 24
> n_distinct(Weight$Id)
[1] 8

So, there are 33 participants in the activity, calories and intensities data sets. 24 participants in the Sleep data and  14 participants for Heartrate, and only 8 in the weight data set. 8 and 14 participants are not significant to make any recommendations and conclusions based on these dataset.

So I will focus on these datasets for my analysis by some quick summary statistics about each data frame: Activity, Calories, Intensities and Sleep.

In [None]:
> Activity %>%
+     select(TotalSteps,
+            TotalDistance,
+            SedentaryMinutes, Calories) %>%
+     summary()
   TotalSteps    TotalDistance    SedentaryMinutes    Calories   
 Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :   0  
 1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:1828  
 Median : 7406   Median : 5.245   Median :1057.5   Median :2134  
 Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :2304  
 3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:2793  
 Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :4900  
> Intensities %>%
+     select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) %>%
+     summary()
 VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
 Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
 1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
 Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
 Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
 3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
 Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0  
> Calories %>%
+     select(Calories) %>%
+     summary()
    Calories   
 Min.   :   0  
 1st Qu.:1828  
 Median :2134  
 Mean   :2304  
 3rd Qu.:2793  
 Max.   :4900  
> Sleep %>%
+     select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
+     summary()
 TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
 Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
 1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
 Median :1.000     Median :433.0      Median :463.0  
 Mean   :1.119     Mean   :419.5      Mean   :458.6  
 3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
 Max.   :3.000     Max.   :796.0      Max.   :961.0  
> Weight %>%
+     select(WeightKg, Fat) %>%
+     summary()
    WeightKg           Fat       
 Min.   : 52.60   Min.   :22.00  
 1st Qu.: 61.40   1st Qu.:22.75  
 Median : 62.50   Median :23.50  
 Mean   : 72.04   Mean   :23.50  
 3rd Qu.: 85.05   3rd Qu.:24.25  
 Max.   :133.50   Max.   :25.00  
                  NA's   :65 

### **Key findings from this analysis :**
1.The average sedentary time is too high (more than 16 hours). And definitely needs to be reduced with a good marketing strategy.

2.The majority of the participants are lightly active. With a high sedentary time.

3.Participants sleep 1 time for an average of 7 hours.

4.Average total steps per day (which is 7638) is a little bit less than recommended by the CDC. According to the CDC research, taking 8,000 steps per day was associated with a 51% lower risk for all-cause mortality (or death from all causes). And taking 12,000 steps per day was associated with a 65% lower risk compared with taking 4,000 steps.


**Merging some data :**
Before beginning to visualize the data, I'm going to merge two data sets : Activity and Sleep data on columns Id. Note that there are more participant Ids in the Activity dataset than in the Sleep dataset.So for analysis, I will consider using 'outer_join' to keep all participants in the in the dataset. And I can do that by adding in my code chunk the extra argument all=TRUE

In [None]:
> Combined_data_outer <- merge(Sleep, Activity, by="Id", all = TRUE)
> n_distinct(Combined_data_outer$Id)
[1] 33

**5.Share Phase**
Now let's visualize some key explorations.

Relationship between Steps and Sedentary time
What's the relationship between steps taken in a day and sedentary minutes?

In [None]:
> ggplot(data=Activity, aes(x=TotalSteps, y=SedentaryMinutes)) + geom_point() + geom_smooth() + labs(title="Total Steps vs. Sedentary Minutes")
`geom_smooth()` using method = 'loess' and formula 'y ~ x' 



![Rplot01.png](attachment:c9241972-fd3e-4201-8fa1-2456cb412106.png)

I can see here a negative correlation between Steps and Sedentary time. The more Sedentary time you have, the less Steps you're taking during the day. This data shows that the company need to market more the customer segments with high Sedentary time. And to do that, the company needs to find ways to get customers get started in walking more and also measure their daily steps.


Relationship between Minutes Asleep and Time in Bed

What's the relationship between minutes asleep and time in bed?

In [None]:
> ggplot(data=Sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point()+ geom_smooth() + labs(title=" Minutes Asleep vs. Time in Bed Minutes")
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

![Rplot02.png](attachment:4d6f936b-a546-4324-b229-66e8eebb3451.png)

As we might expect, we can see here an almost completely linear trend between Minutes Asleep and Time in Bed. So to help users improve their sleep, the company should consider using notification to go to sleep.

Relationship between Steps and Calories
What's the relationship between steps taken and Calories ?

In [None]:
> ggplot(data=Activity, aes(x=TotalSteps, y=Calories)) + 
+     geom_point() + geom_smooth() + labs(title="Total Steps vs. Calories")
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

![Rplot03.png](attachment:907457ff-74fa-4bcc-a798-fc010c3263cd.png)

We can see here a positive correlation between Total Steps and Calories. The more active we are, the more calories we will burn.

Intensities data
Now, let's look at some Intensities data over time.

In [None]:
> Intensities$ActiveIntensity <- (Intensities$VeryActiveMinutes)/60
> 
> Combined_data <- merge(Weight, Intensities, by="Id", all=TRUE)
> Combined_data$time <- format(Combined_data$Date, format = "%H:%M:%S")
> ggplot(data=Combined_data, aes(x=time, y=ActiveIntensity)) + geom_histogram(stat = "identity", fill='darkblue') +
+     theme(axis.text.x = element_text(angle = 90)) +
+     labs(title="Total very Active Intensity vs. Time ")
Warning message:
Ignoring unknown parameters: binwidth, bins, pad 

![Rplot04.png](attachment:d70d33e3-2274-4144-9504-dc4ac76a16b6.png)

By analysing some Intensity data over time. The company will have a good idea on how customers are using their product during the day. Most users are actif before and after work, I suppose. The company can use this time in the Bellabeat app to remind and motivate users to go for a run or for a walk.

Conclusions & Recommandations for the Business
So, collecting data on activity, sleep, stress, etc. will allow the company Bellabeat to empower the customers with knowledge about their own health and daily habits. The company Bellabeat is growing rapidly and quickly positioned itself as a tech-driven wellness company for their customers.

By analyzing the FitBit Fitness Tracker Data set, I found some insights that would help influence Bellabeat marketing strategy.

Target Audience:
People working full-time jobs and spending a lot of time at the computer and in the office and need fitness and daily activities to be in shape.

The users are doing some light activity to stay healthy (according to the activity type analysis). And they need to improve their everyday activity to have more health benefits. And they might need some knowledge about developing healthy habits and motivation to keep them going.

Message to the Company
The Bellabeat app need to be a unique fitness activity app. By becoming a companion guide (like a friend) to its users and customers and help them balance their personal and professional life with healthy habits.

Recommendations to the Bellabeat Marketing team
The average sedentary time is too high for the users of the app (more than 16 hours). And definitely needs to be reduced with a good marketing strategy. So, the data shows that the company need to market more to the customer segment with a high Sedentary time. And to do that, the company needs to find ways to get customers started in walking more by measuring their daily steps (+ notifications).

Participants sleep 1 time for an average of 7 hours. To help users improve their sleep, Bellabeat should consider using app notifications to go to bed. And also, the Bellabeat app can recommend reducing sedentary time for its customers.

The average total steps per day (which is 7638) is a little bit less than recommended by the CDC. According to the CDC research, taking 8,000 steps per day was associated with a 51% lower risk for all-cause mortality (or death from all causes). And taking 12,000 steps per day was associated with a 65% lower risk compared with taking 4,000 steps. So, Bellabeat can encourage people to take at least 8,000 steps per day by explaining the healthy benefits of doing that.

By analysing the Intensity data over time. The company will have a good idea on how their customers are using their app during the day. Most users are actif before and after work. The company can use this time in the Bellabeat app to remind and motivate users to go for a run or for a walk.

For customers who want to lose weight, it can be a good idea to control daily calorie consumption. And Bellabeat can suggest some ideas for low-calorie healthy food (for lunch and dinner).

Thank you very much for your interest in my Bellabeat Case Study.

~ Nilesh 

Credits: 
-MD Diallo 
-Avdhut Yadav