
# Introduction
This case study is complementary to google data analytics professional certificate program. Bellabeat, the case company, a high-tech company manufacturing health-focused smart products. Major products include Bellabeat App, Leaf, Time, Spring and Membership subscription.(for more detail please visit [their website](https://bellabeat.com/))

The goal of this study was to benchmark insights from other company's smart devices usage data trends for a selected Bellabeat product's marketing strategy. 

The study achieved this goal, via implementation of google's five-phased approach to data analysis.

The result shows a 4 hours devices downtime per day. This significant blackout has a trend that matched the users' weekly device usage. 

What the result suggested? device feature improvements: to have a long hours battrey uptime, Short device charging turnaround, and expected device uptime alert messaging feature. 

**The take away:** For Bellabeat marketing strategy of its leaf chackra product.
The marketting messages should focus on the device's:

* 6 months battery life
* no charging required, and
* 7/24 uptime capacity features

Phases to the analysis:

* Ask
* Prepare
* Process
* Analyze
* Share
* Act

# Ask

Major questions to be answered:

* What are some trends in smart device usage?
* How could these trends apply to Bellabeat customers?
* How could these trends help influence Bellabeat marketing strategy?

Key tasks: 

* Identify the business task 
* Consider key stakeholders

**1. The Business task**

* To conduct analysis on external smart device usage data in order to gain insight.
* To apply the insights to one of Bellabeat product's marketing strategy.

**2. Stakeholders**

* **Urška Sršen**: Bellabeat’s cofounder and Chief Creative Officer
* **Sando Mur**: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
* **Bellabeat marketing analytics team**: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

# Prepare

Preparation of the data sets.

Key tasks:

* Download data and store it appropriately. 
* Identify how it’s organized. 
* Sort and filter the data.
* Determine the credibility of the data.

**Download data and store it appropriately**

The data is a public data set from Kaggle. It contains personal fitness tracker from thirty three fitbit users. These eligible users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
Source: [FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit)  (CC0: Public Domain, data set made available through [Mobius](https://www.kaggle.com/arashnic))

**Why R?**

Its accessibility, data-centeric approach, and community base makes R ideal to this project.
Installation and loading of essential R packages before hand.

In [1]:
library("tidyverse")
library("dplyr")
library("ggplot2")
library("readr")
library("lubridate")
library("here")
library("skimr")
library("janitor")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union


here() starts at /kaggle/working


Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test




## Import data to RStudio

Using read_csv function from readr package.

The data set contain 18 csv files with daily, hourly, and minute based dataframes. Here the data dictionary: [fitabase metadata file](https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf).

**Import the Daily dataframes:**

In [2]:
daily_activity <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

daily_calories <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")

daily_intensities <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")

daily_steps <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")

**Daily dataframes structure**

In [3]:
str(daily_activity)
str(daily_calories)
str(daily_intensities)
str(daily_steps)

'data.frame':	940 obs. of  15 variables:
 $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
 $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
 $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
 $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
 $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
 $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
 $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
 $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
 $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
 $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
 $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
 $ SedentaryMinutes        : int  728 776 1218 726 773 539 

**Daily dataframes Structure Summary**

Variables:
* 31 variables on daily activity, intensity, calories, and total steps dataframes.

Data Types:
* Character
* Integer
* Number

Data Types transformation requirements:
* Date columns: from Character to Date/ Time
* ID columns: from Number to Character

**Import Hourly dataframes**

In [4]:
hourly_calories <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")

hourly_intensities <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")

hourly_steps <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")

**Hourly dataframes structure**

In [5]:
str(hourly_calories)
str(hourly_intensities)
str(hourly_steps)

'data.frame':	22099 obs. of  3 variables:
 $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
 $ Calories    : int  81 61 59 47 48 48 48 47 68 141 ...
'data.frame':	22099 obs. of  4 variables:
 $ Id              : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityHour    : chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
 $ TotalIntensity  : int  20 8 7 0 0 0 0 0 13 30 ...
 $ AverageIntensity: num  0.333 0.133 0.117 0 0 ...
'data.frame':	22099 obs. of  3 variables:
 $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
 $ StepTotal   : int  373 160 151 0 0 0 0 0 250 1864 ...


**Hourly dataframes Structure Summary**

Variables:
* 10 variables on hourly intensity, calories, and steps of device users.

Data Types:
* Character
* Integer
* Number

Data Types transformation requirements:
* Date columns: from Character to Date/ Time, and formatting
* ID columns: from Number to Character

**Importing Sleep and Weight dataframes**

In [6]:
sleep_day <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

weight_log_info <- read.csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

**Sleep, and weight dataframes Structure**

In [7]:

str(sleep_day)
str(weight_log_info)


'data.frame':	413 obs. of  5 variables:
 $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
 $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
 $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
 $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
 $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
'data.frame':	67 obs. of  8 variables:
 $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
 $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
 $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
 $ WeightPounds  : num  116 116 294 125 126 ...
 $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
 $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
 $ IsManualReport: chr  "True" "True" "False" "True" ...
 $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1

**Sleep and Weight dataframes Structure Summary**

Variables:
* 13 variables on sleep and weight of device users.

Data Types:
* Logical
* Character
* Integer
* Number

Data Types transformation requirements:
* IsManualReport column: from character to Logical or Boolean
* Date columns: from Character to Date/ Time
* ID columns: from Number to Character


**How many users in the dataframes?**

To get the number of participants in the survey. let us check the number of unique IDs.


In [8]:

n_distinct(daily_activity$Id)
n_distinct(daily_calories$Id)
n_distinct(daily_intensities$Id)
n_distinct(daily_steps$Id)
n_distinct(hourly_calories$Id)
n_distinct(hourly_intensities$Id)
n_distinct(hourly_steps$Id)
n_distinct(sleep_day$Id)
n_distinct(weight_log_info$Id)


Most of the data sets contain data about all the subjects (33 participants), but the remaining two: 'sleep_day' contains 24 and 'weight_log_info' data sets gets only 8 that makes the weight data set the least captured in the survey.

According to the theory of central limit the two data sets are statistically insufficient to make inferences.

**About the data**

The data contains structured csv files with long structure. Its organization is in daily, hourly, and minute bases. Making the daily data set the meta summary data that aggregates the hourly then the minute.

In the data frames, most of the columns contain numeric decimal values (dbl) and date / time columns with character values (chr). The only Boolean values (lgl) are from a weight data set column.

About its integrity, the column naming need some adjustment. For example the date / time columns have inconsistency both in names and formatting. The information is presented either in 'mm/dd/yyyy' or 'mm/dd/yyyy h:m:s UTC' format;

Even if we have 33 participants, insufficient data was collected on sleep and weight.In this case, it is handled with a decision to use the available data for identifying trends. 


**Does the data ROCCC?**

ROCCC stands for Reliable, Original, Cited, Current, and Comprehensive.

Yes, the data is Reliable, original, and Cited. But,
* Is it current?: 
Even if it is relevant for the period it was collected (4.12.2016 - 5.12.2016), its somewhat old to give insights on current technology and trends in smart fitness tracker devices.
* Is it comprehensive?: 
The data set missed important information on age and gender which limits its use for comprehensive representation of device usage of consumers.

* What about privacy and security issues?: 
From the prepare phase we can see that no Personal identifiable information (PII) is included. Privacy and security issues were handled with Data anonymization.  

# Process

Key tasks:

* Check the data for errors.
* Choose your tools.
* Transform the data so you can work with it effectively.
* Document the cleaning process

## Data transformation

Split the data/ time columns to date and time.


In [9]:

# Activity
daily_activity$ActivityDate=as.POSIXct(daily_activity$ActivityDate, format= "%m/%d/%Y", tz=Sys.timezone())
daily_activity$date <- format(daily_activity$ActivityDate, format = "%m/%d/%y")

# Calories
hourly_calories$ActivityHour=as.POSIXct(hourly_calories$ActivityHour, format= "%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_calories$time <- format(hourly_calories$ActivityHour, format = "%H:%M:%S")
hourly_calories$date <- format(hourly_calories$ActivityHour, format = "%m/%d/%y")

# Intensity
hourly_intensities$ActivityHour=as.POSIXct(hourly_intensities$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_intensities$time <- format(hourly_intensities$ActivityHour, format = "%H:%M:%S")
hourly_intensities$date <- format(hourly_intensities$ActivityHour, format = "%m/%d/%y")

# Sleep
sleep_day$SleepDay=as.POSIXct(sleep_day$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
sleep_day$time <- format(sleep_day$SleepDay, format = "%H:%M:%S")
sleep_day$date <- format(sleep_day$SleepDay, format = "%m/%d/%y")



**Data Cleaning: Checking missing Values**

* Check activity data frame for missing values

In [10]:

skim_without_charts(daily_activity)


**Note:** The daily_activity data set contains 31 days data and no missing values

* Check calories data frame for missing values

In [None]:

skim_without_charts(hourly_calories)


**Note:** the hourly_calories dataframe contains 31 days data and no missing values.

* Check Intensity data frame for missing values

In [None]:
skim_without_charts(hourly_intensities)

**Note:** The Intensity dataframe contains 31 days data and no missing Values.

* Check sleep data frame for missing values

In [None]:
skim_without_charts(sleep_day)

**Note:** The sleep data set contains 31 days data and no missing values.

* Check weight data frame for missing values

In [None]:
skim_without_charts(weight_log_info)

**Note:** But, in Weight data set we have considerable missing values. Only 3% of the fat column is completed (Meaning; out of 67 observations only 2 were completed). To keep bias out of our analysis this column is dropped.


#### Setup our final data set for analysis

**Adjust column naming to snake_case and Select variables.**

* Check weight data frame columns

In [None]:
colnames(weight_log_info)

**Action:** Remove fat & date columns and name the new data frame 'clean_weight'.

In [None]:
weight_new <- weight_log_info %>% 
  select(Id,WeightKg, WeightPounds, BMI, IsManualReport)
clean_weight <- clean_names(weight_new)
clean_weight$id <- as.character(clean_weight$id)
clean_weight$is_manual_report <- as.logical(clean_weight$is_manual_report)

head(clean_weight)


* Check Sleep data frame columns

In [None]:
colnames(sleep_day)
head(sleep_day)

**Action:** Remove sleepDay & time columns and name the new data frame 'clean_sleep'.

In [None]:
sleep_new <- sleep_day %>% 
  select(Id,date,TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed)
clean_sleep <- clean_names(sleep_new)
clean_sleep$date <- as.Date(clean_sleep$date, format = "%m/%d/%y")
clean_sleep$id <- as.character(clean_sleep$id)
head(clean_sleep)

* Check Intensity data frame columns

In [None]:
colnames(hourly_intensities)

**Action:** Remove ActivityHour & time columns and name the new data frame 'cleaned_intensities'.

In [None]:
intensities_new <- hourly_intensities %>% 
  select(Id,date,TotalIntensity, AverageIntensity)
cleaned_intensities <- clean_names(intensities_new)
cleaned_intensities$id <- as.character(cleaned_intensities$id)
cleaned_intensities$date <- as.Date(cleaned_intensities$date, format = "%m/%d/%y")
head(cleaned_intensities)


* Check Calories data frame columns

In [None]:
colnames(hourly_calories)

**Action:** Remove ActivityHour & time columns and name the new data frame 'cleaned_calories'.

In [None]:
calories_new <- hourly_calories %>% 
  select(Id,date, Calories)
cleaned_calories <- clean_names(calories_new)
cleaned_calories$id <- as.character(cleaned_calories$id)
cleaned_calories$date <- as.Date(cleaned_calories$date)

head(cleaned_calories)

* Check Activity data frame columns

In [None]:
colnames(daily_activity)

**Action:** Remove ActivityDate column and name the new data frame 'cleaned_activity_merged'.

In [None]:
activity_new <- daily_activity %>% 
  select(! ActivityDate)
cleaned_activity_merged <- clean_names(activity_new)
cleaned_activity_merged$date <- as.Date(cleaned_activity_merged$date, format = "%m/%d/%y")
cleaned_activity_merged$id <- as.character(cleaned_activity_merged$id)
head(cleaned_activity_merged)

#### Further data Cleanings 

Here our interest is to analyze the device usage data where consumers actually used the devices. Devices can be in sedentary state but not being used completely.

How to identify devices actual use?

Based on the activity principles and facts from fitbase data dictionary we can infere the following Conditions. If a device fulfill either of the conditions, Then we can say that the devices was not used on the specified date.

codition_1: With user in activity, the calories expenditure can not be zero.

Codition_2: With the devices 24 hrs in sedentary and all other entries have zero entries.

Note: This excludes the Id, date, and calories as they have default input values. according to Fitbase data dictionary.

*Based on Codition_1: Remove Calories Zero records*

Check records that fulfill this condition.

In [None]:
as_tibble(cleaned_activity_merged[(cleaned_activity_merged$calories == 0), ])

There are 4 records fulfilling the condition.

**Action:** Remove the records from 'cleaned_activity_merged' dataframe and name it 'clean_calories_zero_activitry_merged'.

In [None]:
clean_calories_zero_activitry_merged <- cleaned_activity_merged[!(cleaned_activity_merged$calories == 0), ]

*Based on Codition_2:  Remove 24 hrs (1440 minutes) sedentary and other entries zero observations*

Check records that fulfill condition_2.
Note: 24 hrs sedentary refers devices with 1440 sedentary minutes (equivalent to 24 hrs).

In [None]:
count(clean_calories_zero_activitry_merged [(clean_calories_zero_activitry_merged$total_steps == 0 & 
                                             clean_calories_zero_activitry_merged$total_distance == 0 & 
                                             clean_calories_zero_activitry_merged$tracker_distance == 0 & 
                                             clean_calories_zero_activitry_merged$logged_activities_distance == 0 & 
                                             clean_calories_zero_activitry_merged$very_active_distance == 0 & 
                                             clean_calories_zero_activitry_merged$light_active_distance == 0 & 
                                             clean_calories_zero_activitry_merged$sedentary_active_distance == 0 & 
                                             clean_calories_zero_activitry_merged$very_active_minutes == 0 & 
                                             clean_calories_zero_activitry_merged$fairly_active_minutes == 0 & 
                                             clean_calories_zero_activitry_merged$lightly_active_minutes == 0 & 
                                             clean_calories_zero_activitry_merged$sedentary_minutes == 1440), ])

There are 68 records fulfilling the condition. 

**Action:** Remove the records from 'clean_calories_zero_activitry_merged' dataframe and save it to 'clean_activity_merged_device_in_Use'.

In [None]:
clean_activity_merged_device_in_Use <- clean_calories_zero_activitry_merged [!(clean_calories_zero_activitry_merged$total_steps == 0 & 
                                                                               clean_calories_zero_activitry_merged$total_distance == 0 & 
                                                                               clean_calories_zero_activitry_merged$tracker_distance == 0 & 
                                                                               clean_calories_zero_activitry_merged$logged_activities_distance == 0 & 
                                                                               clean_calories_zero_activitry_merged$very_active_distance == 0 & 
                                                                               clean_calories_zero_activitry_merged$light_active_distance == 0 & 
                                                                               clean_calories_zero_activitry_merged$sedentary_active_distance == 0 & 
                                                                               clean_calories_zero_activitry_merged$very_active_minutes == 0 & 
                                                                               clean_calories_zero_activitry_merged$fairly_active_minutes == 0 & 
                                                                               clean_calories_zero_activitry_merged$lightly_active_minutes == 0 & 
                                                                               clean_calories_zero_activitry_merged$sedentary_minutes == 1440), ]
head(clean_activity_merged_device_in_Use)
count(clean_activity_merged_device_in_Use)


**Merging relevant data frames**

* Merge Weight and intensity data frames on 'id'.


In [None]:
clean_weight_intensity_combined <- merge(clean_weight, cleaned_intensities, by=c('id'))
head(clean_weight_intensity_combined)

* Merge activity and sleep data frames on 'id' and 'date'.

In [None]:
clean_activity_sleep_combined <- merge(clean_sleep, clean_activity_merged_device_in_Use, by=c('id', 'date'))

head(clean_activity_sleep_combined)

**The Cleaned and merged Data frames for analysis:**

* clean_activity_sleep_combined
* cleaned_activity_merged
* cleaned_calories
* cleaned_intensities
* clean_weight



# Analyze

Key tasks:

* Aggregate the data so it’s useful and accessible. 
* Organize and format the data. 
* Perform calculations. 
* Identify trends and relationships.

#### Understanding some summary statistics

* Device Usage stat summary


In [None]:
count_usage_days <- aggregate(cbind(count = date) ~id, data = clean_activity_merged_device_in_Use, FUN = length)
count_usage_days <- count_usage_days %>% 
  rename(number_of_days=count)

# Device usage stat summary
round(mean(count_usage_days$number_of_days),0) # Average 
round(sd(count_usage_days$number_of_days),0) # standard devation
max(count_usage_days$number_of_days) # Maximum
min(count_usage_days$number_of_days) # Minimum


How was the device usage?

The device usage summary shows it, on average 26 days usage in the 31 days.

* Activity stat summary:

In [None]:
clean_activity_merged_device_in_Use %>%  
  select(very_active_minutes,
         fairly_active_minutes,
         lightly_active_minutes,
         sedentary_minutes) %>%
  summary()

What does this tell us about activities levels?

* Least time spent in fairly active state.
* Most time spent in sedentary state of the 24 hours.

**How many hours of a day users were engaged in non_sedentary activities?**

In [None]:
# aggregate variables from activity data frame

clean_activity_merged_device_in_Use$week_day <- weekdays(clean_activity_merged_device_in_Use$date) # convert to week days
weekly_device_activity = clean_activity_merged_device_in_Use %>%
  group_by(id, date)  %>%
  summarise(very_active = sum(very_active_minutes, na.rm = TRUE)/60, 
            fairly_active = sum(fairly_active_minutes, na.rm = TRUE)/60,
            lightly_active = sum(lightly_active_minutes, na.rm = TRUE)/60,
            
            sedentary = sum(sedentary_minutes, na.rm = TRUE)/60,
            device_active_hours = very_active + fairly_active + lightly_active + sedentary,
            downtime_hours = 24 - (device_active_hours),
            week_day = week_day,
                              .groups = 'drop')

# Non-sedentary hours per day 
daily_active_hours_average <- round(mean(weekly_device_activity$very_active + weekly_device_activity$fairly_active + weekly_device_activity$lightly_active ),2)
head(daily_active_hours_average)

Users were in non-sedentary activities for about 4 hours each day.


**How many hours devices were in blackout? in the 24 hours**

In [None]:
# device average downtime in the 24 hours
round(mean(weekly_device_activity$downtime_hours),1)


Devices were in blackout for about 4 hours each day.


* Sleep and calories expenditure stat summary:

In [None]:
clean_activity_sleep_combined %>% 
  select(total_minutes_asleep,  
         calories, total_time_in_bed) %>%
  summary()

* Users were asleep for about 7 hours on average. This is at the lower bound of adult daily sleep need range of 7 to 9 hours. (see [here](https://www.sleepfoundation.org/how-sleep-works/sleep-facts-statistics#:~:text=Adults%20between%2018%20and%2064%20need%20seven%20to,average%20for%20less%20than%20seven%20hours%20per%20night.))
* The average calories expenditure is 2398. This is higher than female adults calories need per day range(1,600–2,200 calories). (see [here](https://www.healthline.com/health/fitness-exercise/how-many-calories-do-i-burn-a-day#:~:text=Every%20day%2C%20you%20burn%20calories%20when%20you%20move,unique%20to%20your%20body%20and%20activity%20levels%20%281%29.))


**Trends and Relationships between variables**

Hypothesises


**Hypothesis_1:** level of activeness follow trends over the week days. Especially, very active minutes are intensified over the week ends.

**Hypothesis_2:** Users with high calories expenditure tends to asleep longer.

**Hypothesis_3:** Users with heavier weight tends to do more intensified activities.


**Checking the Hypothesis**

**Checking Hypothesis_1:** *level of activeness follow trends over the week days. Especially, very active minutes are intensified over the week ends.*

In [None]:
active_vs_sedentary_minutes = weekly_device_activity %>%
  group_by(week_day)  %>%
  summarise(very_active = mean(very_active, na.rm = TRUE),
            active = (mean(fairly_active, na.rm = TRUE) + mean(lightly_active, na.rm = TRUE)),
            sedentary = mean(sedentary, na.rm = TRUE),
            
            device_downtime = 24 - (very_active + active + sedentary),
                              .groups = 'drop')
# Order by week days

dayLabs<-c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday") 
active_vs_sedentary_minutes$week_day <- factor(active_vs_sedentary_minutes$week_day, levels= dayLabs)
active_vs_sedentary_minutes<-active_vs_sedentary_minutes[order(active_vs_sedentary_minutes$week_day), ]

View(active_vs_sedentary_minutes)

**Result**

The  result shows that Hypothesis_1 is not true, The very active minutes tend to be on week start rather than over the week ends.



**Checking Hypothesis_2:** *Users with high calories expenditure tends to asleep longer.*

Correlation analysis between calories and asleep minutes

In [None]:
daily_asleep_vs_calories = clean_activity_sleep_combined %>% 
  group_by(id, date) %>% 
  summarise(daily_minute = sum(total_minutes_asleep),
            daily_calories = sum(calories),
           .groups = 'drop')
x <- daily_asleep_vs_calories$daily_minute
y <- daily_asleep_vs_calories$daily_calories
plot(x, y, xlab = "Minutes Asleep", ylab = "Calories", pch = 19, col = "lightblue") +
  abline(lm(y ~ x), col = "red", lwd = 1) +
  text(paste("Correlation:", round(cor(x, y), 2)), x = 250, y = 7300) 

**Result**

The correlation analysis shows a low correlation of increase in calories expenditure to increase in minutes asleep, 0.13 correlation value. 


* Weight and Activity intensity statistical summary

In [None]:
clean_weight_intensity_combined %>% 
  select(weight_kg,
         average_intensity) %>% 
summary()

Note:
Individual difference in weight is observed

**Hypothesis_3:** *Users with heavier weight tends to do less intensified activities.*

In [None]:
# Group by individual to box plot

clean_weight_intensity_combined_final = clean_weight_intensity_combined %>% 
  group_by(id) %>% 
  summarize(weight_kg = mean(weight_kg, na.rm = TRUE),
            intensity = mean(total_intensity, na.rm =TRUE))
View(clean_weight_intensity_combined_final)


The result shows that we have only 8 candidates with weight log data, which is not statistically significant to make a decision. However, with this data set they are inversely related

In [None]:
# Correlation plot
x <- clean_weight_intensity_combined_final$weight_kg
y <- clean_weight_intensity_combined_final$intensity
plot(x, y, xlab = "Weight(Kgs)", ylab = "Intensity", pch = 19, col = "lightblue") +
  abline(lm(y ~ x), col = "red", lwd = 1) +
  text(paste("Correlation:", round(cor(x, y), 2)), x = 115, y = 17)

**Result**
The data is not enough to test the hypothesis, even if -ve correlation with 0.62 magnitude was found.

# Share

Device Usage over the study period

Device used days statistics:

In [None]:

#Plot individuals device use differences:
par(bg = "#f7f7f7")
boxplot(count_usage_days$number_of_days,
        horizontal = TRUE,
        main='How many days were devices used?, over the study period',
        sub = 'Period: 31 days', cex.sub = 1.2,
        ylab='Consumers',
        xlab='Number of days devices used',
        col='#CCCCCC',
        border='#333333')

text(x = 17, y = 1.4, labels = 'Summary stat >>   Mean = 26, Sd = 7, Min = 3, and Max = 31 days', col = '#666633')


Summary statistics for active minutes and sedentary 

In [None]:
# convert to long
# install.packages("reshape2")                               
library("reshape2")

active_vs_sedentary_minutes_long <- melt(active_vs_sedentary_minutes,                                 
                  id.vars = c("week_day"))

# view(active_vs_sedentary_minutes_long)
# Plot stacked bar chart
# Grouped
# distinguishing the bar with "condition" parameter
# View(active_vs_sedentary_minutes_long)

ggplot(active_vs_sedentary_minutes_long, aes(fill = variable, 
                   
                 # y-axis value 
                 y = value,
                   
                 # this will group the data
                 # based on the "vehicle" type 
                 x = week_day)) + 
  labs(x = "Week days", y = "24 hours") +


  # position adjustment of the bar,
  geom_bar( variable = "stack",  
           # since we are grouping the bar,
           # we have chosen "stack"
          stat="identity"
           )  +
  
  guides(fill = guide_legend(title = "Device Status")) +
  
  scale_fill_manual(values=c('azure2', 'azure3', 'azure4', 'black'), labels = c("Very Active", "Active", "Sedentary", "Downtime (Off)")) +

  ggtitle('Devices: How many hours in blackout? out of the 24 hours.', subtitle = 'Status over the weekdays'  )



Device usage frequency over the week days

Make a table of frequencies and proportions to see the frequency of using the device:

In [None]:
# Order by week days

dayLabs<-c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday") 
weekly_device_activity$week_day <- factor(weekly_device_activity$week_day, levels= dayLabs)
device_usage_week_days<-weekly_device_activity[order(weekly_device_activity$week_day), ]

freq <- table(device_usage_week_days$week_day)
prop <- freq / nrow(device_usage_week_days)
usage_days_stat <- data.frame(week_days = names(freq),
                              freq = as.vector(freq),
                              prop = as.vector(prop))
View(usage_days_stat)



Prepare proportions on Usage

In [None]:
usage_days_stat$week_days <- factor(usage_days_stat$week_days, 
   levels = as.character(wday(c(2:7,1), label=TRUE, abbr=FALSE)))

ggplot(data = usage_days_stat, aes( x = week_days, y = prop*100)) +
  geom_bar(stat = "identity" ) +
  labs(x = "Week days", y = "Device use (%)") +
  ggtitle('Consumers: How they used devices? over the weekdays', subtitle = 'Device use frequency')
#Divide data 'usage_days_stat' data frame into quantile:

#usage_days_stat$week_days=as.integer(usage_days_stat$week_days)
#quantile(usage_days_stat$week_days)


**Individual Differences**

In [None]:
#Plot individual differences in calories expenditure:
boxplot(clean_activity_sleep_combined$calories,
        main='Calories: personal differences',
        xlab='Consumers',
        ylab='Calories(cal)',
        col='antiquewhite',
        border='black') 

#Plot individual differences in weight:
boxplot(clean_weight$weight_kg,
        main='Weight: personal differences',
        xlab='Consumers',
        ylab='Weight(Kg)',
        col='lightblue',
        border='black') 



**Findings**

*about the data*

* Missing values of sleep and weight logs
* Incomplete attributes used excluding Age, Gender, spacial aspects of device uses

*about the consumers differences*

* High individual differences in days of device usage
* Weight differences
* Calories expenditure differences

*about the devices*

* Activity tracking device has an average downtime of 4 hours a day
* Device Usage on physical activities is 4 hours a day


*about relationships and trends*

* daily individual calories expenditure increase shows larger correlation with daily asleep minutes increase than the increase to time in Bed.
* Total steps increase relates with calories expenditure increase
* With the weight data collected, Heavier individuals tend to engage in less intensified activities
* Larger sedentary minutes are observed in week start and week ends (Monday and Friday)
* Device downtime is higher at the mid of the week 
