## Phase I: Ask
**Project goal**

Analyze smart device usage data to gain insight into how people are using their smart devices.

**Overall business Goal**

Seeking new growth opportunities in smart devices industry.

**Skate holders**

Bellabeat executive team Bellabeats marketing strategy

**Target completion date**

Within a week


## Phase II: Prepare

**Data source**

A Kaggle data set: [FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit), Public Domain, collected under [research study](https://www.researchprotocols.org/2017/4/e66/) published in JMIR research publication.

**Date description**

Third party data, External data Historical data *Structured data & Long data format* The data has a low confidence level with a possibly high margin error due to small sample size of 30.

**Data organized**

It is in a relational database modal merged by the users ID

**Bias**

-   Sampling bias, as the sample does not represent the whole population.

    -   According to the study made the individuals who afford wearables devices (FitBit) are mainly younger (between the ages of 18 and 34 years) and Mturkers may not be generalizable to other populations.

-   Data are only collected from people who are willing to be paid to use the device and have their non-identifiable data collected.

-   Data is missing the users gender and age which plays an important role in conclusing the analysis result.

**Data integrity**

|                     |                                                                                                                    |
|--------------------------|----------------------------------------------|
| Reliability         | Validity and uniqueness are checked in the cleaning process                                                        |
| Original            | Data is a third party                                                                                              |
| Comprehensive       | Data format in a relational data model with easily read and understood cvs files, and all files merged by their ID |
| Current             | Data collected in 2016                                                                                             |
| Cited               | [research study](https://www.researchprotocols.org/2017/4/e66/)                                                    |
| Ethics              | Approved study by the RTI International Institutional Review Board                                                 |
| Licensing & Privacy | Individual gave their consent for their data to be used                                                            |
| Accessibility       | Open source and reusable in public data base Kaggle                                                                |
| Accuracy            | Objective data as data is collected from the device directly                                                       |
| Consistency         | Data collected directly from apps sync with the devices                                                            |
| Completeness        | No missing data for each entry, by eliminating the batter life and syncing data                      

## Phase III: Process

Before analyzing, the data had to be processed cleaned. Where the data are made useful by fixing or removing incorrect, inconsistencies, incorrectly formatted, duplicate, or incomplete data within a dataset. The data cleaning and analysis are carried out using R programming language. The cleaning process below are made to some extend reusable to other datasets.

1.  Checking the if the necessary packages are installed if not they are installed and loaded.

In [None]:
if(!require(tidyverse)) {
  install.packages("tidyverse"); 
  require(tidyverse)
} 
library(tidyverse)

if(!require(readr)) {
  install.packages("readr"); 
  require(readr)
} 
library(readr)

2.  Import data
    a.  Creating a list of csv file names to import
    b.  extract the name of each file (11 is used to remove "merged")
    c.  load files into separate df

In [None]:
file_list <- list.files(path = "/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16", pattern = "*.csv")

file_names <- substr(file_list, 1, nchar(file_list)-11)

for(i in file_list){
  filepath <- file.path("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16",paste(i,sep=""))
  assign(i, read.csv(filepath))}

3.  Merge tables to create one table for each time frame daily, hourly and minutes activity data

In [None]:
Daily <- merge(dailyIntensities_merged.csv,dailyCalories_merged.csv, by=c("Id","ActivityDay")) 
daily_total <- merge(Daily,dailySteps_merged.csv, by=c("Id","ActivityDay")) 

Hourly <- merge(hourlyIntensities_merged.csv,hourlyCalories_merged.csv, by=c("Id","ActivityHour")) 
hourly_total <- merge(Hourly,hourlySteps_merged.csv, by=c("Id","ActivityHour")) 

Minuite_N <- merge(minuteIntensitiesNarrow_merged.csv,minuteCaloriesNarrow_merged.csv, by=c("Id","ActivityMinute")) 
minuite_Narrow_total <- merge(Minuite_N,minuteStepsNarrow_merged.csv, by=c("Id","ActivityMinute")) 

Minuite_W <- merge(minuteIntensitiesWide_merged.csv,minuteCaloriesWide_merged.csv, by=c("Id","ActivityHour")) 
minuite_Wide_total <- merge(Minuite_W,minuteStepsWide_merged.csv, by=c("Id","ActivityHour")) 

4.  Removing datasets we merged to empty the Global environment and focus our work


In [None]:
rm(Daily)
rm(Hourly)
rm(Minuite_N)
rm(Minuite_W)
rm(dailyIntensities_merged.csv,dailyCalories_merged.csv,dailySteps_merged.csv)
rm(hourlyIntensities_merged.csv,hourlyCalories_merged.csv,hourlySteps_merged.csv)
rm(minuteIntensitiesWide_merged.csv,minuteCaloriesWide_merged.csv,minuteStepsWide_merged.csv)
rm(minuteIntensitiesNarrow_merged.csv,minuteCaloriesNarrow_merged.csv,minuteStepsNarrow_merged.csv)

5.  Checking if there are significant numbers of NULL values 

   a.  Find the number of missing values in each dataframe
   
   b.  only 1 user input their fat in fat column, therefore, remove the empty column in weightLogInfo_merge (65 missing value)


In [None]:
sapply(ls(), function(x) sum(is.na(get(x))))
weightLogInfo_clean <- weightLogInfo_merged.csv[,-5]

6.  Identify duplication

   a.  form a list of all dataframes
   
   b.  check duplicates in all dfs_list and return the name of the df with duplication
   
   c.  remove the duplicated rows from the identified df


In [None]:
 dfs_list <- Filter(function(x) is(x, "data.frame"), mget(ls()))
    find_dup <- function(x) {
       
      df_dup <- x[duplicated(x), ]
      if (nrow(df_dup) !=0){ 
        x <- x[!duplicated(x), ]}
    }

    df_dup_checked <- lapply(dfs_list, find_dup)

    names(df_dup_checked)[sapply(df_dup_checked, function(x) !any(is.null(x)))]

    sleepDay_merged.csv <- sleepDay_merged.csv[!duplicated(sleepDay_merged.csv), ]
    minuteSleep_merged.csv <- minuteSleep_merged.csv[!duplicated(minuteSleep_merged.csv), ]

7.  Data transformation, by changing date columns from string to dates in each df

In [None]:
if(!require(lubridate)) {
  install.packages("lubridate"); 
  require(lubridate)
} 
library(lubridate)

dailyActivity_merged.csv$ActivityDate <- mdy(dailyActivity_merged.csv$ActivityDate)
daily_total$ActivityDay <- mdy(daily_total$ActivityDay)
heartrate_seconds_merged.csv$Time <- mdy_hms(heartrate_seconds_merged.csv$Time)
hourly_total$ActivityHour <- mdy_hms(hourly_total$ActivityHour)
minuite_Narrow_total$ActivityMinute <- mdy_hms(minuite_Narrow_total$ActivityMinute)
minuite_Wide_total$ActivityHour <- mdy_hms(minuite_Wide_total$ActivityHour)
minuteMETsNarrow_merged.csv$ActivityMinute <- mdy_hms(minuteMETsNarrow_merged.csv$ActivityMinute)
minuteSleep_merged.csv$date <- mdy_hms(minuteSleep_merged.csv$date)
sleepDay_merged.csv$SleepDay <- mdy_hms(sleepDay_merged.csv$SleepDay)
weightLogInfo_clean$Date <- mdy_hms(weightLogInfo_clean$Date)



## Phase IV: Analyze & Share

In order to make some conclusions based on the data provided. Through sorting and formatting the data to make it easier to get different views, make a Pivot table, or create visual to build a story!

In this section we addressed sets of questions using:

-   Different calculations were performed to obtain additional metrics.

-   Combining different data attributes from a other datasets sources to develop a more comprehensive picture.

1.how many users recorded their infos in each df

  a.  install the required package
  b.  Create a table of the number of Ids in each df
  c.  make is visual


In [None]:
    if(!require(dplyr)) {
      install.packages("dplyr"); 
      require(dplyr)
    } 
    library(dplyr)

    database <- c("daily_total")
    no_ID <- c(n_distinct(get("daily_total")$Id))
    users_df <- data.frame(database, no_ID)

    for (object_name in ls()){
      if(is.data.frame(get(object_name))){
        if(n_distinct(get("dailyActivity_merged.csv")$Id)!= n_distinct(get(object_name)$Id)){
          users_df <- rbind(users_df, list(object_name, n_distinct(get(object_name)$Id)))
        }
      }
    }

    if(!require(ggplot2)) {
      install.packages("ggplot2"); 
      require(ggplot2)
    } 
    library(ggplot2)

    users_df %>% 
      select(everything()) %>% 
      filter(database != "users_df") %>% 
      ggplot(mapping =  aes(x=database, y=no_ID))+
      geom_bar(stat='identity')+
      geom_label(aes(label=no_ID))+
      theme(axis.text.x = element_text(angle=45))+
      labs(title="Datasets with Ids less than 33 ", 
           subtitle="The total number of participants(users) in the study are 33. However, not all users wore
the device or recorded their information in all datasets ", x="databases", y= "Number of users")



It seems that only 24 participant out of 33 wore the device during sleeping (around 73% of the users). Also, only 8 participant input their weight (24%). 


Next steps is to understand the behavior of the  33 participant daily activity.

2. How many Ids had worn the device daily, in other words how many users did not record their data daily within 30 days?


In [None]:
daily_total %>% 
  group_by(Id) %>% 
  summarize(days_registered=n()) %>% 
  mutate(days_missed=31-days_registered) %>% 
  mutate(missing_days = cut(days_missed, breaks = c(-1, 0, 5, 14, 30), labels = c("Zero", "<5", "<15",">20"))) %>% 
  count(missing_days) %>%
  mutate(percent = n / sum(n)) %>% 
  ggplot(mapping =  aes(x = "", y = percent, fill = missing_days)) +
  geom_bar(width = 1, stat = "identity", color = "white") +
  coord_polar("y", start=0) + 
  geom_text(aes(label = paste0(round(percent*100), "%")), position = position_stack(vjust = 0.5),  color = "black")+
  scale_fill_manual(values=c( "mistyrose", "pink", "red", "red4")) +
  labs( title = "More than half of the participant walked daily",  subtitle= "On a daily basis for the duration of one month almost 64% of the participant
and only 1 participant skipped more than 20 days without workout", x=NULL, y=NULL,) + 
  theme_classic() + theme(axis.line = element_blank(), axis.text = element_blank(),  axis.ticks = element_blank())


The data was analyzed with and without the participant, who missed most of the month, and it did not cause any significant changes in the data. Therefore, we included the rest of the study with the full participants. Now, from those who wore the device, how many exercised daily?


In [None]:
daily_total %>% 
  select(c(Id, StepTotal)) %>% 
  group_by(Id) %>% 
  filter(StepTotal == 0) %>% 
  summarize(Missed_Days = n()) %>% 
  mutate(Missed_Days = cut(Missed_Days, breaks = c(0, 5, 9, 14), labels = c("<5", "<9", ">10"))) %>% 
  count(Missed_Days) %>%
  mutate(percent = n / 33) %>%
  add_row(Missed_Days = "Zero", n=0, percent= 1- (0.303+0.0303+0.121)) %>% 
  ggplot(mapping =  aes(x = "", y = percent, fill = Missed_Days)) +
  geom_bar(width = 1, stat = "identity",  color = "white") +
  coord_polar("y", start=0) + 
  geom_text(aes(label = paste0(round(percent*100), "%")), position = position_stack(vjust = 0.5),  color = "black")+
  scale_fill_manual(values=c( "red4", "red", "pink", "mistyrose")) +
  labs( title = "Half of the participants walked daily", subtitle = "In general, from the participants who wore the device daily, 30% of the users 
did not walk for less than 5 days, while only 12% (4 participants) recorded 0
steps for more than 10 days", x=NULL, y=NULL,) + 
  theme_classic() + theme(axis.line = element_blank(), axis.text = element_blank(),  axis.ticks = element_blank())


3. which day & time would be the most popular for working out or walking?

    a. the data was grouped by ID and day of the week.
    
    b. Find the day of the week with the maximum number of steps for each ID
    
    c. plot

In [None]:
daily_total$day_of_week<-weekdays(daily_total$ActivityDay)

df_group <- daily_total %>%
  group_by(Id, day_of_week) %>%
  summarize(total_steps = sum(StepTotal))

df_max <- df_group %>%
  group_by(Id) %>%
  summarize(max_day = day_of_week[which.max(total_steps)],
            max_steps = max(total_steps))

ggplot(df_max, aes(x = as.numeric(as.factor(Id)), y = max_steps, fill = max_day)) +
  geom_col()+
  guides(fill=guide_legend(title="Day of the week"))+
  scale_fill_manual(values=c( "red", "orange", "yellow", "green", "blue","purple")) +
  labs( title = "On which day of the week did the 33 users take 
  the maximum number of steps", subtitle = "Accordıng to the data, tuesday and wednesday are the most prefered day to work 
in amount the participants",x="Samples", y= "Max Steps ") 

df_max %>% 
  select(everything()) %>% 
  count(max_day) %>%
  mutate(n) %>%
  ggplot(mapping =  aes(x=max_day, y=n))+
  geom_bar(stat='identity', fill = c( "red", "orange", "yellow", "green", "blue","purple"))+
  geom_label(aes(label=n))+
  theme(axis.text.x = element_text(angle=45))+
  labs(title="Tuesday is the most popular day of the week to walk in ", subtitle="In general, half of our sample prefer to walk the most on Tuesday, followed by 9 users 
prefered wednesday", x="Day of the week", y= "No. of users")


How about, at what time of the day participants prefer to walk?

In [None]:
Timestep <- hourly_total %>%
  group_by(Id, ActivityHour) %>%
  mutate(hours = hour(ActivityHour)) %>% 
  mutate(time_of_day =  cut(hours, breaks = c(-1, 12, 18, 24),  labels = c("morning", "afternoon", "evening")))

Timestep_max <- Timestep %>%
  group_by(Id, time_of_day) %>%
  summarize(total_steps = sum(StepTotal)) %>% 
  group_by(Id) %>%
  summarize(max_day   = time_of_day[which.max(total_steps)],
            max_steps = max(total_steps)) %>% 
  count(max_day) %>%
  mutate(n) %>% 
  mutate(colours     = c("yellow", "red", "darkblue")) 



After building the timestep_max table, a customized function is used to plot half donut pie graph. The Timestep_max data was fed to the HPie function.

In [None]:
HPie <- function(labels, amount, cols = NULL, repr=c("absolute", "proportion")) {
  library(ggforce)
  repr = match.arg(repr)
  stopifnot(length(labels) == length(amount))
  if (repr == "proportion") {
    stopifnot(sum(amount) == 1)
  }
  if (!is.null(cols)) {
    names(cols) <- labels
  }
  
  # arc start/end in rads
  cc <- cumsum(c(-pi/2, switch(repr, "absolute" = (amount / sum(amount)) * pi, "proportion" = amount * pi)))
  cc[length(cc)] <- pi/2

  # get angle of arc midpoints
  meanAngles <- colMeans(rbind(cc[2:length(cc)], cc[1:length(cc)-1]))

  # unit circle
  labelX <- sin(meanAngles)
  labelY <- cos(meanAngles)
  labelY <- ifelse(labelY < 0.015, 0.015, labelY)
  
  #Plot
  p <- ggplot() + theme_no_axes() + coord_fixed() +
    expand_limits(x = c(-1.3, 1.3), y = c(0, 1.3)) + 
    theme(panel.border = element_blank()) +
    theme(legend.position = "none") +
    
    geom_arc_bar(aes(x0 = 0, y0 = 0, r0 = 0.5, r = 1,
                     start = cc[1:length(amount)], 
                     end = c(cc[2:length(amount)], pi/2), fill = labels)) +
    
    switch(is.null(cols)+1, scale_fill_manual(values = cols), NULL) + 
    
    # for label and line positions, just scale sin & cos to get in and out of arc
    geom_path(aes(x = c(0.9 * labelX, 1.15 * labelX), y = c(0.9 * labelY, 1.15 * labelY),
                  group = rep(1:length(amount), 2)), colour = "white", size = 2) +
    geom_path(aes(x = c(0.9 * labelX, 1.15 * labelX), y = c(0.9 * labelY, 1.15 * labelY),
                  group = rep(1:length(amount), 2)), size = 1) +
    
    geom_label(aes(x = 1.15 * labelX, y = 1.15 * labelY, 
                   label = switch(repr,
                                  "absolute" = sprintf("%s\n%i", labels, amount),
                                  "proportion" = sprintf("%s\n%i%%", labels, round(amount*100)))), fontface = "bold", 
               label.padding = unit(1, "points")) +
    
    geom_point(aes(x = 0.9 * labelX, y = 0.9 * labelY), colour = "white", size = 2) +
    geom_point(aes(x = 0.9 * labelX, y = 0.9 * labelY)) +
    geom_text(aes(x = 0, y = 0, label = switch(repr, 
                                               "absolute" = (sprintf("Total: %i Users", sum(amount))), 
                                               "proportion" = "")),
              fontface = "bold", size = 7) +
    labs(title="More than Half of people prefer morning workouts", subtitle="In general, during the morning 18 users exercised out of 33 in total")
  
  
  return(p)
}

HPie(Timestep_max$max_day, Timestep_max$n, cols = Timestep_max$colours)


4. Do You Burn More Calories Running or Walking?

In [None]:
if(!require(RColorBrewer)) {
      install.packages("RColorBrewer"); 
      require(RColorBrewer)
    } 
  
library("RColorBrewer")
minuite_Narrow_total$intensity_category <- cut(minuite_Narrow_total$Intensity, breaks = 4, labels = c("Sedentary","LightActive","FairlyActive","VeryActive"))

minuite_Narrow_total$intensity_category <- factor(minuite_Narrow_total$intensity_category, levels = c("VeryActive", 
                                                              "FairlyActive", "LightActive", "Sedentary"))

minuite_Narrow_total %>% 
  select(Id,Intensity, intensity_category, Calories) %>%
  group_by(intensity_category) %>% 
  summarise(mean_Calories = mean(Calories)) %>% 
  ggplot(mapping= aes(x = factor(intensity_category), y = mean_Calories, fill=intensity_category)) + 
  geom_bar(stat='identity')+
    scale_fill_manual(values=c( "#FC0000", "#FFCA59", "#FFFFB3", "brown")) +
  labs(title="Intensity of Activity vs Calories",x = "Intensity category", y = "Mean calories",) +
  scale_x_discrete(labels = c('Very Active','Fairly Active','Lightly Active', 'Sedentary'))+
  labs(title="Higher Intensity burns more calories", subtitle="In general, workour intensity as a positive proportional relationship with calories burnt")


To gain a more thorough understanding, the analysis was conducted on an individual basis for each participant, examining how they spent their time over the course of the month.

In [None]:
intensity_per_min <- daily_total %>% 
  select(Id,SedentaryMinutes,LightlyActiveMinutes,FairlyActiveMinutes,VeryActiveMinutes) %>% 
  group_by(Id) %>%
  summarise(SedentaryMinutes_m = mean(SedentaryMinutes),
            LightlyActiveMinutes_m = mean(LightlyActiveMinutes),
            FairlyActiveMinutes_m = mean(FairlyActiveMinutes),
            VeryActiveMinutes_m = mean(VeryActiveMinutes)) %>% 
  pivot_longer(cols=c("SedentaryMinutes_m","LightlyActiveMinutes_m",
                      "FairlyActiveMinutes_m","VeryActiveMinutes_m"),
                    names_to='Active_Intensity',
                    values_to='Average_minutes')

intensity_per_min$Active_Intensity <- factor(intensity_per_min$Active_Intensity, levels = c("VeryActiveMinutes_m", "FairlyActiveMinutes_m", 
                                                             "LightlyActiveMinutes_m", "SedentaryMinutes_m")) 

ggplot(data = intensity_per_min, mapping =  aes(x = as.numeric(as.factor(Id)), y=Average_minutes, fill = Active_Intensity )) +
  geom_bar(stat = "identity", position = "stack")+
  scale_fill_manual(values=c(  "#FC0000", "#FFCA59", "#FFFFB3", "brown")) +
  labs(title="Most of the participant time spent in a sedentary state",subtitle = "The majority of the participants spent their time in a sedentary state, with only a small 
amount of time dedicated to high-energy walk." ,x = "sample", y = " Average minutes per month") +
  guides(fill=guide_legend(title="Activity intensity"))



The data does not provide the calories per distance moved, it provides the calories, intensity and steps taken per minute. Thus, from there we observed a very interesting relationship.

In [None]:
minuite_Narrow_total %>% 
  select(everything()) %>% 
  group_by(Id, intensity_category) %>% 
  summarise(average_steps = mean(Steps),
            average_calories =mean( Calories))%>% 
  ggplot(mapping = aes(x=average_steps, y = average_calories))+
  geom_point(aes(colour=intensity_category))+
  scale_color_manual(values=c( "VeryActive"="#FC0000", "FairlyActive"="#FFCA59", "LightActive"="#FFFFB3", "Sedentary"="brown"))+
  geom_smooth(method = 'loess')+
  labs(title="The more steps taken in a certain intensity the more calories are butnt",subtitle = "The faster you walk (speed) the more calories you burn, thus the 4 random very active (red) 
points with few steps. Same steps, shorter time burns more calories. the plot covers 
the entire average during the month" ,x = "Average steps", y = " Average minutes") 


5. How high or low usually heart rate get during walking?

    a. Convert heartrate_seconds_merged.csv seconds to minutes table
    
    b. Find the average heart rate per min
    
    c. Merge heart rate with calories per minutes graph

In [None]:
heartrate_seconds_merged.csv$Activeminutes <- as.character(cut(heartrate_seconds_merged.csv$Time, breaks = "1 min"))

heartrate_minutes <- heartrate_seconds_merged.csv %>% 
  select(Id, Activeminutes, Value) %>% 
  group_by(Id, Activeminutes) %>% 
  summarise(heart_rate = mean(Value))

heartrate_minutes$Activeminutes <- as.POSIXct(heartrate_minutes$Activeminutes)

colnames(minuite_Narrow_total)[2] ="Activeminutes"
heartrate_id <- distinct(heartrate_minutes,Id)
new_minuite_Narrow_total <- filter(minuite_Narrow_total, minuite_Narrow_total$Id %in% heartrate_id$Id)

heartwithexercise <- inner_join(heartrate_minutes, new_minuite_Narrow_total, by=c('Id'='Id', 'Activeminutes'='Activeminutes'))

ggplot(data = heartwithexercise,  aes(x=heart_rate, y=Calories,colour=intensity_category))+
  geom_hex()+
  scale_colour_manual(values=c(  "#FC0000", "#FFCA59", "#FFFFB3", "brown")) +
  labs(title="Heart rate range of all participants during the month per activity",subtitle = "In general, a healthy heart rate is typically considered to be between 60 and 100 beats 
per minute at rest. However, this significantly depend on age and other factors.
Maximum heart rate is around 210 beats per minute (bpm),
" ,x = "Heartrate (bpm)", y = " Calories per min (kcals)") 


The results shown above require collected from 14 participant with registered heart rates. However for higher accuracy and confidence level more data is required (age, body fat...) from a bigger sample.

In [None]:
heartwithexercise %>% 
  select(Id, heart_rate, Calories,intensity_category) %>% 
  group_by(Id, intensity_category) %>% 
  summarise(average_rate = mean(heart_rate) , average_calories = mean(Calories)) %>% 
  ggplot(mapping = aes(x = average_calories  , y = average_rate , fill = intensity_category , colour = as.factor(Id) , group = as.factor(Id))) +
  geom_line()+
  geom_point(size = 4, shape = 21)+
  scale_fill_manual(values=c(  "#FC0000", "#FFCA59", "#FFFFB3", "brown")) +

  labs(title = "Average burnt calories for each activity intensity with the relative 
average heart rate per participant", 
       subtitle = "A positive correlation is shown between the heart rate and calories burnt. 
Relatively, with low intensity walk lower calories burnt and lower the heart rate.",
       x = "Average calories burnt (kcals)" , y= "Average Heartrate (bpm)")


6. Lets see if weight affect the heart rate and calories

Only 4 participant reported their weight and heartbeats rate. It is important to note that the relationship between weight and heart rate is not necessarily linear, and there are many other factors that can influence heart rate, including age, gender, physical fitness, and medical conditions.Therefore, more information is needed to further investigate 


In [None]:
weight <- weightLogInfo_clean %>% 
  select(Id, Date, WeightKg,BMI) %>% 
  group_by(Id) %>% 
  summarise(Average_BMI = mean(BMI), average_kg = mean(WeightKg)) %>% 
  merge(heartwithexercise, by = "Id") 

weight %>% 
  select(Id, heart_rate, Calories,intensity_category, average_kg,Average_BMI) %>% 
  group_by(Id, intensity_category) %>% 
  summarise(Average_BMI=Average_BMI, average_rate = mean(heart_rate) , average_calories = mean(Calories)) %>% 
  ggplot(mapping = aes(x = average_calories , y = average_rate  , fill = intensity_category, color = Average_BMI , group = Average_BMI, linewidth = 0.5)) +
  geom_line()+
  geom_point(size = 4, shape = 21)+
  scale_fill_manual(values=c(  "#FC0000", "#FFCA59", "#FFFFB3", "brown")) +
  labs(title = "Average heartrate and calories burnt ploted for each participants 
with respect to their average BMI", 
subtitle = "A healthy BMI is typically considered to be between 18.5 and 24.9. while in this study
participant have a BMI of 25 or above is generally considered to be overweight.",
       x = "Average calories burnt (kcals)" , y= "Average Heartrate (bpm)")


It is important to note that BMI is only a rough estimate of body fat and may not be accurate for all individuals. For example, highly trained athletes or people with a lot of muscle mass may have a higher BMI but still be considered to be at a healthy weight. Similarly, older adults may have a lower BMI but still be considered to be at a healthy weight due to age-related changes in body composition.

7. How can exercise/walking effect sleep?

In [None]:
daily_active_sleep <-  merge(x=daily_total, y=sleepDay_merged.csv, by.x = c("Id", "ActivityDay"), by.y= c("Id", "SleepDay"))
library(magrittr)
library(dplyr)

behavior<-daily_active_sleep %>% 
  select(Id, ActivityDay, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, TotalMinutesAsleep, TotalTimeInBed) %>% 
  group_by(Id, ActivityDay) %>% 
  summarise(percentage_inactive = (SedentaryMinutes/1440)*100,
            percentage_active = ((LightlyActiveMinutes+ FairlyActiveMinutes+ VeryActiveMinutes)/1440)*100,
            percentage_sleep = (TotalMinutesAsleep/1440)*100,
            percentage_InBed = (TotalTimeInBed/1440)*100,
            TotalMinutesAsleep =TotalMinutesAsleep,
            time_to_sleep = TotalTimeInBed - TotalMinutesAsleep)

behavior %>% 
  select( everything()) %>%
  group_by(Id) %>% 
  summarise(active = mean(percentage_active),
            inactive = mean(percentage_inactive),
            sleep = mean(percentage_sleep),
            InBed = mean(percentage_InBed)) %>% 
  pivot_longer(cols=c("inactive","active","InBed"),
               names_to='Activity',
               values_to='percentages')  %>% 
  ggplot(mapping =   aes(x = (as.numeric(as.factor(Id))), y=percentages, fill = Activity   )) +
  geom_bar(stat = "identity", position = "stack")+ 
  scale_fill_manual(values=c( "indianred4", "darksalmon", "burlywood1")) +
  labs(title = "Percentage of the average activity done during the month 
  for each participant", 
      subtitle = "In general, participants spent their most of their month inactive with around 30% of
the time in bed, and about 25% active",  
      x = "Sample" , y= "Percentage")




In [None]:

daily_active_sleep1<-daily_active_sleep %>% 
  select(Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, TotalMinutesAsleep, TotalTimeInBed) %>% 
  group_by(Id) %>% 
  drop_na() %>% 
  summarise(LightActive = sum(LightlyActiveMinutes), 
            fairlyActive = sum(FairlyActiveMinutes), 
            VeryActive = sum(VeryActiveMinutes),
            Sedentary = sum(SedentaryMinutes),
            time_to_sleep = sum(TotalTimeInBed) -sum(TotalMinutesAsleep),
            av_TotalMinutesAsleep = sum(TotalMinutesAsleep)) %>% 
  pivot_longer(cols=c("LightActive","fairlyActive","VeryActive","Sedentary"),
               names_to='Active_Intensity',
               values_to='Average_Active_minutes') 

daily_active_sleep1$Active_Intensity <- factor(daily_active_sleep1$Active_Intensity, levels = c("VeryActive","fairlyActive","LightActive","Sedentary")) 

ggplot(data = daily_active_sleep1,  aes(x = (as.numeric(as.factor(Id))), y=Average_Active_minutes, fill = Active_Intensity   )) +
  geom_bar(stat = "identity", position = "stack")+
  geom_line(aes(x=as.numeric(as.factor(Id)), y=av_TotalMinutesAsleep), colour = "black", linewidth = 1)+
  scale_fill_manual(values=c(  "#FC0000", "#FFCA59", "#FFFFB3", "brown")) +
  labs(title="Activity intensity vs total time ",
       subtitle ="The sedentary minutes highly correlated with the sleep line graph",
       x = "sample", y = " Totle time (min)",) +
  guides(fill=guide_legend(title="Activity intensity"))


It clearly shows the total sedentary time spent by all participant is more than active time but if we excluded the sedentary time we see the following. 

In [None]:
daily_active_sleep %>% 
  select(Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes, TotalMinutesAsleep, TotalTimeInBed) %>% 
  group_by(Id) %>% 
  summarise(av_LA = sum(LightlyActiveMinutes), 
            av_FA = sum(FairlyActiveMinutes), 
            av_VA = sum(VeryActiveMinutes),
            sum_A = (av_LA+av_FA+av_VA),
            av_TotalMinutesAsleep = sum(TotalMinutesAsleep)) %>% 
  ggplot(mapping = aes(x = av_TotalMinutesAsleep  , y=sum_A))+
  geom_point(aes(color=factor(Id)))+
  geom_smooth()+
  labs(title="Total active time vs total minutes slept for each participent",
       subtitle = "The active the person is the more time he sleeps to recover ",x="sleeping time (min)",  y="Active time (min)")



## Phase V: Act

From the findings we can conclude:
- A more comfortable device is required  to make it easier for people to wear during entire day. 
- The device could be connected wireless to the scale to automatically record the participants weight. 
- A reward or goal system could be added to the device to motivate, notify and remind the users of there progress and daily walk.

However, the data is biased and realizing that the data presented is small and incomplete it is not quite representative of the population of the study. With bigger database and more comprehensive data, data that includes age, gender, region ...etc. It will improve the general outcome of the issue we are trying to solve. From one angle, the decision will most likely be more informed and better, but also the transparency will grant that there is more support to the findings.
