## Let's Start With Asking a Question What Is Bellabeat?
* Bellabeat is a high-tech manufacturer of health-focused products for women.Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat company. 

## The Scenario
*  You are a junior data analyst working on the marketing analyst team at Bellabeat, Urška Sršen believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

## In this session we will follow the Data Analysis General lifecycle 
*  **Ask -> Prepare -> Process -> Analyze -> Share -> Act.** 
 
* **So without Wasting any more time let's directly dive into it.**



## 1. Ask Step
* This is such an important step as asking the right questions gives u the ability to reach your goal faster with accurate conclusions so what should we ask about? we have to ask questions to define the problem + what are the expectations to solve this problem? + who are the key stakeholders? 



**a.	What is the problem we are trying to solve?**
* Analysis of smart device fitness data to gain insights, and explore trends into how consumers are using their smart devices.


**b.	When can we say that we solved the problem?**
*  Get the insights we discovered then help guide marketing strategy for the company.


**c.	Who are the key stakeholders?**
*  **1. Urška Sršen**: Bellabeat’s co-founder and Chief Creative Officer. 
*  **2. Sando Mur**: Mathematician and Bellabeat’s cofounder.


**Notice: All the answers extracted from the Scenario go back to see what exactly I mean** 

## 2. Prepare Step
* After getting answers now we know what is the data we are looking for. We are looking for fitness data so let’s google it. Hmm, here is what I found:

* [FitBit Fitness Tracker Data](https://www.kaggle.com/datasets/arashnic/fitbit) (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users.

###	Data checkpoints R-O-C-C-C: (Reliable, Original, Comprehensive, Current, Cited)

a.	The data from **Reliable** source is the data being collected via Amazon Mechanical Turk.

b.	As it’s not internal data and it’s not collected by Bellabeat so it’s **not Original data** (second-party data).

c.	**Not Comprehensive** as the number of observations is just 33 cases so it’s not enough for the study and may cause bias.

d.	The data is **Not Current**, as it’s in 2016 so considered not current.

e.	The data is **Cited** by many who used this data.


### Data Deep Exploration: (Security, Structure, Organization, Integrity)

a.  The data is **public domain and open-source** so that means you can copy, edit, and publish without permission.

b.	As it's a small data set The data is stored in 18 **spreadsheets in long format**.

c.	The data is already **sorted** by the date, and there is **no filter** applied.

d.	The data is Accurate, and completed, but not consistent.

## Preparation Step Conclusion
* The data has limitations and it’s **not comprehensive** as there are just 33 cases but we can rely on it if you are looking for better accuracy and unbiased data. add other data to your study but never forget to check the preparation points again on the new data you add.


## 3. Process Step
* As we finished exploring the data now It's time to take action and get our hands dirty with the data. 
* As R is such a powerful programming language in Data Analysis and statistics plus cleaning and preprocessing the data we will use it and exploit all these advantages and get better insights and accurate conclusions, let's go.

### R Processing checkpoints

* **a. Loading packages and datasets.**
* **b. Preview the data and make sure from being loaded successfully.**
* **c. Core Cleaning (check format, remove duplicates, handle null values, clean columns names, fix inconsistency, and misspelling).**
* **d. The final step is to merge datasets if needed.**

## Import Packages 
* first of all we have to install and load the packages we need for cleaning, preprocessing, and analysis.
### Packages Needed 
* **tidyverse :** collection of R packages designed for data science.
* **here :** here package is to enable easy file referencing.
* **skimr :** compact and Flexible Summaries of Data.
* **lubridate :** deals with date-time data in an easier way.
* **ggrepel :** ggrepel provides geoms for ggplot2 to repel overlapping text labels. 
* **janitor:** for examining and cleaning dirty data. 

In [None]:
# 1. For the first time use the install.packages(package_name) command to install the package before loading. 
# 2. Load packages for loading packages we use library funciton. 

library(tidyverse)   # collection of R packages designed for data science.
library(here)        # here package is to enable easy file referencing.
library(skimr)       # compact and Flexible Summaries of Data.
library(lubridate)   # deal with date-time data in easier way.
library(ggrepel)     # ggrepel provides geoms for ggplot2 to repel overlapping text labels. 
library(janitor)     # for examining and cleaning dirty data. 


## DataSets Import
* As we see there are in [fitness_data](http://https://www.kaggle.com/arashnic/fitbit) 18 files but we will just select the most effective files with useful data for our expected conclusions it's like feature selection in Machine learning and deep learning.So we will select:

* **dailyActivity.**
* **dailySleep.**
* **hourlySteps.**
* **hourlyintensity.**

* Note: we will not consider Weight(8 users), heart_rate(7 users) as we don't have enough users.


In [None]:
# for data set import we will use read_csv("file_path") function from readr nested package in tidyverse bigger package.
# note: assignment to new variable try to use reliable lowerchase names and use underscore to split words

daily_activity <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
daily_sleep <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
hourly_steps <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
hourly_intensity <- read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")

## Preveiw The Data 
* Congratulations Eng, you loaded the data but wait we have to make sure that the data is fine so we have many options to see our data :

* **view()** : tibble package.
* **str()** : utils package.
* **head()** : utils package.
* **select()** : dplyr package (nested package in tidyverse).
* **glimpse()** : dplyr package (nested package in tidyverse).

In [None]:
# we will choose head(object,n) funciton the see the first 6 rows n=6 by default.
head(daily_activity)
head(daily_sleep)
head(hourly_steps)
head(hourly_intensity)

In [None]:
# We will choose str(object,..) funciton the see the data summary (columns data types, number of rows, number of columns).
str(daily_activity)
str(daily_sleep)
str(hourly_steps)
str(hourly_intensity)

## Core Cleaning 

* **Check the first let's check the number of users using n_unique(x) function from skimr package.**


In [None]:
# calculate number of unique(distinct) records (Id: for distinct users). 
n_unique(daily_activity$Id)
n_unique(daily_sleep$Id)
n_unique(hourly_steps$Id)
n_unique(hourly_intensity$Id)

* **Look for any dublicates using sum(duplicated(data)).**

In [None]:
# calculate the number of doplicate rows and null values for daily_active dataset 
sum(duplicated(daily_activity))
sum(is_null(daily_activity))

# calculate the number of doplicate rows and null values for daily_sleep dataset 
sum(duplicated(daily_sleep))
sum(is_null(daily_sleep))

# calculate the number of doplicate rows and null values for hourly_steps dataset 
sum(duplicated(hourly_steps))
sum(is_null(hourly_steps))

# calculate the number of doplicate rows and null values for hourly_steps dataset 
sum(duplicated(hourly_intensity))
sum(is_null(hourly_intensity))


* **As we see there are just 3 duplicated rows in daily_sleep so now let's remove duplicates and null values using distinct - drop_na().**

In [None]:
# remove duplicates and store it again in daily_sleep using pipe 
daily_sleep <- daily_sleep %>% distinct()

# to make sure that you removed all duplicates 
sum(duplicated(daily_sleep))


* **Cleaning column names as we have to ensure that all column names are using the right format and syntax (lowercase, not starting with numbers or any symbol, using underscore to split words).**

In [None]:
# Using clean_names function to ensure it's in the right format, second line to turn all names to lowercase  
clean_names(daily_activity)
daily_activity <- rename_with(daily_activity, tolower)

clean_names(daily_sleep)
daily_sleep <- rename_with(daily_sleep, tolower)

clean_names(hourly_steps)
hourly_steps <- rename_with(hourly_steps, tolower)

clean_names(hourly_intensity)
hourly_intensity <- rename_with(hourly_intensity, tolower)

* **Consistency of data,time columns**


In [None]:
# run it one as then the column names (activitydata,sleepdata) will change.
# using pipe the rename functio changes the column name and mutate funciton changes the coulmn format.
# in hourly_steps convert date to date-time.

daily_activity <- daily_activity %>%
  rename(date = activitydate) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))

daily_sleep <- daily_sleep %>%
  rename(date = sleepday) %>%
  mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

hourly_steps <- hourly_steps %>% 
  rename(date_time = activityhour) %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

hourly_intensity <- hourly_intensity %>% 
  rename(date_time = activityhour) %>%
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

# ensure the date-time columns is formatted and column names are clean using head function.

head(daily_activity)
head(daily_sleep)
head(hourly_steps)
head(hourly_intensity)


## Finally merge the data 
* note that u are not forced to do all these steps during processing data it always depends on the data remember that.

In [None]:
# Merging(joining) the 2 datasets using id + date column as they common in both datasets.
# glimpse functio to get the new dataset description summary.
daily_activity_sleep <- merge(daily_activity, daily_sleep, by=c ("id", "date"))
glimpse(daily_activity_sleep)

## 4-5. Analyze and Share 
* Now it's the time to analyze the data to get insights. but before you start look again at the data and note down all your thoughts. and try to make relations among features and remember to take your time and trust me it deserves.

### From **daily_activity** dataset
* Study The relation between **steps-calories** ( we will use **scatter plot** as we study a relation ).
* Compare between **Very Active Distance Averge - Light Active distanve Average.** 

In [None]:
# View the daily_activity dataset to ensure the columns names 
head(daily_activity)

# calculate the total steps average 
average_steps <- mean(daily_activity$totalsteps)
average_steps

# Steps Vs Calories
ggplot(data=daily_activity,mapping= aes(x=totalsteps, y=calories, color=id))+
    geom_point()+
    geom_smooth()+
    labs(title="Total Steps vs Calories.", subtitle="This graph shows the relation between total Number of steps and lost calories.", caption="visualized using R.")

In [None]:
# now let's calculate the average for high active and light active distances using bar charts 
average_very_high_distance <- mean(daily_activity$veryactivedistance)
average_light_distance <- mean(daily_activity$lightactivedistance)
calory_per_step <- sum(daily_activity$calories)/sum(daily_activity$totalsteps)

average_very_high_distance 
average_light_distance
calory_per_step

### Conclusion
* The average_steps around 7500 steps.
* There is a Positive Relationship between the total number of steps and the number of taken calories.
* As we see the average light active distance is almost = 2 * average very high active distance.
* calory per step = 0.3 cal/step and this is not the ordinary value as with the normal case the value should be around 0.05 cal/step and that's direct tells us that highly active steps burn more calories than light active steps.

### daily_activity_sleep dataset 
* View the dataset using the head function.
* Compare between the time in bed and actual time sleeping.
* Study the correlation between calories and time in bed for understanding user general behavior **use scatter plot**.
* Study the correlation between sleep time and sedentary-time.

In [None]:
# View the dataset using head funciton.
head(daily_activity_sleep)

#Compare between the time in bed and actual time sleeping.
wasted_time_in_bed_in_days <- ((sum (daily_activity_sleep$totaltimeinbed) - sum(daily_activity_sleep$totalminutesasleep))/60)/24
wasted_time_in_bed_in_days

# The correlation between calories and time in bed for understanding user general behaviour.
ggplot(data=daily_activity_sleep, aes(x=totaltimeinbed, y=calories))+
    geom_point()+
    geom_smooth()+
    labs(title="Calories vs Bed-time.", subtitle="This graph shows the relation between total Number of taken calories and time in bed.", caption="visualized using R.")

# Study the correlation between sleep-time and sedentary-time
ggplot(data=daily_activity_sleep, aes(x=totalminutesasleep, y=sedentaryminutes))+
    geom_point()+
    geom_smooth()+
    labs(title="Sleep-time vs Sedentary-time.", subtitle="This graph shows the relation between sleep time and sedentary time in minutes.", caption="visualized using R.")


### Conclusion
* As we see there are around 11 days in total wasted in bedtime without sleeping.
* for user behavior analysis there is no correlation between taken calories and bedtime or the relation is not clear as the sample is just 33 users so that's maybe caused bias.
* as we see with the last visual there is a negative correlation between sedentary time and sleep-time.

### hourly_intensity dataset 
* Analyze the data to obtain the active-inactive ranges.

In [None]:
# Split the data in date_time column and store it in new columns 
hourly_intensity$time <- format(hourly_intensity$date_time, format = "%H:%M:%S")
hourly_intensity$date <- format(hourly_intensity$date_time, format = "%m/%d/%y")

# View the data to ensure all the data in the right place   
head(hourly_intensity)

# Now calculate the average intensity and group by time then store it in new dataset. 
hourly_intensity_new <- hourly_intensity %>%
                        group_by(time)%>%
                        drop_na()%>%
                        summarize(total_intensity_mean = mean(totalintensity))

# View the data to ensure the grouping by done successfully. 
head(hourly_intensity_new)

# create a visualization using histogram             
ggplot(data=hourly_intensity_new, aes(x=time, y=total_intensity_mean)) + geom_histogram(stat = "identity", fill="Dark blue") +
  theme(axis.text.x = element_text(angle = 90)) + # this for handling time text and prevent any combination
  labs(title="Average Total Intensity vs. Time")

### Conclusion
* as we see from the last visual the active hours strat from 5 am to 11 pm and inactive range from 12 am till 4 am. 

## 6. Act 
* After finishing cleaning, pre-processing, and analysis now it's time to take action.
* take a quick look at our analysis conclusions to get Business Recommendations so let's go.


## Business Recommendations Based on Conclusions




* 1. The average number of steps is 7638 so we can send a notification to all who reach more than 5000 to encourage him/her to achieve 10,000 steps for better health in the long run as recommended by CDC Organization.


* 2. There is a Positive Relationship between the total number of steps and the number of burned calories so we will recommend sending a reminder notification to those who are interested in a fit body to keep on the plan for a healthy life.


* 3. As we know the average light active distance is almost = 2 * average very high active distance so we can recommend a plan for sports awareness and recommend different plans for all users.


* 4. Based on analysis there are around 11 days in total wasted in bedtime without sleeping by 33 users during a month so we have to manage this time by suggesting relaxing music helping for sleeping based on the user data.


* 5. As we knew there is a negative relationship between sedentary time and sleep-time so we can suggest a program to reduce sedentary time to improve sleeping quality and manage sleeping time with fixed cycles.


* 6. From hourly intensities, we can know the most active and inactive hours so if we need to send any notification or any important message we should stick with the active range from 7 am - 10 pm.  