# **Bellabeat : How Can A Wellness Technology Company Play It Smart?**

 ### **STEP 1: ASK**


**Background** Bellabeat is a high-tech manufacturer of beautifully-designed health-focused smart products for women since 2013. Inspiring and empowering women with knowledge about their own health and habits, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for females.

The co-founder and Chief Creative Officer, Urška Sršen is confident that an analysis of non-Bellebeat consumer data (ie. FitBit fitness tracker usage data) would reveal more opportunities for growth.

**Business Task** Analyze FitBit Fitness Tracker Data to gain insights into how consumers are using the FitBit app and discover trends and insights for Bellabeat marketing strategy.

**Business Objectives:**
* What are the trends identified?
* How could these trends apply to Bellabeat customers?
* How could these trends help influence Bellabeat marketing strategy?

**Key Stakeholders:**
* Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
* Sando Mur: Mathematician, Bellabeat’s cofounder and key member of the Bellabeat executive team
* Bellabeat marketing analytics team: A team of data analysts guiding Bellabeat's marketing strategy.


### **STEP 2: PREPARE**
**Information on Data Source:**

The data is publicly available on Kaggle: FitBit Fitness Tracker Data and stored in 18 csv files. Generated by respondents from a distributed survey via Amazon Mechanical Turk between 12 March 2016 to 12 May 2016. 30 FitBit users who consented to the submission of personal tracker data. Data collected includes:
1. physical activity recorded in minutes,
2. heart rate,
3. sleep monitoring,
4. daily activity and
5. steps.

**Limitations of Data Set:**
Data collected from year 2016. Users' daily activity, fitness and sleeping habits,
diet and food consumption may have changed since then, hence data may not be timely or relevant.

Sample size of 30 female FitBit users is not representative of the entire female population. As data is collected in a survey, hence unable to ascertain the integrity or accuracy of data.

**Is Data ROCCC?**
A good data source is ROCCC which stands for Reliable, Original, Comprehensive, Current, and Cited.

1.  Reliable - LOW - Not reliable as it only has 30 respondents
2.  Original - LOW - Third party provider (Amazon Mechanical Turk)
3.  Comprehensive - MED - Parameters match most of Bellabeat's products' parameters
4.  Current - LOW - Data is 5 years old and is not relevant
5.  Cited - LOW - Data collected from third party, hence unknown
6.  Overall, the dataset is considered bad quality data and it is not recommended to produce business recommendations based on this data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import warnings
warnings.filterwarnings("ignore")

**Importing datasets**


In [None]:
activity = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
calories = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
intensities = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
steps = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
sleep = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
weight = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
heartrate = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")



In [None]:
hourly_calories = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv")
hourly_steps = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
hourly_intensities = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv")
min_sleep = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv")

### **Step 3 : Process**

**Processing Daily Activty dataset**

In [None]:
activity.head()

In [None]:
activity.info()

In [None]:
#converting ActivityDate from object to datetime
activity['ActivityDate'] = pd.to_datetime(activity['ActivityDate'], dayfirst = True)

**Process the Hourly Activity dataset**

In [None]:
hourly_activity = pd.merge(hourly_intensities, hourly_calories, how = 'left', left_on = ['Id','ActivityHour'], right_on = ['Id','ActivityHour'])
hourly_activity = pd.merge(hourly_activity, hourly_steps, how = 'left', left_on = ['Id','ActivityHour'], right_on = ['Id','ActivityHour'])
hourly_activity['ActivityHour'] = pd.to_datetime(hourly_activity['ActivityHour'], dayfirst = True)
hourly_activity['Day'] = hourly_activity['ActivityHour'].dt.day_name()
hourly_activity['Day'] = hourly_activity['Day'].astype('category')
weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
hourly_activity['Day'].cat.reorder_categories(weekday, inplace = True)
hourly_activity.head()

In [None]:
hourly_activity['Id'] = hourly_activity['Id'].astype('category')
hourly_activity['Hour'] = hourly_activity['ActivityHour'].dt.hour
hourly_activity

**Process Daily Calories dataset**

In [None]:
calories.head()

In [None]:
calories.info()

In [None]:
#convering ActivityDate from object to datetime 
calories['ActivityDay'] = pd.to_datetime(calories['ActivityDay'], dayfirst = True)

**Process Hourly Calories dataset**

In [None]:
hourly_calories.info()

**Process Daily Intensities dataset**

In [None]:
intensities.head()

In [None]:
intensities.info()

In [None]:
#convering ActivityDate from object to datetime 
intensities['ActivityDay'] = pd.to_datetime(intensities['ActivityDay'], dayfirst = True)

**Process Daily Steps dataset**

In [None]:
steps.head()

In [None]:
steps.info()

In [None]:
#converting ActivityDay from object to datetime
steps['ActivityDay'] = pd.to_datetime(steps['ActivityDay'], dayfirst = True)

**Daily sleep data`**

In [None]:
sleep.head()

In [None]:
sleep.info()

In [None]:
#Converting Sleepday from object to datetime
sleep['SleepDay'] = pd.to_datetime(sleep['SleepDay'], dayfirst = True)

**Process the weight dataset**

In [None]:
weight.head()

In [None]:
weight.info()

In [None]:
#converting Date from object to datetime
weight['Date'] = pd.to_datetime(weight['Date'], dayfirst = True)
weight.head()

In [None]:
heartrate.head()

In [None]:
heartrate.info()

### **Step 4 : Analyze**

In [None]:
#Number of participants in each datasets
print("Number of participants in activity dataset:"+str(activity['Id'].nunique()))
print("Number of participants in calories dataset:"+str(calories['Id'].nunique()))
print("Number of participants in intensities dataset:"+str(intensities['Id'].nunique()))
print("Number of participants in steps dataset:"+str(steps['Id'].nunique()))
print("Number of participants in sleep dataset:"+str(sleep['Id'].nunique()))
print("Number of participants in weight dataset:"+str(weight['Id'].nunique()))

**Activity**

In [None]:
activity.describe()

In [None]:
#Checking rows where calories = 0
activity[activity['Calories'] == 0]


In [None]:
#Dropping rows where calories = 0
activity.drop(labels = [30,653,817,879], axis = 0, inplace = True)
#similarly we remove empty rows
empty_rows = list(activity[activity['TotalSteps'] == 0].index) + list(activity[activity['SedentaryMinutes'] == 0].index) + list(activity[activity['TotalDistance'] == 0].index) + list(activity[activity['LightActiveDistance'] == 0].index) 
len(empty_rows)


In [None]:
#similarly we remove empty rows
empty_rows = list(activity[activity['TotalSteps'] == 0].index) + list(activity[activity['SedentaryMinutes'] == 0].index) + list(activity[activity['TotalDistance'] == 0].index) + list(activity[activity['LightActiveDistance'] == 0].index) 
activity.drop(labels = empty_rows, axis = 0, inplace = True)

In [None]:
#Checking for duplicate rows
activity[activity.duplicated()]

In [None]:
#activity['ActivityDate'] = pd.to_datetime(activity['ActivityDate'], dayfirst = True)
activity['Day'] = activity['ActivityDate'].dt.day_name()
activity['Day'] = activity['ActivityDate'].astype('category')
#weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
#activity['Day'].cat.reorder_categories(weekday, inplace = True)
activity

In [None]:
#Percentage of Women who burn less than average calories


min_calorie_burned = 1800
num_min_calorie = len(list(activity[activity['Calories'] < min_calorie_burned].index))
total_people = len(activity)
print("Percentage of women who burn less than the min average calorie: "+str((num_min_calorie/total_people)*100))

In [None]:
#converting Id of activity to String type
activity['Id'] = activity['Id'].astype(str)
activity.dtypes

**Calories dataset**

In [None]:
calories.dtypes

In [None]:
calories[calories.duplicated()]

In [None]:
#Converting Id of calories to String type
calories['Id'] = calories['Id'].astype(str)
calories.dtypes

In [None]:
#removing rows with 0 calories burned on calories dataset
empty_rows = list(calories[calories['Calories'] == 0].index)
calories.drop(labels = empty_rows, axis = 0, inplace = True)

In [None]:
calories['Day'] = calories['ActivityDay'].dt.day_name()
calories['Day'] = calories['Day'].astype('category')
weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
calories['Day'].cat.reorder_categories(weekday, inplace= True)
calories

In [None]:
calories.describe()

**Intensities dataset**

In [None]:
#Converting Id of intensities to String type
intensities['Id'] = intensities['Id'].astype(str)
intensities.dtypes

In [None]:
intensities[intensities.duplicated()]

In [None]:
intensities.describe()

In [None]:
empty_rows = list(intensities[intensities['SedentaryMinutes'] == 0].index) + list(intensities[intensities['LightlyActiveMinutes'] == 0].index) 
intensities.drop(labels = empty_rows, axis = 0, inplace = True)


**Steps dataset** 

In [None]:
steps.info()

In [None]:
steps.head()

In [None]:
steps[steps.duplicated()]

In [None]:
#Converting the Id from int to String
steps['Id'] = steps['Id'].astype(str)
steps.dtypes

In [None]:
steps['Day'] = steps['ActivityDay'].dt.day_name()
steps['Day'] = steps['Day'].astype('category')
weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
steps['Day'].cat.reorder_categories(weekday, inplace= True)
steps

In [None]:
#Percentange of Women who lead an active lifestyle

min_steps = 10000
num_min_steps = len(list(activity[activity['TotalSteps'] > min_steps].index))
print("Percentage of women who take more than 10k steps: " + str((num_min_steps/total_people)*100))

In [None]:
#Percentange of Women who lead an sedentay lifestyle

min_steps = 5000
num_min_steps = len(list(activity[activity['TotalSteps'] < min_steps].index))
print("Percentage of women who take less than 5k: " + str((num_min_steps/total_people)*100))

In [None]:
#Removing rows with 0 steps

empty_rows = list(steps[steps['StepTotal'] == 0].index)
steps.drop(labels = empty_rows, axis = 0, inplace = True)

**Sleep Dataset**

In [None]:
sleep.info()

In [None]:
sleep.describe()

In [None]:
#Checking for duplicates
sleep.duplicated().sum()


In [None]:
#dropping duplicates
sleep = sleep.drop_duplicates()     


In [None]:
sleep.dtypes

In [None]:
#Converting the Id from int to String
sleep['Id'] = sleep['Id'].astype(str)

sleep['HoursAsleep'] = (sleep['TotalMinutesAsleep']/60).round(2)
sleep['HoursInBed'] = (sleep['TotalTimeInBed']/60).round(2)
sleep['MinutesNotInAsleep'] = (sleep['TotalTimeInBed'] - sleep['TotalMinutesAsleep'])
sleep['PercentAsleep'] = ((sleep['TotalMinutesAsleep']/sleep['TotalTimeInBed'])*100).round(2)
sleep['Day'] = sleep['SleepDay'].dt.day_name()
sleep['Day'] = sleep['Day'].astype('category')
weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sleep['Day'].cat.reorder_categories(weekday, inplace= True)
sleep

**Summary of the findings so far:**
* Average number of participants in each dataset: 27.
* Average sedentary time = 16 hours. This definitely need to be reduced.
* On average, participants sleep only once a day for 7 hours (Still OK).
* 21% of women do not burn enough calories per day(less than 1800 calories).
* 35% of women lead an active lifestyle (>10000 steps per day which is recommended by the Center of Disease Control and prevention (CDC)).
* 25% of women lead an sedentary lifestyle (<5000 steps per day, which is very low).


### **Step 5 : Share**

In [None]:
daily_activity = activity.describe().transpose().round(2)
daily_activity

In [None]:
df_activity = daily_activity[8:12]['mean']
df_activity

**Lets look at the Activity breakdown**

In [None]:
labels = df_activity.keys()
plt.figure(figsize = (8,8))
plt.pie(df_activity, labels = labels, autopct = "%.1f%%")
plt.title("Activity level breakdown", fontsize = 14)
plt.show()

#On an average 79.2% of the time people are sedentary

**How do people's activity vary through the week**


In [None]:
plt.figure(figsize = (10,8))

f, axes = plt.subplots(2, 1, figsize = (12,10))

sns.boxplot(data = calories, x = 'Day', y = 'Calories', ax = axes[0] , palette = 'husl')
sns.boxplot(data = steps, x = 'Day', y = 'StepTotal', ax = axes[1], palette = 'husl')
plt.show()

> 

* **The calories burned do not vary much throughout the week.**
* **The median steps taken throughout the week do not differ much.**

**Calories and steps by the hour**

In [None]:
summary_hourly_activity = hourly_activity.groupby(['Hour']).mean().reset_index().round(2)
summary_hourly_activity.columns = ['Hour','AvgTotalIntensity', 'AvgIntensity', 'AvgCalories', 'AvgSteps']
summary_hourly_activity

In [None]:
f, axes = plt.subplots(3, 1, figsize = (10,8), sharex = True)
fig_axis = np.arange(0,25)

fig1 = sns.lineplot(data = summary_hourly_activity, x = 'Hour', y = 'AvgIntensity', ax = axes[0])
fig2 = sns.lineplot(data = summary_hourly_activity, x = 'Hour', y = 'AvgCalories', ax = axes[1])
fig3 = sns.lineplot(data = summary_hourly_activity, x = 'Hour', y = 'AvgSteps', ax = axes[2])
fig3.set_xticks(fig_axis)
plt.show()

In [None]:
#the plots have very similar shapes, which indicates they are highly correlated
#lets verify the correlation with a heatmap
plt.figure(figsize = (10,6))
sns.heatmap(summary_hourly_activity.corr(), annot = True)
plt.show()

In [None]:
#Just to prove the correlation, we can see here that StepTotal and Calories are positively correlated.

sns.lmplot(data = hourly_activity, x = 'StepTotal', y = 'Calories', aspect = 2)
plt.show()

**Hours of sleep users get in a day**

In [None]:
sleep.describe().transpose().round(2)

In [None]:
plt.figure(figsize = (8,5))
fig_axis = np.arange(0,14)
fig = sns.distplot(sleep['HoursAsleep'], kde = False, color = 'purple')
fig.set_xticks(fig_axis)
plt.ylabel('Number of users')

**Total time users spend in a bed**

In [None]:
plt.figure(figsize = (10,5))
fig_axis = np.arange(0, 400, 20)
fig = sns.distplot(sleep['MinutesNotInAsleep'], kde = False, color = 'green')
fig.set_xticks(fig_axis)
plt.ylabel('Number of users')

**Do people have different sleep patterns on different day of the week?**

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = sleep, x = 'Day', y = 'HoursAsleep', palette = 'pastel')

# There is a wider variation in sleep time on the weekends.
# Also people spend more time in bed on Sundays.

In [None]:
plt.figure(figsize = (10,5))
sns.barplot(data = sleep, x = 'Day', y = 'HoursAsleep', palette = 'pastel')

# On average people get more than 7 hours of sleep on Wednesday and Sunday, and less than 7 on other days.

**Do people regularly wear their Fitbit throughout the day and to monitor their sleep?**

In [None]:
# First we count how many times the user used their Fitbit to record their sleep in the month under review.

sleep_records = sleep.groupby(['Id'])['SleepDay'].count().to_frame().sort_values(by = 'SleepDay', ascending=True)
sleep_records.reset_index(inplace = True)
sleep_records = sleep_records.rename(columns = {'SleepDay' : 'TotalSleepRecords'})
sleep_records.head()

In [None]:
# Then we count how many days each user wore their Fitbit throughout the month. 

activity_records = activity.groupby(['Id'])['ActivityDate'].count().to_frame().sort_values(by = 'ActivityDate', ascending=True)
activity_records.reset_index(inplace=True)
activity_records = activity_records.rename(columns = {'ActivityDate':'TotalActivityRecords'})
activity_records.head()

In [None]:
#And now we combine the two.

user_records = pd.merge(activity_records, sleep_records, how='outer', left_on=['Id'], right_on = ['Id'])

user_records.head()

In [None]:
user_records['TotalSleepRecords'] = user_records['TotalSleepRecords'].fillna(0)
user_records['Id'] = user_records.Id.astype('category')
user_records['TotalSleepRecords'] = user_records.TotalSleepRecords.astype('int')

In [None]:
user_records.sort_values(by='TotalSleepRecords')

#Here we can see that some people do not regularly wear their Fitbit to monitor their sleep.

In [None]:
user_records.describe()

# On average a person wears their Fitbit for 28 days, but only in 12 of those days do they monitor their sleep.
# Half of all users record their sleep only 5 times or less throughout the month.