
### **BELLABEAT CASE STUDY**

How can a wellness technology company play it smart?


##### **ABOUT THE COMPANY**

Bellabeat is a high-tech manufacturer of health-focused products for women. Artist Urška Sršen and Mathematician Sando Mur founded Bellabeat in 2013, to help women assess their daily activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits. By doinging this they collect data throut their products such as:
- _Bellabeat App_: The Bellabeat app provides users with health data related to their activity, sleep, stress,
menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and
make healthy decisions. The Bellabeat app connects to their line of smart wellness products
- _Leaf_: A wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects
to the Bellabeat app to track activity, sleep, and stress.
- _Time_: A wellness watch combines the timeless look of a classic timepiece with smart technology to track user
activity, sleep, and stress through the Bellabeat app.
- _Spring_: A water bottle that tracks daily water intake using smart technology to ensure that you are
appropriately hydrated throughout the day which is also sysnced with the Bellabeat app.
- _Bellabeat memebership_: Bellabeat also offers a subscription-based membership program for users.
Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and
beauty, and mindfulness based on their lifestyle and goals.


##### **BUSINESS TASK**

Make an analysis on the smart device usage data of one of the smart devices in order to gain insight into how consumers use non-Bellabeat smart devices. Identify how these insights could help influence Bellabeat marketing strategy.


##### 1. ASK

- What are some trends in smart device usage?
- How could these trends apply to Bellabeat customers?
- How could these trends help influence Bellabeat marketing strategy?


##### 2. PREPARE

Urška Sršen insists on making use of use public data that explores smart device users’ daily habits which is available [here](https://www.kaggle.com/arashnic/fitbit). The dataset being used is **FitBit Fitness Tracker Data** (CC0: Public Domain, dataset made available through [Mobius](https://www.kaggle.com/arashnic))


##### **About the data**

This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03/12/2016-05/12/2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

##### 3. PROCESS

#### Loading libraries

In [209]:
import pandas as pd
import numpy as np
import matplotlib as plt

#### Loading datasets

In [210]:
daily_activity_df = pd.read_csv('dailyActivity_merged.csv')
sleep_day_df = pd.read_csv('sleepDay_merged.csv')
hourly_calories_df = pd.read_csv('hourlyCalories_merged.csv')
hourly_steps_df = pd.read_csv('hourlySteps_merged.csv')

#### Data Exploration
Having a look at each dataset

In [211]:
daily_activity_df.head()

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863


In [212]:
sleep_day_df.head()

Unnamed: 0,Id,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,4/12/2016 12:00:00 AM,1,327,346
1,1503960366,4/13/2016 12:00:00 AM,2,384,407
2,1503960366,4/15/2016 12:00:00 AM,1,412,442
3,1503960366,4/16/2016 12:00:00 AM,2,340,367
4,1503960366,4/17/2016 12:00:00 AM,1,700,712


In [213]:
hourly_calories_df.head()

Unnamed: 0,Id,ActivityHour,Calories
0,1503960366,4/12/2016 12:00:00 AM,81
1,1503960366,4/12/2016 1:00:00 AM,61
2,1503960366,4/12/2016 2:00:00 AM,59
3,1503960366,4/12/2016 3:00:00 AM,47
4,1503960366,4/12/2016 4:00:00 AM,48


In [214]:
hourly_steps_df.head()

Unnamed: 0,Id,ActivityHour,StepTotal
0,1503960366,4/12/2016 12:00:00 AM,373
1,1503960366,4/12/2016 1:00:00 AM,160
2,1503960366,4/12/2016 2:00:00 AM,151
3,1503960366,4/12/2016 3:00:00 AM,0
4,1503960366,4/12/2016 4:00:00 AM,0


Now checking for more information and column datatypes for each dataset

In [215]:
daily_activity_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDate              940 non-null    object 
 2   TotalSteps                940 non-null    int64  
 3   TotalDistance             940 non-null    float64
 4   TrackerDistance           940 non-null    float64
 5   LoggedActivitiesDistance  940 non-null    float64
 6   VeryActiveDistance        940 non-null    float64
 7   ModeratelyActiveDistance  940 non-null    float64
 8   LightActiveDistance       940 non-null    float64
 9   SedentaryActiveDistance   940 non-null    float64
 10  VeryActiveMinutes         940 non-null    int64  
 11  FairlyActiveMinutes       940 non-null    int64  
 12  LightlyActiveMinutes      940 non-null    int64  
 13  SedentaryMinutes          940 non-null    int64  
 14  Calories  

In [216]:
sleep_day_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413 entries, 0 to 412
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Id                  413 non-null    int64 
 1   SleepDay            413 non-null    object
 2   TotalSleepRecords   413 non-null    int64 
 3   TotalMinutesAsleep  413 non-null    int64 
 4   TotalTimeInBed      413 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 16.3+ KB


In [217]:
hourly_calories_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            22099 non-null  int64 
 1   ActivityHour  22099 non-null  object
 2   Calories      22099 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 518.1+ KB


In [218]:
hourly_steps_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            22099 non-null  int64 
 1   ActivityHour  22099 non-null  object
 2   StepTotal     22099 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 518.1+ KB


##### Dataset statistics

In [219]:
daily_activity_df.describe()

Unnamed: 0,Id,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
count,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0
mean,4855407000.0,7637.910638,5.489702,5.475351,0.108171,1.502681,0.567543,3.340819,0.001606,21.164894,13.564894,192.812766,991.210638,2303.609574
std,2424805000.0,5087.150742,3.924606,3.907276,0.619897,2.658941,0.88358,2.040655,0.007346,32.844803,19.987404,109.1747,301.267437,718.166862
min,1503960000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2320127000.0,3789.75,2.62,2.62,0.0,0.0,0.0,1.945,0.0,0.0,0.0,127.0,729.75,1828.5
50%,4445115000.0,7405.5,5.245,5.245,0.0,0.21,0.24,3.365,0.0,4.0,6.0,199.0,1057.5,2134.0
75%,6962181000.0,10727.0,7.7125,7.71,0.0,2.0525,0.8,4.7825,0.0,32.0,19.0,264.0,1229.5,2793.25
max,8877689000.0,36019.0,28.030001,28.030001,4.942142,21.92,6.48,10.71,0.11,210.0,143.0,518.0,1440.0,4900.0


In [220]:
sleep_day_df.describe()

Unnamed: 0,Id,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
count,413.0,413.0,413.0,413.0
mean,5000979000.0,1.118644,419.467312,458.639225
std,2060360000.0,0.345521,118.344679,127.101607
min,1503960000.0,1.0,58.0,61.0
25%,3977334000.0,1.0,361.0,403.0
50%,4702922000.0,1.0,433.0,463.0
75%,6962181000.0,1.0,490.0,526.0
max,8792010000.0,3.0,796.0,961.0


In [221]:
hourly_calories_df.describe()

Unnamed: 0,Id,Calories
count,22099.0,22099.0
mean,4848235000.0,97.38676
std,2422500000.0,60.702622
min,1503960000.0,42.0
25%,2320127000.0,63.0
50%,4445115000.0,83.0
75%,6962181000.0,108.0
max,8877689000.0,948.0


In [222]:
hourly_steps_df.describe()

Unnamed: 0,Id,StepTotal
count,22099.0,22099.0
mean,4848235000.0,320.166342
std,2422500000.0,690.384228
min,1503960000.0,0.0
25%,2320127000.0,0.0
50%,4445115000.0,40.0
75%,6962181000.0,357.0
max,8877689000.0,10554.0


Checking to see if each dataset has the same amount of unique entries/participants

In [223]:
daily_activity_df.Id.nunique()

33

In [224]:
sleep_day_df.Id.nunique()

24

In [225]:
hourly_calories_df.Id.nunique()

33

In [226]:
hourly_steps_df.Id.nunique()

33

Looking at the _sleep_daily_df_ dataframe, it has less amount of entries compared to the other dataframes, which tells us that perhaps there were participants that did not log all their entries, hence for this reason, I will still make use of the _sleepDay_merged.csv_ dataset.

#### Data cleaning

##### Data formatting

Converting the relavant columns in each data set to the correct datatypes

In [227]:
daily_activity_df["ActivityDate"] = pd.to_datetime(daily_activity_df["ActivityDate"])
sleep_day_df["SleepDay"] = pd.to_datetime(sleep_day_df["SleepDay"])
hourly_calories_df["ActivityHour"] = pd.to_datetime(hourly_calories_df["ActivityHour"])
hourly_steps_df["ActivityHour"] = pd.to_datetime(hourly_steps_df["ActivityHour"])

Renaming datetime columns to be more simple

In [228]:
daily_activity_df = daily_activity_df.rename(columns={"ActivityDate": "Date"})
sleep_day_df = sleep_day_df.rename(columns={"SleepDay": "Date"})
hourly_calories_df = hourly_calories_df.rename(columns={"ActivityHour": "Date_Time"})
hourly_steps_df = hourly_steps_df.rename(columns={"ActivityHour": "Date_Time"})

Checking to see how many missing values are in each dataframe

In [229]:
daily_activity_df.isna().sum()

Id                          0
Date                        0
TotalSteps                  0
TotalDistance               0
TrackerDistance             0
LoggedActivitiesDistance    0
VeryActiveDistance          0
ModeratelyActiveDistance    0
LightActiveDistance         0
SedentaryActiveDistance     0
VeryActiveMinutes           0
FairlyActiveMinutes         0
LightlyActiveMinutes        0
SedentaryMinutes            0
Calories                    0
dtype: int64

In [230]:
sleep_day_df.isna().sum()

Id                    0
Date                  0
TotalSleepRecords     0
TotalMinutesAsleep    0
TotalTimeInBed        0
dtype: int64

In [231]:
hourly_calories_df.isna().sum()

Id           0
Date_Time    0
Calories     0
dtype: int64

In [232]:
hourly_steps_df.isna().sum()

Id           0
Date_Time    0
StepTotal    0
dtype: int64

Checking to see how many duplicates are in each dataframe

In [233]:
daily_activity_df.duplicated().sum()

0

In [234]:
sleep_day_df.duplicated().sum()

3

Three duplicates have been found in this dataset, I will remove the duplicates but keep the first occurence

In [235]:
sleep_day_df = sleep_day_df.drop_duplicates(subset=None, keep='first')
sleep_day_df.duplicated().sum()

0

In [236]:
hourly_calories_df.duplicated().sum()

0

In [237]:
hourly_steps_df.duplicated().sum()

0

##### 4. ANALYZE

Combing all datasets together to perform an easier analysis

In [257]:
# merging daily_activity_df and sleep_day_dfdatasets
merged_daily_data = pd.merge(daily_activity_df, sleep_day_df, on='Id')
merged_daily_data.head()

# Renaming columns for readability
merged_daily_data = merged_daily_data.rename(columns={"Date_x": "Date_Daily","Date_y": "Sleep_Date"})
merged_daily_data.head()

Unnamed: 0,Id,Date_Daily,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,Sleep_Date,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,2016-04-12,1,327,346
1,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,2016-04-13,2,384,407
2,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,2016-04-15,1,412,442
3,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,2016-04-16,2,340,367
4,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,2016-04-17,1,700,712


In [258]:
# Merging hourly_calories_df and hourly_steps_df datasets
merged_hourly_data = pd.merge(hourly_calories_df, hourly_steps_df, on=['Id', 'Date_Time'])
merged_hourly_data.head()

# Renaming columns for readability
merged_hourly_data = merged_hourly_data.rename(columns={"Date_Time_x": "Date","Date_Time_y": "Date_Time"})
merged_hourly_data.head()

Unnamed: 0,Id,Date_Time,Calories,StepTotal
0,1503960366,2016-04-12 00:00:00,81,373
1,1503960366,2016-04-12 01:00:00,61,160
2,1503960366,2016-04-12 02:00:00,59,151
3,1503960366,2016-04-12 03:00:00,47,0
4,1503960366,2016-04-12 04:00:00,48,0


Finding out how many days where the records logged for

In [243]:
total_days = merged_daily_data['Date_Daily'].max() - merged_daily_data['Date_Daily'].min()
total_days

Timedelta('30 days 00:00:00')

**Analyzing the average amount of minutes when a user uses the smart device when they are very active, fairly active, lightly active and sedentary**

In [246]:
merged_daily_data.agg({'VeryActiveMinutes': ['mean', 'min', 'max'], 'FairlyActiveMinutes': ['mean', 'min', 'max'],	'LightlyActiveMinutes': ['mean', 'min', 'max'],	'SedentaryMinutes': ['mean', 'min', 'max']})

Unnamed: 0,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes
mean,23.935779,17.340703,199.848801,799.39148
min,0.0,0.0,0.0,0.0
max,210.0,143.0,518.0,1440.0


The average amount of minutes spent using the smart device by users are **17 minutes** when they are fairly active doing intentional vigorous exercise or vigirous activity, whilst **799 minutes** are spent when they are sedentary, which is roughly **13 hours** in total per day. This goes to show that most of the users use their smart device when they are doing unintentional day-to-day activities during the day such as shopping, cleaning, watering plants, taking out the trash etc.

**Analyzing the average amount of minutes when a user uses the smart device when they are asleep**

In [247]:
merged_daily_data.agg({'TotalMinutesAsleep': ['mean', 'min', 'max']})

Unnamed: 0,TotalMinutesAsleep
mean,419.10277
min,58.0
max,796.0


On average users use the smart device for **419 minutes** which is roughly **6 hours** when sleeping.

Based on how long users use the smart device when they are awake and when they are asleep, this on average leaves **5 hours** of the smart device being unused per day.

**Analyzing what time users burn the most calories per day**

In [288]:
merged_hourly_data.groupby(merged_hourly_data['Date_Time'].dt.hour)['Calories'].mean()

Date_Time
0      71.805139
1      70.165059
2      69.186495
3      67.538049
4      68.261803
5      81.708155
6      86.996778
7      94.477981
8     103.337272
9     106.142857
10    110.460710
11    109.806904
12    117.197397
13    115.309446
14    115.732899
15    106.637158
16    113.327453
17    122.752759
18    123.492274
19    121.484547
20    102.357616
21     96.056354
22     88.265487
23     77.593577
Name: Calories, dtype: float64

Looking at this, the most active time for users is at 18:00 PM follwed by 17:00 PM

**Analysing which days of the week, users burn the average amount of calories**

In [276]:
merged_hourly_data.groupby(merged_hourly_data['Date_Time'].dt.day_name())['Calories'].mean().sort_values(ascending=False)

Date_Time
Saturday     99.865866
Tuesday      98.617500
Friday       97.784117
Monday       97.053478
Thursday     97.008529
Wednesday    96.874260
Sunday       94.335981
Name: Calories, dtype: float64

Looking at this analysis Saturday is the most used/active day because that burn the most calories, followed by Tuesday

**Analysing which days of the week, users make the average amount of steps**

In [286]:
merged_daily_data.groupby(merged_daily_data['Date_Daily'].dt.day_name())['TotalSteps'].mean().sort_values(ascending=False)

Date_Daily
Tuesday      9016.747621
Monday       8644.214330
Saturday     8622.173292
Friday       8231.167385
Wednesday    7833.033804
Thursday     7735.824324
Sunday       6585.467290
Name: TotalSteps, dtype: float64

**Analyzing the most active day of the week when a user uses the smart device when they are awake**

In [290]:
merged_daily_data.agg({'VeryActiveDistance': ['mean','min','max'], 'ModeratelyActiveDistance': ['mean','min','max'], 'LightActiveDistance': ['mean','min','max'], 'SedentaryActiveDistance':['mean','min','max']})	

Unnamed: 0,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance
mean,1.397498,0.730862,3.532016,0.000679
min,0.0,0.0,0.0,0.0
max,13.4,6.48,10.3,0.11


This analysis shows how the average distance is **1 mile** what users are very active whereas **4 miles** when users are doing a light activity. This can be used at an advantage by contributing to a marketing strategy, which will be mentioned in the recommandations.

##### 5. SHARE

##### 6. ACT