

# Case Study: How Can a Wellness Technology Company Play It Smart?

Title: <b>Bellabeat Case Study</b>

Author: <b>Radosław Gryczka</b>

Date: <b>*December 21, 2022*</b>

Python version: 3.10.1




***

# ASK

### Steps of the data analysis process: **ask**, **prepare**, **process**, **analyze**, **share**, and **act**


### Company

- Bellabeat is the go-to wellness brand for women with an ecosystem of products and services focused on women’s health 
- Successful small company with potential to become a larger player in the global smart device market.


## Business task

- Analyze FitBit Fitness Tracker Data to gain insights into how consumers are using the FitBit app 
- Discover trends and insights for Bellabeat marketing strategy.



#### Questions
<ol>
  <li>What are trends in smart device usage?</li>
  <li>How could these trends apply to Bellabeat customers?</li>
  <li>How could these trends help influence Bellabeat marketing strategy</li>
</ol>



## Deliverables
- A clear summary of the business task
- A description of all data sources used
- Documentation of any cleaning or manipulation of data
- A summary of analysis
- Supporting visualizations and key findings
- Recommendations based on the analysis


## Stakeholders

- Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
- Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
- Bellabeat marketing analytics team

***

# PREPARE
## Data Source:
- The data is publicly available on Kaggle: [FitBit Fitness Tracker Data] (https://www.kaggle.com/arashnic/fitbit).
- Generated by respondents from a distributed survey via Amazon Mechanical Turk between 12 March 2016 to 12 May 2016.
- 30 FitBit users who consented to the submission of personal tracker data.
- Data is provided by a third party and likely not a reliable.
- Data collection is from 2016 (6 years ago)


#### The following file is selected and copied for analysis.

dailyActivity_merged.csv

***

# PROCESS
Using Python to prepare and process the data.








### **Loading packages** and **importing** data set

**pandas**, **os** and **datetime**

In [7]:
import pandas as pd 
import os  
import datetime 

# current working directory
pwd = os.getcwd() 

# filepath to the data
filepath = pwd + "/dailyActivity_merged.csv" 

dataset_import = pd.read_csv(filepath)
dataset_import

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.50,8.50,0.00,1.88,0.55,6.06,0.00,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.97,6.97,0.00,1.57,0.69,4.71,0.00,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.74,6.74,0.00,2.44,0.40,3.91,0.00,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.28,6.28,0.00,2.14,1.26,2.83,0.00,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.16,8.16,0.00,2.71,0.41,5.04,0.00,36,10,221,773,1863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
935,8877689391,5/8/2016,10686,8.11,8.11,0.00,1.08,0.20,6.80,0.00,17,4,245,1174,2847
936,8877689391,5/9/2016,20226,18.25,18.25,0.00,11.10,0.80,6.24,0.05,73,19,217,1131,3710
937,8877689391,5/10/2016,10733,8.15,8.15,0.00,1.35,0.46,6.28,0.00,18,11,224,1187,2832
938,8877689391,5/11/2016,21420,19.56,19.56,0.00,13.22,0.41,5.89,0.00,88,12,213,1127,3832





Safety environment for furthure processing

In [8]:
# backup

dataset_modified = dataset_import.copy() 


## Data cleaning and manipulation

- Check for null or missing values

- Perform sanity check of data 

- Observe and familiarize with data

In [9]:
# number of null values

dataset_missing_values_count = dataset_modified.isnull().sum() 
dataset_missing_values_count 

Id                          0
ActivityDate                0
TotalSteps                  0
TotalDistance               0
TrackerDistance             0
LoggedActivitiesDistance    0
VeryActiveDistance          0
ModeratelyActiveDistance    0
LightActiveDistance         0
SedentaryActiveDistance     0
VeryActiveMinutes           0
FairlyActiveMinutes         0
LightlyActiveMinutes        0
SedentaryMinutes            0
Calories                    0
dtype: int64

Making sure if there is 30 distinct users (id)

In [10]:
# unique id's

count_id = len(pd.unique(dataset_modified["Id"])) 
print("# of unique Id: " + str(count_id))

# of unique Id: 33


In [11]:
# data types

dataset_modified.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDate              940 non-null    object 
 2   TotalSteps                940 non-null    int64  
 3   TotalDistance             940 non-null    float64
 4   TrackerDistance           940 non-null    float64
 5   LoggedActivitiesDistance  940 non-null    float64
 6   VeryActiveDistance        940 non-null    float64
 7   ModeratelyActiveDistance  940 non-null    float64
 8   LightActiveDistance       940 non-null    float64
 9   SedentaryActiveDistance   940 non-null    float64
 10  VeryActiveMinutes         940 non-null    int64  
 11  FairlyActiveMinutes       940 non-null    int64  
 12  LightlyActiveMinutes      940 non-null    int64  
 13  SedentaryMinutes          940 non-null    int64  
 14  Calories  

In [12]:
# change data type to datetime from string

dataset_modified["ActivityDate"] = pd.to_datetime(dataset_modified["ActivityDate"]) 
dataset_modified.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Id                        940 non-null    int64         
 1   ActivityDate              940 non-null    datetime64[ns]
 2   TotalSteps                940 non-null    int64         
 3   TotalDistance             940 non-null    float64       
 4   TrackerDistance           940 non-null    float64       
 5   LoggedActivitiesDistance  940 non-null    float64       
 6   VeryActiveDistance        940 non-null    float64       
 7   ModeratelyActiveDistance  940 non-null    float64       
 8   LightActiveDistance       940 non-null    float64       
 9   SedentaryActiveDistance   940 non-null    float64       
 10  VeryActiveMinutes         940 non-null    int64         
 11  FairlyActiveMinutes       940 non-null    int64         
 12  LightlyActiveMinutes  

# Notes 

- There is Null or missing values.

- Data frame has 940 rows and 15 columns.

- ActivityDate was wrongly classified as object dtype and has been converted to datetime64 dtype.

- There are 33 unique IDs instead of expected 30 fitness tracker users.

### Data manipulation

- Rearrange and rename columns.

- Create new column **day_of_the_week** by separating the date into day of the week for further analysis.

- Create new column **total_mins** being the **sum** of very_active_mins, fairly_active_mins, lightly_active_mins and sedentary_mins.

- Create new column **total_hours** by converting **total_mins** column into number of hours.




In [13]:
# list of columns with new column 'day of the week' for further analysis

dataset_modified.columns
rearranged_columns =['Id', 'ActivityDate', 'day_of_the_week', 'TotalSteps', 'TotalDistance', 'TrackerDistance',
       'LoggedActivitiesDistance', 'VeryActiveDistance',
       'ModeratelyActiveDistance', 'LightActiveDistance',
       'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes',
       'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories']

# rearrange columns

dataset_rearr = dataset_modified.reindex(columns=rearranged_columns)
dataset_rearr.columns

Index(['Id', 'ActivityDate', 'day_of_the_week', 'TotalSteps', 'TotalDistance',
       'TrackerDistance', 'LoggedActivitiesDistance', 'VeryActiveDistance',
       'ModeratelyActiveDistance', 'LightActiveDistance',
       'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes',
       'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories'],
      dtype='object')

In [14]:
# rename columns
dataset_rearr.rename(columns = {"Id":"id", "ActivityDate":"date", "TotalSteps":"total_steps", "TotalDistance":"total_dist", \
                    "TrackerDistance":"track_dist", "LoggedActivitiesDistance":"logged_dist", "VeryActiveDistance":"very_active_dist", \
                    "ModeratelyActiveDistance":"moderate_active_dist", "LightActiveDistance":"light_active_dist", \
                    "SedentaryActiveDistance":"sedentary_active_dist", "VeryActiveMinutes":"very_active_mins", \
                    "FairlyActiveMinutes":"fairly_active_mins", "LightlyActiveMinutes":"lightly_active_mins", \
                    "SedentaryMinutes":"sedentary_mins", "TotalExerciseMinutes":"total_mins","TotalExerciseHours":"total_hours", \
                    "Calories":"calories"}, inplace = True)

dataset_rearr.columns

Index(['id', 'date', 'day_of_the_week', 'total_steps', 'total_dist',
       'track_dist', 'logged_dist', 'very_active_dist', 'moderate_active_dist',
       'light_active_dist', 'sedentary_active_dist', 'very_active_mins',
       'fairly_active_mins', 'lightly_active_mins', 'sedentary_mins',
       'calories'],
      dtype='object')

In [15]:
# days of the week into day_of_the_week
dataset_rearr["day_of_the_week"] = dataset_rearr["date"].dt.day_name()

dataset_rearr.head()

Unnamed: 0,id,date,day_of_the_week,total_steps,total_dist,track_dist,logged_dist,very_active_dist,moderate_active_dist,light_active_dist,sedentary_active_dist,very_active_mins,fairly_active_mins,lightly_active_mins,sedentary_mins,calories
0,1503960366,2016-04-12,Tuesday,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985
1,1503960366,2016-04-13,Wednesday,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797
2,1503960366,2016-04-14,Thursday,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776
3,1503960366,2016-04-15,Friday,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745
4,1503960366,2016-04-16,Saturday,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863


In [16]:
# new column 'total_mins' as a sum of very_active_mins, fairly_active_mins, lightly_active_mins and sedentary_mins.

dataset_rearr["total_mins"] = dataset_rearr["very_active_mins"] + dataset_rearr["fairly_active_mins"] + dataset_rearr["lightly_active_mins"] + dataset_rearr["sedentary_mins"]
dataset_rearr["total_mins"] 

0      1094
1      1033
2      1440
3       998
4      1040
       ... 
935    1440
936    1440
937    1440
938    1440
939     931
Name: total_mins, Length: 940, dtype: int64

In [17]:
# new column 'total_hours' as converting to hour values from 'total_mins' column

dataset_rearr["total_hours"] = round(dataset_rearr["total_mins"] / 60)
dataset_rearr["total_mins"]


0      1094
1      1033
2      1440
3       998
4      1040
       ... 
935    1440
936    1440
937    1440
938    1440
939     931
Name: total_mins, Length: 940, dtype: int64

***
# ANALYZE 
#    &
# SHARE


In [18]:
#calculating statistical data

pd.options.display.float_format = '{:.2f}'.format
dataset_rearr.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,940.0,4855407369.33,2424805475.66,1503960366.0,2320127002.0,4445114986.0,6962181067.0,8877689391.0
total_steps,940.0,7637.91,5087.15,0.0,3789.75,7405.5,10727.0,36019.0
total_dist,940.0,5.49,3.92,0.0,2.62,5.24,7.71,28.03
track_dist,940.0,5.48,3.91,0.0,2.62,5.24,7.71,28.03
logged_dist,940.0,0.11,0.62,0.0,0.0,0.0,0.0,4.94
very_active_dist,940.0,1.5,2.66,0.0,0.0,0.21,2.05,21.92
moderate_active_dist,940.0,0.57,0.88,0.0,0.0,0.24,0.8,6.48
light_active_dist,940.0,3.34,2.04,0.0,1.95,3.36,4.78,10.71
sedentary_active_dist,940.0,0.0,0.01,0.0,0.0,0.0,0.0,0.11
very_active_mins,940.0,21.16,32.84,0.0,0.0,4.0,32.0,210.0


Interpreting above statistics:

- On average, users logged 7,637 steps or 5.4km which is not adequate. As recommended by CDC, an adult female has to aim at least 10,000 steps or 8km per day to benefit from general health, weight loss and fitness improvement. **Source**: [Medical News Today article](https://www.medicalnewstoday.com/articles/how-many-steps-should-you-take-a-day#for-general-health)

- Sedentary users are the majority. logging on average 991 minutes making up to 81% of total average minutes.

- Noting that average calories burned is 2,303 calories equivalent to 0.6 pound. Could not interpret into detail as calories burned depend on several factors such as the age, weight, daily tasks, exercise, hormones and daily calorie intake. **Source**: [Health Line article](https://www.healthline.com/health/fitness-exercise/how-many-calories-do-i-burn-a-day#Burning-calories)



## Exporting file for further analysis in **Tableau** 

In [19]:
dataset_rearr.to_csv("daily_activity_merged_cleaned.csv",index=False)

# [CHECK GRAPHS](https://public.tableau.com/app/profile/rados.aw5086/viz/BellabeatGraphs/Bars#1)

In "No. of times users logged in accors the week" bar chart, we are looking at the frequency of FitBit app usage in terms of days of the week.

- We discovered that users prefer or remember (giving them the doubt of benefit that they forgotten) to track their activity on the app during midweek from Tuesday to Friday.
- Noting that the frequency dropped on Friday and continue on weekends and Monday.


In "Calories burned per steps taken" scatter plot we can notice:

- It has a positive correlation.
- Intensity of calories burned increase when users are at the range of > 0 to 15,000 steps with calories burn rate cooling down from 15,000 steps onwards.

Addition:

- 1 observation of > 35,000 steps with < 3,000 calories burned.
- Zero value outliers
- Deduced that outliers could be due to natural variation of data, change in user's usage or errors in data collection (ie. miscalculations, data contamination or human error).

In "calories burned for every hour logged" we can notice:

- A weak positive correlation whereby the increase of hours logged does not translate to more calories being burned. That is largely due to the average sedentary hours plotted at the 16 to 17 hours range.
- Zero value outliers


Sedentary activity takes is the highest column
- This indicates that users are using the FitBit app to log daily activities such as daily commute, inactive movements (moving from one spot to another) or running errands.

- App is rarely being used to track fitness (ie. running) as per the minor percentage of fairly active activity and very active activity. This is highly discouraging as FitBit app was developed to encourage fitness.


***
# ACT
1. What are the trends identified?
- Majority of users are using the FitBit app to track sedentary activities and not using it for tracking their health habits.
- Users prefer to track their activities during weekdays as compared to weekends - perhaps because they spend more time outside on weekdays and stay in on weekends.
2. How could these trends apply to Bellabeat customers?
- Both companies develop products focused on providing women with their health, habit and fitness data and encouraging them to understand their current habits and make healthy decisions. These common trends surrounding health and fitness can very well abe pplied to Bellabeat customers.
3. How could these trends help influence Bellabeat marketing strategy?
- Bellabeat marketing team can encourage users by educating and equipping them with knowledge about fitness benefits, suggest different types of exercise (ie. simple 10 minutes exercise on weekday and a more intense exercise on weekends) and calories intake and burnt rate information on the Bellabeat app.
- On weekends, Bellabeat app can also prompt notification to encourage users to exercise and reduce sedentary time.
