

# Case Study: How Can a Wellness Technology Company Play It Smart?

Title: <b>Bellabeat Case Study</b>

Author: <b>Radosław Gryczka</b>

Date: <b>*December 20, 2021*</b>

Python version: 3.10.1

Chceck out my portfolio [MY page](https://duckduckgo.com "The best search engine for privacy").



***

# ASK

### Steps of the data analysis process: **ask**, **prepare**, **process**, **analyze**, **share**, and **act**


### Company

- Bellabeat is the go-to wellness brand for women with an ecosystem of products and services focused on women’s health 
- Successful small company with potential to become a larger player in the global smart device market.


## Business task

- Analyze FitBit Fitness Tracker Data to gain insights into how consumers are using the FitBit app 
- Discover trends and insights for Bellabeat marketing strategy.



#### Questions
<ol>
  <li>What are trends in smart device usage?</li>
  <li>How could these trends apply to Bellabeat customers?</li>
  <li>How could these trends help influence Bellabeat marketing strategy</li>
</ol>



## Deliverables
- A clear summary of the business task
- A description of all data sources used
- Documentation of any cleaning or manipulation of data
- A summary of analysis
- Supporting visualizations and key findings
- Recommendations based on the analysis


## Stakeholders

- Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
- Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
- Bellabeat marketing analytics team

***

# PREPARE
## Data Source:
- The data is publicly available on Kaggle: [FitBit Fitness Tracker Data] (https://www.kaggle.com/arashnic/fitbit).
- Generated by respondents from a distributed survey via Amazon Mechanical Turk between 12 March 2016 to 12 May 2016.
- 30 FitBit users who consented to the submission of personal tracker data.
- Data is provided by a third party and likely not a reliable.
- Data collection is from 2016 (6 years ago)


#### The following file is selected and copied for analysis.

dailyActivity_merged.csv

***

# PROCESS
Using Python to prepare and process the data.








### **Loading packages** and **importing** data set

**pandas**, **os** and **datetime**

In [2]:
import pandas as pd # data structure and data analysis
import os # file path 
import datetime 

pwd = os.getcwd() # variable of the folder of Python Script is stored in

filepath = pwd + "/Data - dailyActivity_merged.csv" # variable that is the filepath to the Data - dailyActivity_merged.csv file

dataset_import = pd.read_csv(filepath) # reads the csv into Python
dataset_import

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.500000,8.500000,0.0,1.88,0.55,6.06,0.00,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.970000,6.970000,0.0,1.57,0.69,4.71,0.00,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.740000,6.740000,0.0,2.44,0.40,3.91,0.00,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.280000,6.280000,0.0,2.14,1.26,2.83,0.00,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.160000,8.160000,0.0,2.71,0.41,5.04,0.00,36,10,221,773,1863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
935,8877689391,5/8/2016,10686,8.110000,8.110000,0.0,1.08,0.20,6.80,0.00,17,4,245,1174,2847
936,8877689391,5/9/2016,20226,18.250000,18.250000,0.0,11.10,0.80,6.24,0.05,73,19,217,1131,3710
937,8877689391,5/10/2016,10733,8.150000,8.150000,0.0,1.35,0.46,6.28,0.00,18,11,224,1187,2832
938,8877689391,5/11/2016,21420,19.559999,19.559999,0.0,13.22,0.41,5.89,0.00,88,12,213,1127,3832





Safety environment for furthure processing

In [3]:
dataset_modified = dataset_import.copy() # backup in case of mistake

## Data cleaning and manipulation

- Check for null or missing values

- Perform sanity check of data 

- Observe and familiarize with data

In [4]:
dataset_missing_values_count = dataset_modified.isnull().sum() # checking for null values
dataset_missing_values_count 

Id                          0
ActivityDate                0
TotalSteps                  0
TotalDistance               0
TrackerDistance             0
LoggedActivitiesDistance    0
VeryActiveDistance          0
ModeratelyActiveDistance    0
LightActiveDistance         0
SedentaryActiveDistance     0
VeryActiveMinutes           0
FairlyActiveMinutes         0
LightlyActiveMinutes        0
SedentaryMinutes            0
Calories                    0
dtype: int64

Making sure if there is 30 distinct users (id)

In [5]:
count_id = len(pd.unique(dataset_modified["Id"])) 
print("# of unique Id: " + str(count_id))

# of unique Id: 33


In [6]:
dataset_modified.info() #checking data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDate              940 non-null    object 
 2   TotalSteps                940 non-null    int64  
 3   TotalDistance             940 non-null    float64
 4   TrackerDistance           940 non-null    float64
 5   LoggedActivitiesDistance  940 non-null    float64
 6   VeryActiveDistance        940 non-null    float64
 7   ModeratelyActiveDistance  940 non-null    float64
 8   LightActiveDistance       940 non-null    float64
 9   SedentaryActiveDistance   940 non-null    float64
 10  VeryActiveMinutes         940 non-null    int64  
 11  FairlyActiveMinutes       940 non-null    int64  
 12  LightlyActiveMinutes      940 non-null    int64  
 13  SedentaryMinutes          940 non-null    int64  
 14  Calories  

In [7]:
dataset_modified["ActivityDate"] = pd.to_datetime(dataset_modified["ActivityDate"]) # changed data type to datetime64[ns] from object
dataset_modified.info() # confirmation

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Id                        940 non-null    int64         
 1   ActivityDate              940 non-null    datetime64[ns]
 2   TotalSteps                940 non-null    int64         
 3   TotalDistance             940 non-null    float64       
 4   TrackerDistance           940 non-null    float64       
 5   LoggedActivitiesDistance  940 non-null    float64       
 6   VeryActiveDistance        940 non-null    float64       
 7   ModeratelyActiveDistance  940 non-null    float64       
 8   LightActiveDistance       940 non-null    float64       
 9   SedentaryActiveDistance   940 non-null    float64       
 10  VeryActiveMinutes         940 non-null    int64         
 11  FairlyActiveMinutes       940 non-null    int64         
 12  LightlyActiveMinutes  

# Notes 

- There is Null or missing values.

- Data frame has 940 rows and 15 columns.

- ActivityDate was wrongly classified as object dtype and has been converted to datetime64 dtype.

- There are 33 unique IDs instead of expected 30 fitness tracker users.

### Data manipulation

- Create new column **day_of_the_week** by separating the date into day of the week for further analysis.

- Create new column **total_mins** being the **sum** of very_active_mins, fairly_active_mins, lightly_active_mins and sedentary_mins.

- Create new column **total_hours** by converting **total_mins** column into number of hours.

- Rearrange and rename columns.

In [8]:
dataset_modified.columns
# new list of columns with new column 'day of the week' for further analysis
rearranged_columns =['Id', 'ActivityDate', 'day_of_the_week', 'TotalSteps', 'TotalDistance', 'TrackerDistance',
       'LoggedActivitiesDistance', 'VeryActiveDistance',
       'ModeratelyActiveDistance', 'LightActiveDistance',
       'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes',
       'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories']

# rearrangment of columns
dataset_rearr = dataset_modified.reindex(columns=rearranged_columns)
dataset_rearr

Unnamed: 0,Id,ActivityDate,day_of_the_week,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,2016-04-12,,13162,8.500000,8.500000,0.0,1.88,0.55,6.06,0.00,25,13,328,728,1985
1,1503960366,2016-04-13,,10735,6.970000,6.970000,0.0,1.57,0.69,4.71,0.00,21,19,217,776,1797
2,1503960366,2016-04-14,,10460,6.740000,6.740000,0.0,2.44,0.40,3.91,0.00,30,11,181,1218,1776
3,1503960366,2016-04-15,,9762,6.280000,6.280000,0.0,2.14,1.26,2.83,0.00,29,34,209,726,1745
4,1503960366,2016-04-16,,12669,8.160000,8.160000,0.0,2.71,0.41,5.04,0.00,36,10,221,773,1863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
935,8877689391,2016-05-08,,10686,8.110000,8.110000,0.0,1.08,0.20,6.80,0.00,17,4,245,1174,2847
936,8877689391,2016-05-09,,20226,18.250000,18.250000,0.0,11.10,0.80,6.24,0.05,73,19,217,1131,3710
937,8877689391,2016-05-10,,10733,8.150000,8.150000,0.0,1.35,0.46,6.28,0.00,18,11,224,1187,2832
938,8877689391,2016-05-11,,21420,19.559999,19.559999,0.0,13.22,0.41,5.89,0.00,88,12,213,1127,3832


In [9]:
# rename columns
dataset_rearr.rename(columns = {"Id":"id", "ActivityDate":"date", "TotalSteps":"total_steps", "TotalDistance":"total_dist", "TrackerDistance":"track_dist", "LoggedActivitiesDistance":"logged_dist", "VeryActiveDistance":"very_active_dist", "ModeratelyActiveDistance":"moderate_active_dist", "LightActiveDistance":"light_active_dist", "SedentaryActiveDistance":"sedentary_active_dist", "VeryActiveMinutes":"very_active_mins", "FairlyActiveMinutes":"fairly_active_mins", "LightlyActiveMinutes":"lightly_active_mins", "SedentaryMinutes":"sedentary_mins", "TotalExerciseMinutes":"total_mins","TotalExerciseHours":"total_hours","Calories":"calories"}, inplace = True)

# print column names to confirm
dataset_rearr.columns
dataset_rearr.head()

Unnamed: 0,id,date,day_of_the_week,total_steps,total_dist,track_dist,logged_dist,very_active_dist,moderate_active_dist,light_active_dist,sedentary_active_dist,very_active_mins,fairly_active_mins,lightly_active_mins,sedentary_mins,calories
0,1503960366,2016-04-12,,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985
1,1503960366,2016-04-13,,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797
2,1503960366,2016-04-14,,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776
3,1503960366,2016-04-15,,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745
4,1503960366,2016-04-16,,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863


In [10]:
# put days of the week data into column  
dataset_rearr["day_of_the_week"] = dataset_rearr["date"].dt.day_name()

# confirmation
dataset_rearr["day_of_the_week"].head(5)

0      Tuesday
1    Wednesday
2     Thursday
3       Friday
4     Saturday
Name: day_of_the_week, dtype: object

In [11]:
dataset_rearr

Unnamed: 0,id,date,day_of_the_week,total_steps,total_dist,track_dist,logged_dist,very_active_dist,moderate_active_dist,light_active_dist,sedentary_active_dist,very_active_mins,fairly_active_mins,lightly_active_mins,sedentary_mins,calories
0,1503960366,2016-04-12,Tuesday,13162,8.500000,8.500000,0.0,1.88,0.55,6.06,0.00,25,13,328,728,1985
1,1503960366,2016-04-13,Wednesday,10735,6.970000,6.970000,0.0,1.57,0.69,4.71,0.00,21,19,217,776,1797
2,1503960366,2016-04-14,Thursday,10460,6.740000,6.740000,0.0,2.44,0.40,3.91,0.00,30,11,181,1218,1776
3,1503960366,2016-04-15,Friday,9762,6.280000,6.280000,0.0,2.14,1.26,2.83,0.00,29,34,209,726,1745
4,1503960366,2016-04-16,Saturday,12669,8.160000,8.160000,0.0,2.71,0.41,5.04,0.00,36,10,221,773,1863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
935,8877689391,2016-05-08,Sunday,10686,8.110000,8.110000,0.0,1.08,0.20,6.80,0.00,17,4,245,1174,2847
936,8877689391,2016-05-09,Monday,20226,18.250000,18.250000,0.0,11.10,0.80,6.24,0.05,73,19,217,1131,3710
937,8877689391,2016-05-10,Tuesday,10733,8.150000,8.150000,0.0,1.35,0.46,6.28,0.00,18,11,224,1187,2832
938,8877689391,2016-05-11,Wednesday,21420,19.559999,19.559999,0.0,13.22,0.41,5.89,0.00,88,12,213,1127,3832


In [12]:
# add new column 'total_mins' as a sum of very_active_mins, fairly_active_mins, lightly_active_mins and sedentary_mins.
dataset_rearr["total_mins"] = dataset_rearr["very_active_mins"] + dataset_rearr["fairly_active_mins"] + dataset_rearr["lightly_active_mins"] + dataset_rearr["sedentary_mins"]
dataset_rearr["total_mins"] 

0      1094
1      1033
2      1440
3       998
4      1040
       ... 
935    1440
936    1440
937    1440
938    1440
939     931
Name: total_mins, Length: 940, dtype: int64

In [13]:
# create new column 'total_hours' by converting to hour, values from 'total_mins' column
dataset_rearr["total_hours"] = round(dataset_rearr["total_mins"] / 60)
dataset_rearr["total_mins"]


0      1094
1      1033
2      1440
3       998
4      1040
       ... 
935    1440
936    1440
937    1440
938    1440
939     931
Name: total_mins, Length: 940, dtype: int64

***
# ANALYZE 
#    &
# SHARE


In [14]:
#calculating statistical data
dataset_rearr.describe()

Unnamed: 0,id,total_steps,total_dist,track_dist,logged_dist,very_active_dist,moderate_active_dist,light_active_dist,sedentary_active_dist,very_active_mins,fairly_active_mins,lightly_active_mins,sedentary_mins,calories,total_mins,total_hours
count,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0,940.0
mean,4855407000.0,7637.910638,5.489702,5.475351,0.108171,1.502681,0.567543,3.340819,0.001606,21.164894,13.564894,192.812766,991.210638,2303.609574,1218.753191,20.31383
std,2424805000.0,5087.150742,3.924606,3.907276,0.619897,2.658941,0.88358,2.040655,0.007346,32.844803,19.987404,109.1747,301.267437,718.166862,265.931767,4.437283
min,1503960000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
25%,2320127000.0,3789.75,2.62,2.62,0.0,0.0,0.0,1.945,0.0,0.0,0.0,127.0,729.75,1828.5,989.75,16.0
50%,4445115000.0,7405.5,5.245,5.245,0.0,0.21,0.24,3.365,0.0,4.0,6.0,199.0,1057.5,2134.0,1440.0,24.0
75%,6962181000.0,10727.0,7.7125,7.71,0.0,2.0525,0.8,4.7825,0.0,32.0,19.0,264.0,1229.5,2793.25,1440.0,24.0
max,8877689000.0,36019.0,28.030001,28.030001,4.942142,21.92,6.48,10.71,0.11,210.0,143.0,518.0,1440.0,4900.0,1440.0,24.0


Interpreting above statistics:

- On average, users logged 7,637 steps or 5.4km which is not adequate. As recommended by CDC, an adult female has to aim at least 10,000 steps or 8km per day to benefit from general health, weight loss and fitness improvement. **Source**: [Medical News Today article](https://www.medicalnewstoday.com/articles/how-many-steps-should-you-take-a-day#for-general-health)

- Sedentary users are the majority. logging on average 991 minutes making up to 81% of total average minutes.

- Noting that average calories burned is 2,303 calories equivalent to 0.6 pound. Could not interpret into detail as calories burned depend on several factors such as the age, weight, daily tasks, exercise, hormones and daily calorie intake. **Source**: [Health Line article](https://www.healthline.com/health/fitness-exercise/how-many-calories-do-i-burn-a-day#Burning-calories)



Exporting file for further analysis in **Tableau** tool

In [15]:
dataset_rearr.to_excel("Data - dailyActivity_merged_cleaned.xlsx",index=False)

# [CHECK GRAPHS](https://public.tableau.com/app/profile/rados.aw5086/viz/BellabeatGraphs/Bars#1)

How can these trends apply to Bellabeat customers?

In this histogram, we are looking at the frequency of FitBit app usage in terms of days of the week.

We discovered that users prefer or remember (giving them the doubt of benefit that they forgotten) to track their activity on the app during midweek from Tuesday to Friday.

Noting that the frequency dropped on Friday and continue on weekends and Monday.



Bellabeat app can recommend reducing sedentary time.