# 1 Introduction



## How Can a Wellness Technology Company Play It Smart?

# Step 1: Ask

### Background

Bellabeat
is a high tech company that manufactures health focused smart products.They offer different
smart devices that collect data on activity, sleep, stress, and reproductive health to empower women
with knowledge about their own health and habits.

The main focus of this case is to analyze smart devices fitness data and determine how it could help
unlock new growth opportunities for Bellabeat . We will focus on one of Bellabeat’s products: Bellabeat
app.

The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle,
and mindfulness habits. This data can help users better understand their current habits and make
healthy decisions. The Bellabeat app connects to their line of smart wellness products.

### Key Stakeholders

* Urška Sršen Bellabeat cofounder and Chief Creative Officer
* Sando Mur Bellabeat cofounder and key member of Bellabeat executive team
* Bellabeat Marketing Analytics team

### Bussiness Task

Given the previous facts, the business task is defined as searching for user patterns of usage of their
smart devices in order to gain insights that would later better orientate marketing decisions. So, in one
phrase it would be:

How do our users use our smart devices?. Identify trends in how consumers use non
Bellabeat smart
devices to apply insights into Bellabeat’s marketing strategy.


# Step 2: Prepare

### Dataset used

The data source used for this case study is
FitBit Fitness Tracker Data. This dataset is
stored in Kaggle and was made available
through Mobius and generated by
respondents to a distributed survey via
Amazon Mechanical Turk between
03.12.2016 05.12.2016.

### Accessibility and privacy of data

The data is licensed under CC0: Public
Domain, waiving all of his or her rights to
the work worldwide under copyright law,
including all related and neighboring rights,
to the extend by law. The work can be
copied, modified, distributed and perform
the work, even for commercial purposes,
all without asking permission

### Data organization and verification

The dataset is a collection of 18 .csv files.
15 in long format, 3 in wide format. The
datasets consists of wide ranging
information from activity metrics, calories,
sleep records, metabolic equivalent of
tasks (METs), heart rate and steps; in
timeframes of seconds, minutes, hours and
days

### Data limitations

The data has some limitations which could
Undermine
the results of the analysis
Such
limitations to take into consideration
are:
* Missing demographics
* Small simple size
* Short time period of Data collection

# Step 3: Process

### Loading libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import datetime as dt
import seaborn as sns

from pandas.api.types import CategoricalDtype


### Importing datasets

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

First, let's import the dataset 'dailyActivity_merged.csv' using the pandas pd.read_csv() function

In [None]:
df = pd.read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

### Data exploration

Once uploaded, let's firts see how many columns and rows we have, using the pandas function, .shape

In [None]:
df.shape

We can see, we have 940 rows and 15 columns. let's take a look at the names of the columns, using the .columns panda's function

In [None]:
df.columns

Knowing the names of the columns, let´s take a quick overview at the rows and the data itself, using the .head() pandas function

In [None]:
df.head(8)

The dataset stores and tracks the data collected on a daily basis, by the FitBit Fitness tracking devices, such as smartwatches and/or fitness apps.
From a quick view we can sumarize the columns as the following:
* Id:                        is an unique identifier of the users in the survey
* ActivityDate:              is the specific date of the entry
* TotalSteps:                the total steps each user did each day
* TotalDistance:             the total distance each user did each day
* TrackerDistance:           is the distance the device tracked each day
* LoggedActivitiesDistance:  is the distance tracked by the device on specific activities
* VeryActiveDistance:        The distance traveled at a very active physical state?
* ModeratelyActiveDistance:  The distance traveled at a moderately active physical state?
* LightActiveDistance:       The distance traveled at a lightly active physical state?
* SedentaryActiveDistance:   The distance traveled at a sedentary kind of active physical state?
* VeryActiveMinutes:         The minutes spent at a very active physical state?
* FairlyActiveMinutes:       The minutes spent at a fairly active physical state?
* LightlyActiveMinutes:      The minutes spent at a lightly active physical state?
* SedentaryMinutes:          The minutes spent at a lightly active physical state?
* Calories:                  Calories burned that specific day

Once having a look a the columns and the data, we can star the proccess of cleaning

## Cleaning the data

### Checking Data types

First we have to check if the data types align with the content and purpose of the data in each column, we can use the function .dtypes for that

In [None]:
df.dtypes

We can see that the Id columns is an integer, but it should be a string or object in this instance, why? because the Id is only an identifier, and our purpose is not to make mathematic operations with it, sums,multiplications, etc..

Also the ActivityDate columns is an object and should be a Date.

Other than that all the other columns seem to be the correct data type

In [None]:
df['Id'] = df['Id'].astype(str)
df['ActivityDate'] = pd.to_datetime(df['ActivityDate'],format="%m/%d/%Y")
df.dtypes # After reformating. We double check the data type

Now we check the formats from before and we will convert the Id column from int to str using the .astype(str) function, and the 'ActivityDate' column from object or string, to datetime

### Checking column values

After that we can get rid of columns that are not relevant for our analysis.
First we note the 'TotalDistance' column, and the other columns related to distance tracking.
We see at first glance that 'TotalDistance' and 'Tracker Distance' have similar values, but we are not sure.
We also can assum that the 'TrackerDistance' or the 'TotalDistance' is the sum of the different "*ActiveDistance" columns, we may be wrong so we check first.

In [None]:
# We create a new column, adding up the "ActiveDistance" columns to see if it's equal to the 'TotalDistance' column, or the 'TrackerDistance' column
df['sum_distance'] = df['VeryActiveDistance'] + df['ModeratelyActiveDistance'] + df['LightActiveDistance'] + df['SedentaryActiveDistance']

# We also notice that 'LoggedActivitiesDistance' have 0.0 in value in most entries, but we filter to find where has more than 0
df.loc[(df['LoggedActivitiesDistance'] > 0),['TotalDistance','TrackerDistance','LoggedActivitiesDistance','sum_distance']]

The previous finding show that despite 'TotalDistance' and 'TrackerDistance' are not always 100% equal, they are the same in most cases.
We also see that there are entries in the 'LoggedActivitiesDistance' higher than 0, but are just a few.
And finally we see that the sum of the 'ActiveDistance' columns is equal to the 'TotalDistance' column, only differing by 1 decimal due to rounding up.

So now, we have to decide if we want to keep all the columns, or deleting some, we conclude that the 'TotalDistance' column and the 'TotalDistance' are equal in most cases, (having the TotalDistance higher values). so we decide to keep 'TotalDistance'.

About the 'ActiveDistance' columns, unfortunately we don't have an idea behind the categorization, what is the exact diffence between 'Moderately Active' and 'Very Active', maybe the heartbeat pulse at that moment?, steps per minute?, we don't know from this specific dataset, but we will keep them nonetheless.

The same could be said about the 'ActiveMinutes' columns, so we would just add them up in a new column

In [None]:
df['TotalMinutes'] = df['VeryActiveMinutes'] + df['FairlyActiveMinutes'] + df['LightlyActiveMinutes'] + df['SedentaryMinutes']

### Renaming columns

Now, let's rename the columns with the rename function.
And also we want to turn them into lower case with the function str.lower()

In [None]:
df.columns = df.columns.str.lower()
df.rename(columns = {'trackerdistance':'tracker_distance','activitydate':'activity_date','totalsteps':'total_steps','totaldistance':'total_distance',
       'loggedactivitiesdistance':'logged_activities_distance', 'veryactivedistance':'very_active_distance',
       'moderatelyactivedistance':'moderately_active_distance', 'lightactivedistance':'light_active_distance',
       'sedentaryactivedistance':'sedentary_active_distance', 'veryactiveminutes':'very_active_minutes',
       'fairlyactiveminutes':'fairly_active_minutes','lightlyactiveminutes':'lightly_active_minutes',
       'sedentaryminutes':'sedentary_minutes'}
         ,inplace=True) # We make the changes permanent by using inplace=True
print('Double check the name of the columns:')
df.columns

### Creating columns

Let's add a column in which tell us the day of the week using the datetime function day_name(), and another column with the number of the day of the week, using the function weekday

In [None]:
day_of_week = df['activity_date'].dt.day_name()
df['day_of_week'] = day_of_week
df['n_day_of_week'] = df['activity_date'].dt.weekday # 0 represents monday, 6 represents sunday

### Checking empty cells and null values

Checking for null values with the function isna().sum()

In [None]:
print('Total number of null values are: ')
print(df.isna().sum())

Checking for duplicate entries using the function duplicated().sum()

In [None]:
print('Total number of duplicated values are: ',df.duplicated().sum())

There are no null values nor duplicated entries

### Subsetting the data

Now we can select only the columns we will use for our analysis. In this case

In [None]:
df = df[['id', 'activity_date', 'total_steps', 'total_distance',
       #'tracker_distance', 'logged_activities_distance',
       #'very_active_distance', 'moderately_active_distance',
       #'light_active_distance', 'sedentary_active_distance',
       'very_active_minutes', 'fairly_active_minutes',
       'lightly_active_minutes', 'sedentary_minutes', 'calories',
       #'sum_distance','totalminutes', 
       'day_of_week', 'n_day_of_week'
        ]].copy()

### Category creation

Now I'm going to create my own categorization of the users, by level of physical activity and device usage

physical activity would follow these arguments:
* Sedentary: less than 6000 daily steps on average
* Active:  between 6000 and 12000 daily steps on average
* Very active: more than 12000 daily steps on average

Device usage will follow these arguments:
* Low use: less than 8 hours of use per day.
* Normal use: between 8 and 16 hours of use per day.
* High use: more than 16 hours of use per day.


For this dataset, I'll only begin by creating the category 'activity_level'.
I will create the other category when I analize a dataset which has data stored in an hourly basis

In [None]:
# I first group the data by the id
id_grp = df.groupby(['id'])

# Then I look for the average amount of steps, and sort the results in descending order
id_avg_step = id_grp['total_steps'].mean().sort_values(ascending=False)

# After that, I turn the results into a dataframe
id_avg_step = id_avg_step.to_frame()

# I want to create a new column which tells in which category each user fits into, depending on the average amount of steps
conditions = [
    (id_avg_step <=6000),
    (id_avg_step > 6000) & (id_avg_step < 12000),
    (id_avg_step >= 12000)
] # These are the conditions

values = ['sedentary','active','very_active'] # And here are the name of the values

# I create a column with the numpy function, np.select to asign each id a category
id_avg_step['activity_level'] = np.select(conditions,values)

# I store the results in a variable to use it in the next step
id_activity_level = id_avg_step['activity_level']

# I use a list comprehension to create the column in our original dataset.
# With this list comprehension I retrieve the categories where the index match the id column
df['activity_level'] = [id_activity_level[c] for c in df['id']]

# Step 4: Analyze

Let's check how many unique id's there are with the function nunique().
And what are those with the unique() function

In [None]:
print('Number of unique values in id column:',df['id'].nunique())
print()
print('List of id values:',df['id'].unique())

and now let's see how much they appear in the dataset with value_counts()

In [None]:
print('How many times each id appear in the dataset?')
print(df['id'].value_counts())

As we can see, there are 33 unique id's or users, and most appear 31 times throughout the dataset, some less than that

Now let's check the date column, what is the minimum date, maximum date, the days between them, and number of unique dates

In [None]:
print('The min date is:',min(df['activity_date']))
print('The max date is:',max(df['activity_date']))
print('The number of unique dates are:',df['activity_date'].nunique())

As we can see, we have exactly 31 days, ranging from '2016-04-12' to '2016-05-12'

Now we can start making an exploratory data analysis

In [None]:
# First we use the describe() function to see some statistics
df.describe()

Here we can see the mean or average, the min and max values. the 50% median, etc..

We can already see in the max row that someone walked for 28 miles and someone burned 4900 calories, it could be an outlier so we may pay attention to it later

# Step 5.- Share 

## Correlation between calories steps and calories

What is the correlation between the amount of steps done, and the amount of calories burnt?

In [None]:
ax =sns.scatterplot(x='total_steps', y='calories', data=df,hue='activity_level')

#handles, labels = ax.get_legend_handles_labels()
#plt.legend(handles, day_of_week, fontsize=7)
plt.title('Correlation Calories vs. Steps')

plt.show()

We can see in this scatterplot a somewhat positive correlation, the more steps done, the more calories burnt.
Also we divided the dots by colors, using the activity_level category, so we can see which group is representing the data shown

## Average number of steps per day

What is the average number of steps per day?

In [None]:
day_of_week = ['Monday','Tuesday','Wednesday','Thursday', 'Friday','Saturday','Sunday']
fig, ax =plt.subplots(1,1,figsize=(9,6))

day_grp = df.groupby(['day_of_week'])
avg_daily_steps= day_grp['total_steps'].mean()
avg_steps = df['total_steps'].mean()

plt.bar(avg_daily_steps.index,avg_daily_steps)

ax.set_xticks(range(len(day_of_week)))
ax.set_xticklabels(day_of_week)

ax.axhline(y=avg_daily_steps.mean(),color='red', label='Average daily steps')
ax.set_ylabel('Number of steps')
ax.set_xlabel('Day of the week')
ax.set_title('Avg Number of steps per day')

plt.legend()
plt.show()

The results show that Monday, Tuesday and Saturday are the days where the users were more physically active and above the average numbert of steps overall.
Wednesday, Thursday, and Friday are below the average but the three fell into the same area.
Sunday is the least active of all the weekdays.

With this information we can interpret that users tend to be more physically active during the firsts days of the week and during saturdays, giving us a hint of the activities they may do.

## Percentage of activity in minutes

What percentage of the time are people active?

In [None]:
very_active_mins = df['very_active_minutes'].sum() 
fairly_active_mins = df['fairly_active_minutes'].sum()
lightly_active_mins = df['lightly_active_minutes'].sum()
sedentary_mins = df['sedentary_minutes'].sum()

slices = [very_active_mins,fairly_active_mins,lightly_active_mins,sedentary_mins]
labels = ['very active minutes','fairly active minutes','lightly active minutes','sedentary minutes']
explode = [0,0,0,0.1]
plt.pie(slices, labels = labels, explode = explode, autopct='%1.1f%%',textprops=dict(size=9), shadow=True)

plt.title('Percentage of activity in minutes',fontsize=18)
plt.tight_layout()

plt.show()

This pie chart shows that the users are in a sedentary state of activity most of the time, a sixth of the time doing light activity and only 2% of the time being active doing proper excercise.

## Correlation Between activity level minutes and calories

In [None]:
n_day_of_week = [0,1,2,3,4,5,6]

fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(11,15),dpi=70)

sns.scatterplot(data=df,x='calories',y='sedentary_minutes',hue='activity_level',ax=axes[0,0],legend=False)

sns.scatterplot(data=df,x='calories',y='lightly_active_minutes',hue='activity_level',ax=axes[0,1],legend=False)

sns.scatterplot(data=df,x='calories',y='fairly_active_minutes',hue='activity_level',ax=axes[1,0],legend=False)

sns.scatterplot(data=df,x='calories',y='very_active_minutes',hue='activity_level',ax=axes[1,1])


plt.legend(title='Activity level',title_fontsize=20,bbox_to_anchor=(1.8,2.2),fontsize=18,frameon=True,scatterpoints=1)
fig.suptitle('Correlation Between activity level minutes and calories',x=0.5,y=0.92,fontsize=24)
plt.show()

# Step 6.- Act

After
analyzing FitBit Fitness Tracker Data,
we have found some insights that would
help influence Bellabeat marketing strategy

## A multipurpose device

Bellabeat
can let know users, that their products are not
only meant for sports, or excersice related activities As
the data show, many users spend more time wearing
the tracking device on weekends than on weekdays, this
could mean that they relate the product just to sports
or for only the usual walking to the park on sundays
Bellabeat can show that their products are meant to
acompany them wherever they go for any daily
activities, such as work And help them track
information to improve overall fitness and health This
will encourage women from diverse demographic
features and backgrounds to use Bellabeat's product
meant for all women who care about overall health

## Rewards and reminds
Bellabeat
can integrate functions within the bellabeat
app or other products, such as rewards or incentives,
and reminds to encourage their users to hit certain
marks These marks could be achieving the minimum
amount of 7 500 steps per day, certain calorie burning
for people who want to lose weight, or the 8 hour sleep
pattern Certain rewards could be showing a
leaderboard of top users who have reached and
maintained the minimum steps a day for longer, virtual
medals or prizes, such as discounts or offers For the
reminds part, Bellabeat could send notifications to their
users when they are lagging behind in such goals, and
also it could offer recomendations to their users to help
them with their sleep, or achieving their goals