#  Introduction

## Strategic Approaches for Wellness Technology Companies

# Step 1: Ask

### Background

Bellabeat is a technology company specializing in health-focused smart products. Their product line includes smart devices designed to monitor various health metrics such as activity levels, sleep patterns, stress levels, and reproductive health. These devices aim to empower women by providing them with insights into their health and habits.

This case study focuses on analyzing fitness data collected by Bellabeat's smart devices to identify new growth opportunities. Specifically, we will examine the Bellabeat app, a central component of their product ecosystem.

The Bellabeat app offers users insights into their health metrics, including activity levels, sleep quality, stress levels, menstrual cycles, and mindfulness habits. By leveraging this data, users can gain a deeper understanding of their daily routines and make informed decisions to improve their overall well-being. The Bellabeat app integrates seamlessly with Bellabeat's range of smart wellness products.

### Key Stakeholders

* Urška Sršen, Co-founder and Chief Creative Officer at Bellabeat
* Sando Mur, Co-founder and pivotal member of Bellabeat's executive team
* Bellabeat Marketing Analytics team

### Bussiness Task

Given the aforementioned facts, the business objective is to analyze user patterns in the usage of Bellabeat's smart devices to glean insights that will inform targeted marketing decisions. Specifically, the goal is to understand how consumers utilize similar non-Bellabeat smart devices and apply these insights to enhance Bellabeat's marketing strategy.

# Step 2: Prepare

### Dataset used

The data source utilized for this case study is the FitBit Fitness Tracker Data, which is accessible via Kaggle. This dataset was obtained through Mobius and compiled from responses gathered via Amazon Mechanical Turk during a survey conducted between March 12, 2016, and May 12, 2016.

### Accessibility and privacy of data

The data is licensed under CC0: Public
Domain, waiving all of his or her rights to
the work worldwide under copyright law,
including all related and neighboring rights,
to the extend by law. The work can be
copied, modified, distributed and perform
the work, even for commercial purposes,
all without asking permission

### Data organization and verification

The dataset comprises 18 .csv files, with 15 in long format and 3 in wide format. It includes comprehensive information on various metrics such as activity levels, calorie expenditure, sleep patterns, metabolic equivalent of tasks (METs), heart rate, and step counts. The data is recorded across different timeframes, ranging from seconds and minutes to hours and days.

### Data limitations

The dataset comes with several limitations that could potentially impact the analysis results. These limitations include:

* Absence of demographic information
* Small sample size
* Limited duration of data collection period

# Step 3: Process

### Loading libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Importing datasets

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
#pd.set_option('max_column')

In [None]:
df = pd.read_csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")

### Data exploration

After uploading the dataset, we will initially assess its dimensions using the pandas .shape function to determine the number of rows and columns present.

In [None]:
df.shape

We observe that our dataset consists of 940 rows and 15 columns. Next, we will examine the column names using the .columns function in pandas.

In [None]:
df.columns

With the column names identified, let's proceed to get a quick overview of the dataset by examining the first few rows and the data itself using the .head() function in pandas.

In [None]:
df.head(10)


The dataset captures daily data collected by FitBit Fitness tracking devices, which include smartwatches and fitness apps. Based on a preliminary view, we can summarize the columns as follows:

* Id: Unique identifier for each user in the survey.
* ActivityDate: Date of the data entry.
* TotalSteps: Total number of steps taken by each user per day.
* TotalDistance: Total distance covered by each user per day.
* TrackerDistance: Distance tracked by the device each day.
* LoggedActivitiesDistance: Distance tracked by the device during specific activities.
* VeryActiveDistance: Distance covered during very active physical activities.
* ModeratelyActiveDistance: Distance covered during moderately active physical activities.
* LightActiveDistance: Distance covered during lightly active physical activities.
* SedentaryActiveDistance: Distance covered during sedentary activities.
* VeryActiveMinutes: Minutes spent in a very active physical state.
* FairlyActiveMinutes: Minutes spent in a fairly active physical state.
* LightlyActiveMinutes: Minutes spent in a lightly active physical state.
* SedentaryMinutes: Minutes spent in a sedentary state.
* Calories: Calories burned on the specific day.

These columns provide insights into various aspects of daily physical activity and health metrics tracked by FitBit devices.

Now that we have reviewed the columns and examined the data, we can begin the process of data cleaning.

## Cleaning the data

### Checking Data types

First, we need to ensure that the data types of each column align with their respective content and intended use. We can achieve this by using the .dtypes function to inspect the data types of each column.

In [None]:
df.dtypes

Upon review, we notice that the Id column is currently stored as an integer. However, for our purposes, it should be converted to a string or object datatype. This adjustment is necessary because the Id serves solely as an identifier, and we do not intend to perform mathematical operations, such as addition or multiplication, with it.

Additionally, the ActivityDate column is currently stored as an object datatype. To facilitate date-related operations and ensure consistency, it should be converted to a Date datatype.

All other columns appear to have the correct data types for their respective content and purposes.

In [None]:
df['Id'] = df['Id'].astype(str)
df['ActivityDate'] = pd.to_datetime(df['ActivityDate'],format="%m/%d/%Y")
df.dtypes # After reformating. We double check the data type

Next, we'll review the current formats and proceed with converting the 'Id' column from integer to string using the .astype(str) function. Similarly, we'll convert the 'ActivityDate' column from object or string to datetime format.

### Checking column values

After converting the data types, we can proceed to remove columns that are not relevant for our analysis. Initially, we'll consider the 'TotalDistance' column and other related distance tracking columns. While 'TotalDistance' and 'TrackerDistance' appear similar, we need to verify their relationship.

Additionally, we hypothesize that either 'TrackerDistance' or 'TotalDistance' could be the sum of the '*ActiveDistance' columns, but this assumption requires verification before proceeding further.

In [None]:
# We create a new column, adding up the "ActiveDistance" columns to see if it's equal to the 'TotalDistance' column, or the 'TrackerDistance' column
df['sum_distance'] = df['VeryActiveDistance'] + df['ModeratelyActiveDistance'] + df['LightActiveDistance'] + df['SedentaryActiveDistance']

# We also notice that 'LoggedActivitiesDistance' have 0.0 in value in most entries, but we filter to find where has more than 0
df.loc[(df['LoggedActivitiesDistance'] > 0),['TotalDistance','TrackerDistance','LoggedActivitiesDistance','sum_distance']]

Based on our analysis, we have observed the following:

1. While 'TotalDistance' and 'TrackerDistance' are not always identical, they show a high degree of similarity across most entries.

2. Entries in the 'LoggedActivitiesDistance' column are generally minimal or non-zero.

3. The sum of the '*ActiveDistance' columns aligns closely with the 'TotalDistance' column, differing slightly due to rounding errors.

These observations suggest that 'TotalDistance' and 'TrackerDistance' are closely related metrics, and the '*ActiveDistance' columns contribute significantly to the total distance tracked.

Now, we need to decide whether to retain or remove certain columns based on our analysis:

1. We observed that 'TotalDistance' and 'TrackerDistance' are largely similar, with 'TotalDistance' often having higher values. Therefore, we will retain the 'TotalDistance' column.

2. The 'ActiveDistance' columns categorize different activity levels, such as 'Moderately Active' and 'Very Active'. While we lack specific details on their categorization criteria (such as heart rate or steps per minute), we will keep these columns for their potential insights.

3. Similarly, the 'ActiveMinutes' columns will be kept, and we will create a new column by summing these values.

This approach ensures we retain potentially valuable data while optimizing the dataset for our analysis.

In [None]:
df['TotalMinutes'] = df['VeryActiveMinutes'] + df['FairlyActiveMinutes'] + df['LightlyActiveMinutes'] + df['SedentaryMinutes']

### Renaming columns

We will proceed by renaming the columns using the rename function and converting them to lowercase using str.lower() in pandas:

In [None]:
df.columns = df.columns.str.lower()
df.rename(columns = {'trackerdistance':'tracker_distance','activitydate':'activity_date','totalsteps':'total_steps','totaldistance':'total_distance',
       'loggedactivitiesdistance':'logged_activities_distance', 'veryactivedistance':'very_active_distance',
       'moderatelyactivedistance':'moderately_active_distance', 'lightactivedistance':'light_active_distance',
       'sedentaryactivedistance':'sedentary_active_distance', 'veryactiveminutes':'very_active_minutes',
       'fairlyactiveminutes':'fairly_active_minutes','lightlyactiveminutes':'lightly_active_minutes',
       'sedentaryminutes':'sedentary_minutes'}
         ,inplace=True) # We make the changes permanent by using inplace=True
print('Double check the name of the columns:')
df.columns

### Creating columns

To enhance our analysis, we will add a column indicating the day of the week using the datetime function day_name(), and another column with the numeric representation of the day of the week using the function weekday.

In [None]:
day_of_week = df['activity_date'].dt.day_name()
df['day_of_week'] = day_of_week
df['n_day_of_week'] = df['activity_date'].dt.weekday # 0 represents monday, 6 represents sunday

### Checking empty cells and null values

To identify null values in the dataset, we can use isna().sum() function in pandas.

In [None]:
print('Total number of null values are: ')
print(df.isna().sum())

To identify duplicate entries in the dataset, we can use the duplicated().sum() function in pandas.

In [None]:
print('Total number of duplicated values are: ',df.duplicated().sum())

### Subsetting the data

Next, we can select only the columns we will use for our analysis in this case.

In [None]:
df = df[['id', 'activity_date', 'total_steps', 'total_distance',
       #'tracker_distance', 'logged_activities_distance',
       #'very_active_distance', 'moderately_active_distance',
       #'light_active_distance', 'sedentary_active_distance',
       'very_active_minutes', 'fairly_active_minutes',
       'lightly_active_minutes', 'sedentary_minutes', 'calories',
       #'sum_distance','totalminutes', 
       'day_of_week', 'n_day_of_week'
        ]].copy()

### Category creation

I will now categorize users based on their physical activity levels and device usage:

For physical activity:

* Sedentary: Average daily steps less than 6000.
* Active: Average daily steps between 6000 and 12000.
* Very active: Average daily steps more than 12000.

For device usage:

* Low use: Less than 8 hours of daily use.
* Normal use: Between 8 and 16 hours of daily use.
* High use: More than 16 hours of daily use.

For this dataset, I will start by creating the 'activity_level' category. I will consider creating the other category when analyzing datasets that store data on an hourly basis.

In [None]:
# I first group the data by the id
id_grp = df.groupby(['id'])

# Then I look for the average amount of steps, and sort the results in descending order
id_avg_step = id_grp['total_steps'].mean().sort_values(ascending=False)

# After that, I turn the results into a dataframe
id_avg_step = id_avg_step.to_frame()

# I want to create a new column which tells in which category each user fits into, depending on the average amount of steps
conditions = [
    (id_avg_step <=6000),
    (id_avg_step > 6000) & (id_avg_step < 12000),
    (id_avg_step >= 12000)
] # These are the conditions

values = ['sedentary','active','very_active'] # And here are the name of the values

# I create a column with the numpy function, np.select to asign each id a category
id_avg_step['activity_level'] = np.select(conditions,values)

# I store the results in a variable to use it in the next step
id_activity_level = id_avg_step['activity_level']

# I use a list comprehension to create the column in our original dataset.
# With this list comprehension I retrieve the categories where the index match the id column
df['activity_level'] = [id_activity_level[c] for c in df['id']]

# Step 4: Analyze

We will check the number of unique IDs using the nunique() function and list those unique IDs using the unique() function.

In [None]:
print('Number of unique values in id column:',df['id'].nunique())
print()
print('List of id values:',df['id'].unique())

Next, let's determine how frequently each ID appears in the dataset using value_counts().

In [None]:
print('How many times each id appear in the dataset?')
print(df['id'].value_counts())

There are 33 unique IDs or users in the dataset, with most appearing 31 times. Some IDs appear fewer times than that.

In [None]:
print('The min date is:',min(df['activity_date']))
print('The max date is:',max(df['activity_date']))
print('The number of unique dates are:',df['activity_date'].nunique())

Next, we can begin conducting exploratory data analysis.

In [None]:
# First we use the describe() function to see some statistics
df.describe()

Here we observe the mean, minimum, and maximum values, as well as the median (50th percentile), among others.

Notably, the maximum values indicate that one individual walked 28 miles and burned 4900 calories. These could potentially be outliers that we should investigate further.

# Step 5: Share 

## Correlation Analysis Between Calories and Steps

What is the correlation between the number of steps taken and the calories burned?

In [None]:
ax =sns.scatterplot(x='total_steps', y='calories', data=df,hue='activity_level')

#handles, labels = ax.get_legend_handles_labels()
#plt.legend(handles, day_of_week, fontsize=7)
plt.title('Correlation Calories vs. Steps')

plt.show()

We can see in this scatterplot a somewhat positive correlation, the more steps done, the more calories burnt.
Also we divided the dots by colors, using the activity_level category, so we can see which group is representing the data shown

## Average Daily Step Count

What is the average daily step count?

In [None]:
day_of_week = ['Monday','Tuesday','Wednesday','Thursday', 'Friday','Saturday','Sunday']
fig, ax =plt.subplots(1,1,figsize=(9,6))

day_grp = df.groupby(['day_of_week'])
avg_daily_steps= day_grp['total_steps'].mean()
avg_steps = df['total_steps'].mean()

plt.bar(avg_daily_steps.index,avg_daily_steps)

ax.set_xticks(range(len(day_of_week)))
ax.set_xticklabels(day_of_week)

ax.axhline(y=avg_daily_steps.mean(),color='red', label='Average daily steps')
ax.set_ylabel('Number of steps')
ax.set_xlabel('Day of the week')
ax.set_title('Avg Number of steps per day')

plt.legend()
plt.show()

The data reveals that Monday, Tuesday, and Saturday stand out as days with higher than average physical activity levels in terms of step counts. Wednesday, Thursday, and Friday show activity levels below the average, with similar patterns across these three days. Sunday appears to be the least active among weekdays.

This information suggests that users tend to engage in more physical activity during the early days of the week and on Saturdays, providing insights into their likely activities during these periods.

## Percentage of activity in minutes

What percentage of the time are individuals active?

In [None]:
very_active_mins = df['very_active_minutes'].sum() 
fairly_active_mins = df['fairly_active_minutes'].sum()
lightly_active_mins = df['lightly_active_minutes'].sum()
sedentary_mins = df['sedentary_minutes'].sum()

slices = [very_active_mins,fairly_active_mins,lightly_active_mins,sedentary_mins]
labels = ['very active minutes','fairly active minutes','lightly active minutes','sedentary minutes']
explode = [0,0,0,0.1]
plt.pie(slices, labels = labels, explode = explode, autopct='%1.1f%%',textprops=dict(size=9), shadow=True)

plt.title('Percentage of activity in minutes',fontsize=18)
plt.tight_layout()

plt.show()

This pie chart illustrates that users are predominantly in a sedentary state, spend about a sixth of their time engaged in light activity, and only 2% of their time in active exercise.

## Correlation Between Activity Level Minutes and Calories

In [None]:
n_day_of_week = [0,1,2,3,4,5,6]

fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(11,15),dpi=70)

sns.scatterplot(data=df,x='calories',y='sedentary_minutes',hue='activity_level',ax=axes[0,0],legend=False)

sns.scatterplot(data=df,x='calories',y='lightly_active_minutes',hue='activity_level',ax=axes[0,1],legend=False)

sns.scatterplot(data=df,x='calories',y='fairly_active_minutes',hue='activity_level',ax=axes[1,0],legend=False)

sns.scatterplot(data=df,x='calories',y='very_active_minutes',hue='activity_level',ax=axes[1,1])


plt.legend(title='Activity level',title_fontsize=20,bbox_to_anchor=(1.8,2.2),fontsize=18,frameon=True,scatterpoints=1)
fig.suptitle('Correlation Between activity level minutes and calories',x=0.5,y=0.92,fontsize=24)
plt.show()

## Step 6: Act

Upon analyzing the FitBit Fitness Tracker Data, we have identified insights that could inform Bellabeat's marketing strategy.

## A multipurpose device

Bellabeat can inform users that their products are not limited to sports or exercise-related activities. The data indicates that many users wear the tracking device more on weekends than weekdays, suggesting they may associate the product only with sports or leisurely activities like walking in the park on Sundays. Bellabeat can emphasize that their products are designed to accompany users throughout their daily routines, including work, and help them track information to enhance overall fitness and health. This approach aims to encourage women from various demographics and backgrounds to use Bellabeat products, which are tailored for all women interested in holistic health.

## Rewards and reminds

Bellabeat can inform users that their products are not limited to sports or exercise-related activities. The data indicates that many users wear the tracking device more on weekends than weekdays, suggesting they may associate the product only with sports or leisurely activities like walking in the park on Sundays. Bellabeat can emphasize that their products are designed to accompany users throughout their daily routines, including work, and help them track information to enhance overall fitness and health. This approach aims to encourage women from various demographics and backgrounds to use Bellabeat products, which are tailored for all women interested in holistic health.