# Table of Contents

* [Business Task](#h1)
    * [Company Information](#h2)
    * [Task](#h3)
    * [Business Objectives](#h4)
* [Data Source](#h5)
    * [Data Storage & Integrity](#h6)
    * [Limitations of Data](#h7)
* [Cleaning & Manipulation of Data](#h8)
    * [Setting Up Environment](#h9)
    * [Creating Data Frames](#h10)
    * [Data Manipulaiton](#h11)
    * [Merging Data Frames](#h12)
* [Summary of Analysis & Visualizations](#h13)
    * [Activity Levels](#h14)
    * [Sleep Insights](#h15)
    * [Step Counts](#h16)
* [Key Takeaways](#h17)
* [Recommendations](#h18)

# Business Task <a class="anchor"  id="h1"></a>

### Company Information <a class="anchor"  id="h2"></a>
Bellabeat is a high-tech company that manufactures health-focused smart products for women that empower women with knowledge about their own heath and habits.  Founded in 2013, Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.  Urska Srsen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

### Task <a class="anchor"  id="h3"></a>
Analyze smart device usage data (FitBit) to gain insight into how consumers use non-Bellabeat smart devices.

### Business Objectives <a class="anchor"  id="h4"></a>

* What are some trends in smart device usage?
* How could these trends apply to Bellabeat customers?
* How could these trends help influence Bellabeat marketing strategy?

# Data Source <a class="anchor"  id="h5"></a>
The dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03/12/2016 and 05/12/2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
The data was downloaded from Kaggle [link](https://www.kaggle.com/datasets/arashnic/fitbit) and is in the Public Domain.

### Data Storage & Integrity <a class="anchor"  id="h6"></a>
The data was downloaded and stored as csv files.  These files were primariliy looked at via excel. 

In viewing the files, I found that there are 33 unique Ids in the Activity, Intensities, Steps, and Calories data frames (this is more than the original amount given of 30 users).  There are 24 users with data in the Sleep data frame, 9 of which have less than 10 sleep entries.  Any insights generated from the Sleep data frame may not be fully accurate becuase of this.  The weight data only has 8 users, 2 of whom have multiple entries.  The other six have 5 or less entries.  Therefore we cannot make accurate insights from the weight information.

### Limitiations of Data <a class="anchor"  id="h7"></a>

* There are only 30 respondents which is a very small subset of the number of people who use smart trackers.
* The data is 7 years old. This may not be an accurate sampling of user activity anymore.
* There are no demographics on the users.  Therefore we could have biased data and it may not represent a subset of the full population of users.
* Not all of the files have the same amount of User Ids.  The Sleep data only has 8 users and therefore cannot be used for insights.



# Cleaning & Manipulation of Data <a class="anchor"  id="h8"></a>
### Setting up environment <a class="anchor"  id="h9"></a>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #data visualizations
import matplotlib.pyplot as plt #data visualizations
from scipy.stats import linregress

# Input data files are available in the read-only "../input/" directory
# Running this will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Creating data frames <a class="anchor"  id="h10"></a>

In the initial viewing of the files, the daily Activity file is made up of the information stored in the daily Steps, daily Intensities, and daily Calories.

To answer the business quesitons I focused on and made data frames from the following files:

* daily Activity
* daily Sleep
* hourly Steps

In [None]:
activity = pd.read_csv("/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
sleep = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
steps_hr = pd.read_csv('/kaggle/input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv')

### Data Manipulation <a class="anchor"  id="h11"></a>

A summary of the changes made to the data:

* Dropped duplicates in data.
* Added a Day of the Week column to differentiate daily habits.
* Added a Weekend column to differnetiate between weekday & weekend habits.
* Added an Active Minutes column to see how long a person was active during the day (and not distrubuted between levels of activity).
* Added a Recommended Activity column to see how long a person did moderatly - vigorous excercise each day.
* Changed columns relating to dates or times to datetime formate and extracted date/hour for analysis.
* Added a NotAsleepMinutes column that gives the time a person was in bed but not asleep to look at bedtime habits.
* Added a Sleep Hour column to find out what hour users when to bed.
* Added an activity level column based upon 10000steps.org outlined catergories [link](https://www.10000steps.org.au/articles/healthy-lifestyles/counting-steps/#:~:text=Sedentary%20is%20less%20than%205%2C000,than%2010%2C000%20steps%20per%20day).
* Added a sleep level column based upon levels outlined by [EPIC](https://epiceducationconsulting.com/sleep-recommendations).

In [None]:
#Drop duplicates
activity.drop_duplicates(inplace=True)
sleep.drop_duplicates(inplace=True)

#Add a day of week columm
activity['DayofWeek'] = pd.to_datetime(activity['ActivityDate']).dt.day_name()

#Add a weekend column
activity['Weekend'] = np.where(activity['DayofWeek'].isin(['Saturday', 'Sunday']), True, False)

#Add total Active Minutes Column
activity['ActiveMinutes']=activity['VeryActiveMinutes'] + activity['FairlyActiveMinutes'] + activity['LightlyActiveMinutes']
activity['RecommendedActivity']=activity['VeryActiveMinutes'] + activity['FairlyActiveMinutes']

#Change to date/time formats & extract date/hour
activity['ActivityDate'] = pd.to_datetime(activity['ActivityDate']).dt.date

#Add ActivityDate column so it is labeled the same as in the activty data frame
sleep['ActivityDate'] = pd.to_datetime(sleep['SleepDay']).dt.date

#Extract hour
steps_hr['Hour'] = pd.to_datetime(steps_hr['ActivityHour']).dt.hour

#Add column for the number of minutes spent in bed not asleep
sleep['NotAsleepMinutes'] = sleep['TotalTimeInBed']- sleep['TotalMinutesAsleep']

#Add an active level column
activity['ActivityLevel'] = pd.cut(activity['TotalSteps'], bins=[0, 4999, 7499, 9999, 100000], labels=['Sedentary', 'Low', 'Moderate', 'High'])

#Add a sleep level column
sleep['SleepLevel'] = pd.cut(sleep['TotalMinutesAsleep'], bins=[0, 360, 470, 549, 1000], labels=['Bad', 'Okay', 'Good', 'Over'])

### Merging Dataframes <a class="anchor"  id="h12"></a>

I merged the daily activity and daily sleep dataframes using a Left join becuase I didn't want to loose the user information who did not track sleep data.

In [None]:
#Merge Activity & Sleep data into a new dataframe
combo_data = activity.merge(sleep, how='left', on=['Id', 'ActivityDate'])
combo_data.head()

In [None]:
print('Activity Count: ' + str(activity.Id.nunique()))
print("Sleep Count: " + str(sleep.Id.nunique()))
print("Combo_Data Count: " + str(combo_data.Id.nunique()))
print("Hourly Steps Count: " + str(steps_hr.Id.nunique()))

In [None]:
combo_data['Id'].value_counts()

Most users were using their fitbit for tracking each day with one user who only was active for 4 days.

# Summary of Analysis & Visualizations <a class="anchor"  id="h13"></a>

From the Activity & Sleep combined data frame, I'm going to focus on Total Steps, Total Distance, Sedentary Minutes, Calories, Active Minutes (sum of all active level minutes), Total Minutes Asleep, and Not Asleep Minutes (but in bed).

Let's look at the summary statistics for these columns.


In [None]:
#Select specific columns of interest
selected_columns = ['TotalSteps', 'TotalDistance', 'SedentaryMinutes', 'LightlyActiveMinutes', 'FairlyActiveMinutes', 'VeryActiveMinutes', 'Calories', 'RecommendedActivity', 'ActiveMinutes', 'TotalMinutesAsleep', 'NotAsleepMinutes']

#Create a smaller dataframe with select categories
combo_data_short = combo_data[selected_columns]

#Looking at summary statistics
combo_data_short.describe().round(2)

**Items that stick out:**

* As the amount of steps increases, so does the amount of calories burned.  This is not a surprise, but helps confirm.
* The mean sedentary minutes is 991.21 minutes or 16.5 hours!  This needs to be reduced.
* The mean amout of time asleep is 419.17 minutes (6.87 hours) which is less than the recommended amount of 7-9 hours.
* Only the top 25% of users get the recommneded amount of activity per day an average of 42 minutes of moderate to vigorous exercise a day [according to the WHO](https://www.who.int/news-room/fact-sheets/detail/physical-activity#:~:text=should%20do%20at%20least%20an,least%203%20days%20a%20week.).
* 25% of users do not do any moderate or vigorous activity in a day.  However these users do light activity.  This also needs to be reduced.





### Activity Levels <a class="anchor"  id="h14"></a>
Next, I'm making new dataframes from the selected columns and finding the averages in each these columns grouped by Id.

I'll also add an avgerage activity level column based upon the catergories outlined by [10000steps.org](https://www.10000steps.org.au/articles/healthy-lifestyles/counting-steps/#:~:text=Sedentary%20is%20less%20than%205%2C000,than%2010%2C000%20steps%20per%20day) to the average stats by User Id.

* Sedentary is less than 5,000 steps per day 
* Low active is 5,000 to 7,499 steps per day
* Moderate active is 7,500 to 9,999 steps per day
* Very active is more than 10,000 steps per day

In [None]:
#Find mean of selected columns per user
avg_stats_id = combo_data.groupby('Id')[selected_columns].mean()

#Find mean of selected columns per day
avg_stats_day = combo_data.groupby('DayofWeek')[selected_columns].mean().reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

#Find mean of selected columns per Weekday/Weekend
avg_stats_weekend = combo_data.groupby('Weekend')[selected_columns].mean()

#Add Activity level based upon 10,000 steps
avg_stats_id['ActivityLevel_avg'] = pd.cut(avg_stats_id['TotalSteps'], bins=[0, 4999, 7499, 9999, 100000], labels=['Sedentary', 'Low', 'Moderate', 'High'])

#Add a sleep level column
avg_stats_id['SleepLevel'] = pd.cut(avg_stats_id['TotalMinutesAsleep'], bins=[0, 360, 470, 549, 1000], labels=['Bad', 'Okay', 'Good', 'Over'])
avg_stats_id

In [None]:
sns.countplot(data=avg_stats_id, x='ActivityLevel_avg')
plt.title('Number of Users at Each Activty Level')
plt.xlabel('Activty Level')

The number of users in each activity level are pretty close to each other which **shows that users of all kinds use smart-devices**.

In [None]:
sns.boxenplot(data=avg_stats_id, x='ActivityLevel_avg', y = 'RecommendedActivity')
plt.hlines(y = 42, xmin = -0.75, xmax=3.5, colors = 'k', linestyle = 'dashed')
plt.xlabel('Activity Level')
plt.ylabel('Moderate and Vigorous Activity Minutes')
plt.title('Amount of Moderate and Vigorous Activity Minutes \n per Activity Level per Day')
plt.text(0.25, 45, 'Daily Recommended Amount', ha='center', va='center', fontsize =11)

This boxplot shows that only the very active users and the top 60% of moderately active users get in the daily recommended amount of moderate and vigorous activity minutes per day.

In [None]:
sns.boxenplot(data=avg_stats_id, x='ActivityLevel_avg', y = 'LightlyActiveMinutes')
plt.xlabel('Activity Level')
plt.ylabel('Light Active Minutes')
plt.title('Amount of Light Activity Minutes per Activity Level per Day')

Looking at the amount of light active minutes users get per day, all but the sedentary users are getting at least 150 mintues (2.5 hours) in per day.

In [None]:
sns.boxplot(data=avg_stats_id, x='ActivityLevel_avg', y='SedentaryMinutes')
plt.xlabel('Activity Level')
plt.ylabel('Sedentary Minutes')
plt.title('Sedentary Mintues by Activity Level')

As expected, the sedentary users had the most sedentary minutes (median of 20 hours!).  The low active users were very spread in their sedentary minutes with a median not much lower than the sedentary users (19 hours).  Lowering the amount of sedentary minutes should be an area of focus for Bellabeat.

### Sleep Insights <a class="anchor"  id="h15"></a>

In [None]:
# Count the number of Id in each ActivityLevel_avg where TotalMinutesAsleep is not NaN
count_not_nan_sleep = avg_stats_id[avg_stats_id['TotalMinutesAsleep'].notnull()]['ActivityLevel_avg'].value_counts()
print('Number of Users with Sleep Data in each Activity Level')

count_not_nan_sleep

There are still users of each activity level with sleep data, but they are not spread as equally with the fewest in the Low and Very Active catergories.

In [None]:
sns.boxplot(data=combo_data, hue='SleepLevel', y='TotalMinutesAsleep', x='ActivityLevel')
plt.xlabel('Activity Level')
plt.ylabel('Mintues Asleep')
plt.title('Mintues Asleep by Activty Level and Sleep Level')

Sedentary users have the widest range in bad sleep and over sleep, both of which can be unhealthy for users. The other users at various active levels all seem to have similar ranges with moderate active users have the highest values in the bad sleep catergory.

In [None]:
sns.boxplot(data=avg_stats_id, x='ActivityLevel_avg', y='TotalMinutesAsleep')
plt.xlabel('Activity Level')
plt.ylabel('Mintues Asleep')
plt.title('Mintues Asleep by Activity Level')

This histogram shows the amount that users at different activity levels slept per night on average.  Very Active users got the least amount of sleep which users who averaged a Low or Moderate amount of daily activity got the most amount of sleep.  However there were the least amount of users in Low and Very Active catergories with sleep data, so this may not be reflective of ac

In [None]:
# Filter the data points from 0 to 100
filtered_data = avg_stats_id[(avg_stats_id['NotAsleepMinutes'] >= 0) & (avg_stats_id['NotAsleepMinutes'] <= 100)]

# Create the boxplot with the filtered data
sns.boxplot(data=filtered_data, x='ActivityLevel_avg', y='NotAsleepMinutes')

plt.xlabel('Activity Level')
plt.ylabel('Minutes Not Asleep in Bed')
plt.title('Minutes Not Asleep in Bed by Activity Level')

This histogram shows that the Very Active users spend less time awake in bed and the Moderate Active users spend the morst time awake in bed.  More data is needed to make any recommendations.  Note: the highest outliers were removed to look at the main data.

In [None]:
sns.regplot(data=combo_data, x='TotalMinutesAsleep', y='SedentaryMinutes',  
            scatter_kws={'s': 18, 'alpha': 0.7}, line_kws={'color': 'red', 'alpha': 0.5})
plt.title('Total Minutes Asleep vs Sedentary Minutes')
plt.ylabel('Sedentary Minutes')
plt.xlabel('Total Minutes Asleep')
plt.show()

The fewer sedentary minutes is correlated with more time asleep.  More analysis is required to see if this relationship is causation, not just correlation.

### Step Counts <a class="anchor"  id="h16"></a>

In [None]:
#x = np.mean(steps['StepTotal'])
combo_data['TotalSteps'].plot.hist()
plt.xlabel('Number of Steps in a Day')
plt.title('Histogram of Steps per Day')
plt.vlines(x = np.median(combo_data['TotalSteps']), ymin=0, ymax=250,
           colors = 'purple', label = 'median', linestyle = 'dashed')
plt.vlines(x=5000, ymin=0, ymax=250, colors = 'red', label = 'min recommended')
plt.legend()
plt.show()

This histogram shows that most users get more than the minimum amount of recommended steps per day.  It would be nice to increate the the amount of steps for the users at under the minumum recommended.

In [None]:
sns.regplot(data=combo_data, x='TotalSteps', y='Calories',
            scatter_kws={'s': 18, 'alpha': 0.7}, line_kws={'color': 'red', 'alpha': 0.5})
plt.xlabel('Total Steps')
plt.title('Steps vs. Calories per Day')

We can see that users who take more steps burn more calories.

In [None]:
sns.boxplot(data = combo_data, y='SedentaryMinutes', x='DayofWeek', 
            order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.title('Sedentary Minutes per Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Sedentary Minutes')
plt.show()

Looking at sendentary minutes by day of the week we can see that most days are quite similar in the distribuions.  The median on Thursday is lower that the other days of the week.

In [None]:
sns.boxplot(data = combo_data, y='ActiveMinutes', x='DayofWeek', 
            order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.title('Active Minutes per Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Active Minutes')
plt.show()

Looking at the active minutes per day of the week, we can again see that most days are quite similar.  The values for Satruday are slightly raised and are decrease on Sunday as compared with the other days of the week and are decreased.

This could be due to the fact that most people who exercise do so most days of the week.

In [None]:
# Group the merged dataframe by Hour and calculate the average of StepTotal
grouped = steps_hr.groupby('Hour')['StepTotal'].mean()

# Plot the average StepTotal vs Hour
plt.plot(grouped.index, grouped.values)
plt.xlabel('Hour')
plt.ylabel('Average Step Total')
plt.title('Average Step Total vs Hour')
plt.show()

Looking at the average. amount of steps per hour, we can see that the highest amount of steps are taken during during 5pm -7pm in the evening.  This could be becuase people are more likely to exercise after work before bedtime.

# Key Takaways <a class="anchor"  id="h17"></a>

* Users are less likely to wear their device at night which means they will not get the tracking information about their sleep.
* The mean amount of sendentary minutes is 991.21 minutes or 16.5 hours!
* The mean amout of time asleep is 419.17 minutes (6.87 hours) which is less than the recommended amount of 7-9 hours.
* Reducing the amount of sedentary time tends to increase minutes of sleep.  Higher levels of activity decreases time awake in bed.
* Only 25-40% of users get the recommneded amount of activity per day.
* 25% of users do not do any moderate or vigorous activity in a day.  However these users do light activity.
* Users are more likely to increase their steps after work hours.

# Recommendations <a class="anchor"  id="h18"></a>

* A reason that users were less likely to get sleep traking from their device could be that those devices need to be charged frequently so Bellabeat should devise an advertising campaign focusing on the fact that the Ivy stays charged for up to 8 days which would allow users to get more insights into their sleeping habits.
* Since most users do not get the recommended amount of sleep, add sleep notifications to the Bellabeat app to encourage users to stick to a bed time routine.
* As most users are more likely to increase their step coutn after work, add notificatons to the Bellabeat app that encourages users to walk or workout after work.  Additonally, creating a network where users could meet up with other users to walk could be a way to encougage more activity.
* As most users do not meet the recommended amount of moderate and vigorous activity, the Bellabeat app could push posts to users about different activities that would increase their heart rates.