# **Introduction**

***

<p style="font-size:18px;">  This case study will focus on data from Bellabeat a high-tech manufacturer of health-focused
products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the
global smart device market. 
    
<p style="font-size:18px;">  Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart
device fitness data could help unlock new growth opportunities for the company. As such this case study will focus on one of
Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The
insights from this analysis will then help guide marketing strategy for the company. The analysis will be presented to the Bellabeat executive team along with high-level recommendations for Bellabeat’s marketing strategy.

##### **Importing libarys**

In [1]:
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px

print("Libarys imported")

Libarys imported


##### **Importing data**

In [2]:
df_activity = pd.read_csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
df_steps = pd.read_csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
df_sleep = pd.read_csv("/kaggle/input/fitbit/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

print("Data imported")

Data imported


# **Data Cleaning** 
    
***  

<p style="font-size:18px;">  Now that I had imported the datasets which I felt would provide the best insights for this task it is important to determine if the data is clean and how many values are missing or incorrect. The following code checks each dataset and determines the number of duplicated values and the number of null values in each dataset.

In [3]:
# Adding all datasets to a dictionary
datasets = {"df_activity": df_activity, "df_steps": df_steps, "df_sleep": df_sleep}

# Creating a function to check all datasets in the dictionary
def clean_check():
    for name, df in datasets.items():
        print("Number of null values in the dataset", name, df.isna().sum().sum())
        print("Number of duplicated values in the dataset", name, df.duplicated().sum())
        print("---------------------------------------------------------------")
        
clean_check()

Number of null values in the dataset df_activity 0
Number of duplicated values in the dataset df_activity 0
---------------------------------------------------------------
Number of null values in the dataset df_steps 0
Number of duplicated values in the dataset df_steps 0
---------------------------------------------------------------
Number of null values in the dataset df_sleep 0
Number of duplicated values in the dataset df_sleep 3
---------------------------------------------------------------


<p style="font-size:18px;"> From this I can see there is 3 duplicated rows in the sleep dataset. I decided to remove the duplicates from the sleep dataset as only a small quantiy of lines were duplicated and removing these lines would have minimal impact on my analysis.

In [4]:
df_sleep.drop_duplicates(inplace=True)
print("Duplicates values removed")

Duplicates values removed


<p style="font-size:18px;"> To ensure that the imported datasets had no duplicated rows or missing values I checked both again and found that there were 0 duplicated row or missing values as such I could start exploring and analysing the data. 

In [5]:
clean_check()

Number of null values in the dataset df_activity 0
Number of duplicated values in the dataset df_activity 0
---------------------------------------------------------------
Number of null values in the dataset df_steps 0
Number of duplicated values in the dataset df_steps 0
---------------------------------------------------------------
Number of null values in the dataset df_sleep 0
Number of duplicated values in the dataset df_sleep 0
---------------------------------------------------------------


# **Data Exploration & Analysis** 
    
***
    

<p style="font-size:18px;"> To begin I wanted to understand how many users the data was reflective of so for each of the imported datasets I wrote the following code to determine how many users data was in each dataset. Overall it appears that is a small number of users in each data set with the activity and steps datasets containting data from 33 users and the sleep dataset containing data from 24 users.

In [6]:
print(df_activity["Id"].nunique(), "Users data is contained within the daily activity dataset")
print(df_steps["Id"].nunique(), "Users data is contained within the hourly steps dataset")
print(df_sleep["Id"].nunique(), "Users data is contained within the sleep day dataset")

33 Users data is contained within the daily activity dataset
33 Users data is contained within the hourly steps dataset
24 Users data is contained within the sleep day dataset


<p style="font-size:18px;"> To Make the data more manageable and allow for a more detailed analysis I decided it would be benificial to join the sleep dataset onto the main activity dataset as this would give me greater control over the data. To begin I ensured that all the date columns are in datetime format as I would be using the Id and date columns to join the datasets. Once the datasets had been merged I dropped the duplicated and unnecessary column and then printed out the head of the data to ensure the datasets joined correctly.

In [7]:
#Changing the date from object to datetime datatype.
df_activity["ActivityDate"]=pd.to_datetime(df_activity["ActivityDate"])
df_sleep["SleepDay"] = pd.to_datetime(df_sleep["SleepDay"], format='%m/%d/%Y %I:%M:%S %p')

# Merge data frames
df_activity = pd.merge(df_activity, df_sleep, left_on=['Id', 'ActivityDate'], right_on=['Id', 'SleepDay'], how='left')

# Drop the duplicated and nonessential column
df_activity.drop(['SleepDay'], axis=1, inplace=True)

# Display the top 5 rows from the merged dataframe
df_activity.head()

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,1.0,327.0,346.0
1,1503960366,2016-04-13,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797,2.0,384.0,407.0
2,1503960366,2016-04-14,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776,,,
3,1503960366,2016-04-15,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745,1.0,412.0,442.0
4,1503960366,2016-04-16,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863,2.0,340.0,367.0


<p style="font-size:18px;"> Where there was not sleep data for all the 33 users in the sleep dataset, there are now several null values in the df_activity dataset. This is due to the lack of sleep data. To correct this the following code fills all the null values with 0.

In [8]:
# Dictionary specifying columns and their corresponding values to fill the null values
fillna_dict = {"TotalSleepRecords": 0,"TotalMinutesAsleep": 0,"TotalTimeInBed": 0}

# Fill null values in specified columns with corresponding values from the dictionary "fillna_dict"
df_activity.fillna(fillna_dict, inplace=True)

<p style="font-size:18px;"> Now that the data was joined into one dataset I started by getting info on the dataset as this allowed me to understand the size of the dataset, the number of columns and the datatypes which made up the dataset. 

In [9]:
df_activity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Id                        940 non-null    int64         
 1   ActivityDate              940 non-null    datetime64[ns]
 2   TotalSteps                940 non-null    int64         
 3   TotalDistance             940 non-null    float64       
 4   TrackerDistance           940 non-null    float64       
 5   LoggedActivitiesDistance  940 non-null    float64       
 6   VeryActiveDistance        940 non-null    float64       
 7   ModeratelyActiveDistance  940 non-null    float64       
 8   LightActiveDistance       940 non-null    float64       
 9   SedentaryActiveDistance   940 non-null    float64       
 10  VeryActiveMinutes         940 non-null    int64         
 11  FairlyActiveMinutes       940 non-null    int64         
 12  LightlyActiveMinutes  

<p style="font-size:18px;">  Next I wanted to add some additonal columns which would help with my analysis. The first column I added was called "Total_Active_Mins" which totals all of the active minute columns into one total value column. The second column I created was the "Total_Mins" column which adds the total active minutes and total sedentary mins together to identify how many minutes of data was recorded in a day. My third added column "Day_of_week" contains the day of the week based on the date provided in the "ActivityDate" column.

In [10]:
#Changing the date from object to datetime datatype.
#df_activity["ActivityDate"]=pd.to_datetime(df_activity["ActivityDate"])

#Creating a new column which totals all the active minutes.
df_activity["Total_Active_Mins"] = df_activity['VeryActiveMinutes'] + df_activity['FairlyActiveMinutes'] + df_activity['LightlyActiveMinutes']

#Creating a new column which total the number of recorded minutes.
df_activity["Total_Mins"] = df_activity["Total_Active_Mins"] + df_activity["SedentaryMinutes"]

#Creating a new column which contains the day of the week from the date in the ActivityDate column.
df_activity["Day_of_week"] = df_activity["ActivityDate"].dt.day_name()

#Visulizing the newly created columns
df_activity.head()

Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,...,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed,Total_Active_Mins,Total_Mins,Day_of_week
0,1503960366,2016-04-12,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,...,13,328,728,1985,1.0,327.0,346.0,366,1094,Tuesday
1,1503960366,2016-04-13,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,...,19,217,776,1797,2.0,384.0,407.0,257,1033,Wednesday
2,1503960366,2016-04-14,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,...,11,181,1218,1776,0.0,0.0,0.0,222,1440,Thursday
3,1503960366,2016-04-15,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,...,34,209,726,1745,1.0,412.0,442.0,272,998,Friday
4,1503960366,2016-04-16,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,...,10,221,773,1863,2.0,340.0,367.0,267,1040,Saturday


<p style="font-size:18px;"> Before I start exploring and analyzing the data I think it may help to visualise the data in a correlation matrix to give me a high level overview of the data and to see if there is any unexpected or unusual correlation which I could investigate further.

In [11]:
# Defining the columns in the correlation matrix
correlation_matrix = df_activity[['Id', 'ActivityDate', 'TotalSteps', 'TotalDistance', 'TrackerDistance', 'LoggedActivitiesDistance', 'VeryActiveDistance',
        'ModeratelyActiveDistance', 'LightActiveDistance', 'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes','LightlyActiveMinutes', 
        'SedentaryMinutes', 'Calories','Total_Active_Mins', 'Total_Mins','TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed']].corr().round(2)

# Create heatmap
fig = ff.create_annotated_heatmap(z=correlation_matrix.values, x=list(correlation_matrix.columns),y=list(correlation_matrix.index),
    colorscale='Portland_r',showscale=True, zmin=-1, zmax=1)

# Customize layout
fig.update_layout(title='Correlation Heatmap',xaxis_title='Features',yaxis_title='Features', width=1100, height=1000, 
                  xaxis=dict(side='bottom'))

# Show plot
fig.show()

<p style="font-size:18px;"> Although great to provide a high level overview of the data, the correlation matrix did not show any unexpected correlation as such I will continue to explore and check the data.

<p style="font-size:18px;"> To check if the data was complete and that the total minutes per day was equal to 1440 minutes (number of minutes in a day) I wrote the following code. The code identifys how many rows contain data from a full day, partial day and any errors where the recorded data was more than 1440 minutes in a day.

In [12]:
# Creating variables for diffrent critera
under_24hrs = len(df_activity[df_activity['Total_Mins'] < 1440])
over_24hrs = len(df_activity[df_activity['Total_Mins'] > 1440])
exactly_24hrs = len(df_activity[df_activity['Total_Mins'] == 1440])

print("There are", under_24hrs, "rows which the total mins is less than 24hrs")
print("There are", over_24hrs, "rows which the total mins is more than 24hrs")
print("There are", exactly_24hrs, "rows which the total mins is 24hrs")

There are 462 rows which the total mins is less than 24hrs
There are 0 rows which the total mins is more than 24hrs
There are 478 rows which the total mins is 24hrs


<p style="font-size:18px;"> The good news is that there are no rows which contained data for more than 24hrs in a 24hour period. However almost half of the rows contained partial data where a fitness tracker did not track data for a full day. There are many reasons why this may be the case, it could be that the users are required to remove their fitness trackers under certain conditions such as swimming, hazardous environments or for work which could potentially result in the users forgetting to put the fitness trackers back on when able to. Alternatively the lack of data could be due to lack of power in the fitness tracker as such the fitness tracker needed to be charged during this period of time.
    
<p style="font-size:18px;"> To investigate this futher I added an additonal column called "Missing_Mins" which minuses the "Total_Mins" column from 1440 to identify how many minutes in a particular day a user did not track any fitness data. 
    
<p style="font-size:18px;">  With the newly created "Missing_Mins" column I am able to preform further analysis to identify on average how many missing minutes there are per user per day.

In [13]:
# Creating column "Missing_Mins" calculates the total mins minus 1440 (minutes in 24hrs)
df_activity["Missing_Mins"] = (1440 - df_activity["Total_Mins"])

# Variables to find rows with missing minutes depending on specified critera
missing_mins = df_activity.loc[df_activity["Missing_Mins"] != 0, "Missing_Mins"]
negative_missing_mins = df_activity.loc[df_activity["Missing_Mins"] <0].shape[0]

# Function to print results
def missing_mins_check (x,y):
    print("Average number of mins per day where data was not tracked", round(missing_mins.mean(),2))
    print("Average number of hours per day where data was not tracked", round(missing_mins.mean()/60,2))
    print("Percentage of average missing data for each day",round(missing_mins.mean()/1440,2)*100,"%")
    print("Number of rows where the total mins does not match 1440:", missing_mins.shape[0])
    print("Number of rows with too many minutes:", negative_missing_mins)
    
missing_mins_check(missing_mins,negative_missing_mins)

Average number of mins per day where data was not tracked 450.16
Average number of hours per day where data was not tracked 7.5
Percentage of average missing data for each day 31.0 %
Number of rows where the total mins does not match 1440: 462
Number of rows with too many minutes: 0


<p style="font-size:18px;"> Wow, there is on average 7.5 hours of missing data per day or 31% that is a massive amount of missing data. This couldn't be right I thought I must be missing somthing. I thought maybe 7.5 hours of missing data a day could be users charging their fitness trackers as they sleep? This is when I realised my oversight I was not looking at all the data. I had joined the sleep tracking data into the activity dataset yet not included the sleep data in my analysis. To correct this error I repeated my previous check but now with the included sleep data. 

In [14]:
# Ammended "Missing_Mins" column which now factors in the total time spent in bed
df_activity["Missing_Mins"] = 1440 - (df_activity["Total_Mins"] + df_activity["TotalTimeInBed"])

# Variables duplicated from previous code block. Calculation error if variables are not redefined
missing_mins = df_activity.loc[df_activity["Missing_Mins"] != 0, "Missing_Mins"]
negative_missing_mins = df_activity.loc[df_activity["Missing_Mins"] <0].shape[0]

missing_mins_check(missing_mins,negative_missing_mins)

Average number of mins per day where data was not tracked 59.51
Average number of hours per day where data was not tracked 0.99
Percentage of average missing data for each day 4.0 %
Number of rows where the total mins does not match 1440: 336
Number of rows with too many minutes: 155


<p style="font-size:18px;"> Okay, thats looking better. The average time per day of untracked data has dropped significantly which is great and on average there is now only 4% of missing data per day. However this has brought along another problem there is still a large the number of rows in which the total minutes does not equal 1440 (total minutes in a day) and 155 rows which now contain more than 1440 minutes. After looking at the data the only way I could make sense of this is that in cases where the total number of mins is less than 1440, this is (on average) a short period of time in which the users could be charging their fitness trackers. On the other side however where 155 rows now have a total minutes of more than 1440 this suggest to me that somewhere there is double counting. For example the users are having lets say "active time" in bed where the the tracker is adding minutes to the "TotalTimeInBed" column and one of the activity minutes columns.
    
<p style="font-size:18px;"> I spent far too long trying to perfectly get every row to have 0 missing minutes however the problem eludes me. <i> (please let me know if you find a solution) </i> I have minimized the amount of missing data in the dataset yet there is still a huge amount of rows in which the total minutes does not total 1400. I decided to plot the missing minutes on a graph to see if there were any trends or points of interest.

In [15]:
filtered_data = df_activity[df_activity['Missing_Mins'] != 0]

# Create a scatter plot
fig = go.Figure(data=go.Scatter(x=filtered_data['Missing_Mins'],y=df_activity['ActivityDate'],mode='markers',hovertemplate='Missing minutes: %{x}<br> %{y}',marker=dict(
        color=filtered_data['Missing_Mins'], colorscale='Portland', colorbar=dict(title='Missing Minutes')), name=''))

# Customize layout
fig.update_layout(
    title='Missing Mins per Date for all users', xaxis_title='Missing Minutes',yaxis_title='Date',
    yaxis=dict(tickvals=df_activity['ActivityDate'],autorange='reversed',),height=750)

# Show plot
fig.show()

###############################################################################################################


# Create histogram trace
histogram_trace = go.Histogram(x=filtered_data['Missing_Mins'],xbins=dict(start=-400, end=1440, size=20),hoverinfo='x+text',text=None,
                               hovertemplate='Number of missing minutes: %{x}<br>Frequency: %{y}',name='')

# Create layout
layout = go.Layout(title='Frequency of Missing Minutes', xaxis=dict(title='Missing Minutes'), yaxis=dict(title='Frequency'), bargap=0.1)

# Create figure
fig = go.Figure(data=[histogram_trace], layout=layout)

# Show plot
fig.show()

<p style="font-size:18px;"> Overall the there is a small discrepancy of missing minutes which I have been unable to fully eliminate. It appears that the missing minute values over 250 appear to be anomalous. I would need to raise these points of concern with the stakeholders to understand why this is the case. 
    
<p style="font-size:18px;"> Despite the data not correctly matching the minutes in a day I will continue with the analysis with the assumption that the missing data is due to users charging their devices or minutes being double counted. However in reality I have no way to know if this is correct and is somthing that will be raised to the stakeholders of this case study.
    
<p style="font-size:18px;"> Moving on, I decided to review the steps the users were taking per day. Using my previously created "Day_of_week" column I found the average steps per day of the week and plotted this data on a bar chart to determine if there were any trends in the number of steps users were taking per day. Please see the bar chart below. 

In [16]:
# Grouping by Day and calculating the average of StepTotal
average_steps_per_day = df_activity.groupby("Day_of_week", observed=True)["TotalSteps"].mean().round(1)

# Sort the data by the order of days
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
average_steps_per_day = average_steps_per_day.reindex(days_order)

# Create bar trace
trace = go.Bar(x=average_steps_per_day.index, y=average_steps_per_day.values,hovertemplate='%{x}:<br> Average No. of Steps %{y}',name='')

# Create layout
layout = go.Layout(title='Average Steps per Day of the Week', xaxis=dict(title='Day of the Week'), yaxis=dict(title='Average Steps'))

# Create figure
fig = go.Figure(data=[trace], layout=layout)

# Show plot
fig.show()

# Calculate and print the average daily steps across the total dataset
print("Average daily steps across total dataset:", round(df_activity["TotalSteps"].mean(), 2))

Average daily steps across total dataset: 7637.91


<p style="font-size:18px;"> The bar chart unfortunately does not show any obvious trends, the average number of steps remains mostly flat. Saturday and Tuesday having the highest average steps and Sunday having the lowest average steps. With the average number of steps per day across the full dataset was 7,637.

<p style="font-size:18px;"> To dive deeper I decided to plot the number of steps per user per date on the graph below to see if there would be any trends. For example as the dates became closer to summer and the days become longer and sunnier wheather this would lead users to take more steps. Unfortunately I do not know the location of the users or what the weather was per day in the area of the world where the users are located so this may not show any trends. Please see the plot below for the results.
    

In [17]:
# Calculate average steps per date
avg_steps_per_date = df_activity.groupby('ActivityDate')['TotalSteps'].mean()

# Create a scatter plot
fig = go.Figure(data=go.Scatter(x=df_activity['TotalSteps'], y=df_activity['ActivityDate'], mode='markers', hovertemplate='Steps: %{x}<br>%{y}',
                                marker=dict(color=df_activity['TotalSteps'], colorscale='Portland_r', colorbar=dict(title='Total Steps')), name='', showlegend=False))

# Add average line
fig.add_trace(go.Scatter(x=avg_steps_per_date, y=avg_steps_per_date.index, mode='lines', name='Average Steps per Date', line=dict(color='grey', dash='dashdot'), showlegend=False,
                         hovertemplate='Average Steps: %{x} <br>%{y}<extra></extra>'))

# Customize layout
fig.update_layout(
    title='Steps per Date for all users', xaxis_title='Total number of steps', yaxis_title='Date',
    yaxis=dict(tickvals=df_activity['ActivityDate'], autorange='reversed'), height=750)

# Show plot
fig.show()

print("Number of rows which contain 0 steps taken in a day:",len(df_activity[df_activity["TotalSteps"] == 0]))

Number of rows which contain 0 steps taken in a day: 77


<p style="font-size:18px;"> Again, no obvious trend was found. Although there does appear to be a sharp drop off in steps on the 12th of May. Going forward, I would advise the stakeholders that more data across a longer timeframe and the location of the users would allow for further analysis to determine if the time of the year and location has an impact on how the users are using the fitness trackers and whether the summer months result in increased fitness for the users.
    
<p style="font-size:18px;"> To dive deeper into how users were taking steps I used the hourly steps dataset. Which as the name suggests tracks the number of steps each user took per hour. I wanted to see if there was any trends between time of the day and the number of steps users were adding to the step tracker. Before any analysis was completed I wanted to visualize the data to understand the size and datatypes included within this dataset.

In [18]:
print(df_steps.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            22099 non-null  int64 
 1   ActivityHour  22099 non-null  object
 2   StepTotal     22099 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 518.1+ KB
None


<p style="font-size:18px;"> The above information tells me that there are 3 columns the ID of the user, the hour and the step total for that hour. The first error I noticed is that the "ActivityHour" column has the datatype object instead of the correct datetime datatype and so the following code ammends this.

In [19]:
# Convert "ActivityHour" to datetime format
df_steps['ActivityHour'] = pd.to_datetime(df_steps['ActivityHour'], format='%m/%d/%Y %I:%M:%S %p')

df_steps.head()

Unnamed: 0,Id,ActivityHour,StepTotal
0,1503960366,2016-04-12 00:00:00,373
1,1503960366,2016-04-12 01:00:00,160
2,1503960366,2016-04-12 02:00:00,151
3,1503960366,2016-04-12 03:00:00,0
4,1503960366,2016-04-12 04:00:00,0


<p style="font-size:18px;"> Next I could begin prepare the data for analysis. I began by extracting the time from the "ActivityHour" column as this would allow me to group the data by the hour of the day regardless of the date. I grouped the data by hour of the day and calculated the average number of steps per hour. I then ensured that the data is in the correct format and ordered the data in chronological order from 12am through to 11pm.    

In [20]:
# Extract the time component from "ActivityHour"
df_steps['ActivityHour'] = df_steps['ActivityHour'].dt.time

# Grouping by ActivityHour and calculating the average of StepTotal
average_steps_per_hour = df_steps.groupby(df_steps['ActivityHour'])["StepTotal"].mean()

# Convert the index to time format
average_steps_per_hour.index = pd.to_datetime(average_steps_per_hour.index, format='%H:%M:%S').time

# Sort the data by the time index
average_steps_per_hour = average_steps_per_hour.sort_index()

<p style="font-size:18px;"> I could now plot this data on a graph to vizualise the data and see if there were any correlation between the time of the day and the average number of steps taken. Please see the graph below.

In [21]:
# Create figure
fig = go.Figure(data=[go.Bar(x=[hour.strftime('%H:%M') for hour in average_steps_per_hour.index], y=average_steps_per_hour.values.round(0),
                     marker=dict(color=average_steps_per_hour.values, colorscale='Portland'),hovertemplate='Average Steps:%{y}',name='')])

# Update layout
fig.update_layout(title='Average Steps per Hour', xaxis_title='Activity Hour', yaxis_title='Average Steps',
                  xaxis=dict(tickangle=45),yaxis=dict(gridcolor='lightgrey'))

fig.show()

<p style="font-size:18px;"> The data tells us that on average the 33 people in the data typically start taking steps from 6am til 10pm with the most steps being taken around the lunch period of 12pm til 2pm and the rush hour dinner period of 5pm til 7pm. There appears to be a slight decrease in steps at 3pm which could suggest the users rest slighly after eating/the lunchtime period. Where this graph averages the data per hour will not be relfective of an indivual user and does not show if the the averages steps taken by hour changes dependant on the day i.e. a Saturday vs a Monday.
    
<p style="font-size:18px;"> I next wanted to pivot away from the step data and see if there is any trends in the activity data as such I plotted the average active minutes by day of the week.

In [22]:
# Group by "Day_Of_Week" and calculate the mean of the active minute columns rounded to 2 decimal places
average_active_day = df_activity.groupby('Day_of_week', observed=False)[['Total_Active_Mins', 'LightlyActiveMinutes', 'VeryActiveMinutes', 'FairlyActiveMinutes']].mean().round(2)

# Reorder days of the week
average_active_day = average_active_day.reindex(days_order)

# Create figure
fig = go.Figure()

# Add traces for each active minute column
for column in average_active_day.columns:
    fig.add_trace(go.Scatter(x=average_active_day.index, y=average_active_day[column], mode='markers+lines', name=column, hovertemplate='<b>%{y:.2f}</b> mins'))

# Update layout
fig.update_layout(title='Average Active Minutes by Day of the Week',xaxis_title='Day of the Week', yaxis_title='Average Active Minutes', yaxis=dict(range=[0, 250]), hovermode='closest')

fig.show()

# Calculate and print average active minutes per day across entire dataset
print("Average active minutes per day across entire dataset:", df_activity["Total_Active_Mins"].mean().round(2), "(", round(df_activity["Total_Active_Mins"].mean() / 60, 2), "hrs )")

Average active minutes per day across entire dataset: 227.54 ( 3.79 hrs )


<p style="font-size:18px;"> The graph shows that typically the amount of active minutes per day stays mostly flat with fridays and saturdays showing the highest number of active minutes by a small margin. One thing that does stick out to me is that across every day of the week the users have more "VeryActiveMinutes" than "FairlyActiveMinutes". I would expect this to be inversed with the very active minutes being the smallest portion of a users day where their heartrate and movement is above the defined threshold.
    
<p style="font-size:18px;"> On average each user is active for 227 minutes or 3.79 hours per day. To visualize this in a diffrent way I plotted the average active times and sedentary time on a pie chart please see below.

In [23]:
# Calculate average minutes for each activity level
avg_veryactive = round(df_activity['VeryActiveMinutes'].mean()/60,2)
avg_fairlyactive = round(df_activity['FairlyActiveMinutes'].mean()/60,2)
avg_lightlyactive = round(df_activity['LightlyActiveMinutes'].mean()/60,2)
avg_sedentary = round(df_activity['SedentaryMinutes'].mean()/60,2)
avg_inbed = round(df_activity[df_activity['TotalTimeInBed'] > 0]['TotalTimeInBed'].mean() / 60, 2)

# Create a dictionary to hold the average minutes for each activity level
activity_minutes = {'Very Active': avg_veryactive, 'Fairly Active': avg_fairlyactive, 'Lightly Active': avg_lightlyactive, 'Sedentary': avg_sedentary, 'In Bed': avg_inbed}


# Plotting with Plotly
fig = go.Figure(data=[go.Pie(labels=list(activity_minutes.keys()), values=list(activity_minutes.values()), hole=0.5, hoverinfo='label+percent+value', 
                             hovertemplate='%{label}: %{percent} <br>Average Hours: %{value}', name='')])
fig.update_layout(title='Average Minutes Distribution Per Day', title_x=0.5)
fig.show()

print("Total hours in the chart:",avg_veryactive+avg_fairlyactive+avg_lightlyactive+avg_sedentary+avg_inbed)
print("Average hours in bed per day:",avg_inbed)
print("Average hours out of bed per day:",round(df_activity['Total_Mins'].mean()/60,2))

Total hours in the chart: 27.95
Average hours in bed per day: 7.64
Average hours out of bed per day: 20.31


<p style="font-size:18px;"> Where this pie chart takes an average of the diffrent minute types per day this has resulted in the total pie totalling 27.95hrs which is of course not possible in a 24 hour day. This is due to the total minutes (not including time in bed) having an average of 20.31 hours and the average time in bed being 7.64 hours (for the 24 users who tracked sleep data). 
    
<p style="font-size:18px;"> 3.79hrs of active time per day (on average) is great but when looking at a whole day thats less than 20% of that persons day, of course it would be impossible for someone to be active 100% of the day. However, with the correct strategy it may be possible to increase users active time to 25% of their day. 
    
<p style="font-size:18px;"> I next wanted to look at calories to see how diffrent metrics affected a users burned calories. I began by plotting calories by active minutes for users. Please see the graph below.

In [24]:
# Calculate correlation coefficient
correlation = np.corrcoef(df_activity['Total_Active_Mins'], df_activity['Calories'])[0, 1]

# Create the scatter plot with trendline using Plotly Express
fig = px.scatter(df_activity, x='Total_Active_Mins', y='Calories', trendline="ols", labels={'Total_Active_Mins': 'Active Mins', 'Calories': 'Calories'}, 
                 title='Calories by Active Mins', color='Calories', color_continuous_scale='viridis')  

# Add correlation coefficient annotation
fig.add_annotation(x=df_activity['Total_Active_Mins'].max() - 10, y=df_activity['Total_Active_Mins'].max() - 10, text=f'Correlation: {correlation:.2f}', showarrow=False)

# Show plot
fig.show()

print("Number of days in which a user was active less than 5 minutes:",len(df_activity[df_activity["Total_Active_Mins"] < 5]))
print("Percentage of days in which a user was active less than 5 minutes:",(len(df_activity[df_activity["Total_Active_Mins"] < 5])/len(df_activity))*100, "%")


Number of days in which a user was active less than 5 minutes: 94
Percentage of days in which a user was active less than 5 minutes: 10.0 %


<p style="font-size:18px;"> As expected there is a positive corelation between the minutes a user was active and the calories were burned. One thing that did stick out to me was the number of users who had 0 or near zero active minutes per day. I wrote an additonal line of code to confirm the number of days in which users spend less than 5 minutes active. 94 days on no activity thats a huge value, thats 10% of the entire data set! I wonder if there was a reason for this, were these recovery days for the users or was it by choice not to be active?
    
<p style="font-size:18px;"> I decided to look at another metric. I wanted to see how steps impacted on calories burnt so I plotted the graph below.

In [25]:
# Calculate correlation coefficient
correlation = np.corrcoef(df_activity['TotalSteps'], df_activity['Calories'])[0, 1]

# Create the scatter plot with trendline using Plotly Express
fig = px.scatter(df_activity, x='TotalSteps', y='Calories', trendline="ols", labels={'TotalSteps': 'Total Steps', 'Calories': 'Calories'}, 
                 title='Calories by Total Steps', color='Calories', color_continuous_scale='viridis')  

# Add correlation coefficient annotation
fig.add_annotation(x=df_activity['TotalSteps'].max() - 2500, y=df_activity['Calories'].max() - 4400, text=f'Correlation: {correlation:.2f}', showarrow=False)

# Show plot
fig.show()

print("Number of days in which a user took less than 500 steps:",len(df_activity[df_activity["TotalSteps"] < 500]))
print("Percentage of days in which a user took less than 500 steps:",round((len(df_activity[df_activity["TotalSteps"] < 500])/len(df_activity)),2)*100, "%")

Number of days in which a user took less than 500 steps: 98
Percentage of days in which a user took less than 500 steps: 10.0 %


<p style="font-size:18px;"> Again I expected there to be a postive correlation, this time the corelation was higher between steps and calories when compared to calories and active minutes. Once again the thing that suprized me the most was the number of days in which users took less than 500 steps. 98 days in which users did not complete 500 steps or once again 10% of the data. I thought this could not be correct surely?  I realized that some of the data might be inaccurate, on average an adult burns around 2000 kcals per day (slightly less for women) so if someone was wearing a fitness tracker all day even without moving the body will burn calories by resting and sleeping. My realistic range would be from around 1200 calories and above, so any days in which a user burned less than 1200 would be a concern for me. Could these users only tracked data for a small portion of the day? Could the users be children or elderly? could the tracker not be working as intended? unfortunately I do not have these answers put it is cruital to raise these points to the stakeholders.
    
<p style="font-size:18px;"> I remembered from my earlier analysis that the 12th of May was showing a much lower average steps per user so I wanted to see wheather there may be a pattern and that the data from this date could be anomalous. I wrote the following code which totals the number of rows which less than 1400 calories on the 12th of May and the entire dataset.

In [26]:
print("Number of rows on the 12th of May with less than 1200 Calories:",len(df_activity[(df_activity['ActivityDate'] == '2016-05-12') & (df_activity['Calories'] < 1200)]))
print("Number of rows in entire datase with less than 1200 Calories:",len((df_activity[df_activity['Calories'] < 1200])))
print("Percentage of potential anonymous rows on the 12th of May:", 
      round(len(df_activity[(df_activity['ActivityDate'] == '2016-05-12') & (df_activity['Calories'] < 1400)]) / len(df_activity[df_activity['Calories'] < 1200]) * 100, 2), "%")

Number of rows on the 12th of May with less than 1200 Calories: 10
Number of rows in entire datase with less than 1200 Calories: 18
Percentage of potential anonymous rows on the 12th of May: 77.78 %


<p style="font-size:18px;"> This suggests to me that the data from the 12th of May could be inaccurate and misleading. With a total of 18 rows which could be inaccurate, this is just under 2% of the entire dataset, With the 12th of May making up 77.78% of the rows with less than 1200 calories. At this stage I feel confident that these values are inaccurate and should be removed. However I cannot say this with 100% certainty and it would be a question which would be raised to the stakeholders to understand is there a reason why the 12th of May is showing a large amount of anomalous data.
    
<p style="font-size:18px;"> Next I wanted to look at the sleep data and see if there were any trends I could identify. I plotted every users sleep data by date and added an average line to determine the average sleep per date.

In [27]:
# Filter DataFrame to include only rows where TotalMinutesAsleep is greater than 0
sleep_data = df_activity[df_activity['TotalMinutesAsleep'] > 0]

# Calculate average hours of sleep per day
average_hours_of_sleep_per_day = sleep_data.groupby('ActivityDate')['TotalMinutesAsleep'].mean() / 60

# Create a scatter plot
fig = go.Figure(data=go.Scatter(x=sleep_data['ActivityDate'], y=round(sleep_data['TotalMinutesAsleep'] / 60,2), mode='markers',
                                hovertemplate='Hours Sleep: %{x}<br>%{y}', marker=dict(color=sleep_data['TotalMinutesAsleep'], colorscale='Portland'), name=''))

# Customize layout
fig.update_layout(title='Hours sleep per Date for all users', xaxis_title='Date', yaxis_title='Hours of sleep',
                  xaxis=dict(tickvals=sleep_data['ActivityDate'], tickangle=-30), height=750)

# Add a line for the average hours of sleep per day
fig.add_trace(go.Scatter(x=average_hours_of_sleep_per_day.index, y=average_hours_of_sleep_per_day, mode='lines',
                         line=dict(color='grey', dash='dashdot'), hovertemplate='Average Hours Sleep: %{x}<br>%{y}',
                         name=''))

# Show plot
fig.show()

print("Average hours slept across entire dataset:", round(sleep_data['TotalMinutesAsleep'].mean() / 60,2))

Average hours slept across entire dataset: 6.99


<p style="font-size:18px;"> Very interesting. the average sleep per user is between 6-8 hours per user yet there is a large range and diffrence between the uses getting the most sleep and users getting the lowest amount of sleep. On several dates users are getting less than 4 hours sleep. I thought there must be a reason for this, are the users raising children? work rotatating shifts? why would these users be getting so little sleep? I dug deeper into the dates and found that the days in which the users are getting the lowest amounts of sleep are the weekends. This suggests to me that those users were likely going out and staying out late on a Friday/Saturday/Sunday hence the lack of sleep.
    
<p style="font-size:18px;"> To visualize the averages better I plotted the averages on a easy to see bar chart.

In [28]:
# Calculate average hours of sleep per day of the week
sleep_data_day = sleep_data.groupby('Day_of_week')['TotalMinutesAsleep'].mean() / 60
sleep_data_day = sleep_data_day.reindex(days_order)

# Create a bar graph
fig = go.Figure(go.Bar(x=sleep_data_day.index, y=round(sleep_data_day, 2), hovertemplate='Average Hours Sleep: %{y}', name=''))

# Customize layout
fig.update_layout(title='Average Hours of Sleep per Day of the Week', xaxis_title='Day of the Week', yaxis_title='Average Hours of Sleep', height=750)

# Show plot
fig.show()

<p style="font-size:18px;"> Its great to see on average the users are getting around 7 hours of sleep per night however as seen by my previous graph there is a massive range in the data in some cases from 12hrs to 2hrs sleep. Science has shown how important sleep is to a persons health so I wanted to determine how good the users sleep actually is as such I wanted to plot the frequency of sleep duration on a pie chart to vizualise this.

In [29]:
# Count the number of rows meeting the criteria for each category
count_less_than_7 = df_activity[(df_activity['TotalMinutesAsleep'] < 7 * 60) & (df_activity['TotalMinutesAsleep'] > 0)].shape[0]
count_between_7_and_9 = df_activity[(df_activity['TotalMinutesAsleep'] >= 7 * 60) & (df_activity['TotalMinutesAsleep'] < 9 * 60)].shape[0]
count_greater_than_9 = df_activity[df_activity['TotalMinutesAsleep'] >= 9 * 60].shape[0]


# Create pie chart trace
pie_chart_trace = go.Pie(labels=['Less than 7 hours sleep', '7 - 9 hours sleep', 'More than 9 hours sleep'],
                         values=[count_less_than_7, count_between_7_and_9, count_greater_than_9],
                         marker=dict(colors=['red', 'green', 'orange'],line=dict(color='#FFFFFF', width=2)), hole=0.5,
                         hovertemplate='<b>%{label}</b><br>Number of days: %{value}<extra></extra>')

# Create layout
layout = go.Layout(title='Frequencies of Hours Slept')  
                                                            

# Create figure
fig = go.Figure(data=[pie_chart_trace], layout=layout)

# Show plot
fig.show()

<p style="font-size:18px;"> Wow! There is a huge number of days in which users are getting less than 7 hours of sleep. This is definatley somthing which should be improved upon.
    
<p style="font-size:18px;"> I feel I have a good understanding of the data and the story the data tells. From here I will summarize my finding and make recommendations on how Bellabeat can improve their service and guide their marketing strategy going forward.

# **Summary**
    
***

<p style="font-size:18px;">Thank you for getting this far in my notebook there has been a lot of information to explore and analyse. In this section I will summarize my findings and make suggestions on how this data could be used to guide Bellabeat's future marketing strategy.

<p style="font-size:18px;">Ive decided to break this section down into smaller more managable chunks in which I will discuss the missing data, steps, activity and sleep.

<p style="font-size:18px;"><strong>Missing data</strong>
<br>Throuout my data exploration I found several areas of the data which appeared to be missing or inaccurate. The first of which was the total number of minutes of data tracked in a day. I managed to reduce this to 336 rows (just under 36%) in which the total data tracked per day did not equal 1440 minutes. Of which there were 155 rows which contained more minutes of tracked data than is possible in a day. This lead me to believe either the fitness trackers have been double counting the tracking data or the fitness tracker was removed.

<p style="font-size:18px;"><strong>Steps</strong>
<br>The data on steps taken per user per day gave a good insight into how many steps users are taking a day and the time in which users are taking the most amount of steps. However I also found that there is 77 rows (over 8% of the total data) where the users took 0 steps in a day which is alarming. This made me question the validity of the data as this suggests to me that either the users did not wear their fitness trackers on these days. Did users who took 0 steps not leave their bed or go toilet in this 24 hour period? Or that there was an error with the fitness tracker in which steps were not counted. On average the users were taking over 7,600 steps per day which is great and science has shown the health benifits of walking. There are many diffrent reports suggesting the optimal number of steps an adult should take to stay healthy and typically this range is anywhere from 6,000 to 12,000. I belive that 7,600 on average is great but I also belive that there are ways in which Bellabeat could increase this.

<p style="font-size:18px;"><strong>Activity</strong>
<br>Overall I would consider the users in this dataset to be quite active with the average active minutes being 227.54 ( 3.79 hrs ) per day. The majority of that active time being lightly active with just over 30 minutes of either fairly active or very active time per day on average. The data shows that typically Saturdays are the most active days and Sundays are the least. Although, I belive that Bellabeat could make changes to their service to prompt users to be more active. When looking at calories burned per day I found that both the number of active minutes and steps per day had a positive correlation on the number of calories burned. However I did also discover that 10% of the data was from users who took less than 500 steps per day and 10% of the data was from users who had less than 5 active minutes per day. It is unclear to me wheather this data is correct although not impossible, it is unlikely. What stood out to me more was the fact that in the data 18 days users had burnt less than 1200 calories in a day. An average man burns between 2000 - 2400 calories per day and an average woman burns between 1600 - 2000 calories per day. So for there to be 18 days in which a person burnt less than 1200 raises a red flag, even if these users did no activity or movement all day the average person would burn 1300+ calories in a day. One factor which could be causing this is the age of the user, if the user is over 70 years old or younger than 20 years old this would impact the calories they burn. Alternatively these users may not be wearing their fitness trackers for the full 24 hour period as a result this would track a lower number of calories burnt. More research and data would be required to confirm this. 
    
    
<p style="font-size:18px;"><strong>Sleep</strong>
<br>The amount of sleep across the users is okay with an average sleep time of 6.99 hours across the entire dataset. When looking into this further I found that typically on sundays the users got the most sleep and Thursdays the users got the least sleep. Although the average sleep across the dataset was 6.99 hours I noticed that there were several days in which some users were getting less than 4 hours of sleep. Further investigation revealed that these outliers were on the nights of the weekend (Friday, Saturday and Sunday). This suggested that on these nights the users were likely going out for drinks, nighclubs or shows which went deep into the night hence the shorter amount of sleep. Despite looking at the average hours of sleep per day and across the dataset this only paints part of the picture and may not be representative of the full data as such I wanted to see the frequencys within the data frame of users who got less than 7 hours sleep, between 7 and 9 hours sleep and more than 9 hours sleep. A typical adult requires 7-9 hours of sleep each night so I wanted to see how often this was occuring and to my suprize despite the average sleep per night being 6.99 hours 55.81% of the days in the data set the users achived 7 or more hours sleep. On the other hand that does mean over 44% of the users recived less than 7 hours of sleep each night which is a concern. 

# **Conclusion**
    
***

<p style="font-size:18px;"> Now that I have a good understanding of the data I want to make my recommendation to Bellabeat and their cofounder and Chief Creative Officer, Urška Sršen. I feel there are many ways in which Bellabeat can improve their service to customers which can help them grow their market share and profits. I will break my recommendations down into their own paragraphs which detail and explain why I have made these recommendations.
    
<p style="font-size:18px;"><strong>More data</strong>
<br>The data provided is good for the most part however I have found several inconsistencies within the data which could be resolved with more data and better data collection. For example it would be good to track the total minutes of data for each day as currenly throughout my analysis I have found that for 35% of the data the total minutes in a day does not add up to 1440 minutes and it is not clear if this is a tracking error, if the user had intenionally stopped tracking data or wheather the tracking device lost power. It would be great to see this breakdown as it would allow for better insights and a more comprehensive analysis. Alternatively it would be great to track or identify the times in which the device is not tracking data for example charging time per day could be tracked or duration of time when the device is not worn. These little extra pieces of data would make the data more complete and allow for greater insights. 
    
<p style="font-size:18px;"> Another benifit to more data would be the ability to see if users use the devices diffrently over diffrent periods of the year. The current data is from the 12th of April to the 12th of May. However the insights from this data may not be applicable to how a user will use the device in December. Data gathered over a year or longer period of time may grant better insights and further improve Bellabeats marketing strategy. It may be determined that during the summer users are more active than the winter months or that the first week of each month users show a high number of steps than the rest of the month for example. The more data the more insights and a better marketing strategy can be determined.
    
<p style="font-size:18px;"> The Final benifit to having more data is to identify if diffrent ages groups and fitness levels are using the device diffrently. Who is the core target audience? younger users, older users? Athletes or people trying to lose weight? Understanding the core audince will help Bellabeat make decisions which will better market to these users or alternatively understand why certain customers may not be using their device and introduce ways to appeal to these customers.     
    
<p style="font-size:18px;"><strong>Goals</strong>
<br>Although it is great to see that average steps per day across all users is 7,600 could this be improved? There were many cases in which users were taking less than 500 steps per day. The science is showing that for a person to be healthy and reduce the chances of premature death they should be taking anywhere from 4000 steps per day at a minimum. By having users set daily goals gives the users somthing to work towards and may motive a user to take more steps or spend a longer time being active. To further incentivse users to complete their goals Bellabeat could offer addional information and highlight the health benifits of walking extra steps or spending those few more minutes being active. Additionally Bellabeat could offer digital badges which shows a user has completed their goals for X number of days or is in the top X% of users for that month. These could then be shown off on the users profile and shown as a mini digital trophy case of achivements which the users can look back on.
    
<p style="font-size:18px;"> In additonal to goals there could be challenges for example walk X number of steps in a month or spend X minutes active. To further encorage users Bellabeat could work with other fitness brands to offer a reward to people who complete these challenges. For example fitness brand "Nadidas" could offer 5% saving on their store to users who complete the Nadidas 70,000 steps a week challenge. Alternatively the users who complete a challenge are entered into a raffle to win fintness related accessories or clothing.
    

<p style="font-size:18px;"><strong>Leaderboards and friendships</strong>
<br>To introduce friendly competition and support, Bellabeat could offer friendships and leaderboards. Friendships to allow users to support their friends and encorage them to be active and healthy whilst also being a way share and promote a users achivements such as most steps in a day, most active minutes in a month etc. Combined with a weekly leaderboard which shows a user their weekly fitness compared to their friends or local comunity could help drive motiviation and achive a greater fitness level for users. This idea could also be combined with the previous idea of goals such that you can see if a friend has achived their goals and if not you are able to cheer them on and support them.
    
<p style="font-size:18px;"><strong>Alerts and notifications</strong>
<br>The Final recommendation I have is to give alerts and notifications to users who are not as active as they want to be. When setting up a fitness tracker it would be ideal to understand why a user is using the tracker for example a user wants to lose weight, monitor their fitness or improve their fitness. With this understanding in mind lets take the user who wants to lose weight for example if that user has remained static for over an hour the user can receive an alert or notificiation to advise them to get up and take some steps. This again could be linked to a users goals such that if a user is half way to a goal they get a notification that they are half way there and to keep going. This small boost of motivation may spur users to be more active and have a higher likelyhood of hitting their goals.
    
<p style="font-size:18px;">These are my recommendations for Bellabeat and the team. I hope this analysis has been useful and provided insight to how the users have been using their fitness trackers and ways in which Bellabeat could adjust their marketing strategy to improve customer satisfaction and attract more customers.
    
<p style="font-size:18px;"><strong>Many thanks for reading please feel free to comment and add any feedback or recommendations.
    