In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Hey there! 🌟 Looks like we're getting started with a cool data analysis project! Let me walk you through the libraries we're bringing on board:
__os__: This little guy helps us interact with the operating system. It's like our secret agent, allowing us to do all sorts of cool stuff with directories and files. Although we imported it, we won't be using it in this snippet, but hey, it's part of the squad!

__Numpy__: This library is like the mathematical genius in our team! It's all about handling big datasets and performing super-fast mathematical operations. We've given it the nickname "np," so we can call on it with just two letters. It's going to make crunching numbers a breeze!

__Pandas__: Meet our data wizard, "Pandas"! 🧙‍♂️ It knows all the tricks for manipulating and analyzing data like a pro. We're going to call it "pd" to keep things snappy and easy-breezy. With Pandas by our side, we'll tame any dataset that comes our way!

__Seaborn and Matplotlib__: These two are our dynamic duo for data visualization! 🌟 Seaborn is like the stylish fashionista that knows how to create eye-catching plots effortlessly. It builds on top of Matplotlib, which is the backbone of data visualization in Python. We've brought them both on board because they complement each other perfectly. Seaborn adds that extra touch of elegance to our plots, while Matplotlib gives us the flexibility to fine-tune every detail.

__Matplotlib.pyplot__: Give it up for the legendary "Matplotlib" - the Picasso of data visualization! 🎨 It can whip up all sorts of beautiful plots, static or interactive. We'll call it "plt," our trusty artist, to keep things artsy and fun.

In [2]:
fitDataFrame=pd.read_csv("Daily activity metrics.csv")

Nice! Looks like we've loaded your data from the __"Daily activity metrics.csv"__ file into a DataFrame called **fitDataFrame. 📊

By using the **pd.read_csv()** function from the Pandas library, we've efficiently read the data from the CSV file and converted it into a tabular format - **a DataFrame!** This DataFrame is like a structured spreadsheet that allows us to easily organize, manipulate, and analyze the data.

Now, we can dive into the exciting world of data exploration and analysis with the fitDataFrame. Whether it's plotting trends, calculating statistics, or discovering valuable insights, your data journey is just getting started!

Let's together keep up the great work, and have fun uncovering the hidden gems within your Google Fit data! 🌟🚀🔍

In [3]:
fitDataFrame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401 entries, 0 to 400
Data columns (total 26 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Date                        401 non-null    object 
 1   Move Minutes count          325 non-null    float64
 2   Calories (kcal)             400 non-null    float64
 3   Distance (m)                341 non-null    float64
 4   Heart Points                208 non-null    float64
 5   Heart Minutes               208 non-null    float64
 6   Average heart rate (bpm)    58 non-null     float64
 7   Max heart rate (bpm)        58 non-null     float64
 8   Min heart rate (bpm)        58 non-null     float64
 9   Low latitude (deg)          98 non-null     float64
 10  Low longitude (deg)         98 non-null     float64
 11  High latitude (deg)         98 non-null     float64
 12  High longitude (deg)        98 non-null     float64
 13  Average speed (m/s)         338 non

Here's what the output means:

**fitDataFrame** is a DataFrame object from the pandas library.

The DataFrame has a total of **401 entries (rows)** with a RangeIndex starting from 0 to 400.

There are **26 data columns** in the DataFrame, each representing different metrics or features.

The column names and their respective data types are listed under the "Data columns" section.

For each column, the **"Non-Null Count"** indicates how many non-missing (non-null) values are present.

The "Dtype" column represents the data type of each column.

The "memory usage" shows the approximate memory consumed by the DataFrame.

From this information, we can see that our dataset has __a mix of numerical and non-numeric data.__ We have several columns with missing values, indicated by the difference between the total number of entries (401) and the non-null counts for each column.

Now that we have an overview of our data's structure and missing values, we can move on to exploring and analyzing the data further. 📊🕵️‍♀️

In [4]:
fitDataFrame.tail()

Unnamed: 0,Date,Move Minutes count,Calories (kcal),Distance (m),Heart Points,Heart Minutes,Average heart rate (bpm),Max heart rate (bpm),Min heart rate (bpm),Low latitude (deg),...,Step count,Average weight (kg),Max weight (kg),Min weight (kg),Biking duration (ms),Inactive duration (ms),Walking duration (ms),Running duration (ms),Calisthenics duration (ms),Other duration (ms)
396,2023-07-28,99.0,1563.249975,4764.487783,36.0,36.0,,,,13.713904,...,8218.0,,,,,,3300312.0,,,
397,2023-07-29,59.0,1581.959861,3136.019941,28.0,28.0,,,,13.743002,...,5420.0,,,,,,3665686.0,,,
398,2023-07-30,1.0,1394.250001,24.881966,,,,,,,...,44.0,,,,,,,,,
399,2023-07-31,76.0,1628.198926,4651.878158,58.0,58.0,,,,13.817955,...,7881.0,,,,,,4568666.0,,,
400,2023-08-01,1.0,916.707601,33.929954,,,,,,,...,117.0,,,,,,,,,


The output shows the last five rows of the fitDataFrame, giving us a glimpse of the most recent data. Here's what each column represents:

__Date__: The date of the recorded activity metrics.

__Move Minutes count__: The number of minutes with movement detected.

__Calories (kcal)__: The number of calories burned during the recorded activity.

__Distance (m)__: The distance covered in meters during the activity.

__Heart Points__: Points earned based on heart rate activity.

__Heart Minutes__: The number of minutes with heart rate activity.

__Average heart rate (bpm)__: The average heart rate in beats per minute.

__Max heart rate (bpm)__: The maximum heart rate recorded in beats per minute.

__Min heart rate (bpm)__: The minimum heart rate recorded in beats per minute.

__Low latitude (deg) and Low longitude (deg)__: Latitude and longitude coordinates for the lowest location.

__High latitude (deg) and High longitude (deg)__: Latitude and longitude coordinates for the highest location.

__Average speed (m/s)__: The average speed during the activity in meters per second.

__Max speed (m/s)__: The maximum speed recorded during the activity in meters per second.

__Min speed (m/s)__: The minimum speed recorded during the activity in meters per second.

__Step count__: The number of steps taken during the activity.

__Average weight (kg), Max weight (kg), and Min weight (kg)__: Weight-related metrics, related to body weight.

__Biking duration (ms), Inactive duration (ms), Walking duration (ms), Running duration (ms), Calisthenics duration (ms), and Other duration (ms)__: Durations in milliseconds for specific activities.

Remember that some columns may have missing values (NaN) since not all metrics are recorded for every date. This tail view helps us quickly assess the most recent data in our DataFrame.🕵️‍♂️📈

In [5]:
fitDataFrame.drop(['Average heart rate (bpm)','Max heart rate (bpm)','Min heart rate (bpm)','Low latitude (deg)','Low longitude (deg)','High latitude (deg)','High longitude (deg)',
                   'Max weight (kg)','Average weight (kg)','Min weight (kg)','Biking duration (ms)'],
                  axis=1,inplace=True)
fitDataFrame

Unnamed: 0,Date,Move Minutes count,Calories (kcal),Distance (m),Heart Points,Heart Minutes,Average speed (m/s),Max speed (m/s),Min speed (m/s),Step count,Inactive duration (ms),Walking duration (ms),Running duration (ms),Calisthenics duration (ms),Other duration (ms)
0,2021-10-19,9.0,152.299999,,9.0,9.0,,,,,,,,685770.0,
1,2021-10-20,15.0,206.899996,,15.0,15.0,,,,,,,,993930.0,
2,2021-10-21,8.0,163.000000,,8.0,8.0,,,,,,,,582691.0,
3,2021-10-25,10.0,146.699994,,10.0,10.0,,,,,,,,715140.0,
4,2021-10-26,392.0,87.599998,,392.0,392.0,,,,,,,,23670167.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
396,2023-07-28,99.0,1563.249975,4764.487783,36.0,36.0,0.953253,3.476047,0.179258,8218.0,,3300312.0,,,
397,2023-07-29,59.0,1581.959861,3136.019941,28.0,28.0,0.597359,2.178257,0.093243,5420.0,,3665686.0,,,
398,2023-07-30,1.0,1394.250001,24.881966,,,0.282750,0.282750,0.282750,44.0,,,,,
399,2023-07-31,76.0,1628.198926,4651.878158,58.0,58.0,0.804972,1.989570,0.116951,7881.0,,4568666.0,,,


#### Cleaning the Data for Insightful Analysis

Before we embark on our data analysis adventure, it's essential to tidy up the data and make it squeaky clean! Data cleaning is like decluttering our dataset and keeping only what matters most for our research. We want our data to be meaningful, relevant, and free from unnecessary noise.

As we examined the data info, we noticed that some columns, such as Average,Max and Min Heart rate are empty and lack any valuable information. Since our focus is not on geographical data, we can safely bid farewell to columns with latitude and longitude information. By cleaning our data and stripping away irrelevant elements, we're setting the stage for an insightful analysis that will bring out the true essence of our dataset.📊🧹

In [9]:
fitDataFrame.fillna(0, inplace=True)
fitDataFrame

Unnamed: 0,Date,Move Minutes count,Calories (kcal),Distance (m),Heart Points,Heart Minutes,Average speed (m/s),Max speed (m/s),Min speed (m/s),Step count,Inactive duration (ms),Walking duration (ms),Running duration (ms),Calisthenics duration (ms),Other duration (ms)
0,2021-10-19,9.0,152.299999,0.000000,9.0,9.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,685770.0,0.0
1,2021-10-20,15.0,206.899996,0.000000,15.0,15.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,993930.0,0.0
2,2021-10-21,8.0,163.000000,0.000000,8.0,8.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,582691.0,0.0
3,2021-10-25,10.0,146.699994,0.000000,10.0,10.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,715140.0,0.0
4,2021-10-26,392.0,87.599998,0.000000,392.0,392.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,23670167.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
396,2023-07-28,99.0,1563.249975,4764.487783,36.0,36.0,0.953253,3.476047,0.179258,8218.0,0.0,3300312.0,0.0,0.0,0.0
397,2023-07-29,59.0,1581.959861,3136.019941,28.0,28.0,0.597359,2.178257,0.093243,5420.0,0.0,3665686.0,0.0,0.0,0.0
398,2023-07-30,1.0,1394.250001,24.881966,0.0,0.0,0.282750,0.282750,0.282750,44.0,0.0,0.0,0.0,0.0,0.0
399,2023-07-31,76.0,1628.198926,4651.878158,58.0,58.0,0.804972,1.989570,0.116951,7881.0,0.0,4568666.0,0.0,0.0,0.0



We've filled any missing values (NaN) in the fitDataFrame with zeros. Let's review the changes made by the fillna() method and the overall effect on the DataFrame:

Here's a breakdown of what we've done:

**fillna(0)**: The fillna() method is used to fill missing values in a DataFrame with specified values. In this case, you've used fillna(0) to replace any NaN (missing) values with zeros.

**inplace=True**: As before, inplace=True means the fill operation is applied directly to the original DataFrame, modifying it in place.

By filling missing values with zeros, we've effectively plugged the gaps in our data. This can be useful when missing values are not appropriate for our analysis and we prefer to assign a specific value (in this case, zero) in their place.

Now, our fitDataFrame is clean, complete, and ready for further exploration and analysis! With no more missing values to worry about. 🚀🔍

In [10]:
fitDataFrame.describe()

Unnamed: 0,Date,Move Minutes count,Calories (kcal),Distance (m),Heart Points,Heart Minutes,Average speed (m/s),Max speed (m/s),Min speed (m/s),Step count,Inactive duration (ms),Walking duration (ms),Running duration (ms),Calisthenics duration (ms),Other duration (ms)
count,401,401.0,401.0,401.0,401.0,401.0,401.0,401.0,401.0,401.0,401.0,401.0,401.0,401.0,401.0
mean,2023-01-06 01:54:54.763092224,35.78803,1441.384072,1413.356217,13.007481,12.837905,0.42195,1.691777,0.173715,2397.593516,26183.0,1491422.0,3674.23192,184656.3,257356.6
min,2021-10-19 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2022-10-05 00:00:00,1.0,1394.250001,21.488971,0.0,0.0,0.2639,0.28275,0.000162,119.0,0.0,0.0,0.0,0.0,0.0
50%,2023-01-13 00:00:00,13.0,1432.553801,153.250295,2.0,2.0,0.386338,0.984549,0.254475,520.0,0.0,283681.0,0.0,0.0,0.0
75%,2023-04-23 00:00:00,54.0,1564.453145,2488.215878,20.0,20.0,0.587589,1.545567,0.28275,4214.0,0.0,2452866.0,0.0,0.0,0.0
max,2023-08-01 00:00:00,451.0,2213.464527,9554.063741,392.0,392.0,4.014492,50.619999,0.998802,14553.0,6733868.0,10838790.0,543243.0,23670170.0,23280000.0
std,,53.009692,280.050073,2004.353075,25.757571,25.479328,0.378187,3.915655,0.142599,3229.073094,384867.5,2149652.0,32147.762889,1240994.0,1580600.0



The describe() method provides statistical summaries for each numerical column in the DataFrame. Here's what each row represents:

**count**: The number of non-missing values in each column.

**mean**: The average (mean) value of each column.

**std**: The standard deviation, which measures the spread or variability of the data.

**min**: The minimum value in each column.

**25%**: The first quartile or 25th percentile value.

**50%**: The second quartile or median value.

**75%**: The third quartile or 75th percentile value.

**max**: The maximum value in each column.

These summary statistics offer valuable insights into the distribution and variation of the data. For instance, we can see the average number of move minutes, calories burned, distance covered, heart points earned, heart minutes, and various other metrics recorded in your Google Fit data.

With this overview, we can quickly identify the range of values, potential outliers, and patterns in the data, which can guide your further exploration and analysis.🚀📊

