# Final Project: Analyzing Runkeeper Fitness Data

**Completed By Team 6:** Duggu Aishwarya, Colin Barker, Anna Enge, Carlos Ribadeneira Espinoza, Nathan Weeks

One day, my old running friend and I were chatting about our running styles, training habits, and achievements, when I suddenly realized that I could take an in-depth analytical look at my training. I have been using a popular GPS fitness tracker called Runkeeper for years and decided it was time to analyze my running data to see how I was doing.

Since 2012, I've been using the Runkeeper app, and it's great. One key feature: its excellent data export. Anyone who has a smartphone can download the app and analyze their data like we will in this notebook.

After logging your run, the first step is to export the data from Runkeeper (which I've done already). Then import the data and start exploring to find potential problems. After that, create data cleaning strategies to fix the issues. Finally, analyze and visualize the clean time-series data.

I exported seven years worth of my training data, from 2012 through 2018. The data is a CSV file where each row is a single training activity. Let's load and inspect it.

### Part 1.
- Import pandas under the alias `pd` and matplotlib.pyplot as `plt`
- Use the `read_csv()` function to load the dataset (`cardioActivities.csv`) into a variable called `df_activities`. Parse the dates with the `parse_dates` parameter and set the index to the `Date` column using the `index_col` parameter.
- Display 3 random rows from `df_activities` using the `sample()` method.
- Print a summary of `df_activities` using the `info()` method.

In [4]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Define file containing dataset
file = 'cardioActivities.csv'

# Create DataFrame with parse_dates and index_col parameters 
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df_activities = pd.read_csv(file, parse_dates=['Date'], date_parser=dateparse)
df_activities.set_index('Date')
# First look at exported data: select sample of 3 random rows 
df_activities.head(3)



Unnamed: 0,Date,Activity Id,Type,Route Name,Distance (km),Duration,Average Pace,Average Speed (km/h),Calories Burned,Climb (m),Average Heart Rate (bpm),Friend's Tagged,Notes,GPX File
0,2018-11-11 14:05:12,c9627fed-14ac-47a2-bed3-2a2630c63c15,Running,,10.44,58:40,5:37,10.68,774.0,130,159.0,,,2018-11-11-140512.gpx
1,2018-11-09 15:02:35,be65818d-a801-4847-a43b-2acdf4dc70e7,Running,,12.84,1:14:12,5:47,10.39,954.0,168,159.0,,,2018-11-09-150235.gpx
2,2018-11-04 16:05:00,c09b2f92-f855-497c-b624-c196b3ef036c,Running,,13.01,1:15:16,5:47,10.37,967.0,171,155.0,,,2018-11-04-160500.gpx


In [9]:
# Print DataFrame summary
df_activities.describe()

Unnamed: 0,Distance (km),Average Speed (km/h),Calories Burned,Climb (m),Average Heart Rate (bpm),Friend's Tagged
count,508.0,508.0,508.0,508.0,294.0,0.0
mean,11.757835,11.341654,18781.97,128.0,143.530612,
std,6.209219,2.510516,218693.0,108.52604,10.583848,
min,0.76,1.04,40.0,0.0,77.0,
25%,7.015,10.47,491.75,53.0,140.0,
50%,11.46,11.03,728.0884,92.0,144.0,
75%,13.6425,11.6425,921.25,172.25,149.0,
max,49.18,24.33,4072685.0,982.0,172.0,


Lucky for us, the column names Runkeeper provides are informative, and we don't need to rename any columns.

But, we do notice missing values using the info() method. What are the reasons for these missing values? It depends. Some heart rate information is missing because I didn't always use a cardio sensor. In the case of the Notes column, it is an optional field that I sometimes left blank. Also, I only used the Route Name column once, and never used the Friend's Tagged column.

We'll fill in missing values in the heart rate column to avoid misleading results later, but right now, our first data preprocessing steps will be to:

- Remove columns not useful for our analysis.
- Replace the "Other" activity type to "Unicycling" because that was always the "Other" activity.
- Count missing values.

### Part 2.
- Delete unnecessary columns from `df_activities` with the `drop()` method, setting the columns parameter to a list called `cols_to_drop`
- Calculate the activity type counts using the `value_counts()` method on the `Type` column
- Rename the 'Other' values to 'Unicycling' in the `Type` column using `str.replace()`
- Count the missing values in each column using `isnull().sum()`

In [5]:
# Define list of columns to be deleted
cols_to_drop = ["Friend's Tagged",'Route Name','GPX File','Activity Id','Calories Burned', 'Notes']

# Delete unnecessary columns
df_activities.drop(columns=cols_to_drop)

# Count types of training activities
df_activities.Type.value_counts()

# Rename 'Other' type to 'Unicycling'
df_activities['Type'] = df_activities['Type'].str.replace('Other', 'Unicycling')


In [6]:
# Count missing values for each column
df_activities.isnull().sum()

Date                          0
Activity Id                   0
Type                          0
Route Name                  507
Distance (km)                 0
Duration                      0
Average Pace                  0
Average Speed (km/h)          0
Calories Burned               0
Climb (m)                     0
Average Heart Rate (bpm)    214
Friend's Tagged             508
Notes                       277
GPX File                      4
dtype: int64

As we can see from the last output, there are 214 missing entries for my average heart rate.

We can't go back in time to get those data, but we can fill in the missing values with an average value. This process is called mean imputation. When imputing the mean to fill in missing data, we need to consider that the average heart rate varies for different activities (e.g., walking vs. running). We'll filter the DataFrames by activity type (Type) and calculate each activity's mean heart rate, then fill in the missing values with those means.

### Part 3.
- Calculate the sample mean for `'Average Heart Rate (bpm)'` for the `'Cycling'` activity type. Assign the result to `avg_hr_cycle`. Do the same for the `'Running'` activity type and assign it to `avg_hr_run`
- Filter the `df_activities` for the `'Cycling'` activity type. Create a copy of the result using `copy()` and assign the copy to `df_cycle`. Do the same for the `'Running'` and `'Walking'` activity types, calling them `df_run` and `df_walk`
- Fill in the missing values for `'Average Heart Rate (bpm)'` in `df_cycle` with `int(avg_hr_cycle)` using the `fillna()` method. Do the same for the `df_run` using `int(avg_hr_run)`. Fill the missing heart rates in `df_walk` with 110. **Note:** Remember to set `inplace=True`!
- Count the missing values for all columns in `df_run`

In [7]:
# Calculate sample means for heart rate for each training activity type 

avg_hr = df_activities['Average Heart Rate (bpm)'].groupby(df_activities['Type']).mean()
avg_hr_cycle = avg_hr['Cycling']
avg_hr_running = avg_hr['Running']



In [26]:

# Split whole DataFrame into several, specific for different activities

# Cycling
df_cycle = df_activities[df_activities['Type'] == 'Cycling']

df_cycle.head()


Unnamed: 0,Date,Activity Id,Type,Route Name,Distance (km),Duration,Average Pace,Average Speed (km/h),Calories Burned,Climb (m),Average Heart Rate (bpm),Friend's Tagged,Notes,GPX File
8,2018-10-06 16:45:02,4c163abe-3a57-42fd-b50b-7f365960cbd4,Cycling,,19.63,1:26:26,4:24,13.63,577.0,210,79.0,,,2018-10-06-164502.gpx
10,2018-09-16 14:55:03,30aaa821-1d3a-4f2f-9688-8543cebbd6e8,Cycling,,32.61,1:55:15,3:32,16.98,830.0,462,118.0,,,2018-09-16-145503.gpx
12,2018-09-01 17:06:15,2bd1841f-b428-4683-a41b-2bfb4be7e908,Cycling,,36.89,1:58:39,3:13,18.65,937.0,491,122.0,,,2018-09-01-170615.gpx
13,2018-08-28 18:44:33,c9a8e088-441d-4b3f-bfbc-287e87585ca7,Cycling,,28.17,1:27:07,3:06,19.4,685.0,400,111.0,,,2018-08-28-184433.gpx
14,2018-08-25 17:18:32,12723b6e-571b-4b68-be17-2c797982d3f9,Cycling,,19.41,1:11:33,3:41,16.28,536.0,199,124.0,,,2018-08-25-171832.gpx


In [24]:
# Running

df_run = df_activities[df_activities['Type'] == 'Running']
df_run.head()

Unnamed: 0,Date,Activity Id,Type,Route Name,Distance (km),Duration,Average Pace,Average Speed (km/h),Calories Burned,Climb (m),Average Heart Rate (bpm),Friend's Tagged,Notes,GPX File
0,2018-11-11 14:05:12,c9627fed-14ac-47a2-bed3-2a2630c63c15,Running,,10.44,58:40,5:37,10.68,774.0,130,159.0,,,2018-11-11-140512.gpx
1,2018-11-09 15:02:35,be65818d-a801-4847-a43b-2acdf4dc70e7,Running,,12.84,1:14:12,5:47,10.39,954.0,168,159.0,,,2018-11-09-150235.gpx
2,2018-11-04 16:05:00,c09b2f92-f855-497c-b624-c196b3ef036c,Running,,13.01,1:15:16,5:47,10.37,967.0,171,155.0,,,2018-11-04-160500.gpx
3,2018-11-01 14:03:58,bc9b612d-3499-43ff-b82a-9b17b71b8a36,Running,,12.98,1:14:25,5:44,10.47,960.0,169,158.0,,,2018-11-01-140358.gpx
4,2018-10-27 17:01:36,972567b2-1b0e-437c-9e82-fef8078d6438,Running,,13.02,1:12:50,5:36,10.73,967.0,170,154.0,,,2018-10-27-170136.gpx


In [11]:
# Walking
df_walk = df_activities[df_activities['Type'] == 'Walking']
df_walk.head()

Unnamed: 0,Date,Activity Id,Type,Route Name,Distance (km),Duration,Average Pace,Average Speed (km/h),Calories Burned,Climb (m),Average Heart Rate (bpm),Friend's Tagged,Notes,GPX File
422,2013-08-15 18:49:50,666cbe78-b3d5-4bbb-8a22-2717507d32c2,Walking,,2.48,2:23:46,57:56,1.04,306.0,67,,,,2013-08-15-184950.gpx
425,2013-08-08 07:56:08,9ea3ec6e-48fa-417f-8bdf-2b197e19f5d4,Walking,,1.51,15:24,10:11,5.89,85.0,6,,,,2013-08-08-075608.gpx
442,2013-06-03 07:04:59,d091607e-10a2-467a-93d6-f2c54d47a0d9,Walking,,1.33,11:59,9:03,6.63,76.0,5,,,,2013-06-03-070459.gpx
454,2013-04-29 18:48:30,f157d4ff-bbf3-47e9-b07c-58f692ef9e6f,Walking,,1.37,22:39,16:30,3.64,95.0,10,,,,2013-04-29-184830.gpx
455,2013-04-29 13:10:14,313fbef4-deeb-4da5-b1b2-ecee74cba0f6,Walking,,3.83,38:30,10:04,5.96,255.0,25,,,,2013-04-29-131014.gpx


In [16]:
# Filling missing values with counted means  
df_cycle['Average Heart Rate (bpm)'].fillna(avg_hr_cycle, inplace=True)
df_run['Average Heart Rate (bpm)'].fillna(avg_hr_running,inplace=True)
df_walk['Average Heart Rate (bpm)'].fillna(110,inplace=True)





In [13]:
# Count missing values for each column in running data
df_cycle.isnull().sum()

Date                         0
Activity Id                  0
Type                         0
Route Name                  29
Distance (km)                0
Duration                     0
Average Pace                 0
Average Speed (km/h)         0
Calories Burned              0
Climb (m)                    0
Average Heart Rate (bpm)     0
Friend's Tagged             29
Notes                       21
GPX File                     0
dtype: int64

In [28]:
df_run.isnull().sum()

Date                          0
Activity Id                   0
Type                          0
Route Name                  458
Distance (km)                 0
Duration                      0
Average Pace                  0
Average Speed (km/h)          0
Calories Burned               0
Climb (m)                     0
Average Heart Rate (bpm)      0
Friend's Tagged             459
Notes                       237
GPX File                      4
dtype: int64

In [14]:
df_walk.isnull().sum()

Date                         0
Activity Id                  0
Type                         0
Route Name                  18
Distance (km)                0
Duration                     0
Average Pace                 0
Average Speed (km/h)         0
Calories Burned              0
Climb (m)                    0
Average Heart Rate (bpm)     0
Friend's Tagged             18
Notes                       17
GPX File                     0
dtype: int64

Now we can create our first plot! As we found earlier, most of the activities in my data were running (459 of them to be exact). There are only 29, 18, and two instances for cycling, walking, and unicycling, respectively. So for now, let's focus on plotting the different running metrics.

An excellent first visualization is a figure with four subplots, one for each running metric (each numerical column). Each subplot will have a different y-axis, which is explained in each legend. The x-axis, Date, is shared among all subplots.

### Part 4.
- Subset `df_run` for data from `'2013'` through `'2018'`. Take into account that observations in dataset stored in chronological order - most recent records first. Assign the result to `runs_subset_2013_2018`
- In the plotting code, enable `subplots` by setting the subplots parameter to `True`. Set `sharex=False`, `figsize=(12,16)`, `linestyle='none'`, `marker='o'`, and `markersize=3`
- Show the plot using `plt.show()`

In [None]:
# Prepare data subsetting period from 2013 till 2018


# Create, plot and customize in one step


# Show plot


No doubt, running helps people stay mentally and physically healthy and productive at any age. And it is great fun! When runners talk to each other about their hobby, we not only discuss our results, but we also discuss different training strategies.

You'll know you're with a group of runners if you commonly hear questions like:

- What is your average distance?
- How fast do you run?
- Do you measure your heart rate?
- How often do you train?

Let's find the answers to these questions in my data. If you look back at plots in Part 4, you can see the answer to, Do you measure your heart rate? Before 2015: no. To look at the averages, let's only use the data from 2015 through 2018.

In pandas, the `resample()` method is similar to the `groupby()` method - with `resample()` you group by a specific time span. We'll use `resample()` to group the time series data by a sampling period and apply several methods to each sampling period. In our case, we'll resample annually and weekly.


### Part 5.
- Subset `df_run` from March 2015 through 2018 then select the `'Average Heart Rate (bpm)'` column. Assign the result to `df_run_hr_all`
- Create a plot with `plt.subplots()`, setting `figsize` to `(8,5)`. Assign the result to `fig, ax`
- Create customized x-axis ticks with `ax.xaxis.set()` and passing the `hr_zones` to the `ticks` parameter. Use  `ax.set_xticklabels()` and set the parameters `labels` to `zone_names`, `rotation` to `-30`, and `ha` to `'left'`. Use `ax.set()` to set the `title` as `'Distribution of HR'` and the `ylabel` as `'Number of runs'`
- Show the plot with `plt.show()`

In [None]:
# Prepare data
hr_zones = [100, 125, 133, 142, 151, 173]
zone_names = ['Easy', 'Moderate', 'Hard', 'Very hard', 'Maximal']
zone_colors = ['green', 'yellow', 'orange', 'tomato', 'red']


# Create plot


# Plot and customize
n, bins, patches = ax.hist(df_run_hr_all, bins=hr_zones, alpha=0.5)
for i in range(0, len(patches)):
    patches[i].set_facecolor(zone_colors[i])



# Show plot



With all this data cleaning, analysis, and visualization, let's create detailed summary tables of my training.

To do this, we'll create two tables. The first table will be a summary of the distance (km) and climb (m) variables for each training activity. The second table will list the summary statistics for the average speed (km/hr), climb (m), and distance (km) variables for each training activity.

### Part 6.
- Concatenate the `df_run` DataFrame with `df_walk` and `df_cycle` using `append()`, then `sort` based on the index in descending order. Assign the result to `df_run_walk_cycle`
- Group `df_run_walk_cycle` by activity type, then select the columns in `dist_climb_cols`. Sum the result using `sum()`. Assign the result to `df_totals` and print the result.
- Use the `stack()` method on `df_summary` to show a compact reshaped form of the full summary report.

In [90]:
dist_climb_cols, speed_col = ['Distance (km)', 'Climb (m)'], ['Average Speed (km/h)']

# Concatenate three DataFrames using append
df_run_walk_cycle = df_run.append([df_walk, df_cycle], sort = True, ignore_index = True)


# Calculate total distance and climb in each type of activities
df_totals = df_run_walk_cycle[dist_climb_cols].groupby(df_run_walk_cycle['Type']).sum()
print(df_totals)
print()


# Calculating summary statistics for each type of activities 

dist_climb_speed_col = ['Distance (km)', 'Climb (m)', 'Average Speed (km/h)']
df_summary = df_run_walk_cycle[dist_climb_speed_col].groupby(df_run_walk_cycle['Type']).describe()


# Combine totals with summary
for i in dist_climb_cols:
    df_summary[i,'total'] = df_totals[i]
    
    
# Stack and print summary statistics
print('----------------------------------------')
print(df_summary.stack())

         Distance (km)  Climb (m)
Type                             
Cycling         680.58       6976
Running        5224.50      57278
Walking          33.45        349

----------------------------------------
               Average Speed (km/h)     Climb (m)  Distance (km)
Type                                                            
Cycling 25%               16.980000    139.000000      15.530000
        50%               19.500000    199.000000      20.300000
        75%               21.490000    318.000000      29.400000
        count             29.000000     29.000000      29.000000
        max               24.330000    553.000000      49.180000
        mean              19.125172    240.551724      23.468276
        min               11.380000     58.000000      11.410000
        std                3.257100    128.960289       9.451040
        total                   NaN   6976.000000     680.580000
Running 25%               10.495000     54.000000       7.415000
        