# Exploring my Running Data to Find my Most Grueling Run Ever (Based On Numbers of Course)
Here I will be exploring my college running data from 2017 to 2022. I'm doing this as a sort of "last horah" now that I am offically a washed up runner with a cycling and snowboarding problem.

In [None]:
import pandas as pd
from collections import Counter
import seaborn as sns
import statistics

In [None]:
data_path = "Activities 20"
df = pd.read_csv(data_path + "17.csv")
for i in range (18,23):
  df = pd.concat([df, pd.read_csv(data_path + str(i) + ".csv")])
df = df.reset_index()

In [None]:
df.head()

## Feature Engineering

Let's have a look at the columns we have in our dataset and decide which ones may be useful to explore.

In [None]:
df.columns

In [None]:
df.info()

To give a good understanding of my running trends and eventually find the most grueling training week of my career, I want to focus on the distance, speed, elevation gain/loss, heart rate, cadence, and tempurature of my runs. I admitedly have quite a bit of domain knowledge when it comes to running, and these are oftentimes the most crucial stats for quantifying running performances. Let's clean up the columns contatining this data and drop the ones we don't need.

In [None]:
cols = ['Date', 'Title', 'Distance',
       'Calories', 'Time', 'Avg HR', 'Max HR', 'Avg Run Cadence',
       'Max Run Cadence', 'Avg Pace', 'Best Pace', 'Total Ascent',
       'Total Descent', 'Avg Stride Length', 'Min Temp', 'Max Temp', 'Min Elevation',
       'Max Elevation']
cols_to_check = ['Calories', 'Total Ascent', 'Total Descent', 'Min Elevation', 'Max Elevation']
df = df[cols]
df[cols_to_check] = df[cols_to_check].replace({',':''}, regex=True)
df = df.replace('--', None)

In [None]:
for col in cols_to_check:
  med = statistics.median(df[col].dropna().astype(int))
  df[col] = df[col].fillna(med).astype(int)
df

In [None]:
df.info()

The first thing you may notice is some 0's in the *Avg HR, Max HR, Min Temp*, and *Max Temp* columns. This is because I didn't have a fancy watch with a heart rate monitor in the beginning of college. Easy fixes though!

Let's see how many 0's we're looking at.

In [None]:
Counter(df['Max HR'])[0], Counter(df['Avg HR'])[0]

In [None]:
Counter(df['Min Temp'])[0], Counter(df['Max Temp'])[0]

### Filling in missing HR Data
1,282 samples is a very large portion of the 1,975 total samples in our dataset, so we can't afford to throw those out. We have to find a systematic way to fill those in! Most running training programs follow a "80/20" rule where 80% of your training volume is done at easy paces while the other 20% is hard (i.e. running workouts).

In [None]:
df_hr = df[df['Max HR'] > 0]

Somewhat counterintuitively the Avg Pace, Avg HR, and Avg Run Cadence won't help us too much in differenciating between my easy runs and workouts. This is because all running workouts involve some kind of recovery that is either much slower running between reps or complete rest. We need to look at the Max columns for these stats to find the runs that were workouts.

In [None]:
df_hr.info()

In [None]:
df_hr = df_hr.replace('--', None)
df_hr = df_hr.dropna()

In [None]:
df_hr['Best Pace'] = [(60 * int(x.split(':')[0]) + int(x.split(':')[1]))/60 for x in df_hr['Best Pace']]

In [None]:
import numpy as np
np.percentile(df_hr['Best Pace'], 25), np.percentile(df_hr['Max HR'], 80), np.percentile(df_hr['Max Run Cadence'], 75)

In [None]:
len(df_hr.loc[(df_hr['Best Pace'] <= np.percentile(df_hr['Best Pace'], 25)) & (df_hr['Max Run Cadence'] >= 199)]) / len(df_hr)

Stats we can use to infill 0's for workout runs.

In [None]:
import statistics
df_workouts = df_hr.loc[(df_hr['Best Pace'] <= np.percentile(df_hr['Best Pace'], 25)) & (df_hr['Max Run Cadence'] >= 199)]
print(statistics.median(df_workouts['Avg HR']), statistics.median(df_workouts['Max HR']))
avg_work_hr = statistics.median(df_workouts['Avg HR'])
max_work_hr = statistics.median(df_workouts['Max HR'])

Stats to fill in 0's for easy runs.

In [None]:
df_easy = df_hr.loc[(df_hr['Best Pace'] > np.percentile(df_hr['Best Pace'], 25)) & (df_hr['Max Run Cadence'] < 199)]
print(statistics.median(df_easy['Avg HR']), statistics.median(df_easy['Max HR']))
avg_easy_hr = statistics.median(df_easy['Avg HR'])
max_easy_hr = statistics.median(df_easy['Max HR'])

### Filling in Missing Tempurature Data

The vast majority of my runs over the years have been done either at home in Hurley, MS or in Starkville, MS where I got my undergrad before moving to the Wild West. Garmin does not record GeoLocation data for runs; they only give a somewhat cryptic name for the run's location. Instead of trying to pull another dataset in containing weather data and joining based on location and date, let's compute some representative weather data using the dataset we already have.

First, lets gather the samples I do have with tempurature columns that aren't null that occur in Mississippi.

In [None]:
Counter(df_hr['Title'])

Any run title containing the word Starkville, Noxubee, Jackon, Oktibbeha, or Mobile either occured in my hometown or in my college town. Now we're gonna compute some representative tempuratures for the state for each of the four seasons to infill our missing tempurature data.  

In [None]:
df_weather = df_hr.loc[(df_hr["Title"].str.contains("Starkville")) | (df_hr["Title"].str.contains("Noxubee")) | (df_hr["Title"].str.contains("Jackson")) | (df_hr["Title"].str.contains("Oktibbeha")) | (df_hr["Title"].str.contains("Mobile"))]

In [None]:
df_weather

In [None]:
seasons = [1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1]
month_to_season = dict(zip(range(1,13), seasons))
df_weather['Season'] = [month_to_season[x] for x in pd.to_datetime(df_weather['Date']).dt.month]
df_weather

In [None]:
statistics.median(df_weather[df_weather['Season'] == 1]['Min Temp']), statistics.median(df_weather[df_weather['Season'] == 1]['Max Temp'])
wint_min = statistics.median(df_weather[df_weather['Season'] == 1]['Min Temp'])
wint_max = statistics.median(df_weather[df_weather['Season'] == 1]['Max Temp'])

In [None]:
statistics.median(df_weather[df_weather['Season'] == 2]['Min Temp']), statistics.median(df_weather[df_weather['Season'] == 2]['Max Temp'])
spr_min = statistics.median(df_weather[df_weather['Season'] == 2]['Min Temp'])
spr_max = statistics.median(df_weather[df_weather['Season'] == 2]['Max Temp'])

In [None]:
statistics.median(df_weather[df_weather['Season'] == 3]['Min Temp']), statistics.median(df_weather[df_weather['Season'] == 3]['Max Temp'])
summ_min = statistics.median(df_weather[df_weather['Season'] == 3]['Min Temp'])
summ_max = statistics.median(df_weather[df_weather['Season'] == 3]['Max Temp'])

In [None]:
statistics.median(df_weather[df_weather['Season'] == 4]['Min Temp']), statistics.median(df_weather[df_weather['Season'] == 4]['Max Temp'])
fall_min = statistics.median(df_weather[df_weather['Season'] == 4]['Min Temp'])
fall_max = statistics.median(df_weather[df_weather['Season'] == 4]['Max Temp'])

In [None]:
season_temp_min = {1:wint_min, 2:spr_min, 3:summ_min, 4:fall_min}
season_temp_max= {1:wint_max, 2:spr_max, 3:summ_max, 4:fall_max}
season_temp_min, season_temp_max

### Fill in missing data on whole dataset

In [None]:
df = df.replace(0, None)
df

In [None]:
df['Max Run Cadence'] = df['Max Run Cadence'].fillna(statistics.mean(df.dropna()['Max Run Cadence'].astype(int)))

In [None]:
df = pd.DataFrame(df.drop(df.loc[df['Best Pace'].isna()].index).reset_index())
df = pd.DataFrame(df.drop(df.loc[df['Avg Pace'].isna()].index).reset_index())

In [None]:
df

In [None]:
df['Best Pace'] = [(60 * int(x.split(':')[0]) + int(x.split(':')[1]))/60 for x in df['Best Pace']]
df['Avg Pace'] = [(60 * int(x.split(':')[0]) + int(x.split(':')[1]))/60 for x in df['Avg Pace']]
df.loc[(df['Max HR'].isna()) & (df['Best Pace'] <= np.percentile(df['Best Pace'], 25)) & (df['Max Run Cadence'].astype(int) >= 199), ['Max HR']]= max_work_hr
df.loc[(df['Avg HR'].isna()) & (df['Best Pace'] <= np.percentile(df['Best Pace'], 25)) & (df['Max Run Cadence'].astype(int) >= 199), ['Avg HR']] = avg_work_hr
df.loc[(df['Max HR'].isna()) & (df['Best Pace'] > np.percentile(df['Best Pace'], 25)) & (df['Max Run Cadence'].astype(int) < 199), ['Max HR']]= max_easy_hr
df.loc[(df['Avg HR'].isna()) & (df['Best Pace'] > np.percentile(df['Best Pace'], 25)) & (df['Max Run Cadence'].astype(int) < 199), ['Avg HR']] = avg_easy_hr
df

In [None]:
df['Season'] = [month_to_season[x] for x in pd.to_datetime(df['Date']).dt.month]
for i in range(len(df)):
  if df.loc[i]['Min Temp'] == None:
    df.loc[i, ['Min Temp']] = season_temp_min[df.loc[i]['Season']]

for i in range(len(df)):
  if df.loc[i]['Max Temp'] == None:
    df.loc[i, ['Max Temp']] = season_temp_max[df.loc[i]['Season']]

In [None]:
df

In [None]:
df.loc[df['Avg HR'].isna()]

## Totals
Let's have a look at some overall stats from my college running days before we get a bit more granular.

In [None]:
sum(df['Distance'])

In [None]:
df['Calories'] = [x.replace(',','') for x in df['Calories']]
sum(df['Calories'].astype(int))

In [None]:
df['Time Minutes'] = [60 * int(x.split(':')[0]) + int(x.split(':')[1]) for x in df['Time']]

In [None]:
df['Avg Run Cadence'] * df['Time Minutes']

In [None]:
import statistics
statistics.median(df['Distance'])

In [None]:
statistics.mean(df['Distance'])

In [None]:
max(df['Distance'])

In [None]:
min(df['Distance'])

In [None]:
len(df)

## Stats by Week

In [None]:
df['Date'] = pd.to_datetime(df['Date']) - pd.to_timedelta(7, unit='d')

In [None]:
sns.histplot(pd.DataFrame(df.groupby([pd.Grouper(key='Date', freq='W')])['Distance'].sum()).reset_index())

In [None]:
sns.histplot(pd.DataFrame(df.groupby([pd.Grouper(key='Date', freq='W')])['Avg Pace'].mean()).reset_index())

In [None]:
sns.displot(data = df[['Distance', 'Season']], y = 'Distance', x = 'Season')

In [None]:
sns.displot(data = df[['Best Pace', 'Season']], y = 'Best Pace', x = 'Season')

## Worst Week Ever

In [None]:
df.loc[df['Total Ascent'].isna(), ['Total Ascent']] = statistics.median(df['Total Ascent'].dropna().astype(int))

In [None]:
plt = pd.DataFrame(df.groupby([pd.Grouper(key='Date', freq='W')])['Distance'].sum())
plt = plt.merge(df.groupby([pd.Grouper(key='Date', freq='W')])['Avg Pace'].mean(), on = 'Date')
plt

## Fitness Trends

In [None]:
sns.lineplot(data = plot, y = plot['Distance'], x = pd.to_datetime(plot['Date']))

In [None]:
sns.lineplot(df['PPM Avg Seconds'])

In [None]:
sns.lineplot(df['PPM Seconds'])

In [None]:
sns.lineplot(df[df['Avg HR'] > 0]['Avg HR'])

In [None]:
from collections import Counter
