# Running Prediction

Many recreational runners, including myself, may want to step up to the next level and sign up for a race to challenge themselves. Race participants often set goals for themselves, such as a sub-two-hour half marathon or a sub-four-hour full marathon. Such a feat requires strategic pacing, efficient running and breathing techniques, and consistent training.

This project will utilize ML algorithms to analyze real running data to develop a predictive model for race finish times. The models will analyze data from Strava (distance, pace, heart rate, elevation, etc) and utilize regression techniques to estimate finish times based on the user’s training patterns. In addition, classification algorithms can be used to separate actual training data from recovery runs or commute runs to improve prediction accuracy.

# Importing Data

The dataset is saved in the repository as "strava_data.csv". We will start by reaeding the file and printing it out.

In [33]:
import pandas as pd

# Read csv file
strava_data = pd.read_csv('strava_data.csv')

# Print out the dataframe
strava_data

Unnamed: 0,date,time,distance,avg_speed,max_speed,avg_heartrate,max_heartrate,elevation_gain,avg_power,max_power,total_work,avg_cadence,max_cadence,calories_burned,shoe
0,2024/5/23,0:18:31,1.96,0:09:24,0:06:02,177,202,0,,,,,,208,Asics Gel-Kayano 28
1,2024/5/28,0:27:00,2.84,0:09:30,0:06:38,168,183,94,,,,,,257,Asics Gel-Kayano 28
2,2024/5/30,0:23:07,2.71,0:08:31,0:06:41,171,190,100,,,,,,251,Asics Gel-Kayano 28
3,2024/6/1,0:11:58,1.43,0:08:22,0:05:00,176,197,170,,,,,,143,Asics Gel-Kayano 28
4,2024/6/3,0:12:53,1.4,0:09:11,0:06:43,171,192,172,,,,,,136,Asics Gel-Kayano 28
5,2024/6/7,0:56:48,5.38,0:10:32,0:07:32,169,191,656,,,,,,530,Asics Gel-Kayano 28
6,2024/6/10,0:08:47,0.8,0:10:55,0:06:53,167,182,0,,,,,,94,Asics Gel-Kayano 28
7,2024/6/13,0:18:26,1.98,0:09:17,0:07:04,167,184,206,,,,,,198,Asics Gel-Kayano 28
8,2024/6/17,0:31:29,2.87,0:10:58,0:06:55,163,189,205,,,,,,271,Asics Gel-Kayano 28
9,2024/6/25,0:36:16,3.42,0:10:35,0:06:59,169,183,0,,,,,,454,Asics Gel-Kayano 28


# Organizing Data

Now that we have imported our data, we will organize the data by converting the date attribute into pandas format, sorting the values, checking for null values, and removing duplicates.

In [34]:
# Convert date attribute
strava_data['date'] = pd.to_datetime(strava_data['date'])

# Ensure data is sorted by date
strava_data.sort_values('date', inplace=True)

# Check for null values (decide whether if I'm gonna get rid of if later)
print(strava_data.isnull().sum())

# Drop duplicates
strava_data = strava_data.drop_duplicates()

# Print out organized dataframe
strava_data


date                0
time                0
distance            0
avg_speed           0
max_speed           0
avg_heartrate       0
max_heartrate       0
elevation_gain      0
avg_power          10
max_power          10
total_work         10
avg_cadence        10
max_cadence        10
calories_burned     0
shoe                0
dtype: int64


Unnamed: 0,date,time,distance,avg_speed,max_speed,avg_heartrate,max_heartrate,elevation_gain,avg_power,max_power,total_work,avg_cadence,max_cadence,calories_burned,shoe
0,2024-05-23,0:18:31,1.96,0:09:24,0:06:02,177,202,0,,,,,,208,Asics Gel-Kayano 28
1,2024-05-28,0:27:00,2.84,0:09:30,0:06:38,168,183,94,,,,,,257,Asics Gel-Kayano 28
2,2024-05-30,0:23:07,2.71,0:08:31,0:06:41,171,190,100,,,,,,251,Asics Gel-Kayano 28
3,2024-06-01,0:11:58,1.43,0:08:22,0:05:00,176,197,170,,,,,,143,Asics Gel-Kayano 28
4,2024-06-03,0:12:53,1.4,0:09:11,0:06:43,171,192,172,,,,,,136,Asics Gel-Kayano 28
5,2024-06-07,0:56:48,5.38,0:10:32,0:07:32,169,191,656,,,,,,530,Asics Gel-Kayano 28
6,2024-06-10,0:08:47,0.8,0:10:55,0:06:53,167,182,0,,,,,,94,Asics Gel-Kayano 28
7,2024-06-13,0:18:26,1.98,0:09:17,0:07:04,167,184,206,,,,,,198,Asics Gel-Kayano 28
8,2024-06-17,0:31:29,2.87,0:10:58,0:06:55,163,189,205,,,,,,271,Asics Gel-Kayano 28
9,2024-06-25,0:36:16,3.42,0:10:35,0:06:59,169,183,0,,,,,,454,Asics Gel-Kayano 28


# Analyzing Data

Our next step will be performing exploratory data analysis on our running data. We will first print out information and statistics of our dataframe, then explore any relationships and visualize them. 

In [35]:
# Print information
print(strava_data.info())

# Print summary statistics
print(strava_data.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 43 entries, 0 to 42
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   date             43 non-null     datetime64[ns]
 1   time             43 non-null     object        
 2   distance         43 non-null     float64       
 3   avg_speed        43 non-null     object        
 4   max_speed        43 non-null     object        
 5   avg_heartrate    43 non-null     int64         
 6   max_heartrate    43 non-null     int64         
 7   elevation_gain   43 non-null     int64         
 8   avg_power        33 non-null     float64       
 9   max_power        33 non-null     float64       
 10  total_work       33 non-null     float64       
 11  avg_cadence      33 non-null     float64       
 12  max_cadence      33 non-null     float64       
 13  calories_burned  43 non-null     int64         
 14  shoe             43 non-null     object        
d

Because we are concerned whether if our running pace and distance is improving, we will create plots to visualize our average speed and distance over time. We will first perform linear regression on distance over time and print out the correlation coefficient:

In [41]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from sklearn.linear_model import LinearRegression
import numpy as np

# Convert distance to numerical type because it is an object)
strava_data['distance'] = pd.to_numeric(strava_data['distance'], errors='coerce')

# Convert 'date' to numerical type (help from ChatGPT)
strava_data['date_ordinal'] = strava_data['date'].apply(lambda x: x.toordinal())

# Prepare the data for linear regression
x = strava_data[['date_ordinal']]
y = strava_data['distance']

# Create and fit the model
model = LinearRegression()
model.fit(x, y)

# Predict values
predicted_distance = model.predict(x)

# Calculate the correlation coefficient
correlation_coefficient = np.corrcoef(strava_data['distance'], predicted_distance)[0, 1]
print(f'Correlation Coefficient: {correlation_coefficient}')

# Plotting
plt.figure(figsize=(12, 6))
plt.plot(strava_data['date'], strava_data['distance'], marker='o', linestyle='none', color='red', label='Actual Distance')
plt.plot(strava_data['date'], predicted_distance, color='blue', label='Fitted Line', linestyle='-')
plt.title('Distance Over Time with Linear Regression')
plt.xlabel('Date')
plt.ylabel('Distance (km)')

# Set major ticks format for the x-axis
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))

plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.

We will do the same thing for our average speed: