## Predictive Modeling for Running Performance

This report aims to develop and evaluate machine learning models to predict running performance. By analyzing various features and model types, we strive to provide insights that can help runners and coaches enhance training strategies and optimize performance.


In [None]:
from google.colab import files
import pandas as pd

# Upload the CSV file
uploaded = files.upload()

# Load the dataset
df = pd.read_csv('run_clean_data.csv')
print(df.head())

Saving run_clean_data.csv to run_clean_data.csv
   Activity ID          Activity Date Activity Name Activity Type  \
0   7057073739  28 Apr 2022, 21:49:44   Morning Run           Run   
1   7067754767   1 May 2022, 00:11:03   Morning Run           Run   
2   7073448977   1 May 2022, 23:04:26   Morning Run           Run   
3   7083686840   3 May 2022, 20:34:15   Morning Run           Run   
4   7094718079   5 May 2022, 21:44:27   Morning Run           Run   

   Elapsed Time  Distance  Max Heart Rate  Relative Effort  Commute  \
0          4127     12.77           178.0             78.0    False   
1          7723     24.63           173.0            185.0    False   
2          4202     12.51           182.0            255.0    False   
3          4086     12.72           188.0            276.0    False   
4          4290     13.09           184.0            241.0    False   

                       Filename  ...  Maximum Power 10s  Maximum Power 30s  \
0  activities/7511868053.csv.gz 

Here, we load the dataset to examine the structure and initial rows. This step is essential for understanding the data we'll be working with and ensuring its completeness.


In [None]:
print(df.describe())

        Activity ID  Elapsed Time    Distance  Max Heart Rate  \
count  1.640000e+02    164.000000  164.000000      164.000000   
mean   8.211358e+09   4390.871951   12.584268      178.237805   
std    7.392044e+08   2052.552196    5.369254       10.528875   
min    7.057074e+09   1506.000000    4.060000      140.000000   
25%    7.759300e+09   3015.500000    8.475000      174.750000   
50%    8.146922e+09   4016.500000   12.000000      178.000000   
75%    8.611014e+09   5236.250000   15.362500      184.000000   
max    1.027158e+10  16615.000000   28.870000      210.000000   

       Relative Effort  Moving Time   Max Speed  Average Speed  \
count       164.000000   164.000000  164.000000     164.000000   
mean        152.810976  4211.518293    6.247638       2.981124   
std         106.591449  1771.775123    6.017020       0.135569   
min           6.000000  1506.000000    3.233984       2.696547   
25%          68.750000  2987.750000    3.994507       2.895473   
50%         130.50

In [None]:
# Check for missing values
print(df.isnull().sum())

Activity ID                0
Activity Date              0
Activity Name              0
Activity Type              0
Elapsed Time               0
Distance                   0
Max Heart Rate             0
Relative Effort            0
Commute                    0
Filename                   0
Moving Time                0
Max Speed                  0
Average Speed              0
Elevation Gain             0
Elevation Loss             0
Elevation Low              0
Elevation High             0
Max Grade                  0
Average Grade              0
Max Cadence                0
Average Cadence            0
Average Heart Rate         0
Average Watts              0
Calories                   0
Weighted Average Power     0
Power Count                0
Grade Adjusted Distance    0
Average Elapsed Speed      0
Dirt Distance              0
Maximum Power 5s           0
Maximum Power 10s          0
Maximum Power 30s          0
Maximum Power 1.0min       0
Maximum Power 5.0min       0
Maximum Power 

### Feature Engineering and Normalization


In this section, we create new features such as 'Average Pace' and apply normalization to some features to prepare the data for effective modeling. Normalization helps to standardize the range of independent variables.


In [None]:
# Calculate additional features
df['Average Pace'] = df['Elapsed Time'] / df['Distance']  # Time per unit distance

# Normalizing some features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Average Pace', 'Max Heart Rate']] = scaler.fit_transform(df[['Average Pace', 'Max Heart Rate']])

# Recheck the head to see the new feature and normalized columns
print(df[['Average Pace', 'Max Heart Rate']].head())

   Average Pace  Max Heart Rate
0     -0.499153       -0.022655
1     -0.680932       -0.498994
2     -0.258915        0.358416
3     -0.536061        0.930023
4     -0.413131        0.548952


### Model Building and Evaluation


We evaluate multiple machine learning models to identify the best performer for predicting running performance. The models include Linear Regression, Random Forest, and Gradient Boosting. Each model is assessed using MSE, RMSE, and R².


In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Prepare data
features = ['Distance', 'Max Heart Rate', 'Average Pace']
target = 'Elapsed Time'

X = df[features]
y = df[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models to test
models = {
    'Random Forest': RandomForestRegressor(random_state=42),
    'Linear Regression': LinearRegression(),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42)
}

# Function to evaluate models
def evaluate_models(models, X_train, X_test, y_train, y_test):
    results = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        rmse = mse ** 0.5
        r2 = r2_score(y_test, y_pred)
        results[name] = {'MSE': mse, 'RMSE': rmse, 'R2': r2}
    return results

# Evaluate models and print detailed results
model_results = evaluate_models(models, X_train, X_test, y_train, y_test)
for model_name, metrics in model_results.items():
    print(f"Results for {model_name}:")
    print(f"  Mean Squared Error (MSE): {metrics['MSE']:.2f}")
    print(f"  Root Mean Squared Error (RMSE): {metrics['RMSE']:.2f}")
    print(f"  R² Score: {metrics['R2']:.4f}")
    print("----------")


Results for Random Forest:
  Mean Squared Error (MSE): 270819.94
  Root Mean Squared Error (RMSE): 520.40
  R² Score: 0.9129
----------
Results for Linear Regression:
  Mean Squared Error (MSE): 25452.57
  Root Mean Squared Error (RMSE): 159.54
  R² Score: 0.9918
----------
Results for Gradient Boosting:
  Mean Squared Error (MSE): 81801.00
  Root Mean Squared Error (RMSE): 286.01
  R² Score: 0.9737
----------


This analysis aimed to predict running performance using machine learning models, with a focus on understanding how distance, heart rate, and pace influence elapsed time during running activities. After evaluating three different models—Linear Regression, Random Forest, and Gradient Boosting—the results were clear and insightful.

Linear Regression emerged as the most effective model, demonstrating an exceptionally high R² score of 0.9918 and the lowest RMSE of 159.54. This indicates a strong linear relationship between the chosen features and the target variable, suggesting that changes in distance, pace, and heart rate linearly correlate with changes in running performance under the conditions represented in the dataset. The simplicity and high accuracy of Linear Regression make it an excellent choice for predicting outcomes where the relationships between input variables and outcomes are predominantly linear.

Random Forest and Gradient Boosting, while robust and capable of capturing complex nonlinear relationships, did not perform as well in this particular context. The Random Forest model showed a tendency towards overfitting, given its higher RMSE, and the Gradient Boosting model, although it performed reasonably well, could not match the accuracy of the Linear Regression. This suggests that for this dataset, complex models did not provide additional benefits over a simpler linear approach, possibly due to the straightforward nature of the relationships among the variables.

From a practical perspective, these findings underscore the importance of selecting the right model based on the characteristics of the data and the specific predictive task at hand. For athletes and coaches, utilizing a linear regression model could provide clear and actionable insights into how varying distances, paces, and physiological responses impact performance, aiding in the optimization of training and competition strategies.