## Running Injury Prediction Project

### Introduction

Running is one of the most popular sports in the world, with millions of participants each year. However, injuries are common among runners, with approximately 50% experiencing some form of injury annually. In this project, we aim to explore how machine learning algorithms can be used to predict running injuries using biomechanical data.

### Data Loading and Exploration

We will begin by loading the dataset directly from Kaggle and exploring its contents.

In [1]:
import pandas as pd

# Load the dataset directly from Kaggle
df = pd.read_csv("/kaggle/input/injury-prediction-for-competitive-runners/week_approach_maskedID_timeseries.csv")

# Display the first few rows of the dataset
print(df.head())

# Display information about the dataset
print(df.info())

   nr. sessions  nr. rest days  total kms  max km one day  \
0           5.0            2.0       22.2            16.4   
1           5.0            2.0       21.6            16.4   
2           5.0            2.0       21.6            16.4   
3           5.0            2.0       21.6            16.4   
4           6.0            1.0       39.2            17.6   

   total km Z3-Z4-Z5-T1-T2  nr. tough sessions (effort in Z5, T1 or T2)  \
0                     11.8                                          1.0   
1                     11.7                                          1.0   
2                     11.7                                          1.0   
3                     11.7                                          1.0   
4                     18.9                                          1.0   

   nr. days with interval session  total km Z3-4  max km Z3-4 one day  \
0                             2.0           10.0                 10.0   
1                             2.0   

###Data Preprocessing¶

Next, we'll preprocess the data by dropping unnecessary columns.

In [2]:
# Drop unnecessary columns
columns_to_drop = ['avg training success', 'min training success', 'max training success', 
                   # More columns to drop...
                   'rel total kms week 1_2']

# Filter out columns that exist in the DataFrame
columns_to_drop_existing = [col for col in columns_to_drop if col in df.columns]

# Drop existing columns
if columns_to_drop_existing:
    df = df.drop(columns=columns_to_drop_existing)

# Display the modified DataFrame
print(df.head())


   nr. sessions  nr. rest days  total kms  max km one day  \
0           5.0            2.0       22.2            16.4   
1           5.0            2.0       21.6            16.4   
2           5.0            2.0       21.6            16.4   
3           5.0            2.0       21.6            16.4   
4           6.0            1.0       39.2            17.6   

   total km Z3-Z4-Z5-T1-T2  nr. tough sessions (effort in Z5, T1 or T2)  \
0                     11.8                                          1.0   
1                     11.7                                          1.0   
2                     11.7                                          1.0   
3                     11.7                                          1.0   
4                     18.9                                          1.0   

   nr. days with interval session  total km Z3-4  max km Z3-4 one day  \
0                             2.0           10.0                 10.0   
1                             2.0   

### Model Training and Evaluation

We split the data, train a Random Forest classifier with grid search, and evaluate its performance, displaying key metrics.

In [7]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_selection import SelectFromModel

# Splitting data into features and target
X = df.drop(['injury', 'Athlete ID'], axis=1)
y = df['injury']

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature Selection with RandomForestClassifier
selector_rf = RandomForestClassifier(n_estimators=100, random_state=42)
feature_selector = SelectFromModel(selector_rf)
X_train_selected = feature_selector.fit_transform(X_train, y_train)
X_test_selected = feature_selector.transform(X_test)

# Parameter Grid for GridSearchCV
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Model Training with Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_selected, y_train)

# Best parameters found by GridSearch
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Model Evaluation
best_rf_classifier = grid_search.best_estimator_
y_pred = best_rf_classifier.predict(X_test_selected)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=1))


Best Parameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Accuracy: 0.9879283489096573
Confusion Matrix:
 [[12685     0]
 [  155     0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      0.99     12685
           1       1.00      0.00      0.00       155

    accuracy                           0.99     12840
   macro avg       0.99      0.50      0.50     12840
weighted avg       0.99      0.99      0.98     12840



### Conclusion

In this project, we embarked on an exploration of machine learning's potential to predict running injuries using biomechanical data. While achieving an accuracy of 98.79%, our Random Forest classifier struggled to predict injury cases effectively, with precision, recall, and F1-score for the injury class all at 0. This limitation underscores the need for further improvement. Future efforts may focus on advanced feature engineering, addressing class imbalance, and exploring alternative models to enhance predictive performance and contribute to injury prevention in running.
