# Introduction to Random-Forest Classification for Injury Prediction

**What is a Random-Forest?**
A Random-Forest is an ensemble of decision trees. Each tree is trained on a bootstrap sample of the data and considers only a random subset of features at each split. By averaging many trees, it reduces overfitting and captures non-linear interactions.

**Why use it?**

- Handles non-linear interactions.
- Works with a mix of numeric & categorical variables.
- Naturally robust to outliers and multicollinearity.
- Returns feature-importance scores showing which variables the model relied on most.

**Outputs:**

- Predicted probability that a player will be injured in a given match.
- Binary injury label (0 = healthy, 1 = injured) after applying a decision threshold.
- Feature importances indicating which features drive injury risk.

**Why analyse weather & position effects?**

Soccer injuries stem from both intrinsic factors (age, position workload) and extrinsic factors (pitch & weather). Quantifying these helps coaches and medical staff to:
1. **Target prevention**: Tailor warm-ups or recovery by weather scenario and position.
2. **Rotation planning**:Rest vulnerable players when adverse weather coincides with short rest.
3. **Resource allocation**: Assign medical coverage where risk is highest.



## 1. Imports and Data Loading

Import the required Python libraries.

In [188]:
import pandas as pd
import numpy as np
from pathlib import Path

# Pre-processing & modelling
from sklearn.preprocessing   import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble        import RandomForestClassifier
from imblearn.over_sampling  import SMOTE

# Metrics
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    precision_recall_curve,
    auc
)

# Determine project directory 
try:
    PROJECT_DIR = Path(__file__).resolve().parent
except NameError:
    PROJECT_DIR = Path.cwd().resolve()

DATA_DIR = PROJECT_DIR 
stats   = pd.read_csv(DATA_DIR / "player_stats_table.csv")   
players = pd.read_csv(DATA_DIR / "player_table.csv")       
matches = pd.read_csv(DATA_DIR / "match_table.csv")         
weather = pd.read_csv(DATA_DIR / "weather_table.csv")      



## 2. Build the Combined Dataset

Merge the four sources into a single DataFrame that has one row per player-match.

In [None]:
# Merge the four datasets so we have one row per player-match,
# including stats, player info, match details, and weather conditions.
df = (
    stats
      .merge(players, on="player_id", how="left")   
      .merge(matches, on="match_id",  how="left")  
      .merge(weather, on="match_id",  how="left")   
)

required_cols = {
    "stats"  : {"player_id", "match_id", "injury_occurred"},
    "players": {"player_id", "position", "birthdate"},
    "matches": {"match_id", "match_date"},
    "weather": {"match_id", "conditions"},
}
for name, expected in required_cols.items():
    present = set(eval(name).columns)              
    missing = expected - present                   
    assert not missing, f"{name} is missing columns: {missing}"



## 3. Feature Engineering

Create new features that are not directly available in the raw data.  

In [None]:
# 1. Convert date strings into actual datetime objects
#    This makes it easy to calculate differences and extract components.
df["match_date"] = pd.to_datetime(df["match_date"], errors="coerce")
df["birthdate"]  = pd.to_datetime(df["birthdate"],  errors="coerce")

# 2. Calculate player age in whole years on the day of the match
#    We subtract birthdate from match date, convert to days, then divide by 365.
df["age"] = (df["match_date"] - df["birthdate"]).dt.days // 365

# 3. Determine how many days off each player had since their last game
#    a) Sort so each player's matches are in chronological order
#    b) Use .diff() to find the gap in days between consecutive matches
#    c) Fill the first appearance (NaN) with the median rest across all players
df = df.sort_values(["player_id", "match_date"])
rest = df.groupby("player_id")["match_date"].diff().dt.days
median_rest = rest.median()  
df["rest_days"] = rest.fillna(median_rest)

# 4. Create a flag for back-to-back away games
#    We mark as 1 if a player played the day before (rest_days == 1)
#    and the match was away. Adjust the indicator logic to match your schema.
def away_indicator(row):
    """
    Return True if this match was an away game.
    Modify this function if your DataFrame uses different column names.
    """
    if "venue" in row:
        return str(row["venue"]).lower() == "away"
    elif "is_away" in row:
        return row["is_away"] == 1
    else:
        # If no clear column, default to True so we still capture the rest_days logic
        return True

# Apply the away_indicator per row and convert the boolean to 0/1
df["back_to_back_away"] = (
    ((df["rest_days"] == 1) & df.apply(away_indicator, axis=1))
    .astype(int)
)



## 4. Select Features and Target

Choose the predictor variables and the target variable.

Create a separate copy of the selected columns to avoid modifying the original DataFrame `df`. This is important when performing transformations (e.g., scaling) that shouldn't affect `df`.

Finally, separate features (X) and target (y).

In [None]:
model_cols = [
    "position",          
    "conditions",        
    "age",               
    "rest_days",         
    "back_to_back_away", 
    "injury_occurred"    
]

model_df = df[model_cols].dropna().copy()
X = model_df.drop(columns="injury_occurred")
y = model_df["injury_occurred"].astype(int)


## 5. Train/Test Split

Split data into training and testing sets to train the model on one portion and evaluate its performance on unseen data. Stratification is used to keep class balance.

Stratification ensures that the proportion of injury cases (positive vs. negative classes) is maintained in both the training and test sets (class balance). This is especially important for imbalanced classification problems, where one class may dominate. Stratification prevents the test set from being unrepresentative.

You can adjust 'test_size' to control how much data goes into the test set. Common values range from 0.2 to 0.3 (i.e., 20–30% of the data used for testing).
The 'random_state' ensures reproducibility: using the same value will yield the same split every time. You can choose any integer value.

In [180]:
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y,
    test_size=0.30,
    stratify=y,
    random_state=42
)


## 6. Encode Categorical Features

Many machine learning models cannot handle categorical variables directly.
One-Hot-Encoding transforms a categorical feature (e.g., position or conditions) into multiple binary (0/1) features—one for each category, indicating whether a sample belongs to that category.
This allows the model to learn a separate split rule for each category.

When One-Hot-Encoding is applied, one column is dropped and becomes the baseline. All importance comparisons for the remaining categories will be interpreted in relation to this dropped base category.

Repeat the following steps for every categorical column you have among your predictors.

In [None]:
import sklearn
from sklearn.preprocessing import OneHotEncoder

# Prepare keyword arguments for OneHotEncoder depending on scikit-learn version
# We drop the first category to use it as a baseline, and ensure a dense array output.
ohe_kwargs = {"drop": "first"}
if sklearn.__version__ >= "1.4":
    ohe_kwargs["sparse_output"] = False
else:
    ohe_kwargs["sparse"] = False

# Initialize the encoder with our settings
ohe = OneHotEncoder(**ohe_kwargs)

# Fit the encoder on the training set and transform the 'position' and 'conditions' columns
# This creates new binary columns for each category
X_cat = ohe.fit_transform(X_tr[["position", "conditions"]])

# Retrieve the names of the newly created dummy columns
cat_cols = ohe.get_feature_names_out(["position", "conditions"])

# Build a DataFrame of the encoded features and join it back to the rest of X_tr
X_tr_enc = pd.DataFrame(X_cat, columns=cat_cols, index=X_tr.index)
X_tr     = pd.concat([X_tr_enc, X_tr.drop(columns=["position", "conditions"])], axis=1)

# Now apply the same transformation to the test set 
X_te_cat = ohe.transform(X_te[["position", "conditions"]])
X_te_enc = pd.DataFrame(X_te_cat, columns=cat_cols, index=X_te.index)
X_te     = pd.concat([X_te_enc, X_te.drop(columns=["position", "conditions"])], axis=1)

# Identify which categories were dropped 
baseline_position, baseline_conditions = ohe.categories_[0][0], ohe.categories_[1][0]
print(f"Baseline position   : {baseline_position}")
print(f"Baseline conditions : {baseline_conditions}")


Baseline position   : Defender
Baseline conditions : Cloudy


## 7. Scale Numeric Features

Standardize numeric columns so they have mean = 0 and standard deviation = 1. This is important because many machine learning models are sensitive to feature scales. Features with larger ranges can dominate the model’s learning process. Standardization ensures that each feature contributes equally to the result.

In [182]:
num_cols = ["age", "rest_days"]
scaler   = StandardScaler()

X_tr[num_cols] = scaler.fit_transform(X_tr[num_cols])
X_te[num_cols] = scaler.transform(X_te[num_cols])



## 8. Oversampling with SMOTE

Generate synthetic examples for the smaller class to generate more balanced classes.

In [183]:
smote = SMOTE(random_state=42, k_neighbors=5)
X_tr, y_tr = smote.fit_resample(X_tr, y_tr)
print("After SMOTE class counts:", np.bincount(y_tr))



After SMOTE class counts: [153 153]




## 9. Train the Random-Forest Model

Fit a balanced Random-Forest on the oversampled training data.

In [184]:
rf = RandomForestClassifier(
    n_estimators=800,
    max_features="sqrt",
    min_samples_leaf=5,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)
rf.fit(X_tr, y_tr)


## 10. Cross-Validation and Threshold Search

Find the decision threshold that maximises F1 inside 5-fold CV on the training set.

In [None]:
# Binary predictions using the model’s default threshold (0.5)
y_pred      = rf.predict(X_te)              # class labels: 0 (healthy) or 1 (injured)

# Predicted probabilities for the positive class (injury=1)
y_pred_prob = rf.predict_proba(X_te)[:, 1]   # continuous risk scores between 0 and 1

## 11. Final Evaluation on Test Set

In this phase we evaluate the trained model on the test set, which it has not seen during any training or internal validation stage.
We use several complementary metrics to understand different aspects of its performance:

- classification_report: displays precision, recall, and F1‐score for each class.

- Confusion Matrix: breaks down true positives/negatives and false positives/negatives.

- ROC‐AUC: area under the ROC curve, measuring the model’s ability to separate positive and negative classes across different thresholds.

- PR‐AUC: area under the Precision‐Recall curve, which is more informative in imbalanced classification problems.

In [186]:
print("\n=== TEST METRICS ===")
print(classification_report(y_te, y_pred, digits=3))

print("Confusion Matrix:\n", confusion_matrix(y_te, y_pred))
print("ROC-AUC :", roc_auc_score(y_te, y_pred_prob))

precision, recall, _ = precision_recall_curve(y_te, y_pred_prob)
print("PR-AUC  :", auc(recall, precision))




=== TEST METRICS ===
              precision    recall  f1-score   support

           0      0.742     0.754     0.748        65
           1      0.333     0.320     0.327        25

    accuracy                          0.633        90
   macro avg      0.538     0.537     0.537        90
weighted avg      0.629     0.633     0.631        90

Confusion Matrix:
 [[49 16]
 [17  8]]
ROC-AUC : 0.5556923076923076
PR-AUC  : 0.41682496246398176


## 12. Interpret Feature Importances python
See which features the Random-Forest found most informative.

In [189]:
imp_df = (
    pd.DataFrame({
        "feature":    X_tr.columns,
        "importance": rf.feature_importances_
    })
    .sort_values("importance", ascending=False)
)

print("\nTOP 20 FEATURES")
print(imp_df.head(20))

print(f"\n Position dummies (baseline = {baseline_position})")
print(imp_df[imp_df.feature.str.startswith("position_")])

print(f"\n Weather dummies  (baseline = {baseline_conditions})")
print(imp_df[imp_df.feature.str.startswith("conditions_")])



TOP 20 FEATURES
                feature  importance
9             rest_days    0.283787
8                   age    0.223292
7      conditions_Windy    0.097744
5       conditions_Snow    0.082768
0      position_Forward    0.077368
1   position_Goalkeeper    0.070786
3        conditions_Fog    0.055915
6      conditions_Sunny    0.037429
4       conditions_Rain    0.035557
2   position_Midfielder    0.035353
10    back_to_back_away    0.000000

 Position dummies (baseline = Defender)
               feature  importance
0     position_Forward    0.077368
1  position_Goalkeeper    0.070786
2  position_Midfielder    0.035353

 Weather dummies  (baseline = Cloudy)
            feature  importance
7  conditions_Windy    0.097744
5   conditions_Snow    0.082768
3    conditions_Fog    0.055915
6  conditions_Sunny    0.037429
4   conditions_Rain    0.035557


## 11. Conclusion

**Reference for position:** Defender

**Reference for conditions:** Cloudy

* **Rest days**: Shorter rest periods are strongly associated with higher injury risk.

* **Age**: Older players show a higher probability of injury compared to younger players.

* **Weather conditions (vs. Cloudy)**:

- **Windy** and **Snow** have the largest positive impact on injury risk.
- **Fog**, **Sunny** and **Rain** also increase risk relative to cloudy conditions, but to a lesser extent.

* **Player position (vs. Defender)**:

- **Forward** has the highest relative injury risk.
- **Goalkeeper** follows, then **Midfielder**.

* **Back-to-back-away**: This flag contributes virtually zero predictive power in its current form and could be removed or redefined.
