# Introduction to Logistic Regression for Injury Prediction

**What is Logistic Regression?**  
Logistic regression is a method for modeling binary outcomes (e.g., injury yes/no). It estimates the probability that an input belongs to the positive class.

**Why use it?**  
- For binary classification tasks.  
- Provides interpretable coefficients indicating the impact of features on injury risk.  
- Allows comparison of categorical predictors relative to a baseline category.

**Outputs:**  
- Predicted probabilities for target variable.  
- Predicted class labels of target variable (0 or 1).  
- Model coefficients showing feature impact on target variable.

**Why analyze position effects on injuries?**  
Player position is a fundamental aspect of soccer tactics. Different positions involve different running loads, collision frequencies, and playing styles. Knowing which positions carry higher or lower injury risks helps coaches and medical staff:
1. **Target prevention**: Tailor conditioning or recovery programs by position.  
2. **Squad planning**: Manage rotations to minimize risk for vulnerable roles.  
3. **Data-driven decisions**: Allocate resources (e.g., physiotherapists) where they have highest impact.

## 1. Imports and Data Loading

Import necessary Python libraries.

In [348]:
import pandas as pd
import numpy as np
from sklearn.preprocessing    import OneHotEncoder, StandardScaler
from sklearn.model_selection  import train_test_split
from sklearn.linear_model     import LogisticRegression
from imblearn.over_sampling   import SMOTE
from sklearn.metrics          import (
    accuracy_score, precision_score, recall_score,
    roc_auc_score, classification_report
)

## 2. Build the Combined Dataset

Load your datasets, and merge the important tables into a single DataFrame for easier feature engineering and modeling.

In [349]:
# Replace this data import with your real data.
df_player_stats = pd.read_csv('player_stats_table.csv')
df_player       = pd.read_csv('player_table.csv')
df_match        = pd.read_csv('match_table.csv')

df = (
    df_player_stats
      .merge(df_player, on='player_id', how='left')
      .merge(df_match,  on='match_id',  how='left')
)


## 3. Feature Engineering

Create new features that are not directly available in the raw data.  
Example: Calculate player age in years from their birthdate.

In [350]:
df['age'] = (pd.Timestamp('today') - pd.to_datetime(df['birthdate'])).dt.days // 365

# Test if the values have been computed correctly.
print(df[['age']].head())

   age
0   35
1   24
2   29
3   26
4   39


## 4. Select and Split Features and Target

Choose the predictor variables and the target variable.

Create a separate copy of the selected columns to avoid modifying the original DataFrame `df`. This is important when performing transformations (e.g., scaling) that shouldn't affect `df`.

Finally, separate features (X) and target (y).

In [351]:
# Change 'model_columns' to include whatever predictors you need.
model_columns = [
    'position',
    'age',
    'minutes_played',
    'distance_covered_in_meters',
    'sprints',
    'rest_days_since_last_match',
    'is_back_to_back_away_game',
    'injury_occurred'   # target: 1=inured, 0=healthy.
]

model_df = df[model_columns].copy()

X = model_df.drop(columns='injury_occurred')
y = model_df['injury_occurred']

## 5. Train/Test Split

Split data into training and testing sets to train the model on one portion and evaluate its performance on unseen data. Stratification is used to keep class balance.

Stratification ensures that the proportion of injury cases (positive vs. negative classes) is maintained in both the training and test sets (class balance). This is especially important for imbalanced classification problems, where one class may dominate. Stratification prevents the test set from being unrepresentative.

You can adjust 'test_size' to control how much data goes into the test set. Common values range from 0.2 to 0.3 (i.e., 20–30% of the data used for testing).
The 'random_state' ensures reproducibility: using the same value will yield the same split every time. You can choose any integer value.


In [352]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

## 6. Encode Categorical Features

Many machine learning models cannot handle categorical variables directly.
One-Hot-Encoding transforms a categorical feature (e.g., 'position') into multiple binary (0/1) features — one for each category, indicating whether a sample belongs to that category.
This allows the model to learn a separate coefficient for each category.

When One-Hot-Encoding is applied, one column is dropped and becomes the baseline. All coefficients of the remaining categories will be interpreted in relation to this dropped base category.

Repeat the following steps for every categorical column you have among your predictors.

In [353]:
# Show all categories to decide which to drop.
print("All positions in the dataset:")
print(np.sort(df['position'].unique()))

# Apply One-Hot-Encoding to the 'position' column.
# Choose in the 'drop' argument which category you want to use as the reference category.
encoder = OneHotEncoder(drop=['Goalkeeper'], sparse_output=False)
encoder.fit(X_train[['position']])
pos_cols = encoder.get_feature_names_out(['position'])
X_train_pos = pd.DataFrame(
    encoder.transform(X_train[['position']]),
    columns=pos_cols,
    index=X_train.index
)
X_test_pos = pd.DataFrame(
    encoder.transform(X_test[['position']]),
    columns=pos_cols,
    index=X_test.index
)

# Show which category was dropped (= baseline) for interpretation later.
reference_position = list(set(df['position'].unique()) - set([col.split("_")[1] for col in pos_cols]))
print(f"\nReference (baseline) position used for comparison: {reference_position[0]}")

# Remove original 'position' column and append the dummies that have just been created.
X_train = pd.concat([X_train.drop(columns='position'), X_train_pos], axis=1)
X_test  = pd.concat([X_test.drop(columns='position'),  X_test_pos],  axis=1)

All positions in the dataset:
['Defender' 'Forward' 'Goalkeeper' 'Midfielder']

Reference (baseline) position used for comparison: Goalkeeper


## 7. Scale Numeric Features

Standardize numeric columns so they have mean = 0 and standard deviation = 1. This is important because many machine learning models are sensitive to feature scales. Features with larger ranges can dominate the model's learning process.
Standardization ensures that each feature contributes equally to the result.

In [354]:
numeric_cols = [
    'age',
    'minutes_played',
    'distance_covered_in_meters',
    'sprints',
    'rest_days_since_last_match'
]

scaler = StandardScaler()
scaler.fit(X_train[numeric_cols])
X_train[numeric_cols] = scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]  = scaler.transform(X_test[numeric_cols])

## 8. Oversampling

Generate synthetic examples for the smaller class to generate more balanced classes.  
You can change the value for 'k_neighbors' to see how it influences the evaluation metrics of the model.

In [355]:
smote = SMOTE(random_state=42, k_neighbors=30)
X_train, y_train = smote.fit_resample(X_train, y_train)

## 9. Train Logistic Regression Model

Train logistic regression on the training data.

In [356]:
model = LogisticRegression(solver='liblinear', class_weight='balanced')
model.fit(X_train, y_train)

## 10. Make Predictions and Evaluate Performance

Predict the class of the target variable and probability on the test set.  
We compute both y_pred and y_pred_prob because:
- y_pred gives the final binary classification ("Injury YES (1) or NO (0)"), used for metrics like accuracy, precision, recall.
- y_pred_prob gives the predicted probability for the positive class (injury=true), used for metrics like ROC AUC.  

- Accuracy: overall correct predictions
- Precision: how many predicted positives are actually true - important when false positives are costly (e.g. falsely predicting an injury).
- Recall: how many true positives were detected - important when missing a positive case is costly (e.g. overlooking a real injury).
- ROC AUC: measures how well the model ranks positives higher than negatives - a value close to 1.0 means the model separates classes well; 0.5 means random guessing.

Having both allows us to evaluate the model performance comprehensively.


In [357]:
y_pred      = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]

print(f"Accuracy : {accuracy_score(y_test, y_pred):.2f}")
print(f"Precision: {precision_score(y_test, y_pred):.2f}")
print(f"Recall   : {recall_score(y_test, y_pred):.2f}")
print(f"ROC AUC  : {roc_auc_score(y_test, y_pred_prob):.2f}")

Accuracy : 0.57
Precision: 0.33
Recall   : 0.52
ROC AUC  : 0.56


# 11. Interpret Model Coefficients

Coefficients show how each feature affects the occurrence of the target variable (here: injury occurrence) in terms of log-odds.  

- Numeric features (e.g., 'sprints', 'age') indicate the change in injury log-odds per unit increase. Example: A coefficient of 0.3 for 'sprints' means that each additional sprint slightly increases the risk.
- Categorical features (like 'positions') are interpreted relative to the baseline category ('Goalkeeper'). A negative coefficient means a lower injury risk compared to the reference, while a positive one means a higher risk.

In [358]:
coef_df = pd.DataFrame({
    'feature':     X_train.columns,
    'coefficient': model.coef_[0]
}).sort_values(by='coefficient', ascending=False)

print("Feature impacts on injury risk (log-odds):")
print(coef_df)

Feature impacts on injury risk (log-odds):
                      feature  coefficient
6           position_Defender     0.583332
7            position_Forward     0.430203
3                     sprints     0.296763
4  rest_days_since_last_match     0.104831
5   is_back_to_back_away_game     0.034987
1              minutes_played    -0.021885
8         position_Midfielder    -0.215823
0                         age    -0.235150
2  distance_covered_in_meters    -0.241319


## 12. Filter the Results

Filter only the 'position' coefficients to see the influence of the position on the occurrence of injuries.

In [359]:
position_coef_df = coef_df[coef_df['feature'].str.startswith('position_')]
print("\nImpact of player position on injury risk:")
print(position_coef_df.sort_values(by='coefficient', ascending=False))


Impact of player position on injury risk:
               feature  coefficient
6    position_Defender     0.583332
7     position_Forward     0.430203
8  position_Midfielder    -0.215823


## 13. Conclusion

- Defenders are significantly more likely to be injured than goalkeepers.
- Forwards are also more likely to be injured than goalkeepers, but less than defenders.
- Midfielders are slightly less likely to be injured than goalkeepers.