# 1. Importing libraries

In [1]:
# Data processing  
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np
import pickle

# Path configuration for custom module imports  
# -----------------------------------------------------------------------
import sys  
sys.path.append('../')  # Adds the parent directory to the path for custom module imports  

# Custom functions and classes
# -----------------------------------------------------------------------
from src.preprocess import *
from src.race_prediction_model.classification import ClassificationModels

# 2. Data loading

In [2]:
df = pd.read_csv('../data/output/featured_results.csv', index_col=0)

In [3]:
df.columns

Index(['DriverId', 'TeamId', 'Position', 'GridPosition', 'Time', 'Status',
       'Points', 'round', 'circuitId', 'Winner', 'Podium', 'MeanPreviousGrid',
       'MeanPreviousPosition', 'CurrentDriverWins', 'CurrentDriverPodiums'],
      dtype='object')

Since we want to predict whether a driver will win a race, we need to remove the columns that contain information about the race result, as we cannot provide input data about something that has not yet happened.

Our target variable in this case is `Winner` (it could be `Podium` if we want to predict if a driver will finish on the podium, `Position` if we want to predict the exact position, etc.).

Therefore, we can remove `Position`, `Time`, `Status`, `Points`, `Podium`. The rest of the variables can be known before the race takes place.

In [4]:
target = 'Winner'

drop = 'Winner' if target == 'Podium' else 'Podium'

df.drop(columns=['Position', 'Time', 'Status', 'Points', drop], inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6433 entries, 8 to 11
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   DriverId              6433 non-null   object 
 1   TeamId                6433 non-null   object 
 2   GridPosition          6433 non-null   float64
 3   round                 6433 non-null   int64  
 4   circuitId             6433 non-null   object 
 5   Winner                6433 non-null   int64  
 6   MeanPreviousGrid      6433 non-null   float64
 7   MeanPreviousPosition  6433 non-null   float64
 8   CurrentDriverWins     6433 non-null   int64  
 9   CurrentDriverPodiums  6433 non-null   int64  
dtypes: float64(3), int64(4), object(3)
memory usage: 552.8+ KB


We can explicitly check for null values

In [6]:
df.isna().sum()

DriverId                0
TeamId                  0
GridPosition            0
round                   0
circuitId               0
Winner                  0
MeanPreviousGrid        0
MeanPreviousPosition    0
CurrentDriverWins       0
CurrentDriverPodiums    0
dtype: int64

# 3. Preprocess

In [7]:
df.select_dtypes(include='O').columns

Index(['DriverId', 'TeamId', 'circuitId'], dtype='object')

### Encoding

We only need to encode the columns `DriverId`, `TeamId`, and `circuitId`.

* `DriverId`: We will apply target encoding since we want to give more weight to drivers with more victories.

* `TeamId`: We will apply target encoding since we want to give more weight to teams with more victories.

* `circuitId`: We will use ordinal encoding, as the circuits don't have any significance beyond the fact that some teams or drivers perform better than others.

### Scaling

Since we have very few outliers in our datasets and there are no extremely high values, we will use a `MinMax` scaler.

In [17]:
encoding_methods = {"onehot": [],
                    "target": ['DriverId', 'TeamId'],
                    "ordinal" : {
                        'circuitId': df['circuitId'].unique().tolist()
                        },
                    "frequency": []
                    }
scaling = 'minmax'

df_encoded, df_scaled = preprocess(df, encoding_methods, scaling, target_variable=target, save_objects=True) 

# 4. Model selection

We are facing a binary classification problem in which we aim to predict whether a driver will win a race or not based on various data. Therefore, we will test the simplest classification algorithm, `logistic regression`, alongside a more sophisticated model like `XGBoost`.

Regarding the model metrics, `precision` is the key metric, as it is more important for us to ensure that if we predict a driver will win, they actually do, even at the cost of sometimes missing drivers who will win. However, we will also consider `f1_score`, which provides a balance between both situations.

### Model training

First, we instantiate the `ClassificationModels` class with the scaled dataset and the target variable to predict.

In [13]:
models = ClassificationModels(df_scaled, target)

We create an empty dataframe where we will store the metric results.

In [14]:
df_results = pd.DataFrame()

## 4.1 Logistic regression

In [15]:
model = "logistic_regression"

# Fit model
models.fit_model(model, file_name=model, cross_validation=10)

# Get metrics and store them
df_current_results = models.get_metrics(model)
df_current_results["model"] = model
df_results = pd.concat([df_results, df_current_results], axis=0)

### 4.1.1 Metrics

In [16]:
df_current_results.round(3)

Unnamed: 0,accuracy,precision,recall,f1,kappa,auc,average_precision,time_seconds,model
train,0.963,0.692,0.465,0.556,0.538,0.965,0.643,2.196,logistic_regression
test,0.96,0.474,0.367,0.414,0.394,0.954,0.49,2.196,logistic_regression


The `Logistic Regression` model achieves a high accuracy of 96% and an AUC-ROC of 0.95, indicating strong class discrimination.

However, performance drops when evaluating the test set, with a decrease in precision (from 0.69 to 0.47) and recall (from 0.46 to 0.37). This suggests potential overfitting to the training data, reducing the model's generalization ability.

Additionally, these metric values are not particularly strong, so we will aim to improve them using XGBoost and class balancing.

### 4.1.2 Confusion matrix

In [None]:
models.plot_confusion_matrix(model, size=(6,5))

The `confusion matrix` for `Logistic Regression` shows that the model performs well on the negative class (0) but struggles with the positive class (1):

- True negatives (TN): 1218 → Correctly classified as 0.
- False positives (FP): 20 → Incorrectly classified as 1.
- False negatives (FN): 31 → Cases of class 1 misclassified as 0.
- True positives (TP): 18 → Correctly classified as 1.

The model has high precision for class 0, but its recall for class 1 is low, capturing only 18 out of 49 positive cases (≈36.7%). This indicates difficulty in correctly identifying the minority class.

### 4.1.3 Features importance

In [None]:
models.plot_predictors_importance(model, size=(6,5))

Regarding feature importance, we observe results that make sense. In absolute terms, the most significant factors are `starting grid position` and `average position in the previous races` (specifically, the last 3 races in this case).

Additionally, we see that `the number of current wins` and the `driver ID` are important features. It's worth noting that the driver ID was encoded using `target encoding`, which explains why drivers with the most wins are more likely to be predicted as winners.

In [None]:
models.plot_shap_summary(model)

In the `SHAP diagram`, we can better visualize the impact of each feature on the model, further reinforcing our previous conclusions.  

Interestingly, while the most influential feature is the starting grid position for the current race, the previous grid positions appear to have little to no effect.

## 4.2 XGBoost

In [None]:
model = "xgboost"

# Fit model
models.fit_model(model, file_name=model, cross_validation=10)

# Get metrics and store them
df_current_results = models.get_metrics(model)
df_current_results["model"] = model
df_results = pd.concat([df_results, df_current_results], axis=0)

### 4.2.1 Metrics

In [None]:
df_current_results.round(3)

The `XGBoost model` achieves a high accuracy of 97.4% and an AUC-ROC of 0.967, indicating excellent class discrimination, slightly improving over `Logistic Regression` in both metrics.  

Compared to Logistic Regression, XGBoost exhibits better precision on the test set (0.75 vs. 0.47), meaning it makes fewer false positive predictions. However, recall drops slightly indicating that the model still struggles to identify all positive cases, but it's still better than Logistic Regression (0.49 vs. 0.37). The F1-score (0.593) shows an improvement over Logistic Regression as well, demonstrating a more balanced trade-off between precision and recall.  

While overfitting is still present, XGBoost provides a more **robust performance overall**. To further enhance results, we could explore hyperparameter tuning and class balancing techniques.

### 4.2.2 Confusion matrix

In [None]:
models.plot_confusion_matrix(model, size=(6,5))

The `confusion matrix` for `XGBoost` shows that the model performs well in classifying the negative class (0)** but still struggles with the positive class (1):

- True negatives (TN): 1230 → Correctly classified as 0.  
- False positives (FP): 8 → Incorrectly classified as 1.  
- False negatives (FN): 25 → Class 1 cases misclassified as 0.  
- True positives (TP): 24 → Correctly classified as 1.  

Compared to Logistic Regression, XGBoost significantly reduces false positives (from 20 to 8), improving precision. However, it still struggles with recall, as it only captures 24 out of 49 actual positive cases (≈49%). While this is an improvement over Logistic Regression, it suggests that the model still misses many true winners. 

### 4.2.3 Features importance

In [None]:
models.plot_predictors_importance(model, size=(6,5))

In [None]:
models.plot_shap_summary(model)

In this case, we observe that `previous finishing positions` have greater importance than the `starting grid position` for the current race.

Additionally, the `starting grid positions from previous races` and the `circuit` now carry more weight compared to the previous model, whereas current podiums and victories have less influence.

In [None]:
df_results.round(3)

# 5. Class balancing

### Addressing Class Imbalance  

One of the main challenges we face is **class imbalance**. This is expected, as a Formula 1 race typically features 20 drivers, but only one wins, meaning the minority class represents just 5% of the dataset.  

A common approach to handling this issue is upsampling or downsampling. However, these techniques often introduce biases into the dataset. Instead, we will use a different method: reducing the dataset to include only relevant entries for the model.  

Many drivers have a very low probability of winning due to various factors. Therefore, we can exclude them from the dataset to improve model performance.  

### Filtering criterion

We will retain only drivers whose starting and finishing positions are below a certain threshold. This removes participants who consistently qualify and finish in poor positions, as they realistically never had a chance of winning. However, drivers who:  

- Started in a poor position but managed a strong finish will still be included.  
- Started in a good position but had a bad race will also remain in the dataset.  

This filtering process only excludes drivers who performed poorly in both sessions (qualifying and race), ensuring we keep those who are genuinely competitive in the dataset.

## 5.1 Data loading

In [None]:
df = pd.read_csv('../data/output/featured_results.csv', index_col=0)

Now, we will apply the filter. After several tests, we have decided to use a criterion that includes only results where the driver started or finished within the top three positions.

This approach ensures that we retain only the most competitive results while removing those who had little to no realistic chance of winning. By doing so, we aim to improve the model's performance by focusing on relevant cases and reducing the impact of class imbalance.

In [None]:
# Top 3 positions
threshold = 3
mask = (df['Position'] <= threshold) | (df['GridPosition'] <= threshold)

# Apply mask
df = df[mask]

target = 'Winner'
drop = 'Winner' if target == 'Podium' else 'Podium'

df.drop(columns=['Position', 'Time', 'Status', 'Points', drop], inplace=True)

## 5.2 Preprocessing

We carry out the same preprocessing as in the previous case to ensure consistency in the data preparation and maintain comparability between models.

In [None]:
encoding_methods = {"onehot": [],
                    "target": ['DriverId', 'TeamId'],
                    "ordinal" : {
                        'circuitId': df['circuitId'].unique().tolist()
                        },
                    "frequency": []
                    }
scaling = 'minmax'

df_encoded, df_scaled = preprocess(df, encoding_methods, scaling, target_variable=target)

## 5.3 Training and metrics

In [None]:
models_balanced = ClassificationModels(df_scaled, target)

In [None]:
df_results_balanced = pd.DataFrame()

In [None]:
model_list = ["logistic_regression", "xgboost"]

for model in model_list:    

    # Fit model
    models_balanced.fit_model(model, file_name=f"{model}_balanced", cross_validation=10)

    # Plots
    models_balanced.plot_confusion_matrix(model, size=(6,5))
    models_balanced.plot_shap_summary(model)

    # Get metrics
    df_current_results = models_balanced.get_metrics(model)
    df_current_results["model"] = model
    df_results_balanced = pd.concat([df_results_balanced, df_current_results], axis=0)

In [None]:
df_results_balanced.round(3)

Data balancing has shown a significant impact on the metrics of the logistic regression model, reducing the bias caused by the original imbalance and improving its generalization ability. In this case, correcting the distribution has allowed the model to better capture the patterns of the minority classes, reducing the tendency to predominantly predict the majority class. As a result, the imbalance issue has practically disappeared in this model.

On the other hand, in XGBoost, balancing has not generated significant improvements and, in some cases, has worsened the metrics. This is because XGBoost, being a tree-based model with internal mechanisms for handling imbalances (such as weight adjustments in the loss function and differential error assignment), is less affected by imbalanced distributions. Additionally, the balanced dataset has introduced slight overfitting in this model, suggesting that the adjustment to the new distribution has led to greater sensitivity to patterns specific to the training set, rather than an improvement in its generalization ability.

# 6. Conclusions

In [None]:
display(df_results.round(3))
display(df_results_balanced.round(3))

### Logistic regression

- Full dataset:
  - Train: Accuracy = 0.963, F1 = 0.556, AUC = 0.965
  
  - Test: Accuracy = 0.960, F1 = 0.414, AUC = 0.954

  - Train-test difference: Some overfitting.

- Balanced dataset:
  - Train: Accuracy = 0.828, F1 = 0.533, AUC = 0.831

  - Test: Accuracy = 0.782, F1 = 0.558, AUC = 0.790

  - Train-test difference: Better generalization than the full dataset.

  - Improved recall and F1 in test, meaning the model now predicts the minority class better.

### XGBoost
- Full dataset:
  - Train: Accuracy = 0.971, F1 = 0.569, AUC = 0.981

  - Test: Accuracy = 0.974, F1 = 0.490, AUC = 0.967

  - Train-test difference: Small, good generalization, but slight overfitting.

- Balanced dataset:
  - Train: Accuracy = 0.893, F1 = 0.713, AUC = 0.945

  - Test: Accuracy = 0.782, F1 = 0.584, AUC = 0.808

  - Train-test difference: Larger, indicating clear overfitting.

#### Final decision: Best model selection

| Model | Generalization | Overfitting | Performance (F1, AUC) | Training time |
|--------|---------------|-------------|------------------------|-----------------|
| `Logistic Regression (Balanced)` | Good | Low | Better recall and F1 in test | Fast |
| `XGBoost (Full dataset)` | Good | Slight | Best in AUC and precision | Moderate |
| `XGBoost (Balanced)` | Worse | High | Good F1 but overfitting | Moderate |

#### Final choice: `XGBoost with the full dataset`
- If training time is not an issue, `XGBoost with the full dataset` provides the best overall performance.

- If computational efficiency is critical, `logistic regression with the balanced dataset` is a solid alternative.

**Conclusion: XGBoost (full dataset) for best performance, Logistic Regression (balanced) for efficiency.**

# 7. Model file saving

In [None]:
models.fit_model('xgboost', file_name='best_model', cross_validation=10)
models.get_metrics('xgboost')