# Train Model

**INPUT**: "./data/1finalDataset.csv"

**OUTPUT**: Outputs the XGBoostModels "./models/best_xgb_model.json"

In this notebook, we take the final dataset (which contains all the tennis statistics), and we train several models with it (Random Forest, XGBoost, Neural Net). Then, we will save the best models to the models folder.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
from sklearn import tree
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from tensorflow import keras
from tensorflow.keras import layers
pd.set_option('display.max_columns', None)

In [None]:
final_dataset = pd.read_csv("data/tennis_atp-master/1finalDataset.csv")
final_dataset

FileNotFoundError: [Errno 2] No such file or directory: './data/1finalDataset.csv'

## Split Training vs Testing Data

We'll shuffle the data, and do a 85% split between training and testing data.

In [3]:
# Convert data to numpy (exclude the first 5k matches, since ELO hasn't been properly calculated yet)
data = final_dataset.to_numpy(dtype=object)[5000:,:]
np.random.shuffle(data)

# Split the data using an 85% split between training and testing
split = 0.85
value = round(split*len(data))

data_train = data[:value,:]
data_test = data[value:,:]

print("Training Data: "+str(data_train.shape))
print("Testing Data: "+str(data_test.shape))

Training Data: (76819, 68)
Testing Data: (13556, 68)


We need to map the result column to string values (since that's what the sklearn library requires I'm pretty sure)

In [4]:
# Define several mappers
mapper = np.vectorize(lambda x: "Player 2 Wins" if x == 0 else "Player 1 Wins")
reverse_mapper = np.vectorize(lambda x: 0 if x == "Player 2 Wins" else 1)

# Training data
x_train = data_train[:,:-1]
y_pred_train = mapper(data_train[:,-1:]).squeeze()

# Testing data
x_test = data_test[:,:-1]
y_pred_test = mapper(data_test[:,-1:]).squeeze()

## Train Models

### Train Simple Decision Tree

We can start by training a really simple decision tree (max_depth=4) to see how good it is.

In [5]:
# Instantiate a Decision Tree
decision_sklearn = DecisionTreeClassifier(max_depth=4)
decision_sklearn = decision_sklearn.fit(x_train, y_pred_train)

# Make predictions and test accuracy
predictions_train = decision_sklearn.predict(x_train)
predictions_test = decision_sklearn.predict(x_test)
print("Train Accuracy: "+str(accuracy_score(y_pred_train, predictions_train)))
print("Test Accuracy: "+str(accuracy_score(y_pred_test, predictions_test)))

Train Accuracy: 0.65740246553587
Test Accuracy: 0.6535851283564473


In [6]:
text_representation = tree.export_text(decision_sklearn, feature_names=final_dataset.columns[:-1])
print(text_representation)

|--- ELO_DIFF <= 4.66
|   |--- ELO_DIFF <= -110.91
|   |   |--- ELO_DIFF <= -229.70
|   |   |   |--- ELO_DIFF <= -348.35
|   |   |   |   |--- class: Player 2 Wins
|   |   |   |--- ELO_DIFF >  -348.35
|   |   |   |   |--- class: Player 2 Wins
|   |   |--- ELO_DIFF >  -229.70
|   |   |   |--- ELO_SURFACE_DIFF <= -81.19
|   |   |   |   |--- class: Player 2 Wins
|   |   |   |--- ELO_SURFACE_DIFF >  -81.19
|   |   |   |   |--- class: Player 2 Wins
|   |--- ELO_DIFF >  -110.91
|   |   |--- ATP_RANK_DIFF <= 162.50
|   |   |   |--- ELO_SURFACE_DIFF <= -44.50
|   |   |   |   |--- class: Player 2 Wins
|   |   |   |--- ELO_SURFACE_DIFF >  -44.50
|   |   |   |   |--- class: Player 2 Wins
|   |   |--- ATP_RANK_DIFF >  162.50
|   |   |   |--- P_1ST_IN_LAST_3_DIFF <= -2.66
|   |   |   |   |--- class: Player 2 Wins
|   |   |   |--- P_1ST_IN_LAST_3_DIFF >  -2.66
|   |   |   |   |--- class: Player 2 Wins
|--- ELO_DIFF >  4.66
|   |--- ELO_DIFF <= 150.49
|   |   |--- ELO_SURFACE_DIFF <= 61.68
|   |   |  

As we can see in the output, it seems like it's only taking ELO into account, which we don't really want (since otherwise we could just predict using ELO alone).

Let's see if a Random Forest works better :)

### Train Random Forest

We start by training a pretty big random forest (n_estimators=500)

In [7]:
# Instantiate a Random Forsest
forest_sklearn = RandomForestClassifier(n_estimators=500, max_depth=10, max_features="sqrt", bootstrap=True)
forest_sklearn = forest_sklearn.fit(x_train, y_pred_train)

# Make predictions and test accuracy
predictions_train = forest_sklearn.predict(x_train)
predictions_test = forest_sklearn.predict(x_test)
print("Train Accuracy: "+str(accuracy_score(y_pred_train, predictions_train)))
print("Test Accuracy: "+str(accuracy_score(y_pred_test, predictions_test)))

Train Accuracy: 0.7110871008474466
Test Accuracy: 0.6618471525523754


That's a slight improvement :). Let's try a simpler less overfitted model.

In [8]:
# Instantiate a Random Forsest
forest_sklearn2 = RandomForestClassifier(n_estimators=100, max_depth=7, min_samples_split=400, min_samples_leaf=250, max_features="sqrt", bootstrap=True)
forest_sklearn2 = forest_sklearn2.fit(x_train, y_pred_train)

# Make predictions and test accuracy
predictions_train = forest_sklearn2.predict(x_train)
predictions_test = forest_sklearn2.predict(x_test)
print("Train Accuracy: "+str(accuracy_score(y_pred_train, predictions_train)))
print("Test Accuracy: "+str(accuracy_score(y_pred_test, predictions_test)))

Train Accuracy: 0.6710188885562166
Test Accuracy: 0.662142224845087


Seems like accuracy kinda decreased. I'm going to run a quick GridSearch to see if we could improve this. Let's see if we can find the best hyperparameters :)

In [None]:
# This is going to take a long time, if you want to comment it out (I already did this myself and the results are in the models folder)
param_grid = {
    'n_estimators': [100, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [10, 20],
    'min_samples_leaf': [5, 10],
    'max_features': ['sqrt', 'log2']
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    cv=5, 
    n_jobs=-1,
    verbose=4
)
grid_search.fit(x_train, y_pred_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV 1/5] END max_depth=5, max_features=sqrt, min_samples_leaf=5, min_samples_split=10, n_estimators=100;, score=0.667 total time=  20.3s
[CV 2/5] END max_depth=5, max_features=sqrt, min_samples_leaf=5, min_samples_split=10, n_estimators=100;, score=0.667 total time=  20.6s
[CV 3/5] END max_depth=5, max_features=sqrt, min_samples_leaf=5, min_samples_split=10, n_estimators=100;, score=0.663 total time=  20.6s
[CV 4/5] END max_depth=5, max_features=sqrt, min_samples_leaf=5, min_samples_split=10, n_estimators=100;, score=0.661 total time=  21.1s
[CV 5/5] END max_depth=5, max_features=sqrt, min_samples_leaf=5, min_samples_split=10, n_estimators=100;, score=0.669 total time=  21.0s
[CV 1/5] END max_depth=5, max_features=sqrt, min_samples_leaf=5, min_samples_split=20, n_estimators=100;, score=0.666 total time=  21.7s
[CV 2/5] END max_depth=5, max_features=sqrt, min_samples_leaf=5, min_samples_split=20, n_estimators=100;, score=0.66

In [17]:
# Best parameters
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Instantiate a Random Forsest
best_forest_model = RandomForestClassifier(max_depth=15, max_features='log2', min_samples_leaf=5, min_samples_split=20, n_estimators=300)
best_forest_model = best_forest_model.fit(x_train, y_pred_train)

# Make predictions and test accuracy
predictions_train = best_forest_model.predict(x_train)
predictions_test = best_forest_model.predict(x_test)
print("Train Accuracy: "+str(accuracy_score(y_pred_train, predictions_train)))
print("Test Accuracy: "+str(accuracy_score(y_pred_test, predictions_test)))

Best Parameters: {'max_depth': 15, 'max_features': 'log2', 'min_samples_leaf': 5, 'min_samples_split': 20, 'n_estimators': 300}
Best Score: 0.6684674689234912
Train Accuracy: 0.7993074629974355
Test Accuracy: 0.6611094718205961


The gridSearchCV wasn't that successful, since the test accuracy was actually 0.6611. Let's train an XGBoost model and see if it does better.

### Train XGBoost Algorithm

Let's try with XGBoost and see if we can get better results.

In [18]:
# Instantiate an XGBoost Classifier
xgb_model = XGBClassifier(n_estimators=200, max_depth=10, learning_rate=0.1, subsample=0.8, colsample_bytree=0.7)

# Train the model
xgb_model.fit(x_train, reverse_mapper(y_pred_train))

# Make predictions
predictions_train = xgb_model.predict(x_train)
predictions_test = xgb_model.predict(x_test)

# Calculate accuracy
print("Train Accuracy: " + str(accuracy_score(reverse_mapper(y_pred_train), predictions_train)))
print("Test Accuracy: " + str(accuracy_score(reverse_mapper(y_pred_test), predictions_test)))

Train Accuracy: 0.9742381442091149
Test Accuracy: 0.6513720861611094


In [19]:
# Sorting by importance in descending order
sorted_mapped_results = sorted(
    list(zip(final_dataset.columns[:-1], list(xgb_model.feature_importances_))),
    key=lambda x: x[1], 
    reverse=True
)

# Extracting sorted labels and their importances
sorted_labels = [label for label, importance in sorted_mapped_results]
sorted_importances = [importance for _, importance in sorted_mapped_results]

# Displaying results
for label, importance in sorted_mapped_results:
    print(f"{label}: {importance:.4f}")


ELO_DIFF: 0.0843
ELO_SURFACE_DIFF: 0.0429
BEST_OF: 0.0204
ATP_RANK_DIFF: 0.0189
AGE_DIFF: 0.0161
WIN_LAST_25_DIFF: 0.0155
P_2ND_WON_LAST_100_DIFF: 0.0154
P_1ST_WON_LAST_10_DIFF: 0.0142
ATP_POINTS_DIFF: 0.0139
P_ACE_LAST_5_DIFF: 0.0138
P_1ST_WON_LAST_3_DIFF: 0.0138
ELO_GRAD_LAST_200_DIFF: 0.0137
H2H_SURFACE_DIFF: 0.0137
P_1ST_WON_LAST_25_DIFF: 0.0136
P_2ND_WON_LAST_200_DIFF: 0.0136
ELO_GRAD_LAST_50_DIFF: 0.0136
P_ACE_LAST_50_DIFF: 0.0136
ELO_GRAD_LAST_100_DIFF: 0.0135
H2H_DIFF: 0.0135
WIN_LAST_200_DIFF: 0.0135
P_1ST_WON_LAST_5_DIFF: 0.0135
P_2ND_WON_LAST_25_DIFF: 0.0135
P_2ND_WON_LAST_10_DIFF: 0.0135
P_DF_LAST_200_DIFF: 0.0135
P_1ST_WON_LAST_50_DIFF: 0.0135
P_1ST_IN_LAST_3_DIFF: 0.0134
P_1ST_IN_LAST_50_DIFF: 0.0134
P_ACE_LAST_25_DIFF: 0.0134
P_1ST_WON_LAST_100_DIFF: 0.0134
P_BP_SAVED_LAST_50_DIFF: 0.0133
P_2ND_WON_LAST_3_DIFF: 0.0133
N_GAMES_DIFF: 0.0133
P_DF_LAST_50_DIFF: 0.0133
P_ACE_LAST_200_DIFF: 0.0132
P_1ST_IN_LAST_200_DIFF: 0.0132
P_DF_LAST_25_DIFF: 0.0132
P_BP_SAVED_LAST_100_DIF

Okay this is overfitting significantly, let's try regularization.

In [21]:
# Instantiate an XGBoost Classifier
xgb_model = XGBClassifier(
    n_estimators=100, 
    max_depth=5, 
    learning_rate=0.05, 
    subsample=0.7, 
    colsample_bytree=0.6,
    reg_alpha=0.1,
    reg_lambda=1.0
)

# Train the model
xgb_model.fit(x_train, reverse_mapper(y_pred_train))

# Make predictions
predictions_train = xgb_model.predict(x_train)
predictions_test = xgb_model.predict(x_test)

# Calculate accuracy
print("Train Accuracy: " + str(accuracy_score(reverse_mapper(y_pred_train), predictions_train)))
print("Test Accuracy: " + str(accuracy_score(reverse_mapper(y_pred_test), predictions_test)))

Train Accuracy: 0.681524102110155
Test Accuracy: 0.6690764237238124


In [22]:
xgb_model.save_model("./models/xgb_model.json")

That's slightly better. Let's run a gridsearch to really make sure.

In [25]:
# Define parameter grid with all specified parameters
param_grid = {
    'n_estimators': [100, 300],
    'max_depth': [5, 10],
    'learning_rate': [0.01, 0.05],
    'subsample': [0.7],
    'colsample_bytree': [0.6],
    'reg_alpha': [0.1, 0.5],
    'reg_lambda': [0.5, 1.0]
}

# Instantiate an XGBoost Classifier
xgb_model = XGBClassifier()

# Perform GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb_model, 
    param_grid=param_grid, 
    scoring='accuracy', 
    cv=5, 
    verbose=3, 
    n_jobs=-1
)
grid_search.fit(x_train, reverse_mapper(y_pred_train))

Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV 3/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, n_estimators=100, reg_alpha=0.1, reg_lambda=1.0, subsample=0.7;, score=0.665 total time=   2.9s
[CV 5/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, n_estimators=100, reg_alpha=0.1, reg_lambda=0.5, subsample=0.7;, score=0.670 total time=   3.0s
[CV 4/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, n_estimators=100, reg_alpha=0.1, reg_lambda=0.5, subsample=0.7;, score=0.665 total time=   3.0s
[CV 1/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, n_estimators=100, reg_alpha=0.1, reg_lambda=0.5, subsample=0.7;, score=0.668 total time=   3.0s
[CV 2/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, n_estimators=100, reg_alpha=0.1, reg_lambda=0.5, subsample=0.7;, score=0.672 total time=   3.1s
[CV 2/5] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, n_estimators=100, reg_alpha=0.1, reg_lambda=1.0, s

In [26]:
# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Train the best model
best_xgb_model = grid_search.best_estimator_

# Make predictions
predictions_train = best_xgb_model.predict(x_train)
predictions_test = best_xgb_model.predict(x_test)

# Calculate accuracy
print("Train Accuracy:", accuracy_score(reverse_mapper(y_pred_train), predictions_train))
print("Test Accuracy:", accuracy_score(reverse_mapper(y_pred_test), predictions_test))

Best Parameters: {'colsample_bytree': 0.6, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 100, 'reg_alpha': 0.5, 'reg_lambda': 0.5, 'subsample': 0.7}
Train Accuracy: 0.6818365248180789
Test Accuracy: 0.6672322218943642


In [27]:
best_xgb_model = grid_search.best_estimator_
best_xgb_model.save_model("./models/best_xgb_model.json")

### Train Neural Net

In [28]:
# Normalize the data
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
X_test_scaled = scaler.transform(x_test)

# Define the neural network
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(x_train_scaled.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(16, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train_scaled, reverse_mapper(y_pred_train), epochs=50, batch_size=32, validation_split=0.2, verbose=1)

# Evaluate on test set
train_loss, train_acc = model.evaluate(x_train_scaled, reverse_mapper(y_pred_train), verbose=0)
test_loss, test_acc = model.evaluate(X_test_scaled, reverse_mapper(y_pred_test), verbose=0)

print(f"Train Accuracy: {train_acc}")
print(f"Test Accuracy: {test_acc}")

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
[1m1921/1921[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 664us/step - accuracy: 0.6522 - loss: 0.6208 - val_accuracy: 0.6737 - val_loss: 0.6040
Epoch 2/50
[1m1921/1921[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 595us/step - accuracy: 0.6687 - loss: 0.6013 - val_accuracy: 0.6722 - val_loss: 0.6017
Epoch 3/50
[1m1921/1921[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 597us/step - accuracy: 0.6748 - loss: 0.5997 - val_accuracy: 0.6711 - val_loss: 0.6019
Epoch 4/50
[1m1921/1921[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 768us/step - accuracy: 0.6710 - loss: 0.5993 - val_accuracy: 0.6705 - val_loss: 0.6028
Epoch 5/50
[1m1921/1921[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 601us/step - accuracy: 0.6759 - loss: 0.5942 - val_accuracy: 0.6701 - val_loss: 0.6025
Epoch 6/50
[1m1921/1921[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 599us/step - accuracy: 0.6762 - loss: 0.5937 - val_accuracy: 0.6689 - val_loss: 0.6056
Epoc

The neural net wasn't the best result. I could try to optimize it, but since the video was mostly about random forest/decision trees, I focused more on that.

I expored the best result from the GridSearch and now we can use it to predict stuff :).

See the next notebook (3.Predict.ipynb) for this.