 <span style="color: purple; font-weight: bold; font-size:26px"> Check if any models performs better than the naive classifier </span>


Let's start by importing some libs

In [1]:
# data analysis
import pandas as pd
import numpy as np
import random

# visualisations
import seaborn 
import matplotlib.pyplot as plt
%matplotlib inline

# models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
import pickle # for saving the model

# preprocesing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.impute import KNNImputer

# just out of curiosity
import time


Now, let's define some helper functions!


In [2]:
# let's define some helper functions:

def plot_models_acc(dict, naive_classifier):
    labels = tuple(dict.keys())
    y_pos = np.arange(len(labels))
    values = [dict[n]['accuracy'] for n in dict]
    plt.bar(y_pos, values, align='center', alpha=0.5)
    plt.xticks(y_pos, labels,rotation='vertical')
    plt.ylabel('accuracy')
    plt.title('Accuracy of different models')
    # add a horizontal line at naive_classfier
    plt.axhline(y=naive_classifier, color='r', linestyle='-', label='Naive Classifier')
    plt.legend(loc='lower right')
    
    plt.show()

    
# checking the distribution of the features :)
def feature_dist(dataframe):
    col_num = len(dataframe.columns)
    from_to_ind = [(i, i+6) for i in range(0, col_num, 6)]

    for i, j in from_to_ind:
        if j >= col_num:
            if col_num - 1 == i:
                dataframe.iloc[:, i].hist(figsize=(11,11))
            else:
                dataframe.iloc[:, i:col_num-1].hist(layout=(1,col_num - 1 - i), figsize=(11,11))
        else:
            dataframe.iloc[:, i:j].hist(layout=(2,3), figsize=(11,11))
            
            
def conf_mat(grid_search: GridSearchCV, Y_test):
    # construction confusion matrix
    outcome_class_labels = ['Red', 'Draw', 'No contest', 'Blue']
    cm = confusion_matrix(
        Y_test, 
        grid_search.predict(X_test),
        labels = outcome_class_labels
    )

    # create heatmap
    seaborn.heatmap(
        cm, 
        annot=True, 
        cmap='Blues', 
        xticklabels=outcome_class_labels,
        yticklabels=outcome_class_labels,
        fmt='d')

    # add labels
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

    return cm

### Importing data
Now, let's import the data.

The data is a result of a web-scraping script written in Python that visits websites like ufcstats.com and bestfightodds.com in order to gather, store and compute fighting statistics and odds. Our initial time period ranges from 2023 (most recent fights) to as far as 1994, but we have decided to skip the oldest 2400 fights. 

There are several reasons on why we skipped it. First and foremost, most of the oldest data has every winner labeled as 'Red' what would lead to the biased results of our model. Secondly, a lot of data was missing, especially data about the odds. Finally, fighting formulas have changed over time, therefore we would rather not to base our models on the 'outdated' information, so to speak.

Furthermore, let us mention that even in the selected time frame there are some missingness of the odds data that the scraper was unable to gather. Most of the times, it was because of the inconsistencies in the naming of fighters from the bestfightodds.com side. For example, one fighter could have 3 or 4 different pages simply because there were typos in names. To handle that, we manually mapped these odds into our datasets. In the main Excel file one may go and see the green-marked cells signifying the manual intervention. 


In [3]:
data = pd.read_excel('./UFCdata/datasets/UFC_fights_stats_complete.xlsx')
data = data.sort_values('Event_Date', ascending=True).reset_index(drop=True)
data = data.replace(['--', '---'], pd.NA)[2400:] 

In [None]:
data.info()

In [8]:
data.describe()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,R_Height_cm,R_Weight_kg,R_Reach,R_Age,R_Total_Knockdowns,R_Total_Significant_Strikes_Attempted,R_Total_Significant_Strikes_Landed,R_Significant_Strikes_%_Landed,...,B_Opp_Significant_Strikes_in_Clinch_%_Landed,B_Opp_Average_Significant_Strikes_on_Ground_Attempted,B_Opp_Average_Significant_Strikes_on_Ground_Landed,B_Opp_Significant_Strikes_on_Ground_%_Landed,Number_of_Rounds,Last_Round_Duration,R_Open_Odds,B_Open_Odds,R_Closing_Odds,B_Closing_Odds
count,4708.0,4708.0,4708.0,4708.0,4624.0,4708.0,4708.0,4708.0,4708.0,4708.0,...,4708.0,4708.0,4708.0,4708.0,4708.0,4708.0,4708.0,4708.0,4708.0,4708.0
mean,3612.097281,3612.097281,177.486007,73.803455,182.092842,30.937128,1.967502,631.524851,279.340272,0.42236,...,0.522387,5.16545,3.596194,0.505321,2.429269,230.601742,-114.771453,37.722175,-137.179269,39.881903
std,2022.513871,2022.513871,9.244279,16.029015,11.308305,4.246129,2.727167,668.628788,291.850253,0.15253,...,0.320886,7.7708,5.448841,0.351317,1.0064,90.851972,223.974404,203.324964,302.60469,251.600065
min,31.0,31.0,152.4,52.154195,147.32,18.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,5.0,-2300.0,-900.0,-3500.0,-1500.0
25%,1900.75,1900.75,170.18,61.22449,175.26,28.0,0.0,146.0,67.0,0.382689,...,0.333333,0.0,0.0,0.0,1.0,158.0,-230.0,-150.0,-265.0,-155.0
50%,3580.5,3580.5,177.8,70.294785,182.88,31.0,1.0,413.5,184.0,0.442389,...,0.633333,2.9375,2.0,0.62963,3.0,300.0,-155.0,125.0,-152.0,120.0
75%,5392.25,5392.25,185.42,83.900227,190.5,34.0,3.0,905.25,402.25,0.503885,...,0.745532,7.0,4.8,0.765019,3.0,300.0,120.0,175.0,122.0,200.0
max,7096.0,7096.0,210.82,120.181406,213.36,46.0,20.0,6324.0,2975.0,1.0,...,1.0,141.0,84.0,1.0,5.0,300.0,600.0,1100.0,775.0,1000.0


In [9]:
data.describe(include=['O'])

Unnamed: 0,Event_Name,Event_Location,Fight_Weight,Fight_Gender,R_Name,R_Stance,B_Name,B_Stance,Time_Format,Referee,Conclusion_Method,Winner
count,4708,4708,4708,4708,4708,4692,4708,4677,4708,4681,4708,4708
unique,401,137,10,2,1303,3,1471,3,3,164,10,4
top,UFC 286: Edwards vs. Usman 3,"Las Vegas, Nevada, USA",Lightweight,Male,Jim Miller,Orthodox,Angela Hill,Orthodox,3 Rnd (5-5-5),Herb Dean,Decision - Unanimous,Red
freq,15,1526,745,4056,20,3526,16,3421,4226,678,1755,2673


In [None]:
data.head()

In [4]:
# shuffling the data before we proceed:
data = data.sample(frac=1, random_state=2023).reset_index(drop=True)

We further need to drop some of the columns that will not be of use. I'll also remove closing odds since these will surely improve the model accuracy, but we shouldn't rely on them too much


In [5]:
dropdata = data.drop(['Unnamed: 0',
                      'Unnamed: 0.1',
                      'Event_Name',
                      'Event_Location',
                      'B_Name',
                      'R_Name',
                      'Conclusion_Method',
                      'Event_Date',
                      'Last_Round_Duration',
                      'Number_of_Rounds',
                      'Referee',
                      'R_Closing_Odds', # bye bye closing odds...
                      'B_Closing_Odds'], axis=1)

# droping rows with 'Open Stance' since it's so seldom (but shouldn't be present for 2400 row onward)
dropdata = dropdata.drop(dropdata[dropdata['R_Stance'] == 'Open Stance'].index)
dropdata = dropdata.drop(dropdata[dropdata['B_Stance'] == 'Open Stance'].index)

Next, let's find numerical and categorical columns in our dataset. 

In [6]:
objecttypes_cat = list(dropdata.select_dtypes(include=['O']).columns)
objecttypes_num = list(dropdata.select_dtypes(include=['int64', 'float64']).columns)

# we don't want to one-hot encode 'Winner' since it's not really required
# and the pipeline breaks if we do (it'll be looking for this var in 
# X_train&X_test but it's obviously not there)
objecttypes_cat = [x for x in objecttypes_cat if x != 'Winner']

## Data correlation

Since we are givem a considerate number of features, instead of presenting all possible correlations, we'll stick to N most significant ones.


In [None]:
# # Subset Correlation Matrix
# k = 10 #number of variables for heatmap
# corrmat = dropdata.corr()
# cols = corrmat.nlargest(k, 'Winner')['Winner'].index
# cm = np.corrcoef(dropdata[cols].values.T)
# seaborn.set(font_scale=1.25)
# hm = seaborn.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
# plt.show()

# Candidate models

We've chosen the following candidate models that we will further try to get the best performance from. In the end, we will present a comparison of each model.

#### - Perceptron
#### - Random Forests
#### - Decision Trees Classifier
#### - SGD Classifier
#### - Linear SVC
#### - Gaussian Naive Bayes
#### - KNN

PS.
Models can be loaded after they are saved. However, pay attention when loading them up after the kernel restart - loaded model may make predictions based on different training set. This means that it was partially trained on the current test set. It is due to the shuffling we make during train_test splitting.

In [7]:
tuned_models = dict()

### Standardizing the data (based on X_train) and outputting the naive classifier for the test set:

In [8]:
# Set the random seed
random.seed(2137)
np.random.seed(2137)

Y_all = dropdata['Winner']
X_all = dropdata.drop(['Winner'], axis=1)
X_train, X_test, Y_train, Y_test = train_test_split(X_all, Y_all, test_size=0.25, random_state=2023)

# standardizer for numerical columns
transformer_num = Pipeline([
    ('scaler', StandardScaler())
])

# standardizer for categorical columns
transformer_cat = Pipeline([
    ('scaler', OneHotEncoder())
])

# jumping into using these into the preprocessor for all features:
preprocessor = ColumnTransformer(
    transformers = [
        ('num', transformer_num, objecttypes_num),
        ('cat', transformer_cat, objecttypes_cat)   # no 'Winner' encoded
], remainder='passthrough')

# the standardization should be based on the training set, hence we obtain the relevant parameters:
x = preprocessor.fit_transform(X_train)

In [9]:
# printing the naive classifier on the test set
naive_classifier = Y_test.loc[Y_test == 'Red'].count() / Y_test.count()
print("The naive classifier accuracy in the test set is: ", naive_classifier)

The naive classifier accuracy in the test set is:  0.5548003398470688


The naive classifier that would be the least we should beat is the proportion of Red Winners to all fights outcomes, since it's the most common result. 

It appears that our models should be capable to achieve at least 55.5% accuracy to beat the naive classifier. It signifies that we do better than random guessing.

## Perceptron:

The Perceptron is another simple classification algorithm suitable for large scale learning. By default:

        It does not require a learning rate.

        It is not regularized (penalized).

        It updates its model only on mistakes.

The last characteristic implies that the Perceptron is slightly faster to train than SGD with the hinge loss and that the resulting models are sparser.

- Linearly separable

In [None]:
# starting the timer to see how much time it takes
start_time = time.time()

# Let's define the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('imputer', KNNImputer()),
    ('selector', SelectKBest()),
    ('model', Perceptron(random_state=2023))
])


# Define the parameter grid
param_grid = {
    'selector__k': [10, 20, 30, 50, 90, 150, len(preprocessor.get_feature_names_out())],
    'model__penalty': ['elasticnet', 'l2', 'l1', None],
    'model__alpha': [0.01, 0.001, 0.0001, 0.00001],
    'model__max_iter': [500, 1000, 1500],
    'model__tol': [0.1, 0.01, 0.001, 0.0001, 0.00001]
}

# create the grid search object
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, n_jobs=3, verbose=2)

# fit the grid search object to the data
grid_search.fit(X_train, Y_train)

# evaluate the best model on the test set
score = grid_search.score(X_test, Y_test)

# stopping the timer and seeing the result: ------------------------
end_time = time.time()

print('Accuracy on the test set: ', score)
print('\nBest parameters after the grid search: ', grid_search.best_params_)
print("\nTime taken for model to learn: ", end_time - start_time, " seconds")

In [None]:
# save the model to disk using pickle
if input('Are you sure you want to run this cell? It will overwrite current best model. Y/N: ') == 'Y':
    with open('trained_models/best_perceptron_model.pkl', 'wb') as f:
        pickle.dump(grid_search, f)

In [None]:
# load the best model from disk using pickle
with open('trained_models/best_perceptron_model.pkl', 'rb') as f:
    model = pickle.load(f)

print('Best parameters after the grid search: ', model.best_params_)

In [None]:
# Get the selected feature indices after SelectKBest() in the best estimator
selected_feature_indices = model.best_estimator_.named_steps['selector'].get_support(indices=True)

# Get the column names of the selected features
selected_feature_names = preprocessor.get_feature_names_out()[selected_feature_indices]
print('Columns after feature selection: ', selected_feature_names)

In [None]:
# storing accuracy score:
acc_sco = accuracy_score(Y_test, model.predict(X_test)) 

# storing accuracy score:
acc_sco = accuracy_score(Y_test, model.predict(X_test)) 

# construction confusion matrix
cm = conf_mat(model, Y_test)

# store model results in tuned_models dictionary

print('Accuracy on the test set: ',acc_sco)
tuned_models['Perceptron'] = {'accuracy': acc_sco, 'model': model, 'confusion_matrix': cm}

# store model results in tuned_models dictionary

tuned_models['Perceptron'] = {'accuracy': acc_sco, 'model': model, 'confusion_matrix': cm}

## Random Forest Classifier

A random forest classifier.

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

In [10]:
# starting the timer to see how much time it takes
start_time = time.time()

# Let's define the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('imputer', KNNImputer()),
    ('selector', SelectKBest()),
    ('model', RandomForestClassifier(random_state=123))
])


# Define the parameter grid
param_grid = {
    'selector__k': [5, 10, 15, 30, 45, 60, 90, 120, 150, len(preprocessor.get_feature_names_out())],
    'model__n_estimators': [50, 100, 200, 300],
    'model__criterion': ['gini'],
    'model__max_depth': [None],
    'model__min_samples_split': [2, 4, 6, 8],
    'model__min_samples_leaf': [1, 2, 4, 8, 12],
    'model__max_features': ['sqrt']
}

# create the grid search object
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=2, n_jobs=4)

# fit the grid search object to the data
grid_search.fit(X_train, Y_train)

# evaluate the best model on the test set
score = grid_search.score(X_test, Y_test)

# stopping the timer and seeing the result: ------------------------
end_time = time.time()

print('Accuracy on the test set: ', score)
print('\nBest parameters after the grid search: ', grid_search.best_params_)
print("\nTime taken for model to learn: ", end_time - start_time, " seconds")

Fitting 5 folds for each of 800 candidates, totalling 4000 fits
Accuracy on the test set:  0.6202209005947323

Best parameters after the grid search:  {'model__criterion': 'gini', 'model__max_depth': None, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 2, 'model__min_samples_split': 6, 'model__n_estimators': 100, 'selector__k': 60}

Time taken for model to learn:  4636.163654088974  seconds


In [22]:
# save the model to disk using pickle
if input('Are you sure you want to run this cell? It will overwrite current best model. Y/N: ') == 'Y':
    with open('trained_models/best_random_forest_model.pkl', 'wb') as f:
        pickle.dump(grid_search, f)

Are you sure you want to run this cell? It will overwrite current best model. Y/N: Y


In [14]:
# load the best model from disk using pickle
with open('trained_models/best_random_forest_model.pkl', 'rb') as f:
    model = pickle.load(f)
 
print('Best parameters after the grid search: ', model.best_params_)

Best parameters after the grid search:  {'model__criterion': 'gini', 'model__max_depth': None, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 1, 'model__min_samples_split': 4, 'model__n_estimators': 200, 'selector__k': 90}


In [None]:
# Get the selected feature indices after SelectKBest() in the best estimator
selected_feature_indices = model.best_estimator_.named_steps['selector'].get_support(indices=True)

# Get the column names of the selected features
selected_feature_names = preprocessor.get_feature_names_out()[selected_feature_indices]
print('\nColumns after feature selection: ', selected_feature_names)

In [None]:
# storing accuracy score:
acc_sco = accuracy_score(Y_test, model.predict(X_test)) 

# construction confusion matrix
cm = conf_mat(model, Y_test)

# store model results in tuned_models dictionary

print('Accuracy on the test set: ', acc_sco)
tuned_models['Random_Forest'] = {'accuracy': acc_sco, 'model': model, 'confusion_matrix': cm}

## Decision Tree Classifier

A decision tree classifier.

In [None]:
# starting the timer to see how much time it takes
start_time = time.time()

# Let's define the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('Imputer', KNNImputer()),
    ('selector', SelectKBest()),
    ('model', DecisionTreeClassifier(random_state=2023))
])


# Define the parameter grid (k<5 check)
param_grid = {
    'selector__k': [3, 4, 5, 10, 15, 30, 50, 70, 90, len(preprocessor.get_feature_names_out())],
    'model__criterion': ['gini', 'entropy'],
    'model__max_depth': [None, 4],
    'model__min_samples_split': [2],
    'model__min_samples_leaf': [1],
    'model__max_features': ['sqrt', None]
}

# create the grid search object
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=2, n_jobs=4)

# fit the grid search object to the data
grid_search.fit(X_train, Y_train)

# evaluate the best model on the test set
score = grid_search.score(X_test, Y_test)

# stopping the timer and seeing the result: ------------------------
end_time = time.time()

print('Accuracy on the test set: ', score)
print('\nBest parameters after the grid search: ', grid_search.best_params_)
print("\nTime taken for model to learn: ", end_time - start_time, " seconds")

In [None]:
# save the model to disk using pickle
if input('Are you sure you want to run this cell? It will overwrite current best model. Y/N: ') == 'Y':
    with open('trained_models/best_decision_tree_model.pkl', 'wb') as f:
        pickle.dump(grid_search, f)

In [None]:
# load the best model from disk using pickle
with open('trained_models/best_decision_tree_model.pkl', 'rb') as f:
    model = pickle.load(f)
 
print('Best parameters after the grid search: ', model.best_params_)

In [None]:
# Get the selected feature indices after SelectKBest() in the best estimator
selected_feature_indices = model.best_estimator_.named_steps['selector'].get_support(indices=True)

# Get the column names of the selected features
selected_feature_names = preprocessor.get_feature_names_out()[selected_feature_indices]
print('Columns after feature selection: ', selected_feature_names)

In [None]:
# storing accuracy score:
acc_sco = accuracy_score(Y_test, model.predict(X_test)) 

# construction confusion matrix
cm = conf_mat(model, Y_test)

# store model results in tuned_models dictionary

print('Accuracy on the test set: ',acc_sco)
tuned_models['Decision_Tree'] = {'accuracy': acc_sco, 'model': model, 'confusion_matrix': cm}

## Stochastic Gradient Descent (SGD) Classifier

A linear classifier with SGD training. Default implementation uses the loss='hinge' used for linear SVM

In [None]:
# starting the timer to see how much time it takes
start_time = time.time()

# Let's define the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('imputer', KNNImputer()),
    ('selector', SelectKBest()),
    ('model', SGDClassifier(random_state=2023))
])


# Define the parameter grid
param_grid = {
    'selector__k': [10, 20, 30, 40, 50, 70, 90, 150, len(preprocessor.get_feature_names_out())],
    'model__alpha': [0.1, 0.01, 0.001, 0.0001],
    'model__penalty': ['l1', 'l2', 'elasticnet'],
    'model__max_iter': [200, 500, 1000],
    'model__tol': [0.1, 0.01, 0.001, 0.0001]
}
# create the grid search object
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=2, n_jobs=3)

# fit the grid search object to the data
grid_search.fit(X_train, Y_train)

# evaluate the best model on the test set
score = grid_search.score(X_test, Y_test)

# stopping the timer and seeing the result: ------------------------
end_time = time.time()

print('Accuracy on the test set: ', score)
print('\nBest parameters after the grid search: ', grid_search.best_params_)
print("\nTime taken for model to learn: ", end_time - start_time, " seconds")

In [None]:
# save the model to disk using pickle
if input('Are you sure you want to run this cell? It will overwrite current best model. Y/N: ') == 'Y':
    with open('trained_models/best_SGDClassfier_model.pkl', 'wb') as f:
        pickle.dump(grid_search, f)

In [None]:
# load the best model from disk using pickle
with open('trained_models/best_SGDClassfier_model.pkl', 'rb') as f:
    model = pickle.load(f)
 
print('Best parameters after the grid search: ', model.best_params_)

In [None]:
# Get the selected feature indices after SelectKBest() in the best estimator
selected_feature_indices = model.best_estimator_.named_steps['selector'].get_support(indices=True)

# Get the column names of the selected features
selected_feature_names = preprocessor.get_feature_names_out()[selected_feature_indices]
print('Columns after feature selection: ', selected_feature_names)

In [None]:
# storing accuracy score:
acc_sco = accuracy_score(Y_test, model.predict(X_test)) 

# construction confusion matrix
cm = conf_mat(model, Y_test)

# store model results in tuned_models dictionary

print('Accuracy on the test set: ',acc_sco)
tuned_models['SGDClassfier'] = {'accuracy': acc_sco, 'model': model, 'confusion_matrix': cm}

## Linear SVC

Linear Support Vector Classification.

In [None]:
# starting the timer to see how much time it takes
start_time = time.time()

# Let's define the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('imputer', KNNImputer()),
    ('selector', SelectKBest()),
    ('model', LinearSVC(random_state=2023))
])


# Define the parameter grid
param_grid = {
    'selector__k': [5, 10, 20, 40, 60, 90, 130, len(preprocessor.get_feature_names_out())],
    'model__loss': ['hinge'],
    'model__C': [10, 1, 0.1, 0.01, 0.001],
    'model__penalty': ['l2'],
    'model__max_iter': [500, 1000, 1500],
    'model__tol': [0.1, 0.01, 0.001, 0.0001, 0.00001]
}
# create the grid search object
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=2, n_jobs=3)

# fit the grid search object to the data
grid_search.fit(X_train, Y_train)

# evaluate the best model on the test set
score = grid_search.score(X_test, Y_test)

# stopping the timer and seeing the result: ------------------------
end_time = time.time()

print('Accuracy on the test set: ', score)
print('\nBest parameters after the grid search: ', grid_search.best_params_)
print("\nTime taken for model to learn: ", end_time - start_time, " seconds")

In [None]:
# save the model to disk using pickle
if input('Are you sure you want to run this cell? It will overwrite current best model. Y/N: ') == 'Y':
    with open('trained_models/best_LinearSVC_model.pkl', 'wb') as f:
        pickle.dump(grid_search, f)

In [None]:
# load the best model from disk using pickle
with open('trained_models/best_LinearSVC_model.pkl', 'rb') as f:
    model = pickle.load(f)
 
print('Best parameters after the grid search: ', model.best_params_)

In [None]:
# Get the selected feature indices after SelectKBest() in the best estimator
selected_feature_indices = model.best_estimator_.named_steps['selector'].get_support(indices=True)

# Get the column names of the selected features
selected_feature_names = preprocessor.get_feature_names_out()[selected_feature_indices]
print('Columns after feature selection: ', selected_feature_names)

In [None]:
# storing accuracy score:
acc_sco = accuracy_score(Y_test, model.predict(X_test)) 

# construction confusion matrix
cm = conf_mat(model, Y_test)

# store model results in tuned_models dictionary

print('Accuracy on the test set: ',acc_sco)
tuned_models['LinearSVC'] = {'accuracy': acc_sco, 'model': model, 'confusion_matrix': cm}

## Gaussian Naive Bayes

Gaussian Naive Bayes (GaussianNB)

In [None]:
# starting the timer to see how much time it takes
start_time = time.time()

# Let's define the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('imputer', KNNImputer()),
    ('selector', SelectKBest()),
    ('model', GaussianNB())
])


# Define the parameter grid
param_grid = {
    'selector__k': [1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 120, 150, len(preprocessor.get_feature_names_out())],
}
# create the grid search object
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1)

# fit the grid search object to the data
grid_search.fit(X_train, Y_train)

# evaluate the best model on the test set
score = grid_search.score(X_test, Y_test)

# stopping the timer and seeing the result: ------------------------
end_time = time.time()

print('Accuracy on the test set: ', score)
print('\nBest parameters after the grid search: ', grid_search.best_params_)
print("\nTime taken for model to learn: ", end_time - start_time, " seconds")

In [None]:
# save the model to disk using pickle
if input('Are you sure you want to run this cell? It will overwrite current best model. Y/N: ') == 'Y':
    with open('trained_models/best_GaussianNB_model.pkl', 'wb') as f:
        pickle.dump(grid_search, f)

In [None]:
# load the best model from disk using pickle
with open('trained_models/best_GaussianNB_model.pkl', 'rb') as f:
    model = pickle.load(f)
 
print('Best parameters after the grid search: ', model.best_params_)

In [None]:
# Get the selected feature indices after SelectKBest() in the best estimator
selected_feature_indices = model.best_estimator_.named_steps['selector'].get_support(indices=True)

# Get the column names of the selected features
selected_feature_names = preprocessor.get_feature_names_out()[selected_feature_indices]
print('Columns after feature selection: ', selected_feature_names)

In [None]:
# storing accuracy score:
acc_sco = accuracy_score(Y_test, model.predict(X_test)) 

# construction confusion matrix
cm = conf_mat(model, Y_test)

# store model results in tuned_models dictionary

print('Accuracy on the test set: ',acc_sco)
tuned_models['Gaussian_NB'] = {'accuracy': acc_sco, 'model': model, 'confusion_matrix': cm}

## KNN

Classifier implementing the k-nearest neighbors vote.

In [None]:
# starting the timer to see how much time it takes
start_time = time.time()

# Let's define the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('imputer', KNNImputer()),
    ('selector', SelectKBest()),
    ('model', KNeighborsClassifier())
])


# Define the parameter grid
param_grid = {
    'selector__k': [5, 10, 20, 40, 60, 90, 140, len(preprocessor.get_feature_names_out())],
    'model__n_neighbors': [3, 5, 7, 9, 11, 13, 15],
    'model__weights': ['uniform', 'distance'],
    'model__metric': ['euclidean', 'manhattan', 'minkowski']
}
# create the grid search object
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=2, n_jobs=3)

# fit the grid search object to the data
grid_search.fit(X_train, Y_train)

# evaluate the best model on the test set
score = grid_search.score(X_test, Y_test)

# stopping the timer and seeing the result: ------------------------
end_time = time.time()

print('Accuracy on the test set: ', score)
print('\nBest parameters after the grid search: ', grid_search.best_params_)
print("\nTime taken for model to learn: ", end_time - start_time, " seconds")

In [None]:
# save the model to disk using pickle
if input('Are you sure you want to run this cell? It will overwrite current best model. Y/N: ') == 'Y':
    with open('trained_models/best_KNN_model.pkl', 'wb') as f:
        pickle.dump(grid_search, f)

In [None]:
# load the best model from disk using pickle
with open('trained_models/best_KNN_model.pkl', 'rb') as f:
    model = pickle.load(f)
 
print('Best parameters after the grid search: ', model.best_params_)

In [None]:
# Get the selected feature indices after SelectKBest() in the best estimator
selected_feature_indices = model.best_estimator_.named_steps['selector'].get_support(indices=True)

# Get the column names of the selected features
selected_feature_names = preprocessor.get_feature_names_out()[selected_feature_indices]
print('Columns after feature selection: ', selected_feature_names)

In [None]:
# storing accuracy score:
acc_sco = accuracy_score(Y_test, model.predict(X_test)) 

# construction confusion matrix
cm = conf_mat(model, Y_test)

# store model results in tuned_models dictionary

print('Accuracy on the test set: ',acc_sco)
tuned_models['KNN'] = {'accuracy': acc_sco, 'model': model, 'confusion_matrix': cm}

## Logistic Regression

Logistic Regression (aka logit, MaxEnt) classifier.


In [None]:
# starting the timer to see how much time it takes
start_time = time.time()

# Let's define the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('imputer', KNNImputer()),
    ('selector', SelectKBest()),
    ('model', LogisticRegression(random_state=2023))
])


# Define the parameter grid
param_grid = {
    'selector__k': [5, 10, 20, 40, 60, 90, 140, len(preprocessor.get_feature_names_out())],
    'model__penalty': ['l2'],
    'model__C': [0.01, 0.1, 1.0],
    'model__solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
    'model__max_iter': [50, 100, 150, 200, 300],
    'model__tol': [0.001, 0.0001, 0.00001]
}
# create the grid search object
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=2, n_jobs=3)

# fit the grid search object to the data
grid_search.fit(X_train, Y_train)

# evaluate the best model on the test set
score = grid_search.score(X_test, Y_test)

# stopping the timer and seeing the result: ------------------------
end_time = time.time()

print('Accuracy on the test set: ', score)
print('\nBest parameters after the grid search: ', grid_search.best_params_)
print("\nTime taken for model to learn: ", end_time - start_time, " seconds")

In [None]:
# save the model to disk using pickle
if input('Are you sure you want to run this cell? It will overwrite current best model. Y/N: ') == 'Y':
    with open('trained_models/best_logistic_regression.pkl', 'wb') as f:
        pickle.dump(grid_search, f)

In [None]:
# load the best model from disk using pickle
with open('trained_models/best_logistic_regression.pkl', 'rb') as f:
    model = pickle.load(f)
 
print('Best parameters after the grid search: ', model.best_params_)

In [None]:
# Get the selected feature indices after SelectKBest() in the best estimator
selected_feature_indices = model.best_estimator_.named_steps['selector'].get_support(indices=True)

# Get the column names of the selected features
selected_feature_names = preprocessor.get_feature_names_out()[selected_feature_indices]
print('Columns after feature selection: ', selected_feature_names)

In [None]:
# storing accuracy score:
acc_sco = accuracy_score(Y_test, model.predict(X_test)) 

# construction confusion matrix
cm = conf_mat(model, Y_test)

# store model results in tuned_models dictionary

print('Accuracy on the test set: ',acc_sco)
tuned_models['Logistic_Regression'] = {'accuracy': acc_sco, 'model': model, 'confusion_matrix': cm}

# FINAL TUNEL MODELS COMPARISON

Here we assess the performance of each model


In [None]:
plot_models_acc(tuned_models, naive_classifier)

# NOTES

If the odds are available before the fight, then there is nothing inherently wrong with using them as a predictor. However, it's important to keep in mind that any information used for prediction should be available at the time the prediction is made. If the model is trained on data that includes information (such as odds) that would not be available at the time of prediction, then the model may overfit to the training data and perform poorly on new, unseen data.

Additionally, it's important to consider the ethics of using betting odds for predictive modeling. While it may be legal to use this information for research purposes, it could be seen as promoting gambling or taking advantage of vulnerable populations. It's important to approach this type of research with sensitivity and to consider the potential impacts of the research on society.