Predicting UFC fight results using data about fighters and the outcomes of the previous fights

In [1]:
import pandas as pd

In [2]:
training_data = pd.read_csv('../../data/ufc-master.csv') #For training
prediction_data = pd.read_csv('../../data/for-predictions.csv') #Fights that models have to predict

Selecting features and preparing the training data

In [3]:
training_data = training_data[['B_fighter', 'R_fighter', 'B_odds', 'R_odds', 'title_bout', 'reach_dif', 'B_age', 'R_age', 
                               'B_current_lose_streak', 'R_current_lose_streak', 'B_current_win_streak',
                               'R_current_win_streak', 'better_rank', 'B_wins', 'R_wins', 'B_losses', 'R_losses',
                               'B_Stance', 'R_Stance', 'Winner']]

In [4]:
#In the blue fighter stance column we have to fix one data point, where 'Switch' is written as 'Switch ' with an extra space
#at the end
training_data['B_Stance'] = training_data['B_Stance'].replace({'Switch ': 'Switch'})
#Fixing values in the reach_dif columns
#We will fix outliers using the data available on the UFC website instead of removing the "broken" datapoints entirely
filter1 = (training_data['reach_dif'] == -187.96) & (training_data['B_fighter'] == 'Parker Porter')
filter2 = (training_data['reach_dif'] == -187.96) & (training_data['B_fighter'] == 'Irwin Rivera')
filter3 = training_data['reach_dif'] == -160.02
training_data[filter1] = training_data[filter1].replace({-187.96: -2.54})
training_data[filter2] = training_data[filter2].replace({-187.96: -17.78 })
training_data[filter3] = training_data[filter3].replace({-160.02: 5.08})

In [5]:
#Now we will use columns B_wins, B_losses, R_wins and R_losses to create a column for both fighters
#that contains the win rate (proportion of wins out of wins and losses combined)
B_ratio = training_data['B_wins'] / (training_data['B_wins'] + training_data['B_losses'])
R_ratio = training_data['R_wins'] / (training_data['R_wins'] + training_data['R_losses'])
training_data['B_wr'] = B_ratio
training_data['R_wr'] = R_ratio
#It is possible that in some of the rows that value is now NaN as the fighter has never fought before. In task 1 we found out
#that the fighters making debut usually win 43% of the time so we will replace NaN with 0.43 as giving them 0 would not be 
# "fair" and will hurt the prediction accuracy
training_data['B_wr'].fillna(0.43, inplace=True)
training_data['R_wr'].fillna(0.43, inplace=True)
#Now we will drop win and loss columns for both fighters because these features are not important for us anymore after
#creating the win rate column
training_data = training_data.drop(columns=['B_wins', 'B_losses', 'R_wins', 'R_losses'])

In [6]:
#Changing values into 1s and 0s where necessary and one-hot encoding stance and rank features
training_data['title_bout'] = (training_data['title_bout']).astype(int)
training_data['Winner'] = training_data['Winner'].map(dict(Blue=1, Red=0))
training_data = pd.get_dummies(training_data, columns=['B_Stance', 'R_Stance', 'better_rank'])

Now that we have prepared the training data we are going to split it into training and validation sets so that we can
choose hyperparameters for the prediction models.

In [7]:
#Creating training and validation sets for choosing hyperparameters for models
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(training_data.drop(columns=['B_fighter', 'R_fighter', 'B_odds', 'R_odds', 'Winner']), training_data['Winner'], test_size = 0.15, random_state = 2)

In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
#Note: in the KNN algorithm p=1 is Manhattan distance and p=2 is the Euclidean distance
best_acc = 0
best_comb = [0, 0]
for i in range(1, 301, 10):
    for j in range(1, 3, 1):
        model = KNeighborsClassifier(n_neighbors = i, p = j)
        model.fit(X_train, y_train)
        acc = accuracy_score(y_val, model.predict(X_val))
        if (acc > best_acc):
            best_acc = acc
            best_comb[0] = i
            best_comb[1] = j
print("The best achieved accuracy was: " + str(round(best_acc * 100, 2)) + "%.")
print("The neighbors value should be: " + str(best_comb[0]))
print("The value for p should be: " + str(best_comb[1]))

The best achieved accuracy was: 62.3%.
The neighbors value should be: 71
The value for p should be: 1


For random forest classifier we will tune hyperparameters manually

In [9]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 350, random_state=0, max_depth=11, max_features=9)
forest.fit(X_train, y_train)
accuracy = accuracy_score(y_val, forest.predict(X_val))
print("Accuracy of the forest classifier model: " + str(accuracy))

Accuracy of the forest classifier model: 0.6050670640834576


Now we have two models and their hyperparameter values.

Loading dataset for making predictions (in the predictions 1 means that "Blue" fighter won and 0 that "Red" fighter won). The dataset will be prepared similarily to the training dataset above.

In [10]:
pred = pd.read_csv('../../data/for-predictions.csv')
pred_df = pred[['B_odds', 'R_odds', 'title_bout', 'reach_dif', 'B_age', 'R_age', 
                               'B_current_lose_streak', 'R_current_lose_streak', 'B_current_win_streak',
                               'R_current_win_streak', 'better_rank', 'B_wins', 'R_wins', 'B_losses', 'R_losses',
                               'B_Stance', 'R_Stance']].copy(deep=True)

In [11]:
#Now we will use columns wins and losses for both fighters to create a column that has a win ratio out of all wins and losses
B_ratio = pred_df['B_wins'] / (pred_df['B_wins'] + pred_df['B_losses'])
R_ratio = pred_df['R_wins'] / (pred_df['R_wins'] + pred_df['R_losses'])
pred_df['B_wr'] = B_ratio
pred_df['R_wr'] = R_ratio
#It is possible that in some of the rows that value is now NaN as the fighter has never fought before. In task 1 we found out
#that the fighters making debut usually win 43% of the time so we will replace NaN with 0.43 as giving them 0 would not 
#represent reality very well
pred_df['B_wr'].fillna(0.43, inplace=True)
pred_df['R_wr'].fillna(0.43, inplace=True)
#Now we will drop win and loss columns for both fighters because we have added the winrate column
pred_df = pred_df.drop(columns=['B_wins', 'B_losses', 'R_wins', 'R_losses'])

In [12]:
#One-hot encoding as in the training dataset
pred_df = pd.get_dummies(pred_df, columns=['B_Stance', 'R_Stance', 'better_rank'])

In [13]:
#Adding missing columns (one-hot encoding does not create them when some values are not represented)
pred_df['better_rank_Blue'] = 0
pred_df['B_Stance_Open Stance'] = 0
pred_df['R_Stance_Open Stance'] = 0

Predicting fight results   
Note: The models will be now trained on the entire training data because when we chose hyperparameter values we did not use 15% of the data that was available to us as we needed it for validation set.

Predictions with KNN classifier

Model that does not use betting values that were available before the fight

In [14]:
knn_final_1 = KNeighborsClassifier(n_neighbors = 71, p=1)
knn_final_1.fit(training_data.drop(columns=['B_fighter', 'R_fighter', 'B_odds', 'R_odds', 'Winner']), training_data['Winner'])
pred['KNN'] = knn_final_1.predict(pred_df.drop(columns=['B_odds', 'R_odds']))

Model that uses betting values that were available

In [15]:
knn_final_2 = KNeighborsClassifier(n_neighbors = 71, p=1)
knn_final_2.fit(training_data.drop(columns=['B_fighter', 'R_fighter', 'Winner']), training_data['Winner'])
pred['KNN-2'] = knn_final_2.predict(pred_df)

Predictions with random forest classifier

Model that does not use betting values that were available before the fight

In [16]:
forest_final_1 = RandomForestClassifier(n_estimators = 350, random_state=0, max_depth=11, max_features=9)
forest_final_1.fit(training_data.drop(columns=['B_fighter', 'R_fighter', 'B_odds', 'R_odds', 'Winner']), training_data['Winner'])
pred['Forest'] = forest_final_1.predict(pred_df.drop(columns=['B_odds', 'R_odds']))

Model that uses betting values that were available

In [17]:
forest_final_2 = RandomForestClassifier(n_estimators = 350, random_state=0, max_depth=11, max_features=9)
forest_final_2.fit(training_data.drop(columns=['B_fighter', 'R_fighter', 'Winner']), training_data['Winner'])
pred['Forest-2'] = forest_final_2.predict(pred_df)

Checking the prediction results of the models

To which events fights in the dataset for predictions belonged  
UFC 256 (December 12) - fights 0-9  
UFC Vegas 16 (December 5) - fights 10-20  
UFC Vegas 15 (November 28) - fights 20-30

In [18]:
#Reminder: in the model predictions 1 is the blue fighter and 0 is the red fighter
#In the winner column 1 means that blue won, 0 red won, 0.5 that fight was a draw and -1 cancelled fight
#there have been more fight cancellations than in previous years because of coronavirus
winners = [0.5, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, -1, 0, -1, 0, 1, -1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1]
pred['Winner'] = winners
results = pred[['B_fighter', 'R_fighter', 'KNN', 'Forest', 'KNN-2', 'Forest-2', 'Winner', 'B_odds', 'R_odds']].copy(deep=True)
#How the results table looks
results

Unnamed: 0,B_fighter,R_fighter,KNN,Forest,KNN-2,Forest-2,Winner,B_odds,R_odds
0,Brandon Moreno,Deiveson Figueiredo,0,1,0,0,0.5,250,-330
1,Charles Oliveira,Tony Ferguson,1,1,0,1,1.0,140,-175
2,Virna Jandiroba,Mackenzie Dern,0,0,0,0,0.0,160,-200
3,Ronaldo Souza,Kevin Holland,0,0,1,0,0.0,-118,-106
4,Ciryl Gane,Junior dos Santos,1,1,1,1,1.0,-455,333
5,Daniel Pineda,Cub Swanson,1,1,0,1,0.0,-159,127
6,Rafael Fiziev,Renato Moicano,0,1,1,1,1.0,-159,130
7,Billy Quarantillo,Gavin Tucker,0,1,0,0,0.0,-167,135
8,Sam Hughes,Tecia Torres,1,1,0,0,0.0,355,-500
9,Peter Barrett,Chase Hooper,0,0,0,0,0.0,255,-335


Accuracy results for each model

In [19]:
KNN_acc = 0
Forest_acc = 0
KNN_acc_2 = 0
Forest_acc_2 = 0
fights_happened = 0
for i in range(21, 31):
    if (results['Winner'].iloc[i] != -1):
        fights_happened += 1
    if (results['Winner'].iloc[i] == results['KNN'].iloc[i]):
        KNN_acc += 1
    if (results['Winner'].iloc[i] == results['Forest'].iloc[i]):
        Forest_acc += 1
    if (results['Winner'].iloc[i] == results['KNN-2'].iloc[i]):
        KNN_acc_2 += 1
    if (results['Winner'].iloc[i] == results['Forest-2'].iloc[i]):
        Forest_acc_2 += 1
        
print("KNN prediction accuracy: " + str(round((KNN_acc / fights_happened) * 100, 2)) + "%.")
print("Forest prediction accuracy: " + str(round((Forest_acc / fights_happened) * 100, 2)) + "%.")
print("KNN-2 (with odds) prediction accuracy: " + str(round((KNN_acc_2 / fights_happened) * 100, 2)) + "%.")
print("Forest prediction (with odds) accuracy: " + str(round((Forest_acc_2 / fights_happened) * 100, 2)) + "%.")

KNN prediction accuracy: 60.0%.
Forest prediction accuracy: 50.0%.
KNN-2 (with odds) prediction accuracy: 40.0%.
Forest prediction (with odds) accuracy: 50.0%.


Algorithm accuracies for each event separately

In [21]:
KNN_acc = 0
Forest_acc = 0
KNN_acc_2 = 0
Forest_acc_2 = 0
fights_happened = 0
for i in range(31):
    if (results['Winner'].iloc[i] != -1):
        fights_happened += 1
    if (results['Winner'].iloc[i] == results['KNN'].iloc[i]):
        KNN_acc += 1
    if (results['Winner'].iloc[i] == results['Forest'].iloc[i]):
        Forest_acc += 1
    if (results['Winner'].iloc[i] == results['KNN-2'].iloc[i]):
        KNN_acc_2 += 1
    if (results['Winner'].iloc[i] == results['Forest-2'].iloc[i]):
        Forest_acc_2 += 1
    if i == 9:
        print("UFC 256 (December 12)\n")
        print("KNN prediction accuracy: " + str(round((KNN_acc / fights_happened) * 100, 2)) + "%.")
        print("Forest prediction accuracy: " + str(round((Forest_acc / fights_happened) * 100, 2)) + "%.")
        print("KNN-2 (with odds) prediction accuracy: " + str(round((KNN_acc_2 / fights_happened) * 100, 2)) + "%.")
        print("Forest prediction (with odds) accuracy: " + str(round((Forest_acc_2 / fights_happened) * 100, 2)) + "%.\n")
        KNN_acc = 0
        Forest_acc = 0
        KNN_acc_2 = 0
        Forest_acc_2 = 0
        fights_happened = 0
    elif i == 20:
        print("UFC Vegas 16(December 5)\n")
        print("KNN prediction accuracy: " + str(round((KNN_acc / fights_happened) * 100, 2)) + "%.")
        print("Forest prediction accuracy: " + str(round((Forest_acc / fights_happened) * 100, 2)) + "%.")
        print("KNN-2 (with odds) prediction accuracy: " + str(round((KNN_acc_2 / fights_happened) * 100, 2)) + "%.")
        print("Forest prediction (with odds) accuracy: " + str(round((Forest_acc_2 / fights_happened) * 100, 2)) + "%.\n")
        KNN_acc = 0
        Forest_acc = 0
        KNN_acc_2 = 0
        Forest_acc_2 = 0
        fights_happened = 0
    elif i == 30:
        print("UFC Vegas 15(November 28)\n")
        print("KNN prediction accuracy: " + str(round((KNN_acc / fights_happened) * 100, 2)) + "%.")
        print("Forest prediction accuracy: " + str(round((Forest_acc / fights_happened) * 100, 2)) + "%.")
        print("KNN-2 (with odds) prediction accuracy: " + str(round((KNN_acc_2 / fights_happened) * 100, 2)) + "%.")
        print("Forest prediction (with odds) accuracy: " + str(round((Forest_acc_2 / fights_happened) * 100, 2)) + "%.\n")

UFC 256 (December 12)

KNN prediction accuracy: 60.0%.
Forest prediction accuracy: 60.0%.
KNN-2 (with odds) prediction accuracy: 70.0%.
Forest prediction (with odds) accuracy: 80.0%.

UFC Vegas 16(December 5)

KNN prediction accuracy: 50.0%.
Forest prediction accuracy: 62.5%.
KNN-2 (with odds) prediction accuracy: 75.0%.
Forest prediction (with odds) accuracy: 75.0%.

UFC Vegas 16(November 28)

KNN prediction accuracy: 60.0%.
Forest prediction accuracy: 50.0%.
KNN-2 (with odds) prediction accuracy: 40.0%.
Forest prediction (with odds) accuracy: 50.0%.



These results show us that it is possible to predict fight outcomes using the data that is available before the fight, but the prediction accuracies are not as high as we hoped before the project (the goal was 70%). It is hard to achieve very good results because mismatches happen rarely and most of the time the opposing fighters are quite evenly matched and there is also a lot of unpredictability when it comes to fighting (both fighters are trained and well prepared and it is always possible to win the fight with a one good punch (usually called lucky punch when the underdog wins)).

It is important to note though that during the last 2 events random forest classifier (wihad accuracies 75% and 80% which shows that it is possible that with more events the prediction accuracy would have gone over 70% after all.