First we have to select the attributes and prepare the data (remove outliers and data points with missing values)

In [1]:
import pandas as pd

From our initial analysis in the task 1 we know that we should use attributes like title bout, height, reach, stance, losing_streak, better rank and the label is 'Winner'. We can add more attributes later, but let's create the first models using only these.

In [2]:
data = pd.read_csv('../../data/ufc-master.csv')[['B_fighter', 'R_fighter', 'title_bout', 'B_current_lose_streak',
       'R_current_lose_streak', 'B_Stance', 'R_Stance', 'B_age', 'R_age', 'height_dif', 'reach_dif', 'better_rank', 'Winner']]

Fixing mistakes in the dataset

In [3]:
pd.set_option('display.max_rows', 125)
pd.set_option('display.max_columns', 125)
#In the blue fighter stance column we have to fix one data point, where 'Switch' is written as 'Switch ' with an extra space
#in the end
data['B_Stance'] = data['B_Stance'].replace({'Switch ': 'Switch'})

#In the height dif there is one outlier where the difference between the two fighters is 187.96 cm, which is obviously a mistake.
#Instead of excluding this datapoint I am going to fix this manually using the height data available for both fighters on the
#UFC website (the fighters were Parker Porter and Kyle Daukaus)
data['height_dif'] = data['height_dif'].replace({-187.96: -7.62})
data['height_dif'].value_counts()

#In the reach difference we have 2 mistakes, where one of the values is -187.96 and the other is -160.02. These mistakes will be
#fixed as well.
#In the first case the fighters involved are Jinh Yu Frey and Kay Hanse
#In the second case the fighters involved are Parker Porter vs Kyle Daukaus and Irwin Rivera vs Giga Chikadze
filter1 = (data['reach_dif'] == -187.96) & (data['B_fighter'] == 'Parker Porter')
filter2 = (data['reach_dif'] == -187.96) & (data['B_fighter'] == 'Irwin Rivera')
filter3 = data['reach_dif'] == -160.02
data[filter1] = data[filter1].replace({-187.96: -2.54})
data[filter2] = data[filter2].replace({-187.96: -17.78 })
data[filter3] = data[filter3].replace({-160.02: 5.08})

Now that the most obvious errors are fixed in the dataset we have to turn categorical values into 1's and 0's and one-hot encode the other non-numerical values.

In [5]:
#The values in the title_bout column were boolean
data['title_bout'] = (data['title_bout']).astype(int)
data['Winner'] = data['Winner'].map(dict(Blue=1, Red=0))
data = pd.get_dummies(data, columns=['B_Stance', 'R_Stance', 'better_rank'])

Now the data is ready. Let's split the data into training and validation set. Because the end goal is to use the model on predicting the fights that are not in the dataset we are not going to create a separate test set, but focus mainly on the training and validation set.

Because we are not going to use fighters names for prediction we are for the moment going to drop these columns.

In [6]:
data = data.drop(columns=['B_fighter', 'R_fighter'])

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(data.drop(columns='Winner'), data['Winner'], test_size = 0.15, random_state = 0)

Hyperparameter tuning and things like that will come later. First we want to identify algorithms that might do well.  

Creating a simple decision tree classifier for the start.

In [8]:
from sklearn import tree
from sklearn.metrics import accuracy_score
tree_1 = tree.DecisionTreeClassifier(criterion='gini', random_state = 0)
tree_1.fit(X_train, y_train)
accuracy_1 = accuracy_score(y_val, tree_1.predict(X_val))
print(accuracy_1)

0.5394932935916542


54% accuracy is not the best start, but at least it is better than random guessing. Also at the end we will obviously train the chosen model on the entire dataset so we can expect better accuracy after that.

Let's now test a KNN algorithm

In [9]:
from sklearn.neighbors import KNeighborsClassifier
knn_1 = KNeighborsClassifier(n_neighbors = 1)
knn_1.fit(X_train, y_train)
accuracy_2 = accuracy_score(y_val, knn_1.predict(X_val))
print(accuracy_2)

0.5201192250372578


In [23]:
knn_3 = KNeighborsClassifier(n_neighbors = 3)
knn_3.fit(X_train, y_train)
accuracy_3 = accuracy_score(y_val, knn_3.predict(X_val))
print(accuracy_3)

0.5439642324888226


In [12]:
knn_5 = KNeighborsClassifier(n_neighbors = 5)
knn_5.fit(X_train, y_train)
accuracy_4 = accuracy_score(y_val, knn_5.predict(X_val))
print(accuracy_4)

0.533532041728763


From these simple test we can see that the KNN with the n value 3 did the best as it achieved 54.4% accuracy.

Testing the random forest classifier

In [13]:
from sklearn.ensemble import RandomForestClassifier
forest_1 = RandomForestClassifier()
forest_1.fit(X_train, y_train)
accuracy_5 = accuracy_score(y_val, forest_1.predict(X_val))
print(accuracy_5)

0.5320417287630402


The intial findings suggest that the KNN algorithm might be the best choice. Let's use a loop for finding the best comination of neighbors and metrics (Euclidean or Manhattan distance)

In [25]:
#Note: in the KNN algorithm p=1 is Manhattan distance and p=2 is the Euclidean distance
best_acc = 0
best_comb = [0, 0]
for i in range(1, 200, 1):
    for j in range(1, 3, 1):
        model = KNeighborsClassifier(n_neighbors = i, p = j)
        model.fit(X_train, y_train)
        acc = accuracy_score(y_val, model.predict(X_val))
        if (acc > best_acc):
            best_acc = acc
            best_comb[0] = i
            best_comb[1] = j
print("The best achieved accuracy was: " + str(round(best_acc * 100, 2)) + "%.")
print("The neighbors value should be: " + str(best_comb[0]))
print("The value for p should be: " + str(best_comb[1]))

The best achieved accuracy was: 58.42%.
The neighbors value should be: 109
The value for p should be: 2


Without using most of the attributes in the dataset we have already managed to get 58.4% accuracy on the validation set (validation set size is 15%), which is a rather promising sign.