Previously we found that using a KNN algorithm on a relatively small number of features we were able to achieve a little better than 58% percentage accuracy on the validation dataset (15% of the whole dataset), which is not an impressive result, but it shows us that even though every fight can be won with just one punch it is indeed possible to predict fight outcomes. Now we expand the list of features that we are using for predictions and try to see if we can get better results with the KNN algorithm that we used previously (for example at least 63-65% accuracy) or if we have to change the algorithm instead. It is also important to keep in mind that as we do not use the validation set to train the algorithm the end result should be a little better because in the end, after choosing the final algorithm and parameter settings, we will obviously train the model on the entire dataset and only then we will use it to predict fights on the test set (fights that have happened since our project dataset was last updated or in the case of UFC 256 (12th December) haven't even happened yet).

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('../../data/ufc-master.csv')[['B_fighter', 'R_fighter', 'title_bout', 'B_current_win_streak', 
        'B_current_lose_streak', 'R_current_win_streak', 'R_current_lose_streak', 'B_Stance', 'R_Stance', 
        'B_avg_TD_landed', 'R_avg_TD_landed', 'B_wins', 'B_losses', 'R_wins', 'R_losses', 'B_age', 'R_age',
        'height_dif', 'reach_dif', 'better_rank', 'gender', 'Winner']]
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)

First part of the data preparation is the same as before

In [3]:
#In the blue fighter stance column we have to fix one data point, where 'Switch' is written as 'Switch ' with an extra space
#in the end
data['B_Stance'] = data['B_Stance'].replace({'Switch ': 'Switch'})

#In the height dif there is one outlier where the difference between the two fighters is 187.96 cm, which is obviously a mistake.
#Instead of excluding this datapoint I am going to fix this manually using the height data available for both fighters on the
#UFC website (the fighters were Parker Porter and Kyle Daukaus)
data['height_dif'] = data['height_dif'].replace({-187.96: -7.62})
data['height_dif'].value_counts()

#In the reach difference we have 2 mistakes, where one of the values is -187.96 and the other is -160.02. These mistakes will be
#fixed as well.
#In the first case the fighters involved are Jinh Yu Frey and Kay Hanse
#In the second case the fighters involved are Parker Porter vs Kyle Daukaus and Irwin Rivera vs Giga Chikadze
filter1 = (data['reach_dif'] == -187.96) & (data['B_fighter'] == 'Parker Porter')
filter2 = (data['reach_dif'] == -187.96) & (data['B_fighter'] == 'Irwin Rivera')
filter3 = data['reach_dif'] == -160.02
data[filter1] = data[filter1].replace({-187.96: -2.54})
data[filter2] = data[filter2].replace({-187.96: -17.78 })
data[filter3] = data[filter3].replace({-160.02: 5.08})

We have now included features that have NaN values (average takedowns landed) for some fighters so we will replace them with 0s.

In [4]:
data['B_avg_TD_landed'].fillna(0, inplace=True)
data['R_avg_TD_landed'].fillna(0, inplace=True)

In [5]:
#Now we will use columns wins and losses for both fighters to create a column that has a win ratio out of all wins and losses
B_ratio = data['B_wins'] / (data['B_wins'] + data['B_losses'])
R_ratio = data['R_wins'] / (data['R_wins'] + data['R_losses'])
data['B_wr'] = B_ratio
data['R_wr'] = R_ratio
#It is possible that in some of the rows that value is now NaN as the fighter has never fought before. In task 1 we found out
#that the fighters making debut usually win 43% of the time so we will replace NaN with 0.43 as giving them 0 would not 
#represent reality very well
data['B_wr'].fillna(0.43, inplace=True)
data['R_wr'].fillna(0.43, inplace=True)
#Now we will drop win and loss columns for both fighters because we have added the winrate column
data = data.drop(columns=['B_wins', 'B_losses', 'R_wins', 'R_losses'])

#Now changing categorical variables into 1s and 0s where necessary
data['title_bout'] = (data['title_bout']).astype(int)
data['Winner'] = data['Winner'].map(dict(Blue=1, Red=0))
data['gender'] = data['gender'].map(dict(MALE=1, FEMALE=0))
data = pd.get_dummies(data, columns=['B_Stance', 'R_Stance', 'better_rank'])

In [6]:
#Dropping fighters names as well because this whole prediction is based on the stats only
data = data.drop(columns=['B_fighter', 'R_fighter'])

In [7]:
#Creating training and validation sets
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(data.drop(columns='Winner'), data['Winner'], test_size = 0.15, random_state = 0)

In [8]:
#First we are going to use the KNN algorithm that we found earlier and see if it does better with more features or if we have
#to think about changing the algorithm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier(n_neighbors = 109)
knn.fit(X_train, y_train)
acc = accuracy_score(y_val, knn.predict(X_val))
print(acc)

0.5827123695976155


The result is basically same as before so we have to change parameter values or change the algorithm.

Testing again with a random forest classifier

In [9]:
from sklearn.ensemble import RandomForestClassifier
forest_1 = RandomForestClassifier(n_estimators = 100, random_state=0)
forest_1.fit(X_train, y_train)
accuracy_2 = accuracy_score(y_val, forest_1.predict(X_val))
print(accuracy_2)

0.5543964232488823


The result is now better than before during the first try, but still not better than KNN algorithm. The problem might be with the prepared dataset that we use for the predictions. We should not use features like average takedowns landed by each fighter, but the difference between them two for example and for ages as well like we did in task 1.

In [11]:
data.head(3)

Unnamed: 0,title_bout,B_current_win_streak,B_current_lose_streak,R_current_win_streak,R_current_lose_streak,B_avg_TD_landed,R_avg_TD_landed,B_age,R_age,height_dif,reach_dif,gender,Winner,B_wr,R_wr,B_Stance_Open Stance,B_Stance_Orthodox,B_Stance_Southpaw,B_Stance_Switch,R_Stance_Open Stance,R_Stance_Orthodox,R_Stance_Southpaw,R_Stance_Switch,better_rank_Blue,better_rank_Red,better_rank_neither
0,0,0,2,0,1,1.82,0.25,36,36,-7.62,0.0,1,1,0.62069,0.642857,0,0,1,0,0,1,0,0,0,1,0
1,0,1,0,0,1,0.0,0.73,26,35,5.08,10.16,1,1,1.0,0.666667,0,1,0,0,0,1,0,0,0,0,1
2,0,1,0,1,0,1.0,2.41,21,21,2.54,-12.7,0,1,1.0,1.0,0,1,0,0,0,1,0,0,0,0,1
