<img src="https://github.com/Kesterchia/UFC-Winner-Predictions/blob/main/Pictures/Khabib-vs-Mcgregor.jpg?raw=True" width="50" height="50"> 

# Source:
https://www.kaggle.com/mdabbert/ultimate-ufc-dataset?select=ufc-master.csv

## Context:

This dataset is a merger of a few datasets on Ultimate Fighting Championship (UFC) fights on Kaggle, compiled by Kaggle user Mdabbert (https://www.kaggle.com/mdabbert).

## Content:

This dataset includes:

Rajeev Warrier's excellent dataset (https://www.kaggle.com/rajeevw/ufcdata). This dataset makes up much of the data. It contains data for every UFC bout. The 'red fighter' and 'blue fighter' are improperly recorded prior to around 2010, so that data has been excluded.

Mdabbert's odds dataset (https://www.kaggle.com/mdabbert/ufc-fights-2010-2020-with-betting-odds). This contains gambling odds for each fight.

Mart Jürisoo's Rankings dataset (https://www.kaggle.com/martj42/ufc-rankings). Includes a history of UFC fighter rankings. 

There are 108 columns of data.

## Column descriptions:
R_fighter, B_fighter: Fighter names

R_odds, B_odds: The American odds that the fighter will win. Usually scraped from bestfightodds.com

R_ev, B_ev: The profit on a 100 credit winning bet

date: The date of the fight

location: The location of the fight

country: The country the fight occurs in

Winner: The winner of the fight [Red, Blue, or Draw]

title_bout: Was this a title bout?

weight_class: The weight class of the bout

gender: Gender of the combatants

no_of_rounds: The number of rounds in the fight

B_current_lose_streak, R_current_lose_streak: Current losing streak

B_current_win_streak, R_current_win_streak: Current winning streak

B_draw, R_draw: Number of draws

B_avg_SIG_STR_landed, R_avg_SIG_STR_landed : Significant Strikes Landed per minute

B_avg_SIG_STR_pct, R_avg_SIG_STR_pct: Significant Striking Accuracy

B_avg_SUB_ATT, R_avg_SUB_ATT: Average Submissions Attempted per 15 Minutes

B_avg_TD_landed, R_avg_TD_landed: Average takedowns landed per 15 minutes

B_avg_TD_pct, R_avg_TD_pct: Takedown accuracy

B_longest_win_streak, R_longest_win_streak: Longest winning streak

B_losses, R_losses: Total number of losses

B_total_rounds_fought, R_total_rounds_fought: Total rounds fought

B_total_title_bouts, R_total_title_bouts: Total number of title bouts

B_win_by_Decision_Majority, R_win_by_Decision_Majority: Wins by Majority Decision

B_win_by_Decision_Split, R_win_by_Decision_Split: Wins by Split Decision

B_win_by_Decision_Unanimous, R_win_by_Decision_Unanimous: Wins by Unanimous Decision

B_win_by_KO/TKO, R_win_by_KO/TKO: Wins by KO/TKO

B_win_by_Submission, R_win_by_Submission: Wins by Submission

B_win_by_TKO_Doctor_Stoppage, R_win_by_TKO_Doctor_Stoppage: Wins by Doctor Stoppage

B_wins, R_wins: Total career wins

B_Stance, R_stance: Fighter stance

B_Height_cms, R_Height_cms: Fighter height in cms

B_Reach_cms, R_Reach_cms: Fighter reach in cms

B_Weight_lbs, R_Weight_lbs: Fighter weight in pounds

B_age, R_age: Fighter age

lose_streak_dif: (Blue lose streak) - (Red lose streak) winstreakdif: (Blue win streak) - (Red win streak)

longest_win_streak_dif: (Blue longest win streak) - (Red longest win streak)

win_dif: (Blue wins) - (Red wins)

loss_dif: (Blue losses) - (Red losses)

total_round_dif: (Blue total rounds fought) - (Red total rounds fought)

total_title_bout_dif: (Blue number of title fights) - (Red number of title fights)

ko_dif: (Blue wins by KO/TKO) - (Red wins by KO/TKO)

sub_dif: (Blue wins by submission) - (Red wins by submission)

height_dif: (Blue height) - (Red height) in cms

reach_dif: (Blue reach) - (Red reach) in cms

age_dif: (Blue age) - (Red age)

sig_str_dif: (Blue sig strikes per minute) - (Red sig strikes per minute)

avg_sub_att_dif: (Blue submission attempts) - (Red submission attempts)

avg_td_dif: (Blue TD attempts) - (Red TD attempts)

empty_arena: Did this fight occur in an empty arena? (1,0)

constant_1: The number 1

B_match_weightclass_rank, R_match_weightclass_rank: Rank in the weightclass this bout takes place in

R_Women's Flyweight_rank, B_Women's Flyweight_rank: Rank in the Women's Flyweight Division

B_Women's Featherweight_rank, 'RWomen's Featherweightrank: Rank in the Women's Featherweight Division BWomen's 
Strawweightrank, 'R_Women's Strawweight_rank: Rank in the Women's Strawweight Division

B_Women's Bantamweight_rank, R_Women's Bantamweight_rank: Rank in the Women's Bantamweight Division

B_Heavyweight_rank, R_Heavyweight_rank: Heavyweight rank

B_Light Heavyweight_rank, R_Light Heavyweight rank: Light Heavyweight rank

B_Middleweight_rank, R_Middleweight_rank: Middleweight rank

B_Welterweight_rank, R_Welterweight_rank: Welterweight rank

B_Lightweight_rank, R_Lightweight_rank: Lightweight rank

B_Featherweight_rank, R_Featherweight_rank: Featherweight rank

B_Bantamweight_rank, R_Bantamweight_rank: Bantamweight rank

B_Flyweight_rank, R_Flyweight_rank: Flyweight rank

B_Pound-for-Pound_rank, R_Pound-for-Pound_rank: Pound-for-Pound rank

better_rank: Who has the better rank (Red, Blue, neither)

finish: How the fight finished

finish_details: More details about the finish if available.

finish_round: The round the fight ended

finish_round_time: Time in the round of the finish

total_fight_time_secs: Total time of the fight in seconds

## Importing libraries and dataset

In [218]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools

from pandas_profiling import ProfileReport

In [219]:
#Import in the dataset
df = pd.read_csv('./Data/ufc-master.csv')

df.head()

Unnamed: 0,R_fighter,B_fighter,R_odds,B_odds,R_ev,B_ev,date,location,country,Winner,...,R_td_attempted_bout,B_td_attempted_bout,R_td_pct_bout,B_td_pct_bout,R_sub_attempts_bout,B_sub_attempts_bout,R_pass_bout,B_pass_bout,R_rev_bout,B_rev_bout
0,Deiveson Figueiredo,Alex Perez,-286,225,34.965035,225.0,11/21/2020,"Las Vegas, Nevada, USA",USA,Red,...,,,,,,,,,,
1,Valentina Shevchenko,Jennifer Maia,-1667,850,5.9988,850.0,11/21/2020,"Las Vegas, Nevada, USA",USA,Red,...,,,,,,,,,,
2,Mike Perry,Tim Means,-150,120,66.666667,120.0,11/21/2020,"Las Vegas, Nevada, USA",USA,Blue,...,,,,,,,,,,
3,Katlyn Chookagian,Cynthia Calvillo,205,-265,205.0,37.735849,11/21/2020,"Las Vegas, Nevada, USA",USA,Red,...,,,,,,,,,,
4,Mauricio Rua,Paul Craig,150,-190,150.0,52.631579,11/21/2020,"Las Vegas, Nevada, USA",USA,Blue,...,,,,,,,,,,


# Data cleaning:

In [220]:
#See some quick info:
df.info(max_cols = 200)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4485 entries, 0 to 4484
Data columns (total 137 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   R_fighter                     4485 non-null   object 
 1   B_fighter                     4485 non-null   object 
 2   R_odds                        4485 non-null   int64  
 3   B_odds                        4485 non-null   int64  
 4   R_ev                          4485 non-null   float64
 5   B_ev                          4485 non-null   float64
 6   date                          4485 non-null   object 
 7   location                      4485 non-null   object 
 8   country                       4485 non-null   object 
 9   Winner                        4485 non-null   object 
 10  title_bout                    4485 non-null   bool   
 11  weight_class                  4485 non-null   object 
 12  gender                        4485 non-null   object 
 13  no

The columns such as 'B_Women's Flyweight_rank', 'B_Women's Strawweight_rank' are irrelevant, as we already have the weight class and respective fighter rank as columns of their own. Therefore these columns will be dropped.



In [221]:
df = df.drop(labels = df.columns[81:107],
             axis = 1)
df.info(max_cols = 150)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4485 entries, 0 to 4484
Data columns (total 111 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   R_fighter                     4485 non-null   object 
 1   B_fighter                     4485 non-null   object 
 2   R_odds                        4485 non-null   int64  
 3   B_odds                        4485 non-null   int64  
 4   R_ev                          4485 non-null   float64
 5   B_ev                          4485 non-null   float64
 6   date                          4485 non-null   object 
 7   location                      4485 non-null   object 
 8   country                       4485 non-null   object 
 9   Winner                        4485 non-null   object 
 10  title_bout                    4485 non-null   bool   
 11  weight_class                  4485 non-null   object 
 12  gender                        4485 non-null   object 
 13  no

Now we store our variables in X and y:

In [222]:
y = df['Winner']
X = df.drop(labels = 'Winner', axis = 1)

### Imputing variables:

We check how many variables have null values:

In [209]:
nullcols = df.columns[df.isnull().any(axis=0)]

len(nullcols)

41

There are too many columns to have a different imputing strategy for each, so we will use sklearn's iterative imputer to impute the missing values. It goes through each column and returns predictions for missing values based on all other columns. After going through all variables, it repeats the cycle for a number of iterations.

The default strategy for the iterative imputer is to use a Bayesian Ridge Regression, an alternative to OLS.

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer

In [223]:
X_numeric = X.select_dtypes(exclude = 'object')
X_cat = X.select_dtypes(include = 'object')

### Impute numerical variables first:

In [174]:
%%time
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(estimator = None, #Default bayesian ridge
                           max_iter = 30)

X_numeric_imputed = pd.DataFrame(data = imputer.fit_transform(X_numeric),
                                 columns = X_numeric.columns)

Wall time: 1min 14s




### Impute categorical variables:

In [224]:
X_cat[['finish_round_time']]

Unnamed: 0,finish_round_time
0,
1,
2,
3,
4,
...,...
4480,0:44
4481,2:01
4482,0:47
4483,5:00


In [273]:
dtseries = pd.to_datetime(X_cat['finish_round_time'])
dtseries.strfrtime()

AttributeError: 'Series' object has no attribute 'strfrtime'

In [267]:
#First we convert the finishing time to an appropriate datatype:

print("For finish_round_time variable:", '\n')
print('Original datatype:', X_cat['finish_round_time'].dtype)

dtseries = pd.to_datetime(X_cat['finish_round_time'])
dtseries.dtype

For finish_round_time variable: 

Original datatype: object


dtype('<M8[ns]')

In [178]:
X_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4485 entries, 0 to 4484
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   R_fighter          4485 non-null   object
 1   B_fighter          4485 non-null   object
 2   date               4485 non-null   object
 3   location           4485 non-null   object
 4   country            4485 non-null   object
 5   weight_class       4485 non-null   object
 6   gender             4485 non-null   object
 7   B_Stance           4485 non-null   object
 8   R_Stance           4485 non-null   object
 9   better_rank        4485 non-null   object
 10  finish             4118 non-null   object
 11  finish_details     2014 non-null   object
 12  finish_round_time  4099 non-null   object
dtypes: object(13)
memory usage: 455.6+ KB
