<a href="https://www.kaggle.com/code/thasankakandage/ufc-fight-predictor?scriptVersionId=190714939" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_event_fight_stats.csv
/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_fighters.csv
/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_events.csv


<a id="section1"></a>
# Why 'Out of' Statistics Are More Reliable Than Accuracy and Normalized Averages in this Dataset

I believe that using averages to evaluate performance specifically in this UFC dataset can be problematic. For example, if a fighter has a 0.2 average for significant strikes, it indicates that 20% of their strikes are significant. However, this average can be very misleading. A fighter who has achieved 1 significant strike out of 5 attempts and another who has achieved 10 significant strikes out of 50 both have the same average of 0.2. 


The volume of attempts is critical in a fight. Averages alone do not account for the number of attempts, which can greatly influence the performance of fighters. Using normalized averages alone can oversimplify the data and fail to capture the true effectiveness or consistency of a fighter's performance.

# Event Fights Data Cleaning and Preprocessing

## Converting 'result' Section of Data



Cleaning the 'result' section of the data by removing any matches labeled as 'nc' (no contest), as these represent fights that ended under problematic situations. Additionally, reformat the 
original data to a different style. In the original, 'result' indicates the winner of the fight with either the ID of fighter1 (if fighter1 won), the ID of fighter2 (if fighter2 won), or denotes a draw ('D' or 'd'). 


In [2]:
train_data = pd.read_csv('/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_event_fight_stats.csv') 



print(f"Rows removed due to fight being 'nc' = no contest: {train_data[train_data['result'] == 'nc'].shape[0]}")


train_data = train_data[train_data['result'] != 'nc']


def map_result(row):

    if row['result'] == str(row['f1_id']):
        return 0
    elif row['result'] == str(row['f2_id']):
        return 1
    elif row['result'] == 'D' or row['result'] == 'd':
        return 2
    else:
        print(row['result'])
        return None 

train_data['result'] = train_data.apply(map_result, axis=1)
print(train_data.head())


Rows removed due to fight being 'nc' = no contest: 85
   f1_id  f2_id            f1_name               f2_name  \
0   1031   1172         Evan Elder       Darrius Flowers   
1    493    648  Mayra Bueno Silva         Macy Chiasson   
2   3571    938      Anthony Smith         Roman Dolidze   
3   3045    270          Joe Pyfer  Marc-Andre Barriault   
4   1832   3502   Charles Jourdain            Jean Silva   

                weight_class  f1_knockdowns  f2_knockdowns  \
0          Welterweight Bout              0              0   
1  Women's Bantamweight Bout              0              0   
2     Light Heavyweight Bout              0              0   
3          Middleweight Bout              1              0   
4         Featherweight Bout              0              2   

   f1_sig_strike_atts  f2_sig_strike_atts  f1_sig_strikes  ...  f2_clinchs  \
0                  86                  60              45  ...           3   
1                  59                  85              4

## Converting 'weight_class' Section of Data from Categorical to Numerical

Remove rows where the weight_class is an outdated or irrelevant category, such as 'Ultimate '96 Tournament Title Bout'. This category represents an old UFC match style with unreliable statistics, many of which are not even displayed. Additionally, Open Weight and Catch Weight fights ARE counted, even if they are rare. Note that women and men weight classes are not separated in this dataset.

In [3]:
def map_weight_class(row):
    if "Strawweight" in row['weight_class']:
        return 0
    elif "Flyweight" in row['weight_class']:
        return 1
    elif "Bantamweight" in row['weight_class']:
        return 2
    elif "Featherweight" in row['weight_class']:
        return 3
    elif "Flyweight" in row['weight_class']:
        return 4
    elif "Lightweight" in row['weight_class']:
        return 5
    elif "Welterweight" in row['weight_class']:
        return 6
    elif "Middleweight" in row['weight_class']:
        return 7
    elif "Light Heavyweight" in row['weight_class']:
        return 8
    elif "Heavyweight" in row['weight_class']:
        return 9
    elif "Catch Weight" in row['weight_class']:
        return 10
    elif "Open Weight" in row['weight_class']:
        return 11
    else:
        print(row['weight_class'], row['fights_url'])
        return
   


train_data['weight_class'] = train_data.apply(map_weight_class, axis=1)
# Remove rows where 'weight_class' is something different, these ufc fights are very old anyways
train_data = train_data[train_data['weight_class'].notna()]

print(train_data.head())

UFC Superfight Championship Bout http://www.ufcstats.com/fight-details/6a060498e60756af
UFC Superfight Championship Bout http://www.ufcstats.com/fight-details/16b4a0b06427f1ac
Ultimate Ultimate '95 Tournament Title Bout http://www.ufcstats.com/fight-details/524b49a676498c6d
UFC Superfight Championship Bout http://www.ufcstats.com/fight-details/d62aec55bc142346
UFC 6 Tournament Title Bout http://www.ufcstats.com/fight-details/66c029f49d3c5da6
UFC 5 Tournament Title Bout http://www.ufcstats.com/fight-details/6ca94b35719eb300
UFC Superfight Championship Bout http://www.ufcstats.com/fight-details/db8df615610f3632
UFC 3 Tournament Title Bout http://www.ufcstats.com/fight-details/323f543eb8abdb36
UFC 2 Tournament Title Bout http://www.ufcstats.com/fight-details/00835554f95fa911
UFC 7 Tournament Title Bout http://www.ufcstats.com/fight-details/e7ca291dd8f5661b
   f1_id  f2_id            f1_name               f2_name  weight_class  \
0   1031   1172         Evan Elder       Darrius Flowers    

In [4]:


null_counts_per_column = train_data.isnull().sum()

print("Null values per column:")
print(null_counts_per_column)

# these null values for ctrl time are usually "--" on a ufc stats page

train_data['f1_ctrl_time'] = train_data['f1_ctrl_time'].fillna(0)
train_data['f2_ctrl_time'] = train_data['f2_ctrl_time'].fillna(0)

print("Null values per column after replacement:")
print(train_data.isnull().sum())

print(f"Number of duplicate rows: {train_data.duplicated().sum()}")


Null values per column:
f1_id                    0
f2_id                    0
f1_name                  0
f2_name                  0
weight_class             0
f1_knockdowns            0
f2_knockdowns            0
f1_sig_strike_atts       0
f2_sig_strike_atts       0
f1_sig_strikes           0
f2_sig_strikes           0
f1_tot_strike_atts       0
f2_tot_strike_atts       0
f1_tot_strikes           0
f2_tot_strikes           0
f1_takedown_atts         0
f2_takedown_atts         0
f1_takedowns             0
f2_takedowns             0
f1_submissions           0
f2_submissions           0
f1_reversals             0
f2_reversals             0
f1_ctrl_time           123
f2_ctrl_time           123
f1_head_strike_atts      0
f2_head_strike_atts      0
f1_head_strikes          0
f2_head_strikes          0
f1_body_strike_atts      0
f2_body_strike_atts      0
f1_body_strikes          0
f2_body_strikes          0
f1_leg_strike_atts       0
f2_leg_strike_atts       0
f1_leg_strikes           0
f2_l

# Fighter Data Cleaning and Preprocessing

## Removing the Following Statistics Due to Excessive Null Values: Fighter Reach, Fighter Stance, Fighter Weight (Weight Class Will be Used Instead)
Although Fighter Date of Birth and Fighter Height also have numerous null values, I MIGHT replace the null value with the mean/median of age and height of a fighter instead. The age and height of a fighter are extremely important stats, and because there is a lot of data about age and height already existing and the high variability, calculating the mean/median will be essential and straightforward.

Fighter Reach had a lot more null values than other columns so I decided to just omit that column. Fighter Stance is either orthodox or southpaw, it does not provide enough variation to calculate meaningful averages and could negatively impact predictions, so I believe it's better to just omit the column.

In [5]:
fighter_data = pd.read_csv('/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_fighters.csv')
fighter_data = fighter_data.drop(columns=['fighter_weight_lbs', 'fighter_reach_cm', 'fighter_stance'])
print(fighter_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4224 entries, 0 to 4223
Data columns (total 39 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fighter_id            4224 non-null   int64  
 1   fighter_name          4224 non-null   object 
 2   fighter_dob           3453 non-null   object 
 3   fighter_height_cm     3855 non-null   float64
 4   fighter_wins          4224 non-null   int64  
 5   fighter_losses        4224 non-null   int64  
 6   fighter_draws         4224 non-null   int64  
 7   fighter_slpm          4224 non-null   float64
 8   fighter_str_acc_%     4224 non-null   float64
 9   fighter_sapm          4224 non-null   float64
 10  fighter_str_def_%     4224 non-null   float64
 11  fighter_td_avg        4224 non-null   float64
 12  fighter_td_acc_%      4224 non-null   float64
 13  fighter_td_def_%      4224 non-null   float64
 14  fighter_sub_avg       4224 non-null   float64
 15  fighter_url          

In [6]:

duplicate_names = fighter_data[fighter_data.duplicated('fighter_name', keep=False)]

print(f"Number of duplicate fighter names: {duplicate_names['fighter_name'].value_counts().sum()}")

print(duplicate_names)

Number of duplicate fighter names: 12
      fighter_id      fighter_name fighter_dob  fighter_height_cm  \
844          845        Mike Davis         NaN                NaN   
849          850        Mike Davis  1992-10-07             182.88   
1341        1342        Joey Gomez  1986-07-21             177.80   
1343        1344        Joey Gomez  1989-08-29             177.80   
1784        1785      Tony Johnson  1983-05-02             187.96   
1792        1793      Tony Johnson         NaN             185.42   
2389        2390  Michael McDonald  1965-02-06             180.34   
2391        2392  Michael McDonald  1991-01-15             175.26   
3502        3503        Jean Silva  1977-10-08             167.64   
3516        3517       Bruno Silva  1990-03-16             162.56   
3517        3518       Bruno Silva  1989-07-13             182.88   
3527        3528        Jean Silva  1996-12-27             170.18   

      fighter_wins  fighter_losses  fighter_draws  fighter_slpm 

In [7]:

null_counts_per_column = fighter_data.isnull().sum()

print("Null values per column:")
print(null_counts_per_column)

fighter_data['avg_ctrl_time'] = fighter_data['avg_ctrl_time'].fillna(0)
print("Null values per column after replacement:")
print(fighter_data.isnull().sum())

print(f"Number of duplicate rows: {fighter_data.duplicated().sum()}")


Null values per column:
fighter_id                0
fighter_name              0
fighter_dob             771
fighter_height_cm       369
fighter_wins              0
fighter_losses            0
fighter_draws             0
fighter_slpm              0
fighter_str_acc_%         0
fighter_sapm              0
fighter_str_def_%         0
fighter_td_avg            0
fighter_td_acc_%          0
fighter_td_def_%          0
fighter_sub_avg           0
fighter_url               0
avg_knockdowns            0
avg_sig_strike_atts       0
avg_sig_strikes           0
avg_tot_strike_atts       0
avg_tot_strikes           0
avg_takedown_atts         0
avg_takedowns             0
avg_clinch_atts           0
avg_clinchs               0
avg_ctrl_time           142
avg_total_fight_time      0
avg_submissions           0
avg_reversals             0
avg_head_strike_atts      0
avg_head_strikes          0
avg_body_strike_atts      0
avg_body_strikes          0
avg_leg_strike_atts       0
avg_leg_strikes         

## Integrating Some Fighter Career Stats

On each individual fighter’s stat page, some career stats will be integrated into the fighter’s stats for each fight during model training. 

These career stats are:

- SLpM : Significant Strikes Landed per Minute
- Str. Acc. : Significant Striking Accuracy
- SApM : Significant Strikes Absorbed per Minute
- Str. Def. : Significant Strike Defence (the % of opponents strikes that did not land)
- TD Avg. : Average Takedowns Landed per 15 minutes
- TD Acc. : Takedown Accuracy
- TD Def. : Takedown Defense (the % of opponents TD attempts that did not land)
- Sub. Avg. : Average Submissions Attempted per 15 minutes
- Record : Wins-Losses-Draws

Out of these stats, I will only use SLpM, SApM, and their W-L-D Records.

- Str. Acc. - [Reasoning](#section1)
- Str. Def. - [Reasoning](#section1)
- TD Avg. - Already calculated with average takedown attempts and average takedowns within the ufc_fighters.csv file
- TD Acc. - [Reasoning](#section1)
- TD Def. - [Reasoning](#section1)
- Sub.  Avg. - Already calculated with average submission attempts and average submissions within the ufc_fighters.csv file

In [8]:
fighter_data = fighter_data.drop(columns=['fighter_str_acc_%', 'fighter_str_def_%', 'fighter_td_avg', 'fighter_td_acc_%',
                                         'fighter_td_def_%', 'fighter_sub_avg', 'fighter_url'])
print(fighter_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4224 entries, 0 to 4223
Data columns (total 32 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fighter_id            4224 non-null   int64  
 1   fighter_name          4224 non-null   object 
 2   fighter_dob           3453 non-null   object 
 3   fighter_height_cm     3855 non-null   float64
 4   fighter_wins          4224 non-null   int64  
 5   fighter_losses        4224 non-null   int64  
 6   fighter_draws         4224 non-null   int64  
 7   fighter_slpm          4224 non-null   float64
 8   fighter_sapm          4224 non-null   float64
 9   avg_knockdowns        4224 non-null   float64
 10  avg_sig_strike_atts   4224 non-null   float64
 11  avg_sig_strikes       4224 non-null   float64
 12  avg_tot_strike_atts   4224 non-null   float64
 13  avg_tot_strikes       4224 non-null   float64
 14  avg_takedown_atts     4224 non-null   float64
 15  avg_takedowns        

In [9]:
import numpy as np


train_data['f1_slpm'] = np.nan
train_data['f2_slpm'] = np.nan
train_data['f1_sapm'] = np.nan
train_data['f2_sapm'] = np.nan
train_data['f1_wins'] = np.nan
train_data['f2_wins'] = np.nan
train_data['f1_losses'] = np.nan
train_data['f2_losses'] = np.nan
train_data['f1_draws'] = np.nan
train_data['f2_draws'] = np.nan


f1_stats_list = []
f2_stats_list = []


for index, row in train_data.iterrows():
    f1_id, f2_id = row['f1_id'], row['f2_id']

    f1_stats = fighter_data[fighter_data['fighter_id'] == f1_id]
    f2_stats = fighter_data[fighter_data['fighter_id'] == f2_id]
    
    if not f1_stats.empty:
        f1_slpm, f1_sapm, f1_wins, f1_losses, f1_draws = f1_stats[['fighter_slpm', 'fighter_sapm', 'fighter_wins', 'fighter_losses', 'fighter_draws']].values.flatten()
    else:
        f1_slpm, f1_sapm = [None, None]  
    
    if not f2_stats.empty:
        f2_slpm, f2_sapm, f2_wins, f2_losses, f2_draws = f2_stats[['fighter_slpm', 'fighter_sapm', 'fighter_wins', 'fighter_losses', 'fighter_draws']].values.flatten()
    else:
        f2_slpm, f2_sapm = [None, None] 

    f1_stats_list.append([f1_slpm, f1_sapm])
    f2_stats_list.append([f2_slpm, f2_sapm])


    train_data.at[index, 'f1_slpm'] = f1_slpm
    train_data.at[index, 'f2_slpm'] = f2_slpm
    train_data.at[index, 'f1_sapm'] = f1_sapm
    train_data.at[index, 'f2_sapm'] = f2_sapm
    train_data.at[index, 'f1_wins'] = f1_wins
    train_data.at[index, 'f2_wins'] = f2_wins
    train_data.at[index, 'f1_losses'] = f1_losses
    train_data.at[index, 'f2_losses'] = f2_losses
    train_data.at[index, 'f1_draws'] = f1_draws
    train_data.at[index, 'f2_draws'] = f2_draws

print(train_data.head())
print(train_data.info())
print(train_data.isnull().sum())

   f1_id  f2_id            f1_name               f2_name  weight_class  \
0   1031   1172         Evan Elder       Darrius Flowers           6.0   
1    493    648  Mayra Bueno Silva         Macy Chiasson           2.0   
2   3571    938      Anthony Smith         Roman Dolidze           8.0   
3   3045    270          Joe Pyfer  Marc-Andre Barriault           7.0   
4   1832   3502   Charles Jourdain            Jean Silva           3.0   

   f1_knockdowns  f2_knockdowns  f1_sig_strike_atts  f2_sig_strike_atts  \
0              0              0                  86                  60   
1              0              0                  59                  85   
2              0              0                  93                 185   
3              1              0                  10                   5   
4              0              2                  46                  55   

   f1_sig_strikes  ...  f1_slpm  f2_slpm  f1_sapm  f2_sapm  f1_wins  f2_wins  \
0              45  ...  

Moving these columns to the end to look neater.

In [10]:

columns_to_move = ['f1_total_fight_time', 'f2_total_fight_time', 'result', 'fights_url', 'event_url']


all_columns = train_data.columns.tolist()

columns_first = [col for col in all_columns if col not in columns_to_move]

new_column_order = columns_first + columns_to_move

train_data = train_data[new_column_order]

print(train_data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 7570 entries, 0 to 7664
Data columns (total 64 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   f1_id                7570 non-null   int64  
 1   f2_id                7570 non-null   int64  
 2   f1_name              7570 non-null   object 
 3   f2_name              7570 non-null   object 
 4   weight_class         7570 non-null   float64
 5   f1_knockdowns        7570 non-null   int64  
 6   f2_knockdowns        7570 non-null   int64  
 7   f1_sig_strike_atts   7570 non-null   int64  
 8   f2_sig_strike_atts   7570 non-null   int64  
 9   f1_sig_strikes       7570 non-null   int64  
 10  f2_sig_strikes       7570 non-null   int64  
 11  f1_tot_strike_atts   7570 non-null   int64  
 12  f2_tot_strike_atts   7570 non-null   int64  
 13  f1_tot_strikes       7570 non-null   int64  
 14  f2_tot_strikes       7570 non-null   int64  
 15  f1_takedown_atts     7570 non-null   int64 

In [11]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

X = train_data.drop(columns=['f1_id', 'f2_id', 'f1_name', 'f2_name', 'fights_url', 'event_url', 'result'])
print(X.info())
print(X.head())


y = train_data['result']
print(y.info())
print(y.head())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


<class 'pandas.core.frame.DataFrame'>
Index: 7570 entries, 0 to 7664
Data columns (total 57 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   weight_class         7570 non-null   float64
 1   f1_knockdowns        7570 non-null   int64  
 2   f2_knockdowns        7570 non-null   int64  
 3   f1_sig_strike_atts   7570 non-null   int64  
 4   f2_sig_strike_atts   7570 non-null   int64  
 5   f1_sig_strikes       7570 non-null   int64  
 6   f2_sig_strikes       7570 non-null   int64  
 7   f1_tot_strike_atts   7570 non-null   int64  
 8   f2_tot_strike_atts   7570 non-null   int64  
 9   f1_tot_strikes       7570 non-null   int64  
 10  f2_tot_strikes       7570 non-null   int64  
 11  f1_takedown_atts     7570 non-null   int64  
 12  f2_takedown_atts     7570 non-null   int64  
 13  f1_takedowns         7570 non-null   int64  
 14  f2_takedowns         7570 non-null   int64  
 15  f1_submissions       7570 non-null   int64 

# EDA
While there's an almost countless range of analyses possible with UFC stats, I will only focus on what I would believe to be the most important ones. 

# Model Choice


## SVM

In [12]:

# from sklearn.svm import SVC

# svm_model = SVC(kernel='linear', probability = True)  
# svm_model.fit(X_train, y_train)


# y_pred = svm_model.predict(X_test)

# print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')


## Logistic Regression

In [13]:
# from sklearn.linear_model import LogisticRegression


# log_reg_model = LogisticRegression()
# log_reg_model.fit(X_train, y_train)


# y_pred = log_reg_model.predict(X_test)


# print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')

## Random Forest Classifier

In [14]:
from sklearn.ensemble import RandomForestClassifier


rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)


y_pred = rf_model.predict(X_test)


print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')

Accuracy: 0.87


## XGBoost

In [15]:
# import xgboost as xgb


# xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
# xgb_model.fit(X_train, y_train)


# y_pred = xgb_model.predict(X_test)


# print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')


# Testing

To test, simply type in two fighters names, and the weight class they will be fighting in.

In [16]:


# DON'T USE THESE NAMES! These names are shared with atleast 1 other fighter
# An age input will later be developed incase a duplicate name is chosen

# Mike Davis 
# Joey Gomez 
# Tony Johnson    
# Michael McDonald  
# Jean Silva    
# Bruno Silva

# these are your three inputs
#################################
fighter1_name = "Leon Edwards"
fighter2_name = "Belal Muhammad"
weight_class = "Welterweight"
#################################

df = pd.DataFrame({
    'weight_class': [weight_class]
})
df['weight_class_value'] = df.apply(map_weight_class, axis=1)
weight_class_value = df['weight_class_value'].iloc[0]
print(f'Weight class value for "{weight_class}": {weight_class_value}')


fighter1_data = fighter_data[fighter_data["fighter_name"] == fighter1_name]
fighter2_data = fighter_data[fighter_data["fighter_name"] == fighter2_name]
training_columns = X.columns.tolist()
print(X.info())

non_average_stats = ['fighter_wins', 'fighter_losses', 'fighter_draws', 'fighter_slpm', 'fighter_sapm']
features = []

for column in training_columns:

    if column.startswith('f1_'): 
        if column.replace("f1_", "fighter_") not in non_average_stats:
            feature = column.replace("f1_", "avg_")
            if feature not in features:
                features.append(feature)
        else:
            feature = column.replace("f1_", "fighter_")
            if feature not in features:
                features.append(feature)
    elif column.startswith('f2_'): 
        if column.replace("f2_", "fighter_") not in non_average_stats:
            feature = column.replace("f2_", "avg_")
            if feature not in features:
                features.append(feature)
            
        else:
            feature = column.replace("f2_", "fighter_")
  
        

fighter1_features = fighter1_data[features].columns.tolist()
fighter2_features = fighter2_data[features].columns.tolist()


fighter1_data_list = [(f'f1_{col_name}', fighter1_data[col_name].values[0]) for col_name in fighter1_features]
fighter2_data_list = [(f'f2_{col_name}', fighter2_data[col_name].values[0]) for col_name in fighter2_features]


interleaved_feature_values_list = [val for pair in zip(fighter1_data_list, fighter2_data_list) for val in pair]


for col_name, value in interleaved_feature_values_list:
    print(f"{col_name}: {value}")


fighter1_data_filtered = fighter1_data[features]
fighter2_data_filtered = fighter2_data[features]

    
fighter1_array = fighter1_data_filtered.values.flatten()
fighter2_array = fighter2_data_filtered.values.flatten()

interleaved_data = np.empty(len(X.columns.tolist()))

interleaved_data[0] = weight_class_value


interleaved_data[1::2] = fighter1_array
interleaved_data[2::2] = fighter2_array
interleaved_data = interleaved_data.reshape(1, -1)

single_input = scaler.transform(interleaved_data)
                                
    
    
    
    
# # SVM Model
# prediction_probs = svm_model.predict_proba(single_input)
# single_pred = svm_model.predict(single_input)
# prediction_probability = prediction_probs[0][single_pred[0]]

# # Logistic Regression Model
# prediction_probs = log_reg_model.predict_proba(single_input)  
# single_pred = log_reg_model.predict(single_input)  
# prediction_probability = prediction_probs[0][single_pred[0]] 

# Random Forest Classifier Model - best so far
prediction_probs = rf_model.predict_proba(single_input)  
single_pred = rf_model.predict(single_input)  
prediction_probability = prediction_probs[0][single_pred[0]]  

# # XGBoost Model
# prediction_probs = xgb_model.predict_proba(single_input)  
# single_pred = xgb_model.predict(single_input)  
# prediction_probability = prediction_probs[0][single_pred[0]]  



 
    
if single_pred[0] == 0:
    predicted_winner = fighter1_name
elif single_pred[0] == 1:
    predicted_winner = fighter2_name
elif single_pred[0] == 2:
    predicted_winner = "Draw"
else:
    print(single_prediction[0])
    predicted_winner = "Unknown"

print(f'Predicted Winner: {predicted_winner}')
print(f'Probability: {prediction_probability:.2f}')


Weight class value for "Welterweight": 6
<class 'pandas.core.frame.DataFrame'>
Index: 7570 entries, 0 to 7664
Data columns (total 57 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   weight_class         7570 non-null   float64
 1   f1_knockdowns        7570 non-null   int64  
 2   f2_knockdowns        7570 non-null   int64  
 3   f1_sig_strike_atts   7570 non-null   int64  
 4   f2_sig_strike_atts   7570 non-null   int64  
 5   f1_sig_strikes       7570 non-null   int64  
 6   f2_sig_strikes       7570 non-null   int64  
 7   f1_tot_strike_atts   7570 non-null   int64  
 8   f2_tot_strike_atts   7570 non-null   int64  
 9   f1_tot_strikes       7570 non-null   int64  
 10  f2_tot_strikes       7570 non-null   int64  
 11  f1_takedown_atts     7570 non-null   int64  
 12  f2_takedown_atts     7570 non-null   int64  
 13  f1_takedowns         7570 non-null   int64  
 14  f2_takedowns         7570 non-null   int64  
 15  f1



# To Add
- Further add data preprocessing, e.g., checking if certain rows have too many 0's, adding age input if user inputs name belonging to multiple fighters
- Further model test and try out other models
- Model tuning
- Implementing grid search
- Further data analysis
- Add weight and age into training/testing data
- Test using median instead of mean, as some UFC fights have extreme and highly variable outcomes, leading to outliers
- Get rid of very old UFC data, planning to get rid of data that have different fight formats than (5-5-5) or (5-5-5-5-5)