<a href="https://www.kaggle.com/code/thasankakandage/ufc-fight-predictor?scriptVersionId=191370800" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<a id="section1"></a>
# Why 'Out of' Statistics Are More Reliable Than Accuracy and Normalized Averages in this Dataset

I believe that using averages to evaluate performance specifically in this UFC dataset can be problematic. For example, if a fighter has a 0.2 average for significant strikes, it indicates that 20% of their strikes are significant. However, this average can be very misleading. A fighter who has achieved 1 significant strike out of 5 attempts and another who has achieved 10 significant strikes out of 50 both have the same average of 0.2. 


The volume of attempts is critical in a fight. Averages alone do not account for the number of attempts, which can greatly influence the performance of fighters. Using normalized averages alone can oversimplify the data and fail to capture the true effectiveness or consistency of a fighter's performance.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_fighters_avg.csv
/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_event_fight_stats.csv
/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_fighters_median.csv
/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_fighters.csv
/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_events.csv


# Event Fights Data Cleaning and Preprocessing

## Converting 'result' Section of Data



Cleaning the 'result' section of the data by removing any matches labeled as 'nc' (no contest), as these represent fights that ended under problematic situations. Additionally, reformat the 
original data to a different style. In the original, 'result' indicates the winner of the fight with either the ID of fighter1 (if fighter1 won), the ID of fighter2 (if fighter2 won), or denotes a draw ('D' or 'd'). 


In [2]:
train_data = pd.read_csv('/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_event_fight_stats.csv') 

def map_result(row):

    if row['result'] == str(row['f1_id']):
        return 0
    elif row['result'] == str(row['f2_id']):
        return 1
    else:
      
        return row['result']

train_data['result'] = train_data.apply(map_result, axis=1)

print(train_data.info())
print(train_data.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7715 entries, 0 to 7714
Data columns (total 58 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   f1_id                7715 non-null   int64  
 1   f2_id                7715 non-null   int64  
 2   f1_name              7715 non-null   object 
 3   f2_name              7715 non-null   object 
 4   weight_class         7715 non-null   object 
 5   f1_age_during        7656 non-null   float64
 6   f2_age_during        7564 non-null   float64
 7   f1_height_cm         7708 non-null   float64
 8   f2_height_cm         7703 non-null   float64
 9   f1_knockdowns        7715 non-null   int64  
 10  f2_knockdowns        7715 non-null   int64  
 11  f1_sig_strike_atts   7715 non-null   int64  
 12  f2_sig_strike_atts   7715 non-null   int64  
 13  f1_sig_strikes       7715 non-null   int64  
 14  f2_sig_strikes       7715 non-null   int64  
 15  f1_tot_strike_atts   7715 non-null   i

## Balancing the Dataset

Deleting rows where the result is not 0 or 1. This means that the fight ended up in a draw, or no contest which are extremely rare cases.

In [3]:
print(f"Number of rows in training data: {train_data.shape[0]}")
print(f'Number of rows where result is not 0 or 1: {train_data[~train_data["result"].isin([0, 1])].shape[0]}')
train_data = train_data[train_data['result'].isin([0, 1])]
print(f"Number of rows in training data after deletion: {train_data.shape[0]}")

# changing data type of result column from object to int
# before deleting the rows with result as neither 0 or 1, result can be "d" or "nc" which explains why the data type is "object" before changing
train_data['result'] = train_data['result'].astype('int64')





Number of rows in training data: 7715
Number of rows where result is not 0 or 1: 144
Number of rows in training data after deletion: 7571


The dataset has more rows where f1 won (result = 0) than f2. To balance the dataset, taking 1000 random rows with result = 0 and swapping the f1 and f2 stats and changing result from 0 to 1. 

In [4]:
# instead of choosing a static value like 1000, implement a system that checks the difference between classes and chooses n

print(f'Number of rows: {train_data.shape[0]}')
print(f'Number of rows with result 0 (f1 won): {train_data[train_data["result"] == 0].shape[0]}')
print(f'Number of rows with result 1 (f2 won): {train_data[train_data["result"] == 1].shape[0]}')

rows_to_modify = train_data[train_data['result'] == 0].sample(n=1150, random_state=42).index
columns_to_swap = [col for col in train_data.columns if col.startswith('f1_') or col.startswith('f2_')]


f1_columns = [col for col in columns_to_swap if col.startswith('f1_')]
f2_columns = [col for col in columns_to_swap if col.startswith('f2_')]


assert len(f1_columns) == len(f2_columns), "Mismatch between f1 and f2 columns"

show_sample = False
for idx in rows_to_modify:
    if not show_sample:
        print(train_data.loc[idx])
    
    for f1_col, f2_col in zip(f1_columns, f2_columns):
        
        train_data.at[idx, f1_col], train_data.at[idx, f2_col] = train_data.at[idx, f2_col], train_data.at[idx, f1_col]
    
    train_data.at[idx, 'result'] = 1
    if not show_sample:
        print(train_data.loc[idx])
        show_sample = True
    

print(f'Number of rows with result 0 (f1 won) after modification: {train_data[train_data["result"] == 0].shape[0]}')
print(f'Number of rows with result 1 (f2 won) after modification: {train_data[train_data["result"] == 1].shape[0]}')

Number of rows: 7571
Number of rows with result 0 (f1 won): 4913
Number of rows with result 1 (f2 won): 2658
f1_id                                                               3796
f2_id                                                               3723
f1_name                                                     David Teymur
f2_name                                                  Martin Svensson
weight_class                                            Lightweight Bout
f1_age_during                                                       26.0
f2_age_during                                                       30.0
f1_height_cm                                                      175.26
f2_height_cm                                                      185.42
f1_knockdowns                                                          1
f2_knockdowns                                                          0
f1_sig_strike_atts                                                    62
f2_sig_strike_a

## Converting 'weight_class' Section of Data from Categorical to Numerical

Remove rows where the weight_class is an outdated or irrelevant category, such as 'Ultimate '96 Tournament Title Bout'. This category represents an old UFC match style with unreliable statistics, many of which are not even displayed. Additionally, Open Weight and Catch Weight fights ARE counted, even if they are rare. Note that women and men weight classes are not separated in this dataset.

In [5]:
def map_weight_class(row):
    if "Women's" in row['weight_class'] or "Strawweight" in row['weight_class']:
        return None
    elif "Flyweight" in row['weight_class']:
        return 0
    elif "Bantamweight" in row['weight_class']:
        return 1
    elif "Featherweight" in row['weight_class']:
        return 2
    elif "Flyweight" in row['weight_class']:
        return 3
    elif "Lightweight" in row['weight_class']:
        return 4
    elif "Welterweight" in row['weight_class']:
        return 5
    elif "Middleweight" in row['weight_class']:
        return 6
    elif "Light Heavyweight" in row['weight_class']:
        return 7
    elif "Heavyweight" in row['weight_class']:
        return 8
    elif "Catch Weight" in row['weight_class']:
        return 9
    elif "Open Weight" in row['weight_class']:
        return 10
    else:
        print(row['weight_class'], row['fights_url'])
        return None
   


train_data['weight_class'] = train_data.apply(map_weight_class, axis=1)
# Remove rows where 'weight_class' is something different, these ufc fights are very old anyways
train_data = train_data[train_data['weight_class'].notna()]

print(train_data.head())

Ultimate Ultimate '96 Tournament Title Bout http://www.ufcstats.com/fight-details/d595d2c36ddba8ee
UFC 10 Tournament Title Bout http://www.ufcstats.com/fight-details/6397fba14ce7f674
UFC Superfight Championship Bout http://www.ufcstats.com/fight-details/6a060498e60756af
UFC 8 Tournament Title Bout http://www.ufcstats.com/fight-details/8d2e99599124a16f
UFC Superfight Championship Bout http://www.ufcstats.com/fight-details/16b4a0b06427f1ac
Ultimate Ultimate '95 Tournament Title Bout http://www.ufcstats.com/fight-details/524b49a676498c6d
UFC 5 Tournament Title Bout http://www.ufcstats.com/fight-details/6ca94b35719eb300
UFC 3 Tournament Title Bout http://www.ufcstats.com/fight-details/323f543eb8abdb36
UFC 2 Tournament Title Bout http://www.ufcstats.com/fight-details/00835554f95fa911
   f1_id  f2_id            f1_name           f2_name  weight_class  \
0   1566    297        Jai Herbert    Rolando Bedoya           4.0   
1   2629   2454  Azamat Murzakanov  Alonzo Menifield           7.0   


## Cleaning Up Null Values
Cleaning up common null values inside the training data, like control time, birth date, height, etc. Since there are little amounts of data with the fighters age or height not stated, I have decided to just omit the data. For control time the ufc stats page usually has "--" which I am believing means that this control time was either not tracked or was 0

In [6]:


null_counts_per_column = train_data.isnull().sum()

print("Null values per column:")
print(null_counts_per_column)

# these null values for ctrl time are usually "--" on a ufc stats page

train_data['f1_ctrl_time'] = train_data['f1_ctrl_time'].fillna(0)
train_data['f2_ctrl_time'] = train_data['f2_ctrl_time'].fillna(0)
train_data = train_data.dropna(subset=['f1_age_during', 'f2_age_during', 'f1_height_cm', 'f2_height_cm'])

print("Null values per column after replacement:")
print(train_data.isnull().sum())

print(f"Number of duplicate rows: {train_data.duplicated().sum()}")


Null values per column:
f1_id                    0
f2_id                    0
f1_name                  0
f2_name                  0
weight_class             0
f1_age_during           87
f2_age_during          116
f1_height_cm            11
f2_height_cm             8
f1_knockdowns            0
f2_knockdowns            0
f1_sig_strike_atts       0
f2_sig_strike_atts       0
f1_sig_strikes           0
f2_sig_strikes           0
f1_tot_strike_atts       0
f2_tot_strike_atts       0
f1_tot_strikes           0
f2_tot_strikes           0
f1_takedown_atts         0
f2_takedown_atts         0
f1_takedowns             0
f2_takedowns             0
f1_submissions           0
f2_submissions           0
f1_reversals             0
f2_reversals             0
f1_ctrl_time           136
f2_ctrl_time           136
f1_head_strike_atts      0
f2_head_strike_atts      0
f1_head_strikes          0
f2_head_strikes          0
f1_body_strike_atts      0
f2_body_strike_atts      0
f1_body_strikes          0
f2_b

# Fighter Data Cleaning and Preprocessing

## Median over Average Data
UFC fights are very extreme and vary widely in their outcomes, some fights might end in a matter of seconds which skews the average significantly, so I will choose to use the median over the average.

In [7]:



median_fighter_data = pd.read_csv('/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_fighters_median.csv')
avg_fighter_data = pd.read_csv('/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_fighters_avg.csv')



combined_data = pd.merge(avg_fighter_data, median_fighter_data, on='fighter_id', suffixes=('_avg', '_med'))


sampled_data = combined_data.sample(n=5)



comparison_columns = {
    'avg_knockdowns': 'Avg Knockdowns',
    'median_knockdowns': 'Median Knockdowns',
    'avg_sig_strike_atts': 'Avg Significant Strike Attempts',
    'median_sig_strike_atts': 'Median Significant Strike Attempts',
    'avg_sig_strikes': 'Avg Significant Strikes',
    'median_sig_strikes': 'Median Siginificant Strikes',
    'avg_tot_strike_atts': 'Avg Total Strike Attempts',
    'median_tot_strike_atts': 'Median Total Strike Attempts',
    'avg_tot_strikes': 'Avg Total Strikes',
    'median_tot_strikes': 'Median Total Strikes',
    'avg_takedown_atts': 'Avg Takedown Attempts',
    'median_takedown_atts': 'Median Takedown Attempts',
    'avg_takedowns': 'Avg Takedowns',
    'median_takedowns': 'Median Takedowns',
    'avg_clinch_atts': 'Avg Clinch Attempts',
    'median_clinch_atts': 'Median Clinch Attempts',
    'avg_clinchs': 'Avg Clinchs',
    'median_clinchs': 'Median Clinchs',
    'avg_ctrl_time': 'Avg Control Time',
    'median_ctrl_time': 'Median Control Time',
    'avg_total_fight_time': 'Avg Total Fight Time',
    'median_total_fight_time': 'Median Total Fight Time',
    'avg_submissions': 'Avg Submissions',
    'median_submissions': 'Median Submissions',
    'avg_reversals': 'Avg Reversals',
    'median_reversals': 'Median Reversals',
    'avg_head_strike_atts': 'Avg Head Strike Attempts',
    'median_head_strike_atts': 'Median Head Strike Attempts',
    'avg_head_strikes': 'Avg Head Strikes',
    'median_head_strikes': 'Median Head Strikes',
    'avg_body_strike_atts': 'Avg Body Strike Attempts',
    'median_body_strike_atts': 'Median Body Strike Attempts',
    'avg_body_strikes': 'Avg Body Strikes',
    'median_body_strikes': 'Median Body Strikes', 
    'avg_leg_strike_atts': 'Avg Leg Strike Attempts', 
    'median_leg_strike_atts': 'Median Leg Strike Attempts',
    'avg_leg_strikes': 'Avg Leg Strikes',
    'median_leg_strikes': 'Median Leg Strikes',
    'avg_dist_strike_atts': 'Avg Distance Strike Attempts',
    'median_dist_strike_atts': 'Median Distance Strike Attempts',
    'avg_dist_strikes': 'Average Distance Strikes',
    'median_dist_strikes': 'Median Distance Strikes',
    'avg_ground_atts': 'Average Ground Attempts',
    'median_ground_atts': 'Median Ground Attempts',
    'avg_grounds': 'Average Grounds',
    'median_grounds': 'Median Grounds'
    
}


existing_columns = [col for col in comparison_columns.keys() if col in sampled_data.columns]
comparison_data = sampled_data[existing_columns].rename(columns=comparison_columns)


print(comparison_data)



      Avg Knockdowns  Median Knockdowns  Avg Significant Strike Attempts  \
363         0.000000                0.0                       152.857143   
3759        0.666667                0.0                        19.333333   
3126        0.272727                0.0                        52.818182   
976         0.000000                0.0                         0.000000   
464         0.000000                0.0                         0.000000   

      Median Significant Strike Attempts  Avg Significant Strikes  \
363                                128.0                65.428571   
3759                                20.0                10.333333   
3126                                33.0                27.727273   
976                                  0.0                 0.000000   
464                                  0.0                 0.000000   

      Median Siginificant Strikes  Avg Total Strike Attempts  \
363                          53.0                 199.000000   


## Removing the Following Statistics Due to Excessive Null Values: Fighter Reach, Fighter Stance, Fighter Weight (Weight Class Will be Used Instead)
Although Fighter Date of Birth and Fighter Height also have numerous null values, I MIGHT replace the null value with the mean/median of age and height of a fighter instead. The age and height of a fighter are extremely important stats, and because there is a lot of data about age and height already existing and the high variability, calculating the mean/median will be essential and straightforward.

Fighter Reach had a lot more null values than other columns so I decided to just omit that column. Fighter Stance is either orthodox or southpaw, it does not provide enough variation to calculate meaningful averages and could negatively impact predictions, so I believe it's better to just omit the column.

In [8]:
fighter_data = pd.read_csv('/kaggle/input/d/thasankakandage/ufc-dataset-2024/ufc_fighters_median.csv')
fighter_data = fighter_data.drop(columns=['fighter_weight_lbs', 'fighter_reach_cm', 'fighter_stance'])
print(fighter_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4227 entries, 0 to 4226
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   fighter_id               4227 non-null   int64  
 1   fighter_name             4227 non-null   object 
 2   fighter_dob              3457 non-null   object 
 3   fighter_height_cm        3856 non-null   float64
 4   fighter_wins             4227 non-null   int64  
 5   fighter_losses           4227 non-null   int64  
 6   fighter_draws            4227 non-null   int64  
 7   fighter_slpm             4227 non-null   float64
 8   fighter_str_acc_%        4227 non-null   float64
 9   fighter_sapm             4227 non-null   float64
 10  fighter_str_def_%        4227 non-null   float64
 11  fighter_td_avg           4227 non-null   float64
 12  fighter_td_acc_%         4227 non-null   float64
 13  fighter_td_def_%         4227 non-null   float64
 14  fighter_sub_avg         

In [9]:

null_counts_per_column = fighter_data.isnull().sum()

print("Null values per column:")
print(null_counts_per_column)

# ***
fighter_data['median_ctrl_time'] = fighter_data['median_ctrl_time'].fillna(0)
print("Null values per column after replacement:")
print(fighter_data.isnull().sum())

print(f"Number of duplicate rows: {fighter_data.duplicated().sum()}")


Null values per column:
fighter_id                   0
fighter_name                 0
fighter_dob                770
fighter_height_cm          371
fighter_wins                 0
fighter_losses               0
fighter_draws                0
fighter_slpm                 0
fighter_str_acc_%            0
fighter_sapm                 0
fighter_str_def_%            0
fighter_td_avg               0
fighter_td_acc_%             0
fighter_td_def_%             0
fighter_sub_avg              0
fighter_url                  0
median_knockdowns            0
median_sig_strike_atts       0
median_sig_strikes           0
median_tot_strike_atts       0
median_tot_strikes           0
median_takedown_atts         0
median_takedowns             0
median_clinch_atts           0
median_clinchs               0
median_ctrl_time           127
median_total_fight_time      0
median_submissions           0
median_reversals             0
median_head_strike_atts      0
median_head_strikes          0
median_body_str

## Integrating Some Fighter Career Stats

On each individual fighter’s stat page, some career stats will be integrated into the fighter’s stats for each fight during model training. 

These career stats are:

- SLpM : Significant Strikes Landed per Minute
- Str. Acc. : Significant Striking Accuracy
- SApM : Significant Strikes Absorbed per Minute
- Str. Def. : Significant Strike Defence (the % of opponents strikes that did not land)
- TD Avg. : Average Takedowns Landed per 15 minutes
- TD Acc. : Takedown Accuracy
- TD Def. : Takedown Defense (the % of opponents TD attempts that did not land)
- Sub. Avg. : Average Submissions Attempted per 15 minutes
- Record : Wins-Losses-Draws

Out of these stats, I will only use SLpM, SApM, and their W-L Records.

- Str. Acc. - [Reasoning](#section1)
- Str. Def. - [Reasoning](#section1)
- TD Avg. - Already calculated with average takedown attempts and average takedowns within the ufc_fighters.csv file
- TD Acc. - [Reasoning](#section1)
- TD Def. - [Reasoning](#section1)
- Sub.  Avg. - Already calculated with average submission attempts and average submissions within the ufc_fighters.csv file
- Draws - Draws are extremely rare in the UFC and there is barely any data for them, only going to focus on binary classification

In [10]:
fighter_data = fighter_data.drop(columns=['fighter_str_acc_%', 'fighter_str_def_%', 'fighter_td_avg', 'fighter_td_acc_%',
                                         'fighter_td_def_%', 'fighter_sub_avg', 'fighter_url'])
print(fighter_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4227 entries, 0 to 4226
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   fighter_id               4227 non-null   int64  
 1   fighter_name             4227 non-null   object 
 2   fighter_dob              3457 non-null   object 
 3   fighter_height_cm        3856 non-null   float64
 4   fighter_wins             4227 non-null   int64  
 5   fighter_losses           4227 non-null   int64  
 6   fighter_draws            4227 non-null   int64  
 7   fighter_slpm             4227 non-null   float64
 8   fighter_sapm             4227 non-null   float64
 9   median_knockdowns        4227 non-null   float64
 10  median_sig_strike_atts   4227 non-null   float64
 11  median_sig_strikes       4227 non-null   float64
 12  median_tot_strike_atts   4227 non-null   float64
 13  median_tot_strikes       4227 non-null   float64
 14  median_takedown_atts    

Appending the individual fighter stats to their row inside the train data

In [11]:

train_data['f1_slpm'] = np.nan
train_data['f2_slpm'] = np.nan
train_data['f1_sapm'] = np.nan
train_data['f2_sapm'] = np.nan
train_data['f1_wins'] = np.nan
train_data['f2_wins'] = np.nan
train_data['f1_losses'] = np.nan
train_data['f2_losses'] = np.nan



f1_stats_list = []
f2_stats_list = []


for index, row in train_data.iterrows():
    f1_id, f2_id = row['f1_id'], row['f2_id']

    f1_stats = fighter_data[fighter_data['fighter_id'] == f1_id]
    f2_stats = fighter_data[fighter_data['fighter_id'] == f2_id]
    
    if not f1_stats.empty:
        f1_slpm, f1_sapm, f1_wins, f1_losses = f1_stats[['fighter_slpm', 'fighter_sapm', 'fighter_wins', 'fighter_losses']].values.flatten()
    else:
        f1_slpm, f1_sapm = [None, None]  
    
    if not f2_stats.empty:
        f2_slpm, f2_sapm, f2_wins, f2_losses = f2_stats[['fighter_slpm', 'fighter_sapm', 'fighter_wins', 'fighter_losses']].values.flatten()
    else:
        f2_slpm, f2_sapm = [None, None] 

    f1_stats_list.append([f1_slpm, f1_sapm])
    f2_stats_list.append([f2_slpm, f2_sapm])


    train_data.at[index, 'f1_slpm'] = f1_slpm
    train_data.at[index, 'f2_slpm'] = f2_slpm
    train_data.at[index, 'f1_sapm'] = f1_sapm
    train_data.at[index, 'f2_sapm'] = f2_sapm
    train_data.at[index, 'f1_wins'] = f1_wins
    train_data.at[index, 'f2_wins'] = f2_wins
    train_data.at[index, 'f1_losses'] = f1_losses
    train_data.at[index, 'f2_losses'] = f2_losses

 
print(train_data.head())
print(train_data.isnull().sum())

   f1_id  f2_id            f1_name           f2_name  weight_class  \
0   1566    297        Jai Herbert    Rolando Bedoya           4.0   
1   2629   2454  Azamat Murzakanov  Alonzo Menifield           7.0   
2   2015   3999  Guram Kutateladze    Jordan Vucenic           4.0   
3    107    439       Joel Alvarez      Elves Brener           4.0   
5   1125    649      Tony Ferguson    Michael Chiesa           5.0   

   f1_age_during  f2_age_during  f1_height_cm  f2_height_cm  f1_knockdowns  \
0           36.0           27.0        185.42        180.34              1   
1           35.0           36.0        177.80        182.88              1   
2           32.0           28.0        180.34        177.80              0   
3           31.0           26.0        190.50        177.80              1   
5           40.0           36.0        180.34        185.42              0   

   ...                                         fights_url  \
0  ...  http://www.ufcstats.com/fight-details/d03

Moving these columns to the end to look neater.

In [12]:

columns_to_move = ['f1_total_fight_time', 'f2_total_fight_time', 'result', 'fights_url', 'event_url']


all_columns = train_data.columns.tolist()

columns_first = [col for col in all_columns if col not in columns_to_move]

new_column_order = columns_first + columns_to_move

train_data = train_data[new_column_order]

print(train_data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 6621 entries, 0 to 7704
Data columns (total 66 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   f1_id                6621 non-null   int64  
 1   f2_id                6621 non-null   int64  
 2   f1_name              6621 non-null   object 
 3   f2_name              6621 non-null   object 
 4   weight_class         6621 non-null   float64
 5   f1_age_during        6621 non-null   float64
 6   f2_age_during        6621 non-null   float64
 7   f1_height_cm         6621 non-null   float64
 8   f2_height_cm         6621 non-null   float64
 9   f1_knockdowns        6621 non-null   int64  
 10  f2_knockdowns        6621 non-null   int64  
 11  f1_sig_strike_atts   6621 non-null   int64  
 12  f2_sig_strike_atts   6621 non-null   int64  
 13  f1_sig_strikes       6621 non-null   int64  
 14  f2_sig_strikes       6621 non-null   int64  
 15  f1_tot_strike_atts   6621 non-null   int64 

In [13]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

y = train_data['result']
print(y.info())
print(y.head())

X = train_data.drop(columns=['f1_id', 'f2_id', 'f1_name', 'f2_name', 'fights_url', 'event_url', 'result'])
print(X.info())
print(X.head())



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)





<class 'pandas.core.series.Series'>
Index: 6621 entries, 0 to 7704
Series name: result
Non-Null Count  Dtype
--------------  -----
6621 non-null   int64
dtypes: int64(1)
memory usage: 361.5 KB
None
0    0
1    0
2    0
3    0
5    1
Name: result, dtype: int64
<class 'pandas.core.frame.DataFrame'>
Index: 6621 entries, 0 to 7704
Data columns (total 59 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   weight_class         6621 non-null   float64
 1   f1_age_during        6621 non-null   float64
 2   f2_age_during        6621 non-null   float64
 3   f1_height_cm         6621 non-null   float64
 4   f2_height_cm         6621 non-null   float64
 5   f1_knockdowns        6621 non-null   int64  
 6   f2_knockdowns        6621 non-null   int64  
 7   f1_sig_strike_atts   6621 non-null   int64  
 8   f2_sig_strike_atts   6621 non-null   int64  
 9   f1_sig_strikes       6621 non-null   int64  
 10  f2_sig_strikes       6621 non-nul

# EDA
While there's an almost countless range of analyses possible with UFC stats, I will only focus on what I would believe to be the most important ones. 

<a id="section2"></a>
## Testing Linear Seperability
We can test linear seperability by using SVM and overfit on the training data.
1. Start with a large C. 
2. Train the model.
3. Test on the training set.
4. If we get 100% accuracy, the data is linearly seperable.
A large C penalizes incorrect data points heavily, so it will make the optimizer have 0 error in classification in order to minimize the loss function. Doing this overfits the data.

Accuracy of 89% so not linearly seperable

In [14]:

# from sklearn.svm import SVC

# svm_model = SVC(C=2^32, kernel="linear")  
# svm_model.fit(X_train, y_train)


# y_pred = svm_model.predict(X_train)

# print(f'Accuracy: {accuracy_score(y_train, y_pred):.5f}')

# Model Choice
For hyperparameter tuning, I used Grid Search and Bayesian Optimization, however I did not spend a lot of time on it as it is extremely time consuming.  

## SVM
- **C**: Controls the penalty for misclassified data points. A small value lets the penalty to be small and the model to be more tolerant of misclassified points. This means that the model is more flexible so it can generalize more, but can also underfit. A large value penalizes incorrect data points heavily, fitting the training data as accurately as possible leading to less tolerance of errors and leading to possible overfitting and less generalizaiton. 
2. **kernel**: Function to transform the data into a higher-dimensional space where it can linearly seperated. Lets the SVM find a hyperplane that can separate classes that are not linearly separable in the original input space.

The data is not linearly seperable [as seen above](#section1). We can still use the "rbf" kernal to create complex boundaries to seperate non-linear data by transforming the input space into a higher-dimensional space.

In [15]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# # parameter grid for Grid Search and Random Search
# param_grid = {
#     'C': [0.1, 1, 10, 100, 1000], 
#     'gamma': [0.001, 0.01, 0.1, 1, 10, 100],  
#     'kernel': ['rbf'] 
# }
# # # Grid Search
# parameter_search = GridSearchCV(estimator=SVC(probability=True), param_grid=param_grid, 
#                            scoring='accuracy', cv=5, verbose=0)


# parameter_search.fit(X_train, y_train)


# best_params = parameter_search.best_params_
# best_model = parameter_search.best_estimator_

# y_pred = best_model.predict(X_test)

# print(f'Best parameters: {best_params}')
# print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
# # Best parameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
# # Accuracy: 0.89



# svm_model = SVC(C=100, kernel='rbf', gamma=0.001, probability = True)  
# svm_model.fit(X_train, y_train)


# y_pred = svm_model.predict(X_test)

# print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')

## Logistic Regression

In [16]:
# from sklearn.linear_model import LogisticRegression


# log_reg_model = LogisticRegression()
# log_reg_model.fit(X_train, y_train)


# y_pred = log_reg_model.predict(X_test)


# print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')

## Random Forest Classifier
- **n_estimators** (number of trees): Number of decision trees that will be created in the forest. More trees will lead to a more accurate model, however it will also be more computationally expensive and the growth in accuracy will eventually plateau.

- **max_depth**: Maximum depth of the decision trees from root to leaf to help control complexity. A higher depth will lead to a more accurate model, but it can also lead to the tree capturing noise and small changes leading to overfitting the data, and of course, more computing power is needed for deeper trees. A lower depth will lead to decisions based on less splits and will be less complex, so less likely to overfit and instead become more generalized, but can also lead to underfitting.  
- **min_samples_split**: Minimum number of samples required to split a node in a decision tree. A higher number of samples will help stop the creation of nodes that are too specific, which can lead to overfitting. Also, a high value means that nodes require a large number of samples to make a split which results in fewer splits and therefore a low depth tree. A low number of samples lets nodes split even with few samples, creating deeper trees.
- **min_samples_leaf**: Minimum number of samples required to be in a leaf node. A high number of samples lead to low trees because they will have to have a lot of samples in each leaf, preventing the trees to capture patterns and leading to underfitting. A low number of samples will lead to trees growing deeper with smaller leaf nodes, capturing more detailed patterns but also potential overfitting. Integer values means # of samples, while float values means % of samples.
- **max_features**: Number of features that are considered when splitting a node in a decision tree. Limiting the number of features creates diversity with the trees in the forest which helps improve the generalization and reduces overfitting. High values will result in each tree having access to most of the features, leading to accurate trees but might reduce the generalization and leads to overfitting. Low values results in each tree seeing a small set of features, increasing generalization  but less accurate trees and potential underfitting. Integer values means # of features, while float values mean % of features

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from skopt import BayesSearchCV
from skopt.space import Real, Integer
from sklearn.model_selection import cross_val_score

# parameter grid for Grid Search and Random Search
# parameter_grid = {
#     'n_estimators': [50, 100, 150],        
#     'max_depth': [None, 10, 20, 30],       
#     'min_samples_split': [2, 5, 10],    
#     'min_samples_leaf': [1, 2, 4],    
#     'max_features': ['sqrt', 'log2']  
# }


# # Grid Search
# parameter_search = GridSearchCV(estimator=RandomForestClassifier(), 
#                            param_grid=parameter_grid, 
#                            cv=5,                            
#                            verbose=0)                      
# # Best Parameters: {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 150}

# # Random Search
# parameter_search = RandomizedSearchCV(estimator=RandomForestClassifier(),
#     param_distributions=parameter_grid,
#     n_iter=1000,
#     cv=5,
#     verbose=0
# )
# # Best Parameters: {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 150}


# # parameter space for Bayesian Optimization
# parameter_space = {
#     'n_estimators': Integer(10, 200),
#     'max_depth': Integer(1, 30),
#     'min_samples_split': Integer(2, 20),
#     'min_samples_leaf': Integer(1, 20),
#     'max_features': ['sqrt', 'log2']
# }


# # Bayesian Optimization
# parameter_search = BayesSearchCV(
#     estimator=RandomForestClassifier(),
#     search_spaces=parameter_space,
#     n_iter=500,  
#     cv=5,      
#     verbose=0
# )
# # Best Parameters: {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 150}


# parameter_search.fit(X_train, y_train)

# best_params = grid_search.best_params_
# best_rf_model = grid_search.best_estimator_

# y_pred = best_rf_model.predict(X_test)

# print(f'Best Parameters: {best_params}')
# print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')







# more general params?
# n_estimators=50,max_depth=10,min_samples_split=20,min_samples_leaf=1,max_features='sqrt', random_state=42)


rf_model = RandomForestClassifier(n_estimators=50,max_depth=10,min_samples_split=20,min_samples_leaf=1,max_features='sqrt', random_state=42)
rf_model.fit(X_train, y_train)


y_pred = rf_model.predict(X_test)


print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')

Accuracy: 0.89


## XGBoost

In [18]:
# import xgboost as xgb


# xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
# xgb_model.fit(X_train, y_train)


# y_pred = xgb_model.predict(X_test)


# print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')


# Testing

To test, simply type in two fighters names, and the weight class they will be fighting in.

In [19]:
from datetime import datetime
# DON'T USE THESE NAMES! These names are shared with atleast 1 other fighter
# An age input will later be developed incase a duplicate name is chosen

# Mike Davis 
# Joey Gomez 
# Tony Johnson    
# Michael McDonald  
# Jean Silva    
# Bruno Silva

# these are your three inputs
#################################
fighter1_name = "Chepe Mariscal"
fighter2_name = "Damon Jackson"
weight_class = "Featherweight"
#################################

df = pd.DataFrame({
    'weight_class': [weight_class]
})
df['weight_class_value'] = df.apply(map_weight_class, axis=1)
weight_class_value = df['weight_class_value'].iloc[0]
print(f'Weight class value for "{weight_class}": {weight_class_value}')

today = datetime.now()
if 'fighter_dob' in fighter_data.columns:
    fighter_data['fighter_dob'] = fighter_data['fighter_dob'].apply(
        lambda dob: today.year - pd.to_datetime(dob).year - (
            (today.month, today.day) < (pd.to_datetime(dob).month, pd.to_datetime(dob).day)
        )
    )


    fighter_data = fighter_data.rename(columns={'fighter_dob': 'fighter_age_during'})
print(fighter_data.info())
fighter1_data = fighter_data[fighter_data["fighter_name"] == fighter1_name]
fighter2_data = fighter_data[fighter_data["fighter_name"] == fighter2_name]
training_columns = X.columns.tolist()
print(X.info())

non_average_stats = ['fighter_wins', 'fighter_losses', 'fighter_slpm', 'fighter_sapm', 'fighter_age_during', 'fighter_height_cm']
features = []

for column in training_columns:

    if column.startswith('f1_'): 
        if column.replace("f1_", "fighter_") not in non_average_stats:
            feature = column.replace("f1_", "median_")
            if feature not in features:
                features.append(feature)
        else:
            feature = column.replace("f1_", "fighter_")
            if feature not in features:
                features.append(feature)
    elif column.startswith('f2_'): 
        if column.replace("f2_", "fighter_") not in non_average_stats:
            feature = column.replace("f2_", "median_")
            if feature not in features:
                features.append(feature)
            
        else:
            feature = column.replace("f2_", "fighter_")
  
        

fighter1_features = fighter1_data[features].columns.tolist()
fighter2_features = fighter2_data[features].columns.tolist()


fighter1_data_list = [(f'f1_{col_name}', fighter1_data[col_name].values[0]) for col_name in fighter1_features]
fighter2_data_list = [(f'f2_{col_name}', fighter2_data[col_name].values[0]) for col_name in fighter2_features]


interleaved_feature_values_list = [val for pair in zip(fighter1_data_list, fighter2_data_list) for val in pair]


for col_name, value in interleaved_feature_values_list:
    print(f"{col_name}: {value}")


fighter1_data_filtered = fighter1_data[features]
fighter2_data_filtered = fighter2_data[features]

    
fighter1_array = fighter1_data_filtered.values.flatten()
fighter2_array = fighter2_data_filtered.values.flatten()

interleaved_data = np.empty(len(X.columns.tolist()))

interleaved_data[0] = weight_class_value


interleaved_data[1::2] = fighter1_array
interleaved_data[2::2] = fighter2_array
interleaved_data = interleaved_data.reshape(1, -1)

single_input = scaler.transform(interleaved_data)


                                
    
    
    
    
# # SVM Model
# prediction_probs = svm_model.predict_proba(single_input)
# single_pred = svm_model.predict(single_input)
# prediction_probability = prediction_probs[0][single_pred[0]]

# # Logistic Regression Model
# prediction_probs = log_reg_model.predict_proba(single_input)  
# single_pred = log_reg_model.predict(single_input)  
# prediction_probability = prediction_probs[0][single_pred[0]] 

# Random Forest Classifier Model - best so far
prediction_probs = rf_model.predict_proba(single_input)  
print(prediction_probs)
single_pred = rf_model.predict(single_input)  
prediction_probability = prediction_probs[0][single_pred[0]]  

# # XGBoost Model
# prediction_probs = xgb_model.predict_proba(single_input)  
# single_pred = xgb_model.predict(single_input)  
# prediction_probability = prediction_probs[0][single_pred[0]]  



 
    
if single_pred[0] == 0:
    predicted_winner = fighter1_name
elif single_pred[0] == 1:
    predicted_winner = fighter2_name
else:
    print(single_prediction[0])
    predicted_winner = "Unknown"

print(f'Predicted Winner: {predicted_winner}')
print(f'Probability: {prediction_probability:.2f}')


Weight class value for "Featherweight": 2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4227 entries, 0 to 4226
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   fighter_id               4227 non-null   int64  
 1   fighter_name             4227 non-null   object 
 2   fighter_age_during       3457 non-null   float64
 3   fighter_height_cm        3856 non-null   float64
 4   fighter_wins             4227 non-null   int64  
 5   fighter_losses           4227 non-null   int64  
 6   fighter_draws            4227 non-null   int64  
 7   fighter_slpm             4227 non-null   float64
 8   fighter_sapm             4227 non-null   float64
 9   median_knockdowns        4227 non-null   float64
 10  median_sig_strike_atts   4227 non-null   float64
 11  median_sig_strikes       4227 non-null   float64
 12  median_tot_strike_atts   4227 non-null   float64
 13  median_tot_strikes       4227 non-nu



# To Add
- Further add data preprocessing, e.g., checking if certain rows have too many 0's, adding age input if user inputs name belonging to multiple fighters
- Further model test and try out other models

- Further data analysis

- Get rid of very old UFC data, planning to get rid of data that have different fight formats than (5-5-5) or (5-5-5-5-5)