# Data Preparation

## Why do this step?

- The data we scraped from the UFC website is raw data. Every row contains information about a fight that took place i.e. details like how many strikes were thrown in that fight and who won the fight.
- To prepare the data for prediction, every row can contain only an accurate representation of what each fighter has done in fights up until that fight! No data that was recorded during the fight can be present in that row.
- Our Target variable is Winner. The task has to be to predict the winner from the data available of each fighter up until the fight

## Looking at the data

In [18]:
import pandas as pd
import numpy as np
import math

df = pd.read_csv('../data/raw_total_fight_data.csv', sep=';')
fighter_details = pd.read_csv('../data/raw_fighter_details.csv', index_col='fighter_name')

In [19]:
df.head()

Unnamed: 0,R_fighter,B_fighter,R_KD,B_KD,R_SIG_STR.,B_SIG_STR.,R_SIG_STR_pct,B_SIG_STR_pct,R_TOTAL_STR.,B_TOTAL_STR.,...,B_GROUND,win_by,last_round,last_round_time,Format,Referee,date,location,Fight_type,Winner
0,Justin Gaethje,Rafael Fiziev,1,0,72 of 134,68 of 119,53%,57%,98 of 163,81 of 134,...,5 of 5,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Jason Herzog,"March 08, 2025","Las Vegas, Nevada, USA",Lightweight Bout,Justin Gaethje
1,Amanda Lemos,Iasmin Lucindo,0,0,12 of 21,4 of 16,57%,25%,30 of 42,40 of 56,...,1 of 1,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Keith Peterson,"March 08, 2025","Las Vegas, Nevada, USA",Women's Strawweight Bout,Amanda Lemos
2,Jalin Turner,Ignacio Bahamondes,0,0,7 of 20,10 of 16,35%,62%,11 of 25,17 of 23,...,4 of 5,Submission,1,2:29,3 Rnd (5-5-5),Mark Smith,"March 08, 2025","Las Vegas, Nevada, USA",Lightweight Bout,Ignacio Bahamondes
3,Joshua Van,Rei Tsuruya,0,0,59 of 97,32 of 84,60%,38%,127 of 169,47 of 104,...,0 of 0,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Jason Herzog,"March 08, 2025","Las Vegas, Nevada, USA",Flyweight Bout,Joshua Van
4,Alex Pereira,Magomed Ankalaev,0,0,76 of 137,94 of 180,55%,52%,97 of 159,127 of 224,...,0 of 0,Decision - Unanimous,5,5:00,5 Rnd (5-5-5-5-5),Marc Goddard,"March 08, 2025","Las Vegas, Nevada, USA",UFC Light Heavyweight Title Bout,Magomed Ankalaev


In [20]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8037,8038,8039,8040,8041,8042,8043,8044,8045,8046
R_fighter,Justin Gaethje,Amanda Lemos,Jalin Turner,Joshua Van,Alex Pereira,King Green,Brunno Ferreira,Alex Morono,Djorden Santos,Mairon Santos,...,Patrick Smith,Royce Gracie,Johnny Rhodes,Jason DeLucia,Remco Pardoel,Orlando Wiet,Johnny Rhodes,Patrick Smith,Frank Hamaker,Scott Morris
B_fighter,Rafael Fiziev,Iasmin Lucindo,Ignacio Bahamondes,Rei Tsuruya,Magomed Ankalaev,Mauricio Ruffy,Armen Petrosyan,Carlos Leal,Ozzy Diaz,Francis Marshall,...,Scott Morris,Minoki Ichihara,Fred Ettish,Scott Baker,Alberta Cerra Leon,Robert Lucarelli,David Levicki,Ray Wizard,Thaddeus Luster,Sean Daugherty
R_KD,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
B_KD,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
R_SIG_STR.,72 of 134,12 of 21,7 of 20,59 of 97,76 of 137,4 of 15,27 of 56,40 of 95,131 of 345,58 of 129,...,13 of 17,2 of 4,13 of 29,3 of 5,4 of 6,8 of 12,11 of 17,1 of 1,2 of 3,1 of 1
B_SIG_STR.,68 of 119,4 of 16,10 of 16,32 of 84,94 of 180,4 of 9,46 of 83,79 of 120,135 of 261,39 of 143,...,0 of 0,3 of 7,4 of 7,0 of 2,1 of 3,2 of 6,4 of 5,1 of 1,0 of 0,0 of 4
R_SIG_STR_pct,53%,57%,35%,60%,55%,26%,48%,42%,37%,44%,...,76%,50%,44%,60%,66%,66%,64%,100%,66%,100%
B_SIG_STR_pct,57%,25%,62%,38%,52%,44%,55%,65%,51%,27%,...,---,42%,57%,0%,33%,33%,80%,100%,---,0%
R_TOTAL_STR.,98 of 163,30 of 42,11 of 25,127 of 169,97 of 159,4 of 15,31 of 63,41 of 96,134 of 348,78 of 154,...,19 of 25,110 of 114,21 of 38,20 of 25,20 of 22,11 of 15,74 of 86,1 of 1,14 of 15,2 of 2
B_TOTAL_STR.,81 of 134,40 of 56,17 of 23,47 of 104,127 of 224,4 of 9,47 of 84,81 of 123,136 of 262,64 of 179,...,0 of 0,12 of 16,7 of 11,14 of 23,9 of 11,2 of 6,95 of 102,2 of 2,0 of 0,1 of 5


In [21]:
df.describe()

Unnamed: 0,R_KD,B_KD,R_SUB_ATT,B_SUB_ATT,R_REV,B_REV,last_round
count,8047.0,8047.0,8047.0,8047.0,8047.0,8047.0,8047.0
mean,0.244812,0.181186,0.443271,0.321362,0.136697,0.134709,2.354294
std,0.517374,0.457467,0.883737,0.753561,0.429122,0.42128,1.016084
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,3.0
75%,0.0,0.0,1.0,0.0,0.0,0.0,3.0
max,5.0,4.0,10.0,7.0,6.0,5.0,5.0


In [22]:
df.dtypes

R_fighter          object
B_fighter          object
R_KD                int64
B_KD                int64
R_SIG_STR.         object
B_SIG_STR.         object
R_SIG_STR_pct      object
B_SIG_STR_pct      object
R_TOTAL_STR.       object
B_TOTAL_STR.       object
R_TD               object
B_TD               object
R_TD_pct           object
B_TD_pct           object
R_SUB_ATT           int64
B_SUB_ATT           int64
R_REV               int64
B_REV               int64
R_CTRL             object
B_CTRL             object
R_HEAD             object
B_HEAD             object
R_BODY             object
B_BODY             object
R_LEG              object
B_LEG              object
R_DISTANCE         object
B_DISTANCE         object
R_CLINCH           object
B_CLINCH           object
R_GROUND           object
B_GROUND           object
win_by             object
last_round          int64
last_round_time    object
Format             object
Referee            object
date               object
location    

In [23]:
df.columns

Index(['R_fighter', 'B_fighter', 'R_KD', 'B_KD', 'R_SIG_STR.', 'B_SIG_STR.',
       'R_SIG_STR_pct', 'B_SIG_STR_pct', 'R_TOTAL_STR.', 'B_TOTAL_STR.',
       'R_TD', 'B_TD', 'R_TD_pct', 'B_TD_pct', 'R_SUB_ATT', 'B_SUB_ATT',
       'R_REV', 'B_REV', 'R_CTRL', 'B_CTRL', 'R_HEAD', 'B_HEAD', 'R_BODY',
       'B_BODY', 'R_LEG', 'B_LEG', 'R_DISTANCE', 'B_DISTANCE', 'R_CLINCH',
       'B_CLINCH', 'R_GROUND', 'B_GROUND', 'win_by', 'last_round',
       'last_round_time', 'Format', 'Referee', 'date', 'location',
       'Fight_type', 'Winner'],
      dtype='object')

### Column definitions:

- `R_` and `B_` prefix signifies red and blue corner fighter stats respectively
- `KD` is number of knockdowns
- `SIG_STR` is no. of significant strikes 'landed of attempted'
- `SIG_STR_pct` is significant strikes percentage
- `TOTAL_STR` is total strikes 'landed of attempted'
- `TD` is no. of takedowns
- `TD_pct` is takedown percentages
- `SUB_ATT` is no. of submission attempts
- `PASS` is no. times the guard was passed?
- `REV` is the no. of reversals
- `CTRL` is the time spent with ground control
- `HEAD` is no. of significant strinks to the head 'landed of attempted'
- `BODY` is no. of significant strikes to the body 'landed of attempted'
- `CLINCH` is no. of significant strikes in the clinch 'landed of attempted'
- `GROUND` is no. of significant strikes on the ground 'landed of attempted'
- `win_by` is method of win
- `last_round` is last round of the fight (ex. if it was a KO in 1st, then this will be 1)
- `last_round_time` is when the fight ended in the last round
- `Format` is the format of the fight (3 rounds, 5 rounds etc.)
- `Referee` is the name of the Ref
- `date` is the date of the fight
- `location` is the location in which the event took place
- `Fight_type` is which weight class and whether it's a title bout or not
- `Winner` is the winner of the fight

#### Per fighter career wide stats
- `SLpM` - Significant Strikes Landed per Minute
- `Str_Acc.` - Significant Striking Accuracy
- `SApM` - Significant Strikes Absorbed per Minute
- `Str_Def` - Significant Strike Defence (the % of opponents strikes that did not land)
- `TD_Avg` - Average Takedowns Landed per 15 minutes
- `TD_Acc` - Takedown Accuracy
- `TD_Def` - Takedown Defense (the % of opponents TD attempts that did not land)
- `Sub_Avg` - Average Submissions Attempted per 15 minutes 

## Todo:

- Separate `landed of attempted` to separate columns

- Convert `Fight_type` into two separate columns, `weight_class` and `Title_fight` (True or False)

- Convert `last_round_time` to `total_time_fought` by using `last_round` and `Format`

- Convert `CTRL` to `time_in_CTRL`

- Convert percentages to fractions

- Since the data is a description of each fight, we have to convert it into a format that shows the compilation data of each fighter up until that fight. This means every row will look a lot different than it looks now.

- Create `current_win_streak`, `current_lose_streak`, `longest_win_streak`, `wins`, `losses`, `draw`

- Create fighter `height`, `reach`, `weight`, `age`

In [24]:
fighter_details.drop(
    columns=["SLpM",
            "Str_Acc",
            "SApM",
            "Str_Def",
            "TD_Avg",
            "TD_Acc",
            "TD_Def",
            "Sub_Avg",
        ], inplace=True)

### Splitting landed of attempted to different columns

In [25]:
columns = ['R_SIG_STR.', 'B_SIG_STR.', 'R_TOTAL_STR.', 'B_TOTAL_STR.',
       'R_TD', 'B_TD', 'R_HEAD', 'B_HEAD', 'R_BODY','B_BODY', 'R_LEG', 'B_LEG', 
        'R_DISTANCE', 'B_DISTANCE', 'R_CLINCH','B_CLINCH', 'R_GROUND', 'B_GROUND']

for column in columns:
    print(f"{column} data type is: {df[column].dtype}")

R_SIG_STR. data type is: object
B_SIG_STR. data type is: object
R_TOTAL_STR. data type is: object
B_TOTAL_STR. data type is: object
R_TD data type is: object
B_TD data type is: object
R_HEAD data type is: object
B_HEAD data type is: object
R_BODY data type is: object
B_BODY data type is: object
R_LEG data type is: object
B_LEG data type is: object
R_DISTANCE data type is: object
B_DISTANCE data type is: object
R_CLINCH data type is: object
B_CLINCH data type is: object
R_GROUND data type is: object
B_GROUND data type is: object


In [26]:
attempt_suffix = '_att'
landed_suffix = '_landed'

for column in columns:
    df[column+attempt_suffix] = df[column].apply(lambda X: int(X.split('of')[1]))
    df[column+landed_suffix] = df[column].apply(lambda X: int(X.split('of')[0]))
    
df.drop(columns, axis=1, inplace=True)

In [27]:
df.columns

Index(['R_fighter', 'B_fighter', 'R_KD', 'B_KD', 'R_SIG_STR_pct',
       'B_SIG_STR_pct', 'R_TD_pct', 'B_TD_pct', 'R_SUB_ATT', 'B_SUB_ATT',
       'R_REV', 'B_REV', 'R_CTRL', 'B_CTRL', 'win_by', 'last_round',
       'last_round_time', 'Format', 'Referee', 'date', 'location',
       'Fight_type', 'Winner', 'R_SIG_STR._att', 'R_SIG_STR._landed',
       'B_SIG_STR._att', 'B_SIG_STR._landed', 'R_TOTAL_STR._att',
       'R_TOTAL_STR._landed', 'B_TOTAL_STR._att', 'B_TOTAL_STR._landed',
       'R_TD_att', 'R_TD_landed', 'B_TD_att', 'B_TD_landed', 'R_HEAD_att',
       'R_HEAD_landed', 'B_HEAD_att', 'B_HEAD_landed', 'R_BODY_att',
       'R_BODY_landed', 'B_BODY_att', 'B_BODY_landed', 'R_LEG_att',
       'R_LEG_landed', 'B_LEG_att', 'B_LEG_landed', 'R_DISTANCE_att',
       'R_DISTANCE_landed', 'B_DISTANCE_att', 'B_DISTANCE_landed',
       'R_CLINCH_att', 'R_CLINCH_landed', 'B_CLINCH_att', 'B_CLINCH_landed',
       'R_GROUND_att', 'R_GROUND_landed', 'B_GROUND_att', 'B_GROUND_landed'],
      dtype

### Replacing Winner NaNs as Draw

In [28]:
for column in df.columns:
    if df[column].isnull().sum() != 0:
        print(f"NaN values in {column} = {df[column].isnull().sum()}")

NaN values in Referee = 26
NaN values in Winner = 145


* 83 missing values in winner and 23 missing values in Referee

In [29]:
df[df['Winner'].isnull()]['win_by'].value_counts()

win_by
Overturned              57
Decision - Majority     34
Could Not Continue      29
Decision - Split        17
Decision - Unanimous     6
Other                    2
Name: count, dtype: int64

* Here, Overturned means due to drug test being positive and Could not Continue means there was an illegal blow which was not enough to be disqualified but the fighter could not continue.
* The rest are different forms of draw

* Replacing all of these with draw

In [30]:
# df['Winner'].fillna('Draw', inplace=True)
df.fillna({'Winner': 'Draw'}, inplace=True)

### Converting percentages to fractions

In [31]:
pct_columns = ['R_SIG_STR_pct','B_SIG_STR_pct', 'R_TD_pct', 'B_TD_pct']

def pct_to_frac(X):
    if X != '---':
        return float(X.replace('%', ''))/100
    else:
        # if '---' means it's taking pct of `0 of 0`. 
        # Taking a call here to consider 0 landed of 0 attempted as 0 percentage
        return 0

for column in pct_columns:
    df[column] = df[column].apply(pct_to_frac)

In [32]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8037,8038,8039,8040,8041,8042,8043,8044,8045,8046
R_fighter,Justin Gaethje,Amanda Lemos,Jalin Turner,Joshua Van,Alex Pereira,King Green,Brunno Ferreira,Alex Morono,Djorden Santos,Mairon Santos,...,Patrick Smith,Royce Gracie,Johnny Rhodes,Jason DeLucia,Remco Pardoel,Orlando Wiet,Johnny Rhodes,Patrick Smith,Frank Hamaker,Scott Morris
B_fighter,Rafael Fiziev,Iasmin Lucindo,Ignacio Bahamondes,Rei Tsuruya,Magomed Ankalaev,Mauricio Ruffy,Armen Petrosyan,Carlos Leal,Ozzy Diaz,Francis Marshall,...,Scott Morris,Minoki Ichihara,Fred Ettish,Scott Baker,Alberta Cerra Leon,Robert Lucarelli,David Levicki,Ray Wizard,Thaddeus Luster,Sean Daugherty
R_KD,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
B_KD,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
R_SIG_STR_pct,0.53,0.57,0.35,0.6,0.55,0.26,0.48,0.42,0.37,0.44,...,0.76,0.5,0.44,0.6,0.66,0.66,0.64,1.0,0.66,1.0
B_SIG_STR_pct,0.57,0.25,0.62,0.38,0.52,0.44,0.55,0.65,0.51,0.27,...,0.0,0.42,0.57,0.0,0.33,0.33,0.8,1.0,0.0,0.0
R_TD_pct,0.0,0.6,0.0,0.0,0.0,0.0,0.28,0.0,0.33,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0
B_TD_pct,1.0,1.0,0.0,0.19,0.0,0.0,0.0,0.0,0.0,0.6,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
R_SUB_ATT,0,0,0,0,0,0,1,0,0,0,...,0,2,1,5,1,0,0,1,3,1
B_SUB_ATT,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


### Creating a title_bout feature and weight_class

In [33]:
df['Fight_type'].value_counts()

Fight_type
Lightweight Bout               1315
Welterweight Bout              1262
Middleweight Bout              1009
Featherweight Bout              757
Bantamweight Bout               678
                               ... 
UFC 6 Tournament Title Bout       1
UFC 5 Tournament Title Bout       1
UFC 4 Tournament Title Bout       1
UFC 3 Tournament Title Bout       1
UFC 2 Tournament Title Bout       1
Name: count, Length: 113, dtype: int64

In [34]:
df['Fight_type'].value_counts()[df['Fight_type'].value_counts() > 1].index

Index(['Lightweight Bout', 'Welterweight Bout', 'Middleweight Bout',
       'Featherweight Bout', 'Bantamweight Bout', 'Heavyweight Bout',
       'Light Heavyweight Bout', 'Flyweight Bout', 'Women's Strawweight Bout',
       'Women's Flyweight Bout', 'Women's Bantamweight Bout',
       'Open Weight Bout', 'Catch Weight Bout',
       'UFC Light Heavyweight Title Bout', 'UFC Welterweight Title Bout',
       'UFC Heavyweight Title Bout', 'UFC Middleweight Title Bout',
       'UFC Lightweight Title Bout', 'UFC Flyweight Title Bout',
       'UFC Bantamweight Title Bout', 'Women's Featherweight Bout',
       'UFC Featherweight Title Bout', 'UFC Women's Strawweight Title Bout',
       'UFC Women's Bantamweight Title Bout',
       'UFC Women's Flyweight Title Bout',
       'UFC Interim Heavyweight Title Bout',
       'UFC Women's Featherweight Title Bout',
       'UFC Superfight Championship Bout',
       'UFC Interim Featherweight Title Bout',
       'UFC Interim Bantamweight Title Bout',
   

In [35]:
df['title_bout'] = df['Fight_type'].apply(lambda X: True if 'Title Bout' in X else False)

In [36]:
def make_weight_class(X):
    for weight_class in weight_classes:
        if weight_class in X:
            return weight_class
    if X == 'Catch Weight Bout' or 'Catchweight Bout':
        return 'Catch Weight'
    else:
        return 'Open Weight'

In [37]:
weight_classes = ['Women\'s Strawweight', 'Women\'s Bantamweight', 
                  'Women\'s Featherweight', 'Women\'s Flyweight', 'Lightweight', 
                  'Welterweight', 'Middleweight','Light Heavyweight', 
                  'Heavyweight', 'Featherweight','Bantamweight', 'Flyweight', 'Open Weight']

df['weight_class'] = df['Fight_type'].apply(make_weight_class)

In [38]:
df[df['weight_class'].isnull()]['Fight_type'].value_counts()

Series([], Name: count, dtype: int64)

### Creating total_time_fought

In [39]:
df['Format'].value_counts()

Format
3 Rnd (5-5-5)           7100
5 Rnd (5-5-5-5-5)        715
1 Rnd + OT (12-3)         80
3 Rnd + OT (5-5-5-5)      38
No Time Limit             29
1 Rnd (20)                21
1 Rnd + 2OT (15-3-3)      20
2 Rnd (5-5)               14
1 Rnd (15)                 8
1 Rnd (10)                 6
1 Rnd (12)                 4
1 Rnd + OT (30-5)          3
1 Rnd + OT (15-3)          2
1 Rnd (18)                 2
1 Rnd + 2OT (24-3-3)       1
1 Rnd + OT (27-3)          1
1 Rnd + OT (30-3)          1
1 Rnd (30)                 1
1 Rnd + OT (31-5)          1
Name: count, dtype: int64

In [40]:
df['Format'].value_counts().index

Index(['3 Rnd (5-5-5)', '5 Rnd (5-5-5-5-5)', '1 Rnd + OT (12-3)',
       '3 Rnd + OT (5-5-5-5)', 'No Time Limit', '1 Rnd (20)',
       '1 Rnd + 2OT (15-3-3)', '2 Rnd (5-5)', '1 Rnd (15)', '1 Rnd (10)',
       '1 Rnd (12)', '1 Rnd + OT (30-5)', '1 Rnd + OT (15-3)', '1 Rnd (18)',
       '1 Rnd + 2OT (24-3-3)', '1 Rnd + OT (27-3)', '1 Rnd + OT (30-3)',
       '1 Rnd (30)', '1 Rnd + OT (31-5)'],
      dtype='object', name='Format')

In [41]:
time_in_first_round = {'3 Rnd (5-5-5)': 5*60, '5 Rnd (5-5-5-5-5)': 5*60, '1 Rnd + OT (12-3)': 12*60,
       'No Time Limit': 1, '3 Rnd + OT (5-5-5-5)': 5*60, '1 Rnd (20)': 1*20,
       '2 Rnd (5-5)': 5*60, '1 Rnd (15)': 15*60, '1 Rnd (10)': 10*60,
       '1 Rnd (12)':12*60, '1 Rnd + OT (30-5)': 30*60, '1 Rnd (18)': 18*60, '1 Rnd + OT (15-3)': 15*60,
       '1 Rnd (30)': 30*60, '1 Rnd + OT (31-5)': 31*5,
       '1 Rnd + OT (27-3)': 27*60, '1 Rnd + OT (30-3)': 30*60}

exception_format_time = {'1 Rnd + 2OT (15-3-3)': [15*60, 3*60], '1 Rnd + 2OT (24-3-3)': [24*60, 3*60]}

# '1 Rnd + 2OT (15-3-3)' and '1 Rnd + 2OT (24-3-3)' is not included because it has 3 uneven timed rounds. 
# We'll have to deal with it separately

In [42]:
# Converting to seconds
df['last_round_time'] = df['last_round_time'].apply(lambda X: int(X.split(':')[0])*60 + int(X.split(':')[1]))

In [43]:
def get_total_time(row):
    if row['Format'] in time_in_first_round.keys():
        return (row['last_round'] - 1) * time_in_first_round[row['Format']] + row['last_round_time']
    elif row['Format'] in exception_format_time.keys():
        if (row['last_round'] - 1) >= 2:
            return exception_format_time[row['Format']][0] + (row['last_round'] - 2) * \
                    exception_format_time[row['Format']][1] + row['last_round_time']
        else:
            return (row['last_round'] - 1) * exception_format_time[row['Format']][0] + row['last_round_time']
    
# So if the fight ended in round 1, we only need last_round_time. 
# If it ended in round 2, we need the full time of round 1 and the last_round_time
# This works for fights with same time in each round and fights with only two rounds.

In [44]:
df['total_time_fought(seconds)'] = df.apply(get_total_time, axis=1)

In [45]:
def get_no_of_rounds(X):
    if X == 'No Time Limit':
        return 1
    else:
        return len(X.split('(')[1].replace(')', '').split('-'))

df['no_of_rounds'] = df['Format'].apply(get_no_of_rounds)

In [46]:
df

Unnamed: 0,R_fighter,B_fighter,R_KD,B_KD,R_SIG_STR_pct,B_SIG_STR_pct,R_TD_pct,B_TD_pct,R_SUB_ATT,B_SUB_ATT,...,B_CLINCH_att,B_CLINCH_landed,R_GROUND_att,R_GROUND_landed,B_GROUND_att,B_GROUND_landed,title_bout,weight_class,total_time_fought(seconds),no_of_rounds
0,Justin Gaethje,Rafael Fiziev,1,0,0.53,0.57,0.0,1.00,0,0,...,14,14,8,5,5,5,False,Lightweight,900,3
1,Amanda Lemos,Iasmin Lucindo,0,0,0.57,0.25,0.6,1.00,0,0,...,2,2,6,4,1,1,False,Women's Strawweight,900,3
2,Jalin Turner,Ignacio Bahamondes,0,0,0.35,0.62,0.0,0.00,0,1,...,0,0,7,2,5,4,False,Lightweight,149,3
3,Joshua Van,Rei Tsuruya,0,0,0.60,0.38,0.0,0.19,0,0,...,1,0,2,1,0,0,False,Flyweight,900,3
4,Alex Pereira,Magomed Ankalaev,0,0,0.55,0.52,0.0,0.00,0,0,...,21,19,0,0,0,0,True,Light Heavyweight,1500,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8042,Orlando Wiet,Robert Lucarelli,0,0,0.66,0.33,0.0,1.00,0,1,...,0,0,9,7,0,0,False,Open Weight,170,1
8043,Johnny Rhodes,David Levicki,0,0,0.64,0.80,1.0,0.00,0,0,...,2,2,15,9,1,1,False,Open Weight,733,1
8044,Patrick Smith,Ray Wizard,0,0,1.00,1.00,0.0,0.00,1,0,...,0,0,0,0,0,0,False,Open Weight,58,1
8045,Frank Hamaker,Thaddeus Luster,0,0,0.66,0.00,1.0,0.00,3,0,...,0,0,2,1,0,0,False,Open Weight,292,1


In [47]:
df.drop(['Format', 'Fight_type', 'last_round_time'], axis = 1, inplace=True)

### Create CTRL_time(seconds)

In [48]:
CTRL_columns = ['R_CTRL','B_CTRL']

def conv_to_sec(X):
    if X != '--':
        return int(X.split(':')[0])*60 + int(X.split(':')[1])
    else:
        # if '--' means there was no time spent on the ground. 
        # Taking a call here to consider this as 0 seconds
        return 0

for column in CTRL_columns:
    df[column+'_time(seconds)'] = df[column].apply(conv_to_sec)

In [49]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8037,8038,8039,8040,8041,8042,8043,8044,8045,8046
R_fighter,Justin Gaethje,Amanda Lemos,Jalin Turner,Joshua Van,Alex Pereira,King Green,Brunno Ferreira,Alex Morono,Djorden Santos,Mairon Santos,...,Patrick Smith,Royce Gracie,Johnny Rhodes,Jason DeLucia,Remco Pardoel,Orlando Wiet,Johnny Rhodes,Patrick Smith,Frank Hamaker,Scott Morris
B_fighter,Rafael Fiziev,Iasmin Lucindo,Ignacio Bahamondes,Rei Tsuruya,Magomed Ankalaev,Mauricio Ruffy,Armen Petrosyan,Carlos Leal,Ozzy Diaz,Francis Marshall,...,Scott Morris,Minoki Ichihara,Fred Ettish,Scott Baker,Alberta Cerra Leon,Robert Lucarelli,David Levicki,Ray Wizard,Thaddeus Luster,Sean Daugherty
R_KD,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
B_KD,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
R_SIG_STR_pct,0.53,0.57,0.35,0.6,0.55,0.26,0.48,0.42,0.37,0.44,...,0.76,0.5,0.44,0.6,0.66,0.66,0.64,1.0,0.66,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
weight_class,Lightweight,Women's Strawweight,Lightweight,Flyweight,Light Heavyweight,Lightweight,Middleweight,Welterweight,Middleweight,Featherweight,...,Open Weight,Open Weight,Open Weight,Open Weight,Open Weight,Open Weight,Open Weight,Open Weight,Open Weight,Open Weight
total_time_fought(seconds),900,900,149,900,1500,127,567,256,900,900,...,30,308,187,401,591,170,733,58,292,20
no_of_rounds,3,3,3,3,5,3,3,3,3,3,...,1,1,1,1,1,1,1,1,1,1
R_CTRL_time(seconds),92,478,103,50,0,0,72,20,26,2,...,0,0,0,0,0,0,0,0,0,0


In [50]:
df.drop(['R_CTRL','B_CTRL'], axis = 1, inplace=True)

### Create another DataFrame to save the compiled data per fighter (Our Prediction DataFrame)

In [51]:
df.columns

Index(['R_fighter', 'B_fighter', 'R_KD', 'B_KD', 'R_SIG_STR_pct',
       'B_SIG_STR_pct', 'R_TD_pct', 'B_TD_pct', 'R_SUB_ATT', 'B_SUB_ATT',
       'R_REV', 'B_REV', 'win_by', 'last_round', 'Referee', 'date', 'location',
       'Winner', 'R_SIG_STR._att', 'R_SIG_STR._landed', 'B_SIG_STR._att',
       'B_SIG_STR._landed', 'R_TOTAL_STR._att', 'R_TOTAL_STR._landed',
       'B_TOTAL_STR._att', 'B_TOTAL_STR._landed', 'R_TD_att', 'R_TD_landed',
       'B_TD_att', 'B_TD_landed', 'R_HEAD_att', 'R_HEAD_landed', 'B_HEAD_att',
       'B_HEAD_landed', 'R_BODY_att', 'R_BODY_landed', 'B_BODY_att',
       'B_BODY_landed', 'R_LEG_att', 'R_LEG_landed', 'B_LEG_att',
       'B_LEG_landed', 'R_DISTANCE_att', 'R_DISTANCE_landed', 'B_DISTANCE_att',
       'B_DISTANCE_landed', 'R_CLINCH_att', 'R_CLINCH_landed', 'B_CLINCH_att',
       'B_CLINCH_landed', 'R_GROUND_att', 'R_GROUND_landed', 'B_GROUND_att',
       'B_GROUND_landed', 'title_bout', 'weight_class',
       'total_time_fought(seconds)', 'no_of_rounds',

In [52]:
df2 = df.copy()

In [53]:
df2.drop(['R_KD', 'B_KD', 'R_SIG_STR_pct',
       'B_SIG_STR_pct', 'R_TD_pct', 'B_TD_pct', 'R_SUB_ATT', 'B_SUB_ATT',
       'R_CTRL_time(seconds)', 'B_CTRL_time(seconds)', 'R_REV', 'B_REV', 'win_by', 'last_round', 
        'R_SIG_STR._att', 'R_SIG_STR._landed',
       'B_SIG_STR._att', 'B_SIG_STR._landed', 'R_TOTAL_STR._att',
       'R_TOTAL_STR._landed', 'B_TOTAL_STR._att', 'B_TOTAL_STR._landed',
       'R_TD_att', 'R_TD_landed', 'B_TD_att', 'B_TD_landed', 'R_HEAD_att',
       'R_HEAD_landed', 'B_HEAD_att', 'B_HEAD_landed', 'R_BODY_att',
       'R_BODY_landed', 'B_BODY_att', 'B_BODY_landed', 'R_LEG_att',
       'R_LEG_landed', 'B_LEG_att', 'B_LEG_landed', 'R_DISTANCE_att',
       'R_DISTANCE_landed', 'B_DISTANCE_att', 'B_DISTANCE_landed',
       'R_CLINCH_att', 'R_CLINCH_landed', 'B_CLINCH_att', 'B_CLINCH_landed',
       'R_GROUND_att', 'R_GROUND_landed', 'B_GROUND_att', 'B_GROUND_landed',
        'total_time_fought(seconds)'], axis = 1, inplace=True)
df2

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds
0,Justin Gaethje,Rafael Fiziev,Jason Herzog,"March 08, 2025","Las Vegas, Nevada, USA",Justin Gaethje,False,Lightweight,3
1,Amanda Lemos,Iasmin Lucindo,Keith Peterson,"March 08, 2025","Las Vegas, Nevada, USA",Amanda Lemos,False,Women's Strawweight,3
2,Jalin Turner,Ignacio Bahamondes,Mark Smith,"March 08, 2025","Las Vegas, Nevada, USA",Ignacio Bahamondes,False,Lightweight,3
3,Joshua Van,Rei Tsuruya,Jason Herzog,"March 08, 2025","Las Vegas, Nevada, USA",Joshua Van,False,Flyweight,3
4,Alex Pereira,Magomed Ankalaev,Marc Goddard,"March 08, 2025","Las Vegas, Nevada, USA",Magomed Ankalaev,True,Light Heavyweight,5
...,...,...,...,...,...,...,...,...,...
8042,Orlando Wiet,Robert Lucarelli,John McCarthy,"March 11, 1994","Denver, Colorado, USA",Orlando Wiet,False,Open Weight,1
8043,Johnny Rhodes,David Levicki,John McCarthy,"March 11, 1994","Denver, Colorado, USA",Johnny Rhodes,False,Open Weight,1
8044,Patrick Smith,Ray Wizard,John McCarthy,"March 11, 1994","Denver, Colorado, USA",Patrick Smith,False,Open Weight,1
8045,Frank Hamaker,Thaddeus Luster,John McCarthy,"March 11, 1994","Denver, Colorado, USA",Frank Hamaker,False,Open Weight,1


### Compiling Data per fighter

In [54]:
red_fighters = df['R_fighter'].value_counts().index
blue_fighters = df['B_fighter'].value_counts().index

fighters = list(set(red_fighters) | set(blue_fighters))

In [55]:
def get_renamed_winner(row):
    if row['R_fighter'] == row['Winner']:
        return 'Red'
    elif row['B_fighter'] == row['Winner']:
        return 'Blue'
    elif row['Winner'] == 'Draw':
        return 'Draw'

df2['Winner'] = df2[['R_fighter', 'B_fighter', 'Winner']].apply(get_renamed_winner, axis=1)

In [56]:
df = pd.concat([df,pd.get_dummies(df['win_by'], prefix='win_by')],axis=1)
df.drop(['win_by'],axis=1, inplace=True)

In [57]:
Numerical_columns = ['hero_KD', 'opp_KD', 'hero_SIG_STR_pct',
       'opp_SIG_STR_pct', 'hero_TD_pct', 'opp_TD_pct', 'hero_SUB_ATT', 'opp_SUB_ATT',
        'hero_REV', 'opp_REV', 'hero_SIG_STR._att', 'hero_SIG_STR._landed',
       'opp_SIG_STR._att', 'opp_SIG_STR._landed', 'hero_TOTAL_STR._att',
       'hero_TOTAL_STR._landed', 'opp_TOTAL_STR._att', 'opp_TOTAL_STR._landed',
       'hero_TD_att', 'hero_TD_landed', 'opp_TD_att', 'opp_TD_landed', 'hero_HEAD_att',
       'hero_HEAD_landed', 'opp_HEAD_att', 'opp_HEAD_landed', 'hero_BODY_att',
       'hero_BODY_landed', 'opp_BODY_att', 'opp_BODY_landed', 'hero_LEG_att',
       'hero_LEG_landed', 'opp_LEG_att', 'opp_LEG_landed', 'hero_DISTANCE_att',
       'hero_DISTANCE_landed', 'opp_DISTANCE_att', 'opp_DISTANCE_landed',
       'hero_CLINCH_att', 'hero_CLINCH_landed', 'opp_CLINCH_att', 'opp_CLINCH_landed',
       'hero_GROUND_att', 'hero_GROUND_landed', 'opp_GROUND_att', 'opp_GROUND_landed', 
        'hero_CTRL_time(seconds)', 'opp_CTRL_time(seconds)',
       'total_time_fought(seconds)']

Categorical_columns = ['win_by', 'last_round',
        'Winner', 'title_bout']

For all `Numerical_columns`, we take the average of those columns for every fighter of every fight they had up until that point.

For `Categorical_columns`, we have to come up with different ideas for each column:

* Each `win_by` will be a column of it's own
* from `last_round` we can get, `total_rounds_fought`
* from `total_time_fought` we can get `average_time_fought`
* from `Winner` we get `wins`, `losses`, `draw`, `current_streak`, `longest_streak`
* from `title_bout` we can get `no_of_title_fights`

In [58]:
import re

def lreplace(pattern, sub, string):
    """
    Replaces 'pattern' in 'string' with 'sub' if 'pattern' starts 'string'.
    """
    return re.sub('^%s' % pattern, sub, string)

In [59]:
red = df.groupby('R_fighter')
blue = df.groupby('B_fighter')

In [60]:
def get_fighter_red(fighter_name):
    try:
        fighter_red = red.get_group(fighter_name)
    except:
        return None
    rename_columns = {}
    for column in fighter_red.columns:
        if re.search('^R_', column) is not None:
            rename_columns[column] = lreplace('R_', 'hero_', column)
        elif re.search('^B_', column) is not None:
            rename_columns[column] = lreplace('B_', 'opp_', column)
    fighter_red = fighter_red.rename(rename_columns, axis='columns')
    return fighter_red

In [61]:
def get_fighter_blue(fighter_name):
    try:
        fighter_blue = blue.get_group(fighter_name)
    except:
        return None
    rename_columns = {}
    for column in fighter_blue.columns:
        if re.search('^B_', column) is not None:
            rename_columns[column] = lreplace('B_', 'hero_', column)
        elif re.search('^R_', column) is not None:
            rename_columns[column] = lreplace('R_', 'opp_', column)
    fighter_blue = fighter_blue.rename(rename_columns, axis='columns')
    return fighter_blue

In [62]:
def get_result_stats(result_list):
    result_list.reverse() # To get it in ascending order
    current_win_streak = 0
    current_lose_streak = 0
    longest_win_streak = 0
    wins = 0
    losses = 0
    draw = 0
    for result in result_list:
        if result == 'hero':
            wins += 1
            current_win_streak += 1
            current_lose_streak = 0
            if longest_win_streak < current_win_streak:
                longest_win_streak += 1
        elif result == 'opp':
            losses += 1
            current_win_streak = 0
            current_lose_streak += 1
        elif result == 'draw':
            draw += 1
            current_lose_streak = 0
            current_win_streak = 0
            
    return current_win_streak, current_lose_streak, longest_win_streak, wins, losses, draw

In [63]:
win_by_columns = ['win_by_Decision - Majority', 'win_by_Decision - Split',
       'win_by_Decision - Unanimous', 'win_by_KO/TKO','win_by_Submission',
       'win_by_TKO - Doctor\'s Stoppage']

In [64]:
temp_blue_frame = pd.DataFrame()
temp_red_frame = pd.DataFrame()
result_stats = ['current_win_streak', 'current_lose_streak', 'longest_win_streak', 'wins', 'losses', 'draw']

for fighter_name in fighters:
    fighter_red = get_fighter_red(fighter_name)
    fighter_blue = get_fighter_blue(fighter_name)
    fighter_index = None
    
    if fighter_red is None:
        fighter = fighter_blue
        fighter_index = 'blue'
    elif fighter_blue is None:
        fighter = fighter_red
        fighter_index = 'red'
    else:
        fighter = pd.concat([fighter_red, fighter_blue]).sort_index()
    
    fighter['Winner'] = fighter['Winner'].apply(lambda X: 'hero' if X == fighter_name else 'opp')

    for i, index in enumerate(fighter.index):
        fighter_slice = fighter[(i+1):].sort_index(ascending=False)
        s = fighter_slice[Numerical_columns].ewm(span=3, adjust=False).mean().tail(1)
        if len(s) != 0:
            pass
        else:
            s.loc[len(s)] = [np.NaN for _ in s.columns]
        s['total_rounds_fought'] = fighter_slice['last_round'].sum()
        s['total_title_bouts'] = fighter_slice[fighter_slice['title_bout']==True]['title_bout'].count()
        s['hero_fighter'] = fighter_name
        results = get_result_stats(list(fighter_slice['Winner']))
        for result_stat, result in zip(result_stats, results):
            s[result_stat] = result
        win_by_results = fighter_slice[fighter_slice['Winner'] == 'hero'][win_by_columns].sum()
        for win_by_column,win_by_result in zip(win_by_columns, win_by_results):
            s[win_by_column] = win_by_result
        s.index = [index]


        if fighter_index is None:
            if index in fighter_blue.index:
                temp_blue_frame = temp_blue_frame.append(s)
            elif index in fighter_red.index:
                temp_red_frame = temp_red_frame.append(s)
        elif fighter_index == 'blue':
            temp_blue_frame = temp_blue_frame.append(s)
        elif fighter_index == 'red':
            temp_red_frame = temp_red_frame.append(s)

AttributeError: 'DataFrame' object has no attribute 'append'

In [None]:
temp_blue_frame.T

Unnamed: 0,3483,3604,1782,2001,1708,2104,3818,4812,5173,5342,...,5027,5182,5432,5565,2540,3016,250,3301,3513,4266
hero_KD,1,,0,0,0.375095,0.750191,0.0488281,0.390625,0.5,1,...,0,,0,,0.25,,,0.15625,0.3125,0
opp_KD,0,,0.5,0,1.53516,0.0703125,0,0,0,0,...,0,,0,,0.25,,,0.28125,0.5625,2
hero_SIG_STR_pct,0.42,,0.4475,0.485,0.546008,0.602016,0.376074,0.428594,0.475,0.54,...,0.4,,0.385,,0.3925,,,0.491875,0.60375,0.42
opp_SIG_STR_pct,0.43,,0.40875,0.4275,0.536013,0.462026,0.598701,0.429609,0.4475,0.485,...,0.38,,0.425,,0.3925,,,0.479063,0.418125,0.46
hero_TD_pct,1,,0.0825,0.165,0.0369015,0.0738029,0.253555,0.0284375,0.25,0,...,0.205,,0.1,,0.75,,,0.152969,0.305938,0.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
win_by_Decision - Split,0,0,0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
win_by_Decision - Unanimous,1,0,1,1,4,4,1,1,0,0,...,1,0,0,0,0,0,0,2,2,0
win_by_KO/TKO,0,0,0,0,4,4,2,2,0,0,...,0,0,2,0,1,0,0,3,3,1
win_by_Submission,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Adding fighter details like height, weight, reach, stance and dob

In [None]:
fighter_details = fighter_details[fighter_details.index.isin(fighters)]
for col in fighter_details.columns:
    print(f"Number of NaN in {col} : {fighter_details[col].isnull().sum()}")

Number of NaN in Height : 13
Number of NaN in Weight : 10
Number of NaN in Reach : 645
Number of NaN in Stance : 75
Number of NaN in DOB : 141
Number of NaN in SLpM : 0
Number of NaN in Str_Acc : 0
Number of NaN in SApM : 0
Number of NaN in Str_Def : 0
Number of NaN in TD_Avg : 0
Number of NaN in TD_Acc : 0
Number of NaN in TD_Def : 0
Number of NaN in Sub_Avg : 0


In [None]:
def convert_to_cms(X):
    if X is np.NaN:
        return X
    elif len(X.split("'")) == 2:
        feet = float(X.split("'")[0])
        inches = int(X.split("'")[1].replace(' ', '').replace('"',''))
        return (feet * 30.48) + (inches * 2.54)
    else:
        return float(X.replace('"','')) * 2.54

In [None]:
fighter_details['Height_cms'] = fighter_details['Height'].apply(convert_to_cms)
fighter_details['Reach_cms'] = fighter_details['Reach'].apply(convert_to_cms)

In [None]:
fighter_details['Weight_lbs'] = fighter_details['Weight'].apply(lambda X: float(X.replace(' lbs.', '')) if X is not np.NaN else X)

In [None]:
pct_columns = ['Str_Acc','Str_Def', 'TD_Acc', 'TD_Def']

def pct_to_frac(X):
    if X != np.NaN:
        return float(X.replace('%', ''))/100
    else:
        return 0

for column in pct_columns:
    fighter_details[column] = fighter_details[column].apply(pct_to_frac)

In [None]:
fighter_details.drop(['Height', 'Weight', 'Reach'], axis=1, inplace=True)

In [None]:
fighter_details.reset_index(inplace=True)
temp_red_frame.reset_index(inplace=True)
temp_blue_frame.reset_index(inplace=True)

In [None]:
temp_blue_frame = temp_blue_frame.merge(fighter_details, left_on='hero_fighter', right_on='fighter_name', how='left')
temp_blue_frame.set_index('index', inplace=True)

In [None]:
temp_blue_frame[['hero_fighter', 'fighter_name', 'Height_cms', 'Weight_lbs', 'DOB']].head(20)

Unnamed: 0_level_0,hero_fighter,fighter_name,Height_cms,Weight_lbs,DOB
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3483,Kazuki Tokudome,Kazuki Tokudome,180.34,155.0,"Mar 04, 1987"
3604,Kazuki Tokudome,Kazuki Tokudome,180.34,155.0,"Mar 04, 1987"
1782,Bec Rawlings,Bec Rawlings,167.64,125.0,"Feb 11, 1989"
2001,Bec Rawlings,Bec Rawlings,167.64,125.0,"Feb 11, 1989"
1708,Patrick Cote,Patrick Cote,180.34,170.0,"Feb 29, 1980"
2104,Patrick Cote,Patrick Cote,180.34,170.0,"Feb 29, 1980"
3818,Patrick Cote,Patrick Cote,180.34,170.0,"Feb 29, 1980"
4812,Patrick Cote,Patrick Cote,180.34,170.0,"Feb 29, 1980"
5173,Patrick Cote,Patrick Cote,180.34,170.0,"Feb 29, 1980"
5342,Patrick Cote,Patrick Cote,180.34,170.0,"Feb 29, 1980"


In [None]:
temp_red_frame = temp_red_frame.merge(fighter_details, left_on='hero_fighter', right_on='fighter_name', how='left')
temp_red_frame.set_index('index', inplace=True)

In [None]:
temp_blue_frame.drop('fighter_name', axis=1, inplace=True)
temp_red_frame.drop('fighter_name', axis=1, inplace=True)

In [None]:
blue_frame = temp_blue_frame.add_prefix('B_')
red_frame = temp_red_frame.add_prefix('R_')

In [None]:
frame = blue_frame.join(red_frame, how='outer')

In [None]:
rename_cols = {}
for col in frame.columns:
    if 'hero' in col:
        rename_cols[col] = col.replace('_hero_', '_avg_').replace('.', '')
    if 'opp' in col:
        rename_cols[col] = col.replace('_opp_', '_avg_opp_').replace('.', '')
    if 'win_by' in col:
        rename_cols[col] = col.replace(' ', '').replace('-', '_').replace('\'s', '_')

In [None]:
frame.rename(rename_cols, axis='columns', inplace=True)

In [None]:
frame.drop(['R_avg_fighter','B_avg_fighter'], axis=1, inplace=True)

In [None]:
df2 = df2.join(frame, how='outer')

### Create Age

In [None]:
df2['R_DOB'] = pd.to_datetime(df2['R_DOB'])
df2['B_DOB'] = pd.to_datetime(df2['B_DOB'])
df2['date'] = pd.to_datetime(df2['date'])

In [None]:
def get_age(row):
    B_age = (row['date'] - row['B_DOB']).days
    R_age = (row['date'] - row['R_DOB']).days
    if np.isnan(B_age)!=True:
        B_age = math.floor(B_age/365.25)
    if np.isnan(R_age)!=True:
        R_age = math.floor(R_age/365.25)
    return pd.Series([B_age, R_age], index=['B_age', 'R_age'])

In [None]:
df2[['B_age', 'R_age']]= df2[['date', 'R_DOB', 'B_DOB']].apply(get_age, axis=1)

In [None]:
df2.drop(['R_DOB', 'B_DOB'], axis=1, inplace=True)

In [None]:
# df2.drop(df2.index[df2['Winner'] == 'draw'], inplace = True)

In [None]:
df2.to_csv('../data/data.csv', index=False)

In [None]:
df2

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_avg_KD,...,R_Str_Def,R_TD_Avg,R_TD_Acc,R_TD_Def,R_Sub_Avg,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
0,Brian Ortega,Chan Sung Jung,Lukasz Bosacki,2020-10-17,"Abu Dhabi, Abu Dhabi, United Arab Emirates",Red,False,Featherweight,5,0.820312,...,0.52,0.80,0.21,0.56,1.1,172.72,175.26,145.0,33.0,29.0
1,Katlyn Chookagian,Jessica Andrade,Lukasz Bosacki,2020-10-17,"Abu Dhabi, Abu Dhabi, United Arab Emirates",Blue,False,Women's Flyweight,3,0.062622,...,0.62,0.30,0.15,0.48,0.5,175.26,172.72,125.0,29.0,31.0
2,Jimmy Crute,Modestas Bukauskas,Anders Ohlsson,2020-10-17,"Abu Dhabi, Abu Dhabi, United Arab Emirates",Red,False,Light Heavyweight,3,0.000000,...,0.54,4.33,0.75,0.60,2.4,187.96,187.96,205.0,26.0,24.0
3,Claudio Silva,James Krause,Lukasz Bosacki,2020-10-17,"Abu Dhabi, Abu Dhabi, United Arab Emirates",Blue,False,Welterweight,3,0.750000,...,0.43,3.01,0.28,0.66,1.6,180.34,180.34,170.0,34.0,38.0
4,Thomas Almeida,Jonathan Martinez,Anders Ohlsson,2020-10-17,"Abu Dhabi, Abu Dhabi, United Arab Emirates",Blue,False,Featherweight,3,1.625000,...,0.64,0.00,0.00,0.75,0.0,170.18,177.80,145.0,26.0,29.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5797,Orlando Wiet,Robert Lucarelli,John McCarthy,1994-03-11,"Denver, Colorado, USA",Red,False,Open Weight,1,,...,0.00,0.00,0.00,0.00,0.0,177.80,,170.0,,
5798,Frank Hamaker,Thaddeus Luster,John McCarthy,1994-03-11,"Denver, Colorado, USA",Red,False,Open Weight,1,,...,0.00,0.00,0.00,0.00,0.0,,,,,
5799,Johnny Rhodes,David Levicki,John McCarthy,1994-03-11,"Denver, Colorado, USA",Red,False,Open Weight,1,,...,0.00,0.00,0.00,0.00,0.0,182.88,,210.0,,
5800,Patrick Smith,Ray Wizard,John McCarthy,1994-03-11,"Denver, Colorado, USA",Red,False,Open Weight,1,,...,0.00,0.00,0.00,0.00,0.0,187.96,,225.0,,30.0
