<div align=center><h1>Phase IV Appendix</h1>

In [29]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

<h5>Data Collection</h5>
In the code cell above, we uploaded a UFC csv file. The upcoming-event csv contains the statistics from the most recent event that has not yet been merged with the rest of the data. 

In [30]:
# Upload Old UFC Data
old_data = pd.read_csv("ufc-master.csv", low_memory=False)

# Upload UFC Data From Recent Fights
merge_data = pd.read_csv("upcoming-event.csv", low_memory=False)

# Merge Dataframes and Display with Head
frames = [merge_data, old_data]
ufc_data = pd.concat(frames)
ufc_data.head()

Unnamed: 0,R_fighter,B_fighter,R_odds,B_odds,R_ev,B_ev,date,location,country,Winner,...,R_td_attempted_bout,B_td_attempted_bout,R_td_pct_bout,B_td_pct_bout,R_sub_attempts_bout,B_sub_attempts_bout,R_pass_bout,B_pass_bout,R_rev_bout,B_rev_bout
0,Kamaru Usman,Gilbert Burns,-278.0,228.0,35.97122302,228.0,2/13/21,"Las Vegas, Nevada, USA",USA,,...,,,,,,,,,,
1,Maycee Barber,Alexa Grasso,-107.0,-107.0,93.45794393,93.457944,2/13/21,"Las Vegas, Nevada, USA",USA,,...,,,,,,,,,,
2,Kelvin Gastelum,Ian Heinisch,-205.0,174.0,48.7804878,174.0,2/13/21,"Las Vegas, Nevada, USA",USA,,...,,,,,,,,,,
3,Ricky Simon,Brian Kelleher,-253.0,210.0,39.5256917,210.0,2/13/21,"Las Vegas, Nevada, USA",USA,,...,,,,,,,,,,
4,Maki Pitolo,Julian Marquez,145.0,-177.0,145.0,56.497175,2/13/21,"Las Vegas, Nevada, USA",USA,,...,,,,,,,,,,


<h5>Data Cleaning: Removing Columns</h5>
The original dataset has 137 columns and I am removing those that I suspect won't be useful in predicting fight outcomes like fight location or date. I removed some columns that might be useful but only for a small number of fighters. For example, only the top 15 fighters are given a pound for pound ranking and the vast majority of fighters are not ranked in this category. I did not want to consider the fighter's rankings because they are subjective, only some fighters are ranked, and the rankings can be distorted if there a lot of inactive fighters in that weight class. 

In [31]:
old_col_count = ufc_data.shape
drop_col = ['date', 'location', 'country', 'B_match_weightclass_rank', 'R_match_weightclass_rank', 
            "R_Women's Flyweight_rank", "B_Women's Flyweight_rank", "B_Women's Featherweight_rank", 
            "R_Women's Featherweight_rank", "R_Women's Strawweight_rank", "B_Women's Strawweight_rank",
            "B_Women's Bantamweight_rank", "R_Women's Bantamweight_rank", 'B_Heavyweight_rank', 
            'R_Heavyweight_rank', 'B_Light Heavyweight_rank', 'R_Light Heavyweight_rank',
            'B_Middleweight_rank', 'R_Middleweight_rank', 'B_Welterweight_rank', 'R_Welterweight_rank', 
            'B_Lightweight_rank', 'R_Lightweight_rank', 'B_Featherweight_rank', 'R_Featherweight_rank', 
            'B_Bantamweight_rank', 'R_Bantamweight_rank', 'B_Flyweight_rank', 'R_Flyweight_rank',
            'B_Pound-for-Pound_rank', 'R_Pound-for-Pound_rank']
ufc_data = ufc_data.drop(drop_col, axis=1)
new_col_count = ufc_data.shape
print('Old Column Count: {}'.format(old_col_count[1]))
print('New Column Count: {}'.format(new_col_count[1]))

Old Column Count: 137
New Column Count: 106


<h5>Data Cleaning: Removing Rows</h5>
The dataset contains many rows that are not completed. I will be removing those rows because I think the rest of the columns have the ability to help predict fight outcomes. 

In [32]:
# Drop all rows that are not completed. 
ufc_data = ufc_data.dropna(axis=0, how='any')
ufc_data = ufc_data.reset_index()
ufc_data = ufc_data.drop('index', axis=1)

In [33]:
# No empty cells are left in dataframe
empty = np.where(pd.isnull(ufc_data))
print(empty[0])

[]


In [34]:
# How Big is the Dataset After Cleaning
print('Bytes: {}'.format(ufc_data.memory_usage(index=True).sum()))

Bytes: 850672


In [35]:
# view data
print('# of Rows: {}'.format(ufc_data.shape[0]))
print('# of Columns: {}'.format(ufc_data.shape[1]))
ufc_data.head()

# of Rows: 1003
# of Columns: 106


Unnamed: 0,R_fighter,B_fighter,R_odds,B_odds,R_ev,B_ev,Winner,title_bout,weight_class,gender,...,R_td_attempted_bout,B_td_attempted_bout,R_td_pct_bout,B_td_pct_bout,R_sub_attempts_bout,B_sub_attempts_bout,R_pass_bout,B_pass_bout,R_rev_bout,B_rev_bout
0,Petr Yan,Jose Aldo,-215.0,175.0,46.5116,175.0,Red,True,Bantamweight,MALE,...,2.0,1.0,0.5,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,Amanda Ribas,Paige VanZant,-770.0,500.0,12.987,500.0,Red,False,Women's Flyweight,FEMALE,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,Volkan Oezdemir,Jiri Prochazka,-159.0,129.0,62.8931,129.0,Blue,False,Light Heavyweight,MALE,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Makwan Amirkhani,Danny Henry,-215.0,170.0,46.5116,170.0,Red,False,Featherweight,MALE,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,Martin Day,Davey Grant,-162.0,130.0,61.7284,130.0,Blue,False,Bantamweight,MALE,...,1.0,4.0,0.0,0.5,0.0,1.0,0.0,1.0,1.0,0.0


In [36]:
ufc_data.loc[1, 'B_odds']

500.0

#### Data Cleaning For Analysis
Each row in the data set contains the statistics for the red and blue fighter in each bout. I am going to separate the data into a red and blue dataset and then, recombine the datasets so only one fighter is contained per row.

#### Data Changes 
1. Eliminate the columns that are not used in the models. 
2. Some of the statistics capture the differences in characteristics between the fighters. In order to preserve the relationship between difference statistics, I need to flip the sign for the blue fighters so that the difference statistics are consistent. 
3. I created a new column for difference between the average takedown landed per fifteen minutes for each fighter because I thought this statistic would be important for predicting fight outcomes. 
4. I replaced the winner column that denoted whether the red or blue column won with a dummy variable that equals 1 when the fighter wins and 0 otherwise. 

In [37]:
# Make all columns lowercase
old_col = ufc_data.columns
new_col = [col.lower() for col in ufc_data.columns]
ufc_data.columns = new_col

In [38]:
flip_columns = ['lose_streak_dif', 'win_streak_dif', 'longest_win_streak_dif', 'win_dif', 'loss_dif', 
                'total_round_dif', 'total_title_bout_dif', 'ko_dif', 'sub_dif', 'height_dif', 
                'reach_dif','age_dif', 'sig_str_dif', 'avg_sub_att_dif', 'avg_td_dif']

In [39]:
# Add one new difference columns 
ufc_data['b_td_landed_dif'] = ufc_data['b_avg_td_landed'] - ufc_data['r_avg_td_landed']
ufc_data['r_td_landed_dif'] = ufc_data['r_avg_td_landed'] - ufc_data['b_avg_td_landed']

In [40]:
# Getting uniform results from difference columns and delete old column difference
for col in flip_columns:
    b_col = 'b_' + col
    r_col = 'r_' + col
    ufc_data[b_col] = ufc_data[col]
    ufc_data[r_col] = -ufc_data[col]
    ufc_data = ufc_data.drop(col, axis=1)

In [41]:
# Initialize datasets with shared columns
ufc_red = ufc_data[['empty_arena', 'constant_1', 'finish', 'finish_details', 'finish_round', 'finish_round_time',
           'total_fight_time_secs', 'winner', 'title_bout', 'weight_class', 'gender', 'no_of_rounds']]
ufc_blue = ufc_data[['empty_arena', 'constant_1', 'finish', 'finish_details', 'finish_round', 'finish_round_time',
           'total_fight_time_secs', 'winner', 'title_bout', 'weight_class', 'gender', 'no_of_rounds']]

In [42]:
ufc_blue.shape

(1003, 12)

In [43]:
# Iterate through columns and assign each column to respective dataframe
count = 0
for col in ufc_data.columns:
    if col[0] == 'r':
        ufc_red.insert(2, col[2:], ufc_data[col])
    elif col[0] == 'b':
        ufc_blue.insert(2, col[2:], ufc_data[col])

In [44]:
# Needs help I suppose
ufc_red.loc[:,'win_dum'] = 0
ufc_blue.loc[:,'win_dum'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [45]:
for i in ufc_red.index:
    if ufc_red.loc[i, 'winner'] == 'Red':
        ufc_red.loc[i, 'win_dum'] = 1

for i in ufc_blue.index:
    if ufc_blue.loc[i, 'winner'] == 'Blue':
        ufc_blue.loc[i, 'win_dum'] = 1

In [46]:
print(ufc_red.columns)

Index(['empty_arena', 'constant_1', 'avg_td_dif', 'avg_sub_att_dif',
       'sig_str_dif', 'age_dif', 'reach_dif', 'height_dif', 'sub_dif',
       'ko_dif', 'total_title_bout_dif', 'total_round_dif', 'loss_dif',
       'win_dif', 'longest_win_streak_dif', 'win_streak_dif',
       'lose_streak_dif', 'td_landed_dif', 'rev_bout', 'pass_bout',
       'sub_attempts_bout', 'td_pct_bout', 'td_attempted_bout',
       'td_landed_bout', 'tot_str_attempted_bout', 'tot_str_landed_bout',
       'sig_str_pct_bout', 'sig_str_attempted_bout', 'sig_str_landed_bout',
       'kd_bout', 'age', 'weight_lbs', 'reach_cms', 'height_cms', 'stance',
       'wins', 'win_by_tko_doctor_stoppage', 'win_by_submission',
       'win_by_ko/tko', 'win_by_decision_unanimous', 'win_by_decision_split',
       'win_by_decision_majority', 'total_title_bouts', 'total_rounds_fought',
       'losses', 'longest_win_streak', 'avg_td_pct', 'avg_td_landed',
       'avg_sub_att', 'avg_sig_str_pct', 'avg_sig_str_landed', 'draw',
    

In [47]:
# Concatenate the Datasets
data = [ufc_red, ufc_blue]
ufc = pd.concat(data, ignore_index = True)

In [48]:
ufc['odds'].shape

(2006,)

# Add new column for win probability
* Positive odds - 100 divided by (the american odds plus 100), multiplied by 100 to give a percentage e.g. american odds of 150 = (100 / (150 + 100)) * 100 = 40%.
* Negative odds - Firstly multiply the american odds by -1 and use the positive value in the following formula: american odds divided by (the american odds plus 100), multiplied by 100 to give a percentage e.g. american odds of -300 = (300/(300+100)) * 100 = 75%.

In [49]:
for i in ufc.index:
    odds = ufc.loc[i, 'odds']
    if odds < 0:
        odds = -odds
        prob = (odds/(odds+100)) * 100
        ufc.loc[i, 'win_prob'] = prob
    elif odds > 0:
        prob = (100 / (odds + 100)) * 100
        ufc.loc[i, 'win_prob'] = prob
    elif odds == 0:
        ufc.loc[i, 'win_prob'] = .5

In [50]:
ufc.loc[1, 'odds']

-770.0

In [51]:
ufc.loc[1, 'win_prob']

88.50574712643679

Using a conversion table, -770 odds translates to a 88.5% win probability for this fighter. 

In [52]:
ufc['status'] = 0
for i in ufc.index:
    if ufc.loc[i, 'odds'] > 0:
        ufc.loc[i, 'status'] = 'underdog'
    elif ufc.loc[i, 'odds'] < 0:
        ufc.loc[i, 'status'] = 'favorite'
    else:
        ufc.loc[i, 'status'] = 'even'

In [53]:
# Test whether dataframe is correct size: 68 total rows plus the three added. 
ufc.shape

(2006, 71)

In [54]:
ufc.to_csv('clean_ufc.csv', index = False)