<center><h1> Projeto de Machine Learning - Modelo Preditivo para UFC</h1><center>

## Insper Instituto de Ensino e Pesquisa
------------------------------------------------

<center><h5> Gabriel Hermida, Luca Mizrahi e Caio Boa<h5><center>

## Introdução e Objetivo
<div id="introdução"><div>

O avanço da sociedade tem sido impulsionado pelas rápidas transformações tecnológicas. Na área dos esportes, esse progresso extraordinário tem possibilitado a aplicação de análises avançadas de dados e técnicas preditivas para melhorar o desempenho e as estratégias dos atletas. No entanto, mesmo diante desses avanços, ainda enfrentamos desafios relacionados à previsão de resultados em competições esportivas, que envolvem variáveis complexas e dinâmicas.

Dentre as competições esportivas de maior interesse global, destacam-se as lutas de UFC (Ultimate Fighting Championship), cuja imprevisibilidade é influenciada por fatores comportamentais, físicos e estratégicos dos lutadores, tais como treinamento, histórico de lutas, condicionamento físico, habilidades técnicas e táticas. Esses fatores contribuem para a complexidade de prever com precisão os vencedores dessas lutas.

Nesse contexto, o objetivo deste trabalho é identificar o modelo classificatório mais adequado para prever, por meio de análise de dados, os vencedores das lutas de UFC. Ao explorar diferentes abordagens de classificação e utilizar bases de dados relevantes, buscamos fornecer informações valiosas que auxiliem na elaboração de estratégias, apostas e gestão esportiva, aumentando a precisão das previsões e o entendimento dos fatores determinantes para a vitória.

Crédito : https://www.kaggle.com/aqsaqadir22/ufc-data-analysis-training

Base de Dados Utilizada : https://www.kaggle.com/datasets/rajeevw/ufcdata

## 1. Importando Bibliotecas e Lendo o Dataset

#### 1.1 Importando Bibliotecas

In [1]:
import numpy as np 
import pandas as pd
import ydata_profiling
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import xgboost
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.tree import export_graphviz
import graphviz
from subprocess import call
from IPython.display import Image
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
import shap 

#### 1.2 Lendo o Dataset

In [2]:
data_path = "raw_total_fight_data.csv"
fight_data = pd.read_csv(data_path, sep=";")
data2_path = "raw_fighter_details.csv"
fighter_data = pd.read_csv(data2_path)

## 2. Preprocessamento dos Dados
#### 2.1 Análise Exploratória 

In [3]:
# Primeiras linhas do dataset de lutas
print("Dataset de lutas:")
display(fight_data.head(3))

# Primeiras linhas do dataset de lutadores
print("\n Dataset de lutadores:")
display(fighter_data.head(3))

Dataset de lutas:


Unnamed: 0,R_fighter,B_fighter,R_KD,B_KD,R_SIG_STR.,B_SIG_STR.,R_SIG_STR_pct,B_SIG_STR_pct,R_TOTAL_STR.,B_TOTAL_STR.,...,B_GROUND,win_by,last_round,last_round_time,Format,Referee,date,location,Fight_type,Winner
0,Adrian Yanez,Gustavo Lopez,2,0,41 of 103,23 of 51,39%,45%,41 of 103,23 of 51,...,0 of 0,KO/TKO,3,0:27,3 Rnd (5-5-5),Chris Tognoni,"March 20, 2021","Las Vegas, Nevada, USA",Bantamweight Bout,Adrian Yanez
1,Trevin Giles,Roman Dolidze,0,0,27 of 57,32 of 67,47%,47%,43 of 73,75 of 110,...,1 of 2,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Herb Dean,"March 20, 2021","Las Vegas, Nevada, USA",Middleweight Bout,Trevin Giles
2,Tai Tuivasa,Harry Hunsucker,1,0,14 of 18,2 of 6,77%,33%,14 of 18,2 of 6,...,0 of 0,KO/TKO,1,0:49,3 Rnd (5-5-5),Herb Dean,"March 20, 2021","Las Vegas, Nevada, USA",Heavyweight Bout,Tai Tuivasa



 Dataset de lutadores:


Unnamed: 0,fighter_name,Height,Weight,Reach,Stance,DOB,SLpM,Str_Acc,SApM,Str_Def,TD_Avg,TD_Acc,TD_Def,Sub_Avg
0,Tom Aaron,,155 lbs.,,,"Jul 13, 1978",0.0,0%,0.0,0%,0.0,0%,0%,0.0
1,Papy Abedi,"5' 11""",185 lbs.,,Southpaw,"Jun 30, 1978",2.8,55%,3.15,48%,3.47,57%,50%,1.3
2,Shamil Abdurakhimov,"6' 3""",235 lbs.,"76""",Orthodox,"Sep 02, 1981",2.45,44%,2.45,58%,1.23,24%,47%,0.2


In [4]:
fight_data.info() # Informações sobre o dataset de lutas
print("")
fighter_data.info() # Informações sobre o dataset de lutadores

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6012 entries, 0 to 6011
Data columns (total 41 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   R_fighter        6012 non-null   object
 1   B_fighter        6012 non-null   object
 2   R_KD             6012 non-null   int64 
 3   B_KD             6012 non-null   int64 
 4   R_SIG_STR.       6012 non-null   object
 5   B_SIG_STR.       6012 non-null   object
 6   R_SIG_STR_pct    6012 non-null   object
 7   B_SIG_STR_pct    6012 non-null   object
 8   R_TOTAL_STR.     6012 non-null   object
 9   B_TOTAL_STR.     6012 non-null   object
 10  R_TD             6012 non-null   object
 11  B_TD             6012 non-null   object
 12  R_TD_pct         6012 non-null   object
 13  B_TD_pct         6012 non-null   object
 14  R_SUB_ATT        6012 non-null   int64 
 15  B_SUB_ATT        6012 non-null   int64 
 16  R_REV            6012 non-null   int64 
 17  B_REV            6012 non-null   

In [5]:
#display(ydata_profiling.ProfileReport(fight_data))
#display(ydata_profiling.ProfileReport(fighter_data))

### 2.2 Ajustando Valores Nulos

In [5]:
fight_data = fight_data.replace("---", "0")
fighter_data = fighter_data.replace("---", "0")
fight_data = fight_data.dropna(axis=0)
fighter_data = fighter_data.dropna(axis=0)

### 2.3 Ajustando Datasets

#### 2.3.1 fight_data

Criação da coluna `WinnerColor` para indicar qual a cor da luva do vencedor da luta, o que é essencial na hora de construir o modelo preditivo para lutas de UFC e que deve classificar quem será o ganhador por meio da cor da luva, que é o objetivo deste projeto.

In [6]:
# Gerando nova coluna com a cor do vencedor
f = ['R_fighter', 'B_fighter', 'Winner']
display(fight_data[f].head(3))

# Aplica a função lambda para cada linha do dataset e cria a coluna WinnerColor
fight_data['WinnerColor'] = fight_data.apply(lambda row: 'R' if row.Winner == row.R_fighter else 'B', axis=1)

f.append('WinnerColor')
display(fight_data[f].head(3))

Unnamed: 0,R_fighter,B_fighter,Winner
0,Adrian Yanez,Gustavo Lopez,Adrian Yanez
1,Trevin Giles,Roman Dolidze,Trevin Giles
2,Tai Tuivasa,Harry Hunsucker,Tai Tuivasa


Unnamed: 0,R_fighter,B_fighter,Winner,WinnerColor
0,Adrian Yanez,Gustavo Lopez,Adrian Yanez,R
1,Trevin Giles,Roman Dolidze,Trevin Giles,R
2,Tai Tuivasa,Harry Hunsucker,Tai Tuivasa,R


O ajuste de valores percentuais para valores decimais, para que possamos trabalhar com esses dados de forma mais precisa.

In [7]:
# Ajustando valores percentuais
percent_features = ['R_SIG_STR_pct', 'B_SIG_STR_pct', 'R_TD_pct', 'B_TD_pct']
print("Valores percentuais antes do ajuste:")
display(fight_data[percent_features].head(3))

# Aplica a função lambda para cada linha do dataset para ajustar os valores percentuais para valores decimais
for feature in percent_features:
    fight_data[feature] = fight_data[feature].apply(lambda x: int(x.split("%")[0])/100)
    
print("\nValores percentuais após o ajuste para decimal:")
display(fight_data[percent_features].head(3))

Valores percentuais antes do ajuste:


Unnamed: 0,R_SIG_STR_pct,B_SIG_STR_pct,R_TD_pct,B_TD_pct
0,39%,45%,0,0%
1,47%,47%,50%,33%
2,77%,33%,0,0



Valores percentuais após o ajuste:


Unnamed: 0,R_SIG_STR_pct,B_SIG_STR_pct,R_TD_pct,B_TD_pct
0,0.39,0.45,0.0,0.0
1,0.47,0.47,0.5,0.33
2,0.77,0.33,0.0,0.0


Também realizamos o ajuste nos valores de tentativas e acertos de golpes, sendo que o formato original (`41 of 118`) para mostrar o aproveitamento dos lutadores foi transformado em 2 colunas para cada categoria de golpe avaliado `(Ex: TOTAL, HEAD, BODY etc)`, para facilitar a análise e a construção do modelo preditivo.

In [8]:
# Ajustando valores de tentativas e acertos
of_features = ["R_TD", "R_SIG_STR.", "R_TOTAL_STR.", "R_HEAD", "R_BODY", "R_CLINCH", "R_GROUND", "R_LEG", 
        "B_TD", "B_SIG_STR.", "B_TOTAL_STR.", "B_HEAD", "B_BODY", "B_CLINCH", "B_GROUND", "B_LEG"]
print("Valores de tentativas e acertos antes do ajuste:")
display(fight_data[of_features].head(3))

# Aplica a função lambda para cada linha do dataset para ajustar os valores de tentativas e acertos em colunas separadas para cada lutador
for feature in of_features:
    fight_data[feature+"_land"] = fight_data[feature].apply(lambda x: int(x.split(" of ")[0]))
    fight_data[feature+"_attempt"] = fight_data[feature].apply(lambda x: int(x.split(" of ")[1]))

print("\nValores de tentativas e acertos após o ajuste:")
display(fight_data.loc[:2,"R_TD_land":])

Valores de tentativas e acertos antes do ajuste:


Unnamed: 0,R_TD,R_SIG_STR.,R_TOTAL_STR.,R_HEAD,R_BODY,R_CLINCH,R_GROUND,R_LEG,B_TD,B_SIG_STR.,B_TOTAL_STR.,B_HEAD,B_BODY,B_CLINCH,B_GROUND,B_LEG
0,0 of 0,41 of 103,41 of 103,32 of 83,8 of 19,0 of 0,0 of 1,1 of 1,0 of 1,23 of 51,23 of 51,14 of 40,5 of 7,0 of 0,0 of 0,4 of 4
1,1 of 2,27 of 57,43 of 73,22 of 51,4 of 4,4 of 5,8 of 10,1 of 2,1 of 3,32 of 67,75 of 110,10 of 37,7 of 14,3 of 6,1 of 2,15 of 16
2,0 of 0,14 of 18,14 of 18,10 of 14,0 of 0,0 of 0,5 of 8,4 of 4,0 of 0,2 of 6,2 of 6,1 of 5,0 of 0,0 of 0,0 of 0,1 of 1



Valores de tentativas e acertos após o ajuste:


Unnamed: 0,R_TD_land,R_TD_attempt,R_SIG_STR._land,R_SIG_STR._attempt,R_TOTAL_STR._land,R_TOTAL_STR._attempt,R_HEAD_land,R_HEAD_attempt,R_BODY_land,R_BODY_attempt,...,B_HEAD_land,B_HEAD_attempt,B_BODY_land,B_BODY_attempt,B_CLINCH_land,B_CLINCH_attempt,B_GROUND_land,B_GROUND_attempt,B_LEG_land,B_LEG_attempt
0,0,0,41,103,41,103,32,83,8,19,...,14,40,5,7,0,0,0,0,4,4
1,1,2,27,57,43,73,22,51,4,4,...,10,37,7,14,3,6,1,2,15,16
2,0,0,14,18,14,18,10,14,0,0,...,1,5,0,0,0,0,0,0,1,1


Além disso, combinamos as features de golpes de cada lutador, para que possamos ter uma visão geral do desempenho de cada um deles.

In [9]:
# Gerando features totais para varíaveis de tentativas e acertos
for feature in of_features:
    if(feature == "R_TD"):
        fight_data["R_TD_tot_land"] = fight_data[feature+"_land"]
        fight_data["R_TD_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "B_TD"):
        fight_data["B_TD_tot_land"] = fight_data[feature+"_land"]
        fight_data["B_TD_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "R_SIG_STR."):
        fight_data["R_SIGSTR_tot_land"] = fight_data[feature+"_land"]
        fight_data["R_SIGSTR_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "B_SIG_STR."):
        fight_data["B_SIGSTR_tot_land"] = fight_data[feature+"_land"]
        fight_data["B_SIGSTR_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "R_TOTAL_STR."):
        fight_data["R_TOT_land"] = fight_data[feature+"_land"]
        fight_data["R_TOT_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "B_TOTAL_STR."):
        fight_data["B_TOT_land"] = fight_data[feature+"_land"]
        fight_data["B_TOT_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "R_GROUND"):
        fight_data["R_GROUND_tot_land"] = fight_data[feature+"_land"]
        fight_data["R_GROUND_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "B_GROUND"):
        fight_data["B_GROUND_tot_land"] = fight_data[feature+"_land"]
        fight_data["B_GROUND_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "R_HEAD"):
        fight_data["R_HEAD_tot_land"] = fight_data[feature+"_land"]
        fight_data["R_HEAD_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "B_HEAD"):
        fight_data["B_HEAD_tot_land"] = fight_data[feature+"_land"]
        fight_data["B_HEAD_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "R_BODY"):
        fight_data["R_BODY_tot_land"] = fight_data[feature+"_land"]
        fight_data["R_BODY_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "B_BODY"):
        fight_data["B_BODY_tot_land"] = fight_data[feature+"_land"]
        fight_data["B_BODY_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "R_CLINCH"):
        fight_data["R_CLINCH_tot_land"] = fight_data[feature+"_land"]
        fight_data["R_CLINCH_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "B_CLINCH"):
        fight_data["B_CLINCH_tot_land"] = fight_data[feature+"_land"]
        fight_data["B_CLINCH_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "R_LEG"):
        fight_data["R_LEG_tot_land"] = fight_data[feature+"_land"]
        fight_data["R_LEG_tot_attempt"] = fight_data[feature+"_attempt"]
    elif(feature == "B_LEG"):
        fight_data["B_LEG_tot_land"] = fight_data[feature+"_land"]
        fight_data["B_LEG_tot_attempt"] = fight_data[feature+"_attempt"]
        
features = ["R_SIGSTR_tot_land", "R_SIGSTR_tot_attempt", "R_TD_tot_land", "R_TD_tot_attempt"]

display(fight_data[features].head(3))

Unnamed: 0,R_SIGSTR_tot_land,R_SIGSTR_tot_attempt,R_TD_tot_land,R_TD_tot_attempt
0,41,103,0,0
1,27,57,1,2
2,14,18,0,0


In [13]:
# Gerando feature tempo total de luta
print("\nObservando informações sobre os rounds e tempo total de luta no dataset original:")
display(fight_data.loc[:2, 'last_round':'Format'])

# Criando um dicionário com as informações sobre os rounds
Rounds_info = list(fight_data['Format'].unique())
Rounds_info.remove("No Time Limit")
Rounds = {}

# Adicionando informações sobre os rounds ao dicionário
for rounds in Rounds_info:
    time_list = rounds.split("(")
    time_list = time_list.pop(1)
    time_list = time_list.split(")")
    time_list = time_list.pop(0)
    time_list = time_list.split("-")
    Rounds[rounds] = [int(t) for t in time_list]

print("\n Observando informações sobre os rounds:")
display(Rounds)

# Aplica a função lambda para cada linha do dataset e cria a coluna tot_fight_time_min
fight_data['tot_fight_time_min'] = fight_data.apply(
    lambda x: sum(Rounds[x['Format']][:x['last_round']-1]) +
                  float(x['last_round_time'].split(":")[0]) +
                  (float(x['last_round_time'].split(":")[1])/60) if x['last_round'] != 1 
    else float(x['last_round_time'].split(":")[0]) +
             (float(x['last_round_time'].split(":")[1])/60), axis=1)

feature = ['last_round', 'last_round_time', 'Format', 'tot_fight_time_min']

print("\nAdicionando a feature tempo total de luta:")
display(fight_data[feature].head(3))


Observando informações sobre os rounds e tempo total de luta no dataset original:


Unnamed: 0,last_round,last_round_time,Format
0,3,0:27,3 Rnd (5-5-5)
1,3,5:00,3 Rnd (5-5-5)
2,1,0:49,3 Rnd (5-5-5)



 Observando informações sobre os rounds:


{'3 Rnd (5-5-5)': [5, 5, 5],
 '5 Rnd (5-5-5-5-5)': [5, 5, 5, 5, 5],
 '3 Rnd + OT (5-5-5-5)': [5, 5, 5, 5],
 '2 Rnd (5-5)': [5, 5],
 '1 Rnd + OT (12-3)': [12, 3],
 '1 Rnd + 2OT (15-3-3)': [15, 3, 3],
 '1 Rnd (12)': [12],
 '1 Rnd + OT (15-3)': [15, 3],
 '1 Rnd (15)': [15],
 '1 Rnd + 2OT (24-3-3)': [24, 3, 3],
 '1 Rnd (10)': [10],
 '1 Rnd (18)': [18],
 '1 Rnd + OT (27-3)': [27, 3],
 '1 Rnd + OT (30-5)': [30, 5],
 '1 Rnd (20)': [20],
 '1 Rnd (30)': [30]}


Adicionando a feature tempo total de luta:


Unnamed: 0,last_round,last_round_time,Format,tot_fight_time_min
0,3,0:27,3 Rnd (5-5-5),10.45
1,3,5:00,3 Rnd (5-5-5),15.0
2,1,0:49,3 Rnd (5-5-5),0.816667


In [12]:
# Corrige valores de classificação de peso
Fight_types = list(fight_data.Fight_type.unique())
print(Fight_types)

weight_class = ['Fly', 'Bantam', 'Feather', 'Light Heavy', 'Welter', 'Middle', 'Light', 'Heavy', 'Straw', 'Open', 'Catch']
for weight in weight_class:
    fight_data['Fight_type'] = fight_data['Fight_type'].apply(
        lambda x: weight if weight in x else x)

fight_data['Fight_type'] = fight_data['Fight_type'].apply(
    lambda x: x if len(x)<10 else np.nan)

print(list(fight_data.Fight_type.unique()))

['Bantamweight Bout', 'Middleweight Bout', 'Heavyweight Bout', "Women's Strawweight Bout", "Women's Bantamweight Bout", 'Lightweight Bout', 'Welterweight Bout', 'Flyweight Bout', 'Light Heavyweight Bout', 'Featherweight Bout', "Women's Flyweight Bout", 'UFC Bantamweight Title Bout', 'UFC Light Heavyweight Title Bout', "UFC Women's Featherweight Title Bout", 'UFC Welterweight Title Bout', 'Catch Weight Bout', "UFC Women's Flyweight Title Bout", 'UFC Flyweight Title Bout', 'UFC Lightweight Title Bout', 'UFC Middleweight Title Bout', 'UFC Heavyweight Title Bout', 'UFC Featherweight Title Bout', 'UFC Interim Lightweight Title Bout', "UFC Women's Strawweight Title Bout", "Women's Featherweight Bout", "UFC Women's Bantamweight Title Bout", 'UFC Interim Middleweight Title Bout', 'Ultimate Fighter 28 Heavyweight Tournament Title Bout', "Ultimate Fighter 28 Women's Featherweight Tournament Title Bout", 'Ultimate Fighter 27 Featherweight Tournament Title Bout', 'Ultimate Fighter 27 Lightweight T

In [13]:
display(fight_data['Fight_type'].value_counts())
print(sum(fight_data['Fight_type'].isna()))

fight_data = fight_data.dropna()
print(sum(fight_data['Fight_type'].isna()))

Fight_type
Light      1620
Welter     1065
Middle      801
Bantam      603
Heavy       570
Feather     553
Fly         332
Straw       190
Open         86
Catch        38
Name: count, dtype: int64

13
0


In [14]:
# Altera data da luta para ano

fight_data['date'] = fight_data['date'].apply(
    lambda x: int(x.split(" ")[2]))

#### 2.3.2 fighter_data

In [15]:
display(fighter_data.head(10))

Unnamed: 0,fighter_name,Height,Weight,Reach,Stance,DOB,SLpM,Str_Acc,SApM,Str_Def,TD_Avg,TD_Acc,TD_Def,Sub_Avg
2,Shamil Abdurakhimov,"6' 3""",235 lbs.,"76""",Orthodox,"Sep 02, 1981",2.45,44%,2.45,58%,1.23,24%,47%,0.2
6,Daichi Abe,"5' 11""",170 lbs.,"71""",Orthodox,"Nov 27, 1991",3.8,33%,4.49,56%,0.33,50%,0%,0.0
8,Klidson Abreu,"6' 0""",205 lbs.,"74""",Orthodox,"Dec 24, 1992",2.05,40%,2.9,55%,0.64,20%,80%,0.0
11,Juan Adams,"6' 5""",265 lbs.,"80""",Orthodox,"Jan 16, 1992",7.09,55%,4.06,34%,0.91,66%,57%,0.0
12,Anthony Adams,"6' 1""",185 lbs.,"76""",Orthodox,"Jan 13, 1988",3.17,41%,5.93,44%,0.0,0%,0%,0.0
14,Israel Adesanya,"6' 4""",185 lbs.,"80""",Switch,"Jul 22, 1989",3.95,49%,2.63,61%,0.0,0%,82%,0.3
15,Zarrukh Adashev,"5' 5""",125 lbs.,"65""",Southpaw,"Jul 29, 1992",1.93,23%,3.35,60%,0.0,0%,0%,0.0
18,Mariya Agapova,"5' 6""",125 lbs.,"68""",Southpaw,"Apr 07, 1997",3.7,50%,4.15,47%,1.23,66%,33%,0.6
21,Jessica Aguilar,"5' 3""",115 lbs.,"63""",Orthodox,"May 08, 1982",4.93,50%,7.19,53%,0.94,25%,50%,0.2
22,Kevin Aguilar,"5' 7""",155 lbs.,"73""",Orthodox,"Sep 07, 1988",3.69,37%,4.47,53%,0.19,25%,87%,0.0


In [16]:
#Ajustando valores percentuais

percent_features = ['Str_Acc', 'Str_Def', 'TD_Acc', 'TD_Def']
display(fighter_data[percent_features].head(3))

for feature in percent_features:
    fighter_data[feature] = fighter_data[feature].apply(lambda x: int(x.split("%")[0])/100)
    
display(fighter_data[percent_features].head(3))

Unnamed: 0,Str_Acc,Str_Def,TD_Acc,TD_Def
2,44%,58%,24%,47%
6,33%,56%,50%,0%
8,40%,55%,20%,80%


Unnamed: 0,Str_Acc,Str_Def,TD_Acc,TD_Def
2,0.44,0.58,0.24,0.47
6,0.33,0.56,0.5,0.0
8,0.4,0.55,0.2,0.8


In [17]:
# Altera valores de altura de pés para cm

fighter_data['Height'] = fighter_data['Height'].apply(
    lambda x: int(x.split("' ")[0])*30.48 + int(x.split(" ")[1].split('"')[0])*2.54)

# Altera valores de peso de libras para kg

fighter_data['Weight'] = fighter_data['Weight'].apply(
    lambda x: float(x.split(" ")[0])*0.453592)

# Altera valores de alcance de polegadas para cm

fighter_data['Reach'] = fighter_data['Reach'].apply(
    lambda x: float(x.split('"')[0])*2.54)

# Altera valores de data de nascimento para ano de nascimento

fighter_data['Born_year'] = fighter_data['DOB'].apply(
    lambda x: int(x.split(" ")[2]))

# Altera valores de postura para valores numéricos
fighter_data['Stance'] = fighter_data['Stance'].apply(
    lambda x: 1 if x=="Orthodox" else (
    2 if x == "Southpaw" else (
    3 if x == "Switch" else 4)))

### 2.4 Gerando Features em Fighter_Data com base em Fight_Data

In [18]:
# Derivando o histórico de tentativas e golpes conectados para cada lutador
attempts = []
lands = []

for idx, row in fight_data.iterrows():
    attempts.append({
        'fighter': row['R_fighter'],
        'TD_attempts': row['R_TD_tot_attempt']/row['tot_fight_time_min'],
        'TD_lands': row['R_TD_tot_land']/row['tot_fight_time_min'],
        'SIG_STR_attempts': row['R_SIGSTR_tot_attempt']/row['tot_fight_time_min'],
        'SIG_STR_lands': row['R_SIGSTR_tot_land']/row['tot_fight_time_min'],
        'TOT_STR_attempts': row['R_TOT_attempt']/row['tot_fight_time_min'],
        'TOT_STR_lands': row['R_TOT_land']/row['tot_fight_time_min'],
        'HEAD_attempts': row['R_HEAD_tot_attempt']/row['tot_fight_time_min'],
        'HEAD_lands': row['R_HEAD_tot_land']/row['tot_fight_time_min'],
        'BODY_attempts': row['R_BODY_tot_attempt']/row['tot_fight_time_min'],
        'BODY_lands': row['R_BODY_tot_land']/row['tot_fight_time_min'],
        'CLINCH_attempts': row['R_CLINCH_tot_attempt']/row['tot_fight_time_min'],
        'CLINCH_lands': row['R_CLINCH_tot_land']/row['tot_fight_time_min'],
        'GROUND_attempts': row['R_GROUND_tot_attempt']/row['tot_fight_time_min'],
        'GROUND_lands': row['R_GROUND_tot_land']/row['tot_fight_time_min'],
        'LEG_attempts': row['R_LEG_tot_attempt']/row['tot_fight_time_min'],
        'LEG_lands': row['R_LEG_tot_land']/row['tot_fight_time_min']
    })
    attempts.append({
        'fighter': row['B_fighter'],
        'TD_attempts': row['B_TD_tot_attempt']/row['tot_fight_time_min'],
        'TD_lands': row['B_TD_tot_land']/row['tot_fight_time_min'],
        'SIG_STR_attempts': row['B_SIGSTR_tot_attempt']/row['tot_fight_time_min'],
        'SIG_STR_lands': row['B_SIGSTR_tot_land']/row['tot_fight_time_min'],
        'TOT_STR_attempts': row['B_TOT_attempt']/row['tot_fight_time_min'],
        'TOT_STR_lands': row['B_TOT_land']/row['tot_fight_time_min'],
        'HEAD_attempts': row['B_HEAD_tot_attempt']/row['tot_fight_time_min'],
        'HEAD_lands': row['B_HEAD_tot_land']/row['tot_fight_time_min'],
        'BODY_attempts': row['B_BODY_tot_attempt']/row['tot_fight_time_min'],
        'BODY_lands': row['B_BODY_tot_land']/row['tot_fight_time_min'],
        'CLINCH_attempts': row['B_CLINCH_tot_attempt']/row['tot_fight_time_min'],
        'CLINCH_lands': row['B_CLINCH_tot_land']/row['tot_fight_time_min'],
        'GROUND_attempts': row['B_GROUND_tot_attempt']/row['tot_fight_time_min'],
        'GROUND_lands': row['B_GROUND_tot_land']/row['tot_fight_time_min'],
        'LEG_attempts': row['B_LEG_tot_attempt']/row['tot_fight_time_min'],
        'LEG_lands': row['B_LEG_tot_land']/row['tot_fight_time_min']
    })

# Criando um DataFrame com os históricos de tentativas e golpes conectados
attempts_df = pd.DataFrame(attempts)

# Calculando a média das tentativas e dos golpes conectados para cada lutador
mean_attempts = attempts_df.groupby('fighter').mean().reset_index()

# Renomeando as colunas para fazer o merge
mean_attempts.columns = ['fighter_name', 'mean_TD_attempt_history', 'mean_TD_land_history', 'mean_SIG_STR_attempt_history',
                         'mean_SIG_STR_land_history', 'mean_TOT_STR_attempt_history', 'mean_TOT_STR_land_history',
                         'mean_HEAD_attempt_history', 'mean_HEAD_land_history', 'mean_BODY_attempt_history',
                         'mean_BODY_land_history', 'mean_CLINCH_attempt_history', 'mean_CLINCH_land_history',
                         'mean_GROUND_attempt_history', 'mean_GROUND_land_history', 'mean_LEG_attempt_history',
                         'mean_LEG_land_history']

# Fazendo o merge com fighter_data
fighter_data = fighter_data.merge(mean_attempts, on='fighter_name', how='left')

# Preenchendo possíveis valores NaN com 0
fighter_data['mean_TD_attempt_history'] = fighter_data['mean_TD_attempt_history'].fillna(0)
fighter_data['mean_TD_land_history'] = fighter_data['mean_TD_land_history'].fillna(0)
fighter_data['mean_SIG_STR_attempt_history'] = fighter_data['mean_SIG_STR_attempt_history'].fillna(0)
fighter_data['mean_SIG_STR_land_history'] = fighter_data['mean_SIG_STR_land_history'].fillna(0)
fighter_data['mean_TOT_STR_attempt_history'] = fighter_data['mean_TOT_STR_attempt_history'].fillna(0)
fighter_data['mean_TOT_STR_land_history'] = fighter_data['mean_TOT_STR_land_history'].fillna(0)
fighter_data['mean_HEAD_attempt_history'] = fighter_data['mean_HEAD_attempt_history'].fillna(0)
fighter_data['mean_HEAD_land_history'] = fighter_data['mean_HEAD_land_history'].fillna(0)
fighter_data['mean_BODY_attempt_history'] = fighter_data['mean_BODY_attempt_history'].fillna(0)
fighter_data['mean_BODY_land_history'] = fighter_data['mean_BODY_land_history'].fillna(0)
fighter_data['mean_CLINCH_attempt_history'] = fighter_data['mean_CLINCH_attempt_history'].fillna(0)
fighter_data['mean_CLINCH_land_history'] = fighter_data['mean_CLINCH_land_history'].fillna(0)
fighter_data['mean_GROUND_attempt_history'] = fighter_data['mean_GROUND_attempt_history'].fillna(0)
fighter_data['mean_GROUND_land_history'] = fighter_data['mean_GROUND_land_history'].fillna(0)
fighter_data['mean_LEG_attempt_history'] = fighter_data['mean_LEG_attempt_history'].fillna(0)
fighter_data['mean_LEG_land_history'] = fighter_data['mean_LEG_land_history'].fillna(0)

# Verificando se as colunas foram adicionadas corretamente
print(fighter_data[['fighter_name', 'mean_TD_attempt_history', 'mean_TD_land_history']].head())

          fighter_name  mean_TD_attempt_history  mean_TD_land_history
0  Shamil Abdurakhimov                 0.260903              0.051738
1           Daichi Abe                 0.044444              0.022222
2        Klidson Abreu                 0.222222              0.044444
3           Juan Adams                 0.378926              0.045593
4        Anthony Adams                 0.000000              0.000000


## 3. Escolhendo Features e Agrupandos as Tabelas

In [19]:
fighter_data

Unnamed: 0,fighter_name,Height,Weight,Reach,Stance,DOB,SLpM,Str_Acc,SApM,Str_Def,...,mean_HEAD_attempt_history,mean_HEAD_land_history,mean_BODY_attempt_history,mean_BODY_land_history,mean_CLINCH_attempt_history,mean_CLINCH_land_history,mean_GROUND_attempt_history,mean_GROUND_land_history,mean_LEG_attempt_history,mean_LEG_land_history
0,Shamil Abdurakhimov,190.50,106.59412,193.04,1,"Sep 02, 1981",2.45,0.44,2.45,0.58,...,4.432514,1.762322,0.749536,0.449223,0.886185,0.551604,0.246791,0.175089,0.560042,0.423069
1,Daichi Abe,180.34,77.11064,180.34,1,"Nov 27, 1991",3.80,0.33,4.49,0.56,...,8.977778,2.266667,0.600000,0.266667,0.444444,0.244444,0.400000,0.311111,1.711111,1.266667
2,Klidson Abreu,182.88,92.98636,187.96,1,"Dec 24, 1992",2.05,0.40,2.90,0.55,...,3.311111,0.844444,0.955556,0.422222,0.311111,0.200000,0.022222,0.000000,0.822222,0.755556
3,Juan Adams,195.58,120.20188,203.20,1,"Jan 16, 1992",7.09,0.55,4.06,0.34,...,8.457191,3.998415,1.069906,0.784472,0.941945,0.796353,1.019706,0.814539,0.558800,0.409953
4,Anthony Adams,185.42,83.91452,193.04,1,"Jan 13, 1988",3.17,0.41,5.93,0.44,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1656,Zhalgas Zhumagulov,162.56,56.69900,167.64,3,"Aug 29, 1988",4.17,0.49,4.00,0.58,...,4.833333,1.366667,2.200000,1.600000,0.133333,0.066667,0.133333,0.033333,1.333333,1.200000
1657,Fares Ziam,185.42,70.30676,190.50,1,"Mar 21, 1997",1.90,0.47,1.37,0.66,...,3.400000,1.333333,0.200000,0.166667,0.400000,0.266667,0.366667,0.200000,0.400000,0.400000
1658,Cat Zingano,167.64,65.77084,172.72,2,"Jul 01, 1982",2.57,0.61,1.63,0.47,...,2.231910,1.149322,1.214003,0.471823,0.543089,0.452775,1.151321,0.791095,1.000560,0.671911
1659,Zhang Tiequan,172.72,70.30676,175.26,1,"Jul 25, 1978",1.23,0.36,2.14,0.51,...,5.810392,1.968876,0.050000,0.033333,0.312500,0.000000,0.016667,0.000000,0.285542,0.218876


In [20]:
# Supondo que fight_data e fighter_data já estão definidos e disponíveis

# Renomeando as colunas de fighter_data para o merge
fighter_data_red = fighter_data.rename(columns={
    'fighter_name': 'R_fighter', 'Height': 'R_Height', 'Weight': 'R_Weight', 'Reach': 'R_Reach', 'Stance': 'R_Stance',
    'DOB': 'R_DOB', 'SLpM': 'R_SLpM', 'Str_Acc': 'R_Str_Acc', 'SApM': 'R_SApM', 'Str_Def': 'R_Str_Def',
    'TD_Avg': 'R_TD_Avg', 'TD_Acc': 'R_TD_Acc', 'TD_Def': 'R_TD_Def', 'Sub_Avg': 'R_Sub_Avg', 'Born_year': 'R_Born_year',
    'mean_TD_attempt_history': 'R_mean_TD_attempt_history', 'mean_TD_land_history': 'R_mean_TD_land_history',
    'mean_SIG_STR_attempt_history': 'R_mean_SIG_STR_attempt_history', 'mean_SIG_STR_land_history': 'R_mean_SIG_STR_land_history',
    'mean_TOT_STR_attempt_history': 'R_mean_TOT_STR_attempt_history', 'mean_TOT_STR_land_history': 'R_mean_TOT_STR_land_history',
    'mean_HEAD_attempt_history': 'R_mean_HEAD_attempt_history', 'mean_HEAD_land_history': 'R_mean_HEAD_land_history',
    'mean_BODY_attempt_history': 'R_mean_BODY_attempt_history', 'mean_BODY_land_history': 'R_mean_BODY_land_history',
    'mean_CLINCH_attempt_history': 'R_mean_CLINCH_attempt_history', 'mean_CLINCH_land_history': 'R_mean_CLINCH_land_history',
    'mean_GROUND_attempt_history': 'R_mean_GROUND_attempt_history', 'mean_GROUND_land_history': 'R_mean_GROUND_land_history',
    'mean_LEG_attempt_history': 'R_mean_LEG_attempt_history', 'mean_LEG_land_history': 'R_mean_LEG_land_history'
})

fighter_data_blue = fighter_data.rename(columns={
    'fighter_name': 'B_fighter', 'Height': 'B_Height', 'Weight': 'B_Weight', 'Reach': 'B_Reach', 'Stance': 'B_Stance',
    'DOB': 'B_DOB', 'SLpM': 'B_SLpM', 'Str_Acc': 'B_Str_Acc', 'SApM': 'B_SApM', 'Str_Def': 'B_Str_Def',
    'TD_Avg': 'B_TD_Avg', 'TD_Acc': 'B_TD_Acc', 'TD_Def': 'B_TD_Def', 'Sub_Avg': 'B_Sub_Avg', 'Born_year': 'B_Born_year',
    'mean_TD_attempt_history': 'B_mean_TD_attempt_history', 'mean_TD_land_history': 'B_mean_TD_land_history',
    'mean_SIG_STR_attempt_history': 'B_mean_SIG_STR_attempt_history', 'mean_SIG_STR_land_history': 'B_mean_SIG_STR_land_history',
    'mean_TOT_STR_attempt_history': 'B_mean_TOT_STR_attempt_history', 'mean_TOT_STR_land_history': 'B_mean_TOT_STR_land_history',
    'mean_HEAD_attempt_history': 'B_mean_HEAD_attempt_history', 'mean_HEAD_land_history': 'B_mean_HEAD_land_history',
    'mean_BODY_attempt_history': 'B_mean_BODY_attempt_history', 'mean_BODY_land_history': 'B_mean_BODY_land_history',
    'mean_CLINCH_attempt_history': 'B_mean_CLINCH_attempt_history', 'mean_CLINCH_land_history': 'B_mean_CLINCH_land_history',
    'mean_GROUND_attempt_history': 'B_mean_GROUND_attempt_history', 'mean_GROUND_land_history': 'B_mean_GROUND_land_history',
    'mean_LEG_attempt_history': 'B_mean_LEG_attempt_history', 'mean_LEG_land_history': 'B_mean_LEG_land_history'
})

# Fazendo merge de fight_data com fighter_data para lutadores vermelhos e azuis
fight_data_merged_red = fight_data.merge(fighter_data_red, on='R_fighter', how='left')
fight_data_merged = fight_data_merged_red.merge(fighter_data_blue, on='B_fighter', how='left')

# Calculando a idade dos lutadores com base no ano de nascimento e na data da luta
fight_data_merged['R_age'] = fight_data_merged['date'] - fight_data_merged['R_Born_year']
fight_data_merged['B_age'] = fight_data_merged['date'] - fight_data_merged['B_Born_year']

# Selecionando as colunas finais
fight_features = ['R_fighter', 'B_fighter', 'WinnerColor']
fighter_features = ['R_age', 'B_age', 
                   'R_Height', 'R_Weight', 'R_Reach', 'R_Stance', 
                   'B_Height', 'B_Weight', 'B_Reach', 'B_Stance',
                    'R_SLpM', 'R_Str_Acc', 'R_SApM', 'R_Str_Def',
                    'R_TD_Avg', 'R_TD_Acc', 'R_TD_Def', 'R_Sub_Avg',
                    'B_SLpM', 'B_Str_Acc', 'B_SApM', 'B_Str_Def',
                    'B_TD_Avg', 'B_TD_Acc', 'B_TD_Def', 'B_Sub_Avg',
                   'R_mean_TD_attempt_history', 'R_mean_TD_land_history',
                   'B_mean_TD_attempt_history', 'B_mean_TD_land_history',
                   'R_mean_SIG_STR_attempt_history', 'R_mean_SIG_STR_land_history',
                   'B_mean_SIG_STR_attempt_history', 'B_mean_SIG_STR_land_history',
                   'R_mean_TOT_STR_attempt_history', 'R_mean_TOT_STR_land_history',
                   'B_mean_TOT_STR_attempt_history', 'B_mean_TOT_STR_land_history',
                   'R_mean_HEAD_attempt_history', 'R_mean_HEAD_land_history',
                   'B_mean_HEAD_attempt_history', 'B_mean_HEAD_land_history',
                   'R_mean_BODY_attempt_history', 'R_mean_BODY_land_history',
                   'B_mean_BODY_attempt_history', 'B_mean_BODY_land_history',
                   'R_mean_CLINCH_attempt_history', 'R_mean_CLINCH_land_history',
                   'B_mean_CLINCH_attempt_history', 'B_mean_CLINCH_land_history',
                   'R_mean_GROUND_attempt_history', 'R_mean_GROUND_land_history',
                   'B_mean_GROUND_attempt_history', 'B_mean_GROUND_land_history',
                   'R_mean_LEG_attempt_history', 'R_mean_LEG_land_history',
                   'B_mean_LEG_attempt_history', 'B_mean_LEG_land_history']

features = fight_features + fighter_features

final_data = fight_data_merged[features]

# Visualizando os dados finais
final_data.head()


Unnamed: 0,R_fighter,B_fighter,WinnerColor,R_age,B_age,R_Height,R_Weight,R_Reach,R_Stance,B_Height,...,B_mean_CLINCH_attempt_history,B_mean_CLINCH_land_history,R_mean_GROUND_attempt_history,R_mean_GROUND_land_history,B_mean_GROUND_attempt_history,B_mean_GROUND_land_history,R_mean_LEG_attempt_history,R_mean_LEG_land_history,B_mean_LEG_attempt_history,B_mean_LEG_land_history
0,Adrian Yanez,Gustavo Lopez,R,28.0,32.0,170.18,61.23492,177.8,1.0,165.1,...,1.170007,0.702386,0.047847,0.0,1.104294,0.736196,0.047847,0.047847,0.260925,0.216481
1,Trevin Giles,Roman Dolidze,R,29.0,33.0,182.88,83.91452,187.96,1.0,187.96,...,0.155556,0.088889,1.21821,0.958295,0.69281,0.559477,0.134693,0.125169,1.25098,1.150327
2,Tai Tuivasa,Harry Hunsucker,R,28.0,32.0,187.96,119.748288,190.5,2.0,187.96,...,0.0,0.0,1.38465,0.895102,0.0,0.0,1.430054,1.289705,1.22449,1.22449
3,Cheyanne Buys,Montserrat Conejo,B,26.0,28.0,160.02,52.16308,160.02,3.0,152.4,...,0.0,0.0,0.466667,0.2,0.0,0.0,0.0,0.0,0.333333,0.333333
4,Marion Reneau,Macy Chiasson,B,44.0,30.0,167.64,61.23492,172.72,1.0,180.34,...,1.859529,1.396137,0.608257,0.337497,2.666153,1.663602,0.445182,0.366394,0.226056,0.203834


In [21]:
final_data = final_data.dropna()

In [22]:
final_data.head(10)

Unnamed: 0,R_fighter,B_fighter,WinnerColor,R_age,B_age,R_Height,R_Weight,R_Reach,R_Stance,B_Height,...,B_mean_CLINCH_attempt_history,B_mean_CLINCH_land_history,R_mean_GROUND_attempt_history,R_mean_GROUND_land_history,B_mean_GROUND_attempt_history,B_mean_GROUND_land_history,R_mean_LEG_attempt_history,R_mean_LEG_land_history,B_mean_LEG_attempt_history,B_mean_LEG_land_history
0,Adrian Yanez,Gustavo Lopez,R,28.0,32.0,170.18,61.23492,177.8,1.0,165.1,...,1.170007,0.702386,0.047847,0.0,1.104294,0.736196,0.047847,0.047847,0.260925,0.216481
1,Trevin Giles,Roman Dolidze,R,29.0,33.0,182.88,83.91452,187.96,1.0,187.96,...,0.155556,0.088889,1.21821,0.958295,0.69281,0.559477,0.134693,0.125169,1.25098,1.150327
2,Tai Tuivasa,Harry Hunsucker,R,28.0,32.0,187.96,119.748288,190.5,2.0,187.96,...,0.0,0.0,1.38465,0.895102,0.0,0.0,1.430054,1.289705,1.22449,1.22449
3,Cheyanne Buys,Montserrat Conejo,B,26.0,28.0,160.02,52.16308,160.02,3.0,152.4,...,0.0,0.0,0.466667,0.2,0.0,0.0,0.0,0.0,0.333333,0.333333
4,Marion Reneau,Macy Chiasson,B,44.0,30.0,167.64,61.23492,172.72,1.0,180.34,...,1.859529,1.396137,0.608257,0.337497,2.666153,1.663602,0.445182,0.366394,0.226056,0.203834
5,Leonardo Santos,Grant Dawson,B,41.0,27.0,182.88,70.30676,190.5,1.0,177.8,...,0.655823,0.482102,1.002996,0.789969,1.390293,0.889016,0.706369,0.606165,0.77472,0.654511
6,Song Kenan,Max Griffin,B,31.0,36.0,182.88,77.11064,180.34,1.0,180.34,...,0.740188,0.439178,8.191087,8.156899,2.451723,1.284736,0.99011,0.713065,0.52028,0.482433
7,Derek Brunson,Kevin Holland,R,37.0,29.0,185.42,83.91452,195.58,2.0,190.5,...,1.172587,1.099174,2.872074,1.69439,2.422779,2.067505,0.923808,0.569649,0.882136,0.805741
8,Montel Jackson,Jesse Strader,R,29.0,30.0,177.8,61.23492,190.5,2.0,170.18,...,0.508475,0.0,1.46968,1.070998,1.016949,1.016949,0.569492,0.536158,5.59322,4.067797
10,Misha Cirkunov,Ryan Spann,B,34.0,30.0,190.5,92.98636,195.58,1.0,195.58,...,0.418369,0.302666,0.536414,0.508891,3.113395,2.37176,0.310047,0.294583,0.089851,0.07874


In [23]:
y = final_data.iloc[:, 2]
X = final_data.drop('WinnerColor', axis='columns')
X = X.drop('R_fighter', axis='columns')
X = X.drop('B_fighter', axis='columns')

In [24]:
X = X.apply(pd.to_numeric)

In [25]:
y = y.apply(lambda x: 0 if x=="R" else 1)

In [26]:
display(y.head(5))
display(X.head(5))

0    0
1    0
2    0
3    1
4    1
Name: WinnerColor, dtype: int64

Unnamed: 0,R_age,B_age,R_Height,R_Weight,R_Reach,R_Stance,B_Height,B_Weight,B_Reach,B_Stance,...,B_mean_CLINCH_attempt_history,B_mean_CLINCH_land_history,R_mean_GROUND_attempt_history,R_mean_GROUND_land_history,B_mean_GROUND_attempt_history,B_mean_GROUND_land_history,R_mean_LEG_attempt_history,R_mean_LEG_land_history,B_mean_LEG_attempt_history,B_mean_LEG_land_history
0,28.0,32.0,170.18,61.23492,177.8,1.0,165.1,61.23492,170.18,1.0,...,1.170007,0.702386,0.047847,0.0,1.104294,0.736196,0.047847,0.047847,0.260925,0.216481
1,29.0,33.0,182.88,83.91452,187.96,1.0,187.96,92.98636,193.04,1.0,...,0.155556,0.088889,1.21821,0.958295,0.69281,0.559477,0.134693,0.125169,1.25098,1.150327
2,28.0,32.0,187.96,119.748288,190.5,2.0,187.96,109.315672,190.5,1.0,...,0.0,0.0,1.38465,0.895102,0.0,0.0,1.430054,1.289705,1.22449,1.22449
3,26.0,28.0,160.02,52.16308,160.02,3.0,152.4,52.16308,154.94,2.0,...,0.0,0.0,0.466667,0.2,0.0,0.0,0.0,0.0,0.333333,0.333333
4,44.0,30.0,167.64,61.23492,172.72,1.0,180.34,61.23492,182.88,1.0,...,1.859529,1.396137,0.608257,0.337497,2.666153,1.663602,0.445182,0.366394,0.226056,0.203834


TESTANDO MODELOS

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [28]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.dummy import DummyRegressor

preprocessing_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ("poly", PolynomialFeatures(
            degree=2,
            include_bias=False)
        )
        ])

#Regressor Trivial sendo Dummy Regressor com valor constante 0, equivalente ao Red
pipe = Pipeline([('preprocessor', preprocessing_pipeline),
                ('regressor', DummyRegressor(constant=0))])

In [29]:
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.linear_model import LogisticRegression

num_splits = 50

param_grid = [
    {
        'regressor': [RandomForestClassifier()],
        'regressor__n_estimators': [10, 100],
        'regressor__max_depth': [None, 5, 10],
    },
    {
        'regressor': [LogisticRegression()],
        'regressor__C': np.logspace(-3, 3, 7),
        'regressor__penalty': [None, 'l2'],
    },
    {
        'regressor': [xgboost.XGBClassifier()],
        'regressor__n_estimators': [10, 100],
        'regressor__max_depth': [2, 5],
    },
]

test_fraction = 0.2
num_samples_total = len(y_train)
num_samples_test = int(test_fraction * num_samples_total)
num_samples_train = num_samples_total - num_samples_test

grid = GridSearchCV(
    pipe,
    param_grid,
    cv=ShuffleSplit(
        n_splits=num_splits,
        test_size=num_samples_test,
        random_state=42,
    ),
    n_jobs=-1,
    scoring='roc_auc',
)

grid.fit(X_train, y_train)

In [None]:
results_df = pd.DataFrame(grid.cv_results_) \
    .sort_values(by='rank_test_score')

results_df = results_df \
    .set_index(
        results_df["params"] \
            .apply(lambda x: "_".join(str(val) for val in x.values()))
    ) \
    .rename_axis("model")

model_scores = results_df.filter(regex=r"split\d*_test_score")
model_scores

Unnamed: 0_level_0,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,...,split40_test_score,split41_test_score,split42_test_score,split43_test_score,split44_test_score,split45_test_score,split46_test_score,split47_test_score,split48_test_score,split49_test_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LogisticRegression()_0.001_l2,-0.569064,-0.580334,-0.559885,-0.566783,-0.557567,-0.562194,-0.561041,-0.585888,-0.585888,-0.591389,...,-0.579217,-0.578098,-0.594665,-0.576976,-0.581449,-0.603316,-0.574727,-0.592483,-0.575853,-0.549376
LogisticRegression()_0.01_l2,-0.581449,-0.570201,-0.575853,-0.571336,-0.579217,-0.570201,-0.569064,-0.580334,-0.558727,-0.589195,...,-0.595754,-0.561041,-0.582562,-0.582562,-0.581449,-0.608659,-0.56564,-0.589195,-0.582562,-0.570201
"XGBClassifier(base_score=None, booster=None, callbacks=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, device=None, early_stopping_rounds=None,\n enable_categorical=False, eval_metric=None, feature_types=None,\n gamma=None, grow_policy=None, importance_type=None,\n interaction_constraints=None, learning_rate=None, max_bin=None,\n max_cat_threshold=None, max_cat_to_onehot=None,\n max_delta_step=None, max_depth=None, max_leaves=None,\n min_child_weight=None, missing=nan, monotone_constraints=None,\n multi_strategy=None, n_estimators=None, n_jobs=None,\n num_parallel_tree=None, random_state=None, ...)_2_100",-0.574727,-0.574727,-0.558727,-0.601165,-0.561041,-0.552901,-0.570201,-0.586992,-0.586992,-0.589195,...,-0.571336,-0.563345,-0.599006,-0.585888,-0.575853,-0.623379,-0.580334,-0.600086,-0.583672,-0.586992
"XGBClassifier(base_score=None, booster=None, callbacks=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, device=None, early_stopping_rounds=None,\n enable_categorical=False, eval_metric=None, feature_types=None,\n gamma=None, grow_policy=None, importance_type=None,\n interaction_constraints=None, learning_rate=None, max_bin=None,\n max_cat_threshold=None, max_cat_to_onehot=None,\n max_delta_step=None, max_depth=None, max_leaves=None,\n min_child_weight=None, missing=nan, monotone_constraints=None,\n multi_strategy=None, n_estimators=None, n_jobs=None,\n num_parallel_tree=None, random_state=None, ...)_2_10",-0.591389,-0.591389,-0.564493,-0.588094,-0.571336,-0.572469,-0.578098,-0.589195,-0.580334,-0.595754,...,-0.571336,-0.578098,-0.584781,-0.590293,-0.592483,-0.602241,-0.582562,-0.594665,-0.581449,-0.574727
RandomForestClassifier()_10_100,-0.601165,-0.581449,-0.566783,-0.585888,-0.566783,-0.551728,-0.570201,-0.594665,-0.594665,-0.590293,...,-0.569064,-0.584781,-0.599006,-0.585888,-0.593575,-0.621297,-0.572469,-0.585888,-0.585888,-0.570201
"XGBClassifier(base_score=None, booster=None, callbacks=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, device=None, early_stopping_rounds=None,\n enable_categorical=False, eval_metric=None, feature_types=None,\n gamma=None, grow_policy=None, importance_type=None,\n interaction_constraints=None, learning_rate=None, max_bin=None,\n max_cat_threshold=None, max_cat_to_onehot=None,\n max_delta_step=None, max_depth=None, max_leaves=None,\n min_child_weight=None, missing=nan, monotone_constraints=None,\n multi_strategy=None, n_estimators=None, n_jobs=None,\n num_parallel_tree=None, random_state=None, ...)_5_100",-0.591389,-0.594665,-0.59684,-0.610784,-0.581449,-0.54106,-0.601165,-0.59684,-0.591389,-0.583672,...,-0.562194,-0.573599,-0.606528,-0.605459,-0.610784,-0.625453,-0.590293,-0.602241,-0.590293,-0.574727
RandomForestClassifier()_None_100,-0.578098,-0.586992,-0.570201,-0.594665,-0.571336,-0.559885,-0.590293,-0.594665,-0.605459,-0.590293,...,-0.550553,-0.592483,-0.600086,-0.581449,-0.603316,-0.616063,-0.574727,-0.588094,-0.590293,-0.569064
LogisticRegression()_0.1_l2,-0.606528,-0.589195,-0.588094,-0.574727,-0.589195,-0.584781,-0.593575,-0.580334,-0.580334,-0.602241,...,-0.606528,-0.573599,-0.597924,-0.59684,-0.599006,-0.599006,-0.584781,-0.604388,-0.589195,-0.591389
"XGBClassifier(base_score=None, booster=None, callbacks=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, device=None, early_stopping_rounds=None,\n enable_categorical=False, eval_metric=None, feature_types=None,\n gamma=None, grow_policy=None, importance_type=None,\n interaction_constraints=None, learning_rate=None, max_bin=None,\n max_cat_threshold=None, max_cat_to_onehot=None,\n max_delta_step=None, max_depth=None, max_leaves=None,\n min_child_weight=None, missing=nan, monotone_constraints=None,\n multi_strategy=None, n_estimators=None, n_jobs=None,\n num_parallel_tree=None, random_state=None, ...)_5_10",-0.579217,-0.586992,-0.575853,-0.629582,-0.593575,-0.582562,-0.589195,-0.612901,-0.604388,-0.607594,...,-0.562194,-0.583672,-0.599006,-0.607594,-0.594665,-0.622339,-0.609723,-0.604388,-0.593575,-0.592483
RandomForestClassifier()_5_100,-0.603316,-0.584781,-0.586992,-0.584781,-0.586992,-0.564493,-0.588094,-0.601165,-0.604388,-0.604388,...,-0.566783,-0.595754,-0.597924,-0.590293,-0.599006,-0.637758,-0.580334,-0.593575,-0.600086,-0.586992


In [None]:
mean_perf = model_scores.agg(['mean', 'std'], axis=1)
mean_perf['std'] = mean_perf['std'] / np.sqrt(num_splits)
mean_perf = mean_perf.sort_values('mean', ascending=False)
mean_perf

Unnamed: 0_level_0,mean,std
model,Unnamed: 1_level_1,Unnamed: 2_level_1
LogisticRegression()_0.001_l2,-0.576871,0.001883
LogisticRegression()_0.01_l2,-0.577126,0.001968
"XGBClassifier(base_score=None, booster=None, callbacks=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, device=None, early_stopping_rounds=None,\n enable_categorical=False, eval_metric=None, feature_types=None,\n gamma=None, grow_policy=None, importance_type=None,\n interaction_constraints=None, learning_rate=None, max_bin=None,\n max_cat_threshold=None, max_cat_to_onehot=None,\n max_delta_step=None, max_depth=None, max_leaves=None,\n min_child_weight=None, missing=nan, monotone_constraints=None,\n multi_strategy=None, n_estimators=None, n_jobs=None,\n num_parallel_tree=None, random_state=None, ...)_2_100",-0.581633,0.002021
"XGBClassifier(base_score=None, booster=None, callbacks=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, device=None, early_stopping_rounds=None,\n enable_categorical=False, eval_metric=None, feature_types=None,\n gamma=None, grow_policy=None, importance_type=None,\n interaction_constraints=None, learning_rate=None, max_bin=None,\n max_cat_threshold=None, max_cat_to_onehot=None,\n max_delta_step=None, max_depth=None, max_leaves=None,\n min_child_weight=None, missing=nan, monotone_constraints=None,\n multi_strategy=None, n_estimators=None, n_jobs=None,\n num_parallel_tree=None, random_state=None, ...)_2_10",-0.584061,0.001538
RandomForestClassifier()_10_100,-0.584594,0.001984
"XGBClassifier(base_score=None, booster=None, callbacks=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, device=None, early_stopping_rounds=None,\n enable_categorical=False, eval_metric=None, feature_types=None,\n gamma=None, grow_policy=None, importance_type=None,\n interaction_constraints=None, learning_rate=None, max_bin=None,\n max_cat_threshold=None, max_cat_to_onehot=None,\n max_delta_step=None, max_depth=None, max_leaves=None,\n min_child_weight=None, missing=nan, monotone_constraints=None,\n multi_strategy=None, n_estimators=None, n_jobs=None,\n num_parallel_tree=None, random_state=None, ...)_5_100",-0.586853,0.002215
RandomForestClassifier()_None_100,-0.587108,0.002024
LogisticRegression()_0.1_l2,-0.590577,0.001786
"XGBClassifier(base_score=None, booster=None, callbacks=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, device=None, early_stopping_rounds=None,\n enable_categorical=False, eval_metric=None, feature_types=None,\n gamma=None, grow_policy=None, importance_type=None,\n interaction_constraints=None, learning_rate=None, max_bin=None,\n max_cat_threshold=None, max_cat_to_onehot=None,\n max_delta_step=None, max_depth=None, max_leaves=None,\n min_child_weight=None, missing=nan, monotone_constraints=None,\n multi_strategy=None, n_estimators=None, n_jobs=None,\n num_parallel_tree=None, random_state=None, ...)_5_10",-0.593506,0.001804
RandomForestClassifier()_5_100,-0.593891,0.001995


In [None]:
import numpy as np

from scipy.stats import t


def corrected_std(differences, n_train, n_test):
    """Corrects standard deviation using Nadeau and Bengio's approach.

    Parameters
    ----------
    differences : ndarray of shape (n_samples,)
        Vector containing the differences in the score metrics of two models.
    n_train : int
        Number of samples in the training set.
    n_test : int
        Number of samples in the testing set.

    Returns
    -------
    corrected_std : float
        Variance-corrected standard deviation of the set of differences.
    """
    # kr = k times r, r times repeated k-fold crossvalidation,
    # kr equals the number of times the model was evaluated
    kr = len(differences)
    corrected_var = np.var(differences, ddof=1) * (1 / kr + n_test / n_train)
    corrected_std = np.sqrt(corrected_var)
    return corrected_std


def compute_corrected_ttest(differences, df, n_train, n_test):
    """Computes right-tailed paired t-test with corrected variance.

    Parameters
    ----------
    differences : array-like of shape (n_samples,)
        Vector containing the differences in the score metrics of two models.
    df : int
        Degrees of freedom.
    n_train : int
        Number of samples in the training set.
    n_test : int
        Number of samples in the testing set.

    Returns
    -------
    t_stat : float
        Variance-corrected t-statistic.
    p_val : float
        Variance-corrected p-value.
    """
    mean = np.mean(differences)
    std = corrected_std(differences, n_train, n_test)
    t_stat = mean / std
    p_val = t.sf(np.abs(t_stat), df)  # right-tailed t-test
    return t_stat, p_val

In [None]:
model_1_scores = model_scores.iloc[0].values  # scores of the best model
model_2_scores = model_scores.iloc[1].values  # scores of the second-best model

differences = model_1_scores - model_2_scores

n = differences.shape[0]  # number of test sets
dof = n - 1

t_stat, p_val = compute_corrected_ttest(
    differences,
    dof,
    num_samples_train,
    num_samples_test,
)

print(f"Corrected t-statistic: {t_stat:.3f}")
print(f"Corrected p-value: {p_val:.3f}")

Corrected t-statistic: 0.034
Corrected p-value: 0.486


In [None]:
# Test all models against the best model.
best_model_scores = model_scores.iloc[0].values

n_comparisons = model_scores.shape[0] - 1

pairwise_t_test = []

for model_i in range(1, len(model_scores)):
    model_i_scores = model_scores.iloc[model_i].values
    differences = model_i_scores - best_model_scores
    t_stat, p_val = compute_corrected_ttest(
        differences,
        dof,
        num_samples_train,
        num_samples_test,
    )

    # Implement Bonferroni correction
    p_val *= n_comparisons

    # Bonferroni can output p-values higher than 1
    p_val = 1 if p_val > 1 else p_val

    pairwise_t_test.append([
        model_scores.index[0],
        model_scores.index[model_i],
        t_stat,
        p_val,
    ])

pairwise_comp_df = pd.DataFrame(
    pairwise_t_test,
    columns=["model_1", "model_2", "t_stat", "p_val"],
).round(3)

In [None]:
print(f'Melhor modelo: {model_scores.index[0]}')

Melhor modelo: LogisticRegression()_0.001_l2


In [None]:
pairwise_comp_df

Unnamed: 0,model_1,model_2,t_stat,p_val
0,LogisticRegression()_0.001_l2,LogisticRegression()_0.01_l2,-0.034,1.0
1,LogisticRegression()_0.001_l2,"XGBClassifier(base_score=None, booster=None, c...",-0.71,1.0
2,LogisticRegression()_0.001_l2,"XGBClassifier(base_score=None, booster=None, c...",-1.431,1.0
3,LogisticRegression()_0.001_l2,RandomForestClassifier()_10_100,-1.481,1.0
4,LogisticRegression()_0.001_l2,"XGBClassifier(base_score=None, booster=None, c...",-1.101,1.0
5,LogisticRegression()_0.001_l2,RandomForestClassifier()_None_100,-1.475,1.0
6,LogisticRegression()_0.001_l2,LogisticRegression()_0.1_l2,-1.803,0.892
7,LogisticRegression()_0.001_l2,"XGBClassifier(base_score=None, booster=None, c...",-2.387,0.24
8,LogisticRegression()_0.001_l2,RandomForestClassifier()_5_100,-2.924,0.06
9,LogisticRegression()_0.001_l2,RandomForestClassifier()_5_10,-3.465,0.013


## Testando o melhor modelo 

In [None]:
## Testando o melhor modelo na base de teste

from sklearn.metrics import accuracy_score

best_model = grid.best_estimator_

y_pred = best_model.predict(X_test)
dummy = DummyRegressor(strategy='constant', constant=0)
dummy.fit(X_train, y_train)
y_pred_dummy = dummy.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy_dummy = accuracy_score(y_test, y_pred_dummy)

print(f'Acurácia: {accuracy:.3f}')
print(f'Acurácia Dummy: {accuracy_dummy:.3f}')

from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)
conf_matrix_dummy = confusion_matrix(y_test, y_pred_dummy)
print(conf_matrix_dummy)

Acurácia: 0.696
Acurácia Dummy: 0.624
[[537  66]
 [228 135]]
[[603   0]
 [363   0]]


In [None]:
# Plotando a curva ROC

import matplotlib
matplotlib.use('TkAgg')  # ou 'Qt5Agg'

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Supondo que best_model, X_test e y_test já estão definidos
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('Taxa de Falso Positivo')
plt.ylabel('Taxa de Verdadeiro Positivo')
plt.title('Curva ROC')
plt.legend(loc='lower right')
plt.show()