# Comparação Experimental de Técnicas de Classificação

Este trabalho tem como objetivo realizar uma **comparação experimental** entre um conjunto pré-definido de técnicas de aprendizado e classificação automática aplicadas a um problema de **classificação supervisionada**.

## Técnicas Utilizadas

As seguintes técnicas de aprendizado serão avaliadas:

- **Decision Tree (DT)**
- **K Nearest Neighbors (KNN)**
- **Multi-layer Perceptron (MLP)**
- **Random Forest (RF)**
- **Heterogeneous Boosting (HB)**

## Procedimento Experimental

O experimento será conduzido em **3 rodadas** de ciclos aninhados de validação e teste, organizados da seguinte forma:

- **Validação interna:** 4 folds
- **Teste externo:** 10 folds

A seleção de hiperparâmetros será realizada por **busca em grade** (_grid search_) no ciclo interno, com os seguintes valores para cada técnica:

### Hiperparâmetros

```python
# Decision Tree (DT)
{
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 10, 15, 25]
}

# K Nearest Neighbors (KNN)
{
    'n_neighbors': [1, 3, 5, 7, 9]
}

# Multi-layer Perceptron (MLP)
{
    'hidden_layer_sizes': [(100,), (10,)],
    'alpha': [0.0001, 0.005],
    'learning_rate': ['constant', 'adaptive']
}

# Random Forest (RF)
{
    'n_estimators': [5, 10, 15, 25],
    'max_depth': [10, None]
}

# Heterogeneous Boosting (HB)
{
    'n_estimators': [5, 10, 15, 25, 50]
}


# Imports

In [66]:
# Manipulação de dados
import numpy as np
import pandas as pd

# Modelos de classificação
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

# (Exemplo de Heterogeneous Boosting com Gradient Boosting)
from sklearn.ensemble import GradientBoostingClassifier

# Validação cruzada e avaliação
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score
from scipy import stats

# Pré-processamento
from sklearn.preprocessing import MinMaxScaler

# Visualização
import matplotlib.pyplot as plt
import seaborn as sns



# Configurações de exibição

In [67]:
# Configurações gerais de visualização
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

# Importando base de dados

In [68]:
data = pd.read_csv("jogosLoL2021.csv")

In [69]:
data

Unnamed: 0,id,result,golddiffat15,xpdiffat15,csdiffat15,killsdiffat15,assistsdiffat15,golddiffat10,xpdiffat10,csdiffat10,...,OPP_EGR,OPP_MLR,OPP_FB%,OPP_FT%,OPP_F3T%,OPP_HLD%,OPP_DRG%,OPP_BN%,OPP_LNE%,OPP_JNG%
0,10,1,5018.0,4255.0,86.0,5.0,9.0,1793.0,2365.0,65.0,...,23.1,-23.1,0,0,33,50,27,0,49.2,43.7
1,22,0,573.0,-1879.0,-49.0,1.0,4.0,759.0,171.0,-8.0,...,77.2,22.8,100,100,100,58,70,89,50.4,53.3
2,34,0,-579.0,-1643.0,-40.0,-1.0,-5.0,73.0,-1.0,-24.0,...,77.2,22.8,100,100,100,58,70,89,50.4,53.3
3,106,1,3739.0,1118.0,53.0,1.0,0.0,1746.0,824.0,21.0,...,63.9,-3.9,67,67,67,48,60,48,51.6,50.3
4,118,0,-6390.0,-4569.0,-47.0,-10.0,-17.0,-3500.0,-1882.0,-18.0,...,25.8,-0.8,13,25,25,19,20,20,49.7,42.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8147,138358,1,7581.0,3246.0,6.0,9.0,17.0,1928.0,469.0,-4.0,...,53.9,4.9,59,76,59,65,46,65,50.4,52.1
8148,138370,0,-2828.0,-2139.0,-33.0,-3.0,-5.0,-1325.0,-677.0,-20.0,...,61.2,-11.2,70,60,60,55,53,47,50.9,48.0
8149,138382,0,1427.0,-142.0,-21.0,0.0,-6.0,671.0,1446.0,-20.0,...,53.9,4.9,59,76,59,65,46,65,50.4,52.1
8150,138394,0,-1286.0,-2414.0,-56.0,1.0,-6.0,1002.0,-72.0,-40.0,...,51.3,8.7,40,60,60,70,48,57,48.9,54.2


# Pré-processamento dos dados

**Descartar o identificador da partida** e realizar a **padronização das características numéricas** (normalização).

As características que usaremos são os dados pré-jogos, ou seja, as informações disponíveis antes do início da partida, como:

- WR (Win-Rate do Time Azul)
- KD (Kill-to-Death Ratio do Time Azul)
- GPR (Gold Percent Ratio do Time Azul)
- GSPD (Average Gold Spent Ratio do Time Azul)
- EGR (Early-Game-Rate do Time Azul)
- MLR (Mid-Late-Game-Rate do Time Azul)
- FB% (First Blood Rate do Time Azul)
- FT% (First Blood Rate do Time Azul)
- F3T% (First To Three Towers Rate do Time Azul)
- HLD% (Harold Rate do Time Azul)
- DRG% (Dragon Rate do Time Azul)
- BN% (First Blood Rate do Time Azul)
- LNE% (Lane Control Rate do Time Azul)
- JNG% (Jungle Control Rate do Time Azul)
- OPP_WR (Win-Rate do Time Vermelho)
- OPP_KD (Kill-to-Death Ratio do Time Vermelho)
- OPP_GPR (gold Percent Ratio) do Time Vermelho
- OPP_GSPD (Average Gold Spent Ratio do Time Vermelho)
- OPP_EGR (Early-Game-Rate do Time Vermelho)
- OPP_MLR (Mid-Late-Game-Rate do Time Vermelho)
- OPP_FB% (First Blood Rate do Time Vermelho)
- OPP_FT% (First Blood Rate do Time Vermelho)
- OPP_F3T% (First To Three Towers Rate do Time Vermelho)
- OPP_HLD% (Harold Rate do Time Vermelho)
- OPP_DRG% (Dragon Rate do Time Vermelho)
- OPP_BN% (First Blood Rate do Time Vermelho)
- OPP_LNE% (Lane Control Rate do Time Vermelho)
- OPP_JNG% (Jungle Control Rate do Time Vermelho)

Iremos excluir, portanto, as colunas:

- golddiffat15 (Diferença de gold entre os times aos 15 minutos)
- xpdiffat15 (Diferença de XP entre os times aos 15 minutos)
- csdiffat15 (Diferença de creeps entre os times aos 15 minutos)
- killsdiffat15 (Diferença de kills entre os times aos 15 minutos)
- assistsdiffat15 (Diferença de assists entre os times aos 15 minutos)
- golddiffat10 (Diferença de gold entre os times aos 10 minutos)
- xpdiffat10 (Diferença de xp entre os times aos 10 minutos)
- csdiffat10 (Diferença de creeps entre os times aos 10 minutos)
- killsdiffat10 (Diferença de kills entre os times aos 10 minutos)
- assistsdiffat10 (Diferença de assists entre os times aos 10 minutos)

In [70]:
# Descarte do identificador da partida
data = data.drop(columns=['id'])
# Descarte de colunas que não serão utilizadas
data = data.drop(columns= ['golddiffat15', 'xpdiffat15', 'csdiffat15', 'killsdiffat15', 'assistsdiffat15', 'golddiffat10', 'xpdiffat10', 'csdiffat10', 'killsdiffat10', 'assistsdiffat10'])

In [None]:
# Normalizando as colunas numéricas
# Seleciona as colunas numéricas (todas exceto 'result')
cols_to_normalize = data.columns.drop('result')

# Aplica MinMaxScaler para normalizar entre 0 e 1
scaler = MinMaxScaler()
data[cols_to_normalize] = scaler.fit_transform(data[cols_to_normalize])
X = data.drop(columns=['result'])
y = data['result']

Unnamed: 0,result,WR,KD,GPR,GSPD,EGR,MLR,FB%,FT%,F3T%,...,OPP_MLR,OPP_FB%,OPP_FT%,OPP_F3T%,OPP_HLD%,OPP_DRG%,OPP_BN%,OPP_LNE%,OPP_JNG%,id
0,1,1.000000,0.564972,0.683521,0.711039,0.780000,0.756633,1.00,1.00,1.00,...,0.336688,0.00,0.00,0.33,0.50,0.287234,0.00,0.39,0.279863,0
1,0,0.000000,0.045198,0.293071,0.310065,0.210526,0.336688,0.00,0.00,0.33,...,0.756633,1.00,1.00,1.00,0.58,0.744681,0.89,0.51,0.607509,1
2,0,0.000000,0.045198,0.293071,0.310065,0.210526,0.336688,0.00,0.00,0.33,...,0.756633,1.00,1.00,1.00,0.58,0.744681,0.89,0.51,0.607509,2
3,1,1.000000,0.884181,1.000000,1.000000,1.000000,0.565416,0.75,1.00,1.00,...,0.512351,0.67,0.67,0.67,0.48,0.638298,0.48,0.63,0.505119,3
4,0,0.636364,0.223164,0.439139,0.448052,0.411579,0.743824,0.45,0.55,0.55,...,0.540714,0.13,0.25,0.25,0.19,0.212766,0.20,0.44,0.228669,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8147,1,0.550000,0.305085,0.527154,0.555195,0.586316,0.513266,0.80,0.35,0.55,...,0.592864,0.59,0.76,0.59,0.65,0.489362,0.65,0.51,0.566553,8147
8148,0,0.600000,0.265537,0.489700,0.564935,0.507368,0.627630,0.40,0.60,0.60,...,0.445563,0.70,0.60,0.60,0.55,0.563830,0.47,0.56,0.426621,8148
8149,0,0.550000,0.305085,0.527154,0.555195,0.586316,0.513266,0.80,0.35,0.55,...,0.592864,0.59,0.76,0.59,0.65,0.489362,0.65,0.51,0.566553,8149
8150,0,0.500000,0.290960,0.493446,0.500000,0.611579,0.445563,0.70,0.60,0.60,...,0.627630,0.40,0.60,0.60,0.70,0.510638,0.57,0.36,0.638225,8150


## Decision Tree (DT)

In [75]:
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score
import numpy as np

# Exemplo para um classificador
model = DecisionTreeClassifier(random_state=13)
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 10, 15, 25]
}

outer_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=36854321)
scores = []

for train_idx, test_idx in outer_cv.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]


    inner_cv = StratifiedKFold(n_splits=4)
    grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=inner_cv, scoring='accuracy')
    grid.fit(X_train, y_train)

    best_model = grid.best_estimator_
    y_pred = best_model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    scores.append(acc)

print(f"Acurácias: {scores}")
print(f"Média: {np.mean(scores):.3f}, Desvio padrão: {np.std(scores):.3f}")


Acurácias: [0.5931372549019608, 0.5661764705882353, 0.5877300613496933, 0.5754601226993865, 0.5901840490797546, 0.588957055214724, 0.5730061349693252, 0.5779141104294478, 0.5901840490797546, 0.5914110429447853, 0.5772058823529411, 0.5674019607843137, 0.5730061349693252, 0.5791411042944785, 0.5791411042944785, 0.5901840490797546, 0.5742331288343558, 0.6073619631901841, 0.5950920245398773, 0.6073619631901841, 0.5674019607843137, 0.5931372549019608, 0.5840490797546012, 0.5914110429447853, 0.5840490797546012, 0.5950920245398773, 0.5766871165644172, 0.5975460122699386, 0.5852760736196319, 0.5865030674846625]
Média: 0.585, Desvio padrão: 0.011
