# Comparação Experimental de Técnicas de Classificação

Este trabalho tem como objetivo realizar uma **comparação experimental** entre um conjunto pré-definido de técnicas de aprendizado e classificação automática aplicadas a um problema de **classificação supervisionada**.

## Técnicas Utilizadas

As seguintes técnicas de aprendizado serão avaliadas:

- **Decision Tree (DT)**
- **K Nearest Neighbors (KNN)**
- **Multi-layer Perceptron (MLP)**
- **Random Forest (RF)**
- **Heterogeneous Boosting (HB)**

## Procedimento Experimental

O experimento será conduzido em **3 rodadas** de ciclos aninhados de validação e teste, organizados da seguinte forma:

- **Validação interna:** 4 folds
- **Teste externo:** 10 folds

A seleção de hiperparâmetros será realizada por **busca em grade** (_grid search_) no ciclo interno, com os seguintes valores para cada técnica:

### Hiperparâmetros

```python
# Decision Tree (DT)
{
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 10, 15, 25]
}

# K Nearest Neighbors (KNN)
{
    'n_neighbors': [1, 3, 5, 7, 9]
}

# Multi-layer Perceptron (MLP)
{
    'hidden_layer_sizes': [(100,), (10,)],
    'alpha': [0.0001, 0.005],
    'learning_rate': ['constant', 'adaptive']
}

# Random Forest (RF)
{
    'n_estimators': [5, 10, 15, 25],
    'max_depth': [10, None]
}

# Heterogeneous Boosting (HB)
{
    'n_estimators': [5, 10, 15, 25, 50]
}


# Imports

In [27]:
# Manipulação de dados
import numpy as np
import pandas as pd

# Modelos de classificação
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

# (Exemplo de Heterogeneous Boosting com Gradient Boosting)
from sklearn.ensemble import GradientBoostingClassifier

# Validação cruzada e avaliação
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, RepeatedStratifiedKFold
from sklearn.metrics import accuracy_score
from scipy import stats

# Pré-processamento
from sklearn.preprocessing import MinMaxScaler

# Visualização
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Criação do Heterogeneous Boosting
from sklearn.utils import resample
from sklearn.base import BaseEstimator, ClassifierMixin



# Configurações de exibição

In [28]:
# Configurações gerais de visualização
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
warnings.filterwarnings("ignore")

# Importando base de dados

In [29]:
data = pd.read_csv("jogosLoL2021.csv")

In [30]:
data

Unnamed: 0,id,result,golddiffat15,xpdiffat15,csdiffat15,killsdiffat15,assistsdiffat15,golddiffat10,xpdiffat10,csdiffat10,...,OPP_EGR,OPP_MLR,OPP_FB%,OPP_FT%,OPP_F3T%,OPP_HLD%,OPP_DRG%,OPP_BN%,OPP_LNE%,OPP_JNG%
0,10,1,5018.0,4255.0,86.0,5.0,9.0,1793.0,2365.0,65.0,...,23.1,-23.1,0,0,33,50,27,0,49.2,43.7
1,22,0,573.0,-1879.0,-49.0,1.0,4.0,759.0,171.0,-8.0,...,77.2,22.8,100,100,100,58,70,89,50.4,53.3
2,34,0,-579.0,-1643.0,-40.0,-1.0,-5.0,73.0,-1.0,-24.0,...,77.2,22.8,100,100,100,58,70,89,50.4,53.3
3,106,1,3739.0,1118.0,53.0,1.0,0.0,1746.0,824.0,21.0,...,63.9,-3.9,67,67,67,48,60,48,51.6,50.3
4,118,0,-6390.0,-4569.0,-47.0,-10.0,-17.0,-3500.0,-1882.0,-18.0,...,25.8,-0.8,13,25,25,19,20,20,49.7,42.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8147,138358,1,7581.0,3246.0,6.0,9.0,17.0,1928.0,469.0,-4.0,...,53.9,4.9,59,76,59,65,46,65,50.4,52.1
8148,138370,0,-2828.0,-2139.0,-33.0,-3.0,-5.0,-1325.0,-677.0,-20.0,...,61.2,-11.2,70,60,60,55,53,47,50.9,48.0
8149,138382,0,1427.0,-142.0,-21.0,0.0,-6.0,671.0,1446.0,-20.0,...,53.9,4.9,59,76,59,65,46,65,50.4,52.1
8150,138394,0,-1286.0,-2414.0,-56.0,1.0,-6.0,1002.0,-72.0,-40.0,...,51.3,8.7,40,60,60,70,48,57,48.9,54.2


# Pré-processamento dos dados

**Descartar o identificador da partida** e realizar a **padronização das características numéricas** (normalização).

As características que usaremos são os dados pré-jogos, ou seja, as informações disponíveis antes do início da partida, como:

- WR (Win-Rate do Time Azul)
- KD (Kill-to-Death Ratio do Time Azul)
- GPR (Gold Percent Ratio do Time Azul)
- GSPD (Average Gold Spent Ratio do Time Azul)
- EGR (Early-Game-Rate do Time Azul)
- MLR (Mid-Late-Game-Rate do Time Azul)
- FB% (First Blood Rate do Time Azul)
- FT% (First Blood Rate do Time Azul)
- F3T% (First To Three Towers Rate do Time Azul)
- HLD% (Harold Rate do Time Azul)
- DRG% (Dragon Rate do Time Azul)
- BN% (First Blood Rate do Time Azul)
- LNE% (Lane Control Rate do Time Azul)
- JNG% (Jungle Control Rate do Time Azul)
- OPP_WR (Win-Rate do Time Vermelho)
- OPP_KD (Kill-to-Death Ratio do Time Vermelho)
- OPP_GPR (gold Percent Ratio) do Time Vermelho
- OPP_GSPD (Average Gold Spent Ratio do Time Vermelho)
- OPP_EGR (Early-Game-Rate do Time Vermelho)
- OPP_MLR (Mid-Late-Game-Rate do Time Vermelho)
- OPP_FB% (First Blood Rate do Time Vermelho)
- OPP_FT% (First Blood Rate do Time Vermelho)
- OPP_F3T% (First To Three Towers Rate do Time Vermelho)
- OPP_HLD% (Harold Rate do Time Vermelho)
- OPP_DRG% (Dragon Rate do Time Vermelho)
- OPP_BN% (First Blood Rate do Time Vermelho)
- OPP_LNE% (Lane Control Rate do Time Vermelho)
- OPP_JNG% (Jungle Control Rate do Time Vermelho)

Iremos excluir, portanto, as colunas:

- golddiffat15 (Diferença de gold entre os times aos 15 minutos)
- xpdiffat15 (Diferença de XP entre os times aos 15 minutos)
- csdiffat15 (Diferença de creeps entre os times aos 15 minutos)
- killsdiffat15 (Diferença de kills entre os times aos 15 minutos)
- assistsdiffat15 (Diferença de assists entre os times aos 15 minutos)
- golddiffat10 (Diferença de gold entre os times aos 10 minutos)
- xpdiffat10 (Diferença de xp entre os times aos 10 minutos)
- csdiffat10 (Diferença de creeps entre os times aos 10 minutos)
- killsdiffat10 (Diferença de kills entre os times aos 10 minutos)
- assistsdiffat10 (Diferença de assists entre os times aos 10 minutos)

In [31]:
# Descarte do identificador da partida
data = data.drop(columns=['id'])
# Descarte de colunas que não serão utilizadas
data = data.drop(columns= ['golddiffat15', 'xpdiffat15', 'csdiffat15', 'killsdiffat15', 'assistsdiffat15', 'golddiffat10', 'xpdiffat10', 'csdiffat10', 'killsdiffat10', 'assistsdiffat10'])

In [32]:
# Normalizando as colunas numéricas
# Seleciona as colunas numéricas (todas exceto 'result')
cols_to_normalize = data.columns.drop('result')

# Aplica MinMaxScaler para normalizar entre 0 e 1
scaler = MinMaxScaler()
data[cols_to_normalize] = scaler.fit_transform(data[cols_to_normalize])
X = data.drop(columns=['result'])
y = data['result']

## Implementação do Hetergeneos Boosting

In [33]:
from sklearn.base import BaseEstimator, ClassifierMixin, clone
from sklearn.utils import resample
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import mode

class HeterogeneousBoosting(BaseEstimator, ClassifierMixin):
    def __init__(self, n_estimators=10):
        self.n_estimators = n_estimators
        self.base_models = [
            DecisionTreeClassifier(),
            KNeighborsClassifier(),
            MLPClassifier(max_iter=5),
            RandomForestClassifier()
        ]
        self.trained_models = []

    def fit(self, X, y):
        n_samples = X.shape[0]
        sample_weights = np.ones(n_samples) / n_samples

        self.trained_models = []

        for i in range(self.n_estimators):
            for base_model in self.base_models:
                model = clone(base_model)  # ⚠️ Cria uma nova instância limpa

                X_resampled, y_resampled = resample(X, y,
                    replace=True,
                    n_samples=n_samples,
                    random_state=i,
                    stratify=y)


                model.fit(X_resampled, y_resampled)
                self.trained_models.append(model)

                y_pred = model.predict(X)
                incorrect = (y_pred != y).astype(float)
                error = np.dot(sample_weights, incorrect)

                if error > 0.5:
                    break

                sample_weights *= np.exp(incorrect)
                sample_weights /= np.sum(sample_weights)

    def predict(self, X):
        preds = np.array([model.predict(X) for model in self.trained_models])
        final_pred, _ = mode(preds, axis=0)
        return final_pred.ravel()

    def predict_proba(self, X):
        proba_preds = []
        for model in self.trained_models:
            if hasattr(model, "predict_proba"):
                proba_preds.append(model.predict_proba(X))
        if proba_preds:
            return np.mean(proba_preds, axis=0)
        else:
            raise ValueError("Nenhum modelo no ensemble suporta `predict_proba`.")


## Função de treino-teste do modelo

In [34]:
def train_and_evaluate_model(model, X, y):
    """
    Treina e avalia o modelo usando validação cruzada.
    """
    if model == "DT":
        model = DecisionTreeClassifier()
        param_grid = {
            'criterion': ['gini', 'entropy'],
            'max_depth': [5, 10, 15, 25]
        }
    if model == "KNN":
        model = KNeighborsClassifier()
        param_grid = {
            'n_neighbors':[1,3,5,7,9]
        }
    if model == "MLP":
        model = MLPClassifier(max_iter=5)
        param_grid = {
            'hidden_layer_sizes': [(100,),(10,)],
            'alpha': [0.0001, 0.005],
            'learning_rate': ['constant','adaptive']
        }
    if model == "RF":
        model = RandomForestClassifier()
        param_grid = {
            'n_estimators': [5, 10, 15, 25],
            'max_depth': [10, None]
        }
    if model == "HB":
        model = HeterogeneousBoosting()
        param_grid = {
            'n_estimators': [5, 10, 15, 25, 50]
        }
    
    # Treinamento e validação cruzada
    outer_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=36854321)
    scores = []

    for train_idx, test_idx in outer_cv.split(X, y):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]


        inner_cv = StratifiedKFold(n_splits=4)
        grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=inner_cv, scoring='accuracy')
        grid.fit(X_train, y_train)

        best_model = grid.best_estimator_
        y_pred = best_model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        scores.append(acc)
    return scores

## Funções úteis

In [35]:
def t_corrigido_nadeau_bengio(data1, data2, X, n_folds_externos):
    """
    Parâmetros:
    data1, data2: listas ou arrays com as acurácias

    X: conjunto de dados original
    n_folds_externos: número de folds no loop externo
    Retorna:
    t_stat: valor da estatística t
    p_valor: valor-p do teste bilateral
    """
    N = len(X) # número total de amostras no dataset
    n = len(data1) # número de execuções
    # Tamanhos dos conjuntos de treino/teste em cada fold externo
    n_test = N // n_folds_externos
    n_train = N - n_test
    # Cálculo da estatística t com correção
    diffs = np.array(data1) - np.array(data2)
    mean_diff = np.mean(diffs)
    std_diff = np.std(diffs, ddof=1)
    se_corrigido = std_diff * np.sqrt(1/n + n_test/n_train)
    t_stat = mean_diff / se_corrigido
    p_valor = 2 * (1 - t.cdf(abs(t_stat), df=n - 1))
    return t_stat, p_valor
    print('\nCorrected T Test')
    s,p = t_corrigido_nadeau_bengio(scores,scoresWN,iris_X,10)
    print("t: %0.2f p-value: %0.2f\n" % (s,p))

## Decision Tree (DT)

In [36]:
scores = train_and_evaluate_model("DT", X, y)

print(f"Acurácias: {scores}")
print(f"Média: {np.mean(scores):.3f}, Desvio padrão: {np.std(scores):.3f}")


Acurácias: [0.6041666666666666, 0.5625, 0.5828220858895705, 0.5742331288343558, 0.5877300613496933, 0.592638036809816, 0.5730061349693252, 0.5742331288343558, 0.592638036809816, 0.6012269938650306, 0.5808823529411765, 0.5833333333333334, 0.5730061349693252, 0.5803680981595092, 0.5766871165644172, 0.5914110429447853, 0.603680981595092, 0.6073619631901841, 0.5950920245398773, 0.6110429447852761, 0.5698529411764706, 0.6004901960784313, 0.5815950920245399, 0.5950920245398773, 0.5595092024539877, 0.6085889570552148, 0.588957055214724, 0.6, 0.5852760736196319, 0.5865030674846625]
Média: 0.587, Desvio padrão: 0.013


## K Nearnest Neighbor (KNN)

In [37]:
scores = train_and_evaluate_model("KNN", X, y)

print(f"Acurácias: {scores}")
print(f"Média: {np.mean(scores):.3f}, Desvio padrão: {np.std(scores):.3f}")

Acurácias: [0.5845588235294118, 0.5784313725490197, 0.5766871165644172, 0.5901840490797546, 0.5668711656441717, 0.596319018404908, 0.5717791411042945, 0.5717791411042945, 0.5779141104294478, 0.6147239263803681, 0.5698529411764706, 0.5588235294117647, 0.6, 0.5717791411042945, 0.5656441717791411, 0.5877300613496933, 0.5852760736196319, 0.6208588957055214, 0.5447852760736196, 0.5914110429447853, 0.5784313725490197, 0.5955882352941176, 0.596319018404908, 0.5852760736196319, 0.5865030674846625, 0.5607361963190184, 0.5828220858895705, 0.5914110429447853, 0.5766871165644172, 0.5730061349693252]
Média: 0.582, Desvio padrão: 0.016


## Multi Layer Perceptron (MLP)

In [38]:
scores = train_and_evaluate_model("MLP", X, y)

print(f"Acurácias: {scores}")
print(f"Média: {np.mean(scores):.3f}, Desvio padrão: {np.std(scores):.3f}")

Acurácias: [0.5882352941176471, 0.5698529411764706, 0.5938650306748466, 0.5803680981595092, 0.5975460122699386, 0.5975460122699386, 0.5865030674846625, 0.5705521472392638, 0.5742331288343558, 0.6049079754601226, 0.5796568627450981, 0.5931372549019608, 0.5803680981595092, 0.588957055214724, 0.6233128834355828, 0.603680981595092, 0.5644171779141104, 0.603680981595092, 0.5865030674846625, 0.5840490797546012, 0.571078431372549, 0.5980392156862745, 0.588957055214724, 0.5865030674846625, 0.6061349693251534, 0.5717791411042945, 0.5815950920245399, 0.592638036809816, 0.588957055214724, 0.6024539877300613]
Média: 0.589, Desvio padrão: 0.013


## Random Forest (RF)

In [39]:
score = train_and_evaluate_model("RF", X, y)

print(f"Acurácias: {score}")
print(f"Média: {np.mean(score):.3f}, Desvio padrão: {np.std(score):.3f}")

Acurácias: [0.5894607843137255, 0.6151960784313726, 0.6098159509202454, 0.5950920245398773, 0.6380368098159509, 0.6061349693251534, 0.5815950920245399, 0.6024539877300613, 0.5901840490797546, 0.6294478527607362, 0.5821078431372549, 0.5955882352941176, 0.6184049079754601, 0.6085889570552148, 0.6208588957055214, 0.6245398773006134, 0.5975460122699386, 0.6355828220858896, 0.6122699386503068, 0.6282208588957056, 0.5808823529411765, 0.6397058823529411, 0.6024539877300613, 0.596319018404908, 0.5950920245398773, 0.5950920245398773, 0.6171779141104294, 0.6503067484662577, 0.6233128834355828, 0.5791411042944785]
Média: 0.609, Desvio padrão: 0.019


## Heterogeneous Boosting (HB) 

In [None]:
scores = train_and_evaluate_model("HB", X, y)

print(f"Acurácias: {scores}")
print(f"Média: {np.mean(scores):.3f}, Desvio padrão: {np.std(scores):.3f}")

In [None]:
model = HeterogeneousBoosting(n_estimators=10)
model.fit(X, y)