<a href="https://colab.research.google.com/github/SampMark/Machine-Learn/blob/main/Comparison_Models_for_Predicting_Basketball_Tournament_Outcomes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Comparação de modelos em machine learn para prever resultados de torneios de basquete**: KNN, Decision Tree, SVM, e Logistic Regression

O objetivo deste código é aplicar algoritmos de Machine Learning ao conjunto de dados `basketball_train.csv`para avaliação comparativa entre modelos.
A análise envolve as seguintes etapas:

1. Carregar conjunto de dados históricos de temporadas de basquete `basketball_train.csv` .
2. Aplicar os seguintes algoritmos em _machine learn_ e comparar os resultados entre eles:

* [k-Nearest Neighbour (KNN)](https://github.com/SampMark/Machine-Learn/blob/main/K_Nearest_Neighbors.ipynb)
* [Decision Tree](https://github.com/SampMark/Machine-Learn/blob/main/Decision_Trees.ipynb)
* [Support Vector Machine (SVM)](https://github.com/SampMark/Machine-Learn/blob/main/SVM_Support_Vector_Machines.ipynb)
* [Logistic Regression](https://github.com/SampMark/Machine-Learn/blob/main/Logistic_Regression.ipynb)

Os resultados de cada modelo serão avaliados e comparados com base nas seguintes métricas:

* Acurácia
* Índice de Jaccard
* F1-score
* Log loss

## **Instalando e importando as bibliotecas**

---



In [1]:
!pip install scikit-learn scipy seaborn



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
import itertools
from sklearn import preprocessing
import seaborn as sns

## **Sobre o dataset `basketball_train.csv`**

---


O conjunto de dados contém o desempenho e classificação de vários times de basquete, com os seguintes campos:

| Campo         | Descrição                                                                                                     | Tradução                                                                                           |
|---------------|-------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| TEAM          | The Division I college basketball school                                                                    | A escola de basquete universitário da Divisão I                                                  |
| CONF          | The Athletic Conference in which the school participates (e.g., ACC = Atlantic Coast Conference, etc.)       | A conferência atlética na qual a escola participa (ex.: ACC = Conferência da Costa Atlântica)     |
| G             | Number of games played                                                                                      | Número de jogos jogados                                                                          |
| W             | Number of games won                                                                                        | Número de jogos vencidos                                                                         |
| ADJOE         | Adjusted Offensive Efficiency (points scored per 100 possessions)                                           | Eficiência ofensiva ajustada (pontos marcados por 100 posses)                                    |
| ADJDE         | Adjusted Defensive Efficiency (points allowed per 100 possessions)                                          | Eficiência defensiva ajustada (pontos permitidos por 100 posses)                                 |
| BARTHAG       | Power Rating (Chance of beating an average Division I team)                                                 | Classificação de força (chance de vencer um time médio da Divisão I)                             |
| EFG_O         | Effective Field Goal Percentage Shot                                                                        | Porcentagem efetiva de arremessos de campo                                                       |
| EFG_D         | Effective Field Goal Percentage Allowed                                                                     | Porcentagem efetiva de arremessos de campo permitidos                                            |
| TOR           | Turnover Percentage Allowed (Turnover Rate)                                                                 | Porcentagem de erros permitidos (taxa de erros)                                                  |
| TORD          | Turnover Percentage Committed (Steal Rate)                                                                  | Porcentagem de erros cometidos (taxa de roubos)                                                  |
| ORB           | Offensive Rebound Percentage                                                                                | Porcentagem de rebotes ofensivos                                                                |
| DRB           | Defensive Rebound Percentage                                                                                | Porcentagem de rebotes defensivos                                                               |
| FTR           | Free Throw Rate (How often the team shoots Free Throws)                                                     | Taxa de lances livres (frequência com que o time realiza lances livres)                         |
| FTRD          | Free Throw Rate Allowed                                                                                     | Taxa de lances livres permitidos                                                                |
| 2P_O          | Two-Point Shooting Percentage                                                                               | Porcentagem de arremessos de dois pontos                                                        |
| 2P_D          | Two-Point Shooting Percentage Allowed                                                                      | Porcentagem de arremessos de dois pontos permitidos                                             |
| 3P_O          | Three-Point Shooting Percentage                                                                             | Porcentagem de arremessos de três pontos                                                        |
| 3P_D          | Three-Point Shooting Percentage Allowed                                                                    | Porcentagem de arremessos de três pontos permitidos                                             |
| ADJ_T         | Adjusted Tempo (possessions per 40 minutes)                                                                 | Ritmo ajustado (posses por 40 minutos)                                                          |
| WAB           | Wins Above Bubble (cutoff for NCAA March Madness Tournament)                                                | Vitórias acima do corte (para o torneio NCAA March Madness)                                     |
| POSTSEASON    | Round where the team was eliminated or season ended (e.g., R64 = Round of 64, etc.)                         | Rodada em que o time foi eliminado ou terminou a temporada (ex.: R64 = Rodada de 64)            |
| SEED          | Seed in the NCAA March Madness Tournament                                                                   | Cabeça de chave no torneio NCAA March Madness                                                   |
| YEAR          | Season                                                                                                      | Temporada                                                                                       |


## **Importando e explorando o DataSet `basketball_train.csv`**

---



In [4]:
# Carregando os dados
test_df = pd.read_csv('https://raw.githubusercontent.com/SampMark/files/refs/heads/main/basketball_train.csv')

# Exibindo informações sobre o DataFrame
print(f"Número de linhas e colunas do DataSet: {test_df.shape}")
# print(test_df.head().to_markdown(index=False, numalign="left", stralign="left"))
test_df.head()

Número de linhas e colunas do DataSet: (1757, 24)


Unnamed: 0,TEAM,CONF,G,W,ADJOE,ADJDE,BARTHAG,EFG_O,EFG_D,TOR,...,FTRD,2P_O,2P_D,3P_O,3P_D,ADJ_T,WAB,POSTSEASON,SEED,YEAR
0,North Carolina,ACC,40,33,123.3,94.9,0.9531,52.6,48.1,15.4,...,30.4,53.9,44.6,32.7,36.2,71.7,8.6,2ND,1.0,2016
1,Villanova,BE,40,35,123.1,90.9,0.9703,56.1,46.7,16.3,...,30.0,57.4,44.1,36.2,33.9,66.7,8.9,Champions,2.0,2016
2,Notre Dame,ACC,36,24,118.3,103.3,0.8269,54.0,49.5,15.3,...,26.0,52.9,46.5,37.4,36.9,65.5,2.3,E8,6.0,2016
3,Virginia,ACC,37,29,119.9,91.0,0.96,54.8,48.4,15.1,...,33.4,52.6,46.3,40.3,34.7,61.9,8.6,E8,1.0,2016
4,Kansas,B12,37,32,120.9,90.4,0.9662,55.7,45.1,17.8,...,37.3,52.7,43.4,41.3,32.5,70.1,11.6,E8,1.0,2016


In [5]:
# Exibindo informações resumidas sobre o DataFrame
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1757 entries, 0 to 1756
Data columns (total 24 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   TEAM        1757 non-null   object 
 1   CONF        1757 non-null   object 
 2   G           1757 non-null   int64  
 3   W           1757 non-null   int64  
 4   ADJOE       1757 non-null   float64
 5   ADJDE       1757 non-null   float64
 6   BARTHAG     1757 non-null   float64
 7   EFG_O       1757 non-null   float64
 8   EFG_D       1757 non-null   float64
 9   TOR         1757 non-null   float64
 10  TORD        1757 non-null   float64
 11  ORB         1757 non-null   float64
 12  DRB         1757 non-null   float64
 13  FTR         1757 non-null   float64
 14  FTRD        1757 non-null   float64
 15  2P_O        1757 non-null   float64
 16  2P_D        1757 non-null   float64
 17  3P_O        1757 non-null   float64
 18  3P_D        1757 non-null   float64
 19  ADJ_T       1757 non-null  

In [6]:
# Cria a coluna 'windex' com base na condição WAB > 7
test_df['windex'] = np.where(test_df.WAB > 7, 'True', 'False')

# Filtra o DataFrame para incluir apenas as linhas com 'F4', 'S16' ou 'E8' em 'POSTSEASON'
test_df1 = test_df[test_df['POSTSEASON'].str.contains('F4|S16|E8', na=False)]

# Seleciona as features relevantes
test_Feature = test_df1[['G', 'W', 'ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D',
                         'TOR', 'TORD', 'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O',
                         '3P_D', 'ADJ_T', 'WAB', 'SEED', 'windex']].copy()  # Garante que é uma cópia do DataFrame

# Converte a coluna 'windex' para valores numéricos (0 e 1) usando .loc
test_Feature.loc[:, 'windex'] = test_Feature['windex'].map({'False': 0, 'True': 1})

# Normaliza as features usando StandardScaler
test_X = preprocessing.StandardScaler().fit_transform(test_Feature)

# Exibe as 5 primeiras linhas dos dados normalizados
print(test_X[0:5])

[[-4.08074446e-01 -1.10135297e+00  3.37365934e-01  2.66479976e+00
  -2.46831661e+00  2.13703245e-01  9.44090550e-01 -1.19216365e+00
  -1.64348924e+00  1.45405982e-02  1.29523097e+00 -6.23533182e-01
  -9.31788560e-01  1.42784371e-01  1.68876201e-01  2.84500844e-01
   1.62625961e+00 -8.36649260e-01 -9.98500539e-01  4.84319174e-01
  -6.77003200e-01]
 [ 3.63958290e-01  3.26326807e-01  7.03145068e-01 -7.13778644e-01
   1.07370841e+00  4.82633172e-01  4.77498943e-01 -1.32975879e+00
  -6.86193316e-02 -7.35448152e-01 -1.35447914e+00 -8.06829025e-01
   3.41737757e-01  4.96641291e-02  9.40576311e-02  1.37214061e+00
   6.93854620e-01 -2.00860931e+00  9.80549967e-01 -1.19401460e+00
   1.47709789e+00]
 [ 3.63958290e-01  1.18293467e+00  9.31757027e-01 -8.78587347e-01
   1.23870131e+00  7.85179340e-01 -9.22275877e-01  5.27775662e-01
  -1.86734575e-01 -1.19385964e-01 -3.17636057e-01  6.82449703e-01
   1.01292055e+00  8.07042098e-02 -9.90811637e-01  1.74718880e+00
  -2.38550367e-01  6.60855252e-01  1.9

In [7]:
test_y = test_df1['POSTSEASON'].values
test_y[0:5]

array(['E8', 'E8', 'E8', 'E8', 'F4'], dtype=object)

In [9]:
from sklearn.model_selection import train_test_split

# Dividindo os dados em conjuntos de treinamento e validação
X_train, X_val, y_train, y_val = train_test_split(
    test_X, test_y,  # Changed from X, y to test_X, test_y
    test_size=0.2,  # 20% dos dados para validação
    random_state=4, # Controla a aleatoriedade para reprodutibilidade
    stratify=test_y      # Changed from y to test_y to ensure stratification
)

# Exibindo as formas dos conjuntos criados
print(f"Conjunto de treinamento: X={X_train.shape}, y={y_train.shape}")
print(f"Conjunto de teste: X={X_val.shape}, y={y_val.shape}")

Conjunto de treinamento: X=(56, 21), y=(56,)
Conjunto de teste: X=(14, 21), y=(14,)


# **KNN**

In [10]:
from sklearn.neighbors import KNeighborsClassifier

# Criando o modelo KNN com k = 5
knn_model = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_val)
y_pred_knn[:5]

array(['S16', 'E8', 'F4', 'E8', 'E8'], dtype=object)

# **Decision Tree**

In [11]:
from sklearn.tree import DecisionTreeClassifier

# Criando o modelo de Árvore de Decisão
dt_model = DecisionTreeClassifier(criterion="entropy", max_depth=4)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_val)
y_pred_dt[:5]

array(['E8', 'E8', 'F4', 'E8', 'S16'], dtype=object)

# **SVM**

In [12]:
from sklearn.svm import SVC
svm_model = SVC(kernel='poly', random_state=42)
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_val)
y_pred_svm[:5]

array(['S16', 'S16', 'S16', 'S16', 'S16'], dtype=object)

# **Logistic Regression**

In [13]:
from sklearn.linear_model import LogisticRegression

# Criação do modelo de Regressão Logística com C=0.01
logreg_model = LogisticRegression(
    solver='liblinear',  # Solucionador eficiente para problemas menores
    random_state=42,     # Garante reprodutibilidade
    max_iter=1000,       # Número de iterações elevado para garantir convergência
    class_weight='balanced',  # Lida com classes desbalanceadas ajustando os pesos automaticamente
    C=0.01               # Define a força da regularização inversa
)

# Treinamento do modelo
logreg_model.fit(X_train, y_train)
y_pred_logreg = logreg_model.predict(X_val)
y_pred_logreg[:5]

array(['S16', 'S16', 'F4', 'E8', 'S16'], dtype=object)

In [14]:
from sklearn.metrics import f1_score, jaccard_score, log_loss, accuracy_score

# Lista dos modelos e seus nomes
modelos = {
    "KNN": knn_model,
    "Decision Tree": dt_model,
    "SVM": svm_model,
    "Logistic Regression": logreg_model
}

# Inicializa dicionários para armazenar os resultados
accuracies = {}
jaccard_scores = {}
f1_scores = {}
log_losses = {}

# Calcula as métricas para cada modelo
for nome, modelo in modelos.items():
    # Faz previsões no conjunto de validação
    y_pred = modelo.predict(X_val)

    # Calcula Acurácia
    accuracy = accuracy_score(y_val, y_pred)
    accuracies[nome] = accuracy

    # Calcula Jaccard Score
    jaccard = jaccard_score(y_val, y_pred, average='micro')
    jaccard_scores[nome] = jaccard

    # Calcula F1 Score
    f1 = f1_score(y_val, y_pred, average='micro')
    f1_scores[nome] = f1

    # Calcula Log Loss (apenas para modelos de classificação)
    try:
        y_pred_proba = modelo.predict_proba(X_val)
        logloss = log_loss(y_val, y_pred_proba)
        log_losses[nome] = logloss
    except AttributeError:
        log_losses[nome] = "N/A"  # Modelo não suporta predict_proba

    # Exibe os resultados para o modelo atual
    print(f"\n{nome} Model")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Jaccard Score: {jaccard:.4f}")
    print(f"F1 Score: {f1:.4f}")
    if isinstance(log_losses[nome], (int, float)):
        print(f"Log Loss: {log_losses[nome]:.4f}")
    else:
        print(f"Log Loss: {log_losses[nome]}")  # Print as is if it's not a number


KNN Model
Accuracy: 0.4286
Jaccard Score: 0.2727
F1 Score: 0.4286
Log Loss: 3.2479

Decision Tree Model
Accuracy: 0.2857
Jaccard Score: 0.1667
F1 Score: 0.2857
Log Loss: 13.2907

SVM Model
Accuracy: 0.5714
Jaccard Score: 0.4000
F1 Score: 0.5714
Log Loss: N/A

Logistic Regression Model
Accuracy: 0.7143
Jaccard Score: 0.5556
F1 Score: 0.7143
Log Loss: 1.0411


In [None]:
# Imprime a tabela comparativa na ordem solicitada
print("\nTabela Comparativa:")
print("-" * 65)
print("{:<20} {:<10} {:<10} {:<10} {:<10}".format("Model", "Accuracy", "Jaccard", "F1-Score", "LogLoss"))
print("-" * 65)
for nome in modelos:
    log_loss_value = f"{log_losses[nome]:.4f}" if isinstance(log_losses[nome], (int, float)) else log_losses[nome]
    print("{:<20} {:<10.4f} {:<10.4f} {:<10.4f} {:<10}".format(
        nome, accuracies[nome], jaccard_scores[nome], f1_scores[nome], log_loss_value))
print("-" * 65)


Tabela Comparativa:
-----------------------------------------------------------------
Model                Accuracy   Jaccard    F1-Score   LogLoss   
-----------------------------------------------------------------
KNN                  0.4286     0.2727     0.4286     3.2479    
Decision Tree        0.2143     0.1200     0.2143     15.8653   
SVM                  0.5714     0.4000     0.5714     N/A       
Logistic Regression  0.7143     0.5556     0.7143     1.0411    
-----------------------------------------------------------------
