# Data Mining Project - WNBA Playoffs prediction

### Project developed by:
- Adam Nogueira (up202007519)
- Eduardo Silva (up202004999)
- João Félix (up202008867)

## Table of Contents



### Introduction
This project involves developing a data mining case study, which is described in a separate document provided on Moodle. The main focus of the project is a predictive data mining task, the details of which are outlined in the case study description.

### Bibliography
NumPy Developers, Numpy documentation, URL: https://numpy.org/doc/stable/user/index.html#user <br>
pandas development team, pandas documentation, URL: https://pandas.pydata.org/docs/user_guide/index.html#user-guide<br>
Matplotlib Development team, Matplotlib documentation, URL: https://matplotlib.org/stable/index.html <br>
scikit-learn developers, scikit-learn documentation, URL: https://scikit-learn.org/0.18/documentation.html<br>

### Approach

The approach to this project was done as follows:

1. **Data analysis**: First we analyzed the dataset to inspect for the need for data pre-processing: checked the corresponding histograms, class distribution, and the existence of missing or null values.
2. **Algorithm implementation**: Flowing that, we defined the training and test sets using train/test split, resampled the dataset, and applied the SciKit Learn's algorithms to obtain the first results.
3. **Evaluation and refinement**: After analyzing the first results, tunning of each algorithm was done utilizing the SciKit Learn GridSearchCV to find the parameters of each algorithm that yielded the best overall results, and evaluated the final results.

### Used Libraries

- **NumPy**: Provides a fast numerical array structure and helper functions.
- **pandas**: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
- **matplotlib**: The essential Machine Learning package in Python.
- **sklearn**: Basic plotting library in Python; most other Python plotting libraries are built on top of it.
- **seaborn**: Advanced statistical plotting library.
- **pycaret**: Offers streamlined workflows and a wide range of pre-built algorithms and techniques to experiment with different models and compare their performance using different evaluation metrics.


## Data analysis

We start by importing the required libraries and plotting some graphs for initial analysis of the dataset.

### Key Statistics

- Win-Loss Record: This is the most straightforward indicator. Teams with more wins are more likely to make the Playoff prediction predictions. Historically, teams with around a .500 or better win-loss record tend to have a good chance of making the playoffs.

- Winning Percentage: Similar to win-loss record, winning percentage (Wins / Total Games) is a fundamental metric used to assess a team's performance.

- Points Per Game (PPG): Teams that score more points on average are often more successful. This statistic reflects a team's offensive efficiency.

- Points Allowed Per Game (PAPG): Teams that allow fewer points per game have a stronger defense. Defensive efficiency is a critical factor in determining playoff success.

- Net Rating: Net rating is the difference between a team's offensive rating (points scored per 100 possessions) and their defensive rating (points allowed per 100 possessions). Teams with positive net ratings are often playoff-bound.

- Field Goal Percentage (FG%): Shooting efficiency is a key factor in a team's offensive performance. A high field goal percentage indicates effective shooting.

- Three-Point Percentage (3P%): The ability to make three-point shots is crucial in modern basketball. Teams with high three-point percentages often perform well.

- Free Throw Percentage (FT%): Teams with good free throw shooting can close out close games more effectively.

- Rebounds Per Game (RPG): Rebounding is a key component of both offense and defense. Teams that dominate the boards tend to have an advantage.

- Assists Per Game (APG): Ball movement and sharing are critical in the NBA. Teams with high assist numbers often have a strong offense.

- Steals and Blocks: Defensive statistics such as steals and blocks indicate a team's ability to disrupt the opponent's offense and protect the rim.

- Turnovers: Reducing turnovers is important for maintaining possession and minimizing scoring opportunities for the opposition.


In [168]:
import data_manip
import os
import warnings
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from enum import Enum
import seaborn as sb
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
#from pycaret.classification import *
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
%pip install imbalanced-learn


# Set the warning filter to "ignore"
warnings.filterwarnings("ignore", category=UserWarning)

awards = pd.read_csv('modified_data/awards_players.csv', na_values=['NA'], delimiter=",")
coaches = pd.read_csv('modified_data/coaches.csv', na_values=['NA'], delimiter=",")
players = pd.read_csv('modified_data/players.csv', na_values=['NA'], delimiter=",")
players_teams = pd.read_csv('modified_data/players_teams.csv', na_values=['NA'], delimiter=",")
series_post = pd.read_csv('modified_data/series_post.csv', na_values=['NA'], delimiter=",")
teams = pd.read_csv('modified_data/teams.csv', na_values=['NA'], delimiter=",")
teams_post = pd.read_csv('modified_data/teams_post.csv', na_values=['NA'], delimiter=",")


print(teams)

Note: you may need to restart the kernel to use updated packages.
     year  lgID tmID franchID confID  rank playoff firstRound semis finals  \
0       9  WNBA  ATL      ATL     EA     7       N        NaN   NaN    NaN   
1      10  WNBA  ATL      ATL     EA     2       Y          L   NaN    NaN   
2       1  WNBA  CHA      CHA     EA     8       N        NaN   NaN    NaN   
3       2  WNBA  CHA      CHA     EA     4       Y          W     W      L   
4       3  WNBA  CHA      CHA     EA     2       Y          L   NaN    NaN   
..    ...   ...  ...      ...    ...   ...     ...        ...   ...    ...   
137     6  WNBA  WAS      WAS     EA     5       N        NaN   NaN    NaN   
138     7  WNBA  WAS      WAS     EA     4       Y          L   NaN    NaN   
139     8  WNBA  WAS      WAS     EA     5       N        NaN   NaN    NaN   
140     9  WNBA  WAS      WAS     EA     6       N        NaN   NaN    NaN   
141    10  WNBA  WAS      WAS     EA     4       Y          L   NaN    NaN  

### Dataset Preparation

In [169]:
# PLAYERS_TEAMS
player_teams_input_path = 'original_data/players_teams.csv'
player_teams_column_pairs = [('fgMade', 'fgAttempted'), ('ftMade', 'ftAttempted'), ('threeMade', 'threeAttempted')]
player_teams_output_path = 'modified_data/players_teams.csv'
convert_columns_to_ratio(player_teams_input_path, player_teams_column_pairs, player_teams_output_path)

player_teams_columns_to_exclude = ['fgMade', 'fgAttempted', 'ftMade', 'ftAttempted', 'threeMade', 'threeAttempted']
exclude_columns(player_teams_output_path, player_teams_columns_to_exclude, player_teams_output_path)

# TEAMS
teams_input_path = 'original_data/teams.csv'
teams_columns_to_exclude = ['confW', 'confL', 'min', 'attend', 'arena', 'tmORB', 'tmDRB', 'tmTRB', 'opptmORB', 'opptmDRB', 'opptmTRB', 'divID', 'seeded']
teams_output_path = 'modified_data/teams.csv'

exclude_columns(teams_input_path, teams_columns_to_exclude, teams_output_path)

# PLAYERS
players_input_path = 'original_data/players.csv'
players_columns_to_exclude = ['firstseason', 'lastseason', 'height', 'weight', 'college', 'collegeOther', 'deathDate']
players_output_path = 'modified_data/players.csv'
exclude_columns(players_input_path, players_columns_to_exclude, players_output_path)

column_mapping = {'bioID': 'playerID'}
rename_columns(players_output_path, column_mapping, players_output_path)


# AWARDS_PLAYERS
players_input_path = 'original_data/awards_players.csv'
players_columns_to_exclude = ['award', 'lgID']
players_output_path = 'modified_data/awards_players.csv'
exclude_columns(players_input_path, players_columns_to_exclude, players_output_path)

# TEAMS_POST
teams_input_path = 'original_data/teams_post.csv'
teams_columns_to_exclude = ['lgID']
teams_output_path = 'modified_data/teams_post.csv'

exclude_columns(teams_input_path, teams_columns_to_exclude, teams_output_path)

Columns fgMade/fgAttempted, ftMade/ftAttempted, threeMade/threeAttempted converted to ratio columns and new file saved as 'modified_data/players_teams.csv'

Columns fgMade, fgAttempted, ftMade, ftAttempted, threeMade, threeAttempted removed and new file saved as 'modified_data/players_teams.csv'

Columns confW, confL, min, attend, arena, tmORB, tmDRB, tmTRB, opptmORB, opptmDRB, opptmTRB, divID, seeded removed and new file saved as 'modified_data/teams.csv'

Columns firstseason, lastseason, height, weight, college, collegeOther, deathDate removed and new file saved as 'modified_data/players.csv'

Columns renamed and new file saved as 'modified_data/players.csv'

Columns award, lgID removed and new file saved as 'modified_data/awards_players.csv'

Columns lgID removed and new file saved as 'modified_data/teams_post.csv'



### Dataset

In [170]:
dataset = pd.read_csv('modified_data/teams.csv')

In [171]:
# Custom formatting function
def custom_format(value):
    # Check if the value is a number (int or float)
    if isinstance(value, (int, float)):
        # If it's an integer, format as an integer
        if isinstance(value, int):
            return value
        # If it's a float, format with 2 decimal places
        elif isinstance(value, float):
            return round(value, 2)
    else:
        return value

formatted_df = teams.applymap(custom_format)
formatted_df.describe()

Unnamed: 0,year,rank,o_fgm,o_fga,o_ftm,o_fta,o_3pm,o_3pa,o_oreb,o_dreb,...,d_to,d_blk,d_pts,won,lost,GP,homeW,homeL,awayW,awayL
count,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,...,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0,142.0
mean,5.302817,4.084507,860.387324,2039.683099,488.338028,651.366197,157.161972,463.014085,330.5,730.929577,...,510.450704,122.070423,2366.260563,16.661972,16.661972,33.323944,10.169014,6.492958,6.492958,10.169014
std,2.917274,2.095226,86.998969,176.879707,70.749372,86.035246,43.73658,116.166119,41.191432,83.378114,...,54.038019,20.658537,234.615384,4.999131,4.999131,0.949425,2.994017,2.967308,2.702104,2.731409
min,1.0,1.0,647.0,1740.0,333.0,469.0,62.0,205.0,242.0,537.0,...,390.0,71.0,1788.0,4.0,4.0,32.0,1.0,0.0,1.0,3.0
25%,3.0,2.0,794.5,1908.5,435.25,582.75,128.25,389.0,301.25,653.25,...,470.25,109.0,2196.75,13.0,14.0,32.0,8.0,4.25,5.0,9.0
50%,5.0,4.0,864.0,2025.0,483.5,650.0,157.0,459.0,333.5,724.0,...,503.0,123.0,2339.5,17.0,16.0,34.0,11.0,6.0,6.0,10.0
75%,8.0,6.0,915.0,2177.5,539.0,716.5,180.75,528.0,356.75,788.0,...,545.5,136.75,2522.75,20.0,20.0,34.0,12.0,8.0,8.0,12.0
max,10.0,8.0,1128.0,2485.0,668.0,882.0,283.0,802.0,452.0,931.0,...,649.0,206.0,3031.0,28.0,30.0,34.0,16.0,16.0,13.0,16.0


#### Checking for null values

In [172]:
dataset.isna().sum()

year            0
lgID            0
tmID            0
franchID        0
confID          0
rank            0
playoff         0
firstRound     62
semis         104
finals        122
name            0
o_fgm           0
o_fga           0
o_ftm           0
o_fta           0
o_3pm           0
o_3pa           0
o_oreb          0
o_dreb          0
o_reb           0
o_asts          0
o_pf            0
o_stl           0
o_to            0
o_blk           0
o_pts           0
d_fgm           0
d_fga           0
d_ftm           0
d_fta           0
d_3pm           0
d_3pa           0
d_oreb          0
d_dreb          0
d_reb           0
d_asts          0
d_pf            0
d_stl           0
d_to            0
d_blk           0
d_pts           0
won             0
lost            0
GP              0
homeW           0
homeL           0
awayW           0
awayL           0
dtype: int64

## Data Preprocessing

After examining the dataset and assessing its characteristics, we conducted a comprehensive analysis. The results revealed a high level of data consistency, with no missing values or notable outliers observed. As a consequence, the dataset demonstrated a remarkable level of readiness for analysis, requiring minimal data preprocessing efforts.

In [173]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 48 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   year        142 non-null    int64 
 1   lgID        142 non-null    object
 2   tmID        142 non-null    object
 3   franchID    142 non-null    object
 4   confID      142 non-null    object
 5   rank        142 non-null    int64 
 6   playoff     142 non-null    object
 7   firstRound  80 non-null     object
 8   semis       38 non-null     object
 9   finals      20 non-null     object
 10  name        142 non-null    object
 11  o_fgm       142 non-null    int64 
 12  o_fga       142 non-null    int64 
 13  o_ftm       142 non-null    int64 
 14  o_fta       142 non-null    int64 
 15  o_3pm       142 non-null    int64 
 16  o_3pa       142 non-null    int64 
 17  o_oreb      142 non-null    int64 
 18  o_dreb      142 non-null    int64 
 19  o_reb       142 non-null    int64 
 20  o_asts    

### Train and Test split data

Dividimos os dados em conjunto de input e label para os classificadores do Scikit. Label é a coluna Class and input é as restantes colunas

In [174]:
dataset['playoff'] = dataset['playoff'].astype('category')

col_names = list(dataset.columns)
col_names.remove('name')
col_names.remove('lgID')
col_names.remove('tmID')
col_names.remove('franchID')
col_names.remove('confID')
col_names.remove('firstRound')
col_names.remove('semis')
col_names.remove('finals')

inputs = dataset[col_names].values
labels = dataset['playoff'].values

Resumidamente dividi dados em dados de teste e treinamento, mantendo a mesma distribuição das classes inicias, usando 1/4 do dataset original

- stratify - para manter a distribuição de classes 
- train_in - variável que armazena as características de treinamento
- test_in - variável que armazena as características de teste
- train_classes - armazena as classes dos dados de treinamento
- test_classes - armazena as classes dos dados de testes
- random_state - para garantir randomness na divisão de dados


In [175]:
from sklearn.model_selection import train_test_split

(train_in,
 test_in,
 train_classes,
 test_classes) = train_test_split(inputs, labels, test_size=0.25, random_state=1, stratify=labels)


The data analyses showed us that our working dataset is unbalanced. We implemented both undersampling and oversampling. Undersampling removes samples from majority categories, while oversampling duplicates samples from minority categories. Generally oversampling is preferred

Conta a ocorrência de cada classe nos conjuntos de treinamento e teste. Da um overview da distribuição das classes em cada conunto

In [176]:
from collections import Counter

print("---Train Set---")
print(Counter(train_classes))
print("\n---Test Set---")
print(Counter(test_classes))

---Train Set---
Counter({'Y': 60, 'N': 46})

---Test Set---
Counter({'Y': 20, 'N': 16})


É usado para balancear os dados. Remove aleatoriamente dados da classe com maior número de dados até que se encontro o equilíbrio que se quer. Undersampling.

In [177]:

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()

us_inputs, us_labels = rus.fit_resample(train_in, train_classes)

print(Counter(us_labels))

Counter({'N': 46, 'Y': 46})


Faz oversampling dos dados e conseguimos ver que realmente os dados ficaram equilibradoos nos dados de treino. aumenta o número de linhas, tornando-a mais proporcionsl à classe maioritaria<>

In [178]:
from imblearn.over_sampling import SMOTE

ros = SMOTE()

os_inputs, os_labels = ros.fit_resample(train_in, train_classes)

print(Counter(os_labels))

ValueError: could not convert string to float: 'N'

We used a StandardScaler from SciKit Learn's preprocessing library to standardize the data. Porque é necessário para o K nearest neighbrs e o SVM

Fas-se a padronização dos dados para garantir que as características contribuam igualmente para os modelos de machine learning, e evita que uma caraterística em particular dominee o processo de learning da outra

In [None]:
from sklearn.preprocessing import StandardScaler  

scaler = StandardScaler()

scaler.fit(train_in)
train_in = scaler.fit_transform(train_in)
test_in = scaler.fit_transform(test_in)

scaler.fit(os_inputs)
os_inputs = scaler.fit_transform(os_inputs)

scaler.fit(us_inputs)
us_inputs = scaler.fit_transform(us_inputs)

# Classification

## K-Nearest Neighbors

### Original Dataset

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

classifier = KNeighborsClassifier()
classifier.fit(train_in, train_classes)
y_pred = classifier.predict(test_in)

result = confusion_matrix(test_classes, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(test_classes, y_pred)
print("Classification Report:",)
print (result1) 

knn_og_report = classification_report(test_classes, y_pred,output_dict=True)

### Undersampled Dataset

In [None]:
classifier = KNeighborsClassifier()
classifier.fit(us_inputs, us_labels)
y_pred = classifier.predict(test_in)

knn_confusion_matrix = confusion_matrix(test_classes, y_pred)
print("Confusion Matrix:")
print(knn_confusion_matrix)
result1 = classification_report(test_classes, y_pred)
print("Classification Report:",)
print(result1)

knn_us_report = classification_report(test_classes, y_pred, output_dict=True)

### Oversampled Dataset

In [None]:
classifier = KNeighborsClassifier()
classifier.fit(os_inputs, os_labels)
y_pred = classifier.predict(test_in)

result = confusion_matrix(test_classes, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(test_classes, y_pred)
print("Classification Report:",)
print (result1)

knn_os_report = classification_report(test_classes, y_pred,output_dict=True)

## Decision Tree Classifier

### Original Dataset

Confusion matrix:
TP FP
TN FN
FP = False Positive - deu que iam atrasar (1) mas na realidade é 0

Precision - mede a proporção de dados corretamente calculado TP/(TP+FP)

Accuracy - mede a correção no geral (TP + TN) / (TP + TN + FP + FN).
    - em dados não balenciados pode ser misleading

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()

dtc.fit(train_in, train_classes)
dtc_prediction = dtc.predict(test_in)

dtc_classification_report = classification_report(test_classes, dtc_prediction, output_dict=True)

print("--- Original dataset ---\n")
print("Confusion matrix:")
print(f"{confusion_matrix(test_classes, dtc_prediction)}\n")
print(f"Classification report:")
print(f"{classification_report(test_classes, dtc_prediction)}\n")


### Undersampled Dataset

In [None]:
dtc.fit(us_inputs, us_labels)
dtc_prediction = dtc.predict(test_in)

dtc_us_classification_report = classification_report(test_classes, dtc_prediction, output_dict=True)

print("--- Undersampled dataset ---\n")
print(f"Confusion matrix:\n{confusion_matrix(test_classes, dtc_prediction)}\n")
print(f"Classification report:\n{classification_report(test_classes, dtc_prediction)}\n")

### Oversampled Data

In [None]:
dtc.fit(os_inputs, os_labels)
dtc_prediction = dtc.predict(test_in)

dtc_os_classification_report = classification_report(test_classes, dtc_prediction, output_dict=True)

print("--- Oversampled dataset ---\n")
print(f"Confusion matrix:\n{confusion_matrix(test_classes, dtc_prediction)}\n")
print(f"Classification report:\n{classification_report(test_classes, dtc_prediction)}\n")

## SVM

### Original Dataset

In [None]:
svc = SVC()

svc.fit(train_in, train_classes)
y_pred = svc.predict(test_in)

result = confusion_matrix(test_classes, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(test_classes, y_pred)
print("Classification Report:",)
print (result1) 

svc_og_report = classification_report(test_classes, y_pred,output_dict=True)

### Undersampled Dataset

In [None]:
from sklearn.svm import SVC

svc = SVC()

svc.fit(us_inputs, us_labels)
y_pred = svc.predict(test_in)

result = confusion_matrix(test_classes, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(test_classes, y_pred)
print("Classification Report:",)
print (result1) 

svc_us_report = classification_report(test_classes, y_pred,output_dict=True)

### Oversampled Dataset

In [None]:
from sklearn.svm import SVC

svc = SVC()

svc.fit(os_inputs, os_labels)
y_pred = svc.predict(test_in)

result = confusion_matrix(test_classes, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(test_classes, y_pred)
print("Classification Report:",)
print (result1) 

svc_os_report = classification_report(test_classes, y_pred,output_dict=True)

## Result Analyses

### All algorithms

In [None]:

all_algorithms_data = pd.read_csv('models_comparison.csv', na_values=['NA'], delimiter=",")
all_algorithms_data.set_index("Model", inplace=True)

sb.heatmap(all_algorithms_data, cmap="YlGnBu", annot=True)
plt.xlabel('Results')
plt.ylabel('ML Models')
plt.show()


### Selected algorithms

In [None]:

selected_algorithms_data = pd.read_csv('models_comparison_selected.csv', na_values=['NA'], delimiter=",")
selected_algorithms_data.set_index("Model", inplace=True)

sb.heatmap(selected_algorithms_data, cmap="YlGnBu", annot=True)
plt.xlabel('Results')
plt.ylabel('ML Models')
plt.show()


### Comparing Accuracys of each algorythm

In [None]:
selected_algorithms_data = selected_algorithms_data.sort_values(by=['Accuracy'], ascending=False)

# Create the bar plot
plt.figure(figsize=(10, 6))
sb.barplot(x=selected_algorithms_data.index, y='Accuracy', data=selected_algorithms_data, color='#A7226E')

# Set the plot title and axis labels
plt.title('Accuracy Comparison by Algorithm')
plt.xlabel('Model')
plt.ylabel('Accuracy')

plt.xticks(rotation=90)

# Display the plot
plt.show()

### Comparing AUC (Area Under Curve) of each algorythm

In [None]:
selected_algorithms_data = selected_algorithms_data.sort_values(by=['AUC'], ascending=False)

# Create the bar plot
plt.figure(figsize=(10, 6))
sb.barplot(x=selected_algorithms_data.index, y='AUC', data=selected_algorithms_data, color='#FE4365')

# Set the plot title and axis labels
plt.title('AUC Comparison by Algorithm')
plt.xlabel('Model')
plt.ylabel('AUC')

plt.xticks(rotation=90)

# Display the plot
plt.show()

### Comparing Recall of each algorythm

In [None]:
selected_algorithms_data = selected_algorithms_data.sort_values(by=['Recall'], ascending=False)

# Create the bar plot
plt.figure(figsize=(10, 6))
sb.barplot(x=selected_algorithms_data.index, y='Recall', data=selected_algorithms_data, color='#9DE0AD')

# Set the plot title and axis labels
plt.title('Recall Comparison by Algorithm')
plt.xlabel('Model')
plt.ylabel('Recall')

plt.xticks(rotation=90)

# Display the plot
plt.show()

### Comparing Precision of each algorythm

In [None]:
selected_algorithms_data = selected_algorithms_data.sort_values(by=['Precision'], ascending=False)

# Create the bar plot
plt.figure(figsize=(10, 6))
sb.barplot(x=selected_algorithms_data.index, y='Precision', data=selected_algorithms_data, color='#F7DB4F')

# Set the plot title and axis labels
plt.title('Precision Comparison by Algorithm')
plt.xlabel('Model')
plt.ylabel('Precision')

plt.xticks(rotation=90)

# Display the plot
plt.show()

### Comparing F1-score of each algorythm

In [None]:
selected_algorithms_data = selected_algorithms_data.sort_values(by=['F1'], ascending=False)

# Create the bar plot
plt.figure(figsize=(10, 6))
sb.barplot(x=selected_algorithms_data.index, y='F1', data=selected_algorithms_data, color='#F26B38')

# Set the plot title and axis labels
plt.title('F1-Score Comparison by Algorithm')
plt.xlabel('Model')
plt.ylabel('F1-Score')

plt.xticks(rotation=90)

# Display the plot
plt.show()

### Comparing Kappa of each algorythm

In [None]:
selected_algorithms_data = selected_algorithms_data.sort_values(by=['Kappa'], ascending=False)

# Create the bar plot
plt.figure(figsize=(10, 6))
sb.barplot(x=selected_algorithms_data.index, y='Kappa', data=selected_algorithms_data, color='#2F9599')

# Set the plot title and axis labels
plt.title('Kappa Comparison by Algorithm')
plt.xlabel('Model')
plt.ylabel('Kappa')

plt.xticks(rotation=90)

# Display the plot
plt.show()

### Comparing Matthews Correlation Coefficient of each algorythm

In [None]:
selected_algorithms_data = selected_algorithms_data.sort_values(by=['MCC'], ascending=False)

# Create the bar plot
plt.figure(figsize=(10, 6))
sb.barplot(x=selected_algorithms_data.index, y='MCC', data=selected_algorithms_data, color='#FF4E50')

# Set the plot title and axis labels
plt.title('Matthews Correlation Coefficient Comparison by Algorithm')
plt.xlabel('Model')
plt.ylabel('Matthews Correlation Coefficient')

plt.xticks(rotation=90)

# Display the plot
plt.show()

### Comparing Training Time (sec) of each algorythm

In [None]:
selected_algorithms_data = selected_algorithms_data.sort_values(by=['TT (Sec)'], ascending=False)

# Create the bar plot
plt.figure(figsize=(10, 6))
sb.barplot(x=selected_algorithms_data.index, y='TT (Sec)', data=selected_algorithms_data, color='#9DE0AD')

# Set the plot title and axis labels
plt.title('Training Time comparison by Algorithm')
plt.xlabel('Model')
plt.ylabel('Training Time (Sec)')

plt.xticks(rotation=90)

# Display the plot
plt.show()