# Diploma thesis
## Breast cancer classification using machine learning methods
### All the features of the data set

> Lazaros Panitsidis<br />
> Department of Production and Management Engineering <br />
> International Hellenic University <br />
> lazarospanitsidis@outlook.com

## Contents
1. [Useful Python Libraries](#1)
1. [Data Processing](#2)
1. [Gaussian Naive Bayes](#3)
1. [Linear Discriminant Analysis](#4)
1. [Quadratic Discriminant Analysis](#5)
1. [Ridge Classifier](#6)
1. [Decision Tree Classifier](#7)
1. [Random Forest Classifier](#8)
1. [ADA Boost Classifier (Adaptive Boosting)](#9)
1. [C-Support Vector Classification](#10)
1. [Stochastic Gradient Descent Classifier](#11)
1. [eXtreme Gradient Boosting](#12)
1. [Light Gradient Boosting Machine](#13)
1. [K-Nearest Neighbors Classifier](#14)
1. [Multi-layer Perceptron Classifier](#15)
1. [Summary](#16)

<a id='1'></a>
## 1) Useful Python Libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # data visualization library  
import scipy.stats as stats
import matplotlib.pyplot as plt
import time
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None  # default='warn'
#import warnings library
import warnings
# ignore all warnings
warnings.filterwarnings('ignore')
# Any results you write to the current directory are saved as output.

# some of them are not used in this file
from sklearn.feature_selection import SelectKBest, f_classif, chi2, RFE, RFECV , mutual_info_classif
from sklearn.model_selection import train_test_split, cross_val_score , GridSearchCV , LeaveOneOut,KFold,RandomizedSearchCV
from skopt import BayesSearchCV # https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html#skopt.BayesSearchCV , https://scikit-optimize.github.io/stable/auto_examples/bayesian-optimization.html
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score , make_scorer , classification_report
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline , Pipeline # https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
from sklearn.preprocessing import StandardScaler , LabelEncoder
from xgboost import XGBClassifier , plot_importance
from sklearn.utils import resample
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier , RidgeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis , QuadraticDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier , AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
import lightgbm as lgbm
from sklearn.neural_network import MLPClassifier
import pygad

<a id='2'></a>
## 2) Data Processing

In [2]:
dataWISC = pd.read_csv('dataWisc.csv')
dataWISC.drop(["id", "Unnamed: 32"], axis = 1, inplace = True)

# Undersampling function
def make_undersample(_df, column):
  dfs_r = {}
  dfs_c = {}
  smaller = 1e1000
  ignore = ""
  for c in _df[column].unique():
    dfs_c[c] = _df[_df[column] == c]
    if dfs_c[c].shape[0] < smaller:
      smaller = dfs_c[c].shape[0]
      ignore = c

  for c in dfs_c:
    if c == ignore:
      continue
    dfs_r[c] = resample(dfs_c[c], 
                        replace=False, # sample without replacement
                        n_samples=smaller,
                        random_state=0)
  return pd.concat([dfs_r[c] for c in dfs_r] + [dfs_c[ignore]])

dataWISC = make_undersample(dataWISC,'diagnosis')

#Description of the dataset

#how many cases are included in the dataset
length = len(dataWISC)
#how many features are in the dataset
features = dataWISC.shape[1]-1 # - diagnosis

# Number of malignant cases
malignant = len(dataWISC[dataWISC['diagnosis']=='M'])

#Number of benign cases
benign = len(dataWISC[dataWISC['diagnosis']=='B'])

#Rate of malignant tumors over all cases
rate = (float(malignant)/(length))*100

print ("There are "+ str(len(dataWISC))+" cases in this dataset")
print ("There are {}".format(features)+" features in this dataset")
print ("There are {}".format(malignant)+" cases diagnosed as malignant tumor")
print ("There are {}".format(benign)+" cases diagnosed as benign tumor")
print ("The percentage of malignant cases is: {:.2f}%".format(rate))

There are 424 cases in this dataset
There are 30 features in this dataset
There are 212 cases diagnosed as malignant tumor
There are 212 cases diagnosed as benign tumor
The percentage of malignant cases is: 50.00%


In [3]:
y = dataWISC.diagnosis                          # M or B 
x = dataWISC.drop('diagnosis',axis = 1 )
target_names=['Benign','Malignant']
# x_scaled = (x - x.mean())/x.std()
le= LabelEncoder()
le.fit(y)
y_le = le.transform(y)

In [4]:
# ALL features
x_new = x
x_new.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
49,13.49,22.3,86.91,561.0,0.08752,0.07698,0.04751,0.03384,0.1809,0.05718,0.2338,1.353,1.735,20.2,0.004455,0.01382,0.02095,0.01184,0.01641,0.001956,15.15,31.82,99.0,698.8,0.1162,0.1711,0.2282,0.1282,0.2871,0.06917
285,12.58,18.4,79.83,489.0,0.08393,0.04216,0.00186,0.002924,0.1697,0.05855,0.2719,1.35,1.721,22.45,0.006383,0.008008,0.00186,0.002924,0.02571,0.002015,13.5,23.08,85.56,564.1,0.1038,0.06624,0.005579,0.008772,0.2505,0.06431
495,14.87,20.21,96.12,680.9,0.09587,0.08345,0.06824,0.04951,0.1487,0.05748,0.2323,1.636,1.596,21.84,0.005415,0.01371,0.02153,0.01183,0.01959,0.001812,16.01,28.48,103.9,783.6,0.1216,0.1388,0.17,0.1017,0.2369,0.06599
391,8.734,16.84,55.27,234.3,0.1039,0.07428,0.0,0.0,0.1985,0.07098,0.5169,2.079,3.167,28.85,0.01582,0.01966,0.0,0.0,0.01865,0.006736,10.17,22.8,64.01,317.0,0.146,0.131,0.0,0.0,0.2445,0.08865
187,11.71,17.19,74.68,420.3,0.09774,0.06141,0.03809,0.03239,0.1516,0.06095,0.2451,0.7655,1.742,17.86,0.006905,0.008704,0.01978,0.01185,0.01897,0.001671,13.01,21.39,84.42,521.5,0.1323,0.104,0.1521,0.1099,0.2572,0.07097


In [5]:
# https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/#:~:text=Given%20the%20improved%20estimate%20of,biased%20estimates%20of%20model%20performance.
# cv = LeaveOneOut()

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
cv=KFold(n_splits=10, shuffle=True, random_state=13)

originalclass = []
predictedclass = []

def classification_report_with_accuracy_score(y_true, y_pred):
  originalclass.extend(y_true)
  predictedclass.extend(y_pred)
  #print(classification_report(y_true, y_pred, target_names=target_names)) 
  return accuracy_score(y_true, y_pred)

def print_best_params(grid_search):
    print("")
    print("Best hyperparameters : ", grid_search.best_params_)
    print("")
    print("Best estimator : ", grid_search.best_estimator_)
    print("")

<a id='3'></a>
## 3) [Gaussian Naive Bayes](<https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB>)

* Default hyperparameters

In [6]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_gnb = Pipeline([('scaler', StandardScaler()), ('gnb', GaussianNB())])
score = cross_val_score(clf_gnb, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.907     0.962     0.934       212
   Malignant      0.960     0.901     0.929       212

    accuracy                          0.932       424
   macro avg      0.933     0.932     0.932       424
weighted avg      0.933     0.932     0.932       424



* Hyperparameter tuning using Grid Search

In [7]:
param_grid = { 'gnb__var_smoothing': np.logspace(0,-10, num=100) }

grid_search = GridSearchCV(clf_gnb, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits

Best hyperparameters :  {'gnb__var_smoothing': 0.019179102616724886}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('gnb', GaussianNB(var_smoothing=0.019179102616724886))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gnb__var_smoothing,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
18,0.005548,0.005863,0.00435,0.00724,0.015199,{'gnb__var_smoothing': 0.01519911082952934},0.929624,0.905702,0.904444,0.906926,0.976177,0.976068,0.974437,0.854167,0.902778,0.97551,0.930583,0.0407,1
17,0.004697,0.002246,0.002002,0.000594,0.019179,{'gnb__var_smoothing': 0.019179102616724886},0.929624,0.905702,0.904444,0.906926,0.976177,0.976068,0.974437,0.854167,0.902778,0.97551,0.930583,0.0407,1
50,0.003644,0.000944,0.002408,0.000779,9e-06,{'gnb__var_smoothing': 8.902150854450392e-06},0.952851,0.905702,0.904444,0.906926,0.976177,0.952273,0.974437,0.854167,0.902778,0.97551,0.930526,0.039321,3


* Tuned hyperparameters

In [8]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_gnb = Pipeline(steps=[('scaler', StandardScaler()),
                ('gnb', GaussianNB(var_smoothing=0.019179102616724886))])

score = cross_val_score(clf_gnb, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.903     0.967     0.934       212
   Malignant      0.964     0.896     0.929       212

    accuracy                          0.932       424
   macro avg      0.934     0.932     0.932       424
weighted avg      0.934     0.932     0.932       424



<a id='4'></a>
## 4) [Linear Discriminant Analysis](<https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html>)

* Default hyperparameters

In [9]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_lda = Pipeline([('scaler', StandardScaler()), ('lda', LinearDiscriminantAnalysis())])

score = cross_val_score(clf_lda, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.938     0.995     0.966       212
   Malignant      0.995     0.934     0.964       212

    accuracy                          0.965       424
   macro avg      0.966     0.965     0.965       424
weighted avg      0.966     0.965     0.965       424



* Hyperparameter tuning using Grid Search

In [10]:
param_grid = {
    'lda__solver' : ['svd','lsqr','eigen'],
    'lda__shrinkage':[None,'auto'],
    'lda__tol': [0.0001,0.001,0.01,0.1]
}

grid_search = GridSearchCV(clf_lda, param_grid=param_grid, n_jobs=-1, cv=cv,verbose=4,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 24 candidates, totalling 240 fits

Best hyperparameters :  {'lda__shrinkage': 'auto', 'lda__solver': 'lsqr', 'lda__tol': 0.0001}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('lda',
                 LinearDiscriminantAnalysis(shrinkage='auto', solver='lsqr'))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lda__shrinkage,param_lda__solver,param_lda__tol,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
23,0.008577,0.002897,0.001895,0.000299,auto,eigen,0.1,"{'lda__shrinkage': 'auto', 'lda__solver': 'eig...",0.906522,0.976282,0.929624,0.930081,0.976177,0.976068,0.973668,0.975848,1.0,1.0,0.964427,0.029841,1
22,0.008577,0.001954,0.001895,0.000299,auto,eigen,0.01,"{'lda__shrinkage': 'auto', 'lda__solver': 'eig...",0.906522,0.976282,0.929624,0.930081,0.976177,0.976068,0.973668,0.975848,1.0,1.0,0.964427,0.029841,1
21,0.008976,0.003185,0.002094,0.001041,auto,eigen,0.001,"{'lda__shrinkage': 'auto', 'lda__solver': 'eig...",0.906522,0.976282,0.929624,0.930081,0.976177,0.976068,0.973668,0.975848,1.0,1.0,0.964427,0.029841,1


* Tuned hyperparameters

In [11]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_lda = Pipeline([('scaler', StandardScaler()), ('lda', LinearDiscriminantAnalysis(shrinkage='auto', solver='lsqr'))])

score = cross_val_score(clf_lda, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.942     0.991     0.966       212
   Malignant      0.990     0.939     0.964       212

    accuracy                          0.965       424
   macro avg      0.966     0.965     0.965       424
weighted avg      0.966     0.965     0.965       424



<a id='5'></a>
## 5) [Quadratic Discriminant Analysis](<https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html>)

* Default hyperparameters

In [12]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_qda = Pipeline([('scaler', StandardScaler()), ('qda', QuadraticDiscriminantAnalysis())])

score = cross_val_score(clf_qda, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.952     0.939     0.945       212
   Malignant      0.940     0.953     0.946       212

    accuracy                          0.946       424
   macro avg      0.946     0.946     0.946       424
weighted avg      0.946     0.946     0.946       424



* Hyperparameter tuning using Grid Search

In [13]:
param_grid = {
    'qda__reg_param': np.linspace(0, 1, num=10),
    'qda__tol': [0.0001,0.001,0.01]
}

grid_search = GridSearchCV(clf_qda, param_grid=param_grid, n_jobs=-1, cv=cv,verbose=4,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 30 candidates, totalling 300 fits

Best hyperparameters :  {'qda__reg_param': 0.4444444444444444, 'qda__tol': 0.0001}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('qda',
                 QuadraticDiscriminantAnalysis(reg_param=0.4444444444444444))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_qda__reg_param,param_qda__tol,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
14,0.004987,0.001093,0.002294,0.000639,0.444444,0.01,"{'qda__reg_param': 0.4444444444444444, 'qda__t...",0.952851,0.976282,0.928847,0.930081,1.0,0.976068,1.0,0.951389,0.97551,1.0,0.969103,0.025935,1
13,0.008179,0.006521,0.004587,0.006152,0.444444,0.001,"{'qda__reg_param': 0.4444444444444444, 'qda__t...",0.952851,0.976282,0.928847,0.930081,1.0,0.976068,1.0,0.951389,0.97551,1.0,0.969103,0.025935,1
12,0.005984,0.001479,0.002693,0.00078,0.444444,0.0001,"{'qda__reg_param': 0.4444444444444444, 'qda__t...",0.952851,0.976282,0.928847,0.930081,1.0,0.976068,1.0,0.951389,0.97551,1.0,0.969103,0.025935,1


* Tuned hyperparameters

In [14]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_qda = Pipeline([('scaler', StandardScaler()), ('qda', QuadraticDiscriminantAnalysis(reg_param=0.4444444444444444))])

score = cross_val_score(clf_qda, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.946     0.995     0.970       212
   Malignant      0.995     0.943     0.969       212

    accuracy                          0.969       424
   macro avg      0.971     0.969     0.969       424
weighted avg      0.971     0.969     0.969       424



<a id='6'></a>
## 6) [Ridge Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier>)

* Default hyperparameters

In [15]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_rc = Pipeline([('scaler', StandardScaler()), ('rg', RidgeClassifier())])

score = cross_val_score(clf_rc, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.938     0.991     0.963       212
   Malignant      0.990     0.934     0.961       212

    accuracy                          0.962       424
   macro avg      0.964     0.962     0.962       424
weighted avg      0.964     0.962     0.962       424



* Hyperparameter tuning using Grid Search

In [16]:
param_grid = {
    'rg__alpha' : np.linspace(0, 1, num=10),
    'rg__fit_intercept' : [True,False],
    'rg__copy_X' : [True,False],
    'rg__max_iter' : [None],
    'rg__tol' : [0.001],
    'rg__class_weight' : [None,'balanced'],
    'rg__solver' : ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'],
    'rg__positive' : [False]
}

grid_search = GridSearchCV(clf_rc, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 560 candidates, totalling 5600 fits

Best hyperparameters :  {'rg__alpha': 0.0, 'rg__class_weight': None, 'rg__copy_X': True, 'rg__fit_intercept': False, 'rg__max_iter': None, 'rg__positive': False, 'rg__solver': 'svd', 'rg__tol': 0.001}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('rg',
                 RidgeClassifier(alpha=0.0, fit_intercept=False,
                                 solver='svd'))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rg__alpha,param_rg__class_weight,param_rg__copy_X,param_rg__fit_intercept,param_rg__max_iter,param_rg__positive,param_rg__solver,param_rg__tol,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
63,0.017801,0.010745,0.003602,0.001125,0.111111,,True,False,,False,svd,0.001,"{'rg__alpha': 0.1111111111111111, 'rg__class_w...",0.952851,0.976282,0.952851,0.953261,0.952273,0.976068,0.973668,0.951389,1.0,0.97551,0.966415,0.015558,1
23,0.006084,0.001442,0.00359,0.005128,0.0,,False,False,,False,lsqr,0.001,"{'rg__alpha': 0.0, 'rg__class_weight': None, '...",0.952851,0.976282,0.952851,0.953261,0.952273,0.976068,0.973668,0.951389,1.0,0.97551,0.966415,0.015558,1
106,0.006896,0.00589,0.002198,0.000756,0.111111,balanced,False,False,,False,cholesky,0.001,"{'rg__alpha': 0.1111111111111111, 'rg__class_w...",0.952851,0.976282,0.952851,0.953261,0.952273,0.976068,0.973668,0.951389,1.0,0.97551,0.966415,0.015558,1


* Tuned hyperparameters

In [17]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_rc = Pipeline([('scaler', StandardScaler()), 
                    ('rg', RidgeClassifier(alpha=0.0,fit_intercept=False,solver='svd'))])

score = cross_val_score(clf_rc, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.942     0.995     0.968       212
   Malignant      0.995     0.939     0.966       212

    accuracy                          0.967       424
   macro avg      0.968     0.967     0.967       424
weighted avg      0.968     0.967     0.967       424



<a id='7'></a>
## 7) [Decision Tree Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>)

* Default hyperparameters

In [18]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_tree = Pipeline([('scaler', StandardScaler()), ('tree', DecisionTreeClassifier(random_state=13))])

score = cross_val_score(clf_tree, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.920     0.925     0.922       212
   Malignant      0.924     0.920     0.922       212

    accuracy                          0.922       424
   macro avg      0.922     0.922     0.922       424
weighted avg      0.922     0.922     0.922       424



* Hyperparameter tuning using Grid Search

In [19]:
param_grid = {
    'tree__criterion' :['gini','entropy'],
    'tree__splitter' : ['best','random'],
    'tree__max_depth': [2,6,10,None],
    'tree__min_samples_split': list(range(2, 4)),
    'tree__min_samples_leaf': [3,5],
    'tree__min_weight_fraction_leaf' : [0.0],
    'tree__max_features': [None, 'sqrt', 'log2'],
    'tree__max_leaf_nodes' : [None,10,50],
    'tree__min_impurity_decrease' : [0.0],
    'tree__class_weight' : [None,'balanced'],
    'tree__ccp_alpha' : [0.0],
    'tree__random_state' : [13]
}

grid_search = GridSearchCV(clf_tree, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 1152 candidates, totalling 11520 fits

Best hyperparameters :  {'tree__ccp_alpha': 0.0, 'tree__class_weight': 'balanced', 'tree__criterion': 'gini', 'tree__max_depth': 10, 'tree__max_features': 'sqrt', 'tree__max_leaf_nodes': None, 'tree__min_impurity_decrease': 0.0, 'tree__min_samples_leaf': 3, 'tree__min_samples_split': 2, 'tree__min_weight_fraction_leaf': 0.0, 'tree__random_state': 13, 'tree__splitter': 'best'}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('tree',
                 DecisionTreeClassifier(class_weight='balanced', max_depth=10,
                                        max_features='sqrt', min_samples_leaf=3,
                                        random_state=13))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_tree__ccp_alpha,param_tree__class_weight,param_tree__criterion,param_tree__max_depth,param_tree__max_features,param_tree__max_leaf_nodes,param_tree__min_impurity_decrease,param_tree__min_samples_leaf,param_tree__min_samples_split,param_tree__min_weight_fraction_leaf,param_tree__random_state,param_tree__splitter,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
818,0.004189,0.000598,0.002693,0.001483,0.0,balanced,gini,,sqrt,,0.0,3,3,0.0,13,best,"{'tree__ccp_alpha': 0.0, 'tree__class_weight':...",0.97591,0.906926,0.928847,0.953261,0.928205,1.0,0.973668,0.975848,0.97551,0.975045,0.959322,0.027563,1
816,0.004487,0.00067,0.001796,0.000599,0.0,balanced,gini,,sqrt,,0.0,3,2,0.0,13,best,"{'tree__ccp_alpha': 0.0, 'tree__class_weight':...",0.97591,0.906926,0.928847,0.953261,0.928205,1.0,0.973668,0.975848,0.97551,0.975045,0.959322,0.027563,1
746,0.004488,0.001281,0.002892,0.002766,0.0,balanced,gini,10.0,sqrt,,0.0,3,3,0.0,13,best,"{'tree__ccp_alpha': 0.0, 'tree__class_weight':...",0.97591,0.906926,0.928847,0.953261,0.928205,1.0,0.973668,0.975848,0.97551,0.975045,0.959322,0.027563,1


* Tuned hyperparameters

In [20]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_tree = Pipeline(steps=[('scaler', StandardScaler()),
                ('tree',DecisionTreeClassifier(class_weight='balanced', max_depth=10,
                                        max_features='sqrt', min_samples_leaf=3,
                                        random_state=13))])

score = cross_val_score(clf_tree, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.953     0.967     0.960       212
   Malignant      0.967     0.953     0.960       212

    accuracy                          0.960       424
   macro avg      0.960     0.960     0.960       424
weighted avg      0.960     0.960     0.960       424



<a id='8'></a>
## 8) [Random Forest Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>)

* Default hyperparameters

In [6]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_rf = Pipeline([('scaler', StandardScaler()), ('rf', RandomForestClassifier(random_state=13))])
                       
score = cross_val_score(clf_rf, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.962     0.967     0.965       212
   Malignant      0.967     0.962     0.965       212

    accuracy                          0.965       424
   macro avg      0.965     0.965     0.965       424
weighted avg      0.965     0.965     0.965       424



* Hyperparameter tuning using Grid Search

In [7]:
param_grid = {
    'rf__bootstrap': [True,False],
    'rf__max_depth': [5, 10 , None],
    'rf__n_estimators' : [10,50,100,200,500],
    'rf__max_features': [None, 'sqrt', 'log2'],
    'rf__max_leaf_nodes' : [None,5,10],
    'rf__min_samples_leaf': [1,3,5],
    'rf__min_samples_split': list(range(2, 6)),
    'rf__criterion' :['entropy','gini'],
    'rf__random_state' : [13]
}

grid_search = GridSearchCV(clf_rf, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 6480 candidates, totalling 64800 fits

Best hyperparameters :  {'rf__bootstrap': False, 'rf__criterion': 'gini', 'rf__max_depth': 10, 'rf__max_features': 'sqrt', 'rf__max_leaf_nodes': None, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 2, 'rf__n_estimators': 500, 'rf__random_state': 13}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('rf',
                 RandomForestClassifier(bootstrap=False, max_depth=10,
                                        max_features='sqrt', n_estimators=500,
                                        random_state=13))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rf__bootstrap,param_rf__criterion,param_rf__max_depth,param_rf__max_features,param_rf__max_leaf_nodes,param_rf__min_samples_leaf,param_rf__min_samples_split,param_rf__n_estimators,param_rf__random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
6134,0.769523,0.067876,0.042838,0.001962,False,gini,,sqrt,,1,4,500,13,"{'rf__bootstrap': False, 'rf__criterion': 'gin...",0.97591,0.930081,0.928847,0.952851,1.0,0.976177,1.0,0.975848,0.950588,1.0,0.96903,0.025992,1
5594,0.70614,0.021465,0.044082,0.002631,False,gini,10.0,sqrt,,1,4,500,13,"{'rf__bootstrap': False, 'rf__criterion': 'gin...",0.97591,0.930081,0.928847,0.952851,1.0,0.976177,1.0,0.975848,0.950588,1.0,0.96903,0.025992,1
6124,1.033853,0.219839,0.060741,0.022799,False,gini,,sqrt,,1,2,500,13,"{'rf__bootstrap': False, 'rf__criterion': 'gin...",0.97591,0.930081,0.928847,0.952851,1.0,0.976177,1.0,0.975848,0.950588,1.0,0.96903,0.025992,1


* Tuned hyperparameters

In [8]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_rf = Pipeline(steps=[('scaler', StandardScaler()),
                ('rf',RandomForestClassifier(bootstrap=False, max_depth=10,
                                        max_features='sqrt', n_estimators=500,
                                        random_state=13))]) 
                       
score = cross_val_score(clf_rf, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.976     0.962     0.969       212
   Malignant      0.963     0.976     0.970       212

    accuracy                          0.969       424
   macro avg      0.969     0.969     0.969       424
weighted avg      0.969     0.969     0.969       424



<a id='9'></a>
## 9) [ADA Boost Classifier (Adaptive Boosting)](<https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#:~:text=An%20AdaBoost%20%5B1%5D%20classifier%20is,focus%20more%20on%20difficult%20cases.>)

* Default hyperparameters

In [21]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_adaboost = Pipeline([('scaler', StandardScaler()), ('adab', AdaBoostClassifier(random_state=13))])

score = cross_val_score(clf_adaboost, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.963     0.972     0.967       212
   Malignant      0.971     0.962     0.967       212

    accuracy                          0.967       424
   macro avg      0.967     0.967     0.967       424
weighted avg      0.967     0.967     0.967       424



* Hyperparameter tuning using Grid Search

In [22]:
param_grid = {
    'adab__base_estimator' : [DecisionTreeClassifier(class_weight='balanced', max_depth=10,max_features='sqrt', min_samples_leaf=3,random_state=13)],
    'adab__n_estimators' : [10,50,100,500],
    'adab__learning_rate' : np.power(10, np.arange(-3, 1, dtype=float)),
    'adab__algorithm' : ['SAMME', 'SAMME.R'],
    'adab__random_state' : [13],
}

grid_search = GridSearchCV(clf_adaboost, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 32 candidates, totalling 320 fits

Best hyperparameters :  {'adab__algorithm': 'SAMME', 'adab__base_estimator': DecisionTreeClassifier(class_weight='balanced', max_depth=10,
                       max_features='sqrt', min_samples_leaf=3,
                       random_state=13), 'adab__learning_rate': 1.0, 'adab__n_estimators': 500, 'adab__random_state': 13}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('adab',
                 AdaBoostClassifier(algorithm='SAMME',
                                    base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                          max_depth=10,
                                                                          max_features='sqrt',
                                                                          min_samples_leaf=3,
                                                                          random_state=13)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_adab__algorithm,param_adab__base_estimator,param_adab__learning_rate,param_adab__n_estimators,param_adab__random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
15,1.248246,0.116359,0.047182,0.01128,SAMME,DecisionTreeClassifier(class_weight='balanced'...,1.0,500,13,"{'adab__algorithm': 'SAMME', 'adab__base_estim...",1.0,0.976541,0.928847,0.929624,0.976177,0.952273,1.0,0.975848,0.97551,1.0,0.971482,0.02542,1
31,1.243333,0.049054,0.06518,0.005561,SAMME.R,DecisionTreeClassifier(class_weight='balanced'...,1.0,500,13,"{'adab__algorithm': 'SAMME.R', 'adab__base_est...",1.0,0.976541,0.928847,0.929624,0.976177,0.952273,0.974437,0.975848,1.0,1.0,0.971375,0.025405,2
30,0.241514,0.008867,0.014312,0.001263,SAMME.R,DecisionTreeClassifier(class_weight='balanced'...,1.0,100,13,"{'adab__algorithm': 'SAMME.R', 'adab__base_est...",1.0,0.976541,0.952851,0.929624,0.976177,0.952273,0.974437,0.975848,0.975045,1.0,0.97128,0.020451,3


* Tuned hyperparameters

In [23]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_adaboost = Pipeline(steps=[('scaler', StandardScaler()),
                ('adab',AdaBoostClassifier(algorithm='SAMME',
                                    base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                          max_depth=10,
                                                                          max_features='sqrt',
                                                                          min_samples_leaf=3,
                                                                          random_state=13),
                                    n_estimators=500, random_state=13))])

score = cross_val_score(clf_adaboost, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.963     0.981     0.972       212
   Malignant      0.981     0.962     0.971       212

    accuracy                          0.972       424
   macro avg      0.972     0.972     0.972       424
weighted avg      0.972     0.972     0.972       424



<a id='10'></a>
## 10) [C-Support Vector Classification](<https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>)

* Default hyperparameters

In [24]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_svc = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

score = cross_val_score(clf_svc, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.972     0.986     0.979       212
   Malignant      0.986     0.972     0.979       212

    accuracy                          0.979       424
   macro avg      0.979     0.979     0.979       424
weighted avg      0.979     0.979     0.979       424



* Hyperparameter tuning using Grid Search

In [25]:
param_grid = [
    {
        'svc__kernel': ['rbf'], 
        'svc__gamma': [1e-2, 1e-3, 1e-4,'auto','scale'], 
        'svc__C': [1, 10, 100, 1000],
        'svc__decision_function_shape': ['ovo', 'ovr'],
        'svc__random_state' : [13]
    },
    {
        'svc__kernel': ['linear'], 
        'svc__C': [1, 10, 100, 1000],
        'svc__decision_function_shape': ['ovo', 'ovr'],
        'svc__random_state' : [13]
    },
]

grid_search = GridSearchCV(clf_svc, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 48 candidates, totalling 480 fits

Best hyperparameters :  {'svc__C': 10, 'svc__decision_function_shape': 'ovo', 'svc__gamma': 'auto', 'svc__kernel': 'rbf', 'svc__random_state': 13}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('svc',
                 SVC(C=10, decision_function_shape='ovo', gamma='auto',
                     random_state=13))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_svc__C,param_svc__decision_function_shape,param_svc__gamma,param_svc__kernel,param_svc__random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
14,0.009902,0.003493,0.003,0.001093,10,ovo,scale,rbf,13,"{'svc__C': 10, 'svc__decision_function_shape':...",1.0,1.0,0.976282,0.953261,1.0,0.952273,0.949519,0.975848,1.0,1.0,0.980718,0.021068,1
18,0.006996,0.001435,0.002947,0.000785,10,ovr,auto,rbf,13,"{'svc__C': 10, 'svc__decision_function_shape':...",1.0,1.0,0.976282,0.953261,1.0,0.952273,0.949519,0.975848,1.0,1.0,0.980718,0.021068,1
13,0.0172,0.019907,0.0049,0.005819,10,ovo,auto,rbf,13,"{'svc__C': 10, 'svc__decision_function_shape':...",1.0,1.0,0.976282,0.953261,1.0,0.952273,0.949519,0.975848,1.0,1.0,0.980718,0.021068,1


* Tuned hyperparameters

In [26]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_svc = Pipeline(steps=[('scaler', StandardScaler()),
                ('svc',SVC(C=10, decision_function_shape='ovo', gamma='auto',
                     random_state=13))])

score = cross_val_score(clf_svc, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.977     0.986     0.981       212
   Malignant      0.986     0.976     0.981       212

    accuracy                          0.981       424
   macro avg      0.981     0.981     0.981       424
weighted avg      0.981     0.981     0.981       424



<a id='11'></a>
## 11) [Stochastic Gradient Descent Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html>)

* Default hyperparameters

In [27]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_sgd = Pipeline([('scaler', StandardScaler()), ('sgd', SGDClassifier(random_state=13))])

score = cross_val_score(clf_sgd, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.949     0.962     0.956       212
   Malignant      0.962     0.948     0.955       212

    accuracy                          0.955       424
   macro avg      0.955     0.955     0.955       424
weighted avg      0.955     0.955     0.955       424



* Hyperparameter tuning using Grid Search

In [28]:
param_grid = {
    'sgd__average': [True, False],
    'sgd__l1_ratio': np.linspace(0, 1, num=10),
    'sgd__alpha': np.power(10, np.arange(-2, 1, dtype=float)),
    'sgd__random_state' : [13]
}

grid_search = GridSearchCV(clf_sgd, param_grid=param_grid, n_jobs=-1, cv=cv,verbose=4,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 60 candidates, totalling 600 fits

Best hyperparameters :  {'sgd__alpha': 0.1, 'sgd__average': False, 'sgd__l1_ratio': 0.0, 'sgd__random_state': 13}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('sgd',
                 SGDClassifier(alpha=0.1, l1_ratio=0.0, random_state=13))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_sgd__alpha,param_sgd__average,param_sgd__l1_ratio,param_sgd__random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
30,0.007332,0.004497,0.003591,0.001558,0.1,False,0.0,13,"{'sgd__alpha': 0.1, 'sgd__average': False, 'sg...",0.952851,0.976282,0.929624,0.976541,0.976177,0.976068,0.974437,0.975848,1.0,1.0,0.973783,0.019474,1
39,0.004288,0.001612,0.002394,0.000797,0.1,False,1.0,13,"{'sgd__alpha': 0.1, 'sgd__average': False, 'sg...",0.952851,0.976282,0.929624,0.976541,0.976177,0.976068,0.974437,0.975848,1.0,1.0,0.973783,0.019474,1
38,0.004538,0.001349,0.001995,0.000631,0.1,False,0.888889,13,"{'sgd__alpha': 0.1, 'sgd__average': False, 'sg...",0.952851,0.976282,0.929624,0.976541,0.976177,0.976068,0.974437,0.975848,1.0,1.0,0.973783,0.019474,1


* Tuned hyperparameters

In [29]:
originalclass = []
predictedclass = []

# Cross validate
clf_sgd = Pipeline(steps=[('scaler', StandardScaler()),
                ('sgd',SGDClassifier(alpha=0.1, l1_ratio=0.0, random_state=13))])

score = cross_val_score(clf_sgd, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.963     0.986     0.974       212
   Malignant      0.986     0.962     0.974       212

    accuracy                          0.974       424
   macro avg      0.974     0.974     0.974       424
weighted avg      0.974     0.974     0.974       424



<a id='12'></a>
## 12) [eXtreme Gradient Boosting](<https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters>)

* Default hyperparameters

In [9]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_xgb = Pipeline([('scaler', StandardScaler()), ('xgb', XGBClassifier(random_state=13))])

score = cross_val_score(clf_xgb, x_new, y_le, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.962     0.967     0.965       212
   Malignant      0.967     0.962     0.965       212

    accuracy                          0.965       424
   macro avg      0.965     0.965     0.965       424
weighted avg      0.965     0.965     0.965       424



* Hyperparameter tuning using Grid Search

In [10]:
# https://www.kaggle.com/code/prashant111/a-guide-on-xgboost-hyperparameters-tuning/notebook
# https://www.cs.cornell.edu/courses/cs4780/2018sp/lectures/lecturenote19.html
# https://medium.com/data-design/xgboost-hi-im-gamma-what-can-i-do-for-you-and-the-tuning-of-regularization-a42ea17e6ab6

param_grid = {
        'xgb__booster' : ['gbtree'],
        'xgb__validate_parameters' : [True],
        'xgb__learning_rate' : [0.05,0.1,0.3,0.5,1],
        'xgb__gamma' : [0,0.01,0.1,0.5,1],
        'xgb__max_depth' : [2,6,10],
        'xgb__min_child_weight' : [1,3,5],
        'xgb__max_delta_step' : [0,2,4],
        'xgb__subsample' : [0.5],
        'xgb__colsample_bylevel' : [1],
        'xgb__colsample_bynode' : [1],
        'xgb__colsample_bytree' : [1],
        'xgb__reg_lambda' : [0,1],
        'xgb__reg_alpha' : [0],
        'xgb__tree_method' : ['exact'],
        'xgb__scale_pos_weight' : [1],
        'xgb__objective' : ['binary:logistic'], # 'multi:softmax' -> same scores as 'binary:logistic'
        #'num_class' : [2],
        'xgb__n_estimators' : [50,100,200,500],
        'xgb__random_state' : [13]
    }

grid_search = GridSearchCV(clf_xgb, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y_le)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 5400 candidates, totalling 54000 fits

Best hyperparameters :  {'xgb__booster': 'gbtree', 'xgb__colsample_bylevel': 1, 'xgb__colsample_bynode': 1, 'xgb__colsample_bytree': 1, 'xgb__gamma': 0.1, 'xgb__learning_rate': 0.1, 'xgb__max_delta_step': 2, 'xgb__max_depth': 6, 'xgb__min_child_weight': 1, 'xgb__n_estimators': 500, 'xgb__objective': 'binary:logistic', 'xgb__random_state': 13, 'xgb__reg_alpha': 0, 'xgb__reg_lambda': 0, 'xgb__scale_pos_weight': 1, 'xgb__subsample': 0.5, 'xgb__tree_method': 'exact', 'xgb__validate_parameters': True}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('xgb',
                 XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, early_stopping_rounds=None,
                               enable_categorical=False, eval_metric=None,
                               gam

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_xgb__booster,param_xgb__colsample_bylevel,param_xgb__colsample_bynode,param_xgb__colsample_bytree,param_xgb__gamma,param_xgb__learning_rate,param_xgb__max_delta_step,param_xgb__max_depth,param_xgb__min_child_weight,param_xgb__n_estimators,param_xgb__objective,param_xgb__random_state,param_xgb__reg_alpha,param_xgb__reg_lambda,param_xgb__scale_pos_weight,param_xgb__subsample,param_xgb__tree_method,param_xgb__validate_parameters,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
2502,0.378152,0.018146,0.003391,0.000662,gbtree,1,1,1,0.1,0.1,2,10,1,500,binary:logistic,13,0,0,1,0.5,exact,True,"{'xgb__booster': 'gbtree', 'xgb__colsample_byl...",0.97591,0.953261,0.952851,0.929624,1.0,0.976068,1.0,0.975848,1.0,0.97551,0.973907,0.022109,1
2478,0.383389,0.027365,0.004189,0.001657,gbtree,1,1,1,0.1,0.1,2,6,1,500,binary:logistic,13,0,0,1,0.5,exact,True,"{'xgb__booster': 'gbtree', 'xgb__colsample_byl...",0.97591,0.953261,0.952851,0.929624,1.0,0.976068,1.0,0.975848,1.0,0.97551,0.973907,0.022109,1
151,0.40124,0.014719,0.003641,0.001181,gbtree,1,1,1,0.0,0.05,4,2,1,500,binary:logistic,13,0,1,1,0.5,exact,True,"{'xgb__booster': 'gbtree', 'xgb__colsample_byl...",0.97591,0.976541,0.952851,0.952851,0.976177,0.952273,1.0,0.975848,1.0,0.97551,0.973796,0.016549,3


* Tuned hyperparameters

In [11]:
originalclass = []
predictedclass = []

# Cross validate
clf_xgb = Pipeline(steps=[('scaler', StandardScaler()),
                ('xgb',XGBClassifier(booster='gbtree',gamma=0.1,learning_rate=0.1,max_delta_step=2,max_depth=10,min_child_weight=1,
                                    n_estimators=500,objective='binary:logistic',reg_alpha=0,reg_lambda=0,scale_pos_weight=1,subsample=0.5,
                                    tree_method='exact',validate_parameters=True,random_state=13))])

score = cross_val_score(clf_xgb, x_new, y_le, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.972     0.976     0.974       212
   Malignant      0.976     0.972     0.974       212

    accuracy                          0.974       424
   macro avg      0.974     0.974     0.974       424
weighted avg      0.974     0.974     0.974       424



<a id='13'></a>
## 13) [Light Gradient Boosting Machine](<https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html>)

* Default hyperparameters

In [12]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_lgbm = Pipeline([('scaler', StandardScaler()), ('lgbm', lgbm.LGBMClassifier(random_state=13))])

score = cross_val_score(clf_lgbm, x_new, y_le, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.963     0.976     0.970       212
   Malignant      0.976     0.962     0.969       212

    accuracy                          0.969       424
   macro avg      0.969     0.969     0.969       424
weighted avg      0.969     0.969     0.969       424



* Hyperparameter tuning using Grid Search

In [13]:
# https://neptune.ai/blog/lightgbm-parameters-guide
# https://www.youtube.com/watch?v=5CWwwtEM2TA&ab_channel=PyData & https://github.com/MSusik/newgradientboosting/blob/master/pydata.pdf

param_grid = {
        'lgbm__boosting_type' : ['gbdt','dart'],
        'lgbm__num_leaves' : [10,20,30,40,50],
        'lgbm__max_depth' : [3,6,9,-1],
        'lgbm__learning_rate' : [0.05,0.1,0.3,0.5,1],
        'lgbm__n_estimators' : [50,100,200,500],
        'lgbm__objective' : ['binary'],
        'lgbm__min_child_samples' : [10,20,30],
        'lgbm__subsample' : [0.5],
        'lgbm__reg_lambda' : [0,1],
        'lgbm__reg_alpha' : [0],
        'lgbm__subsample' : [0.5],
        'lgbm__colsample_bytree' : [1],
        'lgbm__scale_pos_weight' : [1],
        'lgbm__random_state' : [13]
    }

grid_search = GridSearchCV(clf_lgbm, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y_le)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 4800 candidates, totalling 48000 fits

Best hyperparameters :  {'lgbm__boosting_type': 'dart', 'lgbm__colsample_bytree': 1, 'lgbm__learning_rate': 1, 'lgbm__max_depth': 9, 'lgbm__min_child_samples': 30, 'lgbm__n_estimators': 500, 'lgbm__num_leaves': 10, 'lgbm__objective': 'binary', 'lgbm__random_state': 13, 'lgbm__reg_alpha': 0, 'lgbm__reg_lambda': 0, 'lgbm__scale_pos_weight': 1, 'lgbm__subsample': 0.5}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('lgbm',
                 LGBMClassifier(boosting_type='dart', colsample_bytree=1,
                                learning_rate=1, max_depth=9,
                                min_child_samples=30, n_estimators=500,
                                num_leaves=10, objective='binary',
                                random_state=13, reg_alpha=0, reg_lambda=0,
                                scale_pos_weight=1, subsample=0.5))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lgbm__boosting_type,param_lgbm__colsample_bytree,param_lgbm__learning_rate,param_lgbm__max_depth,param_lgbm__min_child_samples,param_lgbm__n_estimators,param_lgbm__num_leaves,param_lgbm__objective,param_lgbm__random_state,param_lgbm__reg_alpha,param_lgbm__reg_lambda,param_lgbm__scale_pos_weight,param_lgbm__subsample,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
4790,0.227001,0.061333,0.002693,0.000639,dart,1,1,-1,30,500,10,binary,13,0,0,1,0.5,"{'lgbm__boosting_type': 'dart', 'lgbm__colsamp...",0.97591,0.976541,0.952851,0.952851,1.0,0.976068,1.0,0.975848,1.0,1.0,0.981007,0.017679,1
4670,0.215182,0.033922,0.002693,0.000457,dart,1,1,9,30,500,10,binary,13,0,0,1,0.5,"{'lgbm__boosting_type': 'dart', 'lgbm__colsamp...",0.97591,0.976541,0.952851,0.952851,1.0,0.976068,1.0,0.975848,1.0,1.0,0.981007,0.017679,1
4630,0.228647,0.036535,0.00389,0.001041,dart,1,1,9,20,500,10,binary,13,0,0,1,0.5,"{'lgbm__boosting_type': 'dart', 'lgbm__colsamp...",1.0,0.976541,0.952851,0.906522,1.0,0.976068,1.0,0.975848,1.0,1.0,0.978783,0.028576,3


* Tuned hyperparameters

In [14]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_lgbm = Pipeline(steps=[('scaler', StandardScaler()),
                ('lgbm',lgbm.LGBMClassifier(boosting_type='dart', colsample_bytree=1,
                                learning_rate=1, max_depth=9,
                                min_child_samples=30, n_estimators=500,
                                num_leaves=10, objective='binary',
                                random_state=13, reg_alpha=0, reg_lambda=0,
                                scale_pos_weight=1, subsample=0.5))])

score = cross_val_score(clf_lgbm, x_new, y_le, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.981     0.981     0.981       212
   Malignant      0.981     0.981     0.981       212

    accuracy                          0.981       424
   macro avg      0.981     0.981     0.981       424
weighted avg      0.981     0.981     0.981       424



<a id='14'></a>
## 14) [K-Nearest Neighbors Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html>)

* Default hyperparameters

In [30]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_knn = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])

score = cross_val_score(clf_knn, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.946     0.986     0.965       212
   Malignant      0.985     0.943     0.964       212

    accuracy                          0.965       424
   macro avg      0.965     0.965     0.965       424
weighted avg      0.965     0.965     0.965       424



* Hyperparameter tuning using Grid Search

In [34]:
param_grid = {
    'knn__n_neighbors': list(range(2,10)),
    'knn__weights': ['uniform','distance'],
    'knn__algorithm' : ['ball_tree', 'kd_tree', 'brute'],
    'knn__leaf_size': [10,20,30,40,50],
    'knn__p': [1,2],
    'knn__metric': ['minkowski','manhattan','chebyshev']
}

grid_search = GridSearchCV(clf_knn, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 1440 candidates, totalling 14400 fits

Best hyperparameters :  {'knn__algorithm': 'ball_tree', 'knn__leaf_size': 10, 'knn__metric': 'minkowski', 'knn__n_neighbors': 6, 'knn__p': 2, 'knn__weights': 'distance'}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('knn',
                 KNeighborsClassifier(algorithm='ball_tree', leaf_size=10,
                                      n_neighbors=6, weights='distance'))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_knn__algorithm,param_knn__leaf_size,param_knn__metric,param_knn__n_neighbors,param_knn__p,param_knn__weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
499,0.005099,0.000829,0.004401,0.000928,kd_tree,10,minkowski,6,2,distance,"{'knn__algorithm': 'kd_tree', 'knn__leaf_size'...",1.0,0.976282,0.952851,0.906926,1.0,0.976068,1.0,0.951389,1.0,0.97551,0.973903,0.028589,1
1363,0.00369,0.001732,0.003342,0.002319,brute,50,minkowski,6,2,distance,"{'knn__algorithm': 'brute', 'knn__leaf_size': ...",1.0,0.976282,0.952851,0.906926,1.0,0.976068,1.0,0.951389,1.0,0.97551,0.973903,0.028589,1
787,0.004088,0.0003,0.002992,0.000631,kd_tree,40,minkowski,6,2,distance,"{'knn__algorithm': 'kd_tree', 'knn__leaf_size'...",1.0,0.976282,0.952851,0.906926,1.0,0.976068,1.0,0.951389,1.0,0.97551,0.973903,0.028589,1


* Tuned hyperparameters

In [35]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_knn = Pipeline(steps=[('scaler', StandardScaler()),
                ('knn',KNeighborsClassifier(algorithm='ball_tree', leaf_size=10,
                                      n_neighbors=6, weights='distance'))])

score = cross_val_score(clf_knn, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.955     0.995     0.975       212
   Malignant      0.995     0.953     0.973       212

    accuracy                          0.974       424
   macro avg      0.975     0.974     0.974       424
weighted avg      0.975     0.974     0.974       424



<a id='15'></a>
## 15) [Multi-layer Perceptron Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html>)

* Default hyperparameters

In [36]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_mlp =  Pipeline([('scaler', StandardScaler()), ('mlp', MLPClassifier(random_state=13))])

score = cross_val_score(clf_mlp, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.967     0.967     0.967       212
   Malignant      0.967     0.967     0.967       212

    accuracy                          0.967       424
   macro avg      0.967     0.967     0.967       424
weighted avg      0.967     0.967     0.967       424



* Hyperparameter tuning using Grid Search

In [37]:
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
param_grid = {
    'mlp__hidden_layer_sizes' : [(60,120,)],
    'mlp__activation' : ['tanh','relu'],
    'mlp__solver' : ['sgd','adam'],
    'mlp__alpha' : [0.01,0,2],
    'mlp__batch_size' : [40,80,'auto'],
    'mlp__learning_rate' : ['invscaling','adaptive'],
    'mlp__learning_rate_init' : np.power(10, np.arange(-3, 0, dtype=float)),
    'mlp__power_t' : [0.5],
    'mlp__max_iter' : [50,100,200,500],
    'mlp__shuffle' : [True],
    'mlp__random_state' : [13]
}

grid_search = GridSearchCV(clf_mlp, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='rank_test_score').head(3)

Fitting 10 folds for each of 864 candidates, totalling 8640 fits

Best hyperparameters :  {'mlp__activation': 'relu', 'mlp__alpha': 2, 'mlp__batch_size': 'auto', 'mlp__hidden_layer_sizes': (60, 120), 'mlp__learning_rate': 'adaptive', 'mlp__learning_rate_init': 0.1, 'mlp__max_iter': 50, 'mlp__power_t': 0.5, 'mlp__random_state': 13, 'mlp__shuffle': True, 'mlp__solver': 'sgd'}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('mlp',
                 MLPClassifier(alpha=2, hidden_layer_sizes=(60, 120),
                               learning_rate='adaptive', learning_rate_init=0.1,
                               max_iter=50, random_state=13, solver='sgd'))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_mlp__activation,param_mlp__alpha,param_mlp__batch_size,param_mlp__hidden_layer_sizes,param_mlp__learning_rate,param_mlp__learning_rate_init,param_mlp__max_iter,param_mlp__power_t,param_mlp__random_state,param_mlp__shuffle,param_mlp__solver,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
856,0.147206,0.005411,0.002194,0.000399,relu,2,auto,"(60, 120)",adaptive,0.1,50,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 2, '...",1.0,0.976541,0.929624,0.952851,1.0,0.952273,1.0,0.975848,1.0,1.0,0.978714,0.02461,1
843,0.326111,0.058996,0.001762,0.004658,relu,2,auto,"(60, 120)",adaptive,0.001,100,0.5,13,True,adam,"{'mlp__activation': 'relu', 'mlp__alpha': 2, '...",0.976282,1.0,0.929624,0.952851,1.0,0.952273,1.0,0.975848,1.0,1.0,0.978688,0.024612,2
819,0.423367,0.023142,0.002593,0.001017,relu,2,auto,"(60, 120)",invscaling,0.001,100,0.5,13,True,adam,"{'mlp__activation': 'relu', 'mlp__alpha': 2, '...",0.976282,1.0,0.929624,0.952851,1.0,0.952273,1.0,0.975848,1.0,1.0,0.978688,0.024612,2


* Tuned hyperparameters

In [38]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_mlp =  Pipeline(steps=[('scaler', StandardScaler()),
                ('mlp',MLPClassifier(alpha=2, hidden_layer_sizes=(60, 120),
                               learning_rate='adaptive', learning_rate_init=0.1,
                               max_iter=50, random_state=13, solver='sgd'))])


score = cross_val_score(clf_mlp, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.981     0.976     0.979       212
   Malignant      0.977     0.981     0.979       212

    accuracy                          0.979       424
   macro avg      0.979     0.979     0.979       424
weighted avg      0.979     0.979     0.979       424



* Tried a larger range of hyperparameters for testing at first, but was too time consuming. The worst attempts were then found with the following code and the hyperparameters corresponding to those results were removed.

In [39]:
# print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=True).head(5) # worst 5

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_mlp__activation,param_mlp__alpha,param_mlp__batch_size,param_mlp__hidden_layer_sizes,param_mlp__learning_rate,param_mlp__learning_rate_init,param_mlp__max_iter,param_mlp__power_t,param_mlp__random_state,param_mlp__shuffle,param_mlp__solver,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
672,0.168259,0.015378,0.002593,0.000488,relu,0.0,auto,"(60, 120)",invscaling,0.001,50,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 0, '...",0.59543,0.338462,0.385714,0.401465,0.514303,0.533333,0.326203,0.585185,0.396825,0.458983,0.45359,0.093562,862
816,0.175231,0.034767,0.002194,0.000399,relu,2.0,auto,"(60, 120)",invscaling,0.001,50,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 2, '...",0.59543,0.338462,0.385714,0.401465,0.514303,0.533333,0.326203,0.585185,0.396825,0.458983,0.45359,0.093562,862
528,0.172291,0.017648,0.002249,0.000411,relu,0.01,auto,"(60, 120)",invscaling,0.001,50,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 0.01...",0.59543,0.338462,0.385714,0.401465,0.514303,0.533333,0.326203,0.585185,0.396825,0.458983,0.45359,0.093562,862
820,0.376593,0.091383,0.002394,0.000488,relu,2.0,auto,"(60, 120)",invscaling,0.001,200,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 2, '...",0.59543,0.379159,0.419941,0.401465,0.575758,0.533333,0.326203,0.585185,0.396825,0.458983,0.467228,0.092589,853
818,0.370209,0.046181,0.005186,0.006672,relu,2.0,auto,"(60, 120)",invscaling,0.001,100,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 2, '...",0.59543,0.379159,0.419941,0.401465,0.575758,0.533333,0.326203,0.585185,0.396825,0.458983,0.467228,0.092589,853


<a id='16'></a>
## 16) Summary

* Below are the tables of the specific feature selection method.
* The performance of the algorithms is in descending order.
* All the results are the average values of a 10-fold cross validation.
* The columns contain the accuracy and the average values of precision, recall and f1 score.
* It is observed that the number of samples of Βenign and Μalignant cancer are equal (212 respectively), so the weighted average and the macro average are equal.

<table>
    <tr>
        <th colspan="5"> All features : Default algorithms</th>
    </tr>
    <tr>
        <th></th>
        <th>precision </th>
        <th>recall</th>
        <th>f1 score</th>
        <th>accuracy</th>  
    </tr>
    <tr>
        <th>SVC</th>
        <td>0.979</td>
        <td>0.979</td>
        <td>0.979</td>
        <td>0.979</td>
    </tr>
    <tr>
        <th>LGBM</th>
        <td>0.969</td>
        <td>0.969</td>
        <td>0.969</td>
        <td>0.969</td>
    </tr>
    <tr>
        <th>MLP</th>
        <td>0.967</td>
        <td>0.967</td>
        <td>0.967</td>
        <td>0.967</td>
    </tr>
    <tr>
        <th>AdaBoost</th>
        <td>0.967</td>
        <td>0.967</td>
        <td>0.967</td>
        <td>0.967</td>
    </tr>
    <tr>
        <th>LDA</th>
        <td>0.966</td>
        <td>0.965</td>
        <td>0.965</td>
        <td>0.965</td>
    </tr>
    <tr>
        <th>KNN</th>
        <td>0.965</td>
        <td>0.965</td>
        <td>0.965</td>
        <td>0.965</td>
    </tr>
    <tr>
        <th>XGBoost</th>
        <td>0.965</td>
        <td>0.965</td>
        <td>0.965</td>
        <td>0.965</td>
    </tr>
    <tr>
        <th>Random Forest</th>
        <td>0.965</td>
        <td>0.965</td>
        <td>0.965</td>
        <td>0.965</td>
    </tr>
    <tr>
        <th>Ridge</th>
        <td>0.964</td>
        <td>0.962</td>
        <td>0.962</td>
        <td>0.962</td>
    </tr>
    <tr>
        <th>SGD</th>
        <td>0.955</td>
        <td>0.955</td>
        <td>0.955</td>
        <td>0.955</td>
    </tr>
    <tr>
        <th>QDA</th>
        <td>0.946</td>
        <td>0.946</td>
        <td>0.946</td>
        <td>0.946</td>
    </tr>
    <tr>
        <th>GNB</th>
        <td>0.933</td>
        <td>0.932</td>
        <td>0.932</td>
        <td>0.932</td>
    </tr>
    <tr>
        <th>Decision Tree</th>
        <td>0.922</td>
        <td>0.922</td>
        <td>0.922</td>
        <td>0.922</td>
    </tr>

</table>

<table>
    <tr>
        <th colspan="5"> All features : Tuned algorithms</th>
    </tr>
    <tr>
        <th></th>
        <th>precision </th>
        <th>recall</th>
        <th>f1 score</th>
        <th>accuracy</th>  
    </tr>
    <tr>
        <th>SVC</th>
        <td>0.981</td>
        <td>0.981</td>
        <td>0.981</td>
        <td>0.981</td>
    </tr>
    <tr>
        <th>LGBM</th>
        <td>0.981</td>
        <td>0.981</td>
        <td>0.981</td>
        <td>0.981</td>
    </tr>
    <tr>
        <th>MLP</th>
        <td>0.979</td>
        <td>0.979</td>
        <td>0.979</td>
        <td>0.979</td>
    </tr>
    <tr>
        <th>KNN</th>
        <td>0.975</td>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
    </tr>
    <tr>
        <th>XGBoost</th>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
    </tr>
    <tr>
        <th>SGD</th>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
    </tr>
    <tr>
        <th>AdaBoost</th>
        <td>0.972</td>
        <td>0.972</td>
        <td>0.972</td>
        <td>0.972</td>
    </tr>
    <tr>
        <th>QDA</th>
        <td>0.971</td>
        <td>0.969</td>
        <td>0.969</td>
        <td>0.969</td>
    </tr>
    <tr>
        <th>Random Forest</th>
        <td>0.969</td>
        <td>0.969</td>
        <td>0.969</td>
        <td>0.969</td>
    </tr>
    <tr>
        <th>Ridge</th>
        <td>0.968</td>
        <td>0.967</td>
        <td>0.967</td>
        <td>0.967</td>
    </tr>
    <tr>
        <th>LDA</th>
        <td>0.966</td>
        <td>0.965</td>
        <td>0.965</td>
        <td>0.965</td>
    </tr>
    <tr>
        <th>Decision Tree</th>
        <td>0.960</td>
        <td>0.960</td>
        <td>0.960</td>
        <td>0.960</td>
    </tr>
    <tr>
        <th>GNB</th>
        <td>0.934</td>
        <td>0.932</td>
        <td>0.932</td>
        <td>0.932</td>
    </tr>

</table>