# Diploma thesis
## Breast cancer classification using machine learning methods
### Feature selection with Recursive Feature Elimination with cross-validation

> Lazaros Panitsidis<br />
> Department of Production and Management Engineering <br />
> International Hellenic University <br />
> lazarospanitsidis@outlook.com

## Contents
1. [Useful Python Libraries](#1)
1. [Data Processing](#2)
1. [Gaussian Naive Bayes](#3)
1. [Linear Discriminant Analysis](#4)
1. [Quadratic Discriminant Analysis](#5)
1. [Ridge Classifier](#6)
1. [Decision Tree Classifier](#7)
1. [Random Forest Classifier](#8)
1. [ADA Boost Classifier (Adaptive Boosting)](#9)
1. [C-Support Vector Classification](#10)
1. [Stochastic Gradient Descent Classifier](#11)
1. [eXtreme Gradient Boosting](#12)
1. [Light Gradient Boosting Machine](#13)
1. [K-Nearest Neighbors Classifier](#14)
1. [Multi-layer Perceptron Classifier](#15)
1. [Summary](#16)

<a id='1'></a>
## 1) Useful Python Libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # data visualization library  
import scipy.stats as stats
import matplotlib.pyplot as plt
import time
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None  # default='warn'
#import warnings library
import warnings
# ignore all warnings
warnings.filterwarnings('ignore')
# Any results you write to the current directory are saved as output.

# some of them are not used in this file
from sklearn.feature_selection import SelectKBest, f_classif, chi2, RFE, RFECV , mutual_info_classif
from sklearn.model_selection import train_test_split, cross_val_score , GridSearchCV , LeaveOneOut,KFold,RandomizedSearchCV
from skopt import BayesSearchCV # https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html#skopt.BayesSearchCV , https://scikit-optimize.github.io/stable/auto_examples/bayesian-optimization.html
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score , make_scorer , classification_report
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline , Pipeline # https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
from sklearn.preprocessing import StandardScaler , LabelEncoder
from xgboost import XGBClassifier , plot_importance
from sklearn.utils import resample
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier , RidgeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis , QuadraticDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier , AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
import lightgbm as lgbm
from sklearn.neural_network import MLPClassifier
import pygad

<a id='2'></a>
## 2) Data Processing

In [2]:
dataWISC = pd.read_csv('dataWisc.csv')
dataWISC.drop(["id", "Unnamed: 32"], axis = 1, inplace = True)

# Undersampling function
def make_undersample(_df, column):
  dfs_r = {}
  dfs_c = {}
  smaller = 1e1000
  ignore = ""
  for c in _df[column].unique():
    dfs_c[c] = _df[_df[column] == c]
    if dfs_c[c].shape[0] < smaller:
      smaller = dfs_c[c].shape[0]
      ignore = c

  for c in dfs_c:
    if c == ignore:
      continue
    dfs_r[c] = resample(dfs_c[c], 
                        replace=False, # sample without replacement
                        n_samples=smaller,
                        random_state=0)
  return pd.concat([dfs_r[c] for c in dfs_r] + [dfs_c[ignore]])

dataWISC = make_undersample(dataWISC,'diagnosis')

#Description of the dataset

#how many cases are included in the dataset
length = len(dataWISC)
#how many features are in the dataset
features = dataWISC.shape[1]-1 # - diagnosis

# Number of malignant cases
malignant = len(dataWISC[dataWISC['diagnosis']=='M'])

#Number of benign cases
benign = len(dataWISC[dataWISC['diagnosis']=='B'])

#Rate of malignant tumors over all cases
rate = (float(malignant)/(length))*100

print ("There are "+ str(len(dataWISC))+" cases in this dataset")
print ("There are {}".format(features)+" features in this dataset")
print ("There are {}".format(malignant)+" cases diagnosed as malignant tumor")
print ("There are {}".format(benign)+" cases diagnosed as benign tumor")
print ("The percentage of malignant cases is: {:.2f}%".format(rate))

There are 424 cases in this dataset
There are 30 features in this dataset
There are 212 cases diagnosed as malignant tumor
There are 212 cases diagnosed as benign tumor
The percentage of malignant cases is: 50.00%


In [4]:
y = dataWISC.diagnosis                          # M or B 
x = dataWISC.drop('diagnosis',axis = 1 )
target_names=['Benign','Malignant']
# x_scaled = (x - x.mean())/x.std()
le= LabelEncoder()
le.fit(y)
y_le = le.transform(y)

In [5]:
# RFECV features
x_new = x[['texture_mean', 'area_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'area_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'smoothness_worst', 'concavity_worst',
       'symmetry_worst', 'fractal_dimension_worst']]
x_new.head()

Unnamed: 0,texture_mean,area_mean,symmetry_mean,fractal_dimension_mean,area_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,smoothness_worst,concavity_worst,symmetry_worst,fractal_dimension_worst
49,22.3,561.0,0.1809,0.05718,20.2,0.02095,0.01184,0.01641,0.001956,0.1162,0.2282,0.2871,0.06917
285,18.4,489.0,0.1697,0.05855,22.45,0.00186,0.002924,0.02571,0.002015,0.1038,0.005579,0.2505,0.06431
495,20.21,680.9,0.1487,0.05748,21.84,0.02153,0.01183,0.01959,0.001812,0.1216,0.17,0.2369,0.06599
391,16.84,234.3,0.1985,0.07098,28.85,0.0,0.0,0.01865,0.006736,0.146,0.0,0.2445,0.08865
187,17.19,420.3,0.1516,0.06095,17.86,0.01978,0.01185,0.01897,0.001671,0.1323,0.1521,0.2572,0.07097


In [6]:
# https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/#:~:text=Given%20the%20improved%20estimate%20of,biased%20estimates%20of%20model%20performance.
# cv = LeaveOneOut()

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
cv=KFold(n_splits=10, shuffle=True, random_state=13)

originalclass = []
predictedclass = []

def classification_report_with_accuracy_score(y_true, y_pred):
  originalclass.extend(y_true)
  predictedclass.extend(y_pred)
  #print(classification_report(y_true, y_pred, target_names=target_names)) 
  return accuracy_score(y_true, y_pred)

def print_best_params(grid_search):
    print("")
    print("Best hyperparameters : ", grid_search.best_params_)
    print("")
    print("Best estimator : ", grid_search.best_estimator_)
    print("")

<a id='3'></a>
## 3) [Gaussian Naive Bayes](<https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB>)

* Default hyperparameters

In [7]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_gnb = Pipeline([('scaler', StandardScaler()), ('gnb', GaussianNB())])
score = cross_val_score(clf_gnb, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.902     0.958     0.929       212
   Malignant      0.955     0.896     0.925       212

    accuracy                          0.927       424
   macro avg      0.928     0.927     0.927       424
weighted avg      0.928     0.927     0.927       424



* Hyperparameter tuning using Grid Search

In [8]:
param_grid = { 'gnb__var_smoothing': np.logspace(0,-10, num=100) }

grid_search = GridSearchCV(clf_gnb, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits

Best hyperparameters :  {'gnb__var_smoothing': 0.030538555088334154}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('gnb', GaussianNB(var_smoothing=0.030538555088334154))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gnb__var_smoothing,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
17,0.00389,0.001042,0.001994,0.000446,0.019179,{'gnb__var_smoothing': 0.019179102616724886},0.952851,0.928847,0.904444,0.906926,0.976177,0.976068,0.899038,0.877551,0.927545,0.97551,0.932496,0.034188,1
16,0.00349,0.000669,0.001995,0.000446,0.024201,{'gnb__var_smoothing': 0.02420128264794382},0.952851,0.928847,0.904444,0.906926,0.976177,0.976068,0.899038,0.877551,0.927545,0.97551,0.932496,0.034188,1
15,0.004189,0.001246,0.002094,0.000829,0.030539,{'gnb__var_smoothing': 0.030538555088334154},0.952851,0.928847,0.904444,0.906926,0.976177,0.976068,0.899038,0.877551,0.927545,0.97551,0.932496,0.034188,1


* Tuned hyperparameters

In [9]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_gnb = Pipeline(steps=[('scaler', StandardScaler()),
                ('gnb', GaussianNB(var_smoothing=0.030538555088334154))])

score = cross_val_score(clf_gnb, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.904     0.972     0.936       212
   Malignant      0.969     0.896     0.931       212

    accuracy                          0.934       424
   macro avg      0.936     0.934     0.934       424
weighted avg      0.936     0.934     0.934       424



<a id='4'></a>
## 4) [Linear Discriminant Analysis](<https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html>)

* Default hyperparameters

In [10]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_lda = Pipeline([('scaler', StandardScaler()), ('lda', LinearDiscriminantAnalysis())])

score = cross_val_score(clf_lda, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.924     0.981     0.952       212
   Malignant      0.980     0.920     0.949       212

    accuracy                          0.950       424
   macro avg      0.952     0.950     0.950       424
weighted avg      0.952     0.950     0.950       424



* Hyperparameter tuning using Grid Search

In [11]:
param_grid = {
    'lda__solver' : ['svd','lsqr','eigen'],
    'lda__shrinkage':[None,'auto'],
    'lda__tol': [0.0001,0.001,0.01,0.1]
}

grid_search = GridSearchCV(clf_lda, param_grid=param_grid, n_jobs=-1, cv=cv,verbose=4,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 24 candidates, totalling 240 fits

Best hyperparameters :  {'lda__shrinkage': 'auto', 'lda__solver': 'lsqr', 'lda__tol': 0.0001}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('lda',
                 LinearDiscriminantAnalysis(shrinkage='auto', solver='lsqr'))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lda__shrinkage,param_lda__solver,param_lda__tol,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
23,0.006682,0.000898,0.002094,0.000698,auto,eigen,0.1,"{'lda__shrinkage': 'auto', 'lda__solver': 'eig...",0.929624,0.902715,0.905702,0.953261,0.976177,0.976068,1.0,0.951389,0.97551,1.0,0.957045,0.033385,1
22,0.013165,0.010841,0.002194,0.001396,auto,eigen,0.01,"{'lda__shrinkage': 'auto', 'lda__solver': 'eig...",0.929624,0.902715,0.905702,0.953261,0.976177,0.976068,1.0,0.951389,0.97551,1.0,0.957045,0.033385,1
21,0.010772,0.00682,0.002493,0.000804,auto,eigen,0.001,"{'lda__shrinkage': 'auto', 'lda__solver': 'eig...",0.929624,0.902715,0.905702,0.953261,0.976177,0.976068,1.0,0.951389,0.97551,1.0,0.957045,0.033385,1


* Tuned hyperparameters

In [12]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_lda = Pipeline(steps=[('scaler', StandardScaler()),
                ('lda',LinearDiscriminantAnalysis(shrinkage='auto', solver='lsqr'))])
                
score = cross_val_score(clf_lda, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.929     0.991     0.959       212
   Malignant      0.990     0.925     0.956       212

    accuracy                          0.958       424
   macro avg      0.960     0.958     0.958       424
weighted avg      0.960     0.958     0.958       424



<a id='5'></a>
## 5) [Quadratic Discriminant Analysis](<https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html>)

* Default hyperparameters

In [13]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_qda = Pipeline([('scaler', StandardScaler()), ('qda', QuadraticDiscriminantAnalysis())])

score = cross_val_score(clf_qda, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.944     0.953     0.948       212
   Malignant      0.952     0.943     0.948       212

    accuracy                          0.948       424
   macro avg      0.948     0.948     0.948       424
weighted avg      0.948     0.948     0.948       424



* Hyperparameter tuning using Grid Search

In [14]:
param_grid = {
    'qda__reg_param': np.linspace(0, 1, num=10),
    'qda__tol': [0.0001,0.001,0.01]
}

grid_search = GridSearchCV(clf_qda, param_grid=param_grid, n_jobs=-1, cv=cv,verbose=4,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 30 candidates, totalling 300 fits

Best hyperparameters :  {'qda__reg_param': 0.0, 'qda__tol': 0.0001}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('qda', QuadraticDiscriminantAnalysis())])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_qda__reg_param,param_qda__tol,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.009075,0.003525,0.003091,0.001132,0.0,0.0001,"{'qda__reg_param': 0.0, 'qda__tol': 0.0001}",0.905702,0.929624,0.952851,0.929624,1.0,0.952273,0.974437,0.90389,0.926531,1.0,0.947493,0.033192,1
2,0.015758,0.020104,0.004089,0.001917,0.0,0.01,"{'qda__reg_param': 0.0, 'qda__tol': 0.01}",0.905702,0.929624,0.952851,0.929624,1.0,0.952273,0.974437,0.90389,0.926531,1.0,0.947493,0.033192,1
1,0.008277,0.004256,0.006682,0.006939,0.0,0.001,"{'qda__reg_param': 0.0, 'qda__tol': 0.001}",0.905702,0.929624,0.952851,0.929624,1.0,0.952273,0.974437,0.90389,0.926531,1.0,0.947493,0.033192,1


* Tuned hyperparameters

In [15]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_qda = Pipeline([('scaler', StandardScaler()), ('qda', QuadraticDiscriminantAnalysis())])

score = cross_val_score(clf_qda, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.944     0.953     0.948       212
   Malignant      0.952     0.943     0.948       212

    accuracy                          0.948       424
   macro avg      0.948     0.948     0.948       424
weighted avg      0.948     0.948     0.948       424



<a id='6'></a>
## 6) [Ridge Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier>)

* Default hyperparameters

In [16]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_rc = Pipeline([('scaler', StandardScaler()), ('rg', RidgeClassifier())])

score = cross_val_score(clf_rc, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.929     0.981     0.954       212
   Malignant      0.980     0.925     0.951       212

    accuracy                          0.953       424
   macro avg      0.954     0.953     0.953       424
weighted avg      0.954     0.953     0.953       424



* Hyperparameter tuning using Grid Search

In [17]:
param_grid = {
    'rg__alpha' : np.linspace(0, 1, num=10),
    'rg__fit_intercept' : [True,False],
    'rg__copy_X' : [True,False],
    'rg__max_iter' : [None],
    'rg__tol' : [0.001],
    'rg__class_weight' : [None,'balanced'],
    'rg__solver' : ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'],
    'rg__positive' : [False]
}

grid_search = GridSearchCV(clf_rc, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 560 candidates, totalling 5600 fits

Best hyperparameters :  {'rg__alpha': 0.0, 'rg__class_weight': None, 'rg__copy_X': True, 'rg__fit_intercept': True, 'rg__max_iter': None, 'rg__positive': False, 'rg__solver': 'saga', 'rg__tol': 0.001}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('rg', RidgeClassifier(alpha=0.0, solver='saga'))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rg__alpha,param_rg__class_weight,param_rg__copy_X,param_rg__fit_intercept,param_rg__max_iter,param_rg__positive,param_rg__solver,param_rg__tol,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
393,0.005436,0.001383,0.005984,0.00691,0.777778,,True,True,,False,cholesky,0.001,"{'rg__alpha': 0.7777777777777777, 'rg__class_w...",0.929624,0.902715,0.905702,0.929624,0.976177,0.952273,1.0,0.951389,0.97551,1.0,0.952301,0.033643,1
103,0.009826,0.003057,0.004488,0.007223,0.111111,balanced,False,True,,False,saga,0.001,"{'rg__alpha': 0.1111111111111111, 'rg__class_w...",0.929624,0.902715,0.905702,0.929624,0.976177,0.952273,1.0,0.951389,0.97551,1.0,0.952301,0.033643,1
350,0.005585,0.001017,0.004488,0.006841,0.666667,,False,True,,False,svd,0.001,"{'rg__alpha': 0.6666666666666666, 'rg__class_w...",0.929624,0.902715,0.905702,0.929624,0.976177,0.952273,1.0,0.951389,0.97551,1.0,0.952301,0.033643,1


* Tuned hyperparameters

In [18]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_rc = Pipeline(steps=[('scaler', StandardScaler()),
                ('rg', RidgeClassifier(alpha=0.0, solver='saga'))])

score = cross_val_score(clf_rc, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.929     0.981     0.954       212
   Malignant      0.980     0.925     0.951       212

    accuracy                          0.953       424
   macro avg      0.954     0.953     0.953       424
weighted avg      0.954     0.953     0.953       424



<a id='7'></a>
## 7) [Decision Tree Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>)

* Default hyperparameters

In [19]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_tree = Pipeline([('scaler', StandardScaler()), ('tree', DecisionTreeClassifier(random_state=13))])

score = cross_val_score(clf_tree, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.927     0.896     0.911       212
   Malignant      0.900     0.929     0.914       212

    accuracy                          0.913       424
   macro avg      0.913     0.913     0.913       424
weighted avg      0.913     0.913     0.913       424



* Hyperparameter tuning using Grid Search

In [20]:
param_grid = {
    'tree__criterion' :['gini','entropy'],
    'tree__splitter' : ['best','random'],
    'tree__max_depth': [2,6,10,None],
    'tree__min_samples_split': list(range(2, 4)),
    'tree__min_samples_leaf': [3,5],
    'tree__min_weight_fraction_leaf' : [0.0],
    'tree__max_features': [None, 'sqrt', 'log2'],
    'tree__max_leaf_nodes' : [None,10,50],
    'tree__min_impurity_decrease' : [0.0],
    'tree__class_weight' : [None,'balanced'],
    'tree__ccp_alpha' : [0.0],
    'tree__random_state' : [13]
}

grid_search = GridSearchCV(clf_tree, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 1152 candidates, totalling 11520 fits

Best hyperparameters :  {'tree__ccp_alpha': 0.0, 'tree__class_weight': 'balanced', 'tree__criterion': 'gini', 'tree__max_depth': 6, 'tree__max_features': 'sqrt', 'tree__max_leaf_nodes': None, 'tree__min_impurity_decrease': 0.0, 'tree__min_samples_leaf': 5, 'tree__min_samples_split': 2, 'tree__min_weight_fraction_leaf': 0.0, 'tree__random_state': 13, 'tree__splitter': 'best'}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('tree',
                 DecisionTreeClassifier(class_weight='balanced', max_depth=6,
                                        max_features='sqrt', min_samples_leaf=5,
                                        random_state=13))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_tree__ccp_alpha,param_tree__class_weight,param_tree__criterion,param_tree__max_depth,param_tree__max_features,param_tree__max_leaf_nodes,param_tree__min_impurity_decrease,param_tree__min_samples_leaf,param_tree__min_samples_split,param_tree__min_weight_fraction_leaf,param_tree__random_state,param_tree__splitter,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
676,0.003748,0.000675,0.001852,0.000322,0.0,balanced,gini,6.0,sqrt,,0.0,5,2,0.0,13,best,"{'tree__ccp_alpha': 0.0, 'tree__class_weight':...",0.952222,0.906926,0.882706,0.928847,0.928531,0.904762,0.974437,0.902778,0.975045,0.975045,0.93313,0.032526,1
822,0.007205,0.004552,0.006092,0.006917,0.0,balanced,gini,,sqrt,,0.0,5,3,0.0,13,best,"{'tree__ccp_alpha': 0.0, 'tree__class_weight':...",0.952222,0.906926,0.882706,0.928847,0.928531,0.904762,0.974437,0.902778,0.975045,0.975045,0.93313,0.032526,1
820,0.004455,0.000763,0.003392,0.003268,0.0,balanced,gini,,sqrt,,0.0,5,2,0.0,13,best,"{'tree__ccp_alpha': 0.0, 'tree__class_weight':...",0.952222,0.906926,0.882706,0.928847,0.928531,0.904762,0.974437,0.902778,0.975045,0.975045,0.93313,0.032526,1


* Tuned hyperparameters

In [21]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_tree = Pipeline(steps=[('scaler', StandardScaler()),
                ('tree',DecisionTreeClassifier(class_weight='balanced', max_depth=6,
                                        max_features='sqrt', min_samples_leaf=5,
                                        random_state=13))])

score = cross_val_score(clf_tree, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.951     0.915     0.933       212
   Malignant      0.918     0.953     0.935       212

    accuracy                          0.934       424
   macro avg      0.935     0.934     0.934       424
weighted avg      0.935     0.934     0.934       424



<a id='8'></a>
## 8) [Random Forest Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>)

* Default hyperparameters

In [22]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_rf = Pipeline([('scaler', StandardScaler()), ('rf', RandomForestClassifier(random_state=13))])
                       
score = cross_val_score(clf_rf, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.972     0.967     0.969       212
   Malignant      0.967     0.972     0.969       212

    accuracy                          0.969       424
   macro avg      0.969     0.969     0.969       424
weighted avg      0.969     0.969     0.969       424



* Hyperparameter tuning using Grid Search

In [45]:
param_grid = {
    'rf__bootstrap': [True,False],
    'rf__max_depth': [5, 10 , None],
    'rf__n_estimators' : [10,50,100,200,500],
    'rf__max_features': [None, 'sqrt', 'log2'],
    'rf__max_leaf_nodes' : [None,5,10],
    'rf__min_samples_leaf': [1,3,5],
    'rf__min_samples_split': list(range(2, 6)),
    'rf__criterion' :['entropy','gini'], 
    'rf__random_state' : [13]
}

grid_search = GridSearchCV(clf_rf, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 6480 candidates, totalling 64800 fits

Best hyperparameters :  {'rf__bootstrap': True, 'rf__criterion': 'gini', 'rf__max_depth': 10, 'rf__max_features': 'sqrt', 'rf__max_leaf_nodes': None, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 2, 'rf__n_estimators': 200, 'rf__random_state': 13}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('rf',
                 RandomForestClassifier(max_depth=10, max_features='sqrt',
                                        n_estimators=200, random_state=13))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rf__bootstrap,param_rf__criterion,param_rf__max_depth,param_rf__max_features,param_rf__max_leaf_nodes,param_rf__min_samples_leaf,param_rf__min_samples_split,param_rf__n_estimators,param_rf__random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
2883,0.484187,0.342774,0.048369,0.06519,True,gini,,sqrt,,1,2,200,13,"{'rf__bootstrap': True, 'rf__criterion': 'gini...",1.0,0.953261,0.928847,0.976541,1.0,0.952273,0.974437,0.951945,1.0,0.97551,0.971281,0.023217,1
3068,0.729526,0.491401,0.070481,0.075215,True,gini,,log2,,1,3,200,13,"{'rf__bootstrap': True, 'rf__criterion': 'gini...",1.0,0.953261,0.928847,0.976541,1.0,0.952273,0.974437,0.951945,1.0,0.97551,0.971281,0.023217,1
2888,0.708085,0.482986,0.057033,0.059812,True,gini,,sqrt,,1,3,200,13,"{'rf__bootstrap': True, 'rf__criterion': 'gini...",1.0,0.953261,0.928847,0.976541,1.0,0.952273,0.974437,0.951945,1.0,0.97551,0.971281,0.023217,1


* Tuned hyperparameters

In [54]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_rf = Pipeline(steps=[('scaler', StandardScaler()),
                ('rf',RandomForestClassifier(bootstrap=True,criterion='gini',max_depth=None,max_features='sqrt',max_leaf_nodes=None,
                                        min_samples_leaf=1,min_samples_split=2,n_estimators=200,
                                        random_state=13))])
                       
score = cross_val_score(clf_rf, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.972     0.972     0.972       212
   Malignant      0.972     0.972     0.972       212

    accuracy                          0.972       424
   macro avg      0.972     0.972     0.972       424
weighted avg      0.972     0.972     0.972       424



<a id='9'></a>
## 9) [ADA Boost Classifier (Adaptive Boosting)](<https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#:~:text=An%20AdaBoost%20%5B1%5D%20classifier%20is,focus%20more%20on%20difficult%20cases.>)

* Default hyperparameters

In [23]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_adaboost = Pipeline([('scaler', StandardScaler()), ('adab', AdaBoostClassifier(random_state=13))])

score = cross_val_score(clf_adaboost, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.958     0.967     0.962       212
   Malignant      0.967     0.958     0.962       212

    accuracy                          0.962       424
   macro avg      0.962     0.962     0.962       424
weighted avg      0.962     0.962     0.962       424



* Hyperparameter tuning using Grid Search

In [25]:
param_grid = {
    'adab__base_estimator' : [DecisionTreeClassifier(class_weight='balanced', max_depth=6,max_features='sqrt', min_samples_leaf=5,random_state=13)],
    'adab__n_estimators' : [10,50,100,500],
    'adab__learning_rate' : np.power(10, np.arange(-3, 1, dtype=float)),
    'adab__algorithm' : ['SAMME', 'SAMME.R'],
    'adab__random_state' : [13],
}

grid_search = GridSearchCV(clf_adaboost, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 32 candidates, totalling 320 fits

Best hyperparameters :  {'adab__algorithm': 'SAMME', 'adab__base_estimator': DecisionTreeClassifier(class_weight='balanced', max_depth=6,
                       max_features='sqrt', min_samples_leaf=5,
                       random_state=13), 'adab__learning_rate': 1.0, 'adab__n_estimators': 500, 'adab__random_state': 13}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('adab',
                 AdaBoostClassifier(algorithm='SAMME',
                                    base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                          max_depth=6,
                                                                          max_features='sqrt',
                                                                          min_samples_leaf=5,
                                                                          random_state=13),


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_adab__algorithm,param_adab__base_estimator,param_adab__learning_rate,param_adab__n_estimators,param_adab__random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
15,1.271772,0.191445,0.046178,0.002334,SAMME,DecisionTreeClassifier(class_weight='balanced'...,1.0,500,13,"{'adab__algorithm': 'SAMME', 'adab__base_estim...",1.0,0.929624,0.952851,0.953261,0.976177,0.952273,1.0,0.951945,1.0,0.97551,0.969164,0.023678,1
31,1.09425,0.051077,0.062733,0.004476,SAMME.R,DecisionTreeClassifier(class_weight='balanced'...,1.0,500,13,"{'adab__algorithm': 'SAMME.R', 'adab__base_est...",1.0,0.952851,0.928847,0.953261,1.0,0.952273,0.974437,0.951945,1.0,0.97551,0.968912,0.023762,2
30,0.272437,0.034635,0.019698,0.004308,SAMME.R,DecisionTreeClassifier(class_weight='balanced'...,1.0,100,13,"{'adab__algorithm': 'SAMME.R', 'adab__base_est...",1.0,0.952851,0.952851,0.953261,0.976177,0.928531,0.974437,0.951945,1.0,0.97551,0.966556,0.021642,3


* Tuned hyperparameters

In [26]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_adaboost = Pipeline(steps=[('scaler', StandardScaler()),
                ('adab',AdaBoostClassifier(algorithm='SAMME',
                                    base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                          max_depth=6,
                                                                          max_features='sqrt',
                                                                          min_samples_leaf=5,
                                                                          random_state=13),
                                    n_estimators=500, random_state=13))])

score = cross_val_score(clf_adaboost, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.963     0.976     0.970       212
   Malignant      0.976     0.962     0.969       212

    accuracy                          0.969       424
   macro avg      0.969     0.969     0.969       424
weighted avg      0.969     0.969     0.969       424



<a id='10'></a>
## 10) [C-Support Vector Classification](<https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>)

* Default hyperparameters

In [27]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_svc = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

score = cross_val_score(clf_svc, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.963     0.972     0.967       212
   Malignant      0.971     0.962     0.967       212

    accuracy                          0.967       424
   macro avg      0.967     0.967     0.967       424
weighted avg      0.967     0.967     0.967       424



* Hyperparameter tuning using Grid Search

In [28]:
param_grid = [
    {
        'svc__kernel': ['rbf'], 
        'svc__gamma': [1e-2, 1e-3, 1e-4,'auto','scale'], 
        'svc__C': [1, 10, 100, 1000],
        'svc__decision_function_shape': ['ovo', 'ovr'],
        'svc__random_state' : [13]
    },
    {
        'svc__kernel': ['linear'], 
        'svc__C': [1, 10, 100, 1000],
        'svc__decision_function_shape': ['ovo', 'ovr'],
        'svc__random_state' : [13]
    },
]

grid_search = GridSearchCV(clf_svc, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 48 candidates, totalling 480 fits

Best hyperparameters :  {'svc__C': 10, 'svc__decision_function_shape': 'ovo', 'svc__gamma': 0.01, 'svc__kernel': 'rbf', 'svc__random_state': 13}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('svc',
                 SVC(C=10, decision_function_shape='ovo', gamma=0.01,
                     random_state=13))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_svc__C,param_svc__decision_function_shape,param_svc__gamma,param_svc__kernel,param_svc__random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
10,0.007081,0.003823,0.012267,0.018439,10,ovo,0.01,rbf,13,"{'svc__C': 10, 'svc__decision_function_shape':...",0.976282,0.952851,0.952851,0.976541,1.0,0.976068,0.974437,0.975848,1.0,1.0,0.978488,0.016555,1
15,0.005585,0.00111,0.004887,0.008022,10,ovr,0.01,rbf,13,"{'svc__C': 10, 'svc__decision_function_shape':...",0.976282,0.952851,0.952851,0.976541,1.0,0.976068,0.974437,0.975848,1.0,1.0,0.978488,0.016555,1
32,0.004488,0.000499,0.002294,0.000457,1000,ovo,0.0001,rbf,13,"{'svc__C': 1000, 'svc__decision_function_shape...",0.906522,0.976282,0.952851,0.976541,0.976177,0.976068,0.974437,0.975848,1.0,1.0,0.971473,0.025126,3


* Tuned hyperparameters

In [29]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_svc = Pipeline(steps=[('scaler', StandardScaler()),
                ('svc',SVC(C=10, decision_function_shape='ovo', gamma=0.01,
                     random_state=13))])

score = cross_val_score(clf_svc, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.972     0.986     0.979       212
   Malignant      0.986     0.972     0.979       212

    accuracy                          0.979       424
   macro avg      0.979     0.979     0.979       424
weighted avg      0.979     0.979     0.979       424



<a id='11'></a>
## 11) [Stochastic Gradient Descent Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html>)

* Default hyperparameters

In [30]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_sgd = Pipeline([('scaler', StandardScaler()), ('sgd', SGDClassifier(random_state=13))])

score = cross_val_score(clf_sgd, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.962     0.962     0.962       212
   Malignant      0.962     0.962     0.962       212

    accuracy                          0.962       424
   macro avg      0.962     0.962     0.962       424
weighted avg      0.962     0.962     0.962       424



* Hyperparameter tuning using Grid Search

In [31]:
param_grid = {
    'sgd__average': [True, False],
    'sgd__l1_ratio': np.linspace(0, 1, num=10),
    'sgd__alpha': np.power(10, np.arange(-2, 1, dtype=float)),
    'sgd__random_state' : [13]
}

grid_search = GridSearchCV(clf_sgd, param_grid=param_grid, n_jobs=-1, cv=cv,verbose=4,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 60 candidates, totalling 600 fits

Best hyperparameters :  {'sgd__alpha': 0.01, 'sgd__average': False, 'sgd__l1_ratio': 0.0, 'sgd__random_state': 13}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('sgd',
                 SGDClassifier(alpha=0.01, l1_ratio=0.0, random_state=13))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_sgd__alpha,param_sgd__average,param_sgd__l1_ratio,param_sgd__random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
17,0.00359,0.000488,0.001795,0.000399,0.01,False,0.777778,13,"{'sgd__alpha': 0.01, 'sgd__average': False, 's...",0.929624,0.976282,0.952851,0.976541,1.0,0.976068,0.974437,0.951945,1.0,1.0,0.973775,0.022241,1
11,0.007281,0.007151,0.006283,0.011311,0.01,False,0.111111,13,"{'sgd__alpha': 0.01, 'sgd__average': False, 's...",0.929624,0.976282,0.952851,0.976541,1.0,0.976068,0.974437,0.951945,1.0,1.0,0.973775,0.022241,1
19,0.00369,0.000638,0.002194,0.000599,0.01,False,1.0,13,"{'sgd__alpha': 0.01, 'sgd__average': False, 's...",0.929624,0.976282,0.952851,0.976541,1.0,0.976068,0.974437,0.951945,1.0,1.0,0.973775,0.022241,1


* Tuned hyperparameters

In [32]:
originalclass = []
predictedclass = []

# Cross validate
clf_sgd = Pipeline(steps=[('scaler', StandardScaler()),
                ('sgd',SGDClassifier(alpha=0.01, l1_ratio=0.0, random_state=13))])

score = cross_val_score(clf_sgd, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.963     0.986     0.974       212
   Malignant      0.986     0.962     0.974       212

    accuracy                          0.974       424
   macro avg      0.974     0.974     0.974       424
weighted avg      0.974     0.974     0.974       424



<a id='12'></a>
## 12) [eXtreme Gradient Boosting](<https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters>)

* Default hyperparameters

In [48]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_xgb = Pipeline([('scaler', StandardScaler()), ('xgb', XGBClassifier(random_state=13))])

score = cross_val_score(clf_xgb, x_new, y_le, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.967     0.953     0.960       212
   Malignant      0.953     0.967     0.960       212

    accuracy                          0.960       424
   macro avg      0.960     0.960     0.960       424
weighted avg      0.960     0.960     0.960       424



* Hyperparameter tuning using Grid Search

In [46]:
# https://www.kaggle.com/code/prashant111/a-guide-on-xgboost-hyperparameters-tuning/notebook
# https://www.cs.cornell.edu/courses/cs4780/2018sp/lectures/lecturenote19.html
# https://medium.com/data-design/xgboost-hi-im-gamma-what-can-i-do-for-you-and-the-tuning-of-regularization-a42ea17e6ab6

param_grid = {
        'xgb__booster' : ['gbtree'],
        'xgb__validate_parameters' : [True],
        'xgb__learning_rate' : [0.05,0.1,0.3,0.5,1],
        'xgb__gamma' : [0,0.01,0.1,0.5,1],
        'xgb__max_depth' : [2,6,10],
        'xgb__min_child_weight' : [1,3,5],
        'xgb__max_delta_step' : [0,2,4],
        'xgb__subsample' : [0.5],
        'xgb__colsample_bylevel' : [1],
        'xgb__colsample_bynode' : [1],
        'xgb__colsample_bytree' : [1],
        'xgb__reg_lambda' : [0,1],
        'xgb__reg_alpha' : [0],
        'xgb__tree_method' : ['exact'],
        'xgb__scale_pos_weight' : [1],
        'xgb__objective' : ['binary:logistic'], # 'multi:softmax' -> same scores as 'binary:logistic'
        #'num_class' : [2],
        'xgb__n_estimators' : [50,100,200,500],
        'xgb__random_state' : [13]
    }

grid_search = GridSearchCV(clf_xgb, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y_le)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 5400 candidates, totalling 54000 fits

Best hyperparameters :  {'xgb__booster': 'gbtree', 'xgb__colsample_bylevel': 1, 'xgb__colsample_bynode': 1, 'xgb__colsample_bytree': 1, 'xgb__gamma': 0.1, 'xgb__learning_rate': 0.1, 'xgb__max_delta_step': 0, 'xgb__max_depth': 6, 'xgb__min_child_weight': 1, 'xgb__n_estimators': 100, 'xgb__objective': 'binary:logistic', 'xgb__random_state': 13, 'xgb__reg_alpha': 0, 'xgb__reg_lambda': 0, 'xgb__scale_pos_weight': 1, 'xgb__subsample': 0.5, 'xgb__tree_method': 'exact', 'xgb__validate_parameters': True}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('xgb',
                 XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, early_stopping_rounds=None,
                               enable_categorical=False, eval_metric=None,
                               gam

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_xgb__booster,param_xgb__colsample_bylevel,param_xgb__colsample_bynode,param_xgb__colsample_bytree,param_xgb__gamma,param_xgb__learning_rate,param_xgb__max_delta_step,param_xgb__max_depth,param_xgb__min_child_weight,param_xgb__n_estimators,param_xgb__objective,param_xgb__random_state,param_xgb__reg_alpha,param_xgb__reg_lambda,param_xgb__scale_pos_weight,param_xgb__subsample,param_xgb__tree_method,param_xgb__validate_parameters,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
3554,0.276024,0.074177,0.014313,0.028999,gbtree,1,1,1,0.5,0.1,2,6,1,100,binary:logistic,13,0,0,1,0.5,exact,True,"{'xgb__booster': 'gbtree', 'xgb__colsample_byl...",1.0,0.953261,0.952851,0.952851,1.0,0.952273,0.974437,0.951945,1.0,1.0,0.973762,0.022331,1
3578,0.216231,0.070748,0.004188,0.001246,gbtree,1,1,1,0.5,0.1,2,10,1,100,binary:logistic,13,0,0,1,0.5,exact,True,"{'xgb__booster': 'gbtree', 'xgb__colsample_byl...",1.0,0.953261,0.952851,0.952851,1.0,0.952273,0.974437,0.951945,1.0,1.0,0.973762,0.022331,1
2546,0.245253,0.020755,0.003291,0.000457,gbtree,1,1,1,0.1,0.1,4,6,1,100,binary:logistic,13,0,0,1,0.5,exact,True,"{'xgb__booster': 'gbtree', 'xgb__colsample_byl...",1.0,0.953261,0.952851,0.952851,1.0,0.952273,0.974437,0.951945,1.0,1.0,0.973762,0.022331,1


* Tuned hyperparameters

In [53]:
originalclass = []
predictedclass = []

# Cross validate
clf_xgb = Pipeline(steps=[('scaler', StandardScaler()),
                ('xgb',XGBClassifier(booster='gbtree',gamma=0.5,learning_rate=0.1,max_delta_step=2,max_depth=6,min_child_weight=1,
                                    n_estimators=100,objective='binary:logistic',reg_alpha=0,reg_lambda=0,scale_pos_weight=1,subsample=0.5,
                                    tree_method='exact',validate_parameters=True,random_state=13))])

score = cross_val_score(clf_xgb, x_new, y_le, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.981     0.967     0.974       212
   Malignant      0.967     0.981     0.974       212

    accuracy                          0.974       424
   macro avg      0.974     0.974     0.974       424
weighted avg      0.974     0.974     0.974       424



<a id='13'></a>
## 13) [Light Gradient Boosting Machine](<https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html>)

* Default hyperparameters

In [42]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_lgbm = Pipeline([('scaler', StandardScaler()), ('lgbm', lgbm.LGBMClassifier(random_state=13))])

score = cross_val_score(clf_lgbm, x_new, y_le, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.958     0.962     0.960       212
   Malignant      0.962     0.958     0.960       212

    accuracy                          0.960       424
   macro avg      0.960     0.960     0.960       424
weighted avg      0.960     0.960     0.960       424



* Hyperparameter tuning using Grid Search

In [43]:
# https://neptune.ai/blog/lightgbm-parameters-guide
# https://www.youtube.com/watch?v=5CWwwtEM2TA&ab_channel=PyData & https://github.com/MSusik/newgradientboosting/blob/master/pydata.pdf

param_grid = {
        'lgbm__boosting_type' : ['gbdt','dart'],
        'lgbm__num_leaves' : [10,20,30,40,50],
        'lgbm__max_depth' : [3,6,9,-1],
        'lgbm__learning_rate' : [0.05,0.1,0.3,0.5,1],
        'lgbm__n_estimators' : [50,100,200,500],
        'lgbm__objective' : ['binary'],
        'lgbm__min_child_samples' : [10,20,30],
        'lgbm__subsample' : [0.5],
        'lgbm__reg_lambda' : [0,1],
        'lgbm__reg_alpha' : [0],
        'lgbm__subsample' : [0.5],
        'lgbm__colsample_bytree' : [1],
        'lgbm__scale_pos_weight' : [1],
        'lgbm__random_state' : [13]
    }

grid_search = GridSearchCV(clf_lgbm, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y_le)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 4800 candidates, totalling 48000 fits

Best hyperparameters :  {'lgbm__boosting_type': 'dart', 'lgbm__colsample_bytree': 1, 'lgbm__learning_rate': 1, 'lgbm__max_depth': 3, 'lgbm__min_child_samples': 10, 'lgbm__n_estimators': 200, 'lgbm__num_leaves': 10, 'lgbm__objective': 'binary', 'lgbm__random_state': 13, 'lgbm__reg_alpha': 0, 'lgbm__reg_lambda': 0, 'lgbm__scale_pos_weight': 1, 'lgbm__subsample': 0.5}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('lgbm',
                 LGBMClassifier(boosting_type='dart', colsample_bytree=1,
                                learning_rate=1, max_depth=3,
                                min_child_samples=10, n_estimators=200,
                                num_leaves=10, objective='binary',
                                random_state=13, reg_alpha=0, reg_lambda=0,
                                scale_pos_weight=1, subsample=0.5))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lgbm__boosting_type,param_lgbm__colsample_bytree,param_lgbm__learning_rate,param_lgbm__max_depth,param_lgbm__min_child_samples,param_lgbm__n_estimators,param_lgbm__num_leaves,param_lgbm__objective,param_lgbm__random_state,param_lgbm__reg_alpha,param_lgbm__reg_lambda,param_lgbm__scale_pos_weight,param_lgbm__subsample,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
4340,0.084377,0.026207,0.003144,0.000998,dart,1,1,3,10,200,10,binary,13,0,0,1,0.5,"{'lgbm__boosting_type': 'dart', 'lgbm__colsamp...",1.0,0.953261,0.952851,0.952851,1.0,0.952273,0.974437,0.975848,1.0,0.97551,0.973703,0.01957,1
4342,0.065576,0.018901,0.002494,0.000805,dart,1,1,3,10,200,20,binary,13,0,0,1,0.5,"{'lgbm__boosting_type': 'dart', 'lgbm__colsamp...",1.0,0.953261,0.952851,0.952851,1.0,0.952273,0.974437,0.975848,1.0,0.97551,0.973703,0.01957,1
4344,0.07535,0.016888,0.005636,0.009605,dart,1,1,3,10,200,30,binary,13,0,0,1,0.5,"{'lgbm__boosting_type': 'dart', 'lgbm__colsamp...",1.0,0.953261,0.952851,0.952851,1.0,0.952273,0.974437,0.975848,1.0,0.97551,0.973703,0.01957,1


* Tuned hyperparameters

In [44]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_lgbm = Pipeline(steps=[('scaler', StandardScaler()),
                ('lgbm',lgbm.LGBMClassifier(boosting_type='dart', colsample_bytree=1,
                                learning_rate=1, max_depth=3,
                                min_child_samples=10, n_estimators=200,
                                num_leaves=10, objective='binary',
                                random_state=13, reg_alpha=0, reg_lambda=0,
                                scale_pos_weight=1, subsample=0.5))])

score = cross_val_score(clf_lgbm, x_new, y_le, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.976     0.972     0.974       212
   Malignant      0.972     0.976     0.974       212

    accuracy                          0.974       424
   macro avg      0.974     0.974     0.974       424
weighted avg      0.974     0.974     0.974       424



<a id='14'></a>
## 14) [K-Nearest Neighbors Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html>)

* Default hyperparameters

In [34]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_knn = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier())])

score = cross_val_score(clf_knn, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.932     0.972     0.952       212
   Malignant      0.970     0.929     0.949       212

    accuracy                          0.950       424
   macro avg      0.951     0.950     0.950       424
weighted avg      0.951     0.950     0.950       424



* Hyperparameter tuning using Grid Search

In [35]:
param_grid = {
    'knn__n_neighbors': list(range(2,10)),
    'knn__weights': ['uniform','distance'],
    'knn__algorithm' : ['ball_tree', 'kd_tree', 'brute'],
    'knn__leaf_size': [10,20,30,40,50],
    'knn__p': [1,2],
    'knn__metric': ['minkowski','manhattan','chebyshev']
}

grid_search = GridSearchCV(clf_knn, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=False).head(3)

Fitting 10 folds for each of 1440 candidates, totalling 14400 fits

Best hyperparameters :  {'knn__algorithm': 'ball_tree', 'knn__leaf_size': 10, 'knn__metric': 'minkowski', 'knn__n_neighbors': 5, 'knn__p': 1, 'knn__weights': 'uniform'}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('knn',
                 KNeighborsClassifier(algorithm='ball_tree', leaf_size=10,
                                      p=1))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_knn__algorithm,param_knn__leaf_size,param_knn__metric,param_knn__n_neighbors,param_knn__p,param_knn__weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
815,0.003941,0.00212,0.003391,0.001197,kd_tree,40,manhattan,5,2,distance,"{'knn__algorithm': 'kd_tree', 'knn__leaf_size'...",1.0,0.976282,0.928847,0.953261,1.0,0.952273,1.0,0.951945,0.97551,0.97551,0.971363,0.023228,1
813,0.003391,0.000662,0.004588,0.004808,kd_tree,40,manhattan,5,1,distance,"{'knn__algorithm': 'kd_tree', 'knn__leaf_size'...",1.0,0.976282,0.928847,0.953261,1.0,0.952273,1.0,0.951945,0.97551,0.97551,0.971363,0.023228,1
781,0.004289,0.000897,0.004288,0.002277,kd_tree,40,minkowski,5,1,distance,"{'knn__algorithm': 'kd_tree', 'knn__leaf_size'...",1.0,0.976282,0.928847,0.953261,1.0,0.952273,1.0,0.951945,0.97551,0.97551,0.971363,0.023228,1


* Tuned hyperparameters

In [36]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_knn = Pipeline(steps=[('scaler', StandardScaler()),
                ('knn',KNeighborsClassifier(algorithm='ball_tree', leaf_size=10,
                                      p=1))])

score = cross_val_score(clf_knn, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.959     0.986     0.972       212
   Malignant      0.985     0.958     0.971       212

    accuracy                          0.972       424
   macro avg      0.972     0.972     0.972       424
weighted avg      0.972     0.972     0.972       424



<a id='15'></a>
## 15) [Multi-layer Perceptron Classifier](<https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html>)

* Default hyperparameters

In [37]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_mlp =  Pipeline([('scaler', StandardScaler()), ('mlp', MLPClassifier(random_state=13))])

score = cross_val_score(clf_mlp, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.963     0.972     0.967       212
   Malignant      0.971     0.962     0.967       212

    accuracy                          0.967       424
   macro avg      0.967     0.967     0.967       424
weighted avg      0.967     0.967     0.967       424



* Hyperparameter tuning using Grid Search

In [39]:
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
param_grid = {
    'mlp__hidden_layer_sizes' : [(26,52,)],
    'mlp__activation' : ['tanh','relu'],
    'mlp__solver' : ['sgd','adam'],
    'mlp__alpha' : [0.01,0,2],
    'mlp__batch_size' : [40,80,'auto'],
    'mlp__learning_rate' : ['invscaling','adaptive'],
    'mlp__learning_rate_init' : np.power(10, np.arange(-3, 0, dtype=float)),
    'mlp__power_t' : [0.5],
    'mlp__max_iter' : [50,100,200,500],
    'mlp__shuffle' : [True],
    'mlp__random_state' : [13]
}

grid_search = GridSearchCV(clf_mlp, param_grid=param_grid, n_jobs=-1,cv=cv,verbose=1,scoring='f1_macro')
grid_search.fit(x_new, y)

print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='rank_test_score').head(3)

Fitting 10 folds for each of 864 candidates, totalling 8640 fits

Best hyperparameters :  {'mlp__activation': 'tanh', 'mlp__alpha': 2, 'mlp__batch_size': 80, 'mlp__hidden_layer_sizes': (26, 52), 'mlp__learning_rate': 'invscaling', 'mlp__learning_rate_init': 0.001, 'mlp__max_iter': 200, 'mlp__power_t': 0.5, 'mlp__random_state': 13, 'mlp__shuffle': True, 'mlp__solver': 'adam'}

Best estimator :  Pipeline(steps=[('scaler', StandardScaler()),
                ('mlp',
                 MLPClassifier(activation='tanh', alpha=2, batch_size=80,
                               hidden_layer_sizes=(26, 52),
                               learning_rate='invscaling', random_state=13))])



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_mlp__activation,param_mlp__alpha,param_mlp__batch_size,param_mlp__hidden_layer_sizes,param_mlp__learning_rate,param_mlp__learning_rate_init,param_mlp__max_iter,param_mlp__power_t,param_mlp__random_state,param_mlp__shuffle,param_mlp__solver,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
341,0.599881,0.031228,0.002693,0.002445,tanh,2,80,"(26, 52)",invscaling,0.001,200,0.5,13,True,adam,"{'mlp__activation': 'tanh', 'mlp__alpha': 2, '...",0.976282,0.976282,0.929624,0.976541,1.0,0.976068,0.974437,0.975848,1.0,1.0,0.978508,0.019542,1
374,0.984467,0.100239,0.002293,0.00078,tanh,2,80,"(26, 52)",adaptive,0.01,500,0.5,13,True,sgd,"{'mlp__activation': 'tanh', 'mlp__alpha': 2, '...",0.976282,0.976282,0.929624,0.976541,1.0,0.976068,0.974437,0.975848,1.0,1.0,0.978508,0.019542,1
365,0.721471,0.084561,0.002592,0.001493,tanh,2,80,"(26, 52)",adaptive,0.001,200,0.5,13,True,adam,"{'mlp__activation': 'tanh', 'mlp__alpha': 2, '...",0.976282,0.976282,0.929624,0.976541,1.0,0.976068,0.974437,0.975848,1.0,1.0,0.978508,0.019542,1


* Tuned hyperparameters

In [40]:
originalclass = []
predictedclass = []
  
# Cross validate
clf_mlp =  Pipeline(steps=[('scaler', StandardScaler()),
                ('mlp',MLPClassifier(activation='tanh', alpha=2, batch_size=80,
                               hidden_layer_sizes=(26, 52),
                               learning_rate='invscaling', random_state=13))])


score = cross_val_score(clf_mlp, x_new, y, scoring=make_scorer(classification_report_with_accuracy_score),cv=cv)
print(classification_report(originalclass, predictedclass, target_names=target_names, digits=3))

              precision    recall  f1-score   support

      Benign      0.972     0.986     0.979       212
   Malignant      0.986     0.972     0.979       212

    accuracy                          0.979       424
   macro avg      0.979     0.979     0.979       424
weighted avg      0.979     0.979     0.979       424



* Tried a larger range of hyperparameters for testing at first, but was too time consuming. The worst attempts were then found with the following code and the hyperparameters corresponding to those results were removed.

In [41]:
# print_best_params(grid_search)
grid_search_results = pd.DataFrame(grid_search.cv_results_)
grid_search_results.sort_values(by='mean_test_score',ascending=True).head(5) # worst 5

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_mlp__activation,param_mlp__alpha,param_mlp__batch_size,param_mlp__hidden_layer_sizes,param_mlp__learning_rate,param_mlp__learning_rate_init,param_mlp__max_iter,param_mlp__power_t,param_mlp__random_state,param_mlp__shuffle,param_mlp__solver,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
818,0.10442,0.038276,0.002593,0.001278,relu,2.0,auto,"(26, 52)",invscaling,0.001,100,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 2, '...",0.577049,0.562217,0.436905,0.692857,0.499687,0.616555,0.534694,0.616555,0.68678,0.475,0.56983,0.081306,861
816,0.121774,0.053503,0.002294,0.001003,relu,2.0,auto,"(26, 52)",invscaling,0.001,50,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 2, '...",0.577049,0.562217,0.436905,0.692857,0.499687,0.616555,0.534694,0.616555,0.68678,0.475,0.56983,0.081306,861
820,0.103224,0.034546,0.001995,0.000631,relu,2.0,auto,"(26, 52)",invscaling,0.001,200,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 2, '...",0.577049,0.562217,0.436905,0.692857,0.499687,0.616555,0.534694,0.616555,0.68678,0.475,0.56983,0.081306,861
822,0.10851,0.063902,0.002593,0.001111,relu,2.0,auto,"(26, 52)",invscaling,0.001,500,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 2, '...",0.577049,0.562217,0.436905,0.692857,0.499687,0.616555,0.534694,0.616555,0.68678,0.475,0.56983,0.081306,861
528,0.079487,0.017508,0.001994,0.000446,relu,0.01,auto,"(26, 52)",invscaling,0.001,50,0.5,13,True,sgd,"{'mlp__activation': 'relu', 'mlp__alpha': 0.01...",0.59543,0.562217,0.436905,0.692857,0.499687,0.616555,0.534694,0.616555,0.68678,0.475,0.571668,0.081655,853


<a id='16'></a>
## 16) Summary

* Below are the tables of the specific feature selection method.
* The performance of the algorithms is in descending order.
* All the results are the average values of a 10-fold cross validation.
* The columns contain the accuracy and the average values of precision, recall and f1 score.
* It is observed that the number of samples of Βenign and Μalignant cancer are equal (212 respectively), so the weighted average and the macro average are equal.

<table>
    <tr>
        <th colspan="5"> RFECV : Default algorithms</th>
    </tr>
    <tr>
        <th></th>
        <th>precision </th>
        <th>recall</th>
        <th>f1 score</th>
        <th>accuracy</th>  
    </tr>
    <tr>
        <th>Random Forest</th>
        <td>0.969</td>
        <td>0.969</td>
        <td>0.969</td>
        <td>0.969</td>
    </tr>
    <tr>
        <th>MLP</th>
        <td>0.967</td>
        <td>0.967</td>
        <td>0.967</td>
        <td>0.967</td>
    </tr>
    <tr>
        <th>SVC</th>
        <td>0.967</td>
        <td>0.967</td>
        <td>0.967</td>
        <td>0.967</td>
    </tr>
    <tr>
        <th>AdaBoost</th>
        <td>0.962</td>
        <td>0.962</td>
        <td>0.962</td>
        <td>0.962</td>
    </tr>
    <tr>
        <th>SGD</th>
        <td>0.962</td>
        <td>0.962</td>
        <td>0.962</td>
        <td>0.962</td>
    </tr>
    <tr>
        <th>LGBM</th>
        <td>0.960</td>
        <td>0.960</td>
        <td>0.960</td>
        <td>0.960</td>
    </tr>
    <tr>
        <th>XGBoost</th>
        <td>0.960</td>
        <td>0.960</td>
        <td>0.960</td>
        <td>0.960</td>
    </tr>
    <tr>
        <th>Ridge</th>
        <td>0.954</td>
        <td>0.953</td>
        <td>0.953</td>
        <td>0.953</td>
    </tr>
    <tr>
        <th>LDA</th>
        <td>0.952</td>
        <td>0.950</td>
        <td>0.950</td>
        <td>0.950</td>
    </tr>
    <tr>
        <th>KNN</th>
        <td>0.951</td>
        <td>0.950</td>
        <td>0.950</td>
        <td>0.950</td>
    </tr>
    <tr>
        <th>QDA</th>
        <td>0.948</td>
        <td>0.948</td>
        <td>0.948</td>
        <td>0.948</td>
    </tr>
    <tr>
        <th>GNB</th>
        <td>0.928</td>
        <td>0.927</td>
        <td>0.927</td>
        <td>0.927</td>
    </tr>
    <tr>
        <th>Decision Tree</th>
        <td>0.913</td>
        <td>0.913</td>
        <td>0.913</td>
        <td>0.913</td>
    </tr>

</table>

<table>
    <tr>
        <th colspan="5"> RFECV : Tuned algorithms</th>
    </tr>
    <tr>
        <th></th>
        <th>precision </th>
        <th>recall</th>
        <th>f1 score</th>
        <th>accuracy</th>  
    </tr>
    <tr>
        <th>MLP</th>
        <td>0.979</td>
        <td>0.979</td>
        <td>0.979</td>
        <td>0.979</td>
    </tr>
    <tr>
        <th>SVC</th>
        <td>0.979</td>
        <td>0.979</td>
        <td>0.979</td>
        <td>0.979</td>
    </tr>
    <tr>
        <th>LGBM</th>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
    </tr>
    <tr>
        <th>XGBoost</th>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
    </tr>
    <tr>
        <th>SGD</th>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
        <td>0.974</td>
    </tr>
    <tr>
        <th>KNN</th>
        <td>0.972</td>
        <td>0.972</td>
        <td>0.972</td>
        <td>0.972</td>
    </tr>
    <tr>
        <th>Random Forest</th>
        <td>0.972</td>
        <td>0.972</td>
        <td>0.972</td>
        <td>0.972</td>
    </tr>
    <tr>
        <th>AdaBoost</th>
        <td>0.969</td>
        <td>0.969</td>
        <td>0.969</td>
        <td>0.969</td>
    </tr>
    <tr>
        <th>LDA</th>
        <td>0.960</td>
        <td>0.958</td>
        <td>0.958</td>
        <td>0.958</td>
    </tr>
    <tr>
        <th>Ridge</th>
        <td>0.954</td>
        <td>0.953</td>
        <td>0.953</td>
        <td>0.953</td>
    </tr>
    <tr>
        <th>QDA</th>
        <td>0.948</td>
        <td>0.948</td>
        <td>0.948</td>
        <td>0.948</td>
    </tr>
    <tr>
        <th>Decision Tree</th>
        <td>0.935</td>
        <td>0.934</td>
        <td>0.934</td>
        <td>0.934</td>
    </tr>
    <tr>
        <th>GNB</th>
        <td>0.936</td>
        <td>0.934</td>
        <td>0.934</td>
        <td>0.934</td>
    </tr>

</table>