# Introdução

Este trabalho tem o objetivo de exercitar os conceitos vistos em sala de aula, por meio da experimentação de quatro técnicas de aprendizado supervisionado (KNN, Naive Bayes, Decision Tree e Artificial Neural Network) para a classificação em cinco conjuntos de dados públicos.

Os dados em questão, são:

[Credit Approval](https://archive.ics.uci.edu/ml/datasets/Credit+Approval)

[Speaker Accent Recognition](https://archive.ics.uci.edu/ml/datasets/Speaker+Accent+Recognition)

[Breast Cancer Wisconsin (Diagnostic)](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)

[Z-Alizadeh Sani](https://archive.ics.uci.edu/ml/datasets/Z-Alizadeh+Sani)

Acidente Vascular Cerebral - AVC (disponibilizado pelo professor)

# Metodologia

## Conjunto de Dados

Os conjuntos de dados precisaram de tratamento, para que possam ser utilizados pelos classificadores (todos atributos devem ser numéricos).

A biblioteca 'pandas' foi empregada para tal finalidade.

## Ajuste de parâmetros

A biblioteca 'sklearn' foi usada tanto para ajustar os parâmetros como para instanciar os classificadores.

<br>**KNN:** apenas o números de vizinhos foi ajustado (n_neighbors).

<br>**Naive Bayes:** nenhum parâmetro foi ajustado.

<br>**Decision Tree:** os parâmetros ajustados foram a profundidade máxima, critério de divisão e quantidade mínima de amostra para dividir um nó.

<br>**Artificial Neural Network:** os parâmetros ajustados foram a taxa de aprendizado, quantidade de épocas, função de ativação e quantidade de neurônios na camada escondida

# Solução

Imports

In [None]:
import numpy as np
import pandas as pd
import math
from copy import deepcopy
from sklearn.model_selection import StratifiedKFold,RandomizedSearchCV
from sklearn import metrics

## Credit Approval Data Set

In [None]:
credit = pd.read_csv('crx.data', header=None)
credit

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,00202,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,00043,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,00280,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,00100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,00120,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,00260,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,00200,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,00200,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,00280,750,-


Treating the data

In [None]:
# Removing rows with missing values
credit = credit.replace({'?': np.nan}).dropna().reset_index(drop=True)


# Convert strings to numbers
credit[0] = credit[0].map({'a': '0',
                           'b': '1'})

credit[3] = credit[3].map({'u': '0',
                           'y': '1',
                           'l': '2',
                           't': '3'})

credit[4] = credit[4].map({'g': '0',
                           'p': '1',
                           'gg': '2'})

credit[5] = credit[5].map({'c': '0',
                           'd': '1',
                           'cc': '2',
                           'i': '3',
                           'j': '4',
                           'k': '5',
                           'm': '6',
                           'r': '7',
                           'q': '8',
                           'w': '9',
                           'x': '10',
                           'e': '11',
                           'aa': '12',
                           'ff': '13'})

credit[6] = credit[6].map({'v': '0',
                           'h': '1',
                           'bb': '2',
                           'j': '3',
                           'n': '4',
                           'z': '5',
                           'dd': '6',
                           'ff': '7',
                           'o': '8'})
credit[12] = credit[12].map({'g': '0',
                             'p': '1',
                             's': '2'})

credit[15] = credit[15].map({'-': '0',
                             '+': '1'})

credit = credit.replace({'f': '0',
                         't': '1'})
                         
credit

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,1,30.83,0.000,0,0,9,0,1.25,1,1,1,0,0,00202,0,1
1,0,58.67,4.460,0,0,8,1,3.04,1,1,6,0,0,00043,560,1
2,0,24.50,0.500,0,0,8,1,1.50,1,0,0,0,0,00280,824,1
3,1,27.83,1.540,0,0,9,0,3.75,1,1,5,1,0,00100,3,1
4,1,20.17,5.625,0,0,9,0,1.71,1,0,0,0,2,00120,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
648,1,21.08,10.085,1,1,11,1,1.25,0,0,0,0,0,00260,0,0
649,0,22.67,0.750,0,0,0,0,2.00,0,1,2,1,0,00200,394,0
650,0,25.25,13.500,1,1,13,7,2.00,0,1,1,1,0,00200,1,0
651,1,17.92,0.205,0,0,12,0,0.04,0,0,0,0,0,00280,750,0


Checking imbalance

In [None]:
credit[15].value_counts()

0    357
1    296
Name: 15, dtype: int64

## Speaker Accent Recognition Data Set

In [None]:
accent = pd.read_csv('accent-mfcc-data-1.csv')
accent

Unnamed: 0,language,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12
0,ES,7.071476,-6.512900,7.650800,11.150783,-7.657312,12.484021,-11.709772,3.426596,1.462715,-2.812753,0.866538,-5.244274
1,ES,10.982967,-5.157445,3.952060,11.529381,-7.638047,12.136098,-12.036247,3.491943,0.595441,-4.508811,2.332147,-6.221857
2,ES,7.827108,-5.477472,7.816257,9.187592,-7.172511,11.715299,-13.847214,4.574075,-1.687559,-7.204041,-0.011847,-6.463144
3,ES,6.744083,-5.688920,6.546789,9.000183,-6.924963,11.710766,-12.374388,6.169879,-0.544747,-6.019237,1.358559,-6.356441
4,ES,5.836843,-5.326557,7.472265,8.847440,-6.773244,12.677218,-12.315061,4.416344,0.193500,-3.644812,2.151239,-6.816310
...,...,...,...,...,...,...,...,...,...,...,...,...,...
324,US,-0.525273,-3.868338,3.548304,1.496249,3.490753,5.849887,-7.747027,9.738836,-11.754543,7.129909,0.209947,-1.946914
325,US,-2.094001,-1.073113,1.217397,-0.550790,2.666547,7.449942,-6.418064,10.907098,-11.134323,6.728373,2.461446,-0.026113
326,US,2.116909,-4.441482,5.350392,3.675396,2.715876,3.682670,-4.500850,11.798565,-12.031005,7.566142,-0.606010,-2.245129
327,US,0.299616,0.324844,3.299919,2.044040,3.634828,6.693840,-5.676224,12.000518,-11.912901,4.664406,1.197789,-2.230275


Treating the data

In [None]:
# Convert strings to numbers
accent['language'] = accent['language'].map({'US': '0',
                                             'UK': '1',
                                             'FR': '2',
                                             'GE': '3',
                                             'IT': '4',
                                             'ES': '5'})

accent

Unnamed: 0,language,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12
0,5,7.071476,-6.512900,7.650800,11.150783,-7.657312,12.484021,-11.709772,3.426596,1.462715,-2.812753,0.866538,-5.244274
1,5,10.982967,-5.157445,3.952060,11.529381,-7.638047,12.136098,-12.036247,3.491943,0.595441,-4.508811,2.332147,-6.221857
2,5,7.827108,-5.477472,7.816257,9.187592,-7.172511,11.715299,-13.847214,4.574075,-1.687559,-7.204041,-0.011847,-6.463144
3,5,6.744083,-5.688920,6.546789,9.000183,-6.924963,11.710766,-12.374388,6.169879,-0.544747,-6.019237,1.358559,-6.356441
4,5,5.836843,-5.326557,7.472265,8.847440,-6.773244,12.677218,-12.315061,4.416344,0.193500,-3.644812,2.151239,-6.816310
...,...,...,...,...,...,...,...,...,...,...,...,...,...
324,0,-0.525273,-3.868338,3.548304,1.496249,3.490753,5.849887,-7.747027,9.738836,-11.754543,7.129909,0.209947,-1.946914
325,0,-2.094001,-1.073113,1.217397,-0.550790,2.666547,7.449942,-6.418064,10.907098,-11.134323,6.728373,2.461446,-0.026113
326,0,2.116909,-4.441482,5.350392,3.675396,2.715876,3.682670,-4.500850,11.798565,-12.031005,7.566142,-0.606010,-2.245129
327,0,0.299616,0.324844,3.299919,2.044040,3.634828,6.693840,-5.676224,12.000518,-11.912901,4.664406,1.197789,-2.230275


Checking imbalance

In [None]:
accent['language'].value_counts()

0    165
1     45
2     30
3     30
4     30
5     29
Name: language, dtype: int64

Needs to be balanced

## Breast Cancer Wisconsin (Diagnostic) Data Set

In [None]:
cancer = pd.read_csv('wdbc.data', header=None)
cancer

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


Treating the data

In [None]:
# Drop ID column
cancer = cancer.drop(columns=cancer.columns[0])
cancer.columns = range(cancer.shape[1])

# Convert strings to numbers
cancer[0] = cancer[0].map({'B': '0',
                           'M': '1'})

cancer

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,1,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,1,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,1,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,1,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,1,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,1,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,1,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


Checking imbalance

In [None]:
cancer[0].value_counts()

0    357
1    212
Name: 0, dtype: int64

## Z-Alizadeh Sani Data Set

In [None]:
coronary = pd.read_excel('Z-Alizadeh sani dataset.xlsx')
coronary

Unnamed: 0,Age,Weight,Length,Sex,BMI,DM,HTN,Current Smoker,EX-Smoker,FH,...,K,Na,WBC,Lymph,Neut,PLT,EF-TTE,Region RWMA,VHD,Cath
0,53,90,175,Male,29.387755,0,1,1,0,0,...,4.7,141,5700,39,52,261,50,0,N,Cad
1,67,70,157,Fmale,28.398718,0,1,0,0,0,...,4.7,156,7700,38,55,165,40,4,N,Cad
2,54,54,164,Male,20.077335,0,0,1,0,0,...,4.7,139,7400,38,60,230,40,2,mild,Cad
3,66,67,158,Fmale,26.838648,0,1,0,0,0,...,4.4,142,13000,18,72,742,55,0,Severe,Normal
4,50,87,153,Fmale,37.165193,0,1,0,0,0,...,4.0,140,9200,55,39,274,50,0,Severe,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,58,84,168,Male,29.761905,0,0,0,0,0,...,4.8,146,8500,34,58,251,45,0,N,Cad
299,55,64,152,Fmale,27.700831,0,0,0,0,0,...,4.0,139,11400,16,80,377,40,0,mild,Normal
300,48,77,160,Fmale,30.078125,0,1,0,0,1,...,4.0,140,9000,35,55,279,55,0,N,Normal
301,57,90,159,Fmale,35.599858,1,0,0,0,0,...,3.8,141,3800,48,40,208,55,0,N,Normal


Treating the data

In [None]:
# Convert strings to numbers
coronary['Sex'] = coronary['Sex'].map({'Fmale': '0',
                                       'Male': '1'})

coronary['BBB'] = coronary['BBB'].map({'N': '0',
                                       'LBBB': '1',
                                       'RBBB': '2'})

coronary['VHD'] = coronary['VHD'].map({'N': '0',
                                       'mild': '1',
                                       'Moderate': '2',
                                       'Severe': '3'})

coronary['Cath'] = coronary['Cath'].map({'Normal': '0',
                                         'Cad': '1'})

coronary = coronary.replace({'N': '0',
                             'Y': '1'})
coronary

Unnamed: 0,Age,Weight,Length,Sex,BMI,DM,HTN,Current Smoker,EX-Smoker,FH,...,K,Na,WBC,Lymph,Neut,PLT,EF-TTE,Region RWMA,VHD,Cath
0,53,90,175,1,29.387755,0,1,1,0,0,...,4.7,141,5700,39,52,261,50,0,0,1
1,67,70,157,0,28.398718,0,1,0,0,0,...,4.7,156,7700,38,55,165,40,4,0,1
2,54,54,164,1,20.077335,0,0,1,0,0,...,4.7,139,7400,38,60,230,40,2,1,1
3,66,67,158,0,26.838648,0,1,0,0,0,...,4.4,142,13000,18,72,742,55,0,3,0
4,50,87,153,0,37.165193,0,1,0,0,0,...,4.0,140,9200,55,39,274,50,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,58,84,168,1,29.761905,0,0,0,0,0,...,4.8,146,8500,34,58,251,45,0,0,1
299,55,64,152,0,27.700831,0,0,0,0,0,...,4.0,139,11400,16,80,377,40,0,1,0
300,48,77,160,0,30.078125,0,1,0,0,1,...,4.0,140,9000,35,55,279,55,0,0,0
301,57,90,159,0,35.599858,1,0,0,0,0,...,3.8,141,3800,48,40,208,55,0,0,0


Checking imbalance

In [None]:
coronary['Cath'].value_counts()

1    216
0     87
Name: Cath, dtype: int64

Needs to be balanced

## AVC date set

In [None]:
avc = pd.read_csv('AVC.csv', header=None)
avc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1,3.0,0,0,1,1,1,95.12,18.0,99,0
1,1,58.0,1,0,2,4,2,87.96,39.2,2,0
2,2,8.0,0,0,1,4,2,110.89,17.6,99,0
3,2,70.0,0,0,2,4,1,69.04,35.9,1,0
4,1,14.0,0,0,1,3,1,161.28,19.1,99,0
...,...,...,...,...,...,...,...,...,...,...,...
43395,2,10.0,0,0,1,1,2,58.64,20.4,2,0
43396,2,56.0,0,0,2,2,2,213.61,55.4,1,0
43397,2,82.0,1,0,2,4,2,91.94,28.9,1,0
43398,1,40.0,0,0,2,4,2,99.16,33.2,2,0


Treating the data

In [None]:
# Removing rows with missing values
avc = avc.replace({99: np.nan}).dropna().reset_index(drop=True)
avc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1.0,58.0,1.0,0.0,2.0,4.0,2.0,87.96,39.2,2.0,0.0
1,2.0,70.0,0.0,0.0,2.0,4.0,1.0,69.04,35.9,1.0,0.0
2,2.0,52.0,0.0,0.0,2.0,4.0,2.0,77.59,17.7,1.0,0.0
3,2.0,75.0,0.0,1.0,2.0,5.0,1.0,243.53,27.0,2.0,0.0
4,2.0,32.0,0.0,0.0,2.0,4.0,1.0,77.67,32.3,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
29058,2.0,10.0,0.0,0.0,1.0,1.0,2.0,58.64,20.4,2.0,0.0
29059,2.0,56.0,0.0,0.0,2.0,2.0,2.0,213.61,55.4,1.0,0.0
29060,2.0,82.0,1.0,0.0,2.0,4.0,2.0,91.94,28.9,1.0,0.0
29061,1.0,40.0,0.0,0.0,2.0,4.0,2.0,99.16,33.2,2.0,0.0


Checking imbalance

In [None]:
avc[10].value_counts()

0.0    28515
1.0      548
Name: 10, dtype: int64

Needs to be balanced

## Metrics

In [None]:
metrics_dict = {'accuracy': [],
                'sensitivity': [],
                'specificity': [],
                'gmean': []}

For computing the four different metrics n times for each data set

In [None]:
def computeMetrics(model_type, n_iter):
  # Create one dictionary to hold the metrics for each data set
  credit_metrics = deepcopy(metrics_dict)
  accent_metrics  = deepcopy(metrics_dict)
  cancer_metrics = deepcopy(metrics_dict)
  coronary_metrics = deepcopy(metrics_dict)
  avc_metrics = deepcopy(metrics_dict)

  for i in range(n_iter):
    # Credit Approval
    populateCreditMetrics(credit_metrics, model_type)

    # Speaker Accent Recognition
    populateAccentMetrics(accent_metrics, model_type)

    # Breast Cancer Wisconsin (Diagnostic)
    populateCancerMetrics(cancer_metrics, model_type)

    # Z-Alizadeh Sani
    populateCoronaryMetrics(coronary_metrics, model_type)

    # AVC
    populateAVCMetrics(avc_metrics, model_type)

  return credit_metrics, accent_metrics, cancer_metrics, coronary_metrics, avc_metrics

Utility functions

In [None]:
def getModelMetrics(model_type, X_train, Y_train, X_test, Y_test):
  if model_type == 'knn':
    return getKNNMetrics(X_train, Y_train, X_test, Y_test)
  elif model_type == 'bayes':
    return getBayesMetrics(X_train, Y_train, X_test, Y_test)
  elif model_type == 'dtree':
    return getDTreeMetrics(X_train, Y_train, X_test, Y_test)
  elif model_type == 'mlp':
    return getMLPMetrics(X_train, Y_train, X_test, Y_test)
  else:
    assert False, "Invalid model type! The ones available are: knn, bayes, dtree & mlp"

In [None]:
def getMetricsFromClassifier(classifier, X_test, Y_test):
  acc = classifier.score(X_test, Y_test)

  predicted = classifier.predict(X_test)
  CM = metrics.confusion_matrix(Y_test, predicted)
  sens = CM[1,1]/(CM[1,0]+CM[1,1])
  spec = CM[0,0]/(CM[0,0]+CM[0,1])

  gmean = math.sqrt(sens*spec)
  
  return acc, sens, spec, gmean

In [None]:
def appendMetricsToDict(dt, acc, sens, spec, gmean):
  dt['accuracy'].append(acc)
  dt['sensitivity'].append(sens)
  dt['specificity'].append(spec)
  dt['gmean'].append(gmean)


### Credit Approval

In [None]:
def populateCreditMetrics(credit_metrics, model_type):
  # Split the data into two sets, for training and testing
  msk = np.random.rand(len(credit)) < 0.8
  training = credit[msk]
  testing = credit[~msk]

  Y_train = training[15]
  X_train = training.drop(15, axis='columns')

  Y_test = testing[15]
  X_test = testing.drop(15, axis='columns')

  acc, sens, spec, gmean = getModelMetrics(model_type, X_train, Y_train, X_test, Y_test)

  appendMetricsToDict(credit_metrics, acc, sens, spec, gmean)

### Speaker Accent Recognition

In [None]:
def populateAccentMetrics(accent_metrics, model_type):
  # Split the data into two sets, for training and testing
  msk = np.random.rand(len(accent)) < 0.8
  training = accent[msk]
  testing = accent[~msk]

  Y_train = training['language']
  X_train = training.drop('language', axis='columns')

  #print('Sampling on Speaker Accent Recognition data set ')
  #print('Original data set shape %s' % Counter(Y_train))

  # Using SMOTE to balance the data set
  SM = SMOTE(random_state=42, k_neighbors=15, n_jobs=-1)
  X_trainSM, Y_trainSM = SM.fit_resample(X_train, Y_train)

  #print('Resampled data set shape %s' % Counter(Y_trainSM))

  Y_test = testing['language']
  X_test = testing.drop('language', axis='columns')

  acc, sens, spec, gmean = getModelMetrics(model_type, X_trainSM, Y_trainSM, X_test, Y_test)

  appendMetricsToDict(accent_metrics, acc, sens, spec, gmean)

### Breast Cancer Wisconsin (Diagnostic)

In [None]:
def populateCancerMetrics(cancer_metrics, model_type):
  # Split the data into two sets, for training and testing
  msk = np.random.rand(len(cancer)) < 0.8
  training = cancer[msk]
  testing = cancer[~msk]

  Y_train = training[0]
  X_train = training.drop(0, axis='columns')

  Y_test = testing[0]
  X_test = testing.drop(0, axis='columns')

  acc, sens, spec, gmean = getModelMetrics(model_type, X_train, Y_train, X_test, Y_test)

  appendMetricsToDict(cancer_metrics, acc, sens, spec, gmean)

### Z-Alizadeh Sani

In [None]:
def populateCoronaryMetrics(coronary_metrics, model_type):
  # Split the data into two sets, for training and testing
  msk = np.random.rand(len(coronary)) < 0.8
  training = coronary[msk]
  testing = coronary[~msk]

  Y_train = training['Cath']
  X_train = training.drop('Cath', axis='columns')

  # Using SMOTE to balance the data set
  SM = SMOTE(random_state=42, k_neighbors=15, n_jobs=-1)
  X_trainSM, Y_trainSM = SM.fit_resample(X_train, Y_train)

  Y_test = testing['Cath']
  X_test = testing.drop('Cath', axis='columns')

  acc, sens, spec, gmean = getModelMetrics(model_type, X_trainSM, Y_trainSM, X_test, Y_test)

  appendMetricsToDict(coronary_metrics, acc, sens, spec, gmean)

### AVC

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler 
from collections import Counter

def populateAVCMetrics(avc_metrics, model_type):
  # Split the data into two sets, for training and testing
  msk = np.random.rand(len(avc)) < 0.8
  training = avc[msk]
  testing = avc[~msk]

  Y_train = training[10]
  X_train = training.drop(10, axis='columns')

  #print('Sampling on AVC data set ')
  #print('Original data set shape %s' % Counter(Y_train))

  #SM = SMOTE(random_state=42, k_neighbors=15, n_jobs=-1)
  #X_trainSM, Y_trainSM = SM.fit_resample(X_train, Y_train)

  randUS = RandomUnderSampler(sampling_strategy={0.0: 500}, random_state=42)

  X_trainSM, Y_trainSM = randUS.fit_resample(X_train, Y_train)

  #print('Resampled data set shape %s' % Counter(Y_trainSM))

  Y_test = testing[10]
  X_test = testing.drop(10, axis='columns')

  acc, sens, spec, gmean = getModelMetrics(model_type, X_trainSM, Y_trainSM, X_test, Y_test)

  appendMetricsToDict(avc_metrics, acc, sens, spec, gmean)

## KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

For adjusting the hyperparameters

In [None]:
params_KNN = {
    "n_neighbors": [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
}

num_folds = 5
kfold = StratifiedKFold(n_splits=num_folds)

n_iter = 10
gridKNN = RandomizedSearchCV(
    estimator=KNeighborsClassifier(), 
    param_distributions=params_KNN,
    cv=kfold,
    n_iter=n_iter,
    scoring='accuracy',
    n_jobs=-1
)

KNN Metrics

In [None]:
def getKNNMetrics(X_train, Y_train, X_test, Y_test):
  search = gridKNN.fit(X_train,Y_train)

  param = search.best_params_ # Best param found for the current training set

  knn = KNeighborsClassifier(param['n_neighbors'])
  knn.fit(X_train, Y_train)

  return getMetricsFromClassifier(knn, X_test, Y_test)

## Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

Naive Bayes Metrics

In [None]:
def getBayesMetrics(X_train, Y_train, X_test, Y_test):
  bayes = GaussianNB()
  bayes.fit(X_train, Y_train)

  return getMetricsFromClassifier(bayes, X_test, Y_test)

## Decision Tree

In [None]:
from sklearn import tree

For adjusting the hyperparameters

In [None]:
dTree = tree.DecisionTreeClassifier(random_state=0)

params_DT = {
    "max_depth": [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
    "criterion": ['gini','entropy'],
    "min_samples_split" : [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,25,30]
}

# Cross validation
num_folds = 5
kfold = StratifiedKFold(n_splits=num_folds)

n_iter=10
gridDT = RandomizedSearchCV(
    estimator=dTree,
    param_distributions=params_DT,
    cv=kfold,
    scoring='accuracy',
    n_iter=n_iter,
    n_jobs=-1
)

Decision Tree Metrics

In [None]:
def getDTreeMetrics(X_train, Y_train, X_test, Y_test):
  search = gridDT.fit(X_train,Y_train)

  params = search.best_params_ # Best params found for the current training set

  max_depth = params['max_depth']
  criterion = params['criterion']
  min_samples_split = params['min_samples_split']

  dtree = tree.DecisionTreeClassifier(max_depth=max_depth, criterion=criterion, min_samples_split=min_samples_split, random_state=0)

  dtree.fit(X_train, Y_train)

  return getMetricsFromClassifier(dtree, X_test, Y_test)

## Neural Network (Multi-Layer Perceptron)

In [None]:
from sklearn.neural_network import MLPClassifier

For adjusting the hyperparameters

In [None]:
MLP = MLPClassifier(random_state=0)

params_MLP = {
    "learning_rate": ['constant', 'invscaling', 'adaptive'],
    "max_iter": [150, 200, 250, 300], # Number of epochs
    "activation": ['logistic', 'tanh', 'relu'],
    "hidden_layer_sizes" : [(100,), (30,30,40), (30,40,30)]
}

# Cross validation
num_folds = 3
kfold = StratifiedKFold(n_splits=num_folds)

n_iter=5
gridMLP = RandomizedSearchCV(
    estimator=MLP,
    param_distributions=params_MLP,
    cv=kfold,
    scoring='accuracy',
    n_iter=n_iter,
    n_jobs=-1
)

MLP Metrics

In [None]:
def getMLPMetrics(X_train, Y_train, X_test, Y_test):
  search = gridMLP.fit(X_train,Y_train)

  params = search.best_params_ # Best params found for the current training set
  print(params)

  learning_rate = params['learning_rate']
  max_iter = params['max_iter']
  activation = params['activation']
  hidden_layer_sizes = params['hidden_layer_sizes']

  mlp = MLPClassifier(hidden_layer_sizes, activation, learning_rate=learning_rate, max_iter=max_iter, random_state=0)

  mlp.fit(X_train, Y_train)

  return getMetricsFromClassifier(mlp, X_test, Y_test)

# Results

Utility functions

In [None]:
def calculateMetricsMean(metrics_df):
  acc_mean = np.mean(metrics_df['accuracy'])
  sens_mean = np.mean(metrics_df['sensitivity'])
  spec_mean = np.mean(metrics_df['specificity'])
  gmean_mean = np.mean(metrics_df['gmean'])

  indices = metrics_df.index.tolist()
  metrics_df = metrics_df.append({'accuracy': acc_mean, 'sensitivity': sens_mean, 'specificity': spec_mean, 'gmean': gmean_mean}, ignore_index=True)
  indices.append('mean')
  metrics_df = metrics_df.set_axis(indices, axis='index')

  return metrics_df

## KNN

In [None]:
credit_knn, accent_knn, cancer_knn, coronary_knn, avc_knn = computeMetrics('knn', 10)

Credit Approval

In [None]:
df = pd.DataFrame(credit_knn)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.66,0.366197,0.924051,0.581708
1,0.737589,0.596491,0.833333,0.705036
2,0.666667,0.507937,0.811594,0.642058
3,0.625,0.510638,0.691358,0.594167
4,0.630252,0.45,0.813559,0.605063
5,0.637037,0.447761,0.823529,0.607243
6,0.632353,0.409091,0.842857,0.587201
7,0.732824,0.58209,0.890625,0.720016
8,0.71831,0.535714,0.837209,0.669705
9,0.695035,0.454545,0.906667,0.641967


Speaker Accent Recognition

In [None]:
df = pd.DataFrame(accent_knn)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.813953,1.0,0.764706,0.874475
1,0.791667,1.0,0.913043,0.955533
2,0.808824,1.0,0.961538,0.980581
3,0.719298,0.857143,0.869565,0.863332
4,0.829268,0.4,0.969697,0.622799
5,0.75,0.75,0.92,0.830662
6,0.818182,1.0,0.878788,0.937437
7,0.774648,0.875,0.925926,0.900103
8,0.814815,1.0,0.9,0.948683
9,0.775862,0.833333,0.964286,0.896421


Breast Cancer Wisconsin (Diagnostic)

In [None]:
df = pd.DataFrame(cancer_knn)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.973214,0.955556,0.985075,0.970203
1,0.928571,0.916667,0.933333,0.924962
2,0.921875,0.865385,0.960526,0.911715
3,0.929825,0.818182,1.0,0.904534
4,0.920354,0.895833,0.938462,0.9169
5,0.913043,0.822222,0.971429,0.893717
6,0.947826,0.911111,0.971429,0.940787
7,0.966942,0.955556,0.973684,0.964577
8,0.909091,0.822222,0.969231,0.892705
9,0.95082,0.97561,0.938272,0.956759


Z-Alizadeh Sani

In [None]:
df = pd.DataFrame(coronary_knn)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.576271,0.533333,0.714286,0.617213
1,0.516129,0.521739,0.5,0.510754
2,0.454545,0.447368,0.470588,0.458831
3,0.584615,0.659574,0.388889,0.506459
4,0.385714,0.25,0.777778,0.440959
5,0.482759,0.457143,0.521739,0.488374
6,0.515152,0.608696,0.3,0.427327
7,0.509434,0.378378,0.8125,0.554466
8,0.586207,0.604651,0.533333,0.567874
9,0.513158,0.519231,0.5,0.509525


AVC

In [None]:
df = pd.DataFrame(avc_knn)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.727986,0.782178,0.727015,0.754092
1,0.744246,0.747475,0.74419,0.745831
2,0.706585,0.739837,0.705872,0.722655
3,0.710643,0.790909,0.709104,0.74889
4,0.73218,0.803419,0.730694,0.766194
5,0.709957,0.776596,0.708854,0.741952
6,0.740772,0.761905,0.740386,0.751068
7,0.708867,0.705882,0.708922,0.7074
8,0.732741,0.75,0.732397,0.741146
9,0.751622,0.739496,0.751873,0.745659


## Naive Bayes

In [None]:
credit_bayes, accent_bayes, cancer_bayes, coronary_bayes, avc_bayes = computeMetrics('bayes', 10)

Credit Approval

In [None]:
df = pd.DataFrame(credit_bayes)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.808511,0.704918,0.8875,0.790958
1,0.788991,0.577778,0.9375,0.73598
2,0.835938,0.758065,0.909091,0.83015
3,0.762712,0.625,0.925926,0.760726
4,0.78626,0.645161,0.913043,0.767503
5,0.770492,0.653846,0.857143,0.748625
6,0.773723,0.641791,0.9,0.760008
7,0.767606,0.621212,0.894737,0.745534
8,0.848214,0.707317,0.929577,0.810867
9,0.826772,0.659574,0.925,0.781093


Speaker Accent Recognition

In [None]:
df = pd.DataFrame(accent_bayes)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.545455,0.833333,0.789474,0.811107
1,0.506494,1.0,0.774194,0.879883
2,0.569444,1.0,0.88,0.938083
3,0.580645,0.666667,0.863636,0.758787
4,0.451613,0.888889,0.625,0.745356
5,0.586667,0.888889,0.7,0.788811
6,0.589744,0.714286,0.73913,0.726602
7,0.610169,0.833333,0.904762,0.868313
8,0.720588,0.888889,0.857143,0.872872
9,0.625,0.909091,0.833333,0.870388


Breast Cancer Wisconsin (Diagnostic)

In [None]:
df = pd.DataFrame(cancer_bayes)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.948718,0.902439,0.973684,0.937385
1,0.966667,0.962963,0.969697,0.966324
2,0.932203,0.867925,0.984615,0.924431
3,0.947368,0.868421,0.986842,0.92574
4,0.933884,0.826087,1.0,0.908893
5,0.96063,0.894737,0.988764,0.940576
6,0.963303,0.897436,1.0,0.947331
7,0.893443,0.808511,0.946667,0.874866
8,0.938596,0.861111,0.974359,0.915987
9,0.931034,0.878049,0.96,0.91811


Z-Alizadeh Sani

In [None]:
df = pd.DataFrame(coronary_bayes)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.777778,0.833333,0.666667,0.745356
1,0.810345,0.813953,0.8,0.806947
2,0.785714,0.789474,0.777778,0.783604
3,0.678571,0.682927,0.666667,0.674748
4,0.773585,0.820513,0.642857,0.726273
5,0.79661,0.833333,0.705882,0.766965
6,0.825397,0.809524,0.857143,0.832993
7,0.745763,0.804878,0.611111,0.701334
8,0.890625,0.923077,0.84,0.880559
9,0.8,0.833333,0.722222,0.775791


AVC

In [None]:
df = pd.DataFrame(avc_bayes)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.766483,0.685484,0.768254,0.72569
1,0.768608,0.71875,0.769732,0.743804
2,0.802658,0.589147,0.807458,0.689718
3,0.748957,0.747573,0.748982,0.748277
4,0.765907,0.756303,0.766105,0.761188
5,0.77905,0.716814,0.780258,0.747863
6,0.748837,0.690909,0.749956,0.719827
7,0.791408,0.534653,0.795961,0.652352
8,0.772913,0.669725,0.774946,0.720417
9,0.738219,0.685185,0.73923,0.711695


## Decision Tree

In [None]:
credit_dtree, accent_dtree, cancer_dtree, coronary_dtree, avc_dtree = computeMetrics('dtree', 10)

Credit Approval

In [None]:
df = pd.DataFrame(credit_dtree)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.883333,1.0,0.777778,0.881917
1,0.853147,0.934426,0.792683,0.860641
2,0.815385,0.741935,0.882353,0.809104
3,0.883041,0.939759,0.829545,0.882934
4,0.843284,0.811594,0.876923,0.843627
5,0.846154,0.925926,0.789474,0.854982
6,0.792308,0.86,0.75,0.803119
7,0.845528,0.92,0.794521,0.854961
8,0.839695,0.882353,0.8125,0.846706
9,0.851064,0.934426,0.7875,0.857823


Speaker Accent Recognition

In [None]:
df = pd.DataFrame(accent_dtree)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.619048,0.833333,1.0,0.912871
1,0.701754,1.0,0.894737,0.945905
2,0.616667,0.875,0.809524,0.841625
3,0.765625,0.666667,0.928571,0.786796
4,0.62069,0.666667,0.789474,0.725476
5,0.625,1.0,0.952381,0.9759
6,0.676471,0.666667,0.875,0.763763
7,0.587302,0.6,0.75,0.67082
8,0.69697,0.75,0.766667,0.758288
9,0.741935,0.8,0.903226,0.850047


Breast Cancer Wisconsin (Diagnostic)

In [None]:
df = pd.DataFrame(cancer_dtree)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.901786,0.808511,0.969231,0.885231
1,0.926606,0.897959,0.95,0.923613
2,0.932203,0.877551,0.971014,0.923101
3,0.907216,0.9,0.910448,0.905209
4,0.936364,0.85,0.985714,0.915345
5,0.908333,0.891304,0.918919,0.905006
6,0.865385,0.744186,0.95082,0.841182
7,0.944,0.888889,0.975,0.930949
8,0.958333,0.955556,0.96,0.957775
9,0.903509,0.836735,0.953846,0.893373


Z-Alizadeh Sani

In [None]:
df = pd.DataFrame(coronary_dtree)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.765957,0.909091,0.428571,0.624188
1,0.803922,0.891892,0.571429,0.7139
2,0.737705,0.770833,0.615385,0.688737
3,0.806452,0.837209,0.736842,0.785424
4,0.864407,0.897959,0.7,0.792825
5,0.754098,0.74359,0.772727,0.758019
6,0.791045,0.804348,0.761905,0.782839
7,0.785714,0.861111,0.65,0.748146
8,0.836364,0.880952,0.692308,0.780955
9,0.826667,0.807692,0.869565,0.838058


AVC

In [None]:
df = pd.DataFrame(avc_dtree)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.734729,0.765217,0.734115,0.749505
1,0.663032,0.823009,0.659892,0.736951
2,0.756335,0.737374,0.756666,0.746958
3,0.682709,0.788991,0.680604,0.732796
4,0.748042,0.726496,0.74848,0.737406
5,0.620707,0.852459,0.615749,0.7245
6,0.73106,0.698276,0.731711,0.714798
7,0.718584,0.767677,0.717739,0.742288
8,0.725633,0.7,0.726132,0.712946
9,0.75991,0.704762,0.760935,0.73231


## Neural Network (Multi-Layer Perceptron)

In [None]:
credit_mlp, accent_mlp, cancer_mlp, coronary_mlp, avc_mlp = computeMetrics('mlp', 10)

Credit Approval

In [None]:
df = pd.DataFrame(credit_mlp)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.88189,0.84375,0.920635,0.881354
1,0.787611,0.769231,0.803279,0.78607
2,0.835443,0.732394,0.91954,0.82065
3,0.813008,0.673913,0.896104,0.777108
4,0.8125,0.728571,0.891892,0.806106
5,0.835938,0.842105,0.830986,0.836527
6,0.798319,0.734694,0.842857,0.786919
7,0.807143,0.823529,0.791667,0.807441
8,0.792308,0.654545,0.893333,0.764675
9,0.764706,0.690141,0.829268,0.756513


Speaker Accent Recognition

In [None]:
df = pd.DataFrame(accent_mlp)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.833333,1.0,1.0,1.0
1,0.873239,1.0,1.0,1.0
2,0.791045,1.0,0.942857,0.971008
3,0.728571,1.0,1.0,1.0
4,0.831169,1.0,0.9,0.948683
5,0.777778,0.8,1.0,0.894427
6,0.744681,1.0,1.0,1.0
7,0.773333,0.714286,0.9375,0.818317
8,0.794521,1.0,0.9375,0.968246
9,0.716418,0.8,0.96,0.876356


Breast Cancer Wisconsin (Diagnostic)

In [None]:
df = pd.DataFrame(cancer_mlp)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.961905,0.969697,0.958333,0.963998
1,0.918919,0.804878,0.985714,0.890719
2,0.925,0.808511,1.0,0.899172
3,0.918699,0.837838,0.953488,0.893795
4,0.918182,0.87234,0.952381,0.911483
5,0.919643,0.853659,0.957746,0.904206
6,0.938776,0.916667,0.951613,0.933976
7,0.916031,0.836364,0.973684,0.902416
8,0.956522,0.868421,1.0,0.931891
9,0.93,0.97619,0.896552,0.935524


Z-Alizadeh Sani

In [None]:
df = pd.DataFrame(coronary_mlp)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.28,0.0,1.0,0.0
1,0.343284,0.022222,1.0,0.149071
2,0.344262,0.0,1.0,0.0
3,0.23913,0.027778,1.0,0.166667
4,0.724138,0.8,0.555556,0.666667
5,0.253731,0.019608,1.0,0.140028
6,0.68,0.636364,0.764706,0.697589
7,0.383333,0.142857,0.944444,0.367315
8,0.460317,0.25,0.947368,0.486664
9,0.373134,0.0,1.0,0.0


AVC

In [None]:
df = pd.DataFrame(avc_mlp)
df = calculateMetricsMean(df)

df

Unnamed: 0,accuracy,sensitivity,specificity,gmean
0,0.727709,0.761062,0.72705,0.743862
1,0.723783,0.791667,0.722648,0.756371
2,0.781655,0.719008,0.782992,0.750319
3,0.719298,0.773196,0.718393,0.745291
4,0.723819,0.777778,0.722889,0.749832
5,0.722174,0.688073,0.72283,0.705238
6,0.774843,0.780952,0.774733,0.777836
7,0.748711,0.733945,0.748993,0.741431
8,0.701126,0.79,0.699559,0.743406
9,0.75438,0.75,0.754471,0.752232


# Discussão

Devido ao limite de tempo, não foi possível usar mais iterações para ajustar os parâmetros dos classificadores, o que melhoraria o desempenho dos mesmos.

Para alguns dados desbalanceados, mesmo depois de serem balanceados, não melhoraram tanto o desempenho de certos classificadores, como por exemplo, o KNN e Neural Network na base de dados Z-Alizadeh Sani.