# Etapa 3 - Machine Learning

Na etapa 3, os dados do Dataframe de Classificação Binária e dos Dataframe de Classificação multivariada serão usados para desenvolver diversos modelos de Machine Learning (ML). A criação destes modelos tem como objetivo selecionar o mais adequado para a previsão dos dados em estudo.
Nesta etapa gerar-se-á atributos relativos ao encoding das sequências para a determinação das classes de proteínas transportadoras. O processo de encoding utilizado foi o One-hot-encoding, este irá converter cada variável categórica (aminoácido da sequência) em vetores binários representativos de cada aminoácido.


Para tal, serão analisados 3 tipos de classificação: Classificação binária; Classificação multiclasse balanceada; Classificação multiclasse de transportadores. Para cada classificação serão gerados modelos baseados nas 100 melhores features fisico-quimicas (utilizadas no processo de unsupervised learning da etapa 2) e modelos baseados nas sequência codificadas através do one-hot-encoding.


Para a criação de modelos, recorremos à classe **ShallowML** presente no package Propythia. Esta classe irá construir modelos de previsão de Random Forest, Support Vecto Machine (SVM) e k-nearest neighbors (KNN) para os dados presentes nos diferentes dataframes.
O __init__ da classe necessita primeiramente de receber os dados que serão utilizados para a etapa de treino e os dados que serão utilizados para o teste do modelo. Esta divisão foi realizada com recurso à função **train_test_split** do Sklearn, dividindo 75 % do dataset para treino e 25% para a fase de teste. Adicionalmente, foram recolhidos dos dataframes estratificadamente (garantindo a presença de todas as classes no dataset de treino e de teste) e utilizado sempre o mesmo random_state na divisão do dataset (garantindo que as entradas são divididas de igual forma para o modelo baseado em features e **one-hot-encoding**).

Tendo em conta que o objetivo principal ao longo desta etapa será a previsão de diferentes classificações (Binária ou multiclasse), os modelos de Machine Learning escolhidos para cada classificação foram:

•	Random Forest – Modelo de Machine Learning supervisionado que consiste na construção de múltiplas árvores de decisão, fundindo-as de forma a obter uma melhor e mais estável previsão. Este modelo geralmente é utilizado para problemas de regressão ou classificação.

•	SVM - “Support Vector Machine” é um modelo de Machine Learning supervisionado para problemas de regressão e classificação, sendo mais usual a utilização deste para problemas de classificação, que consiste na representação dos dados em um espaço n-dimensional onde cada valor corresponde a uma coordenada especifica. A previsão é obtida a partir do melhor hiperplano encontrado para diferenciação significativa das diferentes classes.

•	KNN - “K-nearest neighbors” é um modelo supervisionado de Machine Learning utilizado para problemas de classificação e regressão, este modelo agrupa os dados nas diferentes K classificações consoante a sua distância de proximidade. Este modelo é considerado um “lazy learner algorithm” uma vez que o conjunto de treino não será imediatamente utilizado, mas sim armazenado, sendo apenas utilizado quando surge uma nova classificação.

De modo a obtermos o modelo mais eficiente, esta classe tem a implementação de um método de otimização de hiperparâmentros chamada de **Grid-Search** implementada no ProPythia.
Grid-Searching consiste na análise de dados de modo a otimizar os hiperparâmetros usados no desenvolvimento de modelos de machine learning. Este processo pode ser computacionalmente custoso. Uma vez que o processo de Grid-Search irá construir diferentes modelos atravez de cross-validation utilizando apenas dados os dados de treino, para cada combinação de hiperparâmetros possível. Adicionalmente, durante todas as iterações, são guardadas as combinações dos hiperparâmetros usados, bem como os modelos delas resultantes, para que no final seja determinado qual é que é o modelo com melhor capacidade de corretamente classificar os dados em análise. Este modelo no fim será avaliado pelo dataset de teste, dataset com um dataset de input nunca utilizado no processo de treino do GridSearch.
No Grid-Search foram utilizados o conjunto de hiperparametros pré-definidos no Package.


Primeiramente foi realizado o import dos packages necessários assim como a definição da função **encode_sequence** para realizar o processo de one-hot-encoding das sequências dos datasets. Resultará num vector em que as dimesões são de 21000 entradas e 1 linha (flatten da matriz de one-hot-encoding).




In [1]:
import pandas as pd
import numpy as np
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer
from sklearn.metrics import matthews_corrcoef
from propythia.shallow_ml import ShallowML
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'

def encode_sequence(sequences, seq_len, padding_truncating='post'):
    # define a mapping of chars to integers
    alphabet = "XARNDCEQGHILKMFPSTWYV"
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    int_to_char = dict((i, c) for i, c in enumerate(alphabet))

    sequences_integer_ecoded = []
    for seq in sequences:
        # if 'X' not in alphabet:
        seq = seq.replace('X', '')  # unknown character eliminated
        # integer encode input data
        integer_encoded = [char_to_int[char] for char in seq]
        sequences_integer_ecoded.append(integer_encoded)
    list_of_sequences_integer = pad_sequences(sequences_integer_ecoded, maxlen=seq_len, dtype='int32',
                                              padding=padding_truncating, truncating=padding_truncating, value=0.0)
    list_of_sequences_aa = []
    for seq in list_of_sequences_integer:
        pad_aa_list = [int_to_char[char] for char in seq]
        pad_aa = ''.join(pad_aa_list)
        list_of_sequences_aa.append(pad_aa)


    # one hot encoding
    shape_hot = len(alphabet) * seq_len  # 20000
    fps_x_2d = to_categorical(list_of_sequences_integer)  # shape (samples, 1000,20)
    fps_x_1d = pd.DataFrame(fps_x_2d.reshape(fps_x_2d.shape[0], shape_hot))  # shape (samples, 20000)
    # return fps_x_1d
    return fps_x_1d

# Classificação Binária
Nesta secção será realizada a classificação binária, transportador ou não transportadores, com recurso ao dataset gerado na etapa 1.
## Features físico-químicas
Primeiramente, foi construido os modelos RF, SVM e KNN baseado nas 100 features selecionadas na etapa anterior. O dataset df_tra foi importado e as labels fora condificadas para 0 - Não transportador e 1- Transportador. Pode ser tambem verificada a distribuição entre as duas classes

In [2]:
df_tra = pd.read_csv('df_Tra.csv', sep= ',')
df_tra = df_tra.drop(columns='Unnamed: 0')
dici = {'NonTra':0, 'Tra':1}
df_tra.transporter = df_tra.transporter.apply(lambda x: dici[x])
print(df_tra.groupby('transporter').size())
df_tra

transporter
0    16457
1    14048
dtype: int64


Unnamed: 0,sequence,transporter,_ChargeD2100,bomanindex,Gravy,_HydrophobicityD3100,_ChargeC2,_HydrophobicityD1100,_SolventAccessibilityD2100,_ChargeD2075,...,WC,HY,FC,MH,TP,_NormalizedVDWVC1,YK,KY,CN,QY
0,MSYKPIAPAPSSTPGSSTPGPGTPVPTGSVPSPSGSVPGAGAPFRP...,0,-0.663266,0.224854,-0.728483,2.080290,0.656979,0.738313,0.738313,-1.628206,...,-0.159045,-0.387705,-0.337443,-0.308190,5.663457,3.035941,2.249176,-0.551078,-0.255895,-0.479138
1,MSDDLPIDIHSSKLLDWLVSRRHCNKDWQKSVVAIREKIKHAILDM...,0,1.572972,0.917296,-0.899206,0.433034,-1.479925,-0.903760,-0.903760,1.628170,...,-0.159045,-0.387705,-0.337443,-0.308190,-0.746678,-0.726321,0.890937,0.731924,0.681034,1.386198
2,MPFDPAASPLSPSQARVLATLMEKARTVPDSYPMSLNGLLTGCNQK...,0,-0.436835,0.281929,-0.414872,0.694912,0.401152,-0.130083,-0.130083,-0.094844,...,-0.159045,-0.387705,-0.337443,-0.308190,0.375096,1.411704,-0.536955,-0.551078,1.711655,-0.479138
3,MIHFTKMHGLGNDFMVVDGVTQNVFFSPEQIRRLADRNFGIGFDQL...,0,-0.117689,0.280997,-0.368043,-0.083867,0.055034,-0.488394,-0.488394,0.020678,...,-0.159045,1.676105,-0.337443,2.207031,0.214843,-0.394844,-0.536955,-0.551078,-0.255895,-0.479138
4,MGSSTTEPDVGTTSNIETTTTLQNKNVNEVDQNKKSEQSNPSFKEV...,0,-0.613676,1.574354,-1.987419,1.986828,0.611833,-0.916583,-0.916583,-0.260411,...,-0.159045,2.306713,-0.337443,-0.308190,0.508641,-0.427992,-0.536955,-0.551078,0.868420,0.612766
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30500,MPEGKFCNRKPVNTEEDLKALLGDKGGAQYYKEMEELEVDQEALWA...,1,0.607386,0.510703,-0.801156,0.064474,-0.727494,-0.557507,-0.557507,0.611736,...,-0.159045,2.077401,0.722249,1.228890,0.401805,-0.792616,0.960590,-0.551078,2.789123,1.477190
30501,MQHANTNKSLMTPGNIITGIILVMGLVLTVLRFTKGIGAVSNLDDN...,1,-1.315992,-1.581525,1.513113,-1.290844,1.469604,1.822963,1.822963,-1.377985,...,-0.159045,1.102824,-0.337443,1.508359,-0.052246,-0.461140,0.368538,-0.551078,-0.255895,-0.479138
30502,MRFGVVVLAIILLTGCSAMSAISDLLPSKDGIEATAQAGESNQKTG...,1,-0.604466,-0.090436,0.218970,0.259392,0.596784,0.074191,0.074191,-0.236207,...,-0.159045,-0.387705,-0.337443,-0.308190,-0.746678,1.743181,-0.536955,-0.551078,-0.255895,-0.479138
30503,MILLTQSRFFSQKARCYITDNKRLFLPLLILIALVVPATRGFTLQA...,1,-1.496199,-1.693440,1.827002,-1.129750,1.710382,1.309433,1.309433,-1.234283,...,-0.159045,-0.387705,-0.337443,-0.308190,-0.052246,0.102371,0.368538,-0.551078,-0.255895,-0.479138


### Pré-preparação para o split do dataset
Das 30505 entradas foram mantidas as 100 features selecionadas. Gerando como dados de entrada uma matriz com 100 colunas e 30505 entradas (cada uma das sequências). Foi também isolado as labels apartir do dataset.

In [3]:
fps_x=[]
for i in range(df_tra.shape[0]):
    fps_x.append(list(df_tra.iloc[i,2:102].to_numpy().flatten()))
fps_x_traf = np.array(fps_x)
print(fps_x_traf.shape)
fps_x_traf

(30505, 100)


array([[-0.66326571,  0.22485399, -0.72848277, ...,  0.36082671,
        -0.57862866,  0.41571143],
       [ 1.572972  ,  0.91729598, -0.89920574, ...,  0.14714935,
        -0.22572944,  0.14744127],
       [-0.4368348 ,  0.28192886, -0.41487247, ...,  0.7555982 ,
        -0.42612783,  0.84436049],
       ...,
       [-0.60446629, -0.09043552,  0.21897018, ..., -0.79296243,
         0.50394728, -0.86403931],
       [-1.49619851, -1.69343959,  1.82700193, ..., -0.67360493,
         0.46107586, -0.74594211],
       [-0.82177621, -0.06858006, -0.01361906, ...,  0.50865165,
        -0.21208589,  0.59905367]])

In [4]:
fps_y_traf =df_tra['transporter'].to_numpy().flatten()
print(fps_y_traf.shape)
fps_y_traf

(30505,)


array([0, 0, 0, ..., 1, 1, 1], dtype=int64)

### Split do dataset
O split foi realizado como indicado na introdução.

In [5]:
X_train_traf, X_test_traf, y_train_traf, y_test_traf = train_test_split(fps_x_traf, fps_y_traf, test_size=0.25, random_state = 42,stratify=fps_y_traf)
print(X_train_traf.shape, y_train_traf.shape)
print(X_test_traf.shape, y_test_traf.shape)

(22878, 100) (22878,)
(7627, 100) (7627,)


Para os modelos serão utilizadas 22878 entradas para treino e 7627 entradas para teste dos modelos.
### RF
O primeiro modelo treinado foi o Random Forest (RF) com os parametros pré-defenidos para o Grid Search.

In [6]:
RF = ShallowML(X_train_traf, X_test_traf, y_train_traf, y_test_traf, report_name='rfx_traf', columns_names = list(df_tra.columns))
best_rf_model = RF.train_best_model(model_name='rf',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=5,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = RF.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 224.51 seconds for 6 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        RandomForestClassifier(random_state=1))]),
             n_jobs=5,
             param_grid=[{'clf__bootstrap': [True], 'clf__criterion': ['gini'],
                          'clf__max_features': ['sqrt', 'log2'],
                          'clf__n_estimators': [10, 100, 500]}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.715 (std: 0.006)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}
 

Model with rank: 2
 Mean validation score: 0.709 (std: 0.006)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'log2', 'clf__n_estimators': 500}
 

Model with rank: 3
 Mean validation score: 0.708 

<Figure size 432x288 with 0 Axes>

O melhor modelo obtido obteve um MCC score de 0.708 e accuracy de 85.2 %, com o seguinte dicionário de hiperparametros {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}
### SVM
Para todos os modelos de SVM, a Grid de hiperparâmentros utilizada foi testar o parametro C para os valores 0.1,1.0 e 10 e a opção de realizar o kernel linear ou através da Radial basis function (rbf).

In [7]:
SVM = ShallowML(X_train_traf, X_test_traf, y_train_traf, y_test_traf, report_name='svmx_traf', columns_names=list(df_tra.columns))
param_grid ={'clf__C': [0.1,1.0,10],
                        'clf__kernel': ['linear', 'rbf']}
best_svm_model = SVM.train_best_model(model_name='svm',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=3, optType='gridSearch', param_grid=param_grid,
                         n_jobs=5,random_state=1, n_iter=15, refit=True, probability=True)

print(best_svm_model)
scores, scores_per_class, cm, cm2 = SVM.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 1196.63 seconds for 6 candidate parameter settings.
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        SVC(probability=True,
                                            random_state=1))]),
             n_jobs=5,
             param_grid={'clf__C': [0.1, 1.0, 10],
                         'clf__kernel': ['linear', 'rbf']},
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.737 (std: 0.001)
 Parameters: {'clf__C': 10, 'clf__kernel': 'rbf'}
 

Model with rank: 2
 Mean validation score: 0.719 (std: 0.002)
 Parameters: {'clf__C': 1.0, 'clf__kernel': 'rbf'}
 

Model with rank: 3
 Mean validation score: 0.692 (std: 0.002)
 Parameters: {'clf__C': 10, 'clf__kernel': 'linear'}
 

make_scorer(matthews_corrcoef)
3
Best score (scorer: make_scorer(matthews_corrcoef)) and parameters from a 3-fold cross val

<Figure size 432x288 with 0 Axes>

O melhor modelo obtido obteve um MCC score de 0.737 e accuracy de 87.5 %, com o seguinte dicionário de hiperparametros {'clf__C': 10, 'clf__kernel': 'rbf'}
### KNN

In [8]:
KNN = ShallowML(X_train_traf, X_test_traf, y_train_traf, y_test_traf, report_name='knnx_traf', columns_names = list(df_tra.columns))
best_knn_model = KNN.train_best_model(model_name='knn',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=4,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = KNN.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 91.26 seconds for 24 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf', KNeighborsClassifier())]),
             n_jobs=4,
             param_grid=[{'clf__leaf_size': [15, 30, 60],
                          'clf__n_neighbors': [2, 5, 10, 15],
                          'clf__weights': ['uniform', 'distance']}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.707 (std: 0.005)
 Parameters: {'clf__leaf_size': 15, 'clf__n_neighbors': 10, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.707 (std: 0.005)
 Parameters: {'clf__leaf_size': 30, 'clf__n_neighbors': 10, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.707 (std: 0.005)
 Parameters: {'clf__leaf_size': 60, 'clf__n_neighbors': 10, 'clf__weights': 'distance'}
 

make_scorer(matthews_corrcoef)
5
Be

<Figure size 432x288 with 0 Axes>

O melhor modelo obtido obteve um MCC score de 0.707 e accuracy de 84.7 %, com o seguinte dicionário de hiperparametros {'clf__leaf_size': 15, 'clf__n_neighbors': 10, 'clf__weights': 'distance'}

Resumindo, o melhor score obtido atraves das features para a classificação binária foi obtido pelo modelo de SVM com um MCC de 0.737 e accuracy de 87.5%.

## One-Hot-Enconding
### Pré-preparação para o split do dataset
As 30 505 sequencias foram transformadas segundo a tecnica de one-hor-encoding. Gerando como dados de entrada uma matriz com 21 000 colunas e 30 505 entradas (cada uma das sequencias). Foi também isolado as labels apartir do dataset.

In [9]:
fps_x_traH = encode_sequence(sequences = df_tra['sequence'], seq_len=1000, padding_truncating='post')
print(fps_x_traH.shape)
fps_y_traH = df_tra['transporter']
print(fps_y_traH.shape)
fps_x_traH

(30505, 21000)
(30505,)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20990,20991,20992,20993,20994,20995,20996,20997,20998,20999
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30500,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
30501,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
30502,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
30503,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Split do dataset
O split foi realizado como indicado na introdução.

In [10]:
X_train_traH, X_test_traH, y_train_traH, y_test_traH = train_test_split(fps_x_traH, fps_y_traH, test_size=0.25, random_state = 42,stratify=fps_y_traH)
colnames = []
for i in range(500):
    for j in range(20):
        colnames.append(str(i)+'_'+str(j))
print(X_train_traH.shape, y_train_traH.shape)
print(X_test_traH.shape, y_test_traH.shape)


(22878, 21000) (22878,)
(7627, 21000) (7627,)


Para os modelos serão utilizadas 22878 sequências codificadas para treino e 7627 sequências codificadas para teste dos modelos. Com a utilização do Random_State tentamos que fossem selecionadas as mesmas entradas (features/ sequencia codificada) para o treino do modelo

### RF
O primeiro modelo treinado foi o Random Forest (RF) com os parametros pré-defenidos para o Grid Search.

In [11]:
RF = ShallowML(X_train_traH, X_test_traH, y_train_traH, y_test_traH, report_name='rfx_traH', columns_names = colnames)
best_rf_model = RF.train_best_model(model_name='rf',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=5,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = RF.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 1127.70 seconds for 6 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        RandomForestClassifier(random_state=1))]),
             n_jobs=5,
             param_grid=[{'clf__bootstrap': [True], 'clf__criterion': ['gini'],
                          'clf__max_features': ['sqrt', 'log2'],
                          'clf__n_estimators': [10, 100, 500]}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.482 (std: 0.018)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}
 

Model with rank: 2
 Mean validation score: 0.454 (std: 0.005)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 100}
 

Model with rank: 3
 Mean validation score: 0.371

<Figure size 432x288 with 0 Axes>

O melhor modelo obtido obteve um MCC score de 0.482 e accuracy de 74.8 %, com o seguinte dicionário de hiperparametros {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}. O dicionario de hiperparametros é igual ao dicionario obtido para as features.

### SVM

In [12]:
SVM = ShallowML(X_train_traH, X_test_traH, y_train_traH, y_test_traH, report_name='svmx_traH', columns_names=colnames)
best_svm_model = SVM.train_best_model(model_name='svm',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=3, optType='gridSearch', param_grid=param_grid,
                         n_jobs=5,random_state=1, n_iter=15, refit=True, probability=True)


print(best_svm_model)
scores, scores_per_class, cm, cm2 = SVM.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 77121.09 seconds for 6 candidate parameter settings.
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        SVC(probability=True,
                                            random_state=1))]),
             n_jobs=5,
             param_grid={'clf__C': [0.1, 1.0, 10],
                         'clf__kernel': ['linear', 'rbf']},
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.615 (std: 0.003)
 Parameters: {'clf__C': 1.0, 'clf__kernel': 'rbf'}
 

Model with rank: 2
 Mean validation score: 0.589 (std: 0.006)
 Parameters: {'clf__C': 10, 'clf__kernel': 'rbf'}
 

Model with rank: 3
 Mean validation score: 0.476 (std: 0.005)
 Parameters: {'clf__C': 0.1, 'clf__kernel': 'linear'}
 

make_scorer(matthews_corrcoef)
3
Best score (scorer: make_scorer(matthews_corrcoef)) and parameters from a 3-fold cross v

<Figure size 432x288 with 0 Axes>

O melhor modelo obtido obteve um MCC score de 0.631 e accuracy de 81.5 %, com o seguinte dicionário de hiperparametros {'clf__C': 1.0, 'clf__kernel': 'rbf'}.
### KNN

In [13]:
KNN = ShallowML(X_train_traH, X_test_traH, y_train_traH, y_test_traH, report_name='knnx_traH', columns_names = colnames)
best_knn_model = KNN.train_best_model(model_name='knn',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=4,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = KNN.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 2387.47 seconds for 24 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf', KNeighborsClassifier())]),
             n_jobs=4,
             param_grid=[{'clf__leaf_size': [15, 30, 60],
                          'clf__n_neighbors': [2, 5, 10, 15],
                          'clf__weights': ['uniform', 'distance']}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.528 (std: 0.004)
 Parameters: {'clf__leaf_size': 15, 'clf__n_neighbors': 10, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.528 (std: 0.004)
 Parameters: {'clf__leaf_size': 30, 'clf__n_neighbors': 10, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.528 (std: 0.004)
 Parameters: {'clf__leaf_size': 60, 'clf__n_neighbors': 10, 'clf__weights': 'distance'}
 

make_scorer(matthews_corrcoef)
5


<Figure size 432x288 with 0 Axes>

O melhor modelo obtido obteve um MCC score de 0.551 e accuracy de 77.5 %, com o seguinte dicionário de hiperparametros {'clf__leaf_size': 15, 'clf__n_neighbors': 10, 'clf__weights': 'distance'}, o mesmo conjunto de hiperparametros obtido para as features.

É de realçar, que neste processo os modelos obtidos com o treino através das features obteve sempre melhores scores que modelos treinados com a informação codificada da sequência, sendo o melhor modelo o SVM com a informação das features.


# Classificação multiclass balanceada
Nesta secção será realizada a classificação multivariada das classes de transportadores, do dataset balanceado gerado na etapa 1.
## Features físico-químicas
Este dataframe é constituido com as sequências de todos os aminoácidos provenientes do dataset original e com as 100 propriedades físico-químicas mais relevantes obtidas na análise de Feature Selection (etapa 2).
Serão construidos modelos de RF, SVM e KNN com base nessas propriedades de modo a determinar classes de transportadores. As labels de cada classes foram codificadas onde a classe 0 (cls0) = 0, classe 1 (cl1) = 1, a classe (cl2) 2 = 2, a classe 3 (cl3) = 3, a classe 4 (cl4) = 4, a classe 5 (cl5) = 5 e a classe 8 (cl8) = 8.

In [14]:
df_TCDB_blc = pd.read_csv('df_TCDB_blc.csv', sep= ',')
df_TCDB_blc = df_TCDB_blc.drop(columns='Unnamed: 0')
dici = {'cls0':0, 'cl1':1, 'cl2':2 ,'cl3':3,'cl4':4,'cl5':5,'cl8':8 }
df_TCDB_blc.TCDB_ID = df_TCDB_blc.TCDB_ID.apply(lambda x: dici[x])
print(df_TCDB_blc.groupby('TCDB_ID').size())
df_TCDB_blc

TCDB_ID
0    4000
1    4631
2    4026
3    3902
4     409
5     202
8     878
dtype: int64


Unnamed: 0,sequence,TCDB_ID,_ChargeD2100,bomanindex,_HydrophobicityC1,_SolventAccessibilityD2100,_HydrophobicityD1100,_ChargeC2,_SolventAccessibilityC2,Gravy,...,TH,NR,WC,NM,HK,HH,HW,WH,HM,HY
0,MDKLLQWSIAQQSGDKEAIQKLGQPDPKMLEQLFGGPDEPTLMKQA...,0,0.756031,0.618030,1.444669,-1.016714,-1.016714,-0.801670,1.444669,-0.846988,...,-0.488784,-0.599495,-0.160959,1.450868,1.288709,-0.297108,-0.250314,-0.244239,-0.290204,1.843871
1,MNLKPLADRVIVKPAPAEEKTKGGLYIPDTGKEKPQYGEIVAVGTG...,0,1.022890,0.210428,0.976942,-0.811177,-0.811177,-1.162786,0.976942,-0.685853,...,-0.488784,-0.599495,-0.160959,-0.437560,-0.412516,-0.297108,-0.250314,-0.244239,-0.290204,-0.390558
2,MTTITTAAIIGAGLAGCECALRLARAGVRVTLFEMKPAAFSPAHSN...,0,0.409640,0.345211,0.028846,-0.269359,-0.269359,-0.470647,0.028846,-0.202590,...,-0.488784,0.887946,-0.160959,-0.437560,-0.412516,-0.297108,-0.250314,-0.244239,-0.290204,-0.390558
3,MLIIDIKGTEISQQEIEILSHPLVAGLILFSRNFVDKAQLTALIKE...,0,0.081658,0.335761,0.774681,-0.734350,-0.734350,-0.139624,0.774681,-0.566868,...,0.920638,-0.599495,-0.160959,-0.437560,0.997070,-0.297108,-0.250314,-0.244239,-0.290204,-0.390558
4,MATFADLPDSVLLEIFSYLPVRDRIRISRVCHHWKKLVDDRWLWRH...,0,0.204732,0.239145,-0.186056,-0.119168,-0.119168,-0.275042,-0.186056,0.005621,...,-0.488784,-0.599495,-0.160959,-0.437560,1.094283,1.426374,2.334925,2.539794,-0.290204,-0.390558
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18043,MAKSDTSNLGYAYHLPSGGCLMRIRNALSQFSDLFSEFNRYIAPAY...,8,-0.315822,0.026902,-0.236621,-0.055281,-0.055281,0.251586,-0.236621,0.043334,...,0.434630,1.383760,-0.160959,-0.437560,2.309444,0.759220,-0.250314,-0.244239,-0.290204,0.822418
18044,MWLSVFALQCALFAGFCFARDIRYSTGVSITLRNDAWTSPKKEPHR...,8,1.366520,1.455074,1.128637,-0.927649,-0.927649,-1.343345,1.128637,-1.163155,...,0.094425,-0.599495,-0.160959,-0.437560,1.385921,0.370047,-0.250314,-0.244239,-0.290204,1.205462
18045,MTATAKGINVMNTPLSTSQEPPIQFSTIASEFLHQQTDDVQPSGFQ...,8,0.342350,1.221102,0.686192,-0.673203,-0.673203,-0.410461,0.686192,-1.160927,...,0.240227,0.958777,-0.160959,1.127137,-0.412516,0.536835,-0.250314,-0.244239,-0.290204,0.567054
18046,MSMFNILKQVVNLNKVQLCQKSFQVNSKSFAQYSYRLNSSYILNNS...,8,-0.032946,0.349572,0.622986,-0.638882,-0.638882,-0.079438,0.622986,-0.443206,...,1.358044,-0.599495,-0.160959,-0.437560,-0.412516,-0.297108,-0.250314,-0.244239,-0.290204,-0.390558


### Pré-preparação para o split do dataset
Das 18 048 sequencias foram mantidas as 100 features fisico-quimicas selecionadas. Gerando como dados de entrada uma matriz com 100 colunas e 18 048 entradas (cada uma das sequências). Foi também isolado as labels apartir do dataset.

In [15]:
fps_x=[]
for i in range(df_TCDB_blc.shape[0]):
    fps_x.append(list(df_TCDB_blc.iloc[i,2:102].to_numpy().flatten()))
fps_x_blcF = np.array(fps_x)
print(fps_x_blcF.shape)
fps_x_blcF

(18048, 100)


array([[ 0.7560312 ,  0.61802955,  1.44466909, ..., -0.37583192,
         0.0395254 , -1.14909945],
       [ 1.02288988,  0.21042831,  0.97694169, ..., -1.29186065,
         0.920056  , -0.35579044],
       [ 0.4096396 ,  0.34521134,  0.02884563, ...,  0.48019607,
        -0.37473067, -0.87641082],
       ...,
       [ 0.3423496 ,  1.22110199,  0.68619223, ...,  1.57949352,
        -0.48751615,  0.75101528],
       [-0.03294632,  0.34957184,  0.62298583, ..., -0.42897415,
         0.10691294,  1.39904754],
       [-0.14488152,  0.1697216 , -0.16077359, ...,  2.39249438,
        -0.64617244, -0.37719201]])

In [16]:
fps_y_blcF = df_TCDB_blc['TCDB_ID'].to_numpy().flatten()
print(fps_y_blcF.shape)
fps_y_blcF


(18048,)


array([0, 0, 0, ..., 8, 8, 8], dtype=int64)

### Split do dataset

In [17]:
X_train_blc, X_test_blc, y_train_blc, y_test_blc = train_test_split(fps_x_blcF, fps_y_blcF, test_size=0.25, random_state = 42,stratify=fps_y_blcF)
print(X_train_blc.shape, y_train_blc.shape)
print(X_test_blc.shape, y_test_blc.shape)

(13536, 100) (13536,)
(4512, 100) (4512,)


O dataset utilizado para o treino treino é constituido por 13 536 entradas e o de teste por 4512 entradas, contendo a informação das features fisico-quimicas das sequências.

### RF
O primeiro modelo é o modelo de Random Forest (RF), onde os hiperparâmentros usados são os pré-defenidos na função de **train_best_model**, para RF, da classe **ShallowML**.

In [18]:
RF = ShallowML(X_train_blc, X_test_blc, y_train_blc, y_test_blc, report_name='rfx_blc', columns_names = list(df_TCDB_blc.columns))
best_rf_model = RF.train_best_model(model_name='rf',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=5,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = RF.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 127.33 seconds for 6 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        RandomForestClassifier(random_state=1))]),
             n_jobs=5,
             param_grid=[{'clf__bootstrap': [True], 'clf__criterion': ['gini'],
                          'clf__max_features': ['sqrt', 'log2'],
                          'clf__n_estimators': [10, 100, 500]}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.581 (std: 0.006)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}
 

Model with rank: 2
 Mean validation score: 0.579 (std: 0.006)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'log2', 'clf__n_estimators': 500}
 

Model with rank: 3
 Mean validation score: 0.575 

<Figure size 432x288 with 0 Axes>

O melhor modelo de RF para a Classificação multiclass balanceada apresenta, aproximadamente, um MCC score de 0.579 e uma accuracy de 67,2%. Com os hiperparâmetros de {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}.

### SVM

In [19]:
SVM = ShallowML(X_train_blc, X_test_blc, y_train_blc, y_test_blc, report_name='svmx_blcF', columns_names=list(df_TCDB_blc.columns))
param_grid ={'clf__C': [0.1,1.0,10],
                        'clf__kernel': ['linear', 'rbf']}
best_svm_model = SVM.train_best_model(model_name='svm',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=3, optType='gridSearch', param_grid=param_grid,
                         n_jobs=5,random_state=1, n_iter=15, refit=True, probability=True)


print(best_svm_model)
scores, scores_per_class, cm, cm2 = SVM.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 511.66 seconds for 6 candidate parameter settings.
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        SVC(probability=True,
                                            random_state=1))]),
             n_jobs=5,
             param_grid={'clf__C': [0.1, 1.0, 10],
                         'clf__kernel': ['linear', 'rbf']},
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.605 (std: 0.007)
 Parameters: {'clf__C': 10, 'clf__kernel': 'rbf'}
 

Model with rank: 2
 Mean validation score: 0.556 (std: 0.005)
 Parameters: {'clf__C': 1.0, 'clf__kernel': 'rbf'}
 

Model with rank: 3
 Mean validation score: 0.506 (std: 0.008)
 Parameters: {'clf__C': 10, 'clf__kernel': 'linear'}
 

make_scorer(matthews_corrcoef)
3
Best score (scorer: make_scorer(matthews_corrcoef)) and parameters from a 3-fold cross vali

<Figure size 432x288 with 0 Axes>

O melhor modelo de SVM para a Classificação multiclass balanceada apresenta, aproximadamente, um MCC score de 0.621 e uma accuracy de 70.5%. Com os hiperparâmetros de {'clf__C': [0.1,1.0,10],'clf__kernel': ['linear', 'rbf']}.

### KNN
O terceiro modelo é o modelo de K-Nearest neighbors (KNN), onde os hiperparâmentros usados são default presente na função de **train_best_model**, para RF, da classe **ShallowML**.

In [20]:
KNN = ShallowML(X_train_blc, X_test_blc, y_train_blc, y_test_blc, report_name='knn_blc', columns_names = list(df_TCDB_blc.columns))
best_knn_model = KNN.train_best_model(model_name='knn',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=4,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = KNN.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 30.64 seconds for 24 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf', KNeighborsClassifier())]),
             n_jobs=4,
             param_grid=[{'clf__leaf_size': [15, 30, 60],
                          'clf__n_neighbors': [2, 5, 10, 15],
                          'clf__weights': ['uniform', 'distance']}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.558 (std: 0.005)
 Parameters: {'clf__leaf_size': 15, 'clf__n_neighbors': 5, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.558 (std: 0.005)
 Parameters: {'clf__leaf_size': 30, 'clf__n_neighbors': 5, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.558 (std: 0.005)
 Parameters: {'clf__leaf_size': 60, 'clf__n_neighbors': 5, 'clf__weights': 'distance'}
 

make_scorer(matthews_corrcoef)
5
Best 

<Figure size 432x288 with 0 Axes>

O melhor modelo de KNN para a Classificação multiclass balanceada apresenta, aproximadamente, um MCC score de 0.573 e uma accuracy de 66.6%. Onde os hiperparâmetros registado para o modelo possuem o {'clf__leaf_size': X, 'clf__n_neighbors': 2, 'clf__weights': 'distance'}, onde o 'clf__leaf_size' poderá ser 15, 30 ou 60.

Nesta etapa é tambem possivel verificar, que à semelhança da classificação baseada em features anterior, o svm obtem o melhor MCC score e accuracy.

## One-Hot-Enconding
### Pré-preparação para o split do dataset
No dataset de **One-Hot-encoding** balanceado as 18048 sequências de aminoácidos pertencentes ao df_TCDB foram transformadas com um processo de **One-Hot-encoding** para o comprimento máximo de 1000 aminoácidos por 20 aminoácidos standart, mais um indefinido. Gerando como dados de entrada uma matriz de 21000 colunas (21 * 1000) e 18048 entradas (cada uma das sequências). Foi também isolado as labels apartir do dataset.

In [21]:
fps_x_enco_blc = encode_sequence(sequences = df_TCDB_blc['sequence'], seq_len=1000, padding_truncating='post')
print(fps_x_enco_blc.shape)
fps_y_enco_blc = df_TCDB_blc['TCDB_ID']
print(fps_y_enco_blc.shape)

(18048, 21000)
(18048,)


### Split do dataset

In [22]:
X_train_enco_blc, X_test_enco_blc, y_train_enco_blc, y_test_enco_blc = train_test_split(fps_x_enco_blc, fps_y_enco_blc, test_size=0.25, random_state = 42,stratify=fps_y_enco_blc)
print(X_train_enco_blc.shape, y_train_enco_blc.shape)
print(X_test_enco_blc.shape, y_test_enco_blc.shape)

(13536, 21000) (13536,)
(4512, 21000) (4512,)


No train_test_split o dataframe de treino é constituido por 13536 entradas de features fisico-quimicas e o de teste por 4512 entradas de features fisico-quimicas.

### RF
No modelo de Random Forest (RF), onde os hiperparâmentros usados são os default presentes na função de **train_best_model**, para RF, da classe **ShallowML**.


In [23]:
RF = ShallowML(X_train_enco_blc, X_test_enco_blc, y_train_enco_blc, y_test_enco_blc, report_name='rfx_enco_blc', columns_names = colnames)
best_rf_model = RF.train_best_model(model_name='rf',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=5,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = RF.score_testset()
print(scores)
print(scores_per_class)
print(cm)

RFN
performing gridSearch...
GridSearchCV took 715.42 seconds for 6 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        RandomForestClassifier(random_state=1))]),
             n_jobs=5,
             param_grid=[{'clf__bootstrap': [True], 'clf__criterion': ['gini'],
                          'clf__max_features': ['sqrt', 'log2'],
                          'clf__n_estimators': [10, 100, 500]}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.292 (std: 0.014)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}
 

Model with rank: 2
 Mean validation score: 0.281 (std: 0.016)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 100}
 

Model with rank: 3
 Mean validation score: 0.

<Figure size 432x288 with 0 Axes>

O melhor modelo de RF para a Classificação multiclass balanceada apresenta, aproximadamente, um MCC score de 0.298 e uma accuracy de 45.7%. Com os Hiperparâmetros de {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}, sendo este conjunto de hiperparametros semelhantes às RF treinadas.

### SVM

In [24]:
SVM = ShallowML(X_train_enco_blc, X_test_enco_blc, y_train_enco_blc, y_test_enco_blc, report_name='svm_enco_blc', columns_names=colnames)
best_svm_model = SVM.train_best_model(model_name='svm',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=3, optType='gridSearch', param_grid=param_grid,
                         n_jobs=5,random_state=1, n_iter=15, refit=True, probability=True)

print(best_svm_model)
scores, scores_per_class, cm, cm2 = SVM.score_testset()
print(scores)
print(scores_per_class)
print(cm)

SVM
performing gridSearch...
GridSearchCV took 37102.07 seconds for 6 candidate parameter settings.
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        SVC(probability=True,
                                            random_state=1))]),
             n_jobs=5,
             param_grid={'clf__C': [0.1, 1.0, 10],
                         'clf__kernel': ['linear', 'rbf']},
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.403 (std: 0.008)
 Parameters: {'clf__C': 10, 'clf__kernel': 'rbf'}
 

Model with rank: 2
 Mean validation score: 0.388 (std: 0.006)
 Parameters: {'clf__C': 1.0, 'clf__kernel': 'rbf'}
 

Model with rank: 3
 Mean validation score: 0.331 (std: 0.008)
 Parameters: {'clf__C': 0.1, 'clf__kernel': 'linear'}
 

make_scorer(matthews_corrcoef)
3
Best score (scorer: make_scorer(matthews_corrcoef)) and parameters from a 3-fold cro

<Figure size 432x288 with 0 Axes>

O melhor modelo de SVM para a Classificação multiclass balanceada apresenta, aproximadamente, um MCC score de 0.429 e uma accuracy de 55,9%. Com os hiperparâmetros de {'clf__C': 10,'clf__kernel': 'rbf'}.

### KNN

In [25]:
KNN = ShallowML(X_train_enco_blc, X_test_enco_blc, y_train_enco_blc, y_test_enco_blc, report_name='knn_enco_blc', columns_names = colnames)
best_knn_model = KNN.train_best_model(model_name='knn',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=4,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = KNN.score_testset()
print(scores)
print(scores_per_class)
print(cm)

KNN
performing gridSearch...
GridSearchCV took 754.80 seconds for 24 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf', KNeighborsClassifier())]),
             n_jobs=4,
             param_grid=[{'clf__leaf_size': [15, 30, 60],
                          'clf__n_neighbors': [2, 5, 10, 15],
                          'clf__weights': ['uniform', 'distance']}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.372 (std: 0.010)
 Parameters: {'clf__leaf_size': 15, 'clf__n_neighbors': 2, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.372 (std: 0.010)
 Parameters: {'clf__leaf_size': 30, 'clf__n_neighbors': 2, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.372 (std: 0.010)
 Parameters: {'clf__leaf_size': 60, 'clf__n_neighbors': 2, 'clf__weights': 'distance'}
 

make_scorer(matthews_corrcoef)
5


<Figure size 432x288 with 0 Axes>

O melhor modelo de KNN para a Classificação multiclass balanceada apresenta, aproximadamente, um MCC score de 0.417 e uma accuracy de 54.4%. Onde os hiperparâmetros registado para o modelo possuem o {'clf__leaf_size': X, 'clf__n_neighbors': 2, 'clf__weights': 'distance'}, onde o 'clf__leaf_size' poderá ser 15, 30 ou 60.


À semelhante dos procedimentos anteriores todos os modelos treinados baseado nas features físico-quimicas obtiveram melhores resultados, e simultaneamente o SVM baseado em features foi o melhor modelo com um MCC score de 0.62.

# Classificação multiclasse de transportadores
Nesta secção será realizada a classificação multiclasse das classes de transportadores, utilizando o dataset que ignora a classe de não transportadores conforme descrito na etapa 1.
## Features físico-químicas
Serão realizados os modelos de Machine Learning descritos anteriormente, tendo em conta as 100 melhores features descritas na etapa 2. O dataset *df_TCDB_nz*, criado na etapa 1, foi importado e definidas as labels consoante a classe do transportador
(cl1 = 1 ; cl2 = 2 ; cl3 = 3 ; cl4 = 4 ; cl5 = 5 ; cl8 = 8)

In [26]:
df_TCDB_nz = pd.read_csv('df_TCDB_nz.csv', sep= ',')
df_TCDB_nz = df_TCDB_nz.drop(columns='Unnamed: 0')
dici = {'cl1':1, 'cl2':2 ,'cl3':3,'cl4':4,'cl5':5,'cl8':8 }
df_TCDB_nz.TCDB_ID = df_TCDB_nz.TCDB_ID.apply(lambda x: dici[x])
print(df_TCDB_nz.groupby('TCDB_ID').size())
df_TCDB_nz

TCDB_ID
1    4631
2    4026
3    3902
4     409
5     202
8     878
dtype: int64


Unnamed: 0,sequence,TCDB_ID,_HydrophobicityD1100,_SolventAccessibilityD2100,_ChargeD2100,_HydrophobicityC1,Gravy,hydrophobic_ratio,_SolventAccessibilityC2,_ChargeD3100,...,LH,RP,HS,CW,QW,TH,HW,WQ,RM,MH
0,MDSIRPATFQIPAAVRELGWAALLLFFVLLSVHEWFSPPGWFGLLA...,4,0.963098,0.963098,-0.891240,-1.053496,0.914927,1.151802,-1.053496,0.770825,...,-0.621004,0.711445,-0.488216,-0.181722,-0.341125,-0.496290,-0.257673,2.264428,-0.472387,-0.293626
1,MSPSRTARLYFLLVLDLLFFVLEISIGYAVGSLALVADSFHMLNDV...,2,0.308846,0.308846,-0.396637,-0.597386,-0.265547,-0.544895,-0.597386,-0.216299,...,1.732211,0.066580,3.035088,-0.181722,-0.341125,-0.496290,-0.257673,0.972632,-0.472387,-0.293626
2,MHFGLNDRPEQVASASHSIFSSDDNKLRLSASLPDTAVTDLRRLGR...,2,-0.540370,-0.540370,0.471229,0.393026,-0.430169,-0.322973,0.393026,-0.605204,...,0.289918,0.163309,1.574206,-0.181722,-0.341125,0.769282,-0.257673,-0.383755,0.716247,1.573704
3,MFPLSALPRCVALRSKHGNSYLRSVHDKSQGGNFVELSADNDGGVM...,1,-0.779495,-0.779495,0.905993,0.823074,-1.142603,-1.400136,0.823074,-0.701986,...,2.567222,-0.610528,0.414094,-0.181722,-0.341125,0.611086,-0.257673,-0.383755,-0.472387,-0.293626
4,FGFKDIIRAIRRIAVPVVSTLFPPAAPLAHAIGEGVDYLLGDEAQA,1,0.104314,0.104314,0.179309,-0.467068,0.673374,1.113481,-0.467068,-0.628210,...,-0.621004,-0.610528,-0.488216,-0.181722,-0.341125,-0.496290,-0.257673,-0.383755,-0.472387,-0.293626
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14043,MPEGKFCNRKPVNTEEDLKALLGDKGGAQYYKEMEELEVDQEALWA...,5,-0.908812,-0.908812,1.340654,1.109772,-1.259992,-0.971821,1.109772,-0.880588,...,1.011064,0.098823,-0.488216,1.587166,-0.341125,-0.496290,-0.257673,-0.383755,-0.472387,1.418093
14044,MQHANTNKSLMTPGNIITGIILVMGLVLTVLRFTKGIGAVSNLDDN...,5,1.108509,1.108509,-0.915757,-1.144719,0.910925,0.752002,-1.144719,0.611843,...,1.352660,0.227796,0.628929,-0.181722,-0.341125,0.874746,1.884899,1.295581,0.815300,1.729314
14045,MRFGVVVLAIILLTGCSAMSAISDLLPSKDGIEATAQAGESNQKTG...,1,-0.373482,-0.373482,-0.081031,0.106328,-0.303055,-0.683305,0.106328,-0.665177,...,-0.621004,-0.610528,2.734318,-0.181722,-0.341125,-0.496290,-0.257673,-0.383755,-0.472387,-0.293626
14046,MILLTQSRFFSQKARCYITDNKRLFLPLLILIALVVPATRGFTLQA...,2,0.673320,0.673320,-1.127166,-1.027433,1.205370,1.200391,-1.027433,0.762407,...,-0.621004,1.033877,0.628929,-0.181722,-0.341125,0.874746,-0.257673,-0.383755,-0.472387,-0.293626


### Pré-preparação para o split do dataset
O dataset contém 14048 entradas, uma vez que a classe de não transportadores é ignorada em relação aos datasets anteriores.
Após seleção das 100 features resultantes da etapa 2, foi criada uma matriz de 14048 (correspondente às sequências) por 100 (features). Foi também isolado as labels apartir do dataset.

In [27]:
fps_x=[]
for i in range(df_TCDB_nz.shape[0]):
    fps_x.append(list(df_TCDB_nz.iloc[i,2:102].to_numpy().flatten()))
fps_x_nzF = np.array(fps_x)
print(fps_x_nzF.shape)
fps_x_nzF

(14048, 100)


array([[ 0.96309838,  0.96309838, -0.89123991, ..., -1.02850769,
         1.17654195,  0.93982451],
       [ 0.30884552,  0.30884552, -0.3966372 , ...,  0.85924138,
         0.08238807,  0.91387199],
       [-0.54037027, -0.54037027,  0.47122903, ...,  0.50390038,
         0.33883039, -0.68161785],
       ...,
       [-0.37348215, -0.37348215, -0.08103059, ...,  1.85863795,
         1.3475035 ,  0.24424889],
       [ 0.67331979,  0.67331979, -1.12716572, ..., -0.517705  ,
        -0.02507347,  0.07946938],
       [ 0.12617228,  0.12617228, -0.33596759, ...,  1.48108813,
        -0.22045809,  1.17803945]])

In [28]:
fps_y_nzF = df_TCDB_nz['TCDB_ID'].to_numpy().flatten()
print(fps_y_nzF.shape)
fps_y_nzF

(14048,)


array([4, 2, 2, ..., 1, 2, 1], dtype=int64)

### Split do Dataset

In [29]:
X_train_nz, X_test_nz, y_train_nz, y_test_nz = train_test_split(fps_x_nzF, fps_y_nzF, test_size=0.25, random_state = 42,stratify = fps_y_nzF)
print(X_train_nz.shape, y_train_nz.shape)
print(X_test_nz.shape, y_test_nz.shape)

(10536, 100) (10536,)
(3512, 100) (3512,)


Como descrito na introdução o split do dataset consistiu numa divisão de 75% para treino e 25% para teste, resultando assim para treino 10536 entradas e para teste 3512 entradas.

### RF

O primeiro modelo utilizado foi o Random Forest (RF), utilizando os hiperparâmetros standard no GridSearch como descrito anteriormente, para RF.

In [30]:
RF = ShallowML(X_train_nz, X_test_nz, y_train_nz, y_test_nz, report_name='rfx_nz', columns_names = list(df_TCDB_nz.columns))
best_rf_model = RF.train_best_model(model_name='rf',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=5,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = RF.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 90.85 seconds for 6 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        RandomForestClassifier(random_state=1))]),
             n_jobs=5,
             param_grid=[{'clf__bootstrap': [True], 'clf__criterion': ['gini'],
                          'clf__max_features': ['sqrt', 'log2'],
                          'clf__n_estimators': [10, 100, 500]}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.585 (std: 0.016)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}
 

Model with rank: 2
 Mean validation score: 0.583 (std: 0.012)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'log2', 'clf__n_estimators': 500}
 

Model with rank: 3
 Mean validation score: 0.581 (

<Figure size 432x288 with 0 Axes>

O melhor modelo resultante do Random Forest obteve um MCC score de 0.61 e accuracy de 72.3%, a partir do seguinte dicionário de parâmetros {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}

### SVM
Como segundo modelo foi utilizado o SVM, mais uma vez como descrito anteriormente os hiperparâmetros utilizados foram os standard.

In [31]:
SVM = ShallowML(X_train_nz, X_test_nz, y_train_nz, y_test_nz, report_name='svm_nz', columns_names=list(df_TCDB_nz.columns))
best_svm_model = SVM.train_best_model(model_name='svm',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=3, optType='gridSearch', param_grid=param_grid,
                         n_jobs=1, n_iter=15, refit=True, probability=True)

print(best_svm_model)
scores, scores_per_class, cm, cm2 = SVM.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 903.22 seconds for 6 candidate parameter settings.
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        SVC(probability=True,
                                            random_state=1))]),
             n_jobs=1,
             param_grid={'clf__C': [0.1, 1.0, 10],
                         'clf__kernel': ['linear', 'rbf']},
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.615 (std: 0.006)
 Parameters: {'clf__C': 10, 'clf__kernel': 'rbf'}
 

Model with rank: 2
 Mean validation score: 0.554 (std: 0.004)
 Parameters: {'clf__C': 1.0, 'clf__kernel': 'rbf'}
 

Model with rank: 3
 Mean validation score: 0.496 (std: 0.006)
 Parameters: {'clf__C': 10, 'clf__kernel': 'linear'}
 

make_scorer(matthews_corrcoef)
3
Best score (scorer: make_scorer(matthews_corrcoef)) and parameters from a 3-fold cross vali

<Figure size 432x288 with 0 Axes>

O melhor modelo resultante do SVM obteve um MCC score de 0.641 e accuracy de 74.3%, a partir do seguinte dicionário de parâmetros {'clf__C': 10, 'clf__kernel': 'rbf'}

### KNN
O ultimo modelo utilizado para classificação multiclasse, sem a classe de não transportadores, foi o KNN.

In [32]:
KNN = ShallowML(X_train_nz, X_test_nz, y_train_nz, y_test_nz, report_name='knn_nz', columns_names = list(df_TCDB_nz.columns))
best_knn_model = KNN.train_best_model(model_name='knn',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=4,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = KNN.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 25.10 seconds for 24 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf', KNeighborsClassifier())]),
             n_jobs=4,
             param_grid=[{'clf__leaf_size': [15, 30, 60],
                          'clf__n_neighbors': [2, 5, 10, 15],
                          'clf__weights': ['uniform', 'distance']}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.593 (std: 0.009)
 Parameters: {'clf__leaf_size': 15, 'clf__n_neighbors': 5, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.593 (std: 0.009)
 Parameters: {'clf__leaf_size': 30, 'clf__n_neighbors': 5, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.593 (std: 0.009)
 Parameters: {'clf__leaf_size': 60, 'clf__n_neighbors': 5, 'clf__weights': 'distance'}
 

make_scorer(matthews_corrcoef)
5
Best 

<Figure size 432x288 with 0 Axes>

O melhor modelo resultante do KNN obteve um MCC score de 0.62 e accuracy de 72.7%, a partir do seguinte dicionário de parâmetros {'clf__leaf_size': 15, 'clf__n_neighbors': 5, 'clf__weights': 'distance'}

Analisando os três modelos, verifica-se que o melhor modelo, para a classificação multiclasse sem a classe dos não transportadores, foi o SVM com um MCC score de 0.615 e accuracy de 74.3%.

## One-Hot-Enconding
### Pré-preparação para o split do dataset
Para esta secção as sequencias do dataset *df_TCDB-z* sofreu uma transformação por o *One-Hot-Encoding*, à semelhança dos datasets anteriores, resultando em uma matrix 21 000 colunas e 14 048 entradas (sequências). Foi também isolado as labels apartir do dataset.

In [33]:
fps_x_enco_nz = encode_sequence(sequences = df_TCDB_nz['sequence'], seq_len=1000, padding_truncating='post')
print(fps_x_enco_nz.shape)
fps_y_enco_nz = df_TCDB_nz['TCDB_ID']
print(fps_y_enco_nz.shape)

(14048, 21000)
(14048,)


### Dataset split

In [34]:
X_train_enco_nz, X_test_enco_nz, y_train_enco_nz, y_test_enco_nz = train_test_split(fps_x_enco_nz, fps_y_enco_nz, test_size=0.25, random_state = 42,stratify=fps_y_enco_nz)
print(X_train_enco_nz.shape, y_train_enco_nz.shape)
print(X_test_enco_nz.shape, y_test_enco_nz.shape)

(10536, 21000) (10536,)
(3512, 21000) (3512,)


A divisão do dataset foi semelhante a todas as secções anteriores, resultando assim para treino 10536 entradas e para teste 3512 entradas.
### RF
O primeiro modelo utilizado foi o Random Forest (RF) utilizando como hiperparâmetros standard, como descrito anteriormente.

In [35]:

RF = ShallowML(X_train_enco_nz, X_test_enco_nz, y_train_enco_nz, y_test_enco_nz, report_name='rf_enco_nz', columns_names = colnames)
best_rf_model = RF.train_best_model(model_name='rf',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=5,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = RF.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 573.83 seconds for 6 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        RandomForestClassifier(random_state=1))]),
             n_jobs=5,
             param_grid=[{'clf__bootstrap': [True], 'clf__criterion': ['gini'],
                          'clf__max_features': ['sqrt', 'log2'],
                          'clf__n_estimators': [10, 100, 500]}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.306 (std: 0.021)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}
 

Model with rank: 2
 Mean validation score: 0.305 (std: 0.015)
 Parameters: {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 100}
 

Model with rank: 3
 Mean validation score: 0.266 

<Figure size 432x288 with 0 Axes>

O melhor modelo resultante do Random Forest obteve um MCC score de 0.312 e accuracy de 51.5%, a partir do seguinte dicionário de parâmetros {'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500. Os parâmetros obtidos no melhor modelo resultante do Random Forest coincidem com os obtidos quando aplicados ao dataset com as features.
### SVM

In [36]:
SVM = ShallowML(X_train_enco_nz, X_test_enco_nz, y_train_enco_nz, y_test_enco_nz, report_name='svm_enco_nz', columns_names=colnames)
best_svm_model = SVM.train_best_model(model_name='svm',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=3, optType='gridSearch', param_grid=param_grid,
                         n_jobs=5,random_state=1, n_iter=15, refit=True, probability=True)

print(best_svm_model)
scores, scores_per_class, cm, cm2 = SVM.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 28429.89 seconds for 6 candidate parameter settings.
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf',
                                        SVC(probability=True,
                                            random_state=1))]),
             n_jobs=5,
             param_grid={'clf__C': [0.1, 1.0, 10],
                         'clf__kernel': ['linear', 'rbf']},
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.408 (std: 0.008)
 Parameters: {'clf__C': 10, 'clf__kernel': 'rbf'}
 

Model with rank: 2
 Mean validation score: 0.377 (std: 0.006)
 Parameters: {'clf__C': 1.0, 'clf__kernel': 'rbf'}
 

Model with rank: 3
 Mean validation score: 0.324 (std: 0.004)
 Parameters: {'clf__C': 0.1, 'clf__kernel': 'linear'}
 

make_scorer(matthews_corrcoef)
3
Best score (scorer: make_scorer(matthews_corrcoef)) and parameters from a 3-fold cross v

<Figure size 432x288 with 0 Axes>

O melhor modelo resultante do SVM obteve um MCC score de 0.43 e accuracy de 59.7%, a partir do seguinte dicionário de parâmetros {'clf__C': 10, 'clf__kernel': 'rbf'}
Comparativamente ao SVM aplicado ao dataset com as features verifica-se que o dicionário de parâmetros é coincidente.
### KNN

In [37]:
KNN = ShallowML(X_train_enco_nz, X_test_enco_nz, y_train_enco_nz, y_test_enco_nz, report_name='knn_enco_nz', columns_names = colnames)
best_knn_model = KNN.train_best_model(model_name='knn',model=None, scaler=None,score=make_scorer(matthews_corrcoef),
                         cv=5, optType='gridSearch', param_grid=None,
                         n_jobs=4,random_state=1, n_iter=15, refit=True)

scores, scores_per_class, cm, cm2 = KNN.score_testset()
print(scores)
print(scores_per_class)
print(cm)

performing gridSearch...
GridSearchCV took 461.19 seconds for 24 candidate parameter settings.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scl', None),
                                       ('clf', KNeighborsClassifier())]),
             n_jobs=4,
             param_grid=[{'clf__leaf_size': [15, 30, 60],
                          'clf__n_neighbors': [2, 5, 10, 15],
                          'clf__weights': ['uniform', 'distance']}],
             scoring=make_scorer(matthews_corrcoef))
Model with rank: 1
 Mean validation score: 0.394 (std: 0.009)
 Parameters: {'clf__leaf_size': 15, 'clf__n_neighbors': 2, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.394 (std: 0.009)
 Parameters: {'clf__leaf_size': 30, 'clf__n_neighbors': 2, 'clf__weights': 'distance'}
 

Model with rank: 1
 Mean validation score: 0.394 (std: 0.009)
 Parameters: {'clf__leaf_size': 60, 'clf__n_neighbors': 2, 'clf__weights': 'distance'}
 

make_scorer(matthews_corrcoef)
5
Best

<Figure size 432x288 with 0 Axes>

O melhor modelo resultante do KNN obteve um MCC score de 0.42 e accuracy de 58.5%, a partir do seguinte dicionário de parâmetros {'clf__leaf_size': 15/30/60, 'clf__n_neighbors': 2, 'clf__weights': 'distance'}
À semelhança dos modelos anteriores verifica-se que o dicionário de parâmetros é coincidente com o obtido através do dataset com as features.
Analisando os três modelos, verifica-se que o melhor modelo, para a classificação multiclasse sem a classe dos não transportadores com transformação por *one-hot-encoding*, foi o SVM com um MCC score de 0.44 e accuracy de 56.7%.

Através da comparação de modelos para a classificação multiclasse sem a classe dos não transportadores, utilizando as features e one-hot-encoding. Verificamos que o SVM para o dataset utilizando as features, à semelhança do observado nas classificações anteriores, foi o melhor modelo com um MCC score de 0.64. 

# Conclusão

Com os resultados obtidos para os modelos treinados ao longo do trabalho, foi possível verificar que:
- Na classificação através das features o modelo de SVM obtinha o melhor MCC, seguido pelo RF e o KNN foi sempre o modelo com menor score;
- Na classificação através do Hot-encoding o modelo que obteve melhor MCC foi os SVM’s, seguidos pelos KNN e por fim os RF.
- Em todos os casos todos os modelos obtidos através das features, tiveram uma melhor performance que os modelos obtidos através da técnica de Hot-encoding.
- A classificação binária obteve melhores scores relativamente as classificações multiclasse testadas, corroborando os resultados obtidos na etapa 2 em que existia uma melhor separação entre transportador e não transportador do que entre várias classes de transportadores.
- Nas confusion matrix relativas à classificação multiclasse, ocorre sempre uma dispersão superior sobre as 4 classes mais representadas.


Relativamente à diferença verificada entre o hot-encoding e utilização das features, esta diferença deve-se ao facto de os modelos de ML tradicional serem dependentes do processo de feature engineering, processo de transformação dos dados para facilitar e permitir ao modelo obter melhores resultados downstream. A técnica de hot-encoding é um método de codificar a sequencia e não um método de feature engineering como o cálculo de descritores físico-quimicos, não conseguindo desta forma captar informação relevante/necessária para a obtenção de resultados otimais através de métodos tradicionais.
Na seguinte (etapa 4), os algoritmos de Deep Learning acabam por realizar este processo de feature enginering e sendo espectável que a capacidade dos modelos seja capaz superior a lidar com a informação de hot encoding.