# **Bioinformatics Project - Computational Drug Discovery [Part 5] Model inferences**

Juan Oliveira

Na **Parte 2**, inferências de modelo, referem-se às conclusões ou previsões que podemos fazer com base em um modelo treinado. Essas inferências podem incluir previsões de valores futuros, identificação de padrões ou avaliação de incertezas. Basicamente, é o uso prático do modelo para obter informações relevantes.

Parte 1 - Criação de um modelo

1. Importando bibliotecas

In [1]:
! pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [2]:
import pandas as pd # Importando a biblioteca pandas para manipulação de dados
import seaborn as sns # Importando a biblioteca seaborn para visualização de dados
import numpy as np # Importando a biblioteca numpy para operações numéricas
import csv # Importando o módulo csv para lidar com arquivos CSV
import math # Importando a biblioteca math para operações matemáticas
import lazypredict # Importando a biblioteca lazypredict para machine learning
import os # Importando o módulo os para lidar com operações de sistema
from numpy import NaN # Importando a constante NaN do NumPy para lidar com valores ausentes
from pandas import read_excel # Importando a função read_excel do pandas para ler arquivos Excel
from datetime import datetime # Importando a classe datetime do módulo datetime para trabalhar com datas e horas
from sklearn.model_selection import train_test_split # Importando a função train_test_split do sklearn para dividir os dados em conjuntos de treinamento e teste
from lazypredict.Supervised import LazyRegressor # Importando a classe LazyRegressor do lazypredict para regressão
from sklearn import datasets # Importando o módulo datasets do sklearn para carregar conjuntos de dados de exemplo
from matplotlib import pyplot # Importando a função pyplot do matplotlib para criar gráficos
from numpy import set_printoptions # Importando a função set_printoptions do numpy para configurar opções de exibição
from sklearn.ensemble import RandomForestRegressor # Importando a classe RandomForestRegressor do sklearn para regressão
from sklearn.tree import DecisionTreeRegressor # Importando a classe DecisionTreeRegressor do sklearn para regressão
from sklearn.feature_selection import SelectKBest # Importando a função SelectKBest do sklearn para seleção de atributos
from sklearn.feature_selection import f_classif # Importando a função f_classif do sklearn para análise de variância
from google.colab import drive # Importar o Google Drive

2. Lendo dataset

In [3]:
from google.colab import drive
drive.mount('/content/gdrive') # Montar arquivos do Google Drive

proj_path = '/content/gdrive/MyDrive/Colab Notebooks/' # Especificando o local onde os dados estão salvos

file_name7 = 'VHC_07_bioactivity_data_3class_pIC50_pubchem_fp.csv' #Especificando dados sobre o arquivo
df1 = pd.read_csv(proj_path + file_name7)
n_row1 = df1.shape[0] # Numero de linhas
print('Numero Registros:',n_row1)
df1 # Exibir o DataFrame

Mounted at /content/gdrive
Numero Registros: 572


Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,PubchemFP0,PubchemFP1,PubchemFP2,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,CHEMBL32704,CCCC(NC(=O)[C@@H]1C[C@@H](OC(=O)N2CCc3ccccc3C2...,active,824.98,2.75,4.00,10.00,1,1,1,...,0,0,0,0,0,0,0,0,0,6.28
1,CHEMBL33248,CCCC(NC(=O)[C@@H]1C[C@@H](OC(=O)N2CCc3ccccc3C2...,active,810.95,2.71,4.00,10.00,1,1,1,...,0,0,0,0,0,0,0,0,0,6.66
2,CHEMBL285069,CCCC(NC(=O)[C@@H]1C[C@@H](OC(=O)N2CCc3ccccc3C2...,active,824.98,3.27,4.00,10.00,1,1,1,...,0,0,0,0,0,0,0,0,0,6.66
3,CHEMBL286124,CCCC(NC(=O)[C@@H]1C[C@@H](OC(=O)N2CCc3ccccc3C2...,active,760.89,1.67,4.00,10.00,1,1,1,...,0,0,0,0,0,0,0,0,0,6.38
4,CHEMBL285686,CCCC(NC(=O)[C@@H]1C[C@@H](OC(=O)N2CCc3ccccc3C2...,active,781.91,3.27,3.00,10.00,1,1,1,...,0,0,0,0,0,0,0,0,0,7.59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
567,CHEMBL3126842,COc1ccc2c(c1)[C@@H]1C[C@]1(C(=O)N1C3CCC1CN(C)C...,active,659.85,4.69,1.00,7.00,1,1,1,...,0,0,0,0,0,0,0,0,0,6.79
568,CHEMBL3126842,COc1ccc2c(c1)[C@@H]1C[C@]1(C(=O)N1C3CCC1CN(C)C...,active,659.85,4.69,1.00,7.00,1,1,1,...,0,0,0,0,0,0,0,0,0,8.74
569,CHEMBL3126842,COc1ccc2c(c1)[C@@H]1C[C@]1(C(=O)N1C3CCC1CN(C)C...,active,659.85,4.69,1.00,7.00,1,1,1,...,0,0,0,0,0,0,0,0,0,7.70
570,CHEMBL3126842,COc1ccc2c(c1)[C@@H]1C[C@]1(C(=O)N1C3CCC1CN(C)C...,active,659.85,4.69,1.00,7.00,1,1,1,...,0,0,0,0,0,0,0,0,0,8.32


3. Ajustando os dados

In [4]:
# Remover colunas 'canonical_smiles', 'molecule_chembl_id e bioactivity_class'
df1 = df1.drop(columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class'])

# Remover linhas com valores nulos
df1 = df1.dropna()

# Exibir o DataFrame resultante
print(df1)

        MW  LogP  NumHDonors  NumHAcceptors  PubchemFP0  PubchemFP1  \
0   824.98  2.75        4.00          10.00           1           1   
1   810.95  2.71        4.00          10.00           1           1   
2   824.98  3.27        4.00          10.00           1           1   
3   760.89  1.67        4.00          10.00           1           1   
4   781.91  3.27        3.00          10.00           1           1   
..     ...   ...         ...            ...         ...         ...   
567 659.85  4.69        1.00           7.00           1           1   
568 659.85  4.69        1.00           7.00           1           1   
569 659.85  4.69        1.00           7.00           1           1   
570 659.85  4.69        1.00           7.00           1           1   
571 659.85  4.69        1.00           7.00           1           1   

     PubchemFP2  PubchemFP3  PubchemFP4  PubchemFP5  ...  PubchemFP872  \
0             1           1           0           0  ...             0   

4. Dividindo DFs em X e Y

In [5]:
Y = df1.filter(['pIC50'], axis=1) # Selecionar a coluna 'pIC50' como o target
X = df1.drop('pIC50', axis=1) # Selecionar todas as colunas exceto a coluna 'pIC50' como os recursos
Y = df1.pIC50 # Selecionar a coluna 'pIC50' como o target
X


Unnamed: 0,MW,LogP,NumHDonors,NumHAcceptors,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,824.98,2.75,4.00,10.00,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,810.95,2.71,4.00,10.00,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,824.98,3.27,4.00,10.00,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,760.89,1.67,4.00,10.00,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,781.91,3.27,3.00,10.00,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
567,659.85,4.69,1.00,7.00,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
568,659.85,4.69,1.00,7.00,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
569,659.85,4.69,1.00,7.00,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
570,659.85,4.69,1.00,7.00,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
Y.shape # Verificar o tamanho do DataFrame Y
X.shape # Verificar o tamanho do DataFrame X
Y

0     6.28
1     6.66
2     6.66
3     6.38
4     7.59
      ... 
567   6.79
568   8.74
569   7.70
570   8.32
571   7.21
Name: pIC50, Length: 572, dtype: float64

5. Selecionando melhores atributos

5.1 Remover recursos de baixa variação

A remoção de recursos de baixa variação é um processo de pré-processamento de dados, ela se refere à eliminação de variáveis ou características de um conjunto de dados que têm variação muito baixa, ou seja, cujos valores são quase constantes ou mudam muito pouco em relação ao conjunto de dados como um todo.

A remoção de recursos de baixa variação é importante por várias razões:

Eficiência Computacional: Variáveis com baixa variação não fornecem muita informação útil e podem aumentar a complexidade computacional das análises sem contribuir significativamente para os resultados.

Redução de Ruído: Variáveis com baixa variação podem ser consideradas como "ruído" nos dados, pois não fornecem informações discriminativas para identificar padrões ou relações importantes.

Melhor Generalização: Em aprendizado de máquina, a remoção de recursos de baixa variação pode ajudar a evitar problemas de superajuste (overfitting) e levar a modelos mais generalizáveis.

Facilitação da Interpretação: Um conjunto de dados mais limpo, sem recursos de baixa variação, torna mais fácil a interpretação dos resultados e insights.

In [7]:
from sklearn.feature_selection import VarianceThreshold # Importando a classe VarianceThreshold do sklearn para seleção de recursos de baixa variação
df = X # Substitua 'X' pelo nome do seu DataFrame de recursos
selection = VarianceThreshold(threshold=(.8 * (1 - .8))) # Definindo um limiar de 80%
# Ajuste e transforme os dados
X_transformed = selection.fit_transform(df)

# Obtenha os nomes das colunas após a transformação
selected_columns = df.columns[selection.get_support(indices=True)]

# Converta o array NumPy de volta para um DataFrame
X = pd.DataFrame(X_transformed, columns=selected_columns)

print(selected_columns)

file_name1='VHC_selected_features.txt' # Especificar o nome do arquivo de texto
np.savetxt(proj_path + file_name1, [selected_columns], fmt='%s', delimiter=',') # Salvar colunas selecionadas em um arquivo de texto
X

Index(['MW', 'LogP', 'NumHDonors', 'NumHAcceptors', 'PubchemFP3',
       'PubchemFP21', 'PubchemFP23', 'PubchemFP24', 'PubchemFP115',
       'PubchemFP116',
       ...
       'PubchemFP755', 'PubchemFP756', 'PubchemFP758', 'PubchemFP777',
       'PubchemFP779', 'PubchemFP797', 'PubchemFP805', 'PubchemFP818',
       'PubchemFP819', 'PubchemFP821'],
      dtype='object', length=203)


Unnamed: 0,MW,LogP,NumHDonors,NumHAcceptors,PubchemFP3,PubchemFP21,PubchemFP23,PubchemFP24,PubchemFP115,PubchemFP116,...,PubchemFP755,PubchemFP756,PubchemFP758,PubchemFP777,PubchemFP779,PubchemFP797,PubchemFP805,PubchemFP818,PubchemFP819,PubchemFP821
0,824.98,2.75,4.00,10.00,1.00,1.00,0.00,0.00,0.00,0.00,...,1.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00
1,810.95,2.71,4.00,10.00,1.00,1.00,0.00,0.00,0.00,0.00,...,1.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00
2,824.98,3.27,4.00,10.00,1.00,1.00,0.00,0.00,1.00,1.00,...,1.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00
3,760.89,1.67,4.00,10.00,1.00,1.00,0.00,0.00,0.00,0.00,...,1.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00
4,781.91,3.27,3.00,10.00,1.00,1.00,0.00,0.00,0.00,0.00,...,1.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
567,659.85,4.69,1.00,7.00,1.00,0.00,0.00,0.00,1.00,1.00,...,1.00,0.00,1.00,1.00,0.00,0.00,0.00,1.00,0.00,1.00
568,659.85,4.69,1.00,7.00,1.00,0.00,0.00,0.00,1.00,1.00,...,1.00,0.00,1.00,1.00,0.00,0.00,0.00,1.00,0.00,1.00
569,659.85,4.69,1.00,7.00,1.00,0.00,0.00,0.00,1.00,1.00,...,1.00,0.00,1.00,1.00,0.00,0.00,0.00,1.00,0.00,1.00
570,659.85,4.69,1.00,7.00,1.00,0.00,0.00,0.00,1.00,1.00,...,1.00,0.00,1.00,1.00,0.00,0.00,0.00,1.00,0.00,1.00


In [8]:
#names=list(X.columns)
## Seleção de atributos
#n_selected=201 # Digitar aqui o número de atributos a serem selecionados
#test = SelectKBest(score_func=f_classif, k=n_selected)
#fit = test.fit(X, Y)
## summarize scores
#set_printoptions(precision=3)
#features = fit.transform(X)
## list(data) or
#performance_list = pd.DataFrame(
#    {'Attribute': names,
#     'Value': fit.scores_
#    })
#performance_list=performance_list.sort_values(by=['Value'], ascending=False)
#names_selected=performance_list.values[0:n_selected,0]

#XX = pd.DataFrame (X, columns = names_selected)
#X=XX
#X

6. Dividindo dataset em treino e teste

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42) # test_size = 20% de 600 transformados em teste e o resto (70%) para treino; randon_state é uma semente para gerar numeros aleatorios
## Salvando arquivos de dataset para conferencia
#df_X_train = pd.DataFrame(X_train)
#df_Y_train = pd.DataFrame(Y_train)
#df_X_test = pd.DataFrame(X_test)
#df_Y_test = pd.DataFrame(Y_test)
#df_X_train.to_csv('01_X_train.csv'), index=False)
#df_Y_train.to_csv('01_Y_train.csv', index=False)
#df_X_test.to_csv('01_X_test.csv', index=False)
#df_Y_test.to_csv('01_Y_test.csv', index=False)

7. Comparando ML algoritmos usando o lazyclassifier (benchmark)

In [10]:
reg = LazyRegressor(verbose=0, ignore_warnings=True, custom_metric=None)
models, predictions = reg.fit(X_train, X_test, Y_train, Y_test)

print(models)

100%|██████████| 42/42 [00:24<00:00,  2.74it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001080 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 849
[LightGBM] [Info] Number of data points in the train set: 400, number of used features: 203
[LightGBM] [Info] Start training from score 7.577446


100%|██████████| 42/42 [00:24<00:00,  1.73it/s]

                                                              Adjusted R-Squared  \
Model                                                                              
Lars                          26577761549961861873241417146578803835492806220...   
RANSACRegressor                                       88730448808195529900032.00   
GaussianProcessRegressor                                                 5283.87   
KernelRidge                                                               123.13   
LassoLars                                                                   6.36   
Lasso                                                                       6.36   
DummyRegressor                                                              6.36   
LinearRegression                                                            6.04   
TransformedTargetRegressor                                                  6.04   
ElasticNet                                                                  




In [11]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lars,26577761549961861873241417146578803835492806220...,-4973616196484090770235324840030057112602439831...,1.1349270141516478e+42,0.19
RANSACRegressor,88730448808195529900032.00,-16604528431943022280704.00,207369591530.24,1.76
GaussianProcessRegressor,5283.87,-987.61,50.6,0.27
KernelRidge,123.13,-21.86,7.69,0.08
LassoLars,6.36,-0.00,1.61,0.06
Lasso,6.36,-0.00,1.61,0.04
DummyRegressor,6.36,-0.00,1.61,0.04
LinearRegression,6.04,0.06,1.56,0.04
TransformedTargetRegressor,6.04,0.06,1.56,0.06
ElasticNet,4.93,0.26,1.38,0.05


Quando se quer avaliar modelo de regressao vemos o R², ja o de classificação seria o AUC ou Acuracia

9.1 Otimizando melhor modelo: REGRESSÃO

In [12]:

from sklearn.metrics import mean_squared_error # Importando a função mean_squared_error do sklearn para calcular o erro quadrático médio
from sklearn import svm # Importando o SVM do sklearn
from sklearn.tree import export_graphviz # Importando a função export_graphviz do sklearn para exportar o gráfico de árvore de decisão
from sklearn.model_selection import GridSearchCV # Importando a função GridSearchCV do sklearn para otimizar os hiperparâmetros
from sklearn.model_selection import cross_val_score # Importando a função cross_val_score do sklearn para avaliar o desempenho do modelo
import pickle

#Se ‘Y’ contém valores contínuos, você pode considerar usar SVR (Support Vector Regression) para regressão.
from sklearn.svm import SVR #Importe o SVR para tarefas de regressão

#parametros= [{'kernel' : ['linear', 'poly', 'rbf', 'sigmoid'],'C' : [1,5,10]},
#         {'gamma' : ('auto','scale')}]

param = [{'kernel' : ['linear', 'poly', 'rbf', 'sigmoid'],'C' : [1,5,10]},
         {'degree' : [2,8],'gamma' : ('auto','scale')}]

#Utilize SVR (Support Vector Regression) em vez de SVC (Support Vector Classification) para variáveis-alvo contínuas.
model = GridSearchCV(estimator = SVR(), param_grid = param, cv = 5, n_jobs = -1, verbose = 2)

model.fit(X, Y.values.ravel()) # Flatten Y if it's a DataFrame

feature_names = X.columns

# salvando o modelo no disco
filename2 = 'finalized_model1.sav'
pickle.dump(model, open(proj_path + filename2, 'wb'))

score = model.score(X, Y)
print("Score: ", score)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
Score:  0.7571828878441407


O **score de 0.7571828878441407** é um valor que indica o desempenho do modelo. No contexto de regressão, esse valor geralmente representa o **coeficiente de determinação (R²)**.

- **R² (coeficiente de determinação)**:
  - Varia entre 0 e 1.
  - Quanto mais próximo de 1, melhor o modelo está se ajustando aos dados.
  - Um R² de 0.7571 significa que aproximadamente 75.71% da variabilidade nos dados de teste é explicada pelo modelo.
  - Em outras palavras, o modelo está capturando cerca de 75.71% da variação nos valores previstos em relação aos valores reais.