<a href="https://colab.research.google.com/github/KagakuAI/QSARcons/blob/main/colab/Notebook_1_QSARcons_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction to QSARcons

**Purpose:**  
``QSARcons`` is a package designed to identify the optimal consensus of Quantitative Structure–Activity Relationship (QSAR) models. It leverages various chemical descriptors and machine learning methods to combine multiple QSAR models.

**Overview:**  
QSARcons offers three primary consensus search strategies:

- **Random search**: explores random combinations of QSAR models of size N.  
- **Systematic search**: all models are sorted by accuracy metric on the validation set, and the top N models are selected.  
- **Genetic search**: Utilizes genetic algorithms to evolve and select the best-performing model combinations of size N.

In [1]:
!pip install qsarcons



In [2]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

from qsarcons.lazy import LazyML
from qsarcons.consensus import RandomSearch, SystematicSearch, GeneticSearch

### 1. Load data

In [3]:
data_train = pd.read_csv(f"https://raw.githubusercontent.com/molecularinformatics/Computational-ADME/main/MPNN/ADME_rPPB_train.csv")
data_test = pd.read_csv(f"https://raw.githubusercontent.com/molecularinformatics/Computational-ADME/main/MPNN/ADME_rPPB_test.csv")
data_train

Unnamed: 0,smiles,activity
0,Cc1cnc(C(=O)NCCc2ccc(S(=O)(=O)NC(=O)NC3CCCCC3)...,0.350248
1,CC(C)[C@@](C)(O)[C@@H]1CN(c2nc(-c3[nH]nc4ncccc...,0.622421
2,CC(C)(C)NC(=O)NCCN1CCC(CNC(=O)c2cc(Cl)cc(Cl)c2)C1,1.144574
3,Cc1ccc(OCC(O)C(C)NC(C)C)c2c1CCC2,1.334253
4,O=C(Nc1cccnc1)c1ccnc(NC(=O)C2CC2)c1,1.615287
...,...,...
703,Cc1ccc(S(=O)(=O)Nc2c(C(=O)NC(C)C(C)(C)C)c(C)nn...,0.659916
704,CN(C)C(=O)C1(Cc2ccccc2-c2ccccc2)CCN(C(=O)c2cnn...,1.146841
705,Cc1ccc(S(=O)(=O)Nc2c(C(=O)NCC(C)(C)C)c(C)nn2-c...,0.525045
706,CC(C)NCC(O)COc1cccc2[nH]ccc12,1.767527


In [4]:
# Get the train and test data-loaders
data_train, data_val = train_test_split(data_train, test_size=0.2, random_state=42)

### 2. Build multiple 2D models

**For the demonstrational proposal, hyperparameter optimization (``hopt=False``) was disabled to speed up the pipeline. But it is recommended to activate it with more computational resources available**

In [5]:
output_folder = "adme_bench"
lazy_ml = LazyML(task="regression", hopt=False, output_folder=output_folder, verbose=True)
lazy_ml.run(data_train, data_val, data_test)

[1/133] Running model: avalon|RidgeRegression
  ↳ Finished in 0.02 min | Memory usage: 0.780 GB
[2/133] Running model: avalon|PLSRegression
  ↳ Finished in 0.02 min | Memory usage: 0.780 GB
[3/133] Running model: avalon|LinearSVR
  ↳ Finished in 0.09 min | Memory usage: 0.780 GB
[4/133] Running model: avalon|MLPRegressor
  ↳ Finished in 0.25 min | Memory usage: 0.780 GB
[5/133] Running model: avalon|RandomForestRegressor
  ↳ Finished in 0.17 min | Memory usage: 0.780 GB
[6/133] Running model: avalon|XGBRegressor
  ↳ Finished in 0.03 min | Memory usage: 0.780 GB
[7/133] Running model: avalon|CatBoostRegressor
  ↳ Finished in 0.92 min | Memory usage: 0.781 GB
[8/133] Running model: rdkit|RidgeRegression
  ↳ Finished in 0.01 min | Memory usage: 0.818 GB
[9/133] Running model: rdkit|PLSRegression
  ↳ Finished in 0.01 min | Memory usage: 0.799 GB
[10/133] Running model: rdkit|LinearSVR
  ↳ Finished in 0.08 min | Memory usage: 0.799 GB
[11/133] Running model: rdkit|MLPRegressor
  ↳ Finished 

### 3. Build model consensus

In [12]:
metric = "auto"
cons_size = "auto"

In [13]:
cons_methods = [
    ("Best", SystematicSearch(cons_size=1, metric=metric)),
    ("Random", RandomSearch(cons_size=cons_size, n_iter=1000, metric=metric)),
    ("Systematic", SystematicSearch(cons_size=cons_size, metric=metric)),
    ("Genetic", GeneticSearch(cons_size=cons_size, n_iter=50, pop_size=50, mut_prob=0.2, metric=metric))
]

In [14]:
# load model predictions
df_val = pd.read_csv(f"{output_folder}/val.csv")
df_test = pd.read_csv(f"{output_folder}/test.csv")

# skip first two columns (smiles and true property value)
x_val, true_val = df_val.iloc[:, 2:], df_val.iloc[:, 1]
x_test = df_test.iloc[:, 2:]

In [15]:
for name, cons_searcher in cons_methods:

    # run search
    best_cons = cons_searcher.run(x_val, true_val)
    print(name)
    print(best_cons)

    # make val and test predictions
    pred_val = cons_searcher.predict_cons(x_val[best_cons])
    pred_test = cons_searcher.predict_cons(x_test[best_cons])

    # write prediction accuracy metric
    df_val[name] = pred_val
    df_test[name] = pred_test

Best
Index(['desc2D|CatBoostRegressor'], dtype='object')
Random
['maccs|CatBoostRegressor', 'secfp|RidgeRegression', 'topological|PLSRegression', 'desc2D|CatBoostRegressor']
Systematic
['desc2D|CatBoostRegressor', 'desc2D|RandomForestRegressor', 'desc2D|RidgeRegression', 'maccs|CatBoostRegressor']
Genetic
['maccs|CatBoostRegressor', 'desc2D|CatBoostRegressor', 'desc2D|LinearSVR']


### 4. Summurize results

Validation set performance

In [16]:
res = pd.DataFrame()
for model in df_val.columns[2:]:
    res.loc[model, "R2"] = r2_score(df_val["Y_TRUE"], df_val[model])
res.sort_values(by="R2", ascending=False)

Unnamed: 0,R2
Genetic,0.542244
Systematic,0.536236
Best,0.525959
desc2D|CatBoostRegressor,0.525959
Random,0.504566
...,...
pharm2D-gobbi|LinearSVR,-0.481040
atompair-count|LinearSVR,-0.505303
topological|LinearSVR,-0.716816
avalon|RidgeRegression,-0.928701


Test set performance

In [17]:
res = pd.DataFrame()
for model in df_test.columns[2:]:
    res.loc[model, "R2"] = r2_score(df_test["Y_TRUE"], df_test[model])
res.sort_values(by="R2", ascending=False)

Unnamed: 0,R2
Best,0.549745
desc2D|CatBoostRegressor,0.549745
Genetic,0.537987
Systematic,0.536988
desc2D|RidgeRegression,0.502346
...,...
pharm2D-gobbi|LinearSVR,-0.225169
topological|LinearSVR,-0.256672
pharm2D-pmapper|LinearSVR,-0.278695
avalon|RidgeRegression,-0.643586
