### Introduction to QSARcons

**Purpose:**  
``QSARcons`` is a package designed to identify the optimal consensus of Quantitative Structure–Activity Relationship (QSAR) models. It leverages various chemical descriptors and machine learning methods to combine multiple QSAR models.

**Overview:**  
QSARcons offers three primary consensus search strategies:

- **Random search**: explores random combinations of QSAR models of size N.  
- **Systematic search**: all models are sorted by accuracy metric on the validation set, and the top N models are selected..  
- **Genetic Ssearch**: Utilizes genetic algorithms to evolve and select the best-performing model combinations of size N.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

from qsarcons.lazy import LazyML
from qsarcons.consensus import RandomSearch, SystematicSearch, GeneticSearch

### 1. Load data

In [2]:
data_train = pd.read_csv(f"https://raw.githubusercontent.com/molecularinformatics/Computational-ADME/main/MPNN/ADME_RLM_train.csv")
data_test = pd.read_csv(f"https://raw.githubusercontent.com/molecularinformatics/Computational-ADME/main/MPNN/ADME_RLM_test.csv")
data_train

Unnamed: 0,smiles,activity
0,Cc1nc(C2CCNCC2)cc(N(C)C)n1,1.027920
1,Cc1nccc(N(C)Cc2ccc(N)nc2)n1,1.027920
2,CC(=O)N1CCN([C@H]2CCN(C(=O)c3cc4cc(C)ccc4o3)CC...,2.183557
3,c1ccc(CCn2cnc3c2CCN(c2ncccn2)C3)cc1,3.399640
4,CCc1nc2cc(-c3c(OC)cccc3OC)ccc2o1,3.575703
...,...,...
2438,COc1ccc(-c2cc(C(C)(C)C)nc(-c3ncccn3)n2)cc1,1.877740
2439,Cc1nc(-c2ccc(Cl)c(Cl)c2)n(CC(C)(C)O)n1,1.027920
2440,Cc1ccccc1CNC(=O)c1ccc(-n2ccnc2)nc1,1.548119
2441,NC(=O)c1csc(CN2CCc3cc(F)ccc3C2)c1,2.400699


In [3]:
# Get the train and test data-loaders
data_train, data_val = train_test_split(data_train, test_size=0.2, random_state=42)

### 2. Build multiple 2D models

In [None]:
lazy_ml = LazyML(task="regression", hopt=True, output_folder="adme_bench", verbose=True)
lazy_ml.run(data_train, data_val, data_test)

[1/133] Running model: avalon|RidgeRegression
  ↳ Finished in 0.01 min | Memory usage: 0.537 GB
[2/133] Running model: avalon|PLSRegression
  ↳ Finished in 0.02 min | Memory usage: 0.520 GB
[3/133] Running model: avalon|LinearSVR
  ↳ Finished in 0.03 min | Memory usage: 0.520 GB
[4/133] Running model: avalon|MLPRegressor


### 3. Build model consensus

In [None]:
metric = "auto"
cons_size = "auto"

In [None]:
cons_methods = [
    ("Best", SystematicSearch(cons_size=1, metric=metric)),
    ("Random", RandomSearch(cons_size=cons_size, n_iter=1000, metric=metric)),
    ("Systematic", SystematicSearch(cons_size=cons_size, metric=metric)),
    ("Genetic", GeneticSearch(cons_size=cons_size, n_iter=50, pop_size=50, mut_prob=0.2, metric=metric))
]

In [None]:
# load model predictions
df_val = pd.read_csv("adme_bench/val.csv")
df_test = pd.read_csv("adme_bench/test.csv")

# skip first two columns (smiles and true property value)
x_val, true_val = df_val.iloc[:, 2:], df_val.iloc[:, 1]
x_test = df_test.iloc[:, 2:]

In [None]:
for name, cons_searcher in cons_methods:
    
    # run search
    best_cons = cons_searcher.run(x_val, true_val)
    
    # make val and test predictions
    pred_val = cons_searcher.predict_cons(x_val[best_cons])
    pred_test = cons_searcher.predict_cons(x_test[best_cons])
    
    # write prediction accuracy metric
    df_val[name] = pred_val
    df_test[name] = pred_test

### 4. Summurize results

In [None]:
res = pd.DataFrame()
for model in df_val.columns[2:]:
    res.loc[model, "R2"] = r2_score(df_val["Y_TRUE"], df_val[model])
res.sort_values(by="R2", ascending=False)

In [None]:
res = pd.DataFrame()
for model in df_test.columns[2:]:
    res.loc[model, "R2"] = r2_score(df_test["Y_TRUE"], df_test[model])
res.sort_values(by="R2", ascending=False)