### Introduction to QSARcons

**Purpose:**  
QSARcons is a Python package designed to identify the optimal consensus of Quantitative Structure–Activity Relationship (QSAR) models. It leverages various chemical descriptors and machine learning methods to combine multiple QSAR models, facilitating more reliable and interpretable predictions in cheminformatics.

**Overview:**  
QSARcons offers three primary consensus search strategies:

- **Random Consensus Search**: Explores random combinations of QSAR models to identify effective ensembles.  
- **Systematic Consensus Search**: Evaluates all possible combinations to find the optimal model ensemble.  
- **Genetic Consensus Search**: Utilizes genetic algorithms to evolve and select the best-performing model combinations.


In [1]:
import polaris
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

from qsarcons.lazy import LazyML
from qsarcons.consensus import RandomSearchRegressor, SystematicSearchRegressor, GeneticSearchRegressor

  from .autonotebook import tqdm as notebook_tqdm


### 1. Load data

In [2]:
# Load the benchmark from the Hub
benchmark = polaris.load_benchmark("polaris/adme-fang-solu-1")

# Get the train and test data-loaders
data_train, data_test = benchmark.get_train_test_split()

data_train, data_test = data_train.as_dataframe(), data_test.as_dataframe()
smi_train, prop_train = data_train["smiles"].to_list(), data_train["LOG_SOLUBILITY"].to_list()

data_train, data_val = train_test_split(data_train, test_size=0.2, random_state=42)

### 2. Build multiple 2D models

In [None]:
data_test["LogS"] = [0 for i in data_test.index]

lazy_ml = LazyML(task="regression", hopt=False, output_folder="logs_bench", verbose=True)
lazy_ml.run(data_train, data_val, data_test)

  4%|███▋                                                                               | 5/114 [01:04<37:35, 20.69s/it]

### 3. Build model consensus

In [None]:
metric = "auto"
cons_size = "auto"

In [None]:
cons_methods = [
    ("Best", SystematicSearchRegressor(cons_size=1, metric=metric)),         
    ("Random", RandomSearchRegressor(cons_size=cons_size, n_iter=1000, metric=metric)),       
    ("Systematic", SystematicSearchRegressor(cons_size=cons_size, metric=metric)),
    ("Genetic", GeneticSearchRegressor(cons_size=cons_size, n_iter=50, pop_size=50, mut_prob=0.2, metric=metric))
]

In [None]:
# load model predictions
df_val = pd.read_csv("logs_bench/val.csv")
df_test = pd.read_csv("logs_bench/test.csv")

# skip first two columns (smiles and true property value)
x_val, true_val = df_val.iloc[:, 2:], df_val.iloc[:, 1]
x_test = df_test.iloc[:, 2:]

In [None]:
for name, cons_searcher in cons_methods:
    
    # run search
    best_cons = cons_searcher.run(x_val, true_val)
    
    # make val and test predictions
    pred_val = cons_searcher._consensus_predict(x_val[best_cons])
    pred_test = cons_searcher._consensus_predict(x_test[best_cons])
    
    # write prediction accuracy metric
    df_val[name] = pred_val
    df_test[name] = pred_test

### 4. Summurize results

In [None]:
res = pd.DataFrame()
for model in df_val.columns[2:]:
    res.loc[model, "R2"] = r2_score(df_val["Y_TRUE"], df_val[model])

In [None]:
res.sort_values(by="R2", ascending=False)

In [None]:
y_pred = df_test["Genetic"].to_list()
results = benchmark.evaluate(y_pred)
results