# Pipelining the scikit-mol transformer

One of the very usable things with scikit-learn are their pipelines. With pipelines different scikit-learn transformers can be stacked and operated on just as a single model object. In this example we will build a simple model that can predict directly on RDKit molecules and then expand it to one that predicts directly on SMILES strings

First some needed imports and a dataset

In [1]:
import os
import rdkit
from rdkit import Chem
from rdkit.Chem import PandasTools
import pandas as pd
import matplotlib.pyplot as plt
from time import time
import numpy as np

In [2]:
csv_file = "../tests/data/SLC6A4_active_excapedb_subset.csv" # Hmm, maybe better to download directly
data = pd.read_csv(csv_file)

The dataset is a subset of the SLC6A4 actives from ExcapeDB. They are hand selected to give test set performance despite the small size, and are provided as example data only and should not be used to build serious QSAR models.

We add RDKit mol objects to the dataframe with pandastools and check that all conversions went well.

In [3]:
PandasTools.AddMoleculeColumnToFrame(data, smilesCol="SMILES")
print(f"{data.ROMol.isna().sum()} out of {len(data)} SMILES failed in conversion")

0 out of 200 SMILES failed in conversion


Then, let's import some tools from scikit-learn and two transformers from scikit-mol

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from scikit_mol.fingerprints import MorganFingerprintTransformer
from scikit_mol.conversions import SmilesToMolTransformer

In [5]:
mol_list_train, mol_list_test, y_train, y_test = train_test_split(data.ROMol, data.pXC50, random_state=0)

After a split into train and test, we'll build the first pipeline

In [6]:
pipe = Pipeline([('mol_transformer', MorganFingerprintTransformer()), ('Regressor', Ridge())])
print(pipe)

Pipeline(steps=[('mol_transformer', MorganFingerprintTransformer()),
                ('Regressor', Ridge())])


We can do the fit by simply providing the list of RDKit molecule objects

In [7]:
pipe.fit(mol_list_train, y_train)
print(f"Train score is :{pipe.score(mol_list_train,y_train):0.2F}")
print(f"Test score is  :{pipe.score(mol_list_test, y_test):0.2F}")

Train score is :1.00
Test score is  :0.55


Nevermind the performance, or the exact value of the prediction, this is for demonstration purpures. We can easily predict on lists of molecules

In [8]:
pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)[OH]')])

array([6.00400299])

We can also expand the already fitted pipeline, how about creating a pipeline that can predict directly from SMILES? With scikit-mol that is easy!

In [9]:
smiles_pipe = Pipeline([('smiles_transformer', SmilesToMolTransformer()), ('pipe', pipe)])
print(smiles_pipe)

Pipeline(steps=[('smiles_transformer', SmilesToMolTransformer()),
                ('pipe',
                 Pipeline(steps=[('mol_transformer',
                                  MorganFingerprintTransformer()),
                                 ('Regressor', Ridge())]))])


In [10]:
smiles_pipe.predict(['c1ccccc1C(=O)[OH]'])

array([6.00400299])

From here, the pipelines could be pickled, and later loaded for easy prediction on RDKit molecule objects or SMILES in other scripts. The transformation with the MorganTransformer will be the same as during fitting, so no need to remember if radius 2 or 3 was used for this or that model, as it is already in the pipeline itself. If we need to see the parameters for a particular pipeline of model, we can always get the non default settings via print or all settings with .get_params().

In [11]:
smiles_pipe.get_params()

{'memory': None,
 'steps': [('smiles_transformer', SmilesToMolTransformer()),
  ('pipe',
   Pipeline(steps=[('mol_transformer', MorganFingerprintTransformer()),
                   ('Regressor', Ridge())]))],
 'verbose': False,
 'smiles_transformer': SmilesToMolTransformer(),
 'pipe': Pipeline(steps=[('mol_transformer', MorganFingerprintTransformer()),
                 ('Regressor', Ridge())]),
 'smiles_transformer__parallel': False,
 'smiles_transformer__safe_inference_mode': False,
 'pipe__memory': None,
 'pipe__steps': [('mol_transformer', MorganFingerprintTransformer()),
  ('Regressor', Ridge())],
 'pipe__verbose': False,
 'pipe__mol_transformer': MorganFingerprintTransformer(),
 'pipe__Regressor': Ridge(),
 'pipe__mol_transformer__fpSize': 2048,
 'pipe__mol_transformer__parallel': False,
 'pipe__mol_transformer__radius': 2,
 'pipe__mol_transformer__safe_inference_mode': False,
 'pipe__mol_transformer__useBondTypes': True,
 'pipe__mol_transformer__useChirality': False,
 'pipe__mol_t