Copyright (c) 2021 Pharmacelera S.L.

All rights reserved.

Description: Example for interpretation by atom/fragmnet coloring from scratch

Usage: Define the molecule featurized method and model prediction method and run this script

In [4]:
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd
import joblib
from interpret import Interpret_image

Defined a function to featurize molecules which should take RDKIT.Mol object as inputs

In [2]:
def smiles2mols(smiles):
    if not isinstance(smiles,(list,np.ndarray)):
        smiles=[smiles]
    mols=[Chem.MolFromSmiles(s) for s in smiles]
    return mols
def featurized(mols):
    if not isinstance(mols,(list,np.ndarray)):
        mols=[mols]
    fps=[Chem.rdMolDescriptors.GetMorganFingerprintAsBitVect(m,radius=2,nBits=2048) for m in mols]
    return np.array(fps)

Read 1000 records from a solubility dataset 

In [5]:
soldata=pd.read_csv('./curated-solubility-dataset.csv').loc[:1000,:]
soldata.head(2)

Unnamed: 0,ID,Name,InChI,InChIKey,SMILES,Solubility,SD,Ocurrences,Group,MolWt,...,NumRotatableBonds,NumValenceElectrons,NumAromaticRings,NumSaturatedRings,NumAliphaticRings,RingCount,TPSA,LabuteASA,BalabanJ,BertzCT
0,A-3,"N,N,N-trimethyloctadecan-1-aminium bromide",InChI=1S/C21H46N.BrH/c1-5-6-7-8-9-10-11-12-13-...,SZEMGTQCPRNXEG-UHFFFAOYSA-M,[Br-].CCCCCCCCCCCCCCCCCC[N+](C)(C)C,-3.616127,0.0,1,G1,392.51,...,17.0,142.0,0.0,0.0,0.0,0.0,0.0,158.520601,0.0,210.377334
1,A-4,Benzo[cd]indol-2(1H)-one,InChI=1S/C11H7NO/c13-11-8-5-1-3-7-4-2-6-9(12-1...,GPYLCFQEKPUWLD-UHFFFAOYSA-N,O=C1Nc2cccc3cccc1c23,-3.254767,0.0,1,G1,169.183,...,0.0,62.0,2.0,0.0,1.0,3.0,29.1,75.183563,2.582996,511.229248


Featurize the molecules and train a regressor model with RandomForest

In [6]:
X=featurized(smiles2mols(soldata['SMILES'].values))
Y=soldata['Solubility'].values
model=RandomForestRegressor(n_estimators=200)
model.fit(X,Y)

RandomForestRegressor(n_estimators=200)

Call and initial the interpretation class Interpret_image() 

This class needs four parameters:

predict: a defined model prediction method.

featurize_mol: a defined molecule featurized method for molecule containing no dummy atoms.

featurize_mol_dummy: a defined molecule featurized method for molecule containing dummy atoms if needed. Default is None.

scaler : a defined scaler method for descriptors. Default is None

In [7]:
inter_ex=Interpret_image(predict=model.predict,featurize_mol=featurized)

Interpret one molecule by call the get_image() function

Parameters:

smiles : a single SMILES for interpretation.

label : label of the molecule.

im_path : path to save images. Default is current path.

level : level of interpretation. atom: Atom level. frag: fragment level defeind by Get_Fragment_lst method. self-defined: defeind list by users. Default is 'frag'.

frag_lst : need to provide a list for interpretation when in 'self-defined' level. e.g. [[1],[2],[3,4,5],[6,7]]. Default is None.

In [8]:
smiles=soldata.loc[1,'SMILES']
ID=soldata.loc[1,'ID']
inter_ex.get_image(smiles=smiles,label=ID,im_path='./example',level='frag',)

Predicted value of A-4 : -3.66 
