# SMILES sanitation
Sometimes we are faced with datasets which has SMILES that rdkit doesn't want to sanitize. This can be human entry errors, or differences between RDKits more strict sanitazion and other toolkits implementations of the parser. e.g. RDKit will not handle a tetravalent nitrogen when it has no charge, where other toolkits may simply build the graph anyway, disregarding the issues with the valence rules or guessing that the nitrogen should have a charge, where it could also by accident instead have a methyl group too many.

In [1]:
import pandas as pd
from rdkit.Chem import PandasTools

csv_file = "../tests/data/SLC6A4_active_excapedb_subset.csv" # Hmm, maybe better to download directly
data = pd.read_csv(csv_file)



Now, this example dataset contain all sanitizable SMILES, so for demonstration purposes, we will corrupt one of them

In [2]:
data.loc[1,'SMILES'] = 'CN(C)(C)(C)'

In [3]:

PandasTools.AddMoleculeColumnToFrame(data, smilesCol="SMILES")
print(f'Dataset contains {data.ROMol.isna().sum()} unparsable mols')

Dataset contains 1 unparsable mols


[18:05:29] Explicit valence for atom # 1 N, 4, is greater than permitted


If we use these SMILES for the scikit-learn pipeline, we would face an error, so we need to check and clean the dataset first. The CheckSmilesSanitation can help us with that.

In [4]:
from scikit_mol.sanitizer import CheckSmilesSanitazion
smileschecker = CheckSmilesSanitazion()

smiles_list_valid, y_valid, smiles_errors, y_errors = smileschecker.sanitize(list(data.SMILES), list(data.pXC50))

Error in parsing 1 SMILES. Unparsable SMILES can be found in self.errors


[18:05:29] Explicit valence for atom # 1 N, 4, is greater than permitted


Now the smiles_list_valid should be all valid and the y_values filtered as well. Errors are returned, but also accesible after the call to .sanitize() in the .errors property

In [5]:
smileschecker.errors

Unnamed: 0,SMILES,y
0,CN(C)(C)(C),7.18046


The checker can also be used only on X

In [6]:
smiles_list_valid, X_errors = smileschecker.sanitize(list(data.SMILES))
smileschecker.errors

Error in parsing 1 SMILES. Unparsable SMILES can be found in self.errors


[18:05:29] Explicit valence for atom # 1 N, 4, is greater than permitted


Unnamed: 0,SMILES
0,CN(C)(C)(C)
