# Step Forward Cross Validation for Bioactivity Prediction

# Downloading the Datasets
Based on Dataset provided by [Landrum & Riniker's](https://pubs.acs.org/doi/10.1021/acs.jcim.4c00049) `Combining IC50 or Ki Values from Different Sources Is a Source of Significant Noise`

[Dataset](https://github.com/rinikerlab/overlapping_assays) can be downloaded from here.

In [1]:
%%bash

rm -rf ../benchmark/landrum/
git clone https://github.com/rinikerlab/overlapping_assays.git ../benchmark/landrum/
mkdir -p ../benchmark/data/raw
mv ../benchmark/landrum/datasets/IC50_datasets.yaml ../benchmark/data/
gzip -d ../benchmark/landrum/datasets/source_data/*IC50.csv.gz
mv ../benchmark/landrum/datasets/source_data/*IC50.csv ../benchmark/data/raw/
rm -rf ../benchmark/landrum/

Cloning into '../benchmark/landrum'...


# Standardizing SMILES
Standardize SMILES and report Canonical SMILES, Remove Failed SMILES, Deduplicate them, and Save the results in `benchmark/data/standardized`

In [2]:
import os

import molvs
import pandas as pd
from rdkit import Chem, RDLogger
from tqdm import tqdm

In [3]:
RDLogger.DisableLog("rdApp.*")

In [4]:
os.makedirs('../benchmark/data/standardized/', exist_ok=True)

In [5]:
md = molvs.metal.MetalDisconnector()
lfc = molvs.fragment.LargestFragmentChooser()
uc = molvs.charge.Uncharger()

## Note on Standardization of SMILES

Standardization is based on checking if the input SMILES (in the dataset) is valid, and then disconnecting the metal(/s), Choosing the largest organic fragment (here, organic means the fragment which has at least one carbon) and finally un-charging the molecule. This molecule then, if valid, is converted into canonical SMILES and returned as a string. If there is any error in this process, None is returned.

In [6]:
def standardize_smiles(smiles):
    std_smiles = molvs.standardize.standardize_smiles(smiles)
    std_mol = Chem.MolFromSmiles(std_smiles)
    std_mol = md.disconnect(std_mol)
    std_mol = lfc.choose(std_mol)
    std_mol = uc.uncharge(std_mol)
    std_smi = Chem.MolToSmiles(std_mol)
    if not molvs.validate.validate_smiles(std_smi):
        #     std_smi = molvs.standardize.canonicalize_tautomer_smiles(std_smi) # Too Slow
        return std_smi

In [7]:
for fname in tqdm(os.listdir('../benchmark/data/raw/'), desc="standardize"):
    if fname.endswith('.csv'):
        df = pd.read_csv(f'../benchmark/data/raw/{fname}')
        df["standardized_smiles"] = df["canonical_smiles"].apply(standardize_smiles)
        df.dropna(subset=["standardized_smiles"], inplace=True)
        df.drop_duplicates(subset=["standardized_smiles"], inplace=True)
        df.to_csv(f'../benchmark/data/standardized/{fname}', index=False)

standardize: 100%|██████████| 67/67 [01:24<00:00,  1.26s/it]
standardize:  42%|████▏     | 28/67 [00:33<00:49,  1.28s/it]