# Pipeline for building ChEMBL Database

ChEMBL is a database of bioactive molecules with drug-like properties. It contains information about how small molecules interact with their protein targets, how these compounds affect cells and whole organisms, etc. 

We will use SMILES (Simplified Molecular Input Line Entry System): a notation used to represent chemical structures (arrangement of atoms and bonds in a molecule) as a string.

Filter SMILES in ChEMBL database (disregard all other information) for single molecules (hint: find an entry that is more than one molecule and breakdown the syntax to know what to filter for.) 

Then filter for organic substances only - we don't want metal ions etc. 

In [1]:
pip install chembl-webresource-client

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Investigate ChEMBL Database

The following information is provided for each molecule in ChEMBL

In [3]:
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
molecule[1]

{'atc_classifications': [],
 'availability_type': -1,
 'biotherapeutic': None,
 'chemical_probe': 0,
 'chirality': -1,
 'cross_references': [],
 'dosed_ingredient': False,
 'first_approval': None,
 'first_in_class': -1,
 'helm_notation': None,
 'inorganic_flag': -1,
 'max_phase': None,
 'molecule_chembl_id': 'CHEMBL6328',
 'molecule_hierarchy': {'active_chembl_id': 'CHEMBL6328',
  'molecule_chembl_id': 'CHEMBL6328',
  'parent_chembl_id': 'CHEMBL6328'},
 'molecule_properties': {'alogp': '1.33',
  'aromatic_rings': 3,
  'full_molformula': 'C18H12N4O3',
  'full_mwt': '332.32',
  'hba': 6,
  'hbd': 1,
  'heavy_atoms': 25,
  'mw_freebase': '332.32',
  'np_likeness_score': '-1.59',
  'num_ro5_violations': 0,
  'psa': '108.61',
  'qed_weighted': '0.73',
  'ro3_pass': 'N',
  'rtb': 3},
 'molecule_structures': {'canonical_smiles': 'Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1',
  'molfile': '\n     RDKit          2D\n\n 25 27  0  0  0  0  0  0  0  0999 V2000\n    5.2792   -2.0500    0.0000 C

In [3]:
pip install deepchem

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd #used to view and analyse data
import requests #used to access json files
import tensorflow as tf
import deepchem as dc
import sqlite3

2026-02-16 21:26:22.531903: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
No normalization for SPS. Feature removed!
No normalization for AvgIpc. Feature removed!
No normalization for NumAmideBonds. Feature removed!
No normalization for NumAtomStereoCenters. Feature removed!
No normalization for NumBridgeheadAtoms. Feature removed!
No normalization for NumHeterocycles. Feature removed!
No normalization for NumSpiroAtoms. Feature removed!
No normalization for NumUnspecifiedAtomStereoCenters. Feature removed!
No normalization for Phi. Feature removed!


Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/users/k2587772/.local/lib/python3.9/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'lightning'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


## Extract single organic molecules as SMILES

In [5]:
#Define path on HPC that contains file with all SMILES
path = '/scratch/prj/rcmb_genai_transition/chembl_36/chembl_36_sqlite/chembl_36_smiles.csv' 

# Load CSV
df = pd.read_csv(path)

# Check the first few rows
print("Total SMILES:", len(df))
print(df.head())


Total SMILES: 2854815
                     canonical_smiles
0               B.CC(=O)OC1CN2CCC1CC2
1           B.CP(c1ccccc1)c1ccc(O)cc1
2    B.Oc1ccc(P(c2ccccc2)c2ccccc2)cc1
3  BC#N.BC#N.CN(C)CCCCCCCCCCCCCCN(C)C
4    BC#N.BC#N.CN(C)CCCCCCCCCCCCN(C)C


Find SMILES that are not single  molecules, for example:
Ions ([Na+].[Cl-]), salts (CCO.Cl), and mixtures (CCO.CCO)
These are typically SMILES with a . in them

In [7]:
multi_molecule = df[df['canonical_smiles'].str.contains('\.')]
print("Total number of multi-molecules:", len(multi_molecule))
print(multi_molecule.head())


Total number of multi-molecules: 118896
                     canonical_smiles
0               B.CC(=O)OC1CN2CCC1CC2
1           B.CP(c1ccccc1)c1ccc(O)cc1
2    B.Oc1ccc(P(c2ccccc2)c2ccccc2)cc1
3  BC#N.BC#N.CN(C)CCCCCCCCCCCCCCN(C)C
4    BC#N.BC#N.CN(C)CCCCCCCCCCCCN(C)C


Keep only SMILES without a dot (keep single molecules)

In [8]:
df_single = df[~df['canonical_smiles'].str.contains('\.')].copy() # ~ inverts the booleans
print("After filtering single molecules:", len(df_single))
pd.set_option('display.max_colwidth', None) # Increase column width to display the full SMILES output
print(df_single.head())


After filtering single molecules: 2735919
                                          canonical_smiles
28                          BOP(=O)(O)COCCn1cnc2c(N)ncnc21
29                   BOP(=O)(O)CO[C@H](C)Cn1cnc2c(N)ncnc21
30         BP(=O)(COCCn1cnc2c(N)ncnc21)OP(=O)(O)OP(=O)(O)O
31  BP(=O)(CO[C@H](C)Cn1cnc2c(N)ncnc21)OP(=O)(O)OP(=O)(O)O
32      BP(=O)(O)CC[C@@H]1C=C[C@H](n2cc(C)c(=O)[nH]c2=O)O1


Filter for organic molcules:

In [11]:
def is_organic(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return False  # invalid SMILES
    # check if molecule contains at least one carbon atom
    return any(atom.GetAtomicNum() == 6 for atom in mol.GetAtoms())
 
# Apply filter
df_organic = df_single[df_single['canonical_smiles'].apply(is_organic)].copy()

print("Number of single organic molecules:", len(df_organic))
print(df_organic.head())

[11:34:19] Can't kekulize mol.  Unkekulized atoms: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
[11:37:26] Can't kekulize mol.  Unkekulized atoms: 1 2 3 4 5 6 10 11 15 16 17 19 20 21
[11:38:24] Explicit valence for atom # 1 As, 7, is greater than permitted
[11:38:54] Explicit valence for atom # 3 Ar, 1, is greater than permitted


Number of single organic molecules: 2735766
                                          canonical_smiles
28                          BOP(=O)(O)COCCn1cnc2c(N)ncnc21
29                   BOP(=O)(O)CO[C@H](C)Cn1cnc2c(N)ncnc21
30         BP(=O)(COCCn1cnc2c(N)ncnc21)OP(=O)(O)OP(=O)(O)O
31  BP(=O)(CO[C@H](C)Cn1cnc2c(N)ncnc21)OP(=O)(O)OP(=O)(O)O
32      BP(=O)(O)CC[C@@H]1C=C[C@H](n2cc(C)c(=O)[nH]c2=O)O1


In [None]:
#DB still includes very large, highly charged molecules.
#optional if needed: remove large molecules

#from rdkit.Chem import Descriptors

#def mw_ok(smiles, max_mw=800):
    #mol = Chem.MolFromSmiles(smiles)
    #return mol and Descriptors.MolWt(mol) <= max_mw

# Apply filter
#df_mw = df_organic[df_organic['canonical_smiles'].apply(mw_ok)].copy()

#print("Number of small organic molecules:", len(df_mw))
#print(df_mw.head())

In [12]:
# Save filtered datasets to a file
import pickle

with open('my_saved_vars.pkl', 'wb') as f:
    pickle.dump({'df_organic': df_organic, 'df_single': df_single}, f)


In [9]:
#Load saved datasets
import pickle

with open('my_saved_vars.pkl', 'rb') as f:
    data = pickle.load(f)

df_organic = data['df_organic']
df_single = data['df_single']

## Convert SMILES into 3D RDKit molecules with coordinates

For each SMILE (2D structure), compute the 3D coordinates and features using RDKit

In [23]:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem import rdmolops
import py3Dmol

### Function that converts a SMILES string into a 3D RDKit molecule with coordinates

In [21]:
def smiles_to_3d(smiles, optimize=True, forcefield="MMFF", random_seed=42): #optimize:bool (whether to run geometry optimization)
    #Parse SMILES (i.e. read SMILES text and convert into an internal molecular graph representation that RDKit can understand)
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        raise ValueError("Invalid SMILES string")

    #Add hydrogens
    mol = Chem.AddHs(mol)

    #Generate 3D coordinates
    params = AllChem.ETKDGv3()
    params.randomSeed = random_seed
    res = AllChem.EmbedMolecule(mol, params) #use random starting 

    if res != 0:
        raise RuntimeError("3D embedding failed")

    # Optimize geometry
    if optimize:
        if forcefield.upper() == "MMFF" and AllChem.MMFFHasAllMoleculeParams(mol):
            AllChem.MMFFOptimizeMolecule(mol)
        else:
            AllChem.UFFOptimizeMolecule(mol)

    return mol

### Function that extracts 3D coordinates

In [19]:
def get_coordinates(mol):
    conf = mol.GetConformer()
    coords = []

    for atom in mol.GetAtoms():
        pos = conf.GetAtomPosition(atom.GetIdx())
        coords.append((atom.GetSymbol(), pos.x, pos.y, pos.z))

    return coords

#### Test using a single SMILES

In [24]:
test_smiles1 = "c1ccc2c(c1)c(c[nH]2)C[C@@H](N)O" 
test_smiles2 = "N=c1ccc2c(-c3ccccc3)c3ccc(N)cc3sc-2c1"

#Check if embedding was successful and if RDKit created a 3D conformer
mol1 = smiles_to_3d(test_smiles1)
mol2 = smiles_to_3d(test_smiles2)

print(mol1.GetNumConformers())
print(mol2.GetNumConformers())

#Check 3D coordinates
coords1 = get_coordinates(mol1)
coords2 = get_coordinates(mol2)

print(coords1)
print(coords2)


#3D visualization
def show_mol_3d(mol, style="stick"):
    mb = Chem.MolToMolBlock(mol)
    viewer = py3Dmol.view(width=400, height=400)
    viewer.addModel(mb, "sdf")
    viewer.setStyle({style: {}})
    viewer.zoomTo()
    return viewer

show_mol_3d(mol1)

1
1
[('C', 2.452993002966186, -1.6351405026979, -0.8657310609726656), ('C', 3.375461901318738, -0.9602845764767858, -0.07024444455881501), ('C', 3.0383164502782614, 0.23980896516333097, 0.5623571618253148), ('C', 1.7437877064811582, 0.7360045470522322, 0.36582963046635814), ('C', 0.7971984589962373, 0.08603974516984894, -0.42675747941820724), ('C', 1.16333062830393, -1.1241038299703066, -1.051652622076133), ('C', -0.402332341630369, 0.8700425681695013, -0.39632602668830175), ('C', -0.14058573617860845, 1.9650704984988685, 0.40498911715230934), ('N', 1.1466314890753815, 1.8726601319320566, 0.8573890362428366), ('C', -1.695966829160974, 0.575019318417071, -1.1064772600613875), ('C', -2.730692723142108, -0.22510397271259983, -0.28466324882292915), ('N', -2.244515107072717, -1.493228121122459, 0.21165987500019998), ('O', -3.1836166067705376, 0.5306286234425203, 0.818632075223322), ('H', 2.7334666196387505, -2.568527509381819, -1.347142576245887), ('H', 4.372987386181327, -1.372642840567096

<py3Dmol.view at 0x742b74089ee0>

#### Test using 100 extracted SMILES

In [26]:
df_reset = df_organic.reset_index(drop=True) #Reset index to 0...N-1
smiles_100 = df_reset.loc[0:99, "canonical_smiles"].tolist() #convert dataframe to a list of SMILES
print(len(smiles_100))

mols_3d = []

#Check if embedding was successful and if RDKit created a 3D conformer
#Loop over each SMILES
for i, smi in enumerate(smiles_100):
    try:
        # Generate 3D molecule
        mol = smiles_to_3d(smi)
        # Append only successful embeddings
        mols_3d.append(mol)
    except Exception as e:
        # Skip failed embeddings
        print(f"Skipped molecule {i}: {smi} | Reason: {e}")

print(f"Successfully embedded molecules: {len(mols_3d)} out of {len(smiles_100)}")

#Check 3D coordinates
coords50 = get_coordinates(mols_3d[50])
print(coords50)

#3D visualization
show_mol_3d(mols_3d[50])


100
Successfully embedded molecules: 100 out of 100
[('B', 0.9684506699861042, -3.558689786127131, -1.3443121363220345), ('P', 2.2941689912812984, -2.247922939564742, -1.0446558701284314), ('O', 2.2932881200309847, -1.879092934488863, 0.4198778254713094), ('O', 3.7919096621387856, -2.9786057231948013, -1.411427557365782), ('C', 4.776736839786339, -1.9797644654753874, -1.545986574201239), ('C', 6.1570916550663295, -2.599343479484326, -1.2710643343265657), ('O', 7.16303873052698, -1.6689236393132547, -1.6179278388864844), ('C', 7.921570092142633, -1.3823976976755037, -0.458503831853596), ('N', 8.220180777586837, 0.049189283185618195, -0.38117506727296613), ('C', 7.6955807929577755, 1.058972131105568, -1.129159861439951), ('N', 8.210209243916383, 2.2678326020815907, -0.7857855939456727), ('C', 7.90494000503686, 3.558720907501572, -1.399617663865956), ('C', 9.087380060179921, 1.9982866453062533, 0.2057272400959066), ('C', 9.913989723738158, 2.8361734765785305, 0.9505975216308474), ('O', 9.

<py3Dmol.view at 0x742cf2806250>

## Extract features using RDKit amd parse them into .chem format

The following features should be extracted: all hydrogen donors and acceptors, hydrophobicity (there are various measures) and aromatic systems to start with.

In [27]:
from rdkit import Chem
from rdkit.Chem import ChemicalFeatures
from rdkit import RDConfig
import os

In [28]:
def get_feature_factory():
    fdef_name = os.path.join(RDConfig.RDDataDir, 'BaseFeatures.fdef')
    return ChemicalFeatures.BuildFeatureFactory(fdef_name)

feature_factory = get_feature_factory()

In [32]:
def extract_features(mol):
    """
    Extract pharmacophore-like features with 3D coordinates
    """
    conf = mol.GetConformer()
    features = feature_factory.GetFeaturesForMol(mol)

    extracted = []

    for feat in features:
        feat_type = feat.GetType()              # Donor, Acceptor, Hydrophobe, Aromatic
        atom_ids = feat.GetAtomIds()             # tuple of atom indices
        pos = feat.GetPos()                      # RDGeom.Point3D

        extracted.append({
            "type": feat_type,
            "atom_ids": atom_ids,
            "x": pos.x,
            "y": pos.y,
            "z": pos.z
        })

    return extracted


In [34]:
features100 = extract_features(mols_3d[50])
for f in features100:
    print(f)


{'type': 'SingleAtomDonor', 'atom_ids': (14,), 'x': 9.94757872404973, 'y': 4.211326893436957, 'z': 0.7364034042766434}
{'type': 'SingleAtomDonor', 'atom_ids': (17,), 'x': 11.522912507182275, 'y': 0.3344615154559938, 'z': 3.13849424855552}
{'type': 'SingleAtomDonor', 'atom_ids': (21,), 'x': 7.994378570675755, 'y': -2.2486516930976967, 'z': 1.7944626423324168}
{'type': 'SingleAtomDonor', 'atom_ids': (23,), 'x': 6.966186599050679, 'y': -4.2233101935125505, 'z': 0.34760211809441177}
{'type': 'SingleAtomDonor', 'atom_ids': (47,), 'x': -11.863476119696248, 'y': 2.048669710732983, 'z': 0.26523383543242074}
{'type': 'SingleAtomDonor', 'atom_ids': (50,), 'x': -7.558269317641036, 'y': 3.5810281571836997, 'z': -1.1716975070897881}
{'type': 'SingleAtomDonor', 'atom_ids': (54,), 'x': -6.25432674775868, 'y': -3.393340113627674, 'z': -0.8497428495666361}
{'type': 'SingleAtomDonor', 'atom_ids': (56,), 'x': -4.5161892108597, 'y': -1.8774595051446779, 'z': -2.135396569214644}
{'type': 'SingleAtomAccepto

## Convert molecules into .chem format for SMIFer tool

In [5]:
def mol_to_chem(mol, resname="TSC"):
    features = extract_features(mol)

    stacking = []
    hba = []
    hbd = []
    hphob = {}

    for feat in features:
        if feat["type"] == "Aromatic":
            ring = get_ring_atoms(mol, feat["atom_ids"])
            stacking.append(ring)

        elif feat["type"] == "Hydrophobe":
            for idx in feat["atom_ids"]:
                atom = mol.GetAtomWithIdx(idx)
                hphob[idx] = atom_hphobicity(atom, True)

        elif feat["type"] == "Acceptor":
            tail, head = get_acceptor_pivot(mol, feat["atom_ids"][0])
            hba.append((tail, head))

        elif feat["type"] == "Donor":
            tail, head = get_acceptor_pivot(mol, feat["atom_ids"][0])
            constrained = needs_geometry_constraint(
                mol.GetAtomWithIdx(head)
            )
            hbd.append((tail, head, constrained))

    return stacking, hphob, hba, hbd


In [7]:
#Identify the number of heavy atoms for each SMILES

# Work on a copy
df_stats = df_organic.copy()

from tqdm import tqdm
tqdm.pandas()

def count_heavy_atoms(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return np.nan
    return mol.GetNumHeavyAtoms()

# Add heavy atom count column
df_stats["num_heavy_atoms"] = df_stats["canonical_smiles"].progress_apply(count_heavy_atoms)


  1%|          | 14529/2735766 [00:02<06:33, 6913.11it/s]


KeyboardInterrupt: 

In [11]:
df_stats[1:10]

Unnamed: 0,canonical_smiles,num_heavy_atoms
29,BOP(=O)(O)CO[C@H](C)Cn1cnc2c(N)ncnc21,20
30,BP(=O)(COCCn1cnc2c(N)ncnc21)OP(=O)(O)OP(=O)(O)O,26
31,BP(=O)(CO[C@H](C)Cn1cnc2c(N)ncnc21)OP(=O)(O)OP...,27
32,BP(=O)(O)CC[C@@H]1C=C[C@H](n2cc(C)c(=O)[nH]c2=...,20
33,BP(=O)(O)CC[C@H]1O[C@@H](n2cc(C)c(=O)[nH]c2=O)...,23
34,BP(=O)(OC[C@@H]1C=C[C@H](n2cc(C)c(=O)[nH]c2=O)...,30
35,BP(=O)(OC[C@@H]1C=C[C@H](n2cnc3c(O)nc(N)nc32)C...,32
36,BP(=O)(OC[C@@H]1CC[C@H](n2cc(C)c(=O)[nH]c2=O)O...,30
37,BP(=O)(OC[C@@H]1CC[C@H](n2ccc(=O)[nH]c2=O)O1)O...,29


In [15]:
# Save df_stats
df_stats.to_pickle("df_stats.pkl")

In [8]:
#Load df_stats
df_stats = pd.read_pickle("df_stats.pkl")

In [9]:
df_reset = df_organic.reset_index(drop=True) #Reset index to 0...N-1
smiles_all = df_reset["canonical_smiles"].tolist() #convert dataframe to a list of SMILES
print(len(smiles_all))

2735766


3D embedding fails for very large molecules. These molecules can therefore be filtered out from the ChEMBL database. 

In [10]:
#Compute statistics on that to see what the average and median sizes are to determine a cutoff
mean_atoms   = df_stats["num_heavy_atoms"].mean()
median_atoms = df_stats["num_heavy_atoms"].median()
min_atoms    = df_stats["num_heavy_atoms"].min()
max_atoms    = df_stats["num_heavy_atoms"].max()

print(f"Mean heavy atoms   : {mean_atoms:.1f}")
print(f"Median heavy atoms : {median_atoms:.1f}")
print(f"Min heavy atoms    : {min_atoms}")
print(f"Max heavy atoms    : {max_atoms}")

df_stats["num_heavy_atoms"].describe()


Mean heavy atoms   : 31.1
Median heavy atoms : 28.0
Min heavy atoms    : 1
Max heavy atoms    : 681


count    2.735766e+06
mean     3.105954e+01
std      1.818822e+01
min      1.000000e+00
25%      2.300000e+01
50%      2.800000e+01
75%      3.400000e+01
max      6.810000e+02
Name: num_heavy_atoms, dtype: float64

Most molecules are small. 75% of the molecules in the dataset has â‰¤ 34 heavy atoms.

The mean > median, meaning the distribution is right-skewed

I will attempt to use 34 heavy atoms as the main cutoff to exclude very large molecules. 

In [11]:
df_filtered = df_stats[df_stats["num_heavy_atoms"] <= 34]

print(f"Kept {len(df_filtered):,} out of {len(df_stats):,} molecules")

smiles_filtered = df_filtered["canonical_smiles"].tolist()


Kept 2,073,960 out of 2,735,766 molecules


In [None]:
import os
import pickle
from tqdm import tqdm
import multiprocessing as mp
from rdkit import Chem
from rdkit.Chem import AllChem

smiles_file = "smiles_filtered.pkl"  # list of SMILES
output_dir = "mols_3d_batches"       # output folder to save batches
batch_size = 50000                    # molecules per batch
n_cores = 16                          # HPC cores to use

os.makedirs(output_dir, exist_ok=True)

# Load SMILES
with open(smiles_file, "rb") as f:
    smiles_filtered = pickle.load(f)

# ETKDG embedding function
def embed_smiles(smi):
    try:
        mol = Chem.MolFromSmiles(smi)
        if mol is None:
            return None
        mol = Chem.AddHs(mol)  # add hydrogens
        # ETKDG embedding, no minimization
        AllChem.EmbedMolecule(mol, useExpTorsionAngles=True, useRandomCoords=True)
        return mol
    except Exception:
        return None


# Batch processing function
def process_batch(batch_smiles, batch_idx):
    mols = []
    for smi in tqdm(batch_smiles, desc=f"Batch {batch_idx}", ncols=100):
        mol = embed_smiles(smi)
        if mol is not None:
            mols.append(mol)
    # Save batch to disk
    batch_file = os.path.join(output_dir, f"mols_3d_batch{batch_idx}.pkl")
    with open(batch_file, "wb") as f:
        pickle.dump(mols, f)
    print(f"Saved {len(mols)} molecules to {batch_file}")
    return len(mols)

# Split SMILES into batches
batches = [smiles_filtered[i:i+batch_size] 
           for i in range(0, len(smiles_filtered), batch_size)]

# Multiprocessing
def worker(batch_args):
    return process_batch(*batch_args)

if __name__ == "__main__":
    batch_args_list = [(batches[i], i) for i in range(len(batches))]
    with mp.Pool(n_cores) as pool:
        results = pool.map(worker, batch_args_list)

    print(f"Total successfully embedded molecules: {sum(results)}")


In [20]:
mols_3d = []

#Loop over each SMILES
for i, smi in enumerate(smiles_filtered):  
    try:
        # Generate 3D molecule
        mol = smiles_to_3d(smi)
        
        mols_3d.append(mol)
    
    except Exception as e:
        # Skip failed embeddings
        print(f"Skipped molecule {i}: {smi} | Reason: {e}")

print(f"Successfully embedded molecules: {len(mols_3d)} out of {len(smiles_filtered)}")


df_stats["num_heavy_atoms"] = df_stats["canonical_smiles"].progress_apply(count_heavy_atoms)


[10:58:29] UFFTYPER: Unrecognized charge state for atom: 5
[10:58:31] UFFTYPER: Unrecognized charge state for atom: 11
[10:58:32] UFFTYPER: Unrecognized charge state for atom: 4
[10:58:32] UFFTYPER: Unrecognized charge state for atom: 3
[10:58:43] UFFTYPER: Unrecognized atom type: Se2+2 (7)
[10:58:43] UFFTYPER: Unrecognized atom type: Se2+2 (7)
[10:58:43] UFFTYPER: Unrecognized atom type: Se2+2 (7)
[10:58:43] UFFTYPER: Unrecognized atom type: Se2+2 (7)
[10:58:58] UFFTYPER: Unrecognized atom type: Se2+2 (15)
[10:58:58] UFFTYPER: Unrecognized atom type: Se2+2 (15)
[10:59:09] UFFTYPER: Unrecognized atom type: Se2+2 (17)
[10:59:09] UFFTYPER: Unrecognized atom type: Se2+2 (17)
[10:59:13] UFFTYPER: Unrecognized charge state for atom: 6


Skipped molecule 1634: Brc1ccc([C@@H]2C[C@@H]3Cc4ccc5ccccc5c4N2O3)cc1 | Reason: 3D embedding failed


[10:59:41] Interrupted, cancelling conformer generation


KeyboardInterrupt: Embedding cancelled