# 03 – Substructure searching with RDKit

Authored by: *Fredrik Svensson, Oliver Scott*  
Edited by: *Florion Peni*

You can use the links in the [Contents](#Contents) section to navigate the notebook.

Please ensure you complete [`02_rdkit_introduction.ipynb`](02_rdkit_introduction.ipynb) before proceeding with the content in this notebook. If you are not familiar with the Python basics, make sure to go through [`01_python_introduction.ipynb`](01_python_introduction.ipynb) first.

⚠️ **Please run the code cells in order**. Skipping cells may result in errors due to missing variables or imports.

This notebook introduces the principles of substructure searching and filtering using RDKit.

### JupyterLab / Google Colab

**JupyterLab** is an open-source interface for running and sharing notebooks with code and text.  
**Google Colab** is a cloud-based version of Jupyter that runs in your browser and connects to free GPUs and TPUs.

> *When working in Colab, any files saved in the runtime (e.g. downloaded datasets or generated outputs) are temporary and will be deleted when you close the tab or your session times out.*  
You can view runtime files by clicking the folder icon in the left-hand toolbar. If you are working with important files, make sure to download them to your local machine or save them to Google Drive before ending your session.

> If you make edits to this notebook and would like to save them you can save your own copy by clicking on **File > Save a copy in Drive** in the top menu. This ensures your work is preserved even after you close the runtime.

### Writing and running code

You can write and run Python code inside **code cells**.

First, select it by clicking on it. Then:
- Click the run/play button located inside the cell (in Colab) or in the toolbar above (in JupyterLab), or
- Use these keyboard shortcuts:
  - Windows: `Ctrl` + `Enter`
  - macOS: `⌘ Command` + `Enter`

> When you first run code on this notebook in Google Colab, you may see a message like:  
*“This notebook was not authored by Google. It may request access to your data…”*  
This is expected for notebooks loaded from GitHub. Click the *“Run anyway”* button when prompted to start running the code.

## Contents

> Internal markdown links (like the ones below) may not work reliably in Colab. To navigate the notebook, use the **Table of contents** panel: click the first icon in the left-hand toolbar.

* [Substructure](#Substructure)
* [SMARTS](#SMARTS)
* [Basic SMARTS queries](#Basic-SMARTS-queries)
* [Substructure searching on large amounts of data](#Substructure-searching-on-large-amounts-of-data)
* [Substructure filtering](#Substructure-filtering)
* [Discussion](#Discussion)

---

## Substructure

Substructure searching and filtering is a fundamental tool in cheminformatics. It involves identifying a specific pattern (*subgraph*) within a molecule (*graph*). This process is widely used in various cheminformatics applications involving digital representations of molecules, including:

- **depiction**: highlighting functional groups within a molecule,
- **drug design**: searching databases and performing Structure-Activity Relationship (SAR) analyses,
- and **analytical chemistry**: searching for previously characterised structures and comparing data to that of an unknown molecule.

Substructure searching is ubiquitous and integral to cheminformatics workflows. For example, you have likely used substructure searching in online databases such as [ChEMBL](https://www.ebi.ac.uk/chembl/), [DrugBank](https://go.drugbank.com/), and [PubChem](https://pubchem.ncbi.nlm.nih.gov/).
<a href="https://pharmanalytics.medium.com/using-graph-cliques-to-compute-combined-2d-3d-molecule-similarity-e0608595438b">
    <img src="https://miro.medium.com/v2/resize:fit:625/1*qx5GulpkdJf8wQT-NQJlYQ.png" alt="Using Graph Cliques to Compute combined 2D & 3D Molecule similarity" style="display: block; margin-left: auto; margin-right: auto;">
    <figcaption style="text-align: center;">Example: Maximum common subgraph of a molecule set</figcaption>
</a>

[Image source](https://miro.medium.com/v2/resize:fit:625/1*qx5GulpkdJf8wQT-NQJlYQ.png)

---

## SMARTS

**SMiles ARbitrary Target Specification (SMARTS)** is an extension of SMILES that allows users to define substructural patterns within molecules. Its expressive syntax supports precise and transparent substructure specification and atom typing.

**Most SMILES strings are also valid SMARTS strings.** The key difference between SMILES and SMARTS lies in the introduction of logical operators (e.g., `!`, `&`, `,`, `;`) and special atomic and bond symbols (e.g., `*`, `~`). These additions allow SMARTS atoms and bonds to be more general.

For example, the SMARTS atomic symbol `[C,N]` represents an atom that can be either aliphatic carbon (C) or aliphatic nitrogen (N).

### Applications

Using SMARTS patterns, we can define key molecular features like hydrogen bond (HB) donors and acceptors. This is helpful for applying Lipinski's Rule of Five (Ro5).

Uppercase atomic symbols represent aliphatic atoms, whereas lowercase atomic symbols represent aromatic atoms. The notation `#Z` matches any atom with the atomic number *Z*, regardless of its chemical environment.

- **HB donors**: Nitrogen or oxygen atoms with at least one directly bonded hydrogen atom.  
  >`[N,n,O;!H0]` or `[#7,#8;!H0]`;
   
- **HB acceptors**: Nitrogen or oxygen atoms (aliphatic or aromatic).  
  >`[N,n,O,o]` or `[#7,#8]`

These patterns can be made more complex for more precise matching, depending on the application. For more detailed information, consult the following resources:

- A comprehensive guide to SMARTS is available on the [Daylight website](https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html).
- The [SMARTS Plus](https://smartsview.zbh.uni-hamburg.de/) tool is useful for exploring and visualising SMARTS patterns.

### Substructure matching in RDKit

The code below demonstrates how to read a SMARTS string and perform substructure matching using RDKit.

In [None]:
# Checks if running in Google Colab
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Install RDKit if in Colab
if IN_COLAB:
    print("Installing RDKit for Colab. May take a few seconds...")
    !pip install rdkit -q
    print("Installation complete.")

from rdkit.Chem import Draw  # We will use Draw to visualise molecules
from rdkit import Chem  # We will again require the Chem module

# Define a molecule to perform substructure matching on
smiles = 'OB1OCC2=C1C=CC(OC1=CC=C(C=C1)C#N)=C2'  # This is the SMILES for Crisaborole
molecule = Chem.MolFromSmiles(smiles)
molecule

Let's try matching the nitrile group (−C≡N).  
First, we should define a SMARTS substructure pattern to describe the triple bond between a nitrogen and a carbon atom.

`[NX1]#[CX2]`
* `N` matches a nitrogen atom, whereas `X1` specifies that the nitrogen atom has exactly 1 connection, i.e. a triple bond.
* `#` represents a triple bond between two atoms.
* `CX2` matches a carbon atom that has exactly 2 connections, i.e. a triple bond and a single bond.

In [None]:
# Define a SMARTS pattern
nitrile_smarts = '[NX1]#[CX2]'

# Use RDKit to read the SMARTS pattern
nitrile = Chem.MolFromSmarts(nitrile_smarts)

# Visualise the SMARTS pattern (the carbon atom is not shown by default)
nitrile

Now we can use RDKit to apply this SMARTS pattern and identify nitrile groups within molecules.

In [None]:
# Check if the molecule contains the nitrile substructure
print(molecule.HasSubstructMatch(nitrile))  # Method returns a Boolean

# Get the atom indexes of the substructure match
atom_indexes = molecule.GetSubstructMatch(nitrile)  # Method returns a tuple of integers
print(atom_indexes)

# Highlight the matching atoms when visualising the molecule
img = Draw.MolToImage(molecule,
                      highlightAtoms=atom_indexes,
                      legend='C and N atoms of nitrile group in Crisaborole')
img

If we wanted to get multiple matches for a substructure in a single molecule we would need to use the `.GetSubstructMatches()` method as opposed to `.GetSubstructMatch()`.

----

## Basic SMARTS queries

Now that we know how to use RDKit for substructure matching, we can extend this concept to find substructure matches within a collection of molecules.

In the code below, we define a set of aromatic compounds.

In [None]:
# Define some aromatic compounds using SMILES
naphthalene = Chem.MolFromSmiles('c12ccccc1cccc2')
benzoxazole = Chem.MolFromSmiles('n1c2ccccc2oc1')
indane = Chem.MolFromSmiles('c1ccc2c(c1)CCC2')
skatole = Chem.MolFromSmiles('CC1=CNC2=CC=CC=C12')
benzene = Chem.MolFromSmiles('c1ccccc1')
quinoline = Chem.MolFromSmiles('n1cccc2ccccc12')

# Define a list holding the RDKit `Mol` objects
my_molecules = [
    naphthalene,
    benzoxazole,
    indane,
    skatole,
    benzene,
    quinoline
]

# Define a separate list holding the names of the aromatic compounds
labels = [
    'Naphthalene',
    'Benzoxazole',
    'Indane',
    'Skatole',
    'Benzene',
    'Quinoline'
]

# Visualise the aromatic compounds
img = Draw.MolsToGridImage(my_molecules, legends=labels)
img

We can use specific ring patterns to perform substructure matching. The SMARTS patterns in the code cell below are used to match five- and six-membered rings fused to a benzene.

- `[*r5R1]` represents atoms in a five-membered ring.
- `[*r6R1]`represents atoms in a six-membered ring.
- `[cR1]` and `[cR2]` specify aromatic carbons that belong to one and two ring systems, respectively.
- `1` and `2` are ring closure notations indicating fused rings.

The `Draw.MolsToGridImage` function is used to visualise both SMARTS patterns side by side, with appropriate labels for each. This approach allows easy comparison of the patterns.

In [None]:
# Define SMARTS queries for five- and six-membered rings fused to a benzene
benzo_five = Chem.MolFromSmarts('[*r5R1]1[cR2]2[cR1][cR1][cR1][cR1][cR2]2[*r5R1][*r5R1]1')
benzo_six = Chem.MolFromSmarts('[*r6R1]1[cR2]2[cR1][cR1][cR1][cR1][cR2]2[*r6R1][*r6R1][*r6R1]1')

# Visualise the SMARTS patterns
img = Draw.MolsToGridImage([benzo_five, benzo_six], legends=['Benzo-5', 'Benzo-6'])
img

Let's find molecules matching the `benzo_five` pattern.

In [None]:
# Create an empty list to store the atom indexes of substructure matches
atom_indexes = []

# Loop through each molecule in the collection
for mol in my_molecules:
    # Find the atom indexes of the first substructure match for the SMARTS pattern `benzo_five`
    _ = mol.GetSubstructMatch(benzo_five)
    # Append the match (atom indexes) to the list
    atom_indexes.append(_)

# Visualise the molecules in a grid with matches highlighted
img = Draw.MolsToGridImage(my_molecules, legends=labels, highlightAtomLists=atom_indexes)
img

#### Exercise 1: Basic SMARTS queries

Using the `my_molecules` list and the `benzo_six` SMARTS pattern, complete the following tasks:

1. Determine which heterocycle(s) in the list match the pattern.
2. Highlight the matched atoms in the molecule(s) and visualise the result.
3. If you're up for a challenge, create a SMARTS pattern that matches a benzene ring while excluding other aromatic systems. You can use the [Daylight website](https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html) as a reference to help refine your pattern.

In [None]:
# Add your solution code here

----

## Substructure searching on large amounts of data

Can you guess the class of drugs defined by the SMARTS query in the code cell below, without executing the code?

<details>
<summary>Hint</summary>
Sulfamethoxazole, an antibiotic, and sulfasalazine, used for treatment of inflammatory diseases, are both representatives of this class. What functional group do they have in common?
<br>
    <div style="text-align: center;">
    <img src="https://github.com/MEDC0080/RDKitTutorial/blob/main/data/sulfonamide_drugs.png?raw=1" alt="Sulfamethoxazole and Sulfasalazine" width="500"/>
</div>
</details>

In [None]:
query = Chem.MolFromSmarts('Nc1ccc(S(=O)(=O)-[*])cc1')
query

We will use this SMARTS query to search for matches in the `approved_drugs.sdf` file.

In [None]:
import os

# Check if running on Colab
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

# Set file path based on environment
if IN_COLAB:
    approved_drugs_file = 'data/approved_drugs.sdf'
else:
    approved_drugs_file = '../data/approved_drugs.sdf'

# If running in Colab, download the file
if IN_COLAB and not os.path.exists(approved_drugs_file):
    os.makedirs('data', exist_ok=True)
    !wget -q https://raw.githubusercontent.com/MEDC0080/RDKitTutorial/main/data/approved_drugs.sdf -O {approved_drugs_file}

# Read the file into a supplier of `Mol` objects using the `Chem.SDMolSupplier()` function
# Each molecule in the SDF file is read as an RDKit `Mol` object
supplier = Chem.SDMolSupplier(approved_drugs_file)

# Initialise empty lists to store matched `Mol` objects and their labels
matches = []          # Stores matched `Mol` objects
matches_labels = []   # Stores labels (e.g., names or primary synonyms) of matched molecules

# Perform substructure matching
for molecule in supplier:
    if molecule is None:
        continue  # Skip invalid molecules
    if molecule.HasSubstructMatch(query):
        # Retrieve the ChEMBL ID stored in the '_Name' property
        cid = molecule.GetProp('_Name')
        # Retrieve the synonyms stored in the 'SYNONYMS' property
        syn = molecule.GetProp('SYNONYMS')
        # Print match information
        print('MATCH:', cid, syn)
        # Append the matched `Mol` object and its primary name to the lists
        matches.append(molecule)
        matches_labels.append(syn.split(' (')[0])  # Extract primary name before parentheses

print('\nNumber of matches:', len(matches)) # Compute and display the number of matches
print(matches_labels)

In the code above, you may have noticed the use of the `.split()` method on a string. This function splits a string into a list of substrings based on a specified delimiter. For instance, we split the `syn` string on the `' ('` characters to extract only the first synonym, which is located at index `0` in the resulting list.

```python
my_string = 'green, yellow, red'
substrings = my_string.split(', ')
print(substrings)

# Output: ['green', 'yellow', 'red']
```

We have now found matches for our substructure query. We can visualise some of the molecules to see if they contain the **sulfonamide** functional group.  
You can try modifying the number of molecules displayed in the visualisation by adjusting the indexing.

In [None]:
from rdkit.Chem import AllChem

# The matched molecules currently have 3D coordinates
# To ensure a clear and consistent 2D representation for drawing, we compute 2D coordinates
for match in matches:
    AllChem.Compute2DCoords(match)

# Draw a grid of the first 9 matched molecules
img = Draw.MolsToGridImage(matches[0:9], legends=matches_labels)
img

#### Exercise 2: Substructure searching on large amounts of data

Using the SMARTS pattern `[*X1]=[CR1]1[CR1][NR1]=[CR1;X3](c2ccccc2)c2ccccc2[NR1]1`, you will need to complete the following tasks:

1. Identify the number of molecules in `appproved_drugs.sdf` that contain this substructure.
2. Find out what substructure the SMARTS pattern represents.
3. Visualise the molecules containing this substructure.
4. Define your own SMARTS pattern and repeat the substructure matching process to explore a different substructure.

<details>
<summary>Example solution</summary>

<br>1. This is pretty straightforward and identical to the code above.<br>
    
```python
supplier = Chem.SDMolSupplier(approved_drugs_file)

smarts_pattern = '[*X1]=[CR1]1[CR1][NR1]=[CR1;X3](c2ccccc2)c2ccccc2[NR1]1'
query = Chem.MolFromSmarts(smarts_pattern)

matches = []
matches_labels = []

for molecule in supplier:
    if molecule is None:
        continue
    if molecule.HasSubstructMatch(query):
        cid = molecule.GetProp('_Name')
        syn = molecule.GetProp('SYNONYMS')
        print('MATCH:', cid, syn)
        matches.append(molecule)
        matches_labels.append(syn.split('(')[0])

print('\nNumber of matches:', len(matches))
```
<br>2. You will see that the SMARTS pattern represents a cyclic imine structure (a ring containing a carbon-nitrogen double bond) fused to one benzene ring and connected to another. This type of structure is often seen in complex organic compounds, particularly heterocycles with extended conjugation or aromaticity.<br>
```python
query = Chem.MolFromSmarts(smarts_pattern)

img1 = Draw.MolToImage(query, legend="SMARTS Pattern")
img1
```
<br>3. To show all matched molecules, you can execute the code shown below in a new cell.<br>
```python
for match in matches:
    AllChem.Compute2DCoords(match)
img2 = Draw.MolsToGridImage(matches, legends=matches_labels)
img2
```
<br>4. For example, to match a hydroxyl group attached to an aromatic ring, you could use the SMARTS pattern shown below. Try re-executing the same code but with `custom_query` as your SMARTS query.<br>
```python
custom_smarts = '[c][OH]'
custom_query = Chem.MolFromSmarts(custom_smarts)
```
</details>

In [None]:
# Write your solution code here, adding extra code cells if necessary

----

## Substructure filtering

While identifying molecules with specific substructures is often valuable, there are cases where we might want to **remove molecules** containing certain unwanted substructures from a dataset.

Some substructures can be undesirable due to properties like **toxicity** or **high reactivity**. With the advent of high-throughput screening (HTS), the need to filter out problematic compounds from screening libraries has grown.

### Pan-assay interference compounds (PAINS)

**PAINS** are chemical compounds known to often produce false positive results in HTS. These compounds tend to react non-specifically with a wide range of biological targets rather than selectively affecting the desired target. When designing screening libraries, it is beneficial to filter out PAINS to improve the reliability of screening results.

<a href="https://en.wikipedia.org/wiki/Pan-assay_interference_compounds" style="text-decoration: none;">
    <div style="text-align: center;">
        <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/PAINS_Figure.tif/lossy-page1-1920px-PAINS_Figure.tif.jpg"
             alt="PAINS Figure"
             style="width: 70%; max-width: 700px; height: auto;">
    </div>
</a>

[Image source](https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/PAINS_Figure.tif/lossy-page1-1920px-PAINS_Figure.tif.jpg)

### Filtering PAINS with RDKit

RDKit provides a toolset for filtering compounds using multiple SMARTS patterns, making it easier to identify and remove problematic molecules. The `FilterCatalog` module in RDKit simplifies the process of substructure matching across multiple SMARTS patterns.

RDKit also includes a **library of PAINS patterns** that can be used directly to filter these compounds from your dataset. While this could be achieved by manually iterating through a list of SMARTS patterns, `FilterCatalog` provides a more efficient and structured approach.

In [None]:
# Import multiple classes from a module using commas
from rdkit.Chem.FilterCatalog import FilterCatalog, FilterCatalogParams

# Create an object to hold the `FilterCatalogParams()` function
params = FilterCatalogParams()

# Add PAINS patterns to the parameters
params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS)

# Create a filter catalogue using the defined parameters
catalog = FilterCatalog(params)

# Print the number of PAINS patterns in the catalogue
print('Number of PAINS patterns:', catalog.GetNumEntries())

<details>
<summary>Detailed explanation</summary>

```python
from rdkit.Chem.FilterCatalog import FilterCatalog, FilterCatalogParams
```

- This line imports the `FilterCatalog` and `FilterCatalogParams` classes from RDKit's `FilterCatalog` module.
    - `FilterCatalogParams`: A class used to define the parameters for building a filter catalogue.
    - `FilterCatalog`: A class used to create a catalogue of filters based on the specified parameters.

```python
params = FilterCatalogParams()
```

- Creates an instance of the `FilterCatalogParams` class, which is used to configure the filter catalogue.
    - The `params` object acts as a container for the filtering rules that will be added, i.e., PAINS patterns.

```python
params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS)
```

- Ensures that the PAINS rules are included in the filter catalogue configuration.
    - `AddCatalog`: Adds a specific catalogue of filtering rules to the parameters object.
    - `FilterCatalogParams.FilterCatalogs.PAINS`: Refers to the PAINS filtering rules provided by RDKit.

```python
catalog = FilterCatalog(params)
```

- Creates a `FilterCatalog` object using the `params` configuration defined earlier.
    - The `catalog` object contains the PAINS filtering rules and can be used to screen molecules for PAINS substructures.

```python
print('Number of PAINS patterns:', catalog.GetNumEntries())
```

- This line outputs the number of PAINS substructures included in the catalogue.
    - `.GetNumEntries()`: A method of the `FilterCatalog` class that returns the total number of PAINS patterns (filtering rules) in the catalogue.
</details>
<br>

In the code below, we perform **PAINS filtering** on a dataset of molecules represented as SMILES strings.

The dataset we are using, `MAPK_compounds.csv`, contains **mitogen-activated protein kinase (MAPK) compounds**, which are molecules known or thought to interact with the MAPK pathway that plays a crucial role in regulating cellular processes (e.g. proliferation, differentiation, and apoptosis). The compounds are stored in a **CSV file** as it is a common, efficient format for storing tabular data.

The code reads the file, filters the compounds using RDKit's PAINS filter catalogue, and splits the molecules into two categories:
1. `filtered_mols`; those containing substructures matching a PAINS pattern,
2. and `kept_mols`; those without any PAINS substructures.

Additionally, the code extracts descriptions of the PAINS patterns for molecules that match, and visualises the first nine filtered molecules along with their matched PAINS descriptions.

By running this code, we can *remove PAINS compounds from the dataset*, ensuring a cleaner library of molecules for *further analysis* or *drug discovery efforts*.

In [None]:
# Set file path based on environment
if IN_COLAB:
    csv_file = 'data/MAPK_compounds.csv'
else:
    csv_file = '../data/MAPK_compounds.csv'

# If running in Colab, download the file
if IN_COLAB and not os.path.exists(csv_file):
    os.makedirs('data', exist_ok=True)
    !wget -q https://raw.githubusercontent.com/MEDC0080/RDKitTutorial/main/data/MAPK_compounds.csv -O {csv_file}

# Read the file into a supplier of `Mol` objects using the `Chem.SmilesMolSupplier()` function
# Each molecule in the CSV file is read as an RDKit `Mol` object
# SMILES strings are in the 5th column, molecule names are in the 2nd column
supplier = Chem.SmilesMolSupplier(csv_file, delimiter=',', smilesColumn=4, nameColumn=1)

# Initialise lists to store filtered and retained molecules
filtered_mols = []   # Molecules containing PAINS substructures
filtered_pains = []  # Descriptions of the PAINS patterns for filtered molecules
kept_mols = []       # Molecules that do not contain PAINS substructures

# Perform PAINS filtering
for molecule in supplier:
    if molecule is None:
        continue  # Skip invalid or unreadable molecules
    # Check if the molecule matches any PAINS pattern using the catalogue
    match = catalog.GetFirstMatch(molecule)
    if match:  # If a PAINS pattern is matched
        AllChem.Compute2DCoords(molecule)  # Ensure 2D coordinates for drawing
        filtered_mols.append(molecule)  # Add molecule to the filtered list
        filtered_pains.append(match.GetDescription())  # Save the PAINS description
    else:  # If no PAINS patterns match
        kept_mols.append(molecule)  # Add molecule to the `kept_mols` list

# Output the number of molecules filtered out
print('Number of filtered PAINS:', len(filtered_mols))

# Output the number of remaining molecules
print('Remaining molecules:', len(kept_mols))

# Visualise the first 9 filtered molecules with their PAINS descriptions
img = Draw.MolsToGridImage(filtered_mols[0:9], legends=filtered_pains[0:9])
img

---

## Discussion

That concludes our introduction to substructure searching with RDKit. Feel free to add more code cells and experiment with the concepts learned.

With a solid understanding of substructure searching, you are now ready to explore more advanced topics. In the next notebook, [`04_rdkit_similarity.ipynb`](04_rdkit_similarity.ipynb), we will delve into **similarity searching using molecular fingerprints** with RDKit.

You can click [here](#Contents) to return to the beginning and review any topics as needed.