# 1.2 Filtering compounds
## Background
For a particular use-case, you may be interested in predicting the activity for a specific category of compounds or compounds with certain characteristics. You would therefore want to limit your dataset to those compounds of interest when training your model. This filtering package from the toolkit is designed to filter compounds for user-specified physiochemical properties.  

Additionally, we (OpenADMET) use these tools to filter compounds from libraries for physical property assays

## Requirements
For this work process, you will need:
- An external dataset that is in a file format readable by `pandas`, e.g. `.csv`, `.parquet`, `.xlsx`, `.xls`. The dataset must, at minimum, include a column of SMILES strings  
**OR**  
- Alternatively, a dataset that has already been processed by [1.1 Curating external datasets](https://github.com/OpenADMET/openadmet-models/blob/5102e2702224deac2f8a58bc01fa12e5cb82165a/demos/1.1_Curating_external_datasets.ipynb), and there is a column called `OPENADMET_CANONICAL_SMILES`.
# 1. Overview
In this demo notebook, we'll show you how to filter the processed dataset for human pregnane X receptor from [PubChem](https://pubchem.ncbi.nlm.nih.gov/bioassay/1347033) for a variety of physicochemical properties that may be of interest.  

This dataset is provided in `/processed_data/processed_PXR_data.parquet`

In [1]:
import datamol as dm
from openadmet.toolkit.filtering.physchem_filters import SMARTSFilter, pKaFilter, DatamolFilter, ProximityFilter
import pandas as pd

ModuleNotFoundError: No module named 'openadmet.toolkit.filtering'

## 2. Read in the data
Read in your data file and ensure that your canonicalized SMILES column contains no `NaN` values.


In [2]:
# Read in the data
df = pd.read_parquet("./processed_data/processed_PXR_pubchem.parquet")

# Double check that there are no NaN SMILES
df['OPENADMET_CANONICAL_SMILES'].isna().value_counts()

OPENADMET_CANONICAL_SMILES
False    2348
Name: count, dtype: int64

## 3. Filtering by properties

There are a variety of filters available to you. This notebook will walk through how to use each one and any new filters will be continuously added.  

### 3.1 SMARTS filtering
SMARTS stands for **S**MILES **a**rbitrary **t**arget **s**pecification and is a way of specifying substructural patterns in molecules. Substructural patterns can include protonation state, hydrogen count, formal charge, isotopic weight, bond order, etc. which can be used to determine characteristic's like a compound's reactivity, toxicity, etc.  

In the below example, we are filtering for a chromophore substructure (SMARTS `[n,o,c][c,n,o]cc`) that will give a strong IR signal.

In [3]:
# Initialize the filter
smarts_filter = SMARTSFilter(
    smarts_list=["[n,o,c][c,n,o]cc"], # Give the SMARTS strings you want to filter on as a list of strings
    names_list=["chromophore"], # Give the name of what you want to call the SMARTS filter
    include=True,
    names_column="SMARTS") # Name of the column that will contain all the SMARTS filters you specify

# Filter the compounds and return dataframe
smarts_df = smarts_filter.filter(
    df,
    smiles_column="OPENADMET_CANONICAL_SMILES",
    mode="remove") # With mode, you can either mark OR remove the rows that pass/fail your filter

smarts_df

[15:08:44] Unusual charge on atom 0 number of radical electrons set to zero
[32m2025-07-01 15:08:48.048[0m | [1mINFO    [0m | [36mopenadmet.toolkit.filtering.filter_base[0m:[36mset_mol_column[0m:[36m55[0m - [1mmol column OPENADMET_CANONICAL_SMILES already present, skipping[0m


Unnamed: 0,INCHIKEY,OPENADMET_LOGAC50,OPENADMET_CANONICAL_SMILES,mol,SMARTS
0,AADCDMQTJNYOSS-LBPRGKRZSA-N,4.725066,CCC1=CC(Cl)=C(OC)C(C(=O)NC[C@@H]2CCCN2CC)=C1O,<rdkit.Chem.rdchem.Mol object at 0x161416a40>,[chromophore]
2,AAKJLRGGTJKAMG-UHFFFAOYSA-N,4.991733,C#CC1=CC=CC(NC2=NC=NC3=CC(OCCOC)=C(OCCOC)C=C23...,<rdkit.Chem.rdchem.Mol object at 0x161416810>,[chromophore]
3,AAOVKJBEBIDNHE-UHFFFAOYSA-N,4.499446,CN1C(=O)CN=C(C2=CC=CC=C2)C2=CC(Cl)=CC=C21,<rdkit.Chem.rdchem.Mol object at 0x161416ab0>,[chromophore]
5,ABJKWBDEJIDSJZ-UHFFFAOYSA-N,4.750066,CN(CC1=CC=C(C(C)(C)C)C=C1)CC1=CC=CC2=CC=CC=C12,<rdkit.Chem.rdchem.Mol object at 0x161416c00>,[chromophore]
7,ACFHHZJOAUZSJU-XBXARRHUSA-N,5.075066,CC(=O)N/N=C/C1=CC=C([N+](=O)[O-])O1,<rdkit.Chem.rdchem.Mol object at 0x161416ce0>,[chromophore]
...,...,...,...,...,...
2342,ZZIALNLLNHEQPJ-UHFFFAOYSA-N,4.182172,O=C1OC2=CC(O)=CC=C2C2=C1C1=CC=C(O)C=C1O2,<rdkit.Chem.rdchem.Mol object at 0x161c53300>,[chromophore]
2343,ZZMVLMVFYMGSMY-UHFFFAOYSA-N,5.004638,CC(C)CC(C)NC1=CC=C(NC2=CC=CC=C2)C=C1,<rdkit.Chem.rdchem.Mol object at 0x161c53370>,[chromophore]
2344,ZZVUWRFHKOJYTH-UHFFFAOYSA-N,4.323159,CN(C)CCOC(C1=CC=CC=C1)C1=CC=CC=C1,<rdkit.Chem.rdchem.Mol object at 0x161c533e0>,[chromophore]
2346,ZZYSLNWGKKDOML-UHFFFAOYSA-N,4.647075,CCC1=NN(C)C(C(=O)NCC2=CC=C(C(C)(C)C)C=C2)=C1Cl,<rdkit.Chem.rdchem.Mol object at 0x161c534c0>,[chromophore]


### 3.2 pKa filtering
pKa or acid dissociation constant is a measure of the strength of an acid in solution. In the context of ADMET, pKa values of compounds can be useful to know for understanding lipophilicity, how coordination complexes form, potential interactions with protein targets, etc.  
$$\mathrm{HA \rightleftharpoons A^- + H^+}$$

In the below example, we are filtering for compounds with pKas in the range of 3-5.

In [4]:
# First, calculate the pKa with MolGpKa

In [5]:
# # Initialize filter
# pka_filter = pKaFilter(
#     min_pka=3.0,
#     max_pka=11,
#     min_unit_sep=1,
#     pka_column="<name-of-column-containing-pkas>",
# )
# # Filter the compounds and return dataframe
# pka_df = pka_filter.filter(
# 		df,
# 		smiles_column="OPENADMET_CANONICAL_SMILES", 
# 		mode="mark")

### 3.3 Datamol filtering
The Datamol filter incoporates many of `datamol`'s physchem filters. All the available options are given below:

| Filter argument | Description |
|----------|----------|
| `mw` | Filter by molecular weight |
| `fsp3` | Filter by fraction of sp3 carbons |
| `n_hba` | Filter by number of hydrogen bond acceptors |
| `n_hbd` | Filter by number of hydrogen bond donors |
| `n_rings` | Filter by number of rings |
| `n_hetero_atoms` | Filter by number of heteroatoms (non-carbon and non-hydrogen atoms)|
| `n_heavy_atoms` | Filter by number of heavy atoms |
| `n_rotatable_bonds` | Filter by number of rotatable bonds |
| `n_aliphatic_rings` | Filter by number of aliphatic rings |
| `n_aromatic_rings` | Filter by number of aromatic rings |
| `n_saturated_rings` | Filter by number of saturated rings |
| `n_radical_electrons` | Filter by number of radical electrons |
| `tpsa` | Filter by topological polar surface area |
| `qed` | Filter by quantitative estimation of drug-likeness |
| `clogp` | Filter by calculated logP (partition coefficient) |
| `sas` | Filter by synthesizabiliy accessibility score |


In [6]:
# Initialize filter
mw_filter = DatamolFilter(
    name='mw',
    min_value=10,
    max_value=400,
    data_column="mw")
    
# Filter the compounds and return dataframe
mw_df = mw_filter.filter(
    df,
    smiles_column="OPENADMET_CANONICAL_SMILES",
    mode="remove")

mw_df

[32m2025-07-01 15:08:48.083[0m | [1mINFO    [0m | [36mopenadmet.toolkit.filtering.filter_base[0m:[36mset_mol_column[0m:[36m55[0m - [1mmol column OPENADMET_CANONICAL_SMILES already present, skipping[0m
  from .autonotebook import tqdm as notebook_tqdm
[32m2025-07-01 15:08:48.218[0m | [1mINFO    [0m | [36mopenadmet.toolkit.filtering.filter_base[0m:[36mset_mol_column[0m:[36m55[0m - [1mmol column OPENADMET_CANONICAL_SMILES already present, skipping[0m


Unnamed: 0,INCHIKEY,OPENADMET_LOGAC50,OPENADMET_CANONICAL_SMILES,mol,SMARTS,passed_smarts_filter,mw
0,AADCDMQTJNYOSS-LBPRGKRZSA-N,4.725066,CCC1=CC(Cl)=C(OC)C(C(=O)NC[C@@H]2CCCN2CC)=C1O,<rdkit.Chem.rdchem.Mol object at 0x161416a40>,[chromophore],True,340.155370
1,AAEVYOVXGOFMJO-UHFFFAOYSA-N,4.548181,CSC1=NC(NC(C)C)=NC(NC(C)C)=N1,<rdkit.Chem.rdchem.Mol object at 0x106705310>,[],False,241.136117
2,AAKJLRGGTJKAMG-UHFFFAOYSA-N,4.991733,C#CC1=CC=CC(NC2=NC=NC3=CC(OCCOC)=C(OCCOC)C=C23...,<rdkit.Chem.rdchem.Mol object at 0x161416810>,[chromophore],True,393.168856
3,AAOVKJBEBIDNHE-UHFFFAOYSA-N,4.499446,CN1C(=O)CN=C(C2=CC=CC=C2)C2=CC(Cl)=CC=C21,<rdkit.Chem.rdchem.Mol object at 0x161416ab0>,[chromophore],True,284.071641
5,ABJKWBDEJIDSJZ-UHFFFAOYSA-N,4.750066,CN(CC1=CC=C(C(C)(C)C)C=C1)CC1=CC=CC2=CC=CC=C12,<rdkit.Chem.rdchem.Mol object at 0x161416c00>,[chromophore],True,317.214350
...,...,...,...,...,...,...,...
2343,ZZMVLMVFYMGSMY-UHFFFAOYSA-N,5.004638,CC(C)CC(C)NC1=CC=C(NC2=CC=CC=C2)C=C1,<rdkit.Chem.rdchem.Mol object at 0x161c53370>,[chromophore],True,268.193949
2344,ZZVUWRFHKOJYTH-UHFFFAOYSA-N,4.323159,CN(C)CCOC(C1=CC=CC=C1)C1=CC=CC=C1,<rdkit.Chem.rdchem.Mol object at 0x161c533e0>,[chromophore],True,255.162314
2345,ZZYASVWWDLJXIM-UHFFFAOYSA-N,4.608399,CC(C)(C)C1=CC(=O)C(C(C)(C)C)=CC1=O,<rdkit.Chem.rdchem.Mol object at 0x161c53450>,[],False,220.146330
2346,ZZYSLNWGKKDOML-UHFFFAOYSA-N,4.647075,CCC1=NN(C)C(C(=O)NCC2=CC=C(C(C)(C)C)C=C2)=C1Cl,<rdkit.Chem.rdchem.Mol object at 0x161c534c0>,[chromophore],True,333.160790


### 3.4 Proximity filtering

The Proximity filters based on the proximity of one SMARTS label to another SMARTS label within a given compound.

In [7]:
# Initialize filter
distance_filter = ProximityFilter(
    smarts_list_a=["[n,o,c][c,n,o]cc"], #
    smarts_list_b=["[n,o]"],
    names_list_a=["chromophore"],
    names_list_b=["fluoro-aromatic"],
    smarts_column_a="chromophore",
    smarts_column_b="fluoro",
    min_dist=1,
    max_dist=3)

# Filter the compounds and return the dataframe
distance_df = distance_filter.filter(
    df,
    smiles_column="OPENADMET_CANONICAL_SMILES", 
    mode="mark")

distance_df

[32m2025-07-01 15:08:48.258[0m | [1mINFO    [0m | [36mopenadmet.toolkit.filtering.filter_base[0m:[36mset_mol_column[0m:[36m55[0m - [1mmol column OPENADMET_CANONICAL_SMILES already present, skipping[0m
[32m2025-07-01 15:08:48.493[0m | [1mINFO    [0m | [36mopenadmet.toolkit.filtering.filter_base[0m:[36mset_mol_column[0m:[36m55[0m - [1mmol column OPENADMET_CANONICAL_SMILES already present, skipping[0m


Unnamed: 0,INCHIKEY,OPENADMET_LOGAC50,OPENADMET_CANONICAL_SMILES,mol,SMARTS,passed_smarts_filter,mw,passed_mw_filter,chromophore,fluoro,inter_distances,passed_proximity_filter
0,AADCDMQTJNYOSS-LBPRGKRZSA-N,4.725066,CCC1=CC(Cl)=C(OC)C(C(=O)NC[C@@H]2CCCN2CC)=C1O,<rdkit.Chem.rdchem.Mol object at 0x161416a40>,[chromophore],True,340.155370,True,"{'chromophore': [(2, 3, 4, 6), (2, 21, 9, 6), ...",{},,False
1,AAEVYOVXGOFMJO-UHFFFAOYSA-N,4.548181,CSC1=NC(NC(C)C)=NC(NC(C)C)=N1,<rdkit.Chem.rdchem.Mol object at 0x106705310>,[],False,241.136117,True,{},"{'fluoro-aromatic': [(3,), (9,), (15,)]}",,False
2,AAKJLRGGTJKAMG-UHFFFAOYSA-N,4.991733,C#CC1=CC=CC(NC2=NC=NC3=CC(OCCOC)=C(OCCOC)C=C23...,<rdkit.Chem.rdchem.Mol object at 0x161416810>,[chromophore],True,393.168856,True,"{'chromophore': [(2, 3, 4, 5), (2, 28, 6, 5), ...","{'fluoro-aromatic': [(9,), (11,)]}",0.0,True
3,AAOVKJBEBIDNHE-UHFFFAOYSA-N,4.499446,CN1C(=O)CN=C(C2=CC=CC=C2)C2=CC(Cl)=CC=C21,<rdkit.Chem.rdchem.Mol object at 0x161416ab0>,[chromophore],True,284.071641,True,"{'chromophore': [(7, 8, 9, 10), (7, 12, 11, 10...",{},,False
4,AAPVQEMYVNZIOO-UHFFFAOYSA-N,4.663742,O=S1(=O)OCC2C(CO1)C1(Cl)C(Cl)=C(Cl)C2(Cl)C1(Cl)Cl,<rdkit.Chem.rdchem.Mol object at 0x161416b90>,[],False,419.811796,False,{},{},,False
...,...,...,...,...,...,...,...,...,...,...,...,...
2343,ZZMVLMVFYMGSMY-UHFFFAOYSA-N,5.004638,CC(C)CC(C)NC1=CC=C(NC2=CC=CC=C2)C=C1,<rdkit.Chem.rdchem.Mol object at 0x161c53370>,[chromophore],True,268.193949,True,"{'chromophore': [(7, 8, 9, 10), (7, 19, 18, 10...",{},,False
2344,ZZVUWRFHKOJYTH-UHFFFAOYSA-N,4.323159,CN(C)CCOC(C1=CC=CC=C1)C1=CC=CC=C1,<rdkit.Chem.rdchem.Mol object at 0x161c533e0>,[chromophore],True,255.162314,True,"{'chromophore': [(7, 8, 9, 10), (7, 12, 11, 10...",{},,False
2345,ZZYASVWWDLJXIM-UHFFFAOYSA-N,4.608399,CC(C)(C)C1=CC(=O)C(C(C)(C)C)=CC1=O,<rdkit.Chem.rdchem.Mol object at 0x161c53450>,[],False,220.146330,True,{},{},,False
2346,ZZYSLNWGKKDOML-UHFFFAOYSA-N,4.647075,CCC1=NN(C)C(C(=O)NCC2=CC=C(C(C)(C)C)C=C2)=C1Cl,<rdkit.Chem.rdchem.Mol object at 0x161c534c0>,[chromophore],True,333.160790,True,"{'chromophore': [(3, 2, 21, 6), (3, 4, 6, 21),...","{'fluoro-aromatic': [(3,), (4,)]}",0.0,True


## Save your filtered data

When you have finished filtering your data to your liking, save the newly curate data file. Your data is now ready for model training with Anvil!

In [8]:
# Drop the mol column
mw_df = mw_df.drop(columns="mol", axis=1)

mw_df.to_parquet("./processed_data/filtered_PXR_pubchem.parquet", index=False)

✨✨✨✨✨✨✨