#CafChem tools docking and rescoring with the UMA MLIP

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MauricioCafiero/CafChem/blob/main/notebooks/Rescore_Docking_UMA_CafChem.ipynb)

## This notebook allows you to:
- dock a single SMILES string, a list of string, or a CSV file with SMILES in one column.
- save poses as SDF files.
- Calculate the interaction between the ligand and the protein using Meta's UMA MLIP

## Requirements:
- This notebook will install deepchem, dockstring, openBabel, Fairchem and py3Dmol
- It will pull the CafChem tools from Github.
- It will install all needed libraries.
- You need to have a HF_Token set as a secret to access the UMA MLIP.

# set-up

This block:

- Loads all needed modules/libraries
    

    


### Install a few libraries

In [1]:
! pip install deepchem
! pip install dockstring
! pip install openbabel-wheel

Collecting deepchem
  Downloading deepchem-2.8.0-py3-none-any.whl.metadata (2.0 kB)
Collecting rdkit (from deepchem)
  Downloading rdkit-2025.3.3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.0 kB)
Downloading deepchem-2.8.0-py3-none-any.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rdkit-2025.3.3-cp311-cp311-manylinux_2_28_x86_64.whl (34.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.9/34.9 MB[0m [31m46.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit, deepchem
Successfully installed deepchem-2.8.0 rdkit-2025.3.3
Collecting dockstring
  Downloading dockstring-0.3.4-py3-none-any.whl.metadata (19 kB)
Downloading dockstring-0.3.4-py3-none-any.whl (4.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dockstring
Successfully ins

In [2]:
! pip install py3Dmol
! pip install fairchem-core

Collecting py3Dmol
  Downloading py3dmol-2.5.0-py2.py3-none-any.whl.metadata (2.1 kB)
Downloading py3dmol-2.5.0-py2.py3-none-any.whl (7.2 kB)
Installing collected packages: py3Dmol
Successfully installed py3Dmol-2.5.0
Collecting fairchem-core
  Downloading fairchem_core-2.2.0-py3-none-any.whl.metadata (9.3 kB)
Collecting ase-db-backends>=0.10.0 (from fairchem-core)
  Downloading ase_db_backends-0.10.0-py3-none-any.whl.metadata (600 bytes)
Collecting ase>=3.25.0 (from fairchem-core)
  Downloading ase-3.25.0-py3-none-any.whl.metadata (4.2 kB)
Collecting e3nn>=0.5 (from fairchem-core)
  Downloading e3nn-0.5.6-py3-none-any.whl.metadata (5.4 kB)
Collecting hydra-core (from fairchem-core)
  Downloading hydra_core-1.3.2-py3-none-any.whl.metadata (5.5 kB)
Collecting lmdb (from fairchem-core)
  Downloading lmdb-1.6.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.1 kB)
Collecting numba>=0.61.2 (from fairchem-core)
  Downloading numba-0.61.2-cp311-cp311-manylinux2014_x86_

### Import libraries, pull CafChem from Github

In [3]:
!git clone https://github.com/MauricioCafiero/CafChem.git

Cloning into 'CafChem'...
remote: Enumerating objects: 80, done.[K
remote: Counting objects: 100% (80/80), done.[K
remote: Compressing objects: 100% (76/76), done.[K
remote: Total 80 (delta 35), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (80/80), 1.38 MiB | 7.15 MiB/s, done.
Resolving deltas: 100% (35/35), done.


In [5]:
import torch
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import files
from fairchem.core import FAIRChemCalculator, pretrained_mlip
import CafChem.CafChemReDock as ccr

cpuCount = os.cpu_count()
print(cpuCount)

2


## Set-up Fairchem
- Must have HF_TOKEN saved as a secret

In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"

predictor = pretrained_mlip.get_predict_unit("uma-s-1", device=device)
calculator = FAIRChemCalculator(predictor, task_name="omol")
model = "UMA-OMOL"

checkpoints/uma-s-1.pt:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

# Calculations

## Dock molecules
- tools available include ccr.dock_dataframe, ccr.dock_list and ccr.dock_smiles
- for each you must supply as arguments the SMILES input (either a filename, a list, or a SMILES string), the target protein, and the number of CPU cores to use. For ccr.dock_dataframe, you must also provide the key for the SMILES column in the CSV file.
- xyz structures can be visualized via the ccr.visualize_molecule tool. This accepts an XYZ string as an argument. This may be easily extracted from an XYZ file as seen below.

In [None]:
scores = ccr.dock_dataframe("file.csv","HMGCR",cpuCount, "smiles",)
print(scores)

Docking 1 molecules in HMGCR.
Docking molecule 1.
SDF file written for score -4.5
[-4.5]


In [None]:
df = pd.read_csv("file.csv")
smiles_list = df["smiles"].tolist()
scores = ccr.dock_list(smiles_list,"HMGCR",cpuCount)
print(scores)

Docking 1 molecules in HMGCR.
Docking molecule 1.
SDF file written for score -4.5
[-4.5]


In [7]:
statin = "OC(=O)C[C@H](O)C[C@H](O)\C=C\c1c(C(C)C)nc(N(C)S(=O)(=O)C)nc1c2ccc(F)cc2"
score = ccr.dock_smiles(statin,"HMGCR",cpuCount)
print(f"score: {score}")

Docking molecule.
SDF file written for score -8.1
score: -8.1


## Calculate interaction energies between a docking pose and the protein using Meta's UMA MLIP
- If CafChem has an XYZ QM active site pepared for the protein, then the interaction between a ligand (SDF file) and the protein active site (from the library) may be calculated using Meta's UMA MLIP.
- supply as arguments the name of the SDF file (without .sdf), the protein information (in the form ccr.[your protein]_data), the ASE calculator, ans the charge and spin multiplicty of the ligand.
- returns a list of XYZ strings for the ligands in the input SDF files.
- the XYZ strings may be visualized with the ccr.visualize_molecule tool, which accepts as its argument the XYZ string.
- the complex XYZ file can be transformed into a G16 counterpoise input file using complexG16, which takes as its arguments the complex XYZ file, the target object, the ligand charge and the ligand spin multiplicity.
- Test data: docking Rosuvastatin ("OC(=O)C[C@H](O)C[C@H](O)\C=C\c1c(C(C)C)nc(N(C)S(=O)(=O)C)nc1c2ccc(F)cc2") should give a score of -8.1. passing that SDF into the uma_interaction function with optimzation on should give an energy of -285 kcal/mol. Making a G16 file and running that as is (wB87XD/def2-tzvpp) should give a CP corrected interaction of -275 kcal/mol; a difference of only 3.5%.

In [None]:
total_xyz = ccr.uma_interaction("trial_S", ccr.HMGCR_data, calculator, -1, 1, False)

The size of the complex is: 391
Energy of complex is: -9727.851 ha
The size of the ligand is: 60
Energy of ligand is: -1968.269 ha
The size of the active site is: 331
Energy of active site is: -7759.170 ha
Energy difference is: -258.817 kcal/mol


In [8]:
total_xyz = ccr.uma_interaction("trial_1", ccr.HMGCR_data, calculator, -1, 1, True)

The size of the complex is: 391
      Step     Time          Energy          fmax
BFGS:    0 19:14:47  -264708.476048       23.289714
BFGS:    1 19:14:48  -264716.883977        3.016549
BFGS:    2 19:14:49  -264719.107904        2.636233
BFGS:    3 19:14:50  -264721.137780        1.891752
BFGS:    4 19:14:51  -264722.848358        1.699315
BFGS:    5 19:14:52  -264723.877274        1.298434
BFGS:    6 19:14:54  -264724.565382        1.454350
BFGS:    7 19:14:56  -264725.092390        1.539927
BFGS:    8 19:14:57  -264725.797421        2.001561
BFGS:    9 19:14:58  -264726.766555        2.980357
BFGS:   10 19:14:59  -264728.034142        5.362298
BFGS:   11 19:15:00  -264730.205465        6.490514
BFGS:   12 19:15:01  -264731.983682        3.913967
BFGS:   13 19:15:03  -264733.616760        3.754091
BFGS:   14 19:15:04  -264735.337526        2.116248
BFGS:   15 19:15:05  -264736.365909        2.891035
BFGS:   16 19:15:06  -264736.980214        1.363588
BFGS:   17 19:15:08  -264737.30325

In [15]:
ccr.complexG16("optimized_complex.xyz",ccr.HMGCR_data,-1,1)

In [None]:
ccr.visualize_molecule(total_xyz[0])

In [None]:
f = open("/content/optimized_complex.xyz","r")
structure = f.read()
f.close()

ccr.visualize_molecule(structure)

## Generate a constraints list

In [None]:
f = open("/content/HMGCR_dude_QM_site.pdb","r")
lines = f.readlines()
f.close()

constraints = []
for line in lines:
  parts = line.split()
  if len(parts) > 1 and parts[2] == "CA":
    constraints.append(int(parts[1])-1)

print(constraints)

[1, 11, 16, 24, 33, 41, 54, 60, 72, 83, 92, 98, 107, 124, 132, 140, 148, 159, 168, 181]


In [None]:
print(len(constraints))

20
