Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error parsing PDBQT to Mol: Element 'G' not found #20

Open
linminhtoo opened this issue Sep 7, 2022 · 4 comments
Open

Error parsing PDBQT to Mol: Element 'G' not found #20

linminhtoo opened this issue Sep 7, 2022 · 4 comments

Comments

@linminhtoo
Copy link

linminhtoo commented Sep 7, 2022

hello Dr Pavel,
I chanced across your repo and found it useful to parse docked pdbqt files back to RDKit mol for further analysis. However, sometimes the pdbqt2mol() function fails as RDKit is not happy with the "G" atom type.

def pdbqt2molblock(pdbqt_block, smi, mol_id):

giving:

****
Post-condition Violation
Element 'G' not found
Violation occurred on line 93 in file /project/build/temp.linux-x86_64-cpython-310/rdkit/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****

This "G" atom type is generated by meeko at macrocycles during ligand preparation for docking with Autodock Vina, see: forlilab/Meeko#10

I saw in your in-line code comments about this issue, but I didn't manage to modify the fix_pdbqt() function successfully:

atom_pdbqt_type = re.sub('D|A', '', line[77:79]).strip() # can add meeko macrocycle types (G and \d (CG0 etc) in the sub expression if will be going to use it

Please note that I didn't use your script to run docking; it was prepared slightly differently. I am just using your code to parse back the docked .pdbqt files
I also saw your comment on build_macrocycle=:

# can do it True, but there is some problem with >=7-chains mols

Do you have any idea how to fix this issue?
I've attached a sample offending .pdbqt file: https://gist.github.com/linminhtoo/5949437ae066fdd136709971dcc36220#file-bad-pdbqt-L26-L27
As you can see, some lines have either "G" (macrocycle), and also "CG0" (not sure if this will also cause problems). I tried brute forcing by replacing "G" with "C" but the template bond order assignment step failed (seems RDKit fails to parse the ring correctly)

Thanks,
Min Htoo

@avnikonenko
Copy link
Contributor

avnikonenko commented Sep 8, 2022

Hello!

try this:

def fix_pdbqt(pdbqt_block):
    pdbqt_fixed = []
    for line in pdbqt_block.split('\n'):
        if not line.startswith('HETATM') and not line.startswith('ATOM'):
            pdbqt_fixed.append(line)
            continue
        atom_type = line[12:16].strip()
        if atom_type == 'G':
            atom_type = 'C'
        # autodock vina types
        if 'CA' in line[77:79]: #Calcium is exception
            atom_pdbqt_type = 'CA'
        else:
            atom_pdbqt_type = re.sub('D|A', '', line[77:80]).strip() # can add meeko macrocycle types (G and \d (CG0 etc) in the sub expression if will be going to use it
            if 'G' in atom_pdbqt_type: # in ['G','CG0','G0']:
                atom_pdbqt_type = 'A'

        if re.search('\d', atom_type[0]) or len(atom_pdbqt_type) == 2: #1HG or two-letter atom names such as CL,FE starts with 13
            atom_format_type = '{:<4s}'.format(atom_type)
        else: # starts with 14
            atom_format_type = ' {:<3s}'.format(atom_type)
        line = line[:12] + atom_format_type + line[16:]
        pdbqt_fixed.append(line)

    return '\n'.join(pdbqt_fixed)
    
pdbqt_fixed = fix_pdbqt(pdbqt_block)
mol = Chem.MolFromPDBBlock('\n'.join([i[:66] for i in pdbqt_fixed.split('MODEL')[1].split('\n')]), removeHs=False)

To parse PDBQT block by RDKit you actually don't need atom_pdbqt_type (symbols which start from 77 position) since we only need up to 66 symbols to read PDBQT as PDB block ( Chem.MolFromPDBBlock('\n'.join([i[:66] for i in pdbqt_fixed.split('MODEL')[1].split('\n')]), removeHs=False, sanitize=False)). So in your case you can easily just ignore the symbols from 77 position and change G to C in the atom type column ([12:16]).
In our code we use atom_pdbqt_type only to properly recognize atom symbol (atom name can contain different symbols or numbers) to get the correct formatting because atom type of elements such as Fe, Cu starts from 13 position and others does from 14 but for some reason OpenBabel ignores sometimes such rule (by the way it might be that authors of meeko have already fixed this issue forlilab/Meeko#3 but when we tried it the above mentioned solution was just easy to use).

I hope it will help you!

@linminhtoo
Copy link
Author

Hello @avnikonenko

thanks for your prompt response.

I tried what you suggested, and now RDKit can parse the molecule from PDBBlock. Still, it fails when I try to assign the bond orders (see below for the error message). I found out that in the parsed molecule, a 7-membered ring broke incorrectly.

This is the parsed molecule before assigning bond orders, note the 7-membered ring on the left has broken:
image
Here's the template mol, parsed from the SMILES, note the 7-membered ring on the right.
image

error message:

Traceback (most recent call last):
  File "/tmp/nix-shell.D8HiWB/ipykernel_75125/101573197.py", line 22, in <cell line: 11>
    mol = AllChem.AssignBondOrdersFromTemplate(template_mol, mol)
  File "/nix/store/rcsd28zlf6mwfw5yx5a921id2hzmj4ab-python3.10-rdkit-pypi-2022.3.5/lib/python3.10/site-packages/rdkit/Chem/AllChem.py", line 438, in AssignBondOrdersFromTemplate
    raise ValueError("No matching found")
ValueError: No matching found

@linminhtoo
Copy link
Author

linminhtoo commented Sep 9, 2022

I also took a look at the meeko issue you linked, and actually, their method is much more robust, I think. It actually worked for me :D

from meeko import PDBQTMolecule
pdbqt_mol = PDBQTMolecule(top_pose_pdbqt, is_dlg=False, skip_typing=True, poses_to_read=1)
rdkit_mol = pdbqt_mol[0].export_rdkit_mol()

image
as you can see, the molecule matches the template mol.

So, I would suggest we use meeko's way moving forward. They only import 3d atom coordinates from the docked pose .pdbqt file, and map those 3d coordinates onto the original rdkit molecule parsed from the SMILES. This way, we avoid dealing with bond orders which are problematic.

I am happy to help open a PR if you think this is an acceptable solution.

@DrrDom
Copy link
Owner

DrrDom commented Sep 20, 2022

This would be a valuable fix for the script. However, this vina_dock.py script is obsolete and it will be removed in some future revisions of the repository. Now I suggest to use this repo https://github.com/ci-lab-cz/docking-scripts, where we implemented support not only VIna2 but also gnina and smina (smina is invoked from gnina code). We continually support this repo.

If you will be able to fix issue there it will be helpful (ci-lab-cz/easydock#8). However, it will require more changes, because meeko changed interface of some of its functions we use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants