# **The Molecular Treasure Hunt: Part 2: Docking a ligand library against each of your protein receptors**

*An &#8491;ngstrom sized adventure by Sarah Harris (Leeds Physics) and Geoff Wells (UCL Pharmacy)*

"*You cannot pass! I am a servant of the Secret Fire, wielder of the flame of Anor. The dark fire will not avail you, flame of Udûn.*" J.R.R. Tolkien, The Fellowship of the Ring.

This procedure for understanding the amazing phenomenon of molecular recognition between biomolecules has been designed with a strong agenda. We believe that including the effect of biomolecular dynamics is key to improving our ability to make predictions about how biomolecules will interact. We can see these multiple conformational states either using NMR or cryo-EM experiments, or assessing the diversity of X-ray crystal structures, or using theoretical tools such as atomistic molecular dynamics. You will therefore (we hope!!) include a number of protein structures in your "receptor" directory.

How drugs bind to multiple protein conformations depends upon the underlying physics and chemistry of proteins. In a real biological context (*in vivo*), of course, this will also depend on the environment of the protein - e.g. whether it located in a large complex or in a membrane environment. For molecules to be orally bioavailable drugs, there are more stringent criteria (e.g. Lipinski's rule of 5). 

Thermal fluctuations (physics) drive conformational changes, but within the constraints imposed by the chemical structure of the proteins and the ligands. The requirement for the drug to reach the target site imposes additional criteria which depend on the physicochemical properties (e.g. clogP). Other considerations, such as the specificity of the compound, its toxicity (related to specificity), synthetic accessibilty (e.g. can a chemist make it) and ease of formulation (e.g. can this be made into a safe and effective medicine, and is it feasible to make large quantities of it) are essential in practise, and can be non-trivial to overcome.   

In this part we also take our docking results and do a few further things:
- We calculate interactions between the ligands and the protein
- We produce some .sdf files that we can use with other chemoinformatics software (e.g. DataWarrior)

There are a number of features to protein-ligand interactions that we can quantify and use to identify our 'best' or most promising compounds. Here is a list of the features that we calculate:
- Close contacts
- Hydrophobic interactions
- Hydrogen bonds
- Salt bridges
- Halogen bonds
- Pi-stacking interactions (parallel and perpendicular)
- Pi-cation interactions

The importance of each of these will depend on the chemistry of your ligands and their interaction with the protein. For example, halogen bonds are only relevant to compounds that contain halogens (F, Cl, Br, I) and salt bridge interactions are only relevant to ligands with charges complementary to the protein within the binding site.

Enjoy your Treasure Hunt!! Your search space encompasses the vast possibilities offered by the combinatorics of multiple protein conformations and enormous chemical diversity, particularly if you have embarked on a fragment library dock. Your limits imposed by pharmaceutical requirements though, are stringent, and in many ways unknown. Good luck!!  

We start by importing a number of python modules to run our calculations:

In [None]:
import sys
import os
import subprocess
#Importing glob is important for recursive filename searches - we use this to find all of our protein and ligand files!
from glob import glob
from pathlib import Path
from termcolor import colored
import multiprocessing as mp

import numpy as np

import oddt
from oddt.interactions import (close_contacts,
                               hbonds,
                               distance,
                               halogenbonds,
                               halogenbond_acceptor_halogen,
                               pi_stacking,
                               salt_bridges,
                               pi_cation,
                               hydrophobic_contacts)

# Setting your ligand library and the protein receptors for your docking
Your parent folder is the directory that contains the various docking folders and your receptor folder. The Jupyter notebooks are opened in each of the docking folders. Your proteins should be located in the "receptor" folder. This notebook will dock every single ligand in your chosen compound library against all of the protein structures (e.g. pdb files) that are in the "receptor" directory.

### Protein and ligand files
Before you start your treasure hunt, you must make sure that your proteins are **aligned** to each other, that they are **protonated** correctly and that you are happy with the position of the **grid box** that defines your receptor site for each protein structure. Check this by visualising in Chimera!!
Your ligand files need to have 3D coordinates and be saved as individual molecule files in pdbqt format. They should be have hydrogens added and be in the correct ionisation state. The compound libraries provided are a good template for the format and naming conventions of the ligand files.

In [None]:
#This defines our receptor structures (all of these files happen to have the same prefix, the starting structure (xal structure) is xxx_00.pdb)
receptor_filenames = glob('../receptor/*.pdb')
#This defines where the ligand files are found
ligand_filenames = glob('/home/geoff/compounds/FtsZ/pharmit/pocket3/*.pdbqt')

#The sort command is *really* important for organising the output in the right order (the protein files are mixed up otherwise)
receptor_filenames.sort()
print(colored('Receptor filenames', 'blue', attrs=['bold']))
print(receptor_filenames)
print("Number of proteins is", len(receptor_filenames))

ligand_filenames.sort()
print("Number of ligands is", len(ligand_filenames))

### Number of processors and number of docking steps per processor
This tells us about how many processors we have available (or lets us set that number) and helps us to get an estimate of how long the this step of the Molecular Treasure Hunt will take to run. Larger molecules are slower to dock (e.g the FDA library compared to fragment libraries) because they have more rotatable bonds. But the number of docking steps per cpu will give a rough guide to how long the docking will take.

In [None]:
num_cpu = mp.cpu_count()
print("Number of processors:", num_cpu)
#If necessary we can set the number of processors to be less than the mp.cpu_count()
#by commenting out the top line (by adding a # at the start of the line)
#and uncommenting the line below (by removing the # from the line) using a number you chose.
#only one of the two lines (EITHER the top or the bottom one) should be uncommented.
#num_cpu = 4

docking_steps = len(ligand_filenames) * len(receptor_filenames)
print("Number of individual docking_steps is", docking_steps)
docking_steps_per_cpu = round((docking_steps / num_cpu), 2)
print("Number of docking steps per cpu is ", docking_steps_per_cpu)

### Results folder location
Here we create a folder to store the results in - you can change the name of this, although it is best not to! (do not use spaces or puctuation other than underscores or hypens in your file name)!

In [None]:
Results_Folder = 'MTH_Results'
rf = Path(Results_Folder)
try:
    rf.mkdir()
except FileExistsError as exc:
    print(colored('The results folder is already present - check that it is empty before starting!', 'blue',  attrs=['bold']))

## Making the protein files that we need for the docking calculations
### Preparing the protein .pdbqt files
The structures that we start with and align are .pdb files - these contain information on the atoms that are in the protein, the residue names and the x, y and z coordinates of each atom. For the docking calculations a modified version of the .pdb file is necessary that contains additional atom charge (q) and atom type (t) information - this is called a .pdbqt file. **Note**: the .pdbqt files do not contain non-polar hydrogens (this is a requirement for Autodock Vina-based docking calculations).

The next piece of code generates a .pdbqt file for each of your protein .pdb files. As long as your proteins don't contain very weird atoms or residues, this should work!!

In [None]:
#Let's use oddt to make the pdbqt files that we need...
print(colored('Making pdbqt files', 'blue', attrs=['bold']))
for receptor_filename in receptor_filenames:
    receptor_prefix = receptor_filename.split(".")[2]
    receptor_name = receptor_prefix.split("/")[2]
    receptor_pdbqt_file = '..' + receptor_prefix + '.pdbqt'
    print(receptor_name)
    rec = next(oddt.toolkit.readfile('pdb', receptor_filename))
    rec.write('pdbqt', receptor_pdbqt_file, overwrite=True, opt={'r': None, 'c': None, 'h': None})
print(colored('Finished making pdbqt files', 'blue', attrs=['bold']))

This defines a procedure for docking a series of ligands to a protein. It's horribly complicated - so only change this if you are feeling python-tastic!!

In [None]:
def dock_ligands(filename):
    #This line removes the filename suffix (.pdbqt)
    dock_input = filename.split(".")[0]
    #This line removes the filename prefix 
    dock_name = dock_input.split("/")[-1]
    vina_receptor = '..' + PDBFile + '.pdbqt'
    vina_config = "../receptor/conf.txt"
    vina_ligand = dock_input + '.pdbqt'
    vina_out = dock_name + '_out.pdbqt'
    #The next line runs a local docking experiment...
    vina_run = subprocess.run(["vina", "--receptor", vina_receptor, "--config", vina_config, "--ligand", vina_ligand, "--out", vina_out], stdout=subprocess.DEVNULL)
    
    #The input file for vina "conf.txt" contains instructions to only output the top docked pose. However, it is 
    #also necessary to run the "vina_split" command to produce a single ligand conformer, because of the flags that 
    #are present at the start and end of the pdbqt output file.

    try:
        vina_split = subprocess.run(["vina_split", "--input", vina_out], stdout=subprocess.DEVNULL)
        
        #This extracts the Vina binding energy from the output file and prints it while the docking is running
        with open(dock_name + '_out.pdbqt', 'r') as contents:
            for line in contents:
                if line.startswith("REMARK VINA RESULT:"):
                    dock_energy = round(float(line.split()[3]),4)
        contents.close()
        
        #We use oddt to process the results files and calculate interactions between the ligand and protein
        top_pose = next(oddt.toolkit.readfile('pdbqt', dock_name + '_out.pdbqt'))
        top_pose_out = top_pose.write('mol2')
        mol = top_pose
        protein = '..' + PDBFile + '.pdb'
        rec = next(oddt.toolkit.readfile('pdb', protein))
        rec.protein=True
        mols = list(oddt.toolkit.readfile('mol2', filename))
        nonHatoms = int(len(mol.atoms))
        properties = (mol.calcdesc(["MW", "logP"]))
        s=properties.values()
        MW = round(float([x for x in s][0]),3)
        clogP = round(float([x for x in s][1]),3)
        #Then use these properties to calculate some ligand efficiencies
        LE = round(float(dock_energy/nonHatoms),3)
        LipE = round(float(dock_energy-clogP),3)
        #Now use oddt to calculate the interactions
        cc_count = int(len(np.asarray(close_contacts(rec.atom_dict[rec.atom_dict['atomicnum'] != 1], mol.atom_dict[mol.atom_dict['atomicnum'] != 1], cutoff=3)[0])))
        hbonds_count = int(hbonds(rec, mol, cutoff=3.5, tolerance=30)[2].sum())
        pi1, pi2, strict_parallel, strict_perpendicular = pi_stacking(rec, mol, tolerance=30)
        pipi_par=str(strict_parallel.sum())
        pipi_perp=str(strict_perpendicular.sum())
        pi_cation_count = int(len(pi_cation(rec, mol)[2]))
        #Need to check whether the pi cation count is reversible or in order...
        halogenbonds_count = int(len(halogenbonds(rec, mol)[2]))
        salt_bridges_count = int(len(salt_bridges(rec, mol)[0]))
        hyd_contacts_count = int(len(hydrophobic_contacts(rec, mol)[0]))
        
        print("Autodock Vina Score =", dock_energy, " KCal/mol,", 'Hydrogen bonds =', hbonds_count)
        
        #This writes an output file for each ligand
        with open(PDBFile_name + "_" + dock_name + ".mol2",'w') as contents:
          contents.write("########## Name: " + dock_name + '\n' + 
                         "########## Vina_BE: " + str(dock_energy) + '\n' + 
                         "########## MW: " + str(MW) + '\n' + 
                         "########## clogP: " + str(clogP) + '\n' + 
                         "########## nonHatoms: " + str(nonHatoms) + '\n' + 
                         "########## Vina_LE: " + str(LE) + '\n' + 
                         "########## Vina_LipE: " + str(LipE) + '\n' + 
                         "########## Close_contacts: " + str(cc_count) + '\n' + 
                         "########## Hydrophob_contacts: " + str(hyd_contacts_count) + '\n' + 
                         "########## HBonds: " + str(hbonds_count) + '\n' + 
                         "########## Salt_Bridges: " + str(salt_bridges_count) + '\n' + 
                         "########## Pi-Pi_Parallel: " + pipi_par + '\n' + 
                         "########## Pi-Pi_Perpendicular: " + pipi_perp + '\n' + 
                         "########## Pi-Cation: " + str(pi_cation_count) + '\n' + 
                         "########## Halogen_Bonds: " + str(halogenbonds_count) + '\n' + 
                        top_pose_out)
          contents.close()
    except Exception as e:
            print(f"Warning: Skipping vina_split and processing due to an error: {e}")
        
    !'/bin/rm' '{dock_name}'_out*

Here we are performing the dock. We have the option to dock against multiple protein conformers, because we know that dynamics is important!!
You should not need to change this section, unless you want to do something special. 

In [None]:
#This runs a series of docking calculations on protein conformations found in the receptor folder
for receptor_filename in receptor_filenames:
    PDBFile = receptor_filename.split(".")[2]
    PDBFile_name = PDBFile.split("/")[2]
    print(colored('Docking for protein started', 'blue', attrs=['bold']))
    print(PDBFile_name)
   
    #Here we run the script that executes the docking procedure for our ligand list...
    #for ligand_filename in ligand_filenames:
    pool = mp.Pool(mp.cpu_count())
    pool.map(dock_ligands, [ligand_filename for ligand_filename in ligand_filenames])
    pool.close()
    print('Docking is complete for:',PDBFile_name)
    #This combines our output files together for each protein conformer and deletes the intermediate files
    !'/bin/cat' '{PDBFile_name}'*.mol2 >> '{Results_Folder}'/VinaResults_'{PDBFile_name}'.mol2
    !'/bin/rm' '{PDBFile_name}'*.mol2

print(colored('All of the docking is complete!', 'blue', attrs=['bold']))


# What to do next
Your new Treasure Hunt results are stored in the MTH_Results folder (unless you renamed it). You have a set of .mol2 files and .sdf files for each protein conformer. These have a VinaResults prefix. They contain the docked poses of all of your ligands in your library for each separate protein receptor configuration and the scores, and interactions.

You can analyse the data using the next Jupyter notebook, Chimera (for the .mol2 files) and/or DataWarrior (.sdf files) here. 
## Opening the results in Chimera:
- File > Open > Choose the correct protein/receptor .pdb file (you should have these files on your own computer from when you aligned the proteins in Part 1) this will have the same 'number' as the ligand .mol2 file, e.g. if you load in VinaResultsxxx00.mol2 then the protein file would be proteinxxx00.pdb.

- Tools > Surface/Binding Analysis > ViewDock > Choose the VinaResultsxxx.mol2 file (the file is a Dock 4,5,6 file type)

- This opens a new window. You can select columns to view from the Columns > Show menu and pick the Vina_BE or other scoring/properties lists.

- You can scroll through the ligands in the list and examine how they interact with the protein.

- The presets menu allows you to view the protein as a surface coloured by hydrophbicity.

- The Chimera menus provide a number of other ways to adapt the representation.


## Opening the results in DataWarrior:
- The .sdf files can be viewed in DataWarrior, where you can explore the chemoinformatics of your results (if that is your thing!!). 
- You can also visualise the bound conformations with Pymol using these sdf files (but you will need to download and install Pymol to the VirtualBox yourself!).

# Congratulations!!!
When you have these files you have made a big step on your quest - you are now ready for the next stage - have fun!

"*None shall remember the deeds that are done in the last defence of your homes. Yet the deeds will not be less valiant because they are unpraised.*" J.R.R. Tolkien, The Return of the King.

Sarah and Geoff

(an "out-of-office studios" $O^{3}S$ production)