# **The Molecular Treasure Hunt: Part 2: Docking a ligand library against each of your protein receptors**

## **This version uses the Vina external executable for the docking**

*An &#8491;ngstrom sized adventure by Sarah Harris (Leeds Physics) and Geoff Wells (UCL Pharmacy)*

# Running the Molecular Treasure Hunt!!

"*You cannot pass [your project]! I am a servant of the Secret Fire, wielder of the flame of Anor. The dark fire will not avail you, flame of Udûn [until you have submitted your report].*"

This procedure for understanding the amazing phenomenon of molecular recognition between biomolecules has been designed with a strong agenda. We believe that including the effect of biomolecular dynamics is key to improving our ability to make predictions about how biomolecules will interact. We can see these multiple conformational states either using NMR experiments, or assessing the diversity of X-ray crystal structures, or using theoretical tools such as atomistic molecular dynamics. You will therefore (we hope!!) include a number of protein structures in your "receptor" directory.

How drugs bind to multiple protein conformations depends upon the underlying physics and chemistry of proteins. In vivo, of course, this will also depend on the environment of the protein - e.g. whether it located in a large complex or in a membrane environment. For molecules to be orally bioavailable drugs, there are more stringent criteria (e.g. Lipinski's rule of 5). 

Thermal fluctuations (physics) drive conformational changes, but within the constraints imposed by the chemical structure of the proteins and the ligands. The requirement for the drug to reach the target site imposes additional criteria which depends on the physicochemical properties (e.g. clogP). Other considerations, such as the specificity of the compound for it's precise target, it's toxicity (related to specificity), synthetic accessibilty (e.g. can the chemists make it) and ease of formulation (e.g. can this be made into a safe and effective medicine, and is it feasible to make large quantities of it) are essential in practise, and can be non-trivial to overcome.   

Enjoy your Treasure Hunt!! Your search space encompasses the vast possibilities offered by the combinatorics of multiple protein conformations and enormous chemical diversity, particularly if you have embarked on a fragment library dock. Your limits imposed by pharmaceutical requirements though, are stringent, and in many ways unknown. Good luck!!   

We start by importing a number of python modules to run our calculations:

In [1]:
import sys
import os
import subprocess
#Importing glob is important for recursive filename searches - we use this to find all of our protein and ligand files!
from glob import glob
from pathlib import Path
from termcolor import colored
import multiprocessing as mp

import oddt

# Setting your ligand library and the protein receptors for your docking
Your working directory is the directory in which you opened the Jupyter notebook. Your proteins should be located in a folder called "receptor" within this directory. This notebook will dock every single ligand in your library against all of the protein structures (e.g. pdb files) that are in the "receptor" directory.

Before you start your treasure hunt, you must make sure that your proteins are **aligned** to each other, that they are **protonated** correctly and that you are happy with the position of the **grid box** that defines your receptor site for each protein structure. Check this by visualising in Chimera!!

### Protein files

In [2]:
#This defines our receptor structures (all of these files happen to have the same prefix, the starting structure (xal structure) is xxx_00.pdb)
receptor_filenames = glob('receptor/*.pdb')
#The sort command is *really* important for organising the output in the right order (the protein files are mixed up otherwise)
receptor_filenames.sort()
print(colored('Receptor filenames', 'blue', attrs=['bold']))
print(receptor_filenames)
print("Number of proteins is", len(receptor_filenames))

[1m[34mReceptor filenames[0m
['receptor/5rlh_b.pdb', 'receptor/6zsl_prep.pdb', 'receptor/7rdx_prep.pdb', 'receptor/7rdy_prep.pdb']
Number of proteins is 4


This defines our ligand list as a global variable within the notebook...

### Ligand files

In [3]:
ligand_filenames = glob('/home/geoff/compounds/enamine_fragments/Minifrag/*.pdbqt')
ligand_filenames.sort()
print("Number of ligands is", len(ligand_filenames))

Number of ligands is 78


This tells us about how many processors we have available...

### Number of processors

In [4]:
num_cpu = mp.cpu_count()
print("Number of processors:", num_cpu)
#If necessary we can set the number of processors to be less than the mp.cpu_count()
#by commenting out the top line (by adding a # at the start of the line)
#and uncommenting the line below (by removing the # from the line) using a number you chose.
#only one of the two lines (EITHER the top or the bottom one) should be uncommented.
#num_cpu = 4

Number of processors: 12


Here we create a folder to store the results in - you can change the name of this, although it is best not to! (do not use spaces or puctuation other than underscores or hypens in your file name)!

### Results folder location

In [5]:
Results_Folder = 'MTH_Results'
rf = Path(Results_Folder)
try:
    rf.mkdir()
except FileExistsError as exc:
    print(colored('The results folder is already present - check that it is empty before starting!', 'blue',  attrs=['bold']))

## Making the protein files that we need for the docking calculations
### Preparing the protein .pdbqt files
The structures that we start with and align are .pdb files - these contain information on the atoms that are in the protein, the residue names and the x, y and z coordinates of each atom. For the docking calculations a modified version of the .pdb file is necessary that contains additional atom charge (q) and atom type (t) information - this is called a .pdbqt file. **Note**: the .pdbqt files do not contain non-polar hydrogens (this is a requirement for Autodock Vina-based docking calculations).

The next piece of code generates a .pdbqt file for each of your protein .pdb files. As long as your proteins don't contain very weird atoms or residues, this should work!! *This piece of code is complicated because we are subverting a process that Chimera usually hides from you! Chimera is running some of the code from MGLTools to execute these tasks, this is python2-based code, so for sustainability we have avoided the need to execute python2 code directly.*

In [6]:
#Let's use oddt to make the pdbqt files that we need...
#Load the aligned .pdb files into Chimera in sequence with the ligand and write out the corresponding .pdbqt files
print(colored('Making pdbqt files', 'blue', attrs=['bold']))
for receptor_filename in receptor_filenames:
    receptor_prefix = receptor_filename.split(".")[0]
    receptor_name = receptor_prefix.split("/")[1]
    receptor_pdbqt_file = receptor_prefix + '.pdbqt'
    print(receptor_name)
    rec = next(oddt.toolkit.readfile('pdb', receptor_filename))
    #oddt.docking.AutodockVina.write_vina_pdbqt(rec, receptor_pdbqt_file, flexible=False)
    rec.write('pdbqt', receptor_pdbqt_file, overwrite=True, opt={'r': None, 'c': None, 'h': None})
print(colored('Finished making pdbqt files', 'blue', attrs=['bold']))

[1m[34mMaking pdbqt files[0m
5rlh_b
6zsl_prep
7rdx_prep
7rdy_prep
[1m[34mFinished making pdbqt files[0m


And here we get an estimate of how long the first step of the Molecular Treasure Hunt will take to run. As a very rough guess, you should get 1 dock every two minutes, so if your docking steps per cpu is 10, this should take around 20 minutes on a current (2020) workstation. Larger molecules are slower to dock (e.g the FDA library compared to the fragment library) because they have more rotatable bonds. 

In [7]:
docking_steps = len(ligand_filenames) * len(receptor_filenames)
print("Number of individual docking_steps is", docking_steps)
docking_steps_per_cpu = round((docking_steps / num_cpu), 2)
print("Number of docking steps per cpu is ", docking_steps_per_cpu)

Number of individual docking_steps is 312
Number of docking steps per cpu is  26.0


This defines a procedure for docking a series of ligands to a protein. It's horribly complicated - so only change this if you are feeling python-tastic!!

In [8]:
def dock_ligands(filename):
    #This line removes the filename suffix (.pdbqt)
    dock_input = filename.split(".")[0]
    #This line removes the filename prefix 
    dock_name = dock_input.split("/")[-1]
    #The next line runs a local docking experiment...
    #subprocess.run(["ls", "-l"])
    vina_receptor = PDBFile + '.pdbqt'
    vina_config = "receptor/conf.txt"
    vina_ligand = dock_input + '.pdbqt'
    vina_out = dock_name + '_out.pdbqt'
    vina_run = subprocess.run(["vina", "--receptor", vina_receptor, "--config", vina_config, "--ligand", vina_ligand, "--out", vina_out], stdout=subprocess.DEVNULL)
    
    #The input file for qvina "conf.txt" contains instructions to only output the top docked pose. However, it is 
    #also necessary to run the "qvina_split" command to produce a single ligand conformer, because of the flags that 
    #are present at the start and end of the pdbqt output file.
    
    vina_split = subprocess.run(["vina_split", "--input", vina_out], stdout=subprocess.DEVNULL)
    
    with open(dock_name + '_out.pdbqt', 'r') as contents:
        for line in contents:
            if line.startswith("REMARK VINA RESULT:"):
                dock_energy = round(float(line.split()[3]),4)
    contents.close()
    print("Autodock Vina Score =", dock_energy, " KCal/mol")
    top_pose = next(oddt.toolkit.readfile('pdbqt', dock_name + '_out.pdbqt'))
    top_pose_out = top_pose.write('mol2')
    
    with open(PDBFile_name + "_" + dock_name + ".mol2",'w') as contents:
      contents.write("########## Name: " + dock_name + '\n' + 
                     "########## Vina_BE: " + str(dock_energy) + '\n' + 
                    top_pose_out)
      contents.close()

    !'/bin/rm' '{dock_name}'_out*

Here we are performing the dock. We have the option to dock against multiple protein conformers, because we know that dynamics is important!!
You should not need to change this section, unless you want to do something special. 

In [9]:
#This runs a series of docking calculations on protein conformations found in the receptor folder
for receptor_filename in receptor_filenames:
    PDBFile = receptor_filename.split(".")[0]
    PDBFile_name = PDBFile.split("/")[1]
    print(colored('Docking for protein started', 'blue', attrs=['bold']))
    print(PDBFile_name)
   
    #Here we run the script that executes the docking procedure for our ligand list...
    #for ligand_filename in ligand_filenames:
    pool = mp.Pool(mp.cpu_count())
    pool.map(dock_ligands, [ligand_filename for ligand_filename in ligand_filenames])
    pool.close()
    print('Docking is complete for:',PDBFile_name)
    !'/bin/cat' '{PDBFile_name}'*.mol2 >> '{Results_Folder}'/VinaResults_'{PDBFile_name}'.mol2
    !'/bin/rm' '{PDBFile_name}'*.mol2

print(colored('All of the docking is complete!', 'blue', attrs=['bold']))


[1m[34mDocking for protein started[0m
5rlh_b
Autodock Vina Score = -2.35  KCal/mol
Autodock Vina Score = -2.539  KCal/mol
Autodock Vina Score = -3.066  KCal/mol
Autodock Vina Score = -3.07  KCal/mol
Autodock Vina Score = -2.69  KCal/mol
Autodock Vina Score = -3.302  KCal/mol
Autodock Vina Score = -2.851  KCal/mol
Autodock Vina Score = Autodock Vina Score =-3.806  -2.789 KCal/mol  KCal/mol

Autodock Vina Score = -3.155  KCal/mol
Autodock Vina Score = -2.997  KCal/mol
Autodock Vina Score = -2.933  KCal/mol
Autodock Vina Score = -2.66 Autodock Vina Score = KCal/mol -3.255
  KCal/mol
Autodock Vina Score = -3.627  KCal/mol
Autodock Vina Score = -2.372  KCal/molAutodock Vina Score =
 -3.006  KCal/mol
Autodock Vina Score =Autodock Vina Score = -3.442  -3.484 KCal/mol 
 KCal/mol
Autodock Vina Score = -3.227  KCal/mol
Autodock Vina Score = -3.254  KCal/mol
Autodock Vina Score = Autodock Vina Score =-3.278  KCal/mol 
-3.609  KCal/mol
Autodock Vina Score = -3.296  KCal/mol
Autodock Vina Score 

# Congratulations!!!
Your Treasure Hunt results are stored in the MTH_Results folder (unless you renamed it). You have a set of .mol2 files and .sdf files for each protein conformer. These have a VinaResults prefix. They contain the docked poses of all of your ligands in your library for each separate protein receptor configuration. 

"*None shall remember the deeds that are done in the last defence of your homes. Yet the deeds will not be less valiant because they are unpraised.*"

When you have these files - you are now ready for the next stage - have fun!

Sarah and Geoff

(an "out-of-office" $O^{3}P$ production)