# **The Molecular Treasure Hunt: Part 2: Docking a ligand library against each of your protein receptors**

## **This version uses the Python version of Autodock Vina**

*An &#8491;ngstrom sized adventure by Sarah Harris (Leeds Physics) and Geoff Wells (UCL Pharmacy)*

# Running the Molecular Treasure Hunt!!

"*You cannot pass! I am a servant of the Secret Fire, wielder of the flame of Anor. The dark fire will not avail you, flame of Udûn. JRRT*"

This procedure for understanding the amazing phenomenon of molecular recognition between biomolecules has been designed with a strong agenda. We believe that including the effect of biomolecular dynamics is key to improving our ability to make predictions about how biomolecules will interact. We can see these multiple conformational states either using NMR experiments, or assessing the diversity of X-ray crystal structures, or using theoretical tools such as atomistic molecular dynamics. You will therefore (we hope!!) include a number of protein structures in your "receptor" directory.

How drugs bind to multiple protein conformations depends upon the underlying physics and chemistry of proteins. *In vivo*, of course, this will also depend on the environment of the protein - e.g. whether it located in a large complex or in a membrane environment. For molecules to be orally bioavailable drugs, there are more stringent criteria (e.g. Lipinski's rule of 5). 

Thermal fluctuations (physics) drive conformational changes, but within the constraints imposed by the chemical structure of the proteins and the ligands. The requirement for the drug to reach the target site imposes additional criteria which depends on the physicochemical properties (e.g. clogP). Other considerations, such as the specificity of the compound for it's precise target, it's toxicity (related to specificity), synthetic accessibilty (e.g. can the chemists make it) and ease of formulation (e.g. can this be made into a safe and effective medicine, and is it feasible to make large quantities of it) are essential in practise, and can be non-trivial to overcome.   

Enjoy your Treasure Hunt!! Your search space encompasses the vast possibilities offered by the combinatorics of multiple protein conformations and enormous chemical diversity, particularly if you have embarked on a fragment library dock. Your limits imposed by pharmaceutical requirements though, are stringent, and in many ways unknown. Good luck!!   

We start by importing a number of python modules to run our calculations:

In [1]:
#Importing glob is important for recursive filename searches - we use this to find all of our protein and ligand files!
from glob import glob
from pathlib import Path
from termcolor import colored
import multiprocessing as mp
from vina import Vina
import oddt

This tells us about how many processors we have available...

In [2]:
num_cpu = mp.cpu_count()
print("Number of processors:", num_cpu)

Number of processors: 24


# Setting your ligand library and the protein receptors for your docking
Your working directory is the directory in which you opened the Jupyter notebook. Your proteins should be located in a folder called "receptor" within this directory. This notebook will dock every single ligand in your library against all of the protein structures (e.g. pdb files) that are in the "receptor" directory.

Before you start your treasure hunt, you must make sure that your proteins are **aligned** to each other, that they are **protonated** correctly and that you are happy with the position of the **grid box** that defines your receptor site for each protein structure. Check this by visualising in Chimera!!

In [3]:
#This defines our receptor structures (all of these files happen to have the same prefix, the starting structure (xal structure) is xxx_00.pdb)
receptor_filenames = glob('receptor/*.pdb')
#The sort command is *really* important for organising the output in the right order (the protein files are mixed up otherwise)
receptor_filenames.sort()
print(colored('Receptor filenames', 'blue', attrs=['bold']))
for receptor_filename in receptor_filenames:
    receptor_filename = receptor_filename.split("/")[-1]
    print(receptor_filename)
print("Number of proteins is", len(receptor_filenames))

[1m[34mReceptor filenames[0m
2FLU-wt_H_01.pdb
2FLU-wt_H_02.pdb
2FLU-wt_H_03.pdb
2FLU-wt_H_04.pdb
2FLU-wt_H_05.pdb
2FLU-wt_H_06.pdb
2FLU-wt_H_07.pdb
2FLU-wt_H_08.pdb
2FLU-wt_H_09.pdb
2FLU-wt_H_10.pdb
Number of proteins is 10


## Making the protein files that we need for the docking calculations
### Preparing the protein .pdbqt files
The structures that we start with and align are .pdb files - these contain information on the atoms that are in the protein, the residue names and the x, y and z coordinates of each atom. For the docking calculations a modified version of the .pdb file is necessary that contains additional atom charge (q) and atom type (t) information - this is called a .pdbqt file. **Note**: the .pdbqt files do not contain non-polar hydrogens (this is a requirement for Autodock Vina-based docking calculations).

The next piece of code generates a .pdbqt file for each of your protein .pdb files. As long as your proteins don't contain very weird atoms or residues, this should work!!

In [4]:
#Let's use oddt to make the pdbqt files that we need...
print(colored('Making pdbqt files', 'blue', attrs=['bold']))
for receptor_filename in receptor_filenames:
    receptor_prefix = receptor_filename.split(".")[0]
    receptor_name = receptor_prefix.split("/")[1]
    receptor_pdbqt_file = receptor_prefix + '.pdbqt'
    print(receptor_name)
    rec = next(oddt.toolkit.readfile('pdb', receptor_filename))
    rec.write('pdbqt', receptor_pdbqt_file, overwrite=True, opt={'r': None, 'c': None, 'h': None})
print(colored('Finished making pdbqt files', 'blue', attrs=['bold']))

[1m[34mMaking pdbqt files[0m
2FLU-wt_H_01
2FLU-wt_H_02
2FLU-wt_H_03
2FLU-wt_H_04
2FLU-wt_H_05
2FLU-wt_H_06
2FLU-wt_H_07
2FLU-wt_H_08
2FLU-wt_H_09
2FLU-wt_H_10
[1m[34mFinished making pdbqt files[0m


This defines our ligand list as a global variable within the notebook...

In [6]:
ligand_filenames = glob('ligand_testset/*.pdbqt')
ligand_filenames.sort()
print("Number of ligands is", len(ligand_filenames))

Number of ligands is 41


Here we create a folder to store the results in - you can change the name of this (do not use spaces or punctuation other than underscores or hypens in your file name)! **Also** remember that you will need to make changes to other notebooks if you change the default name!

In [7]:
Results_Folder = 'MTH_Results'
rf = Path(Results_Folder)
try:
    rf.mkdir()
except FileExistsError as exc:
    print(colored('The results folder is already present - check that it is empty before starting!', 'blue',  attrs=['bold']))

And here we get an estimate of how long the first step of the Molecular Treasure Hunt will take to run. As a very rough guess, you should get 1 dock every two minutes, so if your docking steps per cpu is 10, this should take around 20 minutes on a current (2020) workstation. Larger molecules are slower to dock (e.g the FDA library compared to the fragment library) because they have more rotatable bonds. 

In [8]:
docking_steps = len(ligand_filenames) * len(receptor_filenames)
print("Number of individual docking_steps is", docking_steps)
docking_steps_per_cpu = round((docking_steps / num_cpu), 2)
print("Number of docking steps per cpu is ", docking_steps_per_cpu)

Number of individual docking_steps is 410
Number of docking steps per cpu is  17.08


This defines a procedure for docking a series of ligands to a protein. It's horribly complicated - so only change this if you are feeling python-tastic!!

In [9]:
def dock_ligands(filename):
    #This line removes the filename suffix (.pdbqt)
    dock_input = filename.split(".")[0]
    #This line removes the filename prefix 
    dock_name = dock_input.split("/")[-1]
    #The next line runs a local docking experiment...
    v.set_ligand_from_file(filename)
    v.dock(exhaustiveness=24, n_poses=1)
    top_pose = v.poses(n_poses=1, energy_range=3.0, coordinates_only=False)
    score = v.score()
    dock_energy = score[0]
    print('Score: %.3f (kcal/mol)' % score[0])
    #Read the pdbqt file string into an ODDT object
    top_pose1 = oddt.toolkit.readstring('pdbqt', top_pose)
    #Write out the output including the compound name and Vina binding energy
    top_pose_mol2 = top_pose1.write('mol2')
    #This variable will be the string that is written to file when the function has executed on all ligands
    top_pose_out = "########## Name: " + dock_name + '\n' + "########## Vina_BE: " + str(dock_energy) + '\n' + top_pose_mol2
    return top_pose_out

Here we are performing the dock. We have the option to dock against multiple protein conformers, because we know that dynamics is important!!
You should not need to change this section, unless you want to do something special. 

There is an important technical consideration when we are using parallel processing: We need to ensure that all multiprocessor jobs have finished before we output the data to a file, otherwise we risk multiple write processes happening at the same time. This would result in a mangled file, and a disappointing end to your quest! This why we output the results outside of the docking function (that runs in parallel).

In [10]:
#First we extract the relevant information from our configuration file (conf.txt)
with open("receptor/conf.txt", 'r') as contents:
    lines = contents.readlines()
    for line in lines:
            if "center_x" in line:
                center_x = float(line.split()[2])
            if "center_y" in line:
                center_y = float(line.split()[2])
            if "center_z" in line:
                center_z = float(line.split()[2])
            if "size_x" in line:
                size_x = float(line.split()[2])
            if "size_y" in line:
                size_y = float(line.split()[2])           
            if "size_z" in line:
                size_z = float(line.split()[2])
    contents.close()

print(colored('Docking for protein started', 'blue', attrs=['bold']))
#This runs a series of docking calculations on protein conformations found in the receptor folder
for receptor_filename in receptor_filenames:
    PDBFile = receptor_filename.split(".")[0]
    PDBFile_name = PDBFile.split("/")[1]

    print('Receptor docking:', PDBFile_name)
    v = Vina(sf_name='vina', verbosity=0)
    v.set_receptor(PDBFile + ".pdbqt")
    v.compute_vina_maps(center=[center_x, center_y, center_z], box_size=[size_x, size_y, size_z])

    #Here we run the script that executes the docking procedure for our ligand list...
    #for ligand_filename in ligand_filenames:
    pool = mp.Pool(mp.cpu_count())
    #The output from each iteration of the function will be called top_poses
    top_poses = pool.map(dock_ligands, [ligand_filename for ligand_filename in ligand_filenames])
    pool.close()
    print('Docking is complete for:',PDBFile_name)
    print(colored('Docking for protein finished', 'blue'))
    #The top_pose output is in 'list' format, so we convert the item (there is only one) in the list to a string
    for top_pose in top_poses:
        molecule=str(top_pose)
        #Now we write that string to the results file
        with open(Results_Folder + "/VinaResults_" + PDBFile_name + ".mol2",'a') as contents:
            contents.write(molecule)
    contents.close()
    
print(colored('All of the docking is complete!', 'blue', attrs=['bold']))


[1m[34mDocking for protein started[0m
Receptor docking: 2FLU-wt_H_01
Score: -4.053 (kcal/mol)
Score: -5.034 (kcal/mol)
Score: -4.789 (kcal/mol)
Score: -3.910 (kcal/mol)
Score: -5.478 (kcal/mol)
Score: -6.265 (kcal/mol)
Score: -5.141 (kcal/mol)
Score: -5.486 (kcal/mol)
Score: -6.437 (kcal/mol)Score: -3.845 (kcal/mol)

Score: -5.984 (kcal/mol)
Score: -5.983 (kcal/mol)
Score: -5.540 (kcal/mol)
Score: -4.835 (kcal/mol)
Score: -5.358 (kcal/mol)
Score: -4.032 (kcal/mol)
Score: -5.599 (kcal/mol)
Score: -5.366 (kcal/mol)
Score: -5.635 (kcal/mol)
Score: -6.132 (kcal/mol)
Score: -4.093 (kcal/mol)
Score: -5.541 (kcal/mol)
Score: -5.878 (kcal/mol)
Score: -5.385 (kcal/mol)
Score: -5.458 (kcal/mol)
Score: -5.911 (kcal/mol)
Score: -5.781 (kcal/mol)
Score: -6.553 (kcal/mol)
Score: -5.909 (kcal/mol)
Score: -3.187 (kcal/mol)
Score: -4.809 (kcal/mol)Score: -5.095 (kcal/mol)

Score: -5.668 (kcal/mol)
Score: -5.654 (kcal/mol)
Score: -6.525 (kcal/mol)
Score: -5.376 (kcal/mol)Score: -5.739 (kcal/mol)

Sco

# What to do next
You have a set of .mol2 files for each protein conformer from your docking calculation. These contain the poses for each docked ligand for the separate protein conformers. 
You should move the folder that contains your results back to your own computer (if necessary) so that you can analyse them using using the next Jupyter notebooks (parts 3, 4, 5 and 6) and Chimera, or Pymol at your leisure!

## Opening the results in Chimera:
- File > Open > Choose the correct protein/receptor .pdb file (you should have these files on your own computer from when you aligned the proteins in Part 1) this will have the same 'number' as the ligand .mol2 file, e.g. if you load in VinaResultsxxx00.mol2 then the protein file would be proteinxxx00.pdb.

- Tools > Surface/Binding Analysis > ViewDock > Choose the VinaResultsxxx.mol2 file (the file is a Dock 4,5,6 file type)

- This opens a new window. You can select columns to view from the Columns > Show menu and pick the Vina_BE or other scoring/properties lists.

- You can scroll through the ligands in the list and examine how they interact with the protein.

- The presets menu allows you to view the protein as a surface coloured by hydrophobicity.

- The Chimera menus provide a number of other ways to adapt the representation.

# Congratulations!!!
Your Treasure Hunt results are stored in the MTH_Results folder (unless you renamed it). You have a set of .mol2 files for each protein conformer. These have a VinaResults prefix. They contain the docked poses of all of your ligands in your library for each separate protein receptor configuration. 

"*None shall remember the deeds that are done in the last defence of your homes. Yet the deeds will not be less valiant because they are unpraised. JRRT*"

When you have these files - you are now ready for the next stage - have fun!

Sarah and Geoff

(an $O^{3}S$ production)