# **The Molecular Treasure Hunt: Part 3: Calculating properties and rescoring**

*An &#8491;ngstrom sized adventure by Sarah Harris (Leeds Physics) and Geoff Wells (UCL Pharmacy)*

# Running the Molecular Treasure Hunt!!

"*It is known that there are an infinte number of worlds, simply because there is an infinite amount of space for them to be in. However, not every one of them is inhabited.*"

In this part, we take our docking results from Part 2 and do a few further things:
- We calculate interactions between the ligands and the protein
- We rescore the protein-ligand interactions
- We produce some .sdf files that we can use with other chemoinformatics software (e.g. DataWarrior)

We discussed the (re)scoring functions in part pre3, so we will comment mainly on the interactions calculations here. There are a number of features to protein-ligand interactions that we can quantify and use to identify our 'best' or most promising compounds. Here is a list of the features that we calculate:
- Close contacts
- Hydrophobic interactions
- Hydrogen bonds
- Salt bridges
- Halogen bonds
- Pi-stacking interactions (parallel and perpendicular)
- Pi-cation interactions

The importance of each of these will depend on the chemistry of your ligands and their interaction with the protein. For example, halogen bonds are only relevant to compounds that contain halogens (F, Cl, Br, I) and salt bridge interactions are only relevant to ligands with complementary charges within the binding site.

Remember that we do not redock the ligands. Rather, we use three different empirical scoring functions in addition to the Vina Binding Energy to rank the ligands and then assess the docked conformation and determine the interactions made between the ligand and the protein.

We start by importing a number of python modules to run our calculations:

In [1]:
#Importing glob is important for recursive filename searches - we use this to find all of our protein and ligand files!
from glob import glob
from pathlib import Path
import multiprocessing as mp
from types import GeneratorType
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings("ignore")

from termcolor import colored
import numpy as np

import oddt
from oddt.interactions import (close_contacts,
                               hbonds,
                               distance,
                               halogenbonds,
                               halogenbond_acceptor_halogen,
                               pi_stacking,
                               salt_bridges,
                               pi_cation,
                               hydrophobic_contacts)

from oddt.scoring import scorer, ensemble_descriptor, ensemble_model
from oddt.scoring.descriptors import (autodock_vina_descriptor,
                                      fingerprints,
                                      oddt_vina_descriptor)
from oddt.scoring.models.classifiers import neuralnetwork
from oddt.scoring.models import regressors
from oddt.scoring.functions import rfscore, nnscore, PLECscore

In [6]:
#Location of pickle files for rescoring functions
pickle_dir = '/home/geoff/software/docking_rescoring_files/'

#This is where the results from the docking are found. You will need to edit this if you changed it in Part 2.
results_filenames = glob('MTH_Results/VinaResults_*.mol2')
print("Number of results files is", len(results_filenames))

#Make a new folder for the rescored results
Results_folder = Path('MTH_Results_Rescored')
try:
    Results_folder.mkdir()
except FileExistsError as exc:
    print(colored('The rescored results folder is already present - check that it is empty before starting!', 'green',  attrs=['bold']))
    
#Set the number of processors for rescoring
#At the moment this is a serial task, so set the number of processors to either the number of results files
#OR the number of processors (which ever is smaller)
max_cpu = mp.cpu_count()
print('Maximum number of CPUs is', max_cpu)

if max_cpu > len(results_filenames):
    cpu_use = len(results_filenames)
else:
    cpu_use = max_cpu
num_cpu = cpu_use
print('Number of CPUs in use is', num_cpu)

Number of results files is 4
[1m[32mThe rescored results folder is already present - check that it is empty before starting![0m
Maximum number of CPUs is 12
Number of CPUs in use is 4


We have several rescoring functions available to us (at least 7!). But for now we will only include the RF version 3, NNScore and PLEC_RF versions. It is of course possible to substitute other scoring functions, but this will require you to make changes to the python code and understand how Open Drug Discovery Tools implements the various scoring functions. Any additional scoring functions will also require training (i.e. addition to the prePart3 notebook). We would strongly encourage you to explore these possibilities, or any of the other analysis options that you consider valuable during the Molecular Treasure Hunt!

Make sure that the functions have been 'trained' first and that there is a '.pickle' file in your folder for each scoring function - these contain information relating to the trained function...

In [7]:
#This function calculates additional interactions between each protein and the docked ligands and rescores them using three alternative scoring methods
def rescoring_function(filename):
    #Split up the file name into variables that identify the receptor and ligand
    PDBFile_path = filename.split(".")[0]
    PDBFile_name = PDBFile_path.split("/")[1]
    PDBFile = PDBFile_name.split("VinaResults_")[1]
    protein = 'receptor/' + PDBFile + '.pdb'
    print(colored('Protein:' + PDBFile_name + '\n', 'green', attrs=['bold']))
    rec = next(oddt.toolkit.readfile('pdb', protein))
    rec.protein=True
    mols = list(oddt.toolkit.readfile('mol2', filename))
    #The next line (commented out) adds polar hydrogens if required. This may be necessary for some protein pdb files too
    #list(map(lambda x: x.addh(only_polar=True), mols))
    rescored_out = 'MTH_Results_Rescored/' + PDBFile_name + '_rescored.mol2'
    with open(rescored_out, 'w') as contents:
        for mol in mols:
            #Calculate molecule properties using oddt(Pybel)
            nonHatoms = int(len(mol.atoms))
            properties = (mol.calcdesc(["MW", "logP"]))
            s=properties.values()
            MW = round(float([x for x in s][0]),3)
            clogP = round(float([x for x in s][1]),3)
            #Then use these properties to calculate some ligand efficiencies
            LE = round(float(mol.data['Vina_BE'])/nonHatoms,3)
            LipE = round(float(mol.data['Vina_BE'])-clogP,3)
            #Now use oddt to calculate the interactions
            cc_count = int(np.array([len(close_contacts(rec.atom_dict[rec.atom_dict['atomicnum'] != 1], mol.atom_dict[mol.atom_dict['atomicnum'] != 1], cutoff=3)[0])]))
            hbonds_count = int(np.array([hbonds(rec, mol, cutoff=3.5, tolerance=30)[2].sum()]))
            pi1, pi2, strict_parallel, strict_perpendicular = pi_stacking(rec, mol, tolerance=30)
            pipi_par=str(strict_parallel.sum())
            pipi_perp=str(strict_perpendicular.sum())
            pi_cation_count = int(np.array([len(pi_cation(rec, mol)[2])]))
            #Need to check whether the pi cation count is reversible or in order...
            halogenbonds_count = int(np.array([len(halogenbonds(rec, mol)[2])]))
            salt_bridges_count = int(np.array([len(salt_bridges(rec, mol)[0])]))
            hyd_contacts_count = int(np.array([len(hydrophobic_contacts(rec, mol)[0])]))

            #Now we add in the rescoring functions
            rf3 = rfscore.load(filename=pickle_dir + 'RFScore_v3_pdbbind2016.pickle', version=3, pdbbind_version=2016)
            rf3.set_protein(rec)
            rf3.predict_ligand(mol)
            nn = nnscore.load(filename=pickle_dir + 'NNScore_pdbbind2016.pickle', pdbbind_version=2016)
            nn.set_protein(rec)
            nn.predict_ligand(mol)
            plec3=PLECscore(version='rf').load(filename=pickle_dir + 'PLECrf_p5_l1_pdbbind2016_s65536.pickle', version='rf', pdbbind_version=2016, depth_protein=5, depth_ligand=1, size=2048)
            plec3.set_protein(rec)
            plec3.predict_ligand(mol)

            #Convert the molecule to an output string
            write_out=mol.write('mol2')

            #Then append everything to the output file
            with open(rescored_out,'a') as contents:
              contents.write("########## Name: " + mol.data['Name'] + '\n' + 
                             "########## Vina_BE: " + mol.data['Vina_BE'] + '\n' + 
                             "########## MW: " + str(MW) + '\n' + 
                             "########## clogP: " + str(clogP) + '\n' + 
                             "########## nonHatoms: " + str(nonHatoms) + '\n' + 
                             "########## Vina_LE: " + str(LE) + '\n' + 
                             "########## Vina_LipE: " + str(LipE) + '\n' + 
                             "########## Close_contacts: " + str(cc_count) + '\n' + 
                             "########## Hydrophob_contacts: " + str(hyd_contacts_count) + '\n' + 
                             "########## HBonds: " + str(hbonds_count) + '\n' + 
                             "########## Salt_Bridges: " + str(salt_bridges_count) + '\n' + 
                             "########## Pi-Pi_Parallel: " + pipi_par + '\n' + 
                             "########## Pi-Pi_Perpendicular: " + pipi_perp + '\n' + 
                             "########## Pi-Cation: " + str(pi_cation_count) + '\n' + 
                             "########## Halogen_Bonds: " + str(halogenbonds_count) + '\n' + 
                             "########## RFScore_v3: " + str(round(float(mol.data['rfscore_v3']),3)) + '\n' +
                             "########## NNScore: " + str(round(float(mol.data['nnscore']),3)) + '\n' + 
                             "########## PLECScore_rf: " + str(round(float(mol.data['PLECrf_p5_l1_s65536']),3)) + '\n' + 
                             write_out)
        contents.close()

The next cell executes the rescoring function using parallel processing on your machine. The parallelisation is achieved by rescoring each protein conformer on a separate processor until they are all complete. This is possible because the rescoring of each protein conformer is independent of every other; such processes are called *embarrassingly parallel*. However, you may see that the protein conformers appear in an unexpected order as each parallel calculation is launched - **DON'T PANIC**!

In [8]:

#Here we run the script that executes the docking procedure for our ligand list...
#We limit the number of processors because the memory demand of each thread is quite high...
#The memory requirement depends on how many ligands you have docked, so it may be best to use a fraction of your available CPU initially.
pool = mp.Pool(num_cpu)
pool.map(rescoring_function, [results_filename for results_filename in results_filenames])
pool.close()

[1m[32mProtein:VinaResults_7rdx_prep
[0m[1m[32mProtein:VinaResults_7rdy_prep
[0m[1m[32mProtein:VinaResults_6zsl_prep
[0m[1m[32mProtein:VinaResults_5rlh_b
[0m





In [9]:
#Write sdf files for use in DataWarrior etc.
mol2_files = glob('MTH_Results_Rescored/VinaResults_*.mol2')
#print(len(mol2_files))
for mol2_file in mol2_files:
    mol2_prefix = mol2_file.split(".")[0]
    sdf_out = mol2_prefix + '.sdf'
    mols = list(oddt.toolkit.readfile('mol2', mol2_file))
    with open (sdf_out, 'w') as contents:
        for mol in mols:
            sdf = mol.write('sdf')
            with open (sdf_out, 'a') as contents:
                contents.write(sdf)
    contents.close()
 

# What to do next
Your new Treasure Hunt results are stored in the MTH_Results_Rescored folder (unless you renamed it). You have a set of .mol2 files and .sdf files for each protein conformer. These have a VinaResults prefix. They contain the docked poses of all of your ligands in your library for each separate protein receptor configuration and the scores, and interactions.

You should move the folder that contains your results back to your own computer (if necessary) so that you can analyse them using using this Jupyter notebook and Chimera (for the .mol2 files), DataWarrior (.sdf files) or Pymol (.mol2 or .sdf files) at your leisure!

## Opening the results in Chimera:
- File > Open > Choose the correct protein/receptor .pdb file (you should have these files on your own computer from when you aligned the proteins in Part 1) this will have the same 'number' as the ligand .mol2 file, e.g. if you load in VinaResultsxxx00.mol2 then the protein file would be proteinxxx00.pdb.

- Tools > Surface/Binding Analysis > ViewDock > Choose the VinaResultsxxx.mol2 file (the file is a Dock 4,5,6 file type)

- This opens a new window. You can select columns to view from the Columns > Show menu and pick the Vina_BE or other scoring/properties lists.

- You can scroll through the ligands in the list and examine how they interact with the protein.

- The presets menu allows you to view the protein as a surface coloured by hydrophbicity.

- The Chimera menus provide a number of other ways to adapt the representation.


## Opening the results in DataWarrior or Pymol:
- The .sdf files can be viewed in pymol, and in DataWarrior, where you can explore the chemoinformatics of your results (if that is your thing!!). 

# Congratulations!!!

"*since every piece of matter in the Universe is affected by every other piece of matter in the Universe, it is in theory possible to extrapolate the whole of creation*"

You are now ready for the next stage - have fun!

Sarah and Geoff

(an "out-of-office" $O^{3}P$ production)