# **The Molecular Treasure Hunt: Part pre3: 'The threequel' - training rescoring functions**

*An &#8491;ngstrom sized adventure by Sarah Harris (Leeds Physics) and Geoff Wells (UCL Pharmacy)*

## Training your rescoring functions!!
"*The chances of finding out what's really going on in the universe are so remote, the only thing to do is hang the sense of it and keep yourself occupied. DA*"

Unfortunately, there is no established relationship between the atomic structure of a complex and its binding affinity (yet). Therefore, a range of empirical scoring functions have been derived to estimate binding affinities (or rather free energies). Since we have no clear reason to favour one scoring function over others, we have chosen to use several in our analysis. Consequently, any trends that emerge may be more robust and warrant further investigation.

In this notebook we use the Open Drug Discovery Toolkit to train some scoring functions for the next notebook. This utilizes a database of known protein ligand interactions and their affinities (the PDB Bind 2016 database in this case). We need to do this (unless we already have a copy of the trained scoring functions on our computer - you only need to do this once - a good thing because it takes some time!). If you are feeling adventurous/python-tastic and have experimental data for your system(s) then it is possible to use this to train the scoring functions too. Please note this is not a trivial exercise and some understanding of the underlying functions and python structure is essential!

The NNScore is described in:
JD Durrant, JA McCammon. NNScore 2.0: a neural-network receptor-ligand scoring function. J Chem Inf Model. 2011;51: 2897-2903. doi:10.1021/ci2003889
JD Durrant, JA McCammon. BINANA: a novel algorithm for ligand-binding characterization. J Mol Graph Model. 2011;29: 888-893. doi:10.1016/j.jmgm.2011.01.004

The PLECScoring methods are described in:
M Wójcikowski, M Kukiełka, MM Stepniewska-Dziubinska, P Siedlecki. Development of a protein–ligand extended connectivity (PLEC) fingerprint and its application for binding affinity predictions Bioinformatics, 2019, 35, 1334–1341, doi:10.1093/bioinformatics/bty757

The RFScoring methods are described in:
PJ Ballester, JBO Mitchell. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking Bioinformatics, 2010, 26, 1169-1175, doi:10.1093/bioinformatics/btq112

We start by importing a number of python modules to run our calculations:

In [1]:
import sys
import os
#Importing glob is important for recursive filename searches - we use this to find all of our protein and ligand files!
from glob import glob
from pathlib import Path
from types import GeneratorType
from sklearn.metrics import r2_score

from termcolor import colored
import numpy as np

import oddt
from oddt.interactions import (close_contacts,
                               hbonds,
                               distance,
                               halogenbonds,
                               halogenbond_acceptor_halogen,
                               pi_stacking,
                               salt_bridges,
                               pi_cation,
                               hydrophobic_contacts)

from oddt.scoring import scorer, ensemble_descriptor, ensemble_model
from oddt.scoring.descriptors import (autodock_vina_descriptor,
                                      fingerprints,
                                      oddt_vina_descriptor)
from oddt.scoring.models.classifiers import neuralnetwork
from oddt.scoring.models import regressors
from oddt.scoring.functions import rfscore, nnscore, PLECscore

In [2]:
#We can train the rescoring models first (only need to do this if it has not been done before)...
#The file size for the PLECscore_nn model is very large
#Executing this on a small number of CPUs will take a long time!!!
models = ([PLECscore(n_jobs=-1, version=v, size=65536)
           for v in ['rf']] +
          [nnscore(n_jobs=-1)] +
          [rfscore(version=v, n_jobs=-1) for v in [3]])
for model in models:
    model.train()

Training PLECscore rf with depths P5 L1 on PDBBind v2016
Test set:	R2_score: 0.5649	Rp: 0.7922	RMSE: 1.4335	SD: 1.3284
Train set:	R2_score: 0.9327	Rp: 0.9797	RMSE: 0.4882	SD: 0.3770
OOB set:	R2_score: 0.5090	Rp: 0.7193	RMSE: 1.3182	SD: 1.3069
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done  92 out of 1000 | elapsed:   48.7s remaining:  8.0min
[Parallel(n_jobs=-1)]: Done 193 out of 1000 | elapsed:  1.6min remaining:  6.5min
[Parallel(n_jobs=-1)]: Done 294 out of 1000 | elapsed:  2.3min remaining:  5.6min
[Parallel(n_jobs=-1)]: Done 395 out of 1000 | elapsed:  3.1min remaining:  4.7min
[Parallel(n_jobs=-1)]: Done 496 out of 1000 | elapsed:  3.8min remaining:  3.9min
[Parallel(n_jobs=-1)]: Done 597 out of 1000 | elapsed:  4.6min remaining:  3.1min
[Parallel(n_jobs=-1)]: Done 698 out of 1000 | elapsed:  5.3min remaining:  2.3min
[Parallel(n_jobs=-1)]: Done 799 out of 1000 | elapsed:  6.1min remaining:  1.5min
[Parallel(n_jobs=-1)]: D

After this you should have a series of .pickle files that contain data on the trained scoring functions that will be used in the following steps. At the moment notebook 3 uses the PLEC rf score, the NNScore and the RFScore version 3. Again, if you already have these files then you don't need to run the this notebook!

"*Forty-two, said Deep Thought, with infinite majesty and calm. DA*"

Sarah and Geoff

(an $O^{3}S$ production)