# **The Molecular Treasure Hunt: Part 5: Selecting your "hits_list"**

*An &#8491;ngstrom sized adventure by Sarah Harris (Leeds Physics) and Geoff Wells (UCL Pharmacy)*

"*Will you look into the mirror?
What will I see?
Even the wisest cannot tell. For the mirror shows many things. Things that were, things that are, and some things... that have not yet come to pass.*"

# Selecting and refining our list of 'hit' compounds
Now it is time to make decisions about which ligands you think are the best, based on your physical, chemical and pharmacological understanding of the results. We have made a few suggestions, but now it's up to you to make your own choices.

We recommend that you identify criteria that exclude all but the "best" ~1% of unique ligands (which here would be ~20 from a 2000 ligand data set, assuming a single protein receptor conformer). This quantity is very subjective.

How you decide to proceed depends upon the ligand library you have docked. If you have docked FDA approved drugs, then you have limited chemical freedom to modify these, and your decisions should reflect this. Conversely, if you have docked a fragment library, your chemical opportunities are immense. You are likely to find a strong correlation between Vina_BE/Cyscore and the size of your ligand (e.g. MW or non-H atoms). However, because you want to use your fragment dock to inspire new molecular design, you may be best choosing the most **efficient** fragments (e.g. those with the best ligand efficiency (Vina_LE)), rather than those with the best absolute docking score.

In [1]:
#This imports the python modules that we need for the notebook...
import pandas as pd

# Recreating the 'pandas dataframes' from our docking data
We need our pandas dataframe again to help us select molecules to save.

In [2]:
#Read in the pandas dataframe from the csv file created in the previous notebook
total_dataframes = pd.read_csv("rescored_docking_results.csv")

In [3]:
#This lists all of our results...
total_dataframes

Unnamed: 0,Protein_Conf,Ligand_Name,Vina_BE,Vina_Ki,MW,nonHatoms,clogP,Vina_LE,Vina_LipE,Close_Contacts,Hphob_Contacts,HBonds,Salt_Bridges,Pi-Pi_Par,Pi-Pi_Perp,Pi-Cation,Hal_Bonds,RFScore_v3,NNScore,PLECScore_rf
0,2FLU-wt_H_01_rescored,MFCD10004515,-5.385,119.317903,168.128,15,0.475,-0.359,-5.860,2,6,5,0,0,0,0,0,4.411,4.118,3.952
1,2FLU-wt_H_01_rescored,MFCD10566077,-6.132,34.077793,201.221,15,2.411,-0.409,-8.543,1,12,1,0,0,0,1,0,4.488,5.288,3.647
2,2FLU-wt_H_01_rescored,MFCD10687403,-5.540,91.998097,208.090,14,0.603,-0.396,-6.143,1,2,1,0,0,0,1,2,4.543,4.790,4.117
3,2FLU-wt_H_01_rescored,MFCD10698048,-4.053,1114.672359,148.978,8,1.783,-0.507,-5.836,0,2,0,0,0,0,0,1,3.877,5.120,3.612
4,2FLU-wt_H_01_rescored,MFCD10699119,-5.141,179.669433,179.072,13,1.298,-0.395,-6.439,1,2,2,0,0,0,1,2,4.013,4.773,3.957
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
405,2FLU-wt_H_10_rescored,MFCD13619891,-6.332,24.364612,168.128,15,0.564,-0.422,-6.896,1,18,1,0,0,0,0,0,4.204,4.271,3.910
406,2FLU-wt_H_10_rescored,MFCD13619893,-5.926,48.145397,192.171,15,1.324,-0.395,-7.250,3,4,4,0,0,0,0,0,4.909,5.137,4.500
407,2FLU-wt_H_10_rescored,MFCD13619961,-5.280,142.299832,241.192,18,1.193,-0.293,-6.473,1,1,1,0,0,0,0,5,4.855,6.277,3.823
408,2FLU-wt_H_10_rescored,MFCD16090013,-3.536,2653.460552,111.145,10,-1.001,-0.354,-2.535,0,0,3,0,0,0,0,0,4.233,4.102,3.720


### Selections based on means
Here we calculate the mean of each quantity we are interested in across all protein conformers and we rank the results by the (mean) Vina_BE, so it tells us about the binding of the compounds averaged across all protein conformations. This is only useful in the context of the distributions you observe above, for example it is not possible to have a non-integral number of hydrogen bonds! So use with care!!

In [4]:
mean_over_conformers = total_dataframes.groupby(['Ligand_Name']).mean().reset_index()
mean_over_conformers

Unnamed: 0,Ligand_Name,Vina_BE,Vina_Ki,MW,nonHatoms,clogP,Vina_LE,Vina_LipE,Close_Contacts,Hphob_Contacts,HBonds,Salt_Bridges,Pi-Pi_Par,Pi-Pi_Perp,Pi-Cation,Hal_Bonds,RFScore_v3,NNScore,PLECScore_rf
0,MFCD10004515,-5.0961,199.122809,168.128,15.0,0.475,-0.3396,-5.5711,1.3,8.3,3.8,0.0,0.0,0.0,0.2,0.0,4.409,3.8533,3.8295
1,MFCD10566077,-6.2365,31.038347,201.221,15.0,2.411,-0.4158,-8.6475,1.1,12.0,2.5,0.0,0.0,0.0,0.2,0.0,4.7613,4.7063,3.6676
2,MFCD10687403,-5.7386,72.434917,208.09,14.0,0.603,-0.4099,-6.3416,1.4,4.1,1.0,0.3,0.0,0.0,0.9,2.1,4.5058,4.3777,3.9206
3,MFCD10698048,-4.3308,752.840325,148.978,8.0,1.783,-0.5413,-6.1138,0.4,1.3,0.0,0.0,0.0,0.0,0.2,1.7,3.9067,5.3396,3.5663
4,MFCD10699119,-5.3713,131.187076,179.072,13.0,1.298,-0.4131,-6.6693,1.0,2.9,1.6,0.0,0.0,0.0,0.4,3.1,3.9259,4.3848,3.9039
5,MFCD10700048,-4.1157,1061.563425,136.151,11.0,0.818,-0.3741,-4.9337,0.8,3.0,2.3,0.0,0.0,0.0,0.2,0.0,3.9686,4.4761,3.8135
6,MFCD11055264,-5.8814,54.965532,193.201,15.0,1.668,-0.392,-7.5494,0.6,18.7,1.8,0.0,0.0,0.0,0.0,0.0,4.8373,5.3116,3.7122
7,MFCD11096852,-5.8354,62.548469,207.207,16.0,1.604,-0.3647,-7.4394,1.3,11.9,1.5,0.0,0.0,0.0,0.0,0.0,4.6185,5.3115,3.8388
8,MFCD11109313,-5.0178,238.262877,128.108,12.0,0.913,-0.4182,-5.9308,1.2,8.0,2.2,0.0,0.0,0.0,0.3,0.0,3.8549,4.0195,3.933
9,MFCD11109316,-5.6685,79.468573,188.164,17.0,-1.446,-0.3336,-4.2225,0.3,9.9,1.2,0.0,0.0,0.0,0.5,0.0,5.1097,5.3317,3.6237


Arguably the simplest criteria for selecting compounds would be based on one of the docking 'scores' (Vina_BE or Cyscore), but this often gives us compounds which are large and that form few specific interactions (e.g. HBonds) with the protein receptor target.

This is is how we display the top 10 compounds based on our selected parameter (in this case mean Vina_BE).

In [5]:
Vina_BE_mean_over_conformers = total_dataframes.groupby(['Ligand_Name']).mean().sort_values('Vina_BE').reset_index()
Vina_BE_mean_over_conformers_top10 = Vina_BE_mean_over_conformers.head(10)
Vina_BE_mean_over_conformers_top10

Unnamed: 0,Ligand_Name,Vina_BE,Vina_Ki,MW,nonHatoms,clogP,Vina_LE,Vina_LipE,Close_Contacts,Hphob_Contacts,HBonds,Salt_Bridges,Pi-Pi_Par,Pi-Pi_Perp,Pi-Cation,Hal_Bonds,RFScore_v3,NNScore,PLECScore_rf
0,MFCD11847114,-6.4599,21.61144,199.109,16.0,1.0762,-0.4038,-7.5361,1.1,7.9,2.3,0.0,0.0,0.0,0.8,1.7,4.337,4.727,3.8481
1,MFCD12198116,-6.303,27.667492,217.264,15.0,1.702,-0.4201,-8.005,0.7,18.6,1.3,0.2,0.0,0.0,0.0,0.0,4.9846,5.3865,3.8586
2,MFCD10566077,-6.2365,31.038347,201.221,15.0,2.411,-0.4158,-8.6475,1.1,12.0,2.5,0.0,0.0,0.0,0.2,0.0,4.7613,4.7063,3.6676
3,MFCD12198122,-6.1111,38.305815,177.177,13.0,0.375,-0.47,-6.4861,0.8,13.7,1.5,0.5,0.0,0.0,0.0,0.0,4.2354,4.0958,3.8462
4,MFCD13619891,-6.0905,39.371305,168.128,15.0,0.564,-0.4059,-6.6545,1.2,13.9,1.9,0.0,0.0,0.0,0.0,0.0,4.3,4.0534,3.8931
5,MFCD12022271,-6.0417,43.966065,168.128,15.0,0.564,-0.4028,-6.6057,1.1,10.8,2.6,0.0,0.0,0.0,0.0,0.0,4.2859,4.249,3.9365
6,MFCD11841074,-6.024,43.554875,192.176,19.0,-1.851,-0.317,-4.173,1.0,11.2,2.0,0.0,0.0,0.0,0.9,0.0,4.8786,5.4382,3.6181
7,MFCD11055264,-5.8814,54.965532,193.201,15.0,1.668,-0.392,-7.5494,0.6,18.7,1.8,0.0,0.0,0.0,0.0,0.0,4.8373,5.3116,3.7122
8,MFCD11096852,-5.8354,62.548469,207.207,16.0,1.604,-0.3647,-7.4394,1.3,11.9,1.5,0.0,0.0,0.0,0.0,0.0,4.6185,5.3115,3.8388
9,MFCD13619893,-5.769,67.292391,192.171,15.0,1.324,-0.3847,-7.093,2.1,8.4,3.3,0.0,0.0,0.0,0.0,0.0,4.8404,4.5548,4.0221


We can also rank the results based on the ligand efficiency, this scales our Vina_BE by the size of the molecule, so we prioritise more *efficient* molecules in which more of the atoms that compose the compound contribute to the binding. This can be an important consideration when developing fragment-based 'hit' compounds, because as drugs go through a development process their size increases and their ligand efficency reduces (so starting with ligand efficient molecules is generally regarded as a 'good thing').

Here we select our top 5 compounds by chosing those with the best Vina_LE averaged across all protein conformers.

In [6]:
Vina_LE_mean_over_conformers = total_dataframes.groupby(['Ligand_Name']).mean().sort_values('Vina_LE').reset_index()
Vina_LE_mean_over_conformers_top5 = Vina_LE_mean_over_conformers.head(5)
Vina_LE_mean_over_conformers_top5

Unnamed: 0,Ligand_Name,Vina_BE,Vina_Ki,MW,nonHatoms,clogP,Vina_LE,Vina_LipE,Close_Contacts,Hphob_Contacts,HBonds,Salt_Bridges,Pi-Pi_Par,Pi-Pi_Perp,Pi-Cation,Hal_Bonds,RFScore_v3,NNScore,PLECScore_rf
0,MFCD10698048,-4.3308,752.840325,148.978,8.0,1.783,-0.5413,-6.1138,0.4,1.3,0.0,0.0,0.0,0.0,0.2,1.7,3.9067,5.3396,3.5663
1,MFCD12198134,-4.276,798.169873,128.517,9.0,-0.331,-0.475,-3.945,1.1,1.7,2.3,0.0,0.0,0.0,0.4,0.4,3.9303,4.6942,3.8576
2,MFCD12198122,-6.1111,38.305815,177.177,13.0,0.375,-0.47,-6.4861,0.8,13.7,1.5,0.5,0.0,0.0,0.0,0.0,4.2354,4.0958,3.8462
3,MFCD12022285,-5.122,194.881215,147.177,11.0,1.277,-0.4657,-6.399,0.3,8.0,0.9,0.0,0.2,0.0,1.0,0.0,4.3471,4.2151,3.6574
4,MFCD17171427,-5.4786,108.485827,161.138,12.0,-0.302,-0.4565,-5.1766,0.7,6.7,1.3,0.2,0.0,0.0,1.0,0.0,4.2979,4.1109,3.6659


In this cell we select compounds based on a series of average criteria calculated across all of the conformers. We use three criteria in this case, but it could be more (or fewer). Below we have used the mean Vina_LE, mean Vina_BE and mean number of HBonds (a reflection of HBond occupancy across the protein conformers) in this case. Initially we don't know how many ligand results this will give us, but we can tune our selection criteria to give us a manageable number of ligands.

In [7]:
Vina_LE_Vina_BE_HBonds_mean_over_conformers = mean_over_conformers[(mean_over_conformers.Vina_LE < -0.4) & (mean_over_conformers.Vina_BE < -5) & (mean_over_conformers.HBonds > 0.8)]
Vina_LE_Vina_BE_HBonds_mean_over_conformers

Unnamed: 0,Ligand_Name,Vina_BE,Vina_Ki,MW,nonHatoms,clogP,Vina_LE,Vina_LipE,Close_Contacts,Hphob_Contacts,HBonds,Salt_Bridges,Pi-Pi_Par,Pi-Pi_Perp,Pi-Cation,Hal_Bonds,RFScore_v3,NNScore,PLECScore_rf
1,MFCD10566077,-6.2365,31.038347,201.221,15.0,2.411,-0.4158,-8.6475,1.1,12.0,2.5,0.0,0.0,0.0,0.2,0.0,4.7613,4.7063,3.6676
2,MFCD10687403,-5.7386,72.434917,208.09,14.0,0.603,-0.4099,-6.3416,1.4,4.1,1.0,0.3,0.0,0.0,0.9,2.1,4.5058,4.3777,3.9206
4,MFCD10699119,-5.3713,131.187076,179.072,13.0,1.298,-0.4131,-6.6693,1.0,2.9,1.6,0.0,0.0,0.0,0.4,3.1,3.9259,4.3848,3.9039
8,MFCD11109313,-5.0178,238.262877,128.108,12.0,0.913,-0.4182,-5.9308,1.2,8.0,2.2,0.0,0.0,0.0,0.3,0.0,3.8549,4.0195,3.933
18,MFCD11847114,-6.4599,21.61144,199.109,16.0,1.0762,-0.4038,-7.5361,1.1,7.9,2.3,0.0,0.0,0.0,0.8,1.7,4.337,4.727,3.8481
20,MFCD12022268,-5.3068,148.677429,174.199,13.0,1.452,-0.4082,-6.7588,0.3,6.3,1.0,0.0,0.0,0.0,0.7,0.0,4.0366,4.4897,3.7912
22,MFCD12022271,-6.0417,43.966065,168.128,15.0,0.564,-0.4028,-6.6057,1.1,10.8,2.6,0.0,0.0,0.0,0.0,0.0,4.2859,4.249,3.9365
25,MFCD12022285,-5.122,194.881215,147.177,11.0,1.277,-0.4657,-6.399,0.3,8.0,0.9,0.0,0.2,0.0,1.0,0.0,4.3471,4.2151,3.6574
28,MFCD12198116,-6.303,27.667492,217.264,15.0,1.702,-0.4201,-8.005,0.7,18.6,1.3,0.2,0.0,0.0,0.0,0.0,4.9846,5.3865,3.8586
31,MFCD12198122,-6.1111,38.305815,177.177,13.0,0.375,-0.47,-6.4861,0.8,13.7,1.5,0.5,0.0,0.0,0.0,0.0,4.2354,4.0958,3.8462


### Selections based on the best overall properties
However, we don't have to deal with the averaged properties of our ligands, we can select those with the best absolute 'activity' from *any* protein conformer and apply similar selection criteria. This might be important if some protein conformers are outliers (i.e. if they demonstrate particularly low or high binding scores for the ligand set).
In this example we use two criteria to select our compounds (Vina_BE and number of HBonds).

In [8]:
sort_by_Ligand_Name = total_dataframes.sort_values('Ligand_Name')
Vina_BE_HBonds_best_overall = sort_by_Ligand_Name[(sort_by_Ligand_Name.Vina_BE < -4) & (sort_by_Ligand_Name.HBonds > 0) & (sort_by_Ligand_Name.Vina_LE < -0.5)]
Vina_BE_HBonds_best_overall

Unnamed: 0,Protein_Conf,Ligand_Name,Vina_BE,Vina_Ki,MW,nonHatoms,clogP,Vina_LE,Vina_LipE,Close_Contacts,Hphob_Contacts,HBonds,Salt_Bridges,Pi-Pi_Par,Pi-Pi_Perp,Pi-Cation,Hal_Bonds,RFScore_v3,NNScore,PLECScore_rf
31,2FLU-wt_H_01_rescored,MFCD12198122,-6.553,16.816972,177.177,13,0.375,-0.504,-6.928,3,14,1,0,0,0,0,0,3.983,4.447,3.843
240,2FLU-wt_H_06_rescored,MFCD12198134,-4.612,436.397564,128.517,9,-0.331,-0.512,-4.281,1,0,4,0,0,0,0,1,4.122,5.502,4.214


In [9]:
#Find the best results for a specific protein conformer
sort_by_Ligand_Name = total_dataframes.sort_values('Ligand_Name')
Vina_BE_HBonds_best_conf00 = sort_by_Ligand_Name[(sort_by_Ligand_Name.Protein_Conf == 'helicaseEapo_00_rescored') & (sort_by_Ligand_Name.Vina_BE < -5) & (sort_by_Ligand_Name.HBonds > 0) & (sort_by_Ligand_Name.Vina_LE < -0.4)]
Vina_BE_HBonds_best_conf00

Unnamed: 0,Protein_Conf,Ligand_Name,Vina_BE,Vina_Ki,MW,nonHatoms,clogP,Vina_LE,Vina_LipE,Close_Contacts,Hphob_Contacts,HBonds,Salt_Bridges,Pi-Pi_Par,Pi-Pi_Perp,Pi-Cation,Hal_Bonds,RFScore_v3,NNScore,PLECScore_rf


Now that we have decided on some criteria for selecting our ligands, we write out a series of files that contain the bound conformation of the ligand for each of the protein structures.
**Make sure that you execute the cell that picks your 'selected_hits' directly before executing the 'hit_ligands' selection below!!**

**Bear in mind that the size of the results folder will be *very* large if you select too many ligands (e.g. more than 20) - this could rapidly fill up your computer hard drive.**

Here we print out the list of "hit_ligands" you have chosen for visual analysis in Chimera. Be careful to check that these are the ones you really want to keep!!

In [10]:
#To use one of our other selection sets from above we would replace the 'Vina_BE_HBonds_best_overall'
#with the corresponding variable name.
selection_set = Vina_BE_HBonds_best_overall
#The next lines make your hit list and prints the output so that you can check the number of compounds
#before making the results files in the next step.
hit_ligands = selection_set.drop_duplicates(subset='Ligand_Name', keep='first')
Ligand_Names = hit_ligands['Ligand_Name'].tolist()
print('Number of ligands:', len(Ligand_Names))
print(hit_ligands[['Ligand_Name']])
with open('hit_ligands.txt', 'w') as f:
    for item in Ligand_Names:
        f.write("%s\n" % item)

Number of ligands: 2
      Ligand_Name
31   MFCD12198122
240  MFCD12198134


In the cell above we wrote out a list file called "hit_ligands.txt" that we will move to the same computer that you used for the docking calculations in Part 2. This will tell the next part which results files you want to save. It will produce two types of result file for you:
- A series of .mol2 and .sdf files for each protein conformer that contain only the 'hit' compounds you define here. These files will be in a folder called "MTH_Hits".
- A series of .pdb files for each ligand that you define here with its corresponding docked pose for each protein conformer. These files will be in a folder called "MTH_Protein_Hits". You can use these files to see how the ligand pose changes as the protein conformation changes - a movie!

# Congratulations!!
You have completed the fourth part of your treasure hunt, and you are soon to open your first boxes of treasure! Please do not be disheartend if on your first attempt you find that your box is empty, or that the contents are disappointing. You will be more lucky next time, or the time after!! Perseverance is key to any successful quest.

Also beware of false optimism! Even though your results may look beautifully convincing, you should assess them with a skeptical eye. Others will scrutinse them carefully for their true value when you present them in the future. **Ultimately all predictions of this type need to be tested by experiment.**

"*All that is gold does not glitter, not all those who wander are lost.*"

Good luck!!

Sarah and Geoff

(an "out-of-office" $O^{3}P$ production)