# **The Molecular Treasure Hunt: Part 4: Selecting your "hits_list"**

*An &#8491;ngstrom sized adventure by Sarah Harris (Leeds Physics) and Geoff Wells (UCL Pharmacy)*

"*The chances of finding out what's really going on in the universe are so remote, the only thing to do is hang the sense of it and keep yourself occupied.*" Douglas Adams, The Hitchhiker's Guide to the Galaxy

# Selecting and refining our list of 'hit' compounds
Now it is time to make decisions about which ligands you think are the best, based on your physical, chemical and pharmacological understanding of the results. We have made a few suggestions, but now it's up to you to make your own choices.

We recommend that you identify criteria that exclude all but the "best" ~1% of unique ligands (which here would be ~20 from a 2000 ligand data set, assuming a single protein receptor conformer). This quantity is very subjective.

How you decide to proceed depends upon the ligand library you have docked. If you have docked FDA approved drugs, then you have limited chemical freedom to modify these, and your decisions should reflect this. Conversely, if you have docked a fragment library, your chemical opportunities are immense. You are likely to find a strong correlation between Vina_BE and the size of your ligand (e.g. MW or non-H atoms). However, because you want to use your fragment dock to inspire new molecular design, you may be best choosing the most **efficient** fragments (e.g. those with the best ligand efficiency (Vina_LE)), rather than those with the best absolute docking score.

In [None]:
#This imports the python modules that we need for the notebook...
import pandas as pd

# Recreating the 'pandas dataframes' from our docking data
We need our pandas dataframe again to help us select molecules to save.

In [None]:
#Read in the pandas dataframe from the csv file created in the previous notebook
total_dataframes = pd.read_csv("rescored_docking_results.csv")

In [None]:
#This lists all of our results...
total_dataframes

### Selections based on means
Here we calculate the mean of each quantity we are interested in across all protein conformers and we rank the results by the (mean) Vina_BE, so it tells us about the binding of the compounds averaged across all protein conformations. This is only useful in the context of the distributions you observe in Part 3, for example it is not possible to have a non-integral number of hydrogen bonds! So use with care!!

In [None]:
mean_over_conformers = total_dataframes.groupby(['Ligand_Name']).mean(numeric_only=True).reset_index()
mean_over_conformers

Arguably the simplest criteria for selecting compounds would be based on the docking score (Vina_BE), but this can often gives us compounds which are large and that form few specific interactions (e.g. HBonds) with the protein receptor target.

This is is how we display the top 10 compounds based on our selected parameter (in this case mean Vina_BE).

In [None]:
Vina_BE_mean_over_conformers = total_dataframes.groupby(['Ligand_Name']).mean().sort_values('Vina_BE').reset_index()
Vina_BE_mean_over_conformers_top10 = Vina_BE_mean_over_conformers.head(10)
Vina_BE_mean_over_conformers_top10

We can also rank the results based on the ligand efficiency, this scales our Vina_BE by the size of the molecule, so we prioritise more *efficient* molecules in which more of the atoms that compose the compound contribute to the binding. This can be an important consideration when developing fragment-based 'hit' compounds, because as drugs go through a development process their size increases and their ligand efficiency reduces (so starting with ligand efficient molecules is generally regarded as a 'good thing').

Here we select our top 5 compounds by chosing those with the best Vina_LE averaged across all protein conformers.

In [None]:
Vina_LE_mean_over_conformers = total_dataframes.groupby(['Ligand_Name']).mean(numeric_only=True).sort_values('Vina_LE').reset_index()
Vina_LE_mean_over_conformers_top5 = Vina_LE_mean_over_conformers.head(5)
Vina_LE_mean_over_conformers_top5

### This is the part where you make some choices and need to edit the python code!
In this cell we select compounds based on a series of average criteria calculated across all of the conformers. We use three criteria in this case, but it could be more (or fewer). Below we have used the mean Vina_LE, mean Vina_BE and mean number of HBonds (a reflection of HBond occupancy across the protein conformers). Initially we don't know how many ligand results this will give us, but we can tune our selection criteria (by changing the numbers in the python code below) to give us a manageable number of ligands. You can look at the dataframes above and the graphs in Part 3 to give you an idea of which values to use here.

In [None]:
Vina_LE_Vina_BE_HBonds_mean_over_conformers = mean_over_conformers[(mean_over_conformers.Vina_LE < -0.4) & (mean_over_conformers.Vina_BE < -4) & (mean_over_conformers.HBonds > 0.8)]
Vina_LE_Vina_BE_HBonds_mean_over_conformers

### Selections based on the best overall properties
However, we don't have to deal with the averaged properties of our ligands, we can select those with the best absolute 'activity' from *any* protein conformer and apply similar selection criteria. This might be important if some protein conformers are outliers (i.e. if they demonstrate particularly low or high binding scores for the ligand set).
In this example we use two criteria to select our compounds (Vina_BE and number of HBonds). 

You can change these numbers by editing the python code to increase or decrease the number of molecules you select. You can also add or substitue the properties that you use for your selection by editing the python code.

In [None]:
sort_by_Ligand_Name = total_dataframes.sort_values('Ligand_Name')
Vina_BE_HBonds_best_overall = sort_by_Ligand_Name[(sort_by_Ligand_Name.Vina_BE < -5.2) & (sort_by_Ligand_Name.HBonds > 0) & (sort_by_Ligand_Name.Vina_LE < -0.4)]
Vina_BE_HBonds_best_overall

In this example we identify a particular protein conformer and find the best ligands for this form of the protein. Again, you will need to adjust the parameters to select a small subset of the ligands.

In [None]:
#Find the best results for a specific protein conformer
Protein_conf_name = 'eg5_adp_mg_02000' #insert name of conformer after the '='

sort_by_Ligand_Name = total_dataframes.sort_values('Ligand_Name')
Vina_BE_HBonds_best_conf00 = sort_by_Ligand_Name[(sort_by_Ligand_Name.Protein_Conf == Protein_conf_name) & (sort_by_Ligand_Name.Vina_BE < -4.9) & (sort_by_Ligand_Name.HBonds > 0) & (sort_by_Ligand_Name.Vina_LE < -0.2)]
Vina_BE_HBonds_best_conf00

## Writing the output files for our 'best' ligands
Now that we have decided on some criteria for selecting our ligands, we write out a series of files that contain the bound conformation of the ligand for each of the protein structures.
**Make sure that you execute the cell that picks your 'selected_hits' directly before executing the 'hit_ligands' selection below!!**

**Bear in mind that the size of the results folder will be *very* large if you select too many ligands (e.g. more than 20) - this could rapidly fill up your computer hard drive.**

Here we print out the list of "hit_ligands" you have chosen for visual analysis in Chimera. Be careful to check that these are the ones you really want to keep!!

To use one of our other selection sets from above we would replace the 'Vina_BE_HBonds_best_overall' with the corresponding variable name. e.g. we could replace 'selection_set = Vina_BE_HBonds_best_overall' with 'selection_set = Vina_LE_Vina_BE_HBonds_mean_over_conformers'

In [None]:
selection_set = Vina_BE_HBonds_best_overall
#The next lines make your hit list and prints the output so that you can check the number of compounds
#before making the results files in the next step.
hit_ligands = selection_set.drop_duplicates(subset='Ligand_Name', keep='first')
Ligand_Names = hit_ligands['Ligand_Name'].tolist()
print('Number of ligands:', len(Ligand_Names))
print(hit_ligands[['Ligand_Name']])
with open('hit_ligands.txt', 'w') as f:
    for item in Ligand_Names:
        f.write("%s\n" % item)

In the cell above we wrote out a list file called "hit_ligands.txt". This will tell the next part which results files you want to save. It will produce two types of result file for you:
- A series of .mol2 and .sdf files for each protein conformer that contain only the 'hit' compounds you define here. These files will be in a folder called "MTH_Hits".
- A series of .pdb files for each ligand that you define here with its corresponding docked pose for each protein conformer. These files will be in a folder called "MTH_Protein_Hits". You can use these files to see how the ligand pose changes as the protein conformation changes - a movie!

# Congratulations!!
You have completed the fourth part of your treasure hunt, and you are soon to open your first boxes of treasure! Please do not be disheartend if on your first attempt you find that your box is empty, or that the contents are disappointing. You will be more lucky next time, or the time after!! Perseverance is key to any successful quest.

Also beware of false optimism! Even though your results may look beautifully convincing, you should assess them with a skeptical eye. Others will scrutinse them carefully for their true value when you present them in the future. **Ultimately all predictions of this type need to be tested by experiment.**

"*All that is gold does not glitter, not all those who wander are lost.*" J.R.R. Tolkien, The Fellowship of the Ring

Good luck!!

Sarah and Geoff

(an "out-of-office studios" $O^{3}S$ production)