# Talktorial 4

# Ligand-based screening: compound similarity

#### Developed in the CADD seminars 2017 and 2018, AG Volkamer, Charit√©/FU Berlin 

Andrea Morger and Franziska Fritz adapted by Gautier Peyrat

## Aim of this talktorial

In this talktorial, we get familiar with different approaches to encode (descriptors, fingerprints) and compare (similarity measures) molecules. Furthermore, we perform a virtual screening in form of a similarity search for the EGFR inhibitor Gefitinib against our dataset of EGFR-tested compounds from the ChEMBL database filtered by Lipinski's rule of five (see **talktorial 2**). 

## Learning goals

### Theory

* Molecular similarity
* Molecular descriptors
* Molecular fingerprints
  * Substructure-based fingerprints
  * MACCS fingerprints
  * Morgan fingerprints, circular fingerprints
* Molecular similarity measures
  * Tanimoto coefficient
  * Dice coefficient
* Virtual screening
  * Virtual screening using similarity search

### Practical

* Import and draw molecules
* Calculate molecular descriptors
  * 1D molecular descriptors: Molecular weight
  * 2D molecular descriptors: MACCS fingerprint
  * 2D molecular descriptors: Morgan fingerprints
* Calculate molecular similarity
  * MACCS fingerprints: Tanimoto and Dice similarity
  * Morgan fingerprints: Tanimoto and Dice similarity
* Virtual screening using similarity search
  * Compare query compound to all compounds in a data set
  * Distribution of similarity values
  * Visualize most similar molecules
  * Generate enrichment plots

## References

* Review on "Molecular similarity in medicinal chemistry" ([<i>J. Med. Chem.</i> (2014), <b>57</b>, 3186-3204](http://pubs.acs.org/doi/abs/10.1021/jm401411z))
* Morgan fingerprints with RDKit ([RDKit tutorial on Morgan fingerprints](http://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints))
* ECFP - extended-connectivity fingerprints ([<i>J. Chem. Inf. Model.</i> (2010), <b>50</b>,742-754](https://pubs.acs.org/doi/abs/10.1021/ci100050t))
* Chemical space
([<i>ACS Chem. Neurosci.</i> (2012), <b>19</b>, 649-57](https://www.ncbi.nlm.nih.gov/pubmed/23019491))
* List of molecular descriptors in RDKit ([RDKit documentation: Descriptors](https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors))
* List of fingerprints in RDKit ([RDKit documentation: Fingerprints](https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-fingerprints))
* Enrichment plots ([Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), **1**, 313-31](https://onlinelibrary.wiley.com/doi/10.1002/9783527806539.ch6h))

_____________________________________________________________________________________________________________________


## Theory

### Molecular similarity

Molecular similarity is a well known and often used concept in chemical informatics. Comparing compounds and their properties can be used in many different ways and may help us in identifying new compounds with desired properties and biological activity.

The assumption that structurally similar molecules have similar properties and, thus, similar biological activity is represented in the similarity property principle (SPP) as well as the structure activity relationship (SAR). 
In this context, virtual screening follows the idea that given a set of molecules with known binding affinity, we can look for further such molecules.

### Molecular descriptors

Similarity can be assessed in many different ways depending on the application (see <a href="http://pubs.acs.org/doi/abs/10.1021/jm401411z"><i>J. Med. Chem.</i> (2014), <b>57</b>, 3186-3204</a>):

* **1D molecular descriptor**: Solubility, logP, molecular weight, melting point etc.
    * Global descriptor: only one value representing the whole compound <br>
    * Usually not enough characteristics specifying a molecule to apply machine learning (ML) 
    * Can be added to 2D fingerprints to improve molecular encoding for ML
* **2D molecular descriptors**: Molecular graphs, paths, fragments, atom environments
    * Detailed representation of individual parts of the molecule 
    * Many features/bits per molecule called fingerprints
    * Very often used in similarity search and ML
* **3D molecular descriptors**: Shape, stereochemistry
    * Chemists are usually trained for 2D representations <br>
    * Less robust than 2D representations because of compound flexibility (what is the "right" conformation of a compound?)
* **Biological similarity**
    * Biological fingerprint, e.g. individual bits represent bioactivity measure against different targets
    * Independent of molecular structure
    * Requires experimental (or predicted) data


We already learned how to calculate 1D physicochemical parameters, such as molecular weight and logP in **talktorial 2**. More about such descriptors in RDKit can be found in the [RDKit documentation: Descriptors](https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors). 

In the following, we focus on the definition of 2D (or 3D) molecular descriptors. Due to their, mostly, uniqueness per molecule, these descriptors are also called fingerprints.

### Molecular fingerprints

#### Substructure-based fingerprints

Molecular fingerprints are a computational representation of molecules that encode chemical and molecular features in form of bitstrings, bitvectors or arrays. Each bit corresponds to a predefined molecular feature or environment, where "1" represents the presence and "0" the absence of a feature. Note that some implementations are count-based, thus, they count how often a specific feature is present.

There are multiple ways to design fingerprints.
Here, we introduce MACCS keys and Morgan fingerprint as two commonly used 2D fingerprints. 
As can be seen in the [RDKit documentation: Fingerprints](https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-fingerprints), RDKit also offers multiple alternate fingerprints. 

#### MACCS fingerprints

Molecular ACCess System 
(MACCS) fingerprints, also termed MACCS structural keys, consist of 166 predefined structural fragments. Each position queries the presence or absence of one particular structural fragment or key. 
The individual keys were empirically defined by medicinal chemists and are simple to use and interpret ([RDKit documentation: MACCS keys](http://rdkit.org/Python_Docs/rdkit.Chem.MACCSkeys-module.html)).

<img src="images/maccs_fp.png" align="above" alt="Image cannot be shown" width="250">
<div align="center"> Figure 2: Illustration of MACCS fingerprint (figure by Andrea Morger).</div>

#### Morgan fingerprints and circular fingerprints 

This family of fingerprints is based on the Morgan algorithm. 
The bits correspond to the circular environments of each atom in a molecule. 
The number of neighboring bonds and atoms to consider is set by the radius. 
Also the length of the bit string can be defined, a longer bit string will be modded to the desired length. 
Therefore, the Morgan fingerprint is not limited to a certain number of bits. 
More about the Morgan fingerprint can be found in the 
[RDKit documentation: Morgan fingerprints](http://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints). 
Extended connectivity fingerprints (ECFP) are also commonly used fingerprints that are derived using a variant of the Morgan algorithm, see ([<i>J. Chem. Inf. Model.</i> (2010), <b>50</b>,742-754](https://pubs.acs.org/doi/abs/10.1021/ci100050t)) for further information. 

<img src="images/morgan_fp.png" align="above" alt="Image cannot be shown" width="270">
<div align="center">Figure 3: Illustration of Morgan circular fingerprint (figure by Andrea Morger).</div>

### Molecular similarity measures 

Once the descriptors/fingerprints are calculated, they can be compared to assess the similarity between two molecules. Molecular similarity can be quantified with a number of different similarity coefficients, two common similarity measures are the Tanimoto and Dice index ([<i>J. Med. Chem.</i> (2014), <b>57</b>, 3186-3204](http://pubs.acs.org/doi/abs/10.1021/jm401411z)).

#### Tanimoto coefficient

$$T _{c}(A,B) = \frac{c}{a+b-c}$$

a: number of features present in compound A <br>
b: number of features present in compound B <br>
c: number of features shared by compounds A and B

#### Dice coefficient

$$D_{c}(A,B) = \frac{c}{\frac{1}{2}(a+b)}$$

a: number of features present in compound A <br>
b: number of features present in compound B <br>
c: number of features shared by compounds A and B

The similarity measures usually consider the number of positive bits (1's) present in each fingerprint and the number of positive bits that both have in common. 
Dice similarity usually returns higher values than Tanimoto similarity because of their denominators:

$$\frac{c}{a+b-c} \leq \frac{c}{\frac{1}{2}(a+b)}$$


### Virtual screening 

The challenge in early stages of drug discovery is to narrow down a set of small molecules (compounds) from the large existing chemical space that are potentially binding to the target under investigation. Note that this chemical space is vast: Small molecules can be made of 10<sup>20</sup> combinations of chemical moieties ([<i>ACS Chem. Neurosci.</i> (2012), <b>19</b>, 649-57](https://www.ncbi.nlm.nih.gov/pubmed/23019491)). 

Since experimental high-throughput screening (HTS) for the activity of all those small molecules against the target of interest is cost and time intensive, computer-aided methods are invoked to propose a focused list of small molecules to be tested. This process is called virtual (high-throughput) screening: a large library of small molecules is filtered by rules and/or patterns, in order to identify those small molecules that are most likely to bind a target under investigation.

#### Virtual screening using similarity search

Comparing a set of novel molecules against a (or several) known active molecule(s) to find the most similar ones can be used as a simple way of virtual screening. 
Given the similar property principle, we can assume that the most similar molecules, e.g. to a known inhibitor, also have similar effects. Requirements for a similarity search are the following (as discussed in detail above):

* A representation that encodes chemical/molecular features
* A potential weighting of features (optional)
* A similarity measurement

A similarity search can be performed by calculating the similarity between one compound and all compounds in a specific database. Ranking the compounds of the database by their similarity coefficient yields the most similar molecules at the top.

#### Enrichment plots

Enrichment plots are used to validate virtual screening results, which display the ratio of active compounds detected in the top x% of the ranked list, i.e.: 
* the ratio of top-ranked molecules (x-axis) from the whole dataset vs. 
* the ratio of active molecules (y-axis) from the whole dataset.

<img src="images/enrichment_plot.png" align="above" alt="Image cannot be shown" width="270">
<div align="center">Figure 4: Example of enrichment plot for virtual screening results.</div>

## Practical

In the first part of this practical section, we will use RDKit to encode molecules (molecular fingerprints) and compare them in order to calculate their similarity (molecular similarity measures), as discussed in the theory section above.

In the second part, we will use these encoding and comparison techniques to conduct a similarity search (virtual screening): 
We use the known EGFR inhibitor Gefitinib as query and search for similar compounds in our data set of compounds tested on EGFR, which we collected from the ChEMBL database in **talktorial 1** and filtered by Lipinski's rule of five in **talktorial 2**.

### Import and draw molecules

First, we define and draw eight example molecules, which we will encode and compare later on. 
The molecules in SMILES format are converted to RDKit molecule objects and visualized with the RDKit `Draw` function.

In [None]:
# Import relevant Python packages
# The majority of the basic molecular functionality is found in module rdkit.Chem
from rdkit import Chem
# Drawing related
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem
from rdkit.Chem import MACCSkeys
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs

import math
import numpy as np
import pandas as pd
from rdkit.Chem import PandasTools
import matplotlib.pyplot as plt

In [None]:
# Find those 8 molecule SMILES on the ChEMBL
doxycycline =  # Doxycycline Smiles
amoxicilline =  # Amoxicilline Smiles
furosemide = # Furosemide Smiles
glycol_dilaurate =  # Glycol dilaurate Smiles
hydrochlorothiazide =  # Hydrochlorothiazide Smiles
isotretinoine =  # Isotretinoine Smiles
tetracycline =  # Tetracycline Smiles
hemi_cycline_D =  # Hemi-cycline D Smiles

# Get all the SMILES in one list "L_smiles"

# Convert the SMILES in a list of Molecule objects (from RDKit) "L_mol"

# Construct a list of molecule names called "mol_names"

# Draw molecules to a grid image with 2 molecules per row, and add 2 arguments :
# subImgSize=(450,150), legends=mol_names


### Calculate molecular descriptors

We extract and generate 1D and 2D molecular descriptors to compare our molecules. 
For 2D descriptors, different types of fingerprints are generated to be used later for the calculation of the molecular similarity.

#### 1D molecular descriptors: molecular weight

We calculate the molecular weight of our example structures.

In [None]:
# Get molecular weight for all molecules and store them in a list called "mol_weights"

We draw our molecular structures with their similar molecular weight for visual comparison: is the molecular weight a feasible descriptor for compound similarity?

In [None]:
# Generate a DataFrame "sim_mw_df" with the columns "smiles", "name", "mw" and "Mol"


# Sort by molecular weight in descending order

# Display only the columns "smiles", "name" and "mw"
# The syntax is : dataframe_name[["colname1", "colname2", "colname3"]]

In [None]:
# Draw the molecules and display their names and their molecular weights in legends
# the zip() function allows to iterate with 2 iterators in the same loop :
# for i,j in zip(sim_mw_df["name"], sim_mw_df["mw"])
# a string can be formated with this syntax : f"{i}: {str(round(j, 2))} Da" to use variable values
# Use 2 mols per rows and a width of 450 px and a height of 150 px


As we can see, molecules with similar molecular weight can have a similar structure (e.g. Doxycycline/Tetracycline), however they can also have a similar number of atoms in completely different arrangements (e.g. Doxycycline/Glycol dilaurate or Hydrochlorothiazide/Isotretinoine).

In order to account for more detailed properties of a molecule, we now take a look at 2D molecular descriptors.

#### 2D molecular descriptors: MACCS fingerprint

MACCS fingerprints can be easily generated using RDKit. As explicit bitvectors are not human-readable, we will further transform them to bitstrings.

In [None]:
# Generate MACCS fingerprint 
maccs_fp1 = MACCSkeys.GenMACCSKeys(mols[0])  # Doxycycline
maccs_fp1

In [None]:
# Generate fingerprint "maccs_fp2" for Amoxicilline


In [None]:
# list all attributes and methods of the object maccs_fp2


In [None]:
# Print fingerprint maccs_fp2 as bitstring


In [None]:
# Create an empty list "maccs_fp_list"

# Generate MACCS fingerprints for all molecules and store them in the list maccs_fp_list



#### 2D molecular descriptors: Morgan fingerprints

We also calculate the circular Morgan fingerprints with RDKit. 
With two different functions, the Morgan fingerprint can be calculated either as int or bit vector.

In [None]:
# Generate Morgan fingerprint (as int vector), by default the radius is 2 and the vector is 2048 long
circ_fp1 = rdFingerprintGenerator.GetCountFPs([mols[0]])[0]
circ_fp1

In [None]:
# Generate Morgan fingerprint (as int vector) for the second molecule store it in "circ_fp2"


In [None]:
# Look at the values that are set:
circ_fp1.GetNonzeroElements()

In [None]:
# Generate Morgan fingerprint (as bit vector) with GetFPs instead of GetCountFPs
# by default the radius is 2 and the vector is 2048 long


In [None]:
# Print fingerprint as bitstring


In [None]:
# Generate Morgan fingerprint for all molecules


### Calculate molecular similarity

In the following, we will apply two similarity measures, i.e. **Tanimoto** and **Dice**, to our two fingerprint types, i.e. **MACCS** and **Morgan** fingerprints.

Example: Two MACCS fingerprints compared with the Tanimoto similarity.

In [None]:
# list all methods and attributes from "Datastructs"


In [None]:
# Calculate Tanimoto similarity coefficient between the same molecule (maccs_fp1 and maccs_fp1)


In [None]:
# Calculate Tanimoto coefficient between 2 molecules (maccs_fp1 and maccs_fp2)


In the following, we want to compare a query compound with our molecule list. 
Therefore, we use the RDKit functions ```BulkTanimotoSimilarity``` and ```BulkDiceSimilarity``` that calculate the similarity of a query fingerprint with a list of fingerprints, based on a similarity measure, i.e. either the Tanimoto or Dice similarity. 

After calculating the similarity, we want to draw our ranked molecules with the following function:

In [None]:
def draw_ranked_molecules(sim_df_sorted, sorted_column):
    """
    Function that draws molecules from a (sorted) DataFrame.
    """
    # We will define labels: first molecule is query, following molecules start from rank 1
    # Create a list "rank" with all the labels : "#0: ", "#1: ", "#2"
    # Then, change the first element of the list by "Query: "

    # Get the list "top_smiles" with SMILES from the dataframe (use the method .tolist() )
    # Convert the molecules to RDKit objects and get them in a list "top_mols"
    top_names = [f"{i}{j} ({str(round(k, 2))})" for i, j, k in zip(rank, sim_df_sorted["name"].tolist(), 
                                                                  sim_df_sorted[sorted_column])]

    return # return the molecules top_mols drawn to grid image, with their top_names as legends.

In the following, we will investigate all combinations of MACCS/Morgan fingerprint comparisons based on the Tanimoto/Dice similarity measure. Therefore, we create a DataFrame that will summarize our results:

In [None]:
# Generate DataFrame "sim_df" with the columns "smiles" and "name"


#### MACCS fingerprints: Tanimoto similarity

In [None]:
# Add similarity scores to DataFrame
sim_df['tanimoto_MACCS'] = DataStructs.BulkTanimotoSimilarity(maccs_fp1,maccs_fp_list)

In [None]:
# DataFrame sorted by Tanimoto similarity of MACCS fingerprints
sim_df_sorted_t_ma = sim_df.copy()
sim_df_sorted_t_ma.sort_values(['tanimoto_MACCS'], ascending=False, inplace=True)
sim_df_sorted_t_ma

In [None]:
# Draw molecules ranked by Tanimoto similarity of MACCS fingerprints
draw_ranked_molecules(sim_df_sorted_t_ma, "tanimoto_MACCS")

With MACCS fingerprints, Tetracycline is the most similar molecule (high score), followed by Amoxicilline. In contrast to the 1D descriptor molecular weight, the linear molecule Glycol dilaurate is recognized as dissimilar (last rank).

#### MACCS fingerprints: Dice similarity

In [None]:
# Add similarity scores "dice_MACCS" to DataFrame "sim_df"


In [None]:
# Sort a copy "sim_df_sorted_d_ma" of the dataFrame "sim_df" by Dice similarity of MACCS fingerprints




Tanimoto and Dice similarity measures by definition result in the same ranking of molecules, with higher values for the Dice similarity (see Tanimoto and Dice equations in the theory section of this talktorial).

#### Morgan fingerprints: Tanimoto similarity

In [None]:
# Add similarity scores "tanimoto_morgan" and "dice_morgan" to DataFrame "sim_df"



In [None]:
# Sort a copy of the dataFrame "sim_df_sorted_t_mo" by Tanimoto similarity of Morgan fingerprints




In [None]:
# Draw molecules ranked by Tanimoto similarity of Morgan fingerprints


Compare the MACCS and Morgan similarities by plotting Tanimoto(Morgan) vs Tanimoto(MACCS)

In [None]:
fig, axes = plt.subplots(figsize=(6,6), nrows=1, ncols=1)
sim_df_sorted_t_mo.plot('tanimoto_MACCS','tanimoto_morgan',kind='scatter',ax=axes)
plt.plot([0,1],[0,1],'k--')
axes.set_xlabel("MACCS")
axes.set_ylabel("Morgan")
plt.show()

Usage of different fingerprints (here: MACCS and Morgan fingerprints) results in different similarity values (here: Tanimoto similarity) and thus potentially also in different rankings of molecule similarity as shown here. 

Morgan fingerprints also recognize Tetracycline as the most similar compound to Doxycycline (but with lower score), and Clycol dilaurate as most dissimilar. However, ranked second is Hemi-cycline D, a structural part of cyclines - possibly because of the atom environment-based algorithm of Morgan fingerprints (whereas MACCS fingerprints rather ask for the occurrence of certain properties). 

### Virtual screening using similarity search

Now that we have learned how to calculate fingerprints and the similarity between them, we can apply this knowledge to a similarity search of a query compound against a full data set of compounds. 

We use the known EGFR inhibitor Gefitinib as query and search for similar compounds in our data set of compounds tested on EGFR, which we collected from the ChEMBL database in **talktorial 1** and filtered by Lipinski's rule of five in **talktorial 2**.

#### Compare query compound to all compounds in the data set

We import compounds from a *csv* file containing the filtered EGFR-tested compounds from the ChEMBL database as provided by **talktorial 2**. Given one query compound (here Gefitinib) we screen that data set for similar compounds.

In [None]:
# Import data from csv file containing compounds in SMILES format
filtered_df = pd.read_csv('../data/T2/EGFR_compounds_lipinski.csv', delimiter=';', usecols=['molecule_chembl_id', 'smiles', 'pIC50'])
filtered_df.head() 

In [None]:
# Generate Mol object from SMILES of query compound (Gefitinib), you can find it on http://www.icoa.fr/pkidb/
query = # Molecule Gefitinib

In [None]:
# Generate MACCS "maccs_fp_query" and Morgan "circ_fp_query" fingerprints for query compound (Gefitinib)



In [None]:
# Generate MACCS and Morgan fingerprints for all molecules in file
# Firstly, get all mol objects from smiles of the "filtered_df" in a list "ms"
# Then, calculate all fingerprints "circ_fp_list" and "maccs_fp_list" from "ms"



In [None]:
# Calculate Tanimoto similarity for query compound (Gefitinib) and all molecules in file (MACCS, Morgan)
tanimoto_maccs = 
tanimoto_circ = 

In [None]:
# Calculate Dice similarity for query compound (Gefitinib) and all molecules in file (MACCS, Morgan)
dice_maccs = 
dice_circ = 

In [None]:
# Create a dataframe "similarity_df" with ChEMLB ID, bioactivity, tanimoto_MACCS, 
# tanimoto_morgan, dice_MACCS, dice_morgan and smiles columns for all the compounds to Gefitinib


In [None]:
# Show DataFrame the first lines of te dataframe


#### Distribution of similarity values

As mentioned in the theory section, compared for the same fingerprint (e.g. MACCS fingerprints), Tanimoto similarity values are lower than Dice similarity values.
Also, comparing two different fingerprints (e.g. MACCS and Morgan fingerprints), the similarity measure values (e.g. Tanimotot similarity) vary. 

We can have a look at the distributions by plotting histograms.

In [None]:
# Plot distribution of Tanimoto similarity of MACCS fingerprints
%matplotlib inline
fig, axes = plt.subplots(figsize=(10,6), nrows=2, ncols=2)
similarity_df.hist(["tanimoto_MACCS"], ax=axes[0,0])
similarity_df.hist(["tanimoto_morgan"], ax=axes[0,1])
similarity_df.hist(["dice_MACCS"], ax=axes[1,0])
similarity_df.hist(["dice_morgan"], ax=axes[1,1])
axes[1,0].set_xlabel("similarity value")
axes[1,0].set_ylabel("# molecules")
plt.show()

We can compare similarities here too. This time let's directly compare Tanimoto and Dice similarities for the two fingerprints

In [None]:
fig, axes = plt.subplots(figsize=(12,6), nrows=1, ncols=2)

similarity_df.plot('tanimoto_MACCS','dice_MACCS',kind='scatter',ax=axes[0])
axes[0].plot([0,1],[0,1],'k--')
axes[0].set_xlabel("Tanimoto(MACCS)")
axes[0].set_ylabel("Dice(MACCS)")

similarity_df.plot('tanimoto_morgan','dice_morgan',kind='scatter',ax=axes[1])
axes[1].plot([0,1],[0,1],'k--')
axes[1].set_xlabel("Tanimoto(Morgan)")
axes[1].set_ylabel("Dice(Morgan)")

plt.show()

Similarity distributions are important to interpret similarity values, e.g. a value of 0.6 needs to be evaluated differently for MACCS or Morgan fingerprints, as well as Tanimoto or Dice similarities.

In the following, we draw the most similar molecules for the Tanimoto similarity based on Morgan fingerprints.

#### Visualize most similar molecules

We visually inspect the structure of Gefitinib in comparison to the most similar molecules in our ranking, including the information about their bioactivity (pIC50 derived from ChEMBL in **talktorial 1**).

In [None]:
# Sort DataFrame "similarity_df" by "tanimoto_morgan" in descending order
# show the first lines

In [None]:
# Use PandasTools to add a structural representation of the SMILES strings (RDKit object Mol) to the DataFrame


In [None]:
# Draw query and top molecules (+ bioactivity data)
sim_mols = [Chem.MolFromSmiles(i) for i in similarity_df.smiles][:11]

legend = [f"#{str(a)} {b} ({str(round(c,2))})" for a, b, c in zip(range(1,len(sim_mols)+1),
                                                                               similarity_df.ChEMBL_ID, 
                                                                               similarity_df.bioactivity)]
Chem.Draw.MolsToGridImage(mols = [query] + sim_mols[:11], 
                          legends = (['Gefitinib'] + legend), 
                          molsPerRow = 4)

The top ranked molecules for Gefitinib are first Gefitinib entries (rank 1 and 2) in our dataset, followed by alterations of Gefitinib, e.g. different benzole substitution patterns. 
Note: ChEMBL contains the complete structure-activity relationship analysis for Gefitinib (being a well-studied compound), therefore it is not surprising to have that many Gefitinib-like compounds in our dataset.

We now check how well the similarity search is able to distinguish between active and inactive compounds based on our dataset. Therefore, we use the bioactivity values, which we collected from ChEMBL for each compound (bioactivity against EGFR) in **talktorial 1**.

#### Generate enrichment plots

In order to validate our virtual screening and see the ratio of active compounds detected, we generate an enrichment plot. 

Enrichment plots show 
* the ratio of top-ranked molecules (x-axis) from the whole dataset vs. 
* the ratio of active molecules (y-axis) from the whole dataset. 

We compare the Tanimoto similarity for MACCS and Morgan fingerprints. 

In order to decide whether we treat a molecule as active or inactive, we apply the commonly used pIC50 cut-off value of 6.3. Although there are several suggestions ranging from an pIC50 cut-off values of 5 to 7 in the literature or even to define an exclusion range when not to take data points, we think this cutoff is reasonable. 
The same cut-off will be used for machine learning in **talktorial 10**.

In [None]:
# pIC50 cut-off value used to discriminate active and inactive compounds
threshold = 6.3

In [None]:
similarity_df.head()

In [None]:
def get_enrichment_data(similarity_df, similarity_measure, threshold):
    """
    This function calculates x and y values for enrichment plot:
    x - % ranked dataset
    y - % true actives identified
    """
    
    # Get number of molecules in the data set (similarity_df)
    mols_all = 
    # Get total number of active compounds in data set (by comparing their bioactivities to the threshold)
    actives_all = 
    # Initialize a list "actives_counter_list" that will hold the counter for actives and compounds while iterating through our dataset
    
    # Initialize counter "actives_counter" for active molecules (to 0)
    
    # Data must be ranked for enrichment plots, sort compounds by similarity measures, in descending order

    # Iterate over the ranked dataset and check each compound if active (by checking bioactivity)
    for value in similarity_df.bioactivity:
        if value >= threshold:
            actives_counter += 1
        actives_counter_list.append(actives_counter)

    # Transform number of molecules into % ranked dataset
    mols_perc_list = [i/mols_all for i in list(range(1, mols_all+1))]

    # Transform number of actives into % true actives identified
    actives_perc_list = [i/actives_all for i in actives_counter_list]

    # Generate DataFrame with x and y values as well as label 
    enrich_df = pd.DataFrame({'% ranked dataset':mols_perc_list, 
                              '% true actives identified':actives_perc_list,
                              'similarity_measure': similarity_measure})
    
    return enrich_df

In [None]:
# Define similarity measures to be plotted
sim_measures = ['tanimoto_MACCS', 'tanimoto_morgan']

# Generate a list of DataFrames containing enrichment plot data for the 2 similarity measures
enrich_data = 


In [None]:
# Prepare data set for plotting:
# Concatenate per-similarity measure DataFrames to one DataFrame
# - different similarity measures are still distinguishable by the "similarity_measure" column
enrich_df = pd.concat(enrich_data)

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))

fontsize = 20

for key, grp in enrich_df.groupby(['similarity_measure']):
    ax = grp.plot(ax = ax,
                  x = '% ranked dataset',
                  y = '% true actives identified',
                  label=key,
                  alpha=0.5, linewidth=4)
ax.set_ylabel('% True actives identified', size=fontsize)
ax.set_xlabel('% Ranked dataset', size=fontsize)

# Ratio of actives in dataset
ratio = sum(similarity_df.bioactivity >= threshold) / len(similarity_df)

# Plot optimal curve
ax.plot([0,ratio,1], [0,1,1], label="Optimal curve", color="black", linestyle="--")

# Plot random curve
ax.plot([0,1], [0,1], label="Random curve", color="grey", linestyle="--")

plt.tick_params(labelsize=16)
plt.legend(labels=['MACCS', 'Morgan', "Optimal", "Random"], loc=(.5, 0.08), 
           fontsize=fontsize, labelspacing=0.3)

# Save plot - use bbox_inches to include text boxes:
# https://stackoverflow.com/questions/44642082/text-or-legend-cut-from-matplotlib-figure-on-savefig?rq=1
plt.savefig("../data/T4/enrichment_plot.png", dpi=300, bbox_inches="tight", transparent=True)

plt.show()

Enrichment plots show a slightly better performance for fingerprint comparison based on Morgan fingerprints than based on MACCS fingerprints.

In [None]:
# Get EF for x% of ranked dataset
def print_data_ef(perc_ranked_dataset, enrich_df):
    data_ef = enrich_df[enrich_df['% ranked dataset'] <= perc_ranked_dataset].tail(1)
    data_ef = round(float(data_ef['% true actives identified']), 1)
    print("Experimental EF for ", perc_ranked_dataset, "% of ranked dataset: ", data_ef, "%", sep="")

# Get random EF for x% of ranked dataset
def print_random_ef(perc_ranked_dataset):
    random_ef = round(float(perc_ranked_dataset), 1)
    print("Random EF for ", perc_ranked_dataset, "% of ranked dataset:       ", random_ef, "%", sep="")

# Get optimal EF for x% of ranked dataset
def print_optimal_ef(perc_ranked_dataset, similarity_df, threshold):
    ratio = sum(similarity_df.bioactivity >= threshold) / len(similarity_df) * 100
    if perc_ranked_dataset <= ratio:
        optimal_ef = round(100/ratio * perc_ranked_dataset, 1)
    else:
        optimal_ef = round(float(100), 2)
    print("Optimal EF for ", perc_ranked_dataset, "% of ranked dataset:      ", optimal_ef, "%", sep="")

In [None]:
# Choose percentage
perc_ranked_list = 5

# Get EF data
print_data_ef(perc_ranked_list, enrich_df)
print_random_ef(perc_ranked_list)
print_optimal_ef(perc_ranked_list, similarity_df, threshold)

## Discussion

We have performed our virtual screening using the Tanimoto similarity. Of course, this could also be done using Dice or any other similarity measure. 

A drawback of a similarity search with molecular fingerprints is that it is based on molecular similarity and thus does not yield any novel structures. Another challenge when working with molecular similarity are so-called activity cliffs. A small change in a functional group of a molecule may initiate a jump in bioactivity. 