# Talktorial 2

# Molecular filtering: ADME and lead-likeness criteria

#### Developed in the CADD seminars 2017 and 2018, AG Volkamer, Charité/FU Berlin 

Michele Ritschel and Mathias Wajnberg

## Aim of this talktorial

The compounds acquired from ChEMBL (**talktorial 1**) will be filtered by lead-likeliness criteria in order to remove less drug-like molecules from our screening library.

* Calculate molecular parameters related to bioavailability of compounds (Lipinski's rule of five)
* Filter compounds collected from ChEMBL by rule of five criteria
* Plot parameters in form of radar chart

## Learning goals

### Theory
* ADME - absorption, distribution, metabolism and excretion
* Lead-likeliness and Lipinski's rule of five
* Variations and interpretation of radar charts in the context of lead-likeliness

### Practical
* Calculate physicochemical parameters for example compounds
* Generate bar plots to compare individual physicochemical parameters for multiple molecules
* Write a function to check compliance with rule of five
* Apply rule of five to whole dataset retrieved from ChEMBL
* Generate a radar chart of our dataset filtered by the rule of five. This helps to visualize the properties in context of the rule of five criteria in one plot.

## References

* ADME criteria: ADME description (https://en.wikipedia.org/wiki/ADME) and ([<i>Mol Pharm.</i> (2010), <b>7(5)</b>, 1388-1405](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025274/))
* SwissADME (http://www.swissadme.ch/)
* Lead compounds: (https://en.wikipedia.org/wiki/Lead_compound)
* LogP (https://en.wikipedia.org/wiki/Partition_coefficient)
* Lipinski, Christopher A., et al. "Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings." ([<i>Adv. Drug Deliv. Rev.</i> (1997), <b>23</b>, 3-25](https://www.sciencedirect.com/science/article/pii/S0169409X96004231))
* Ritchie et al. "Graphical representation of ADME-related molecule properties for medicinal chemists" ([<i>Drug. Discov. Today</i> (2011), <b>16</b>, 65-72](https://www.ncbi.nlm.nih.gov/pubmed/21074634))

_____________________________________________________________________________________________________________________


## Theory

In a virtual screening we can predict whether a compound might bind to and interact with a specific target. However, if we want to identify a new drug, it is also important that this compound reaches the target and is eventually removed from the body in a favorable way. Therefore, we should also consider whether a compound is actually taken up into the body and whether it is able to cross certain barriers in order to reach its target. Is it metabolically stable and how will it be excreted once it is not acting at the target anymore? These processes are investigated in the field of pharmacokinetics. In contrast to pharmacodynamics ('What does the drug do to our body?'), pharmacokinetics deals with the question **'What happens to the drug in our body?'**. 

### ADME

Pharmacokinetics are mainly divided into four steps: 
<strong>A</strong>bsorption, 
<strong>D</strong>istribution, 
<strong>M</strong>etabolism, and 
<strong>E</strong>xcretion. 
These are summarized as <strong>ADME</strong>. Sometimes, ADME(T) also includes <strong>T</strong>oxicology. 
Below, the ADME steps are discussed more detailed.  
([ADME wikipedia](https://en.wikipedia.org/wiki/ADME) and [<i>Mol Pharm.</i> (2010), <b>7(5)</b>, 1388-1405](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025274/))


<img src="images/adme.png" class="center" align="right" width="260"> 


* **Absorption**: The amount and the time it takes for a substance to be taken up into the body depends on multiple factors which can vary between individuals and their conditions as well as on the properties of the substance. Factors such as (poor) compound solubility, gastric emptying time, intestinal transit time, chemical (in-)stability in the stomach, and (in-)ability to permeate the intestinal wall can all influence the extent to which a drug is absorbed after e.g. oral administration, inhalation or contact to skin.
<br><br>
* **Distribution**: The distribution of an absorbed substance, i.e. within the body, between blood and different tissues, and crossing of the blood-brain barrier are affected by regional blood flow rates, molecular size and polarity of the compound, and binding to serum proteins and transporter enzymes. Critical effects in toxicology can be accumulation of highly apolar substances in fatty tissue, or crossing of the blood-brain barrier.
<br><br>
* **Metabolism**: As soon as a compound enters the body, it usually starts to be metabolized. This means that only part of this compound will actually reach its target. Mainly liver and  kidney enzymes are responsible for the break down of xenobiotics (substances that are extrinsic to the body).  Reducing the amount of an absorbed substance can be favorable if a toxic compound is removed. On the other hand, transformation of a chemical could even yield new toxic metabolites. 
<br><br>
* **Excretion**: Compounds and their metabolites need to be removed from the body via excretion, usually through the kidneys (urine) or in the feces. Incomplete excretion can result in accumulation of foreign substances or adverse interference with normal metabolism.

<div align="right" width="250">Figure 1: ADME processes in the human body <br>
    (figure taken from openclipart.org and adapted) </div>

###  Lead-likeness and Lipinski's rule of five

[<strong>Lead</strong> compounds](https://en.wikipedia.org/wiki/Lead_compound) are developmental drug candidates with promising properties. They are used as starting structures and modified with the aim to find desired drugs. Besides bioactivity (*'Compound binds to the target of interest.'*), also favorable ADME properties are important criteria for the design of efficient drugs. 

The bioavailability of a compound is an important ADME property, to measure this property solely based on a compounds structure, Lipinski's rule of five was invented. This is a a rule of thumb, which helps to estimate oral bioavailability of a compound.

According to the rule of five, a substance is most likely not orally bioavailable if it violates more than one of the following rules:

* Molecular weight is less or equal to 500 Daltons
* Not more than 10 hydrogen bond acceptors
* Not more than 5 hydrogen bond donors
* LogP (octanol-water coefficient) <= 5

[LogP](https://en.wikipedia.org/wiki/Partition_coefficient) is also called partition coefficient or octanol-water coefficient. It measures the distribution of a compound, usually between a hydrophobic (e.g. 1-octanol) and a hydrophilic (e.g. water) phase. 

Hydrophobic molecules might have a reduced solubility in water, while more hydrophilic molecules (e.g. high number of hydrogen bond acceptors and donors) or large molecules (high molecular weight) might have more difficulties in passing phospholipid membranes.

As for the rule of five, note that all numbers are multiples of five; this is the origin of the rule's name.

([<i>Adv. Drug Deliv. Rev.</i> (1997), <b>23</b>, 3-25](https://www.sciencedirect.com/science/article/pii/S0169409X96004231))

### Radar charts

<img src="images/radarplot.png" class="center" align="right" width="250"> 

After calculating the molecular properties related to the rule of five, it can be helpful to visualize them. Ritchie et al. ([<i>Drug. Discov. Today</i> (2011), <b>16(1-2)</b>, 65-72](https://www.ncbi.nlm.nih.gov/pubmed/21074634)) provided an overview on graphical representations of ADME-related properties: 
There are multiple ways (e.g. craig plots, flower plots, or golden triangle) to visualize molecular properties and, thus, to support the interpretation by medicinal chemists. 

In this tutorial, you learn how to generate a radar plot using the python plotting library `matplotlib`.
Due to their appearance, radar charts ([radar charts wikipedia](https://en.wikipedia.org/wiki/Radar_chart)) are sometimes also called ‘spider’ or ‘cobweb’ plots. 
They are arranged circularly in 360 degrees and have one axis, starting in the center, for each condition. The values for each parameter are plotted on the axis and connected with a line. 
A shaded area can indicate the region where the parameters meet the conditions.

<div align="right" width="250">Figure 2: Radar plot displaying physico- <br> chemical properties of a compound dataset </div>


## Practical

### Define example molecules and visualize them

Before working with the whole dataset retrieved from ChEMBL, we pick four example compounds to investigate their chemical properties.
We import the necessary libraries, start from the SMILES of four example molecules and draw them.

In [None]:
import sys
sys.path.insert(1, f'../corrections/exercices')

In [None]:
from rdkit import Chem
from rdkit.Chem import Descriptors
import pandas as pd
from rdkit.Chem import Draw
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.lines import Line2D
from math import pi

In [None]:
smiles_1 = 'CCC1C(=O)N(CC(=O)N(C(C(=O)NC(C(=O)N(C(C(=O)NC(C(=O)NC(C(=O)N(C(C(=O)N(C(C(=O)N(C(C(=O)N(C(C(=O)N1)C(C(C)CC=CC)O)C)C(C)C)C)CC(C)C)C)CC(C)C)C)C)C)CC(C)C)C)C(C)C)CC(C)C)C)C' # Cyclosporine
smiles_2 = 'CN1CCN(CC1)C2=C3C=CC=CC3=NC4=C(N2)C=C(C=C4)C' # Clozapine
smiles_3 = 'CC1=C(C(CCC1)(C)C)C=CC(=CC=CC(=CC=CC=C(C)C=CC=C(C)C=CC2=C(CCCC2(C)C)C)C)C' # Beta-carotene
smiles_4 = 'CCCCCC1=CC(=C(C(=C1)O)C2C=C(CCC2C(=C)C)C)O' # Cannabidiol
smiles_dict = {'cyclosporine' : smiles_1, 'clozapine' : smiles_2, 'beta-carotene' : smiles_3, 'cannabidiol' : smiles_4}
mol_dict = {name : Chem.MolFromSmiles(smiles) for name, smiles in smiles_dict.items()}

Draw.MolsToGridImage([mol for mol in mol_dict.values()], legends=[name for name in mol_dict.keys()], molsPerRow=4)

### Calculate rule of five molecular properties and plot them

The chemical properties relevant for the rule of five are calculated and visually compared:

* Calculate molecular weight, number of h-bond acceptors and donors, and logP.
*  Using the predefined functions in the [rdkit descriptor library](http://www.rdkit.org/docs/GettingStartedInPython.html#descriptor-calculation)

In [None]:
MWs = [Descriptors.ExactMolWt(mol) for mol in mol_dict.values()]
HBAs = [Descriptors.NumHAcceptors(mol) for mol in mol_dict.values()]
HBDs = [Descriptors.NumHDonors(mol) for mol in mol_dict.values()]
LogPs = [Descriptors.MolLogP(mol) for mol in mol_dict.values()]
parameters = [MWs, HBAs, HBDs, LogPs]
print('Molecular weight of the four compounds:',MWs)

* Plot the properties per molecule as bar plots.

In [None]:
# Start 2x2 plot frame
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2)
axes = [ax1, ax2, ax3, ax4]
x = np.arange(1, len(mol_dict.values())+1)
colors = ['red', 'green', 'blue', 'cyan']

# Create subplots
for index in x-1:
    axes[index].bar(x, parameters[index], color=colors)

# Add rule of five thresholds as dashed lines
ax1.axhline(y=500, color="black", linestyle="dashed")
ax1.set_title("molecular weight (Da)")
ax2.axhline(y=10, color="black", linestyle="dashed")
ax2.set_title("# h-bond acceptors")
ax3.axhline(y=5, color="black", linestyle="dashed")
ax3.set_title("# h-bond donors")
ax4.axhline(y=5, color="black", linestyle="dashed")
ax4.set_title("logP")

# Add legend
legend_elements = [mpatches.Patch(color=colors[ids], label=name) for ids,name in enumerate(mol_dict.values())]
legend_elements.append(Line2D([0], [0], color="black", ls="dashed", label="Threshold"))
fig.legend(handles=legend_elements, bbox_to_anchor=(1.25, 0.5), labels=list(mol_dict.keys()))

# Fit subplots and legend into figure
plt.tight_layout()

plt.show()

In the above bar chart we compared the rule of five properties (molecular weight, number of hydrogen bond donors and acceptors, LogP) for four example molecules. We can see that the four example drug molecules have different properties. In the next steps, we will investigate for each compound individually whether it violates the rule of five.

### Investigate compliance with Lipinski's rule of five

A function is defined to investigate whether a compound violates the rule of five and is applied to our example compounds.

In [None]:
from ADME import exo_rule_of_five

In [None]:
exo_rule_of_five.example(2)

In [None]:
## créer une fonction qui permet de savoir si une molécule respecte les règles de Lipinski.
## cette fonction prend en argument un dictionnaire smiles et retourne "true" ou "false".

def rule_of_five(smiles_dict):
    ## parcourir le dictionnaire "smiles_dict"
    
    ## Transformer le smiles en molecule en utilisant rdkit.
    
    ## Calculer 4 descripteurs en utilisant les méthodes du module "Descriptors" de rdkit.
    ## Stocker les descripteurs dans les variables "MW", "HBA", "HBD" et "LogP".
    
    ## Déclarer "True" si au moins 3 conditions des règles de Lipinski sont validées.
    
    ## Retourner une liste de tuples contenant le nom de la molécule et "True" ou "False"
    return

In [None]:
exo_rule_of_five.correction(rule_of_five)

In [None]:
for name, smi in rule_of_five(smiles_dict):
    print(f"Rule of five accepted for {name}: {smi}")

Our `rule_of_five` function yields that two of the four example molecules do not pass the rule of five. From this we can interpret that cyclosporin and betacarotene are most likely not orally bioavailable. As all of them are available on the market as drugs, they must reach their target somehow. They could be exceptions of the rule, or they might be administered via a route different from oral administration. 

### Apply rule of five to the EGFR dataset

The `rule_of_five` function can be used to filter the main dataset by compliance with Lipinski's rule of five.

* Adjust the function to return all chemical parameters related to the rule of five
* Load main dataframe (`ChEMBL_df`)
* Apply rule of five function to `ChEMBL_df`
* Filter `ChEMBL_df` by  compounds that violate more than one rule
* Save filtered dataframe

In [None]:
from ADME import exo_df_rule_of_five

In [None]:
exo_df_rule_of_five.example()

In [None]:
## Calculer les 4 descripteurs et la conformité à la règle de Lipinski à partir de la colone "smiles" d'un dataframe
## Retourner un dataframe à 5 colones contenant les 4 descripteurs et "yes" ou "no" (conformité à la règle de Lipinski)
def df_rule_of_five(df):
    return

In [None]:
exo_df_rule_of_five.correction(df_rule_of_five)

In [None]:
ChEMBL_df = pd.read_csv('../data/T1/EGFR_compounds.csv', index_col=0)
print(ChEMBL_df.shape)
ChEMBL_df.head()

In [None]:
ChEMBL_df = ChEMBL_df.join(df_rule_of_five(ChEMBL_df))

In [None]:
ChEMBL_df.head(15)

In [None]:
# Delete empty rows --> rule of five
filtered_df = ChEMBL_df[ChEMBL_df['rule_of_5']=='yes']

In [None]:
# Info about data
print('# of compounds in unfiltered data set:', len(ChEMBL_df))
print('# of compounds in filtered data set:', len(filtered_df))
print("# of compounds not compliant with Lipinski's rule of five:", (len(ChEMBL_df)-len(filtered_df)))

# Save filtered data 
filtered_df.to_csv('../data/T2/EGFR_compounds_lipinski.csv', sep=';') 
filtered_df.head(15)

### Radar plot for visualization of rule of five properties

First, we define a function that defines the mean and standard deviations of a dataset. 

These statistics will be lateron used to plot the Lipinski's rule of five related parameters of a dataset.

In [None]:
from ADME import exo_get_properties_stats

In [None]:
exo_get_properties_stats.example()

In [None]:
## créer une fonction permettant de calculer la moyenne et l'erreur standard de ces propriétés physico-chimique.
## Elle prend en argument un dataframe et retourne un dataframe avec pour ligne : "HBD", "HBA", "MW" et "LogP"
## Les colonnes sont la moyenne et l'erreur standard.
def get_properties_stats(data_df):
    return

In [None]:
exo_get_properties_stats.correction(get_properties_stats)

We calculate the statistic for the dataset of compounds compliant with Lipinski's rule of five (filtered dataset).

In [None]:
stats_rof = get_properties_stats(filtered_df)
stats_rof

We calculate the statistic for the dataset of compounds NOT compliant with Lipinski's rule of five.

In [None]:
stats_not_rof = get_properties_stats(ChEMBL_df[ChEMBL_df['rule_of_5']=='no'])
stats_not_rof

We create a function to visualize the compound properties with a radar chart. For this, we follow a [tutorial on stackoverflow](https://stackoverflow.com/questions/42227409/tutorial-for-python-radar-chart-plot).

In [None]:
def plot_radarplot(data_stats, output_path):
    """
    Function that plots a radar plot based on the mean and std of 4 physicochemical properties (HBD, HBA, MW and LogP).
    
    Input: 
    Dataframe with mean and std (columns) for each physicochemical property (rows).
    
    Output:
    Radar plot (saved as file and shown in Jupyter notebook).
    """

    # Get data points for lines
    std_1 = [data_stats["mean"]["HBD"] + data_stats["std"]["HBD"], 
             (data_stats["mean"]["HBA"]/2) + (data_stats["std"]["HBA"]/2), 
             (data_stats["mean"]["MW"]/100) + (data_stats["std"]["MW"]/100), 
             data_stats["mean"]["LogP"] + data_stats["std"]["LogP"]]
    std_2 = [data_stats["mean"]["HBD"] - data_stats["std"]["HBD"], 
             (data_stats["mean"]["HBA"]/2) - (data_stats["std"]["HBA"]/2), 
             (data_stats["mean"]["MW"]/100) - (data_stats["std"]["MW"]/100), 
             data_stats["mean"]["LogP"] - data_stats["std"]["LogP"]]
    mean_val = [data_stats["mean"]["HBD"], (data_stats["mean"]["HBA"]/2), 
                (data_stats["mean"]["MW"]/100), data_stats["mean"]["LogP"]]

    # Get data points for (filled) area (rule of five)
    rule_conditions = [5, (10/2), (500/100), 5]
    
    # Define property names
    parameters = ['# H-bond donors', '# H-bond acceptors/2', 'Molecular weight (Da)/100', 'LogP']

    # 
    N = len(rule_conditions)

    # Set font size
    fontsize = 16

    # Angles for the condition axes
    x_as = [n / float(N) * 2 * pi for n in range(N)]

    # Since our chart will be circular we need to append a copy of the first
    # Value of each list at the end of each list with data
    std_1 += std_1[:1]
    std_2 += std_2[:1]
    mean_val += mean_val[:1]
    rule_conditions += rule_conditions[:1]
    x_as += x_as[:1]

    # Set figure size
    plt.figure(figsize=(8,8))

    # Set color of axes
    plt.rc('axes', linewidth=2, edgecolor="#888888")

    # Create polar plot
    ax = plt.subplot(111, polar=True)

    # Set clockwise rotation. That is:
    ax.set_theta_offset(pi / 2)
    ax.set_theta_direction(-1)

    # Set position of y-labels
    ax.set_rlabel_position(0)

    # Set color and linestyle of grid
    ax.xaxis.grid(True, color="#888888", linestyle='solid', linewidth=2)
    ax.yaxis.grid(True, color="#888888", linestyle='solid', linewidth=2)

    # Set number of radial axes and remove labels
    plt.xticks(x_as[:-1], [])

    # Set yticks
    plt.yticks([1, 3, 5, 7], ["1", "3", "5","7"], size=fontsize,)

    # Set axes limits
    plt.ylim(0, 7)

    # Plot data
    # Mean values
    ax.plot(x_as, mean_val, 'b', linewidth=3, linestyle='solid', zorder=3)

    # Standard deviation
    ax.plot(x_as, std_1, 'm', linewidth=2, linestyle='dashed', zorder=3, color='#111111')
    ax.plot(x_as, std_2, 'y', linewidth=2, linestyle='dashed', zorder=3, color='#333333')

    # Fill area
    ax.fill(x_as, rule_conditions, "#3465a4", alpha=0.2)

    # Draw ytick labels to make sure they fit properly
    for i in range(N):
        angle_rad = i / float(N) * 2 * pi
        if angle_rad == 0:
            ha, distance_ax = "center", 1
        elif 0 < angle_rad < pi:
            ha, distance_ax = "left", 1
        elif angle_rad == pi:
            ha, distance_ax = "center", 1
        else:
            ha, distance_ax = "right", 1
        ax.text(angle_rad, 8 + distance_ax, parameters[i], size=fontsize,
                horizontalalignment=ha, verticalalignment="center")

    # Add legend relative to top-left plot    
        labels = ('Mean', 'Mean + std', 'Mean - std', 'Rule of five area')
        legend = ax.legend(labels, loc=(1.1, .7),
                           labelspacing=0.3, fontsize=fontsize)
    plt.tight_layout()

    # Save plot - use bbox_inches to include text boxes:
    # https://stackoverflow.com/questions/44642082/text-or-legend-cut-from-matplotlib-figure-on-savefig?rq=1
    plt.savefig(output_path, dpi=300, bbox_inches="tight", transparent=True)

    # Show polar plot
    plt.show()

First, we plot the dataset filtered by the rule of five.

In [None]:
plot_radarplot(stats_rof, "../data/T2/radarplot_rof.png")

In the above created radar chart, the blue square shows the area where the physicochemical properties are within the rule of five. The blue line connects the mean values of our filtered dataset, while the dashed lines show the standard deviations. We can see that the mean values never violate any of Lipinski's rules. However, according to the standard deviation, some properties are still larger. This is acceptable; we have to keep in mind that one of the four property's rules may be violated.

Second, we take a look at the compounds that violate the rule of five.

In [None]:
plot_radarplot(stats_not_rof, "../data/T2/radarplot_not_rof.png")

We see that compounds mostly violate the rule of five based on their logP value and their molecular weight.

## Discussion
Lipinski's rule of five focuses on oral bioavailability. Drugs can also be administered via alternative routes, i.e. inhalation, skin penetration and injection. Be aware, that the rule of five is a guide for estimation of oral bioavailability, there are exceptions in both ways. With bioavailability, we have looked at one of several ADME properties. 

There are webservers/programmes available to get a whole picture of ADME properties, e.g. [SwissADME](http://www.swissadme.ch/).

## Quiz

In [None]:
from nbautoeval import run_yaml_quiz

In [None]:
run_yaml_quiz(f"../corrections/quiz/ADME.yaml", "theoric-quiz")

In [None]:
run_yaml_quiz(f"../corrections/quiz/ADME.yaml", "code-quiz")