# BindingDB and DTI tutorial

BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules. This is also called Drug-Target Interaction (DTI).  
Two main applications DTI models are:

1. Drug screening - Identify Ligand candidates that can bind to a protein of interest.
2. Drug repurposing - Find new therapeutic purposes (protein targets) for an existing drug

BindingDB contains about 2.4M binding reaction data samples. Each such sample contains information about the drug target (usually a Protein - a long sequence of one of 20 existing Amino Acids), the Ligand (a small molecule, that drug designers want to bind with the target), and different possible measurements of the "binding capability". Typically the binding is *noncovalent*, i.e - reversible, such that the binding atoms don't share electrons, but rather the binding is on a electromagnetic interaction level.
 
In total, these 2.4M reaction samples consist of ~8,800 protein targets and 1M small molecules.

A straightforward representation of this data is using a single CSV/TSV file which consists of 2.4M rows. In each row, there are columns for a SMILES string representation of the Ligand, a string representation (Amino Acid code sequence) of the target protein, binding affinity measurements and more columns for additional information such as different codes for the interaction or compounds in different databases, and more.
This is the representation that we will explore in this tutorial. However, we note that it's also possible to download from BindingDB 2D and 3D representations of the compounds.

## Measuring binding capability
1. **Dissociation constant - $K_{D}$**  
    Consider a solution with fixed concentrations of dissolved Ligands and Proteins, and sufficient time passes so that it reaches equilibrium. Let $[P]$, $[L]$ and $[PL]$ denote the concentration of free protein, free ligand, and bound protein, respectively, at equilibrium. Then the dissociation constant is defined: $K_{D}=\frac{[P][L]}{[PL]}$. Or: $\frac{K_{D}}{[L]}=\frac{[P]}{[PL]}$. 
    So, for $K_{D}=[L]$, $[P]=[PL]$. And therefore, $\frac{[PL]}{[P]+[PL]}=\frac{1}{2}$.
    Meaning, $K_{D}$ is the concentration of ligands required for half of the total protein to be bound to a ligand.  
    A small value of $K_{D}$ means fewer ligands are required for that, so smaller value is better binding affinity. It means, a smaller dosage of the candidate drug will be required to make impact. So, less side effects etc'.

2. **$IC50$**  
    This measure of binding affinity is used in Enzyme inhibition assays. Enzymes are types of proteins targets, and the goal is to find small molecules that bind and *inhibit* them. Such molecules/drugs are called "Enzyme Inhibitors". For Enzymes, $K_{D}$ is usually termed $K_{i}$ (inhibition constant), but studies usually report $IC50$ instead. This is the concentration of Ligand that reduces enzyme activity by 50%. It sounds similar to the definition of $K_{D}$, but it's different because in a typical enzymatic binding assay, the inhibitor is not the only molecule trying to bind with the enzyme's active site. It competes with the enzyme's physiological substrate. So, if the concentration of substrate is very low, $IC50$ should approximate $K_{D}$, otherwise it will be greater (more ligands required to obtain 50% binding).  

3. **$EC50$**  
    This is another principally similar measure of half maximum affinity response. It is used for another type of assays in which a protein is expressed in a cell in such a way that its level of activation as a result of binding with inhibitors, can be detected. 

## Reading data from BindingDB
The [DeepPurpose](https://github.com/kexinhuang12345/DeepPurpose/) open source library has a ready helper function to read data from BindingDB and perpare it for ML.
We manually downloaded from the BindingDB website a single TSV file `BindingDB_All.tsv` containing the \~2.4M samples mentioned earlier, and manually converted it to Pickle format, for faster reading (\~20 seconds instead of \~60). 
This file as well as other data subsets or 2D/3D representations on BindingDB gets updated periodically, so for the sake of reproducibility we'll mention that the one we use here was downloaded on April 26 2022.

In [1]:
from DeepPurpose import dataset
from tutorials.utils import BindingDB
import pandas as pd

# download data
data_dir = "./data"
BindingDB.download(data_dir)

# load data
data_path = "./data/BindingDB_All.pkl"
df = pd.read_pickle(data_path)
X_drugs, X_targets, y = dataset.process_BindingDB(
    path=data_path, df=df, y="Kd", binary=False, convert_to_log=True, threshold=30
)



Downloading zip file:
Downloading zip file: DONE
Extracting zip:
Extracting zip: DONE
Pickling data:


  import sys


Pickling data: DONE
Loading Dataset from the pandas input...
Beginning Processing...
There are 82809 drug target pairs.
Default set to logspace (nM -> p) for easier regression


This function reads the file and after some processing returns three arrays of the same length. `X_drugs` contains Ligand SMILES strings, `X_targets` contains target sequence (Amino Acid code sequence) strings, and `y` contains the labels, in this case $K_{D}$, followed by conversion to logarithmic scale: $y = -log_{10}\left( K_{D} \cdot 10^{-9} \right)$.

The processing function of DeepPurpose is very simple. Let's explain most of what it does after reading the whole file of ~2.4M rows:
1. Remove targets with more than one protein chain (multichain complex). This leaves about ~2.3M rows.
2. Keep only rows in which the $K_{D}$ measurement exists. This leaves only \~94,000 rows. For comparison, setting to $IC50$ would leave \~1.4M rows, setting to $K_{i}$ would leave ~500k, and setting to $EC50$ would leave 200k.
We can think about how to correctly combine more than one choice. 
3. Some $K_{D}$ values contain '<' or '>' sign, they just remove the sign and keep the number. (Is it okay? not sure)
4. Remove samples with $K_{D}$ larger than $10^7 [nM]$. (Larger values are not good drug candidates?). This left ~83k rows.
5. Convert units $[nm] -> [p]$.
    $y = -log_{10}\left( K_{D} \cdot 10^{-9} \right)$

In [2]:
print(f"length of X_target: {len(X_targets)}, X_drugs: {len(X_drugs)}, y: {len(y)}")
print(f"minimum label: {y.min()}, maximum label: {y.max()}")

print(f"Random example:")
print(f"Target sequence:\n {X_targets[200]}")
print(f"Ligand SMILES string:\n {X_drugs[200]}")
print(f"Label (binding affinity), -log(Kd*1e9): {y[200]}")

length of X_target: 82809, X_drugs: 82809, y: 82809
minimum label: 2.0, maximum label: 15.0
Random example:
Target sequence:
 MSNVPHKSSLPEGIRPGTVLRIRGLVPPNASRFHVNLLCGEEQGSDAALHFNPRLDTSEVVFNSKEQGSWGREERGPGVPFQRGQPFEVLIIASDDGFKAVVGDAQYHHFRHRLPLARVRLVEVGGDVQLDSVRIF
Ligand SMILES string:
 CO[C@@H]1O[C@H](CO)[C@@H](O[C@@H]2O[C@H](CO)[C@H](O)[C@H](NC(=S)NC3CCCCC3)[C@H]2O)[C@H](O)[C@H]1NC(C)=O
Label (binding affinity), -log(Kd*1e9): 4.619788758288394


## DTI prediction
[This](https://github.com/kexinhuang12345/DeepPurpose/blob/master/Tutorial_1_DTI_Prediction.ipynb) tutorial by DeepPurpose demonstrates how the library can be used to train affinity prediction models, given data in the format above (drug-target string sequence pairs with corresponding affinity labels), and use them for drug screening and repurposing.

## Benchmarks / Leader boards

[Therapeutics Data Commons](https://tdcommons.ai) has a [benchmark](https://tdcommons.ai/benchmark/dti_dg_group/bindingdb_patent/) defined for Drug-Target Interaction (DTI) based on BindingDB. They point out a problem with existing ML models in which the test set contains unseen compound-target pairs, but individually the targets and compounds are seen during training.  
In practice pharma companies screen new targets and compounds over the years, so it is desirable that models can generalize to this shift.  
In this benchmark they use patented DTI data, and use years 2013-2018 for training and 2019-2021 for testing.

