# Compound Data Acquisition

Welcome to the practical course Molecular Modelling!

This course will focus on computational methods for modeling of molecules, and is based on your already gained expertise from the lecture. You will try different methods and data visualization techniques throughout this course, followed by a poster presentation at the end. As this course is a practical course, you will need extra research about your target, the methods you used, ... If something is unclear, feel free to ask our tutors, Robert Wild and Florian Wedl!

In this notebook, we will learn more about the ChEMBL database and how to extract data from ChEMBL, i.e. (compound, activity data) pairs for a target of interest. These data sets can be used for many cheminformatics tasks, such as similarity search, clustering or machine learning.

Our work here will include finding compounds which were tested against your target and filtering available bioactivity data. *

Goal: Get a list of compounds with bioactivity data for your target

## ChEMBL database

ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs. see https://www.ebi.ac.uk/chembl/ for more information.

Current data content (as of 09.2020, ChEMBL 27):

>1.9 million distinct compounds

>16 million activity values

>Assays are mapped to ~13,000 targets

## Compound activity measures

How is the potentcy of a drug usually measured?

check the following literature: https://en.wikipedia.org/wiki/IC50

## Excercise 1

as you might already know from the Hands-On session of the lecture, we need certain packages and python libraries to run specific code.

In [1]:
import math
from pathlib import Path
from zipfile import ZipFile
from tempfile import TemporaryDirectory

import numpy as np
import pandas as pd
from rdkit.Chem import PandasTools
from chembl_webresource_client.new_client import new_client
from tqdm.auto import tqdm
import os

ModuleNotFoundError: No module named 'rdkit'

next, we want to create a directory where we want to store our generated data. We name this directory 'data'.

In [2]:
os.chdir('/media/storage_6/afi/Molecular_modelling_PR_23/test_run_afi/data')

next, we create resource objects for API access. check https://en.wikipedia.org/wiki/API

In [3]:
targets_api = new_client.target
compounds_api = new_client.molecule
bioactivities_api = new_client.activity

type(targets_api)

chembl_webresource_client.query_set.QuerySet

Get the Uniprot ID of the target of interest. UniProt website: https://www.uniprot.org/

In [29]:
uniprot_id = "Q13936"

now we fetch target data from ChEMBL

In [30]:
# Get target information from ChEMBL but restrict it to specified values only
targets = targets_api.get(target_components__accession=uniprot_id).only(
    "target_chembl_id", "organism", "pref_name", "target_type"
)
print(f'The type of the targets is "{type(targets)}"')

The type of the targets is "<class 'chembl_webresource_client.query_set.QuerySet'>"



to use the target data, we need to download it from ChEMBL.

The results of the query are stored in targets, a QuerySet, i.e. the results are not fetched from ChEMBL until we ask for it (here using pandas.DataFrame.from_records).


In [31]:
targets = pd.DataFrame.from_records(targets)
targets

Unnamed: 0,organism,pref_name,target_chembl_id,target_type
0,Homo sapiens,Voltage-gated L-type calcium channel alpha-1C ...,CHEMBL1940,SINGLE PROTEIN
1,Homo sapiens,Voltage-gated L-type calcium channel alpha-1C ...,CHEMBL1940,SINGLE PROTEIN
2,Homo sapiens,Voltage-gated L-type calcium channel,CHEMBL2095229,PROTEIN FAMILY
3,Homo sapiens,Voltage-gated calcium channel,CHEMBL2363032,PROTEIN COMPLEX GROUP
4,Homo sapiens,Voltage-dependent L-type calcium channel alpha...,CHEMBL3988638,PROTEIN COMPLEX
5,Homo sapiens,L-type calcium channel alpha-1c/beta-2/alpha2d...,CHEMBL4106164,PROTEIN COMPLEX


after checking the entries, you select your target of interest.

In [32]:
target = targets.iloc[0]
target

organism                                                 Homo sapiens
pref_name           Voltage-gated L-type calcium channel alpha-1C ...
target_chembl_id                                           CHEMBL1940
target_type                                            SINGLE PROTEIN
Name: 0, dtype: object

Great! You found your target. Now you should save the chembl id.

In [33]:
#get chembl id from df
chembl_id = target['target_chembl_id']
chembl_id

'CHEMBL1940'

### Get bioactivity data

Now, we want to fetch the bioactivity data for the target of interest.

In this step, we fetch the bioactivity data and filter it to only consider: *

 - human proteins,
 - bioactivity type IC50,
 - exact measurements (relation '='), and
 - binding data (assay type 'B').

In [34]:
bioactivities = bioactivities_api.filter(
    target_chembl_id=chembl_id, type="IC50", relation="=", assay_type="B"
).only(
    "activity_id",
    "assay_chembl_id",
    "assay_description",
    "assay_type",
    "molecule_chembl_id",
    "type",
    "standard_units",
    "relation",
    "standard_value",
    "target_chembl_id",
    "target_organism",
)

print(f"Length and type of bioactivities object: {len(bioactivities)}, {type(bioactivities)}")

Length and type of bioactivities object: 107, <class 'chembl_webresource_client.query_set.QuerySet'>


In our bioactivity set, each entry holds important information ... try to find out the molecule_chembl_id of your first entry of the set. *

In [35]:
ligand_chembl_id = bioactivities[0]['molecule_chembl_id']
ligand_chembl_id
## TODO: check the bioactivity set

'CHEMBL343771'

now, we download the bioactivity data from ChEMBL, similar as we did it for your target. <span style="color:red">for what? </span>*

In [36]:
## TODO: download the bioactivity data and check the content of your df! use a similar procedure
bioactivities_df = pd.DataFrame.from_records(bioactivities)
print(f"DataFrame shape: {bioactivities_df.shape}")
bioactivities_df.head()

DataFrame shape: (108, 13)


Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,molecule_chembl_id,relation,standard_units,standard_value,target_chembl_id,target_organism,type,units,value
0,439718,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL343771,=,nM,540.0,CHEMBL1940,Homo sapiens,IC50,uM,0.54
1,439718,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL343771,=,nM,540.0,CHEMBL1940,Homo sapiens,IC50,uM,0.54
2,447874,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL138302,=,nM,390.0,CHEMBL1940,Homo sapiens,IC50,uM,0.39
3,458410,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL138302,=,nM,260.0,CHEMBL1940,Homo sapiens,IC50,uM,0.26
4,458417,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL6966,=,nM,150.0,CHEMBL1940,Homo sapiens,IC50,uM,0.15


In [37]:
len(bioactivities_df)

108

Note that the first two rows describe the same bioactivity entry; we will remove such artifacts later during the deduplication step. 

Note also that we have columns for standard_units/units and standard_values/values; in the following, we will use the standardized columns (standardization by ChEMBL), and thus, we drop the other two columns.*

So first find the unique elements of the units column, then drop the units and value column. <span style="color:red">why units column? </span>*


In [None]:
#TODO: 
bioactivities_df.drop(["units", "value"], axis=1, inplace=True)
bioactivities_df.head()

Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,molecule_chembl_id,relation,standard_units,standard_value,target_chembl_id,target_organism,type
0,439718,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL343771,=,nM,540.0,CHEMBL1940,Homo sapiens,IC50
1,439718,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL343771,=,nM,540.0,CHEMBL1940,Homo sapiens,IC50
2,447874,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL138302,=,nM,390.0,CHEMBL1940,Homo sapiens,IC50
3,458410,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL138302,=,nM,260.0,CHEMBL1940,Homo sapiens,IC50
4,458417,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL6966,=,nM,150.0,CHEMBL1940,Homo sapiens,IC50


Now we filter and preprocess our bioactivity data.
We want to do the following:

1. Convert standard_value’s datatype from object to float
2. Delete entries with missing values 
3. Keep only entries with standard_unit == nM
4. Delete duplicate molecules
5. Reset DataFrame index
6. Rename columns

In [None]:
##TODO: Convert standard_value’s datatype in bioactivities_df from object to float (hint: use .astype())
bioactivities_df = bioactivities_df.astype({"standard_value": "float64"})
bioactivities_df.dtypes

activity_id             int64
assay_chembl_id        object
assay_description      object
assay_type             object
molecule_chembl_id     object
relation               object
standard_units         object
standard_value        float64
target_chembl_id       object
target_organism        object
type                   object
dtype: object

In [None]:
##TODO: Delete entries with missing values (hint: use .dropna())
bioactivities_df.dropna(axis=0, how="any", inplace=True)
print(f"DataFrame shape: {bioactivities_df.shape}")

DataFrame shape: (108, 11)


In [None]:
##TODO: Keep only entries with standard_unit == nM (which means that you should delete entries with standard_units != nM)
print(f"Units in downloaded data: {bioactivities_df['standard_units'].unique()}")
print(
    f"Number of non-nM entries:\
    {bioactivities_df[bioactivities_df['standard_units'] != 'nM'].shape[0]}"
)
#bioactivities_df = bioactivities_df[bioactivities_df["standard_units"] == "nM"]

Units in downloaded data: ['nM']
Number of non-nM entries:    0


In [None]:
bioactivities_df = bioactivities_df[bioactivities_df["standard_units"] == "nM"]
print(f"Units after filtering: {bioactivities_df['standard_units'].unique()}")

Units after filtering: ['nM']


In [None]:
##TODO: delete duplicate molecules (maybe with drop_duplicates?) make sure, that you keep the first entry of your duplicate!
bioactivities_df.drop_duplicates("molecule_chembl_id", keep="first", inplace=True)
print(f"DataFrame shape: {bioactivities_df.shape}")

DataFrame shape: (97, 11)


In [None]:
##TODO: reset data frame index, in order to be able to iterate over the index later
bioactivities_df.reset_index(drop=True, inplace=True)
bioactivities_df.head()

Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,molecule_chembl_id,relation,standard_units,standard_value,target_chembl_id,target_organism,type
0,439718,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL343771,=,nM,540.0,CHEMBL1940,Homo sapiens,IC50
1,447874,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL138302,=,nM,390.0,CHEMBL1940,Homo sapiens,IC50
2,458417,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL6966,=,nM,150.0,CHEMBL1940,Homo sapiens,IC50
3,1273171,CHEMBL660543,Binding affinity for DHP (Dihydropyridine) sit...,B,CHEMBL3392229,=,nM,270.0,CHEMBL1940,Homo sapiens,IC50
4,1283431,CHEMBL660543,Binding affinity for DHP (Dihydropyridine) sit...,B,CHEMBL3392227,=,nM,250.0,CHEMBL1940,Homo sapiens,IC50


In [None]:
##TODO rename columns standard_value to IC50 and standard_units to units!
bioactivities_df.rename(
    columns={"standard_value": "IC50", "standard_units": "units"}, inplace=True
)
bioactivities_df.head()

Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,molecule_chembl_id,relation,units,IC50,target_chembl_id,target_organism,type
0,439718,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL343771,=,nM,540.0,CHEMBL1940,Homo sapiens,IC50
1,447874,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL138302,=,nM,390.0,CHEMBL1940,Homo sapiens,IC50
2,458417,CHEMBL656260,Inhibition of (-)-[3H]- D-888 binding to L-typ...,B,CHEMBL6966,=,nM,150.0,CHEMBL1940,Homo sapiens,IC50
3,1273171,CHEMBL660543,Binding affinity for DHP (Dihydropyridine) sit...,B,CHEMBL3392229,=,nM,270.0,CHEMBL1940,Homo sapiens,IC50
4,1283431,CHEMBL660543,Binding affinity for DHP (Dihydropyridine) sit...,B,CHEMBL3392227,=,nM,250.0,CHEMBL1940,Homo sapiens,IC50


now, try to find out how large your set is!

### Get compound data

We have now a proper df containing all molecules that were tested against your target. (with measured bioactivity data)
Now we want to get the molecular structures of the molecules that are linked to the respective bioactivity ChEMBL IDs.

Let’s have a look at the compounds from ChEMBL which we have defined bioactivity data for: We fetch compound ChEMBL IDs and structures for the compounds linked to our filtered bioactivity data.

In [None]:
compounds_provider = compounds_api.filter(
    molecule_chembl_id__in=list(bioactivities_df["molecule_chembl_id"])
).only("molecule_chembl_id", "molecule_structures")

Again, we want to export the QuerySet object into a pandas.DataFrame. 
Given the data volume, this can take some time, so feel free to go for a coffee.
For that reason, we will first obtain the list of records through tqdm, so we get a nice progress bar and some ETAs.
We can then pass the list of compounds to the DataFrame.

In [47]:
compounds = list(tqdm(compounds_provider))

100%|██████████| 97/97 [00:01<00:00, 70.82it/s]


In [48]:
## now, fetch the data of the compounds into a df!
compounds_df = pd.DataFrame.from_records(
    compounds,
)
print(f"DataFrame shape: {compounds_df.shape}")

DataFrame shape: (97, 2)


Wow, smiles! Before we look at the structures of our molecules, we preprocess and filter the data.

1. Remove entries with missing entries

2. Delete duplicate molecules (by molecule_chembl_id)

3. Get molecules with canonical SMILES

In [49]:
##TODO: Remove entries with missing molecule structure entry... that sounds familiar
compounds_df.dropna(axis=0, how="any", inplace=True)
print(f"DataFrame shape: {compounds_df.shape}")

DataFrame shape: (97, 2)


In [50]:
##TODO: Delete duplicate molecules... how the hell did that work before...
compounds_df.drop_duplicates("molecule_chembl_id", keep="first", inplace=True)
print(f"DataFrame shape: {compounds_df.shape}")

DataFrame shape: (97, 2)


So far, we have multiple different molecular structure representations. We only want to keep the canonical SMILES.

In [51]:
compounds_df.iloc[0].molecule_structures.keys()

dict_keys(['canonical_smiles', 'molfile', 'standard_inchi', 'standard_inchi_key'])

In [52]:
canonical_smiles = []

for i, compounds in compounds_df.iterrows():
    try:
        canonical_smiles.append(compounds["molecule_structures"]["canonical_smiles"])
    except KeyError:
        canonical_smiles.append(None)

compounds_df["smiles"] = canonical_smiles
compounds_df.drop("molecule_structures", axis=1, inplace=True)
print(f"DataFrame shape: {compounds_df.shape}")

DataFrame shape: (97, 2)


Sanity check: Remove all molecules without a canonical SMILES string.

In [53]:
# TODO: do the sanity check
compounds_df.dropna(axis=0, how="any", inplace=True)
print(f"DataFrame shape: {compounds_df.shape}")

DataFrame shape: (97, 2)


### Merge both datasets

Merge values of interest from bioactivities_df and compounds_df in an output_df based on the compounds’ ChEMBL IDs (molecule_chembl_id), keeping the following columns:

> ChEMBL IDs: molecule_chembl_id

> SMILES: smiles

> units: units

> IC50: IC50

In [54]:
# Merge DataFrames
output_df = pd.merge(
    bioactivities_df[["molecule_chembl_id", "IC50", "units"]],
    compounds_df,
    on="molecule_chembl_id",
)
# Reset row indices
output_df.reset_index(drop=True, inplace=True)

print(f"Dataset with {output_df.shape[0]} entries.")

Dataset with 97 entries.


In [55]:
#check dataset length
output_df.head(10)

Unnamed: 0,molecule_chembl_id,IC50,units,smiles
0,CHEMBL343771,540.0,nM,COc1ccc(CCNC2CCCC(C(C#N)(c3ccc(OC)c(OC)c3)C(C)...
1,CHEMBL138302,390.0,nM,COc1ccc(CCN(C)C2CCCC(C(C#N)(c3ccc(OC)c(OC)c3)C...
2,CHEMBL6966,150.0,nM,COc1ccc(CCN(C)CCCC(C#N)(c2ccc(OC)c(OC)c2)C(C)C...
3,CHEMBL3392229,270.0,nM,CCOC(=O)C1=C(C)NC(C)=C(C#N)[C@@H]1c1cnccc1-c1c...
4,CHEMBL3392227,250.0,nM,CCOC(=O)C1=C(C)NC(C)=C(C#N)[C@@H]1c1cnccc1-c1c...
5,CHEMBL3392230,100.0,nM,CCOC(=O)C1=C(C)NC(C)=C(C#N)[C@@H]1c1cnccc1-c1c...
6,CHEMBL247828,3500.0,nM,CC(C)N1C(=O)[C@H](NC(=O)[C@@H](Cc2ccccc2OC(F)(...
7,CHEMBL247829,1725.0,nM,CC(C)N1C(=O)[C@@H](NC(=O)[C@@H](Cc2ccccc2OC(F)...
8,CHEMBL493677,14900.0,nM,CC(C)(C)CCN1CC[C@H](CNC(=O)c2cc(Cl)cc(Cl)c2)[C...
9,CHEMBL495792,28000.0,nM,Cn1nccc1-c1ccc2c(c1)C(=O)CC1(CCN(C(=O)N[C@H]3C...
