# chembl_webresource_client demo

### ChEMBL Group, EMBL-EBI

## Introduction

This is the only official Python client library developed and supported by ChEMBL group.

The library helps accessing ChEMBL data and cheminformatics tools from Python. You don't need to know how to write SQL. You don't need to know how to interact with REST APIs. You don't need to compile or install any cheminformatics frameworks. Results are cached.

The client handles interaction with the HTTPS protocol and caches all results in the local file system for faster retrieval. Abstracting away all network-related tasks, the client provides the end user with a convenient interface, giving the impression of working with a local resource. Design is based on the Django QuerySet interface. The client also implements lazy evaluation of results, which means it will only evaluate a request for data when a value is required. This approach reduces number of network requests and increases performance.

Please note that the code below attempts to balance clarity and brevity, and is not intended to be a template for production code: error checking, for example, should be much more thorough in practice. 

## Configuration and setup

In [1]:
import logging
from operator import itemgetter
from IPython.display import display, SVG
import pandas as pd

In [3]:
# Python modules used for API access...
from chembl_webresource_client.new_client import new_client

## List of available resources
It's easy to get a list of available resources by invoking:

In [None]:
available_resources = [resource for resource in dir(new_client) if not resource.startswith('_')]
print(available_resources)
print(len(available_resources))

Which means there are 33 different types of resources available _via_ web services. In this notebook only the most important of these are covered.

## Molecules

Molecule records may be retrieved in a number of ways, such as lookup of single molecules using various identifiers or searching for compounds _via_ substruture or similarity. 

In [None]:
# Get a molecule-handler object for API access and check the connection to the database...
molecule = new_client.molecule
molecule.set_format('json')
print("%s molecules available in ChEMBL" % len(molecule.all()))

### Getting a single molecule

In order to retrieve a single molecule from the web services, you need to know its unique and unambiguous identifier. In case of molecule resource this can be one of three types:

 1. ChEMBL_ID
 2. InChI Key
 3. Canonical SMILES (non-canonical SMILES will be covered later in this notebook)

In [None]:
# so this:
# 1.
m1 = molecule.get('CHEMBL25')
# 2.
m2 = molecule.get('BSYNRYMUTXBXSQ-UHFFFAOYSA-N')
#
m3 = molecule.get('CC(=O)Oc1ccccc1C(=O)O')
# will return the same data:
m1 == m2 == m3

### ChEMBL ID

All the main entities in the ChEMBL database have a ChEMBL ID. It is a stable identifier designed for straightforward lookup of data.

In [None]:
# Lapatinib, the bioactive component of the anti-cancer drug Tykerb
chembl_id = "CHEMBL554" 

In [None]:
# Get compound record using client...
record_via_client = molecule.get(chembl_id)
record_via_client

### InChIKey

Compound records may also be retrieved _via_ InChI Key lookup.

In [None]:
# InChI Key for Lapatinib
inchi_key = "BCFGMOOMADDAQU-UHFFFAOYSA-N"

# getting molecule via client
molecule.set_format('json')
record_via_client = molecule.get(inchi_key)
record_via_client

### SMILES

Compound records may also be retrieved _via_ SMILES lookup.

The purpose of the `get` method is to return objects identified by their unique and unambiguous properties.
This is why SMILES provided as arguments to the `get` method need to be canonical.
But you can still search for molecules, using non-canonical SMILES - this functionaly will be covered later in this notebook.

In [None]:
# Canonoical SMILES for Lapatinib
canonical_smiles = "CS(=O)(=O)CCNCc1ccc(-c2ccc3ncnc(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)c3c2)o1"

# getting molecule via client
molecule.set_format('json')
record_via_client = molecule.get(canonical_smiles)
record_via_client

### Batch queries

Multiple records may be requested at once. The `get` method can accept a list of homogenous identifiers.

In [None]:
records1 = molecule.get(['CHEMBL6498', 'CHEMBL6499', 'CHEMBL6505'])
records2 = molecule.get(['XSQLHVPPXBBUPP-UHFFFAOYSA-N', 'JXHVRXRRSSBGPY-UHFFFAOYSA-N', 'TUHYVXGNMOGVMR-GASGPIRDSA-N'])
records3 = molecule.get(['CNC(=O)c1ccc(cc1)N(CC#C)Cc2ccc3nc(C)nc(O)c3c2',
            'Cc1cc2SC(C)(C)CC(C)(C)c2cc1\\N=C(/S)\\Nc3ccc(cc3)S(=O)(=O)N',
            'CC(C)C[C@H](NC(=O)[C@@H](NC(=O)[C@H](Cc1c[nH]c2ccccc12)NC(=O)[C@H]3CCCN3C(=O)C(CCCCN)CCCCN)C(C)(C)C)C(=O)O'])
records1 == records2

Please note that the length of url can't be more than 4000 characters. This is why url-based approach should not be used for a very long lists of identifiers. Also `molecule.get` call needs to be modified slightly in that case.

In [None]:
# Generate a list of 300 ChEMBL IDs (N.B. not all will be valid)...
chembl_ids = ['CHEMBL{}'.format(x) for x in range(1, 301)]

# Get compound records, note `molecule_chembl_id` named parameter.
# Named parameters should always be used for longer lists
records = molecule.get(molecule_chembl_id=chembl_ids)
len(records)

Note that we expect to see a number that is less than 300. This is because for some identifiers in range `(CHEMBL1, ..., CHEMBL300)` there are no molecule mapped to them.

### Filtering
All resources available through ChEMBL web services can be filtered.
Some examples of filtering applied to molecules:

1. Get all approved drugs
2. Get all molecules in ChEMBL with no Rule-of-Five violations
3. Get all biotherapeutic molecules
4. Return molecules with molecular weight <= 300
5. Return molecules with molecular weight <= 300 AND pref_name ends with -nib

In [None]:
# 1. Get all approved drugs
approved_drugs = molecule.filter(max_phase=4)

# 2. Get all molecules in ChEMBL with no Rule-of-Five violations
no_violations = molecule.filter(molecule_properties__num_ro5_violations=0)

# 3. Get all biotherapeutic molecules
biotherapeutics = molecule.filter(biotherapeutic__isnull=False)

# 4. Return molecules with molecular weight <= 300
light_molecules = molecule.filter(molecule_properties__mw_freebase__lte=300)

# 5. Return molecules with molecular weight <= 300 AND pref_name ends with nib
light_nib_molecules = molecule.filter(molecule_properties__mw_freebase__lte=300).filter(pref_name__iendswith="nib")

With the client-generated results, we no not have to worry about pagination:

In [None]:
# The QuerySet object returned by the client is a lazily-evaluated iterator
# This means that it's ready to use and it will try to reduce the amount of server requests
# All results are cached as well so they are fetched from server only once.
approved_drugs = molecule.filter(max_phase=4)

# Getting the lenght of the whole result set is easy:
print(len(approved_drugs))

# So is getting a single element:
print(approved_drugs[123])

# Or a chunk of elements:
print(approved_drugs[2:5])

# Or using in the loops or list comprehensions:
for drug in approved_drugs[0:20]:
    if drug['molecule_structures']:
        print(drug['molecule_structures']['canonical_smiles'])

### Ordering results
Similar to filtering, it's also possible to order the result set, there is a parameter called `order_by` that is reposnsible for ordering:

In [None]:
# Sort approved drugs by molecular weight ascending (from lightest to heaviest) and get the first (lightest) element
lightest_drug = molecule.filter(max_phase=4).order_by('molecule_properties__mw_freebase')[0]
lightest_drug['pref_name']

In [None]:
# Sort approved drugs by molecular weight descending (from heaviest to lightest) and get the first (heaviest) element
heaviest_drug = molecule.filter(max_phase=4).order_by('-molecule_properties__mw_freebase')[0]
heaviest_drug['pref_name']

### Filtering molecules using SMILES
It is possible to filter molecules by SMILES

In [None]:
# Atorvastatin...
smiles = "CC(C)c1c(C(=O)Nc2ccccc2)c(-c2ccccc2)c(-c2ccc(F)cc2)n1CC[C@@H](O)C[C@@H](O)CC(=O)O"

# By default, the type of search used is 'exact search' which means that only compounds with exacly same SMILES string will be picked:
result = molecule.filter(molecule_structures__canonical_smiles=smiles)
print(len(result))

# This is quivalent of:
result1 = molecule.filter(molecule_structures__canonical_smiles__exact=smiles)
print(len(result1))

# For convenience, we have a shortcut call:
result2 = molecule.filter(smiles=smiles)
print(len(result2))

# Checking if they are all the same: 
print(result[0]['pref_name'] == result1[0]['pref_name'] == result2[0]['pref_name'])

# And because SMILES string are unique in ChEMBL, this is similar to:
result3 = molecule.get(smiles)
print(result[0]['pref_name'] == result3['pref_name'])

There are however different filtering operators that can be applied to SMILES; the most important one is called `flexmatch`, which will return all structures described by given SMILES string even if this is non-canonical SMILES.

In [None]:
# Flexmatch will look for structures that match given SMILES, ignoring stereo:
records = molecule.filter(molecule_structures__canonical_smiles__flexmatch=smiles)
print(len(records))

for record in records:
    print("{:15s} : {}".format(record["molecule_chembl_id"], record['molecule_structures']['canonical_smiles']))

Unlike with the exact string match, it is possible to retrieve multiple records when a SMILES is used for the `flexmatch` lookup (_i.e._ it is potentially one-to-many instead of one-to-one as the ID lookups are). This is due to the nature of `flexmatch`.

In our case two structures are returned, CHEMBL1487 (Atorvastatin) and CHEMBL1207181, which is the same structure as the former but with one of the two stereocentres undefined.

### Substructure-searching

As well as ID lookups, the web services may also be used to perform substructure searches. Currently, only SMILES-based searches are supported, although this could change if there is is a need for more powerful search abilities (_e.g._ SMARTS searching).

In [None]:
# Lapatinib contains the following core...
query = "c4ccc(Nc2ncnc3ccc(c1ccco1)cc23)cc4"

In [None]:
# Perform substructure search on query using client
substructure = new_client.substructure
records = substructure.filter(smiles=query)
records

### Similarity searching

The web services may also be used to perform SMILES-based similarity searches.

In [None]:
# Lapatinib
smiles = "CS(=O)(=O)CCNCc1oc(cc1)c2ccc3ncnc(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)c3c2"

In [None]:
# Note that a percentage similarity must be supplied.
similarity = new_client.similarity
res = similarity.filter(smiles=smiles, similarity=85)
len(res)

In [None]:
res

### Versions for a parent structure

The versions (_e.g._ salt forms) for a parent compound may be retrieved for a ChEMBL ID. Keep in mind that a parent structure is one that has had salt/solvate components removed; it corresponds to the bioactive moiety and its use facilitates structure searching, comparison _etc_. A compound without salt/solvate components is its own parent.

In [None]:
# Neostigmine (a parent)...
chembl_id = "CHEMBL278020" 

In [None]:
records = new_client.molecule_form.get(chembl_id)['molecule_forms']
records

The ChEMBL ID lookup service may now be used to get the full records for the salt forms...

In [None]:
for chembl_id in [x["molecule_chembl_id"] for x in records if x["is_parent"] == False]:
    record = new_client.molecule.get(chembl_id)          
    print("{:10s} : {}".format(chembl_id, record['molecule_structures']['canonical_smiles']))

### Drug mechanism(s) of action

The mechanisms of action of marketed drugs may be retrieved.

Note that this data may not be recorded for the parent structure, but rather for one of its versions. For example, the marketed drug, Tykerb, containing the active ingredient Lapatinib (CHEMBL554) is actually the ditosylate monohydrate (CHEMBL1201179).

In [None]:
# Molecule forms for Lapatinib are used here...
for chembl_id in (x["molecule_chembl_id"] for x in new_client.molecule_form.get("CHEMBL554")['molecule_forms']):
    print("The recorded mechanisms of action of '{}' are...".format(chembl_id))
    mechanism_records = new_client.mechanism.filter(molecule_chembl_id=chembl_id)
    if mechanism_records:
        for mech_rec in mechanism_records:
            print("{:10s} : {}".format(mech_rec["molecule_chembl_id"], mech_rec["mechanism_of_action"]))
    print("-" * 50)

### Image query

The webservice may be used to obtain a SVG image of a compound.

In [None]:
# Lapatinib ditosylate monohydrate (Tykerb)
chembl_id = "CHEMBL1201179" 

In [None]:
image = new_client.image
image.set_format('svg')
svg = image.get(chembl_id)
SVG(svg)

### Bioactivities

All bioactivity records for a compound may be retrieved _via_ its ChEMBL ID.


In [None]:
# Lapatinib
chembl_id = "CHEMBL554" 

In [None]:
records = new_client.activity.filter(molecule_chembl_id=chembl_id)
len(records), records[:2]

## Targets

The webservices may also be used to obtain information on biological targets, _i.e._ the entities, such as proteins, cells or organisms, with which compounds interact.


In [None]:
# Like with any other resource type, a complete list of targets can be requested using the client:
records = new_client.target.all()
len(records)

In [None]:
records[:4]

### ChEMBL ID

Data on any target type may be obtained _via_ a lookup of its ChEMBL ID.


In [None]:
# Receptor protein-tyrosine kinase erbB-2
chembl_id = "CHEMBL1824"

In [None]:
record = new_client.target.get(chembl_id)
record

Remember that all targets have ChEMBL IDs, not just proteins...

In [None]:
# SK-BR-3, a cell line over-expressing erbB-2
chembl_id = "CHEMBL613834" 

In [None]:
record = new_client.target.get(chembl_id)
record

### UniProt ID

Data on protein targets may also be obtained using UniProt ID.

In [None]:
# UniProt ID for erbB-2, a target of Lapatinib
uniprot_id = "O15648"

In [None]:
records = new_client.target.filter(target_components__accession=uniprot_id)
print([(x['target_chembl_id'], x['pref_name']) for x in records])

### Bioactivities

All bioactivities for a target may be retrieved.

In [4]:
# Receptor protein-tyrosine kinase erbB-2
chembl_id = "CHEMBL5686"
#O15648

In [5]:
records = new_client.activity.filter(target_chembl_id=chembl_id)
len(records)

7870

In [6]:
type(records)

chembl_webresource_client.query_set.QuerySet

In [7]:
len(records)

7870

In [8]:
import modin.pandas as pd

In [13]:
Bio = records[0:7870]

In [14]:
len(f)

7870

In [15]:
df= pd.DataFrame(list(f.all()))

In [16]:
df.info()

<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 7870 entries, 0 to 7869
Data columns (total 43 columns):
 #   Column                     Non-Null Count  Dtype  
---  -------------------------  --------------  -----  
 0   activity_comment           7719 non-null   object
 1   activity_id                7870 non-null   int64
 2   activity_properties        7870 non-null   object
 3   assay_chembl_id            7870 non-null   object
 4   assay_description          7870 non-null   object
 5   assay_type                 7870 non-null   object
 6   bao_endpoint               7870 non-null   object
 7   bao_format                 7870 non-null   object
 8   bao_label                  7870 non-null   object
 9   canonical_smiles           7835 non-null   object
 10  data_validity_comment      4 non-null      object
 11  data_validity_description  4 non-null      object
 12  document_chembl_id         7870 non-null   object
 13  document_journal           151 non-null    object
 14  do

### Approved Drugs

The approved drugs for a target may be retrieved.

In [None]:
# Receptor protein-tyrosine kinase erbB-2
chembl_id = "CHEMBL1824"

In [None]:
activities = new_client.mechanism.filter(target_chembl_id=chembl_id)
compound_ids = [x['molecule_chembl_id'] for x in activities]
approved_drugs = new_client.molecule.filter(molecule_chembl_id__in=compound_ids).filter(max_phase=4)

for record in approved_drugs:
    print("{:10s} : {}".format(record["molecule_chembl_id"], record["pref_name"]))

### Assay details

Details of an assay may be retrieved _via_ its ChEMBL ID.

In [None]:
# Inhibitory activity against epidermal growth factor receptor
chembl_id = "CHEMBL674106"

In [None]:
record = new_client.assay.get(chembl_id)
record

### Bioactivities

All bioactivity records for an assay may be requested.

In [None]:
records = new_client.activity.filter(assay_chembl_id=chembl_id)
len(records), records[:2]

## Other resources

As noted previously, there are many other resources that can be useful. They won't be covered in this document in a great detail but some examples may be helpful.

In [None]:
# Documents - retrieve all publications published after 1985 in 5th volume.
print(new_client.document.filter(doc_type='PUBLICATION').filter(year__gt=1985).filter(volume=5))

In [None]:
# Cell lines:
print(new_client.cell_line.get('CHEMBL3307242'))

In [None]:
# Protein class:
print(new_client.protein_class.filter(l6="CAMK protein kinase AMPK subfamily"))

In [None]:
# Source:
print(new_client.source.filter(src_short_name="ATLAS"))

In [None]:
# Target component:
print(new_client.target_component.get(375))

In [None]:
# ChEMBL ID Lookup: check if CHEMBL1 is a molecule, assay or target:
print(new_client.chembl_id_lookup.get("CHEMBL1")['entity_type'])

In [None]:
# ATC class:
print(new_client.atc_class.get('H03AA03'))