<a href="https://colab.research.google.com/github/JLee823/2023-1st-AI-assisted-drug-discovery-SNU/blob/main/Week6_0_Intro_to_ChEMBL_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to ChEMBL API
-------


In this notebook, we will briefly review how to download data from ChEMBL using ChEMBL API.

Examples are adopted from https://github.com/chembl/chembl_webresource_client

You can find more useful tips from the website above

In [None]:
pip install chembl_webresource_client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.8-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 KB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting attrs<22.0,>=21.2
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 KB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: url-normalize, attrs, requests-cache, chembl_webresource_client
  Attempting uninstall: attrs
    Found existing installation: attrs 22.2.0
    Uninstalling attrs-22.2.0:
      Successfully uninstalled attrs-22.2.0
Successfully installed attrs-

## What types of information can be downloaded from ChEMBL
------

In [None]:
from chembl_webresource_client.new_client import new_client

available_resources = [resource for resource in dir(new_client) if not resource.startswith('_')]
print(available_resources)



## Available filters

The design of the client is based on Django QuerySet (https://docs.djangoproject.com/en/1.11/ref/models/querysets) and most important lookup types are supported. These are:

- exact
- iexact
- contains
- icontains
- in
- gt
- gte
- lt
- lte
- startswith
- istartswith
- endswith
- iendswith
- range
- isnull
- regex
- iregex

# Molecules
-------

Molecule records may be retrieved in a number of ways, such as lookup of single molecules using various identifiers or searching for compounds via similarity.

## Find molecules using preferred name (pref_name)

In [6]:
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
mols = molecule.filter(pref_name__iexact='aspirin')
mols



## Find a molecule by its synonyms

- in case it is not found by pref_name
- Use the `only` method where you can specify fields you want to be included in response

In [13]:
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
mols = molecule.filter(pref_name='viagra').only('molecule_chembl_id')
mols


[]

In [14]:
mols = molecule.filter(molecule_synonyms__molecule_synonym__iexact='viagra').only('molecule_chembl_id')
mols

[{'molecule_chembl_id': 'CHEMBL192'}, {'molecule_chembl_id': 'CHEMBL1737'}]

## Get a single molecule by ChEMBL id

All the main entities in the ChEMBL database have a ChEMBL ID. It is a stable identifier designed for straightforward lookup of data.

In [15]:
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
m1 = molecule.filter(chembl_id='CHEMBL192').only(['molecule_chembl_id', 'pref_name', 'molecule_structures'])
m1

[{'molecule_chembl_id': 'CHEMBL192', 'molecule_structures': {'canonical_smiles': 'CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12', 'molfile': '\n     RDKit          2D\n\n 33 36  0  0  0  0  0  0  0  0999 V2000\n    2.1000   -0.0042    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    2.1000    0.7000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.5375   -0.0042    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0\n    1.4917   -0.3667    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0\n    0.8792   -0.0042    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    2.8042    0.9083    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0\n    1.4917    1.0625    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.8792    0.6833    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0\n    3.2042    0.3458    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0\n    2.8042   -0.2417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.2875   -0.3750    0.0000 C   0  0  0  0  0  0  0  0  0  

## Get many molecules by id

In [18]:
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
mols = molecule.filter(molecule_chembl_id__in=['CHEMBL25', 'CHEMBL192', 'CHEMBL27'])
mols



In [20]:
mols = molecule.filter(molecule_chembl_id__in=['CHEMBL25', 'CHEMBL192', 'CHEMBL27']).only(['molecule_chembl_id', 'pref_name'])
mols

[{'molecule_chembl_id': 'CHEMBL25', 'pref_name': 'ASPIRIN'}, {'molecule_chembl_id': 'CHEMBL27', 'pref_name': 'PROPRANOLOL'}, {'molecule_chembl_id': 'CHEMBL192', 'pref_name': 'SILDENAFIL'}]

## Find compounds similar to given SMILES query with similarity threshold of 70%

In [21]:
from chembl_webresource_client.new_client import new_client

similarity = new_client.similarity
res = similarity.filter(smiles="CO[C@@H](CCC#C\C=C/CCCC(C)CCCCC=C)C(=O)[O-]", similarity=70).only(['molecule_chembl_id', 'similarity'])
for i in res:
    print(i)

{'molecule_chembl_id': 'CHEMBL478779', 'similarity': '85.4166686534881591796875'}
{'molecule_chembl_id': 'CHEMBL477889', 'similarity': '85.4166686534881591796875'}
{'molecule_chembl_id': 'CHEMBL477888', 'similarity': '85.4166686534881591796875'}
{'molecule_chembl_id': 'CHEMBL2304268', 'similarity': '70.1754391193389892578125'}


## Find compounds similar to aspirin (CHEMBL25) with similarity threshold of 70%


In [None]:
from chembl_webresource_client.new_client import new_client

similarity = new_client.similarity
res = similarity.filter(chembl_id='CHEMBL25', similarity=70).only(['molecule_chembl_id', 'pref_name', 'similarity'])
res

[{'molecule_chembl_id': 'CHEMBL2296002', 'pref_name': None, 'similarity': '100'}, {'molecule_chembl_id': 'CHEMBL1697753', 'pref_name': 'ASPIRIN DL-LYSINE', 'similarity': '100'}, {'molecule_chembl_id': 'CHEMBL3833325', 'pref_name': 'CARBASPIRIN CALCIUM', 'similarity': '88.8888895511627197265625'}, {'molecule_chembl_id': 'CHEMBL3833404', 'pref_name': 'CARBASPIRIN', 'similarity': '88.8888895511627197265625'}, '...(remaining elements truncated)...']

## Find compounds with the same connectivity

In [None]:
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
res = molecule.filter(molecule_structures__canonical_smiles__connectivity='CN(C)C(=N)N=C(N)N').only(['molecule_chembl_id', 'pref_name'])
for i in res:
    print(i)

{'molecule_chembl_id': 'CHEMBL1431', 'pref_name': 'METFORMIN'}
{'molecule_chembl_id': 'CHEMBL1703', 'pref_name': 'METFORMIN HYDROCHLORIDE'}
{'molecule_chembl_id': 'CHEMBL3094198', 'pref_name': None}


## Get all approved drugs

using `order_by` to sort them by molecular weight

In [22]:
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
approved_drugs = molecule.filter(max_phase=4).order_by('molecule_properties__mw_freebase')
approved_drugs



In [23]:
print(len(approved_drugs))

4194


## Get approved drugs for lung cancer
-----
EFO terms:  https://www.ebi.ac.uk/efo/

In [26]:
from chembl_webresource_client.new_client import new_client

drug_indication = new_client.drug_indication
molecules = new_client.molecule

lung_cancer_ind = drug_indication.filter(efo_term__icontains="LUNG CARCINOMA")
lung_cancer_mols = molecules.filter(
    molecule_chembl_id__in=[x['molecule_chembl_id'] for x in lung_cancer_ind])

len(lung_cancer_mols)

716

## Get molecules with molecular weight <= 300

In [27]:
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
light_molecules = molecule.filter(molecule_properties__mw_freebase__lte=300)

len(light_molecules)

418451

## Filter drugs by approval year and name

* USAN: https://en.wikipedia.org/wiki/United_States_Adopted_Name


In [25]:
from chembl_webresource_client.new_client import new_client

drug = new_client.drug
res = drug.filter(first_approval__gte=1980).filter(usan_stem="-azosin")
res



## Get molecules with molecular weight <= 300 AND pref_name ending with nib


In [28]:
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
light_nib_molecules = molecule.filter(molecule_properties__mw_freebase__lte=300, pref_name__iendswith="nib").only(['molecule_chembl_id', 'pref_name'])

light_nib_molecules

[{'molecule_chembl_id': 'CHEMBL276711', 'pref_name': 'SEMAXANIB'}, {'molecule_chembl_id': 'CHEMBL4594348', 'pref_name': 'ELSUBRUTINIB'}]

## Get all molecules in ChEMBL with no Rule-of-Five violations
-------
Rule-of-Five (Lipinski's rule of five)
1. No more than 5 hydrogen bond donors (the total number of nitrogen–hydrogen and oxygen–hydrogen bonds)
2. No more than 10 hydrogen bond acceptors (all nitrogen or oxygen atoms)
3. A molecular mass less than 500 daltons
4. A calculated octanol-water partition coefficient (Clog P) that does not exceed 5

In [29]:
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
no_violations = molecule.filter(molecule_properties__num_ro5_violations=0)
len(no_violations)

1631753

# Activities

## Get all IC50 activities related to the hERG target

In [None]:
from chembl_webresource_client.new_client import new_client

target = new_client.target
activity = new_client.activity
herg = target.filter(pref_name__iexact='hERG').only('target_chembl_id')[0]
herg_activities = activity.filter(target_chembl_id=herg['target_chembl_id']).filter(standard_type="IC50")

len(herg_activities)

13200

## Get all activities for a specific target with assay type B (binding):


### Assay types in ChEMBL: 
* Binding (B) - Data measuring binding of compound to a molecular target, e.g. Ki, IC50, Kd.
* Functional (F) - Data measuring the biological effect of a compound, e.g. \%cell death in a cell line, rat weight.
* ADMET (A) - ADME data e.g. t1/2, oral bioavailability.
* Toxicity (T) - Data measuring toxicity of a compound, e.g., cytotoxicity.
* Physicochemical (P) - Assays measuring physicochemical properties of the compounds in the absence of biological material e.g., chemical stability, solubility.
* Unclassified (U) - A small proportion of assays cannot be classified into one of the above categories e.g., ratio of binding vs efficacy.

Retreving only binding assay results

In [None]:
from chembl_webresource_client.new_client import new_client

activity = new_client.activity
res = activity.filter(target_chembl_id='CHEMBL3938', assay_type='B')

len(res)

860

In [40]:
res[0]

{'activity_comment': 'Not Active',
 'activity_id': 1650747,
 'activity_properties': [],
 'assay_chembl_id': 'CHEMBL860783',
 'assay_description': 'Average Binding Constant for STK16; NA=Not Active at 10 uM',
 'assay_type': 'B',
 'assay_variant_accession': None,
 'assay_variant_mutation': None,
 'bao_endpoint': 'BAO_0000034',
 'bao_format': 'BAO_0000357',
 'bao_label': 'single protein format',
 'canonical_smiles': 'CS(=O)(=O)CCNCc1ccc(-c2ccc3ncnc(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)c3c2)o1',
 'data_validity_comment': None,
 'data_validity_description': None,
 'document_chembl_id': 'CHEMBL1144455',
 'document_journal': 'Nat Biotechnol',
 'document_year': 2005,
 'ligand_efficiency': None,
 'molecule_chembl_id': 'CHEMBL554',
 'molecule_pref_name': 'LAPATINIB',
 'parent_molecule_chembl_id': 'CHEMBL554',
 'pchembl_value': None,
 'potential_duplicate': 0,
 'qudt_units': None,
 'record_id': 405809,
 'relation': None,
 'src_id': 1,
 'standard_flag': 0,
 'standard_relation': None,
 'standard_text_value

Retreving only binding assay results measured in IC50 or Kd values

In [34]:
from chembl_webresource_client.new_client import new_client

activity = new_client.activity
res = activity.filter(target_chembl_id='CHEMBL3938', assay_type='B').filter(standard_type__in=["IC50", "Kd"])
len(res)

416

In [38]:
res[0]

{'activity_comment': 'Not Active',
 'activity_id': 1650747,
 'activity_properties': [],
 'assay_chembl_id': 'CHEMBL860783',
 'assay_description': 'Average Binding Constant for STK16; NA=Not Active at 10 uM',
 'assay_type': 'B',
 'assay_variant_accession': None,
 'assay_variant_mutation': None,
 'bao_endpoint': 'BAO_0000034',
 'bao_format': 'BAO_0000357',
 'bao_label': 'single protein format',
 'canonical_smiles': 'CS(=O)(=O)CCNCc1ccc(-c2ccc3ncnc(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)c3c2)o1',
 'data_validity_comment': None,
 'data_validity_description': None,
 'document_chembl_id': 'CHEMBL1144455',
 'document_journal': 'Nat Biotechnol',
 'document_year': 2005,
 'ligand_efficiency': None,
 'molecule_chembl_id': 'CHEMBL554',
 'molecule_pref_name': 'LAPATINIB',
 'parent_molecule_chembl_id': 'CHEMBL554',
 'pchembl_value': None,
 'potential_duplicate': 0,
 'qudt_units': None,
 'record_id': 405809,
 'relation': None,
 'src_id': 1,
 'standard_flag': 0,
 'standard_relation': None,
 'standard_text_value

## Get all activities with a pChEMBL value for a molecule
-----

The pChEMBL value is currently defined as follows: 

−log10 (molar IC50, XC50, EC50, AC50, Ki, Kd or Potency). 

(ref: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965067/)

In [42]:
from chembl_webresource_client.new_client import new_client

activities = new_client.activity
res = activities.filter(molecule_chembl_id="CHEMBL25", pchembl_value__isnull=False)

len(res)

145

In [43]:
res[0]

{'activity_comment': None,
 'activity_id': 88326,
 'activity_properties': [],
 'assay_chembl_id': 'CHEMBL762032',
 'assay_description': 'Inhibitory concentration required against Arachidonic acid (100 uM) induced platelet aggregatory activity',
 'assay_type': 'F',
 'assay_variant_accession': None,
 'assay_variant_mutation': None,
 'bao_endpoint': 'BAO_0000190',
 'bao_format': 'BAO_0000019',
 'bao_label': 'assay format',
 'canonical_smiles': 'CC(=O)Oc1ccccc1C(=O)O',
 'data_validity_comment': None,
 'data_validity_description': None,
 'document_chembl_id': 'CHEMBL1146621',
 'document_journal': 'Bioorg Med Chem Lett',
 'document_year': 2003,
 'ligand_efficiency': None,
 'molecule_chembl_id': 'CHEMBL25',
 'molecule_pref_name': 'ASPIRIN',
 'parent_molecule_chembl_id': 'CHEMBL25',
 'pchembl_value': '4.46',
 'potential_duplicate': 0,
 'qudt_units': 'http://www.openphacts.org/units/Nanomolar',
 'record_id': 174376,
 'relation': '=',
 'src_id': 1,
 'standard_flag': 1,
 'standard_relation': '=',

## Search for ADMET-related inhibitor assays (type A)

In [None]:
from chembl_webresource_client.new_client import new_client
assay = new_client.assay
res = assay.filter(description__icontains='inhibit', assay_type='A')
res

[{'assay_category': None, 'assay_cell_type': None, 'assay_chembl_id': 'CHEMBL884521', 'assay_classifications': [], 'assay_organism': 'Rattus norvegicus', 'assay_parameters': [], 'assay_strain': None, 'assay_subcellular_fraction': None, 'assay_tax_id': 10116, 'assay_test_type': None, 'assay_tissue': None, 'assay_type': 'A', 'assay_type_description': 'ADME', 'bao_format': 'BAO_0000357', 'bao_label': 'single protein format', 'cell_chembl_id': None, 'confidence_description': 'Direct single protein target assigned', 'confidence_score': 9, 'description': 'Inhibition of cytochrome P450 progesterone 15-alpha hydroxylase', 'document_chembl_id': 'CHEMBL1125500', 'relationship_description': 'Direct protein target assigned', 'relationship_type': 'D', 'src_assay_id': None, 'src_id': 1, 'target_chembl_id': 'CHEMBL3705', 'tissue_chembl_id': None, 'variant_sequence': None}, {'assay_category': None, 'assay_cell_type': None, 'assay_chembl_id': 'CHEMBL615148', 'assay_classifications': [], 'assay_organism