# Data Extraction from ChEMBL Database

Load dataset from ChEMBL Databse, which contains bioactivity data for more than 2 million compounds. Data spans over 15598 targets.

In [1]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## Search for All Targets (Alzheimer Disease)

Found total of 6 targets.

In [2]:
target_result = new_client.target.search('Alzheimer')
targets = pd.DataFrame(target_result)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Nucleosome-remodeling factor subunit BPTF,13.0,False,CHEMBL3085621,"[{'accession': 'Q12830', 'component_descriptio...",SINGLE PROTEIN,9606
1,"[{'xref_id': 'Q92542', 'xref_name': None, 'xre...",Homo sapiens,Nicastrin,11.0,False,CHEMBL3418,"[{'accession': 'Q92542', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Homo sapiens,Gamma-secretase,11.0,False,CHEMBL2094135,"[{'accession': 'Q96BI3', 'component_descriptio...",PROTEIN COMPLEX,9606
3,[],Rattus norvegicus,Amyloid-beta A4 protein,9.0,False,CHEMBL3638365,"[{'accession': 'P08592', 'component_descriptio...",SINGLE PROTEIN,10116
4,[],Mus musculus,Amyloid-beta A4 protein,8.0,False,CHEMBL4523942,"[{'accession': 'P12023', 'component_descriptio...",SINGLE PROTEIN,10090
5,"[{'xref_id': 'P05067', 'xref_name': None, 'xre...",Homo sapiens,Amyloid-beta A4 protein,7.0,False,CHEMBL2487,"[{'accession': 'P05067', 'component_descriptio...",SINGLE PROTEIN,9606


Selects the Amyloid-beta A4 protein as the target (CHEMBL2487) and retrieve bioactivity data from this selected target protein that use IC50 values. We are interested in the standard value column, which is the potency of the drug.

In [3]:
selected = targets.target_chembl_id[5]
bio_activity_data = new_client.activity.filter(target_chembl_id=selected).filter(standard_type = 'IC50')
df = pd.DataFrame(bio_activity_data)
df.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,357577,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,5.0
1,,,357580,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,2.7
2,,,358965,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,1.8
3,,,368887,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,11.0
4,,,375954,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,10.0


Drop the missing data rows (missing standard value)

In [4]:
df = df[df.standard_value.notna()]

## Data pre-process

Select the 3 columns that we want and add another column for bioactivity level.

In [5]:
df1 = df[['molecule_chembl_id', 'canonical_smiles', 'standard_value']]
df1.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL311039,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,5000.0
1,CHEMBL450926,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,2700.0
2,CHEMBL310242,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,1800.0
3,CHEMBL74874,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,11000.0
4,CHEMBL75183,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,10000.0


Defines a function for classifying the standard value into 3 types: active, intermediate, inactive.
Adds the column for the bioactivity level.

In [6]:
def classify_standard_value(value):
    value = float(value)
    if value < 1000:
        return 'active'
    elif value < 10000:
        return 'intermediate'
    else:
        return 'inactive'

In [7]:
df1 = df1.copy()
df1.loc[:, 'bioactivity_level'] = df1['standard_value'].apply(classify_standard_value)
df1.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_level
0,CHEMBL311039,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,5000.0,intermediate
1,CHEMBL450926,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,2700.0,intermediate
2,CHEMBL310242,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,1800.0,intermediate
3,CHEMBL74874,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,11000.0,inactive
4,CHEMBL75183,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,10000.0,inactive
5,CHEMBL563,CC(C(=O)O)c1ccc(-c2ccccc2)c(F)c1,305000.0,inactive
6,CHEMBL196279,CC(C(=O)O)c1ccc(-c2ccc(Cl)c(Cl)c2)c(F)c1,75000.0,inactive
8,CHEMBL195970,CC(C(=O)O)c1ccc(-c2cc(Cl)cc(Cl)c2)c(F)c1,77000.0,inactive
9,CHEMBL195970,CC(C(=O)O)c1ccc(-c2cc(Cl)cc(Cl)c2)c(F)c1,94000.0,inactive
13,CHEMBL264006,CC(C(=O)O)c1ccc(-c2ccc(C3CCCCC3)cc2)c(F)c1,21000.0,inactive


Turn the dataframe into an csv file for future use.

In [8]:
df1.to_csv('bioactivity_data.csv', index = False)