# **Bioinformatics Project - Computational Drug Discovery [Part 1] Download Bioactivity Data**

AbdulMuiz Shaikh

In this Jupyter notebook, we will be building a real-life **data science project** in Bioinformatics. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.


---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.


## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-25.2.0-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-25.2.0-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━

## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for HERG**

In [3]:
# Target search for HERG
target = new_client.target
target_query = target.search('herg')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,13.0,False,CHEMBL240,"[{'accession': 'Q12809', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Melanin-concentrating hormone receptor 2/HERG,12.0,False,CHEMBL4106188,"[{'accession': 'Q12809', 'component_descriptio...",SELECTIVITY GROUP,9606
2,[],Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9.0,False,CHEMBL2363011,"[{'accession': 'Q9H252', 'component_descriptio...",SINGLE PROTEIN,9606
3,[],Homo sapiens,Voltage-gated potassium channel,2.0,False,CHEMBL2362996,"[{'accession': 'P51787', 'component_descriptio...",PROTEIN FAMILY,9606


### **Select and retrieve bioactivity data for *Voltage-gated inwardly rectifying potassium channel KCNH2* (zero entry)**

We will assign the zero entry (which corresponds to the target protein, *Voltage-gated inwardly rectifying potassium channel KCNH2*) to the ***selected_target*** variable

In [4]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL240'

Here, we will retrieve only bioactivity data for *Voltage-gated inwardly rectifying potassium channel KCNH2* (CHEMBL240) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [5]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [11]:
df_sample = df.sample(n=10000, random_state=42)


In [12]:
df.head(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,305156,[],CHEMBL841079,Inhibition of hERG currents Kv11.1,T,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,IC50,nM,UO_0000065,,14.0
1,,,305157,[],CHEMBL841078,Inhibitory concentration against hERG currents...,F,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,IC50,nM,UO_0000065,,3.0
2,,,305244,[],CHEMBL691014,K+ channel blocking activity in human embryoni...,F,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,IC50,nM,UO_0000065,,32.2


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [14]:
df.to_csv('HERG_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [18]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,305156,[],CHEMBL841079,Inhibition of hERG currents Kv11.1,T,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,IC50,nM,UO_0000065,,14.0
1,,,305157,[],CHEMBL841078,Inhibitory concentration against hERG currents...,F,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,IC50,nM,UO_0000065,,3.0
2,,,305244,[],CHEMBL691014,K+ channel blocking activity in human embryoni...,F,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,IC50,nM,UO_0000065,,32.2
3,,,305245,[],CHEMBL691013,K+ channel blocking activity in human embryoni...,F,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,IC50,nM,UO_0000065,,5950.0
4,,,306561,[],CHEMBL691014,K+ channel blocking activity in human embryoni...,F,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,IC50,nM,UO_0000065,,143.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,,,16880542,[],CHEMBL3880847,Inhibition of human ERG expressed in CHO cells...,B,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,pIC50,,UO_0000065,,4.5
9996,,,16880543,[],CHEMBL3880847,Inhibition of human ERG expressed in CHO cells...,B,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,pIC50,,UO_0000065,,4.5
9997,,,16880544,[],CHEMBL3880847,Inhibition of human ERG expressed in CHO cells...,B,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,pIC50,,UO_0000065,,4.5
9998,,,16880545,[],CHEMBL3880847,Inhibition of human ERG expressed in CHO cells...,B,,,BAO_0000190,...,Homo sapiens,Voltage-gated inwardly rectifying potassium ch...,9606,,,pIC50,,UO_0000065,,4.5


In [19]:
print(df2['standard_value'].isnull().sum())
print(df2['canonical_smiles'].isnull().sum())


0
0


Apparently, for this dataset there is no missing data. But we can use the above code cell for bioactivity data of other target protein.

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [20]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  #else:
  #  bioactivity_class.append("intermediate")

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [21]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL12713,O=C1NCCN1CCN1CCC(c2cn(-c3ccc(F)cc3)c3ccc(Cl)cc...,14.0
1,CHEMBL12713,O=C1NCCN1CCN1CCC(c2cn(-c3ccc(F)cc3)c3ccc(Cl)cc...,3.0
2,CHEMBL1108,O=C(CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1)c1ccc(F...,32.2
3,CHEMBL2368925,O=C(O[C@@H]1C[C@@H]2C[C@H]3C[C@H](C1)N2CC3=O)c...,5950.0
4,CHEMBL6966,COc1ccc(CCN(C)CCCC(C#N)(c2ccc(OC)c(OC)c2)C(C)C...,143.0
...,...,...,...
9995,CHEMBL3917472,CC(C)(C)NS(=O)(=O)CCCN1CC2CN(CCCOc3ccc(C#N)cc3...,31622.78
9996,CHEMBL3980357,CC(C)(C)OC(=O)NCCN1CC2CN(CCNS(=O)(=O)c3ccc(F)c...,31622.78
9997,CHEMBL3976526,O=S(=O)(NCCN1CC2CN(Cc3ccccc3)CC(C1)O2)c1ccc(F)cc1,31622.78
9998,CHEMBL3905698,N#Cc1ccc(OCCCN2CC3CN(CCNS(=O)(=O)Cc4ccccc4)CC(...,31622.78


In [22]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL12713,O=C1NCCN1CCN1CCC(c2cn(-c3ccc(F)cc3)c3ccc(Cl)cc...,14.0,active
1,CHEMBL12713,O=C1NCCN1CCN1CCC(c2cn(-c3ccc(F)cc3)c3ccc(Cl)cc...,3.0,active
2,CHEMBL1108,O=C(CCCN1CC=C(n2c(=O)[nH]c3ccccc32)CC1)c1ccc(F...,32.2,active
3,CHEMBL2368925,O=C(O[C@@H]1C[C@@H]2C[C@H]3C[C@H](C1)N2CC3=O)c...,5950.0,active
4,CHEMBL6966,COc1ccc(CCN(C)CCCC(C#N)(c2ccc(OC)c(OC)c2)C(C)C...,143.0,active
...,...,...,...,...
5641,,,,inactive
5653,,,,inactive
5654,,,,inactive
5655,,,,active


Saves dataframe to CSV file

In [23]:
df4.to_csv('HERG_bioactivity_data_preprocessed.csv', index=False)

In [24]:
! ls -l

total 10476
-rw-r--r-- 1 root root 5001055 Sep 18 21:02 bioactivity_data_raw.csv
-rw-r--r-- 1 root root  719576 Sep 18 21:09 HERG_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root 5001055 Sep 18 21:02 HERG_bioactivity_data_raw.csv
drwxr-xr-x 1 root root    4096 Sep 16 13:40 sample_data


---