# **Molecular Modeling Project - Computational Drug Discovery [Part 1] Download Bioactivity Data **


In **Part 1**, Data Collection and Pre-Processing from the ChEMBL Database.were performed



## **ChEMBL Database**

The ChEMBL Database is a database that contains curated bioactivity data of more than 2.4 million compounds. It is compiled from more than 89,900 documents, 1.6 million assays and the data spans 15,600 targets and 2,000 cell lines, 782 tissues,1,400 drug warnings, 15,500 drugs, 6,900 drug mechanisms and 48,800 drug indications. [Data as of November 26, 2024; ChEMBL version 34].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-24.1.2-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl.metadata (3.1 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-24.1.2-py3-none-any.whl (66 kB)
[2K   [90m━━━━━━━━━━

## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein Serine/threonine-protein kinase PknB in Mycobacterium tuberculosis.**

### **Target search for Mycobacterium tuberculosis**

In [None]:
# Target search for Mycobacterium tuberculosis
target = new_client.target
target_query = target.search('Mycobacterium tuberculosis')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mycobacterium tuberculosis,Mycobacterium tuberculosis,27.0,False,CHEMBL360,[],ORGANISM,1773
1,[],Mycobacterium tuberculosis H37Rv,Mycobacterium tuberculosis H37Rv,24.0,False,CHEMBL2111188,[],ORGANISM,83332
2,[],Mycobacterium tuberculosis variant bovis,Mycobacterium tuberculosis variant bovis,21.0,False,CHEMBL613086,[],ORGANISM,1765
3,[],Mycobacterium tuberculosis variant microti,Mycobacterium tuberculosis variant microti,21.0,False,CHEMBL612960,[],ORGANISM,1806
4,[],Mycobacterium tuberculosis variant bovis BCG,Mycobacterium tuberculosis variant bovis BCG,19.0,False,CHEMBL615052,[],ORGANISM,33892
...,...,...,...,...,...,...,...,...,...
106,[],Mycobacterium tuberculosis,Type II NADH:quinone oxidoreductase NdhA,7.0,False,CHEMBL5169229,"[{'accession': 'P95200', 'component_descriptio...",SINGLE PROTEIN,1773
107,[],Mycobacterium tuberculosis,Trehalose-binding lipoprotein LpqY,7.0,False,CHEMBL5169230,"[{'accession': 'P9WGU9', 'component_descriptio...",SINGLE PROTEIN,1773
108,[],Mycobacterium tuberculosis,Pup--protein ligase,7.0,False,CHEMBL5169231,"[{'accession': 'P9WNU7', 'component_descriptio...",SINGLE PROTEIN,1773
109,[],Mycobacterium tuberculosis,Putative citrate synthase 2,7.0,False,CHEMBL5291952,"[{'accession': 'P9WPD3', 'component_descriptio...",SINGLE PROTEIN,83332


### **Selection and retrieval of bioactivity data for *Target protein Serine/threonine-protein kinase pknB (90 entry)

The 90th entry were assigned (which corresponds to the target protein, *PknB* ) to the ***selected_target*** variable

In [None]:
df = pd.DataFrame.from_dict(target_query)

In [None]:
df.head(95)

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mycobacterium tuberculosis,Mycobacterium tuberculosis,27.0,False,CHEMBL360,[],ORGANISM,1773
1,[],Mycobacterium tuberculosis H37Rv,Mycobacterium tuberculosis H37Rv,24.0,False,CHEMBL2111188,[],ORGANISM,83332
2,[],Mycobacterium tuberculosis variant bovis,Mycobacterium tuberculosis variant bovis,21.0,False,CHEMBL613086,[],ORGANISM,1765
3,[],Mycobacterium tuberculosis variant microti,Mycobacterium tuberculosis variant microti,21.0,False,CHEMBL612960,[],ORGANISM,1806
4,[],Mycobacterium tuberculosis variant bovis BCG,Mycobacterium tuberculosis variant bovis BCG,19.0,False,CHEMBL615052,[],ORGANISM,33892
...,...,...,...,...,...,...,...,...,...
90,[],Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,7.0,False,CHEMBL1908385,"[{'accession': 'P9WI81', 'component_descriptio...",SINGLE PROTEIN,1773
91,[],Mycobacterium tuberculosis,Phospho-N-acetylmuramoyl-pentapeptide-transferase,7.0,False,CHEMBL1921665,"[{'accession': 'P9WMW7', 'component_descriptio...",SINGLE PROTEIN,1773
92,[],Mycobacterium tuberculosis,dTDP-4-dehydrorhamnose reductase,7.0,False,CHEMBL1938225,"[{'accession': 'P9WH09', 'component_descriptio...",SINGLE PROTEIN,1773
93,[],Mycobacterium tuberculosis,Serine/threonine-protein kinase pknF,7.0,False,CHEMBL2016432,"[{'accession': 'P9WI75', 'component_descriptio...",SINGLE PROTEIN,1773


In [None]:
selected_target = targets.target_chembl_id[90]
selected_target

'CHEMBL1908385'

Here, only bioactivity data for *Serine/threonine-protein kinase pknB* (CHEMBL1908385) that are reported as IC$_{50}$ values in nM (nanomolar) unit were retrived.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df.head(100)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,10861355,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,1.35
1,,,10861356,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.322
2,,,10861357,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.453
3,,,10861358,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,1.162
4,,,10861359,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.223
5,,,10861360,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.44
6,,,10861361,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.107
7,,,10861362,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.428
8,,,10861363,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.375
9,,,10861364,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.384


Finally the resulting bioactivity data were saved to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('bioactivity_data_raw_drug.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [None]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,10861355,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,1.35
1,,,10861356,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.322
2,,,10861357,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.453
3,,,10861358,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,1.162
4,,,10861359,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.223
5,,,10861360,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.44
6,,,10861361,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.107
7,,,10861362,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.428
8,,,10861363,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.375
9,,,10861364,[],CHEMBL2020901,Inhibition of Mycobacterium tuberculosis GST-f...,B,,,BAO_0000190,...,Mycobacterium tuberculosis,Serine/threonine-protein kinase pknB,1773,,,IC50,uM,UO_0000065,,0.384


Apparently, for this dataset there is no missing data. But the above code cell can be used for bioactivity data of other target protein.

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [None]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL549536,Cc1cc(Nc2nc(-c3ccncc3)nc3ccccc23)n[nH]1,1350.0
1,CHEMBL2017707,c1ccc2c(Nc3cc(C4CC4)[nH]n3)nc(-c3ccncc3)nc2c1,322.0
2,CHEMBL2017708,Fc1ccc(-c2nc(Nc3cc[nH]n3)c3ccccc3n2)cc1,453.0
3,CHEMBL2017709,Cc1cc(Nc2nc(-c3ccc(F)cc3)nc3ccccc23)n[nH]1,1162.0
4,CHEMBL2017710,Fc1ccc(-c2nc(Nc3cc(C4CC4)[nH]n3)c3ccccc3n2)cc1,223.0
5,CHEMBL2017711,c1ccc(-c2nc(Nc3cc(C4CC4)[nH]n3)c3ccccc3n2)cc1,440.0
6,CHEMBL2017712,c1ccc2c(Nc3cc(C4CC4)[nH]n3)nc(NC3CCCCC3)nc2c1,107.0
7,CHEMBL2017713,c1ccc2c(Nc3cc(C4CC4)[nH]n3)nc(NCC3CC3)nc2c1,428.0
8,CHEMBL2017714,Fc1ccc(-c2nccc(Nc3cc(C4CC4)[nH]n3)n2)cc1,375.0
9,CHEMBL2017715,NC(=O)c1cccc(-c2nccc(Nc3cc(C4CC4)[nH]n3)n2)c1,384.0


In [None]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL549536,Cc1cc(Nc2nc(-c3ccncc3)nc3ccccc23)n[nH]1,1350.0,intermediate
1,CHEMBL2017707,c1ccc2c(Nc3cc(C4CC4)[nH]n3)nc(-c3ccncc3)nc2c1,322.0,active
2,CHEMBL2017708,Fc1ccc(-c2nc(Nc3cc[nH]n3)c3ccccc3n2)cc1,453.0,active
3,CHEMBL2017709,Cc1cc(Nc2nc(-c3ccc(F)cc3)nc3ccccc23)n[nH]1,1162.0,intermediate
4,CHEMBL2017710,Fc1ccc(-c2nc(Nc3cc(C4CC4)[nH]n3)c3ccccc3n2)cc1,223.0,active
5,CHEMBL2017711,c1ccc(-c2nc(Nc3cc(C4CC4)[nH]n3)c3ccccc3n2)cc1,440.0,active
6,CHEMBL2017712,c1ccc2c(Nc3cc(C4CC4)[nH]n3)nc(NC3CCCCC3)nc2c1,107.0,active
7,CHEMBL2017713,c1ccc2c(Nc3cc(C4CC4)[nH]n3)nc(NCC3CC3)nc2c1,428.0,active
8,CHEMBL2017714,Fc1ccc(-c2nccc(Nc3cc(C4CC4)[nH]n3)n2)cc1,375.0,active
9,CHEMBL2017715,NC(=O)c1cccc(-c2nccc(Nc3cc(C4CC4)[nH]n3)n2)c1,384.0,active


Saves dataframe to CSV file

In [None]:
df4.to_csv('bioactivity_data_drug_preprocessed.csv', index=False)

In [None]:
! ls -l

total 56
-rw-r--r-- 1 root root  4558 Nov 16 18:58 bioactivity_data_drug_preprocessed.csv
-rw-r--r-- 1 root root 44290 Nov 16 18:56 bioactivity_data_raw_drug.csv
drwxr-xr-x 1 root root  4096 Nov 14 14:25 sample_data


---