# **Bioinformatics Project - Computational Drug Discovery [Part 1] Download Bioactivity Data (Concised version)**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:
* Redundant code cells were deleted.
* Code cells for saving files to Google Drive has been deleted.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client



## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for coronavirus**

In [None]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('SARS-CoV-2')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Severe acute respiratory syndrome coronavirus 2,SARS-CoV-2,34.0,False,CHEMBL4303835,[],ORGANISM,2697049.0
1,[],Severe acute respiratory syndrome-related coro...,SARS-CoV,34.0,False,CHEMBL4303836,[],ORGANISM,694009.0
2,[],Homo sapiens,"Serine--tRNA ligase, cytoplasmic",15.0,False,CHEMBL4523232,"[{'accession': 'P49591', 'component_descriptio...",SINGLE PROTEIN,9606.0
3,[],Homo sapiens,Thromboxane-A synthase,14.0,False,CHEMBL1835,"[{'accession': 'P24557', 'component_descriptio...",SINGLE PROTEIN,9606.0
4,[],Rattus norvegicus,Thromboxane-A synthase,14.0,False,CHEMBL4028,"[{'accession': 'P49430', 'component_descriptio...",SINGLE PROTEIN,10116.0
...,...,...,...,...,...,...,...,...,...
3018,[],Mus musculus,Glutamate NMDA receptor,0.0,False,CHEMBL3832634,"[{'accession': 'P35436', 'component_descriptio...",PROTEIN COMPLEX GROUP,10090.0
3019,[],Mus musculus,L-type calcium channel,0.0,False,CHEMBL3988632,"[{'accession': 'Q01815', 'component_descriptio...",PROTEIN FAMILY,10090.0
3020,[],Rattus norvegicus,Voltage-gated sodium channel,0.0,False,CHEMBL3988641,"[{'accession': 'O88457', 'component_descriptio...",PROTEIN FAMILY,10116.0
3021,[],Homo sapiens,UDP-glucuronosyltransferases (UGTs),0.0,False,CHEMBL4523985,"[{'accession': 'P22310', 'component_descriptio...",PROTEIN FAMILY,9606.0


In [None]:
targets.shape

(3023, 9)

### **Select and retrieve bioactivity data for *SARS coronavirus 3C-like proteinase* (fifth entry)**

We will assign the fifth entry (which corresponds to the target protein, *coronavirus 3C-like proteinase*) to the ***selected_target*** variable

In [None]:
selected_target = targets.target_chembl_id[4]
selected_target

'CHEMBL4028'

Here, we will retrieve only bioactivity data for *coronavirus 3C-like proteinase* (CHEMBL3927) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df.head(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,192051,[],CHEMBL809686,Tested in vitro against TXA2 synthetase inhibi...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,uM,UO_0000065,,0.3
1,,,192052,[],CHEMBL880941,Tested in vitro against TXA2 synthetase inhibi...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,uM,UO_0000065,,3.3
2,,,200255,[],CHEMBL809686,Tested in vitro against TXA2 synthetase inhibi...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,uM,UO_0000065,,0.2


In [None]:
df.shape

(205, 46)

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('bioactivity_data_raw.csv', index=False)

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data2"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/data2’: File exists


In [None]:
! cp bioactivity_data_raw.csv "/content/gdrive/My Drive/Colab Notebooks/data2"

In [None]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data2"

total 95
-rw------- 1 root root 96462 Jan 20 22:14 bioactivity_data_raw.csv


In [None]:
! ls


bioactivity_data_preprocessed.csv  bioactivity_data_raw.csv  gdrive  sample_data


In [None]:
! head bioactivity_data_raw.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,,192051,[],CHEMBL809686,Tested in vitro against TXA2 synthetase inhibitory activity in rat platelet,B,,,BAO_0000190,BAO_0000019,assay format,Cn1nc(-c2cncs2)c2ccccc2c1=O,,,CHEMBL1126557,J Med Chem,1993,"{'bei': '26.81', 'le': '0.52', 'lle': '4.46', 'sei': '13.65'}",CHEMBL130453,,CHEMBL130453,6.52,0,http://www

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [None]:
df.isnull().sum()

Unnamed: 0,0
action_type,205
activity_comment,205
activity_id,0
activity_properties,0
assay_chembl_id,0
assay_description,0
assay_type,0
assay_variant_accession,205
assay_variant_mutation,205
bao_endpoint,0


In [None]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,192051,[],CHEMBL809686,Tested in vitro against TXA2 synthetase inhibi...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,uM,UO_0000065,,0.3
1,,,192052,[],CHEMBL880941,Tested in vitro against TXA2 synthetase inhibi...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,uM,UO_0000065,,3.3
2,,,200255,[],CHEMBL809686,Tested in vitro against TXA2 synthetase inhibi...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,uM,UO_0000065,,0.2
3,,,200256,[],CHEMBL880941,Tested in vitro against TXA2 synthetase inhibi...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,uM,UO_0000065,,0.2
4,,,218108,[],CHEMBL809686,Tested in vitro against TXA2 synthetase inhibi...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,uM,UO_0000065,,1.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,,,900764,[],CHEMBL814434,Tested for inhibition of thromboxane synthetas...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,M,UO_0000065,,3E-8
201,,,900766,[],CHEMBL814434,Tested for inhibition of thromboxane synthetas...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,M,UO_0000065,,0.000001
202,,,900768,[],CHEMBL814434,Tested for inhibition of thromboxane synthetas...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,M,UO_0000065,,9E-8
203,,,900770,[],CHEMBL814434,Tested for inhibition of thromboxane synthetas...,B,,,BAO_0000190,...,Rattus norvegicus,Thromboxane-A synthase,10116,,,IC50,M,UO_0000065,,3.4E-7


Apparently, for this dataset there is no missing data. But we can use the above code cell for bioactivity data of other target protein.

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [None]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("inactive")

**Iterate the molecule_chembl_id to a list**

In [None]:
mol_cid =[]
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

In [None]:
mol_cid

['CHEMBL130453',
 'CHEMBL130453',
 'CHEMBL11662',
 'CHEMBL11662',
 'CHEMBL335687',
 'CHEMBL335687',
 'CHEMBL151732',
 'CHEMBL9088',
 'CHEMBL8515',
 'CHEMBL8778',
 'CHEMBL150316',
 'CHEMBL8976',
 'CHEMBL8887',
 'CHEMBL267978',
 'CHEMBL406693',
 'CHEMBL417603',
 'CHEMBL8870',
 'CHEMBL8735',
 'CHEMBL267473',
 'CHEMBL147183',
 'CHEMBL147413',
 'CHEMBL268645',
 'CHEMBL355929',
 'CHEMBL8648',
 'CHEMBL8489',
 'CHEMBL148928',
 'CHEMBL8879',
 'CHEMBL8819',
 'CHEMBL8986',
 'CHEMBL267473',
 'CHEMBL8957',
 'CHEMBL358777',
 'CHEMBL345310',
 'CHEMBL8987',
 'CHEMBL8679',
 'CHEMBL333219',
 'CHEMBL93',
 'CHEMBL116719',
 'CHEMBL23942',
 'CHEMBL267473',
 'CHEMBL22542',
 'CHEMBL123307',
 'CHEMBL330837',
 'CHEMBL120250',
 'CHEMBL280728',
 'CHEMBL121306',
 'CHEMBL444856',
 'CHEMBL120249',
 'CHEMBL120904',
 'CHEMBL118619',
 'CHEMBL122181',
 'CHEMBL120181',
 'CHEMBL121307',
 'CHEMBL118184',
 'CHEMBL120041',
 'CHEMBL118568',
 'CHEMBL331218',
 'CHEMBL11662',
 'CHEMBL121289',
 'CHEMBL333236',
 'CHEMBL121043',
 '

**Iterate canonical_smiles to a list**

In [None]:
canonical_smiles =[]
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

In [None]:
canonical_smiles

['Cn1nc(-c2cncs2)c2ccccc2c1=O',
 'Cn1nc(-c2cncs2)c2ccccc2c1=O',
 'O=C(O)/C=C/c1ccc(Cn2ccnc2)cc1',
 'O=C(O)/C=C/c1ccc(Cn2ccnc2)cc1',
 'CCn1nc(-c2cccnc2)c2ccccc2c1=O',
 'CCn1nc(-c2cccnc2)c2ccccc2c1=O',
 'O=C(NCCCCn1ccnc1)c1ccc(Cl)cc1',
 'O=C(NCCCCCCCCn1ccnc1)c1ccc(Cl)s1',
 'CC(CCNC(=O)c1ccc(Cl)s1)n1ccnc1',
 'O=C(NCCCCn1ccnc1)c1cccs1',
 'O=C(NCCCn1ccnc1)c1ccc(I)cc1',
 'CC(CCNC(=O)c1ccc(Cl)nc1)n1ccnc1',
 'O=C(NCCCCCn1ccnc1)c1cc2cc(Cl)ccc2[nH]1',
 'O=C(NCCCCCCCCn1ccnc1)c1cc2cc(Cl)ccc2[nH]1',
 'CC(CCNC(=O)c1cc2cc(Cl)ccc2[nH]1)n1ccnc1',
 'O=C(NCCCCCn1ccnc1)c1ccc(Cl)s1',
 'O=C(NCCCn1ccnc1)c1cc2cc(Cl)ccc2s1',
 'O=C(NCCCCn1ccnc1)c1cc2ccccc2[nH]1',
 'O=C(O)c1ccc(OCCn2ccnc2)cc1',
 'CSc1ccc(C(=O)NCCCn2ccnc2)cc1',
 'O=C(NCCCCCCn1ccnc1)c1ccc(Cl)cc1',
 'O=C(NCCCCCn1ccnc1)c1ccc(Cl)nc1',
 'O=C(NCCCCCn1ccnc1)c1ccc(Cl)cc1',
 'O=C(NCCCCn1ccnc1)c1ccc(Cl)s1',
 'O=C(NCCCCCCn1ccnc1)c1ccc(Cl)s1',
 'O=C(NCCCCCn1ccnc1)c1ccc(C(F)(F)F)cc1',
 'O=C(NCCCn1ccnc1)c1cc(Br)c(Br)s1',
 'O=C(NCCCn1ccnc1)c1sc2ccccc2c1Cl',
 'C

 **Iterate standard_value to a list**

In [None]:
standard_value =[]
for i in df2.standard_value:
  standard_value.append(i)

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL130453,Cn1nc(-c2cncs2)c2ccccc2c1=O,300.0
1,CHEMBL130453,Cn1nc(-c2cncs2)c2ccccc2c1=O,3300.0
2,CHEMBL11662,O=C(O)/C=C/c1ccc(Cn2ccnc2)cc1,200.0
3,CHEMBL11662,O=C(O)/C=C/c1ccc(Cn2ccnc2)cc1,200.0
4,CHEMBL335687,CCn1nc(-c2cccnc2)c2ccccc2c1=O,1300.0
...,...,...,...
200,CHEMBL73065,CC(Cn1ccnc1)Cn1c(=O)[nH]c2ccccc2c1=O,30.0
201,CHEMBL75344,CC(CCn1c(=O)[nH]c2c(Cl)cccc2c1=O)n1ccnc1,1000.0
202,CHEMBL73748,CC(Cn1ccnc1)Cn1cnc2ccc(Cl)cc2c1=O,90.0
203,CHEMBL74915,Cc1ccc(N)c(C(=O)NCCCCn2ccnc2)c1,340.0


In [None]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL130453,Cn1nc(-c2cncs2)c2ccccc2c1=O,300.0,active
1,CHEMBL130453,Cn1nc(-c2cncs2)c2ccccc2c1=O,3300.0,inactive
2,CHEMBL11662,O=C(O)/C=C/c1ccc(Cn2ccnc2)cc1,200.0,active
3,CHEMBL11662,O=C(O)/C=C/c1ccc(Cn2ccnc2)cc1,200.0,active
4,CHEMBL335687,CCn1nc(-c2cccnc2)c2ccccc2c1=O,1300.0,inactive
...,...,...,...,...
200,CHEMBL73065,CC(Cn1ccnc1)Cn1c(=O)[nH]c2ccccc2c1=O,30.0,active
201,CHEMBL75344,CC(CCn1c(=O)[nH]c2c(Cl)cccc2c1=O)n1ccnc1,1000.0,active
202,CHEMBL73748,CC(Cn1ccnc1)Cn1cnc2ccc(Cl)cc2c1=O,90.0,active
203,CHEMBL74915,Cc1ccc(N)c(C(=O)NCCCCn2ccnc2)c1,340.0,active


Saves dataframe to CSV file

In [None]:
df4.to_csv('bioactivity_data_preprocessed.csv', index=False)

In [None]:
df4.shape

(205, 4)

In [None]:
! ls -l

total 120
-rw-r--r-- 1 root root 13773 Jan 20 22:29 bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root 96462 Jan 20 22:14 bioactivity_data_raw.csv
drwx------ 6 root root  4096 Jan 20 22:14 gdrive
drwxr-xr-x 1 root root  4096 Jan 16 14:29 sample_data


In [None]:
! cp bioactivity_data_preprocessed.csv "/content/gdrive/My Drive/Colab Notebooks/data2"

In [None]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data2"

bioactivity_data_preprocessed.csv  bioactivity_data_raw.csv


---