<a href="https://colab.research.google.com/github/GolDRoger69/Drug-Discovery-using-ML/blob/main/Python%5CPB_project_bioactivity_Data_Set_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-24.1.3-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-24.1.3-py3-none-any.whl (66 kB)
[2K   [90m━━━━━━━━━━━━━━

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

In [3]:
# Target search for Dengue
target = new_client.target
target_query = target.search('dengue')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Dengue virus,Dengue virus,14.0,False,CHEMBL613757,[],ORGANISM,12637
1,[],dengue virus type 4,dengue virus type 4,11.0,False,CHEMBL613728,[],ORGANISM,11070
2,[],dengue virus type 1,dengue virus type 1,11.0,False,CHEMBL613360,[],ORGANISM,11053
3,[],dengue virus type 2,dengue virus type 2,11.0,False,CHEMBL613966,[],ORGANISM,11060
4,[],dengue virus type 3,dengue virus type 3,11.0,False,CHEMBL612717,[],ORGANISM,11069
5,[],Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,9.0,False,CHEMBL5980,"[{'accession': 'P29990', 'component_descriptio...",SINGLE PROTEIN,31634


### **Select and retrieve bioactivity data for Dengue virus type 2 NS3 protein	(sixth entry)**

In [4]:
selected_target = targets.target_chembl_id[5]
selected_target

'CHEMBL5980'

Here, we will retrieve only bioactivity data for *Dengue virus type 2 NS3 protein	* (CHEMBL5980) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [5]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [6]:
df = pd.DataFrame.from_dict(res)

In [7]:
df.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,active,7018430,[],CHEMBL1794550,PUBCHEM_BIOASSAY: Primary and Confirmatory Scr...,F,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,100.0
1,,active,7018431,[],CHEMBL1794550,PUBCHEM_BIOASSAY: Primary and Confirmatory Scr...,F,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,12.31
2,,active,7018432,[],CHEMBL1794550,PUBCHEM_BIOASSAY: Primary and Confirmatory Scr...,F,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,100.0
3,,active,7018433,[],CHEMBL1794550,PUBCHEM_BIOASSAY: Primary and Confirmatory Scr...,F,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,50.97
4,,active,7018434,[],CHEMBL1794550,PUBCHEM_BIOASSAY: Primary and Confirmatory Scr...,F,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,100.0


In [8]:
df.standard_type.unique()

array(['IC50'], dtype=object)

In [9]:
df.to_csv('bioactivity_data.csv', index=False)

## **Copying files to Google Drive**

In [10]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)


Mounted at /content/gdrive/


In [11]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

In [13]:
! cp bioactivity_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [14]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

total 831
-rw------- 1 root root 850407 May  3 09:19 bioactivity_data.csv


In [15]:
! ls

bioactivity_data.csv  gdrive  sample_data


In [16]:
! head bioactivity_data.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,active,7018430,[],CHEMBL1794550,"PUBCHEM_BIOASSAY: Primary and Confirmatory Screening for Flavivirus Genomic Capping Enzyme Inhibition. (Class of assay: confirmatory) [Related pubchem assays (depositor defined):AID588708, AID588742]",F,,,BAO_0000190,BAO_0000019,assay format,COc1ccc2nc3cccc(OC)c3nc2c1,,,CHEMB

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [17]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,active,7018430,[],CHEMBL1794550,PUBCHEM_BIOASSAY: Primary and Confirmatory Scr...,F,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,100.0
1,,active,7018431,[],CHEMBL1794550,PUBCHEM_BIOASSAY: Primary and Confirmatory Scr...,F,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,12.31
2,,active,7018432,[],CHEMBL1794550,PUBCHEM_BIOASSAY: Primary and Confirmatory Scr...,F,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,100.0
3,,active,7018433,[],CHEMBL1794550,PUBCHEM_BIOASSAY: Primary and Confirmatory Scr...,F,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,50.97
4,,active,7018434,[],CHEMBL1794550,PUBCHEM_BIOASSAY: Primary and Confirmatory Scr...,F,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1314,"{'action_type': 'INHIBITOR', 'description': 'N...",,25562102,[],CHEMBL5348989,Inhibition of DENV-2 NS2B-NS3 protease by bioc...,B,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,0.2
1315,"{'action_type': 'INHIBITOR', 'description': 'N...",,25562103,[],CHEMBL5348989,Inhibition of DENV-2 NS2B-NS3 protease by bioc...,B,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,0.12
1316,"{'action_type': 'INHIBITOR', 'description': 'N...",,25562104,[],CHEMBL5348989,Inhibition of DENV-2 NS2B-NS3 protease by bioc...,B,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,0.089
1317,"{'action_type': 'INHIBITOR', 'description': 'N...",,25584263,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5354448,Inhibition of DENV2 NS2B (48 to 100 residues)/...,B,,,BAO_0000190,...,Dengue virus type 2 (strain Thailand/16681/198...,Dengue virus type 2 NS3 protein,31634,,,IC50,uM,UO_0000065,,8.46


## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [18]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **DataFrame**

In [29]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]

df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL1401841,COc1ccc2nc3cccc(OC)c3nc2c1,100000.0
1,CHEMBL1608853,O=C(O)c1ccc2c(c1)C(=O)/C(=C\c1ccco1)C2=O,12310.0
2,CHEMBL1429799,O=C1NN(c2ccccc2)C(=O)/C1=C\c1ccccc1OCC(=O)N1CC...,100000.0
3,CHEMBL246446,O=C(O)c1ccc2nc(-c3ccco3)c(-c3ccco3)nc2c1,50970.0
4,CHEMBL1383455,CCn1nc([N+](=O)[O-])c(C(C#N)c2nc3ccccc3n2C)c(C...,100000.0
...,...,...,...
1314,CHEMBL5421390,CN([C@@H](Cc1ccc(C(=N)N)cc1)C(=O)N[C@@H](CCCCN...,200.0
1315,CHEMBL5421971,CN([C@@H](Cc1ccc(C(=N)N)cc1)C(=O)N[C@@H](CCCCN...,120.0
1316,CHEMBL5413511,CN([C@@H](Cc1ccc(C(=N)N)cc1)C(=O)N[C@@H](CCCCN...,89.0
1317,CHEMBL164,O=c1c(O)c(-c2cc(O)c(O)c(O)c2)oc2cc(O)cc(O)c12,8460.0


In [31]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df3 = pd.concat([df3, bioactivity_class], axis=1)
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class,bioactivity_class.1
0,CHEMBL1401841,COc1ccc2nc3cccc(OC)c3nc2c1,100000.0,inactive,inactive
1,CHEMBL1608853,O=C(O)c1ccc2c(c1)C(=O)/C(=C\c1ccco1)C2=O,12310.0,inactive,inactive
2,CHEMBL1429799,O=C1NN(c2ccccc2)C(=O)/C1=C\c1ccccc1OCC(=O)N1CC...,100000.0,inactive,inactive
3,CHEMBL246446,O=C(O)c1ccc2nc(-c3ccco3)c(-c3ccco3)nc2c1,50970.0,inactive,inactive
4,CHEMBL1383455,CCn1nc([N+](=O)[O-])c(C(C#N)c2nc3ccccc3n2C)c(C...,100000.0,inactive,inactive
...,...,...,...,...,...
1195,,,,intermediate,intermediate
1197,,,,active,active
1199,,,,intermediate,intermediate
1201,,,,active,active


### ***Saving Data frame to CSV file***

In [32]:
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)

In [33]:
! ls -l

total 960
-rw-r--r-- 1 root root 850407 May  3 09:15 bioactivity_data.csv
-rw-r--r-- 1 root root 122117 May  3 09:37 bioactivity_preprocessed_data.csv
drwx------ 6 root root   4096 May  3 09:17 gdrive
drwxr-xr-x 1 root root   4096 Apr 30 13:37 sample_data


In [34]:
! cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [35]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_data.csv  bioactivity_preprocessed_data.csv
