## **COMPUTATIONAL DRUG DISCOVERY**

**PART 1(DOWNLOAD THE BIOACTIVITY DATA)**

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.


# **Installing Libraries**

Install the ChEMBL web service package so that we can retrieve bioativity data from the ChEMBL Database.

In [12]:
! pip install chembl_webresource_client



# **Importing Libraries**

In [13]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

# **Search for Target Protein**

**Target search for cancer**

In [14]:
# Code to search cancer targets in chembl database
target = new_client.target
target_query = target.search('Phd fingers')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,PHD finger protein 13,16.0,False,CHEMBL1764945,"[{'accession': 'Q86YI8', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,PHD finger protein 23,14.0,False,CHEMBL2424508,"[{'accession': 'Q9BUL5', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Homo sapiens,Bromodomain and PHD finger-containing protein 3,13.0,False,CHEMBL3108644,"[{'accession': 'Q9ULD4', 'component_descriptio...",SINGLE PROTEIN,9606
3,[],Homo sapiens,Peregrin,13.0,False,CHEMBL3132741,"[{'accession': 'P55201', 'component_descriptio...",SINGLE PROTEIN,9606
4,[],Homo sapiens,Lysine-specific demethylase PHF2,12.0,False,CHEMBL4295672,"[{'accession': 'O75151', 'component_descriptio...",SINGLE PROTEIN,9606
...,...,...,...,...,...,...,...,...,...
161,[],Homo sapiens,Baculoviral IAP repeat-containing protein 2/Al...,2.0,False,CHEMBL4802032,"[{'accession': 'P15121', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
162,[],Homo sapiens,E3 ubiquitin-protein ligase CBL-B,2.0,False,CHEMBL4879459,"[{'accession': 'Q13191', 'component_descriptio...",SINGLE PROTEIN,9606
163,[],Homo sapiens,CRL4(CRBN) E3 ubiquitin ligase,1.0,False,CHEMBL3833061,"[{'accession': 'Q96SW2', 'component_descriptio...",PROTEIN COMPLEX,9606
164,[],Homo sapiens,Baculoviral IAP repeat-containing protein 2/BC...,1.0,False,CHEMBL4296119,"[{'accession': 'P00519', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606


## Select and retrieve bioactivity data for PHD finger protein 23

**We will assign the 1st entry**

In [15]:
selected_target = targets.target_chembl_id[1]
selected_target

'CHEMBL2424508'

**Now we will retrieve only bioactivity data for PHD finger protein 13(CHEMBL2424508) that are reported as IC50 values in nM (nanomolar) unit.**

In [16]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type = "IC50")

In [17]:
df = pd.DataFrame.from_dict(res)


In [18]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,13444773,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
1,,,13444774,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
2,,,13444775,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
3,,,13444776,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
4,,,13444777,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,,,13965841,[],CHEMBL3136677,Inhibition of GST-tagged PHF23 (unknown origin...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,30.0
77,,,13965842,[],CHEMBL3136677,Inhibition of GST-tagged PHF23 (unknown origin...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,30.0
78,,,13965843,[],CHEMBL3136677,Inhibition of GST-tagged PHF23 (unknown origin...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,30.0
79,,,13965844,[],CHEMBL3136677,Inhibition of GST-tagged PHF23 (unknown origin...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,30.0


In [19]:
df.head(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,13444773,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
1,,,13444774,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
2,,,13444775,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0


In [20]:
df.standard_type.unique()

array(['IC50'], dtype=object)

**Finally we will save the resulting bioactivity data to a CSV file "bioactivity_data.csv"**

In [21]:
df.to_csv('bioactivity_data.csv', index=False)

# **What we have so far in the CSV file**

In [22]:
!ls

bioactivity_data.csv  sample_data


***See what i have extract so far***

In [23]:
! head bioactivity_data.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,,13444773,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 mins by AlphaScreen assay,B,,,BAO_0000190,BAO_0000357,single protein format,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)c(Nc4ccccc4)c3)cc2C1,,,CHEMBL2424619,J Med Chem,2013,,CHEMBL2424677,,CHEMBL2424677,,0,http://www.openphacts.org/un

# **Handling Missing Data**

If any compounds has a missing value for the 'standard_value' column then drop it

In [24]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,13444773,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
1,,,13444774,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
2,,,13444775,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
3,,,13444776,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
4,,,13444777,[],CHEMBL2429071,Inhibition of PHF23 (unknown origin) after 30 ...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,,,13965841,[],CHEMBL3136677,Inhibition of GST-tagged PHF23 (unknown origin...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,30.0
77,,,13965842,[],CHEMBL3136677,Inhibition of GST-tagged PHF23 (unknown origin...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,30.0
78,,,13965843,[],CHEMBL3136677,Inhibition of GST-tagged PHF23 (unknown origin...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,30.0
79,,,13965844,[],CHEMBL3136677,Inhibition of GST-tagged PHF23 (unknown origin...,B,,,BAO_0000190,...,Homo sapiens,PHD finger protein 23,9606,,,IC50,uM,UO_0000065,,30.0


**Apparently for this dataset there is no missing data. But it is important to use this cell in case there are imputs without the standard_value column.**

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [25]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

**Iterate the 'molecule_chembl_id', 'canonical_smiles' and 'standard_value' to a list**

In [26]:
mol_cid = []
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

In [27]:
canonical_smiles = []
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

In [28]:
standard_value = []
for i in df2.standard_value:
  standard_value.append(i)

# **Combine the lists into a dataframe**

In [29]:
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class,standard_value))
df3 = pd.DataFrame(data_tuples, columns = ['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])

In [30]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,CHEMBL2424677,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,inactive,10000.0
1,CHEMBL2426376,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,inactive,10000.0
2,CHEMBL2426375,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,inactive,10000.0
3,CHEMBL2426374,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CC5CCC(C4)C5N4CC...,inactive,10000.0
4,CHEMBL2426373,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(N5CCCC5)CC4)...,inactive,10000.0
...,...,...,...,...
76,CHEMBL3134132,c1cc(N2CCC(N3CCCCC3)CC2)ccc1CCN1CCCC1,inactive,30000.0
77,CHEMBL2426364,O=C(c1ccc(C(=O)N2CCC(N3CCCC3)CC2)c(Nc2ccccc2)c...,inactive,30000.0
78,CHEMBL2426474,O=C(c1ccc(C(=O)N2CCC(N3CCCCC3)CC2)cc1)N1CCC(N2...,inactive,30000.0
79,CHEMBL1235119,O=C(c1cncc(Br)c1)N1CCC(N2CCCC2)CC1,inactive,30000.0


# **Save the DataFrame to CSV file**

In [31]:
df3.to_csv('bioactivity_pre-processed_data.csv' , index=False)

In [32]:
!ls

bioactivity_data.csv  bioactivity_pre-processed_data.csv  sample_data


In [33]:
! head bioactivity_pre-processed_data.csv

molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
CHEMBL2424677,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)c(Nc4ccccc4)c3)cc2C1,inactive,10000.0
CHEMBL2426376,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)cc3Nc3ccccc3)cc2C1,inactive,10000.0
CHEMBL2426375,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)cc3)cc2C1,inactive,10000.0
CHEMBL2426374,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CC5CCC(C4)C5N4CCCC4)cc3)cc2C1,inactive,10000.0
CHEMBL2426373,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(N5CCCC5)CC4)cc3)cc2C1,inactive,10000.0
CHEMBL2426368,O=C(c1ccc(C(=O)N2CCC(N3CCCC3)CC2)c(-c2ccccc2)c1)N1CCC(N2CCCC2)CC1,inactive,10000.0
CHEMBL2426367,O=C(c1ccc(C(=O)N2CCC(N3CCCC3)CC2)c(Oc2ccccc2)c1)N1CCC(N2CCCC2)CC1,inactive,10000.0
CHEMBL2426366,O=C(c1ccc(C(=O)N2CCC(N3CCCC3)CC2)c(Cc2ccccc2)c1)N1CCC(N2CCCC2)CC1,inactive,10000.0
CHEMBL2426365,O=C(c1ccc(C(=O)N2CCC(N3CCCC3)CC2)c(NCc2ccccc2)c1)N1CCC(N2CCCC2)CC1,inactive,10000.0
