# **Bioinformatics Project - Part 1 Computational Drug Discovery Data Concised**

In this Notebook I will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:
* Redundant code cells were deleted.
* Code cells for saving files to Google Drive has been deleted.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [3]:
! pip install chembl_webresource_client



## **Importing libraries**

In [4]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

Directory to save data

In [5]:
data_folder = "data"

## **Search for Target protein**

### **Target search for pancreas**

In [6]:
# Target search 
search_label = 'pancreas'
target = new_client.target
target_query = target.search(search_label)
targets = pd.DataFrame.from_dict(target_query)
targets.to_csv(data_folder + '/' + search_label + '_targets.csv', index=False)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,Pancreas,19.0,False,CHEMBL613587,[],TISSUE,10090
1,[],Rattus norvegicus,Pancreas,19.0,False,CHEMBL613650,[],TISSUE,10116
2,[],Homo sapiens,Carboxypeptidase B,18.0,False,CHEMBL2552,"[{'accession': 'P15086', 'component_descriptio...",SINGLE PROTEIN,9606
3,[],Homo sapiens,Kallikrein 1,17.0,False,CHEMBL2319,"[{'accession': 'P06870', 'component_descriptio...",SINGLE PROTEIN,9606
4,[],Homo sapiens,Protein disulfide-isomerase A2,17.0,False,CHEMBL4739853,"[{'accession': 'Q13087', 'component_descriptio...",SINGLE PROTEIN,9606
...,...,...,...,...,...,...,...,...,...
103,[],Homo sapiens,Hepatocyte growth factor activator,5.0,False,CHEMBL3351190,"[{'accession': 'Q04756', 'component_descriptio...",SINGLE PROTEIN,9606
104,[],Mus musculus,Suppressor of tumorigenicity 14 protein homolog,5.0,False,CHEMBL3745587,"[{'accession': 'P56677', 'component_descriptio...",SINGLE PROTEIN,10090
105,[],Homo sapiens,Urokinase-type plasminogen activator/surface r...,5.0,False,CHEMBL3883324,"[{'accession': 'P00749', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
106,[],Homo sapiens,Coagulation factor IX/VIII,5.0,False,CHEMBL4296076,"[{'accession': 'P00740', 'component_descriptio...",PROTEIN COMPLEX,9606


### **Select and retrieve bioactivity data for *Human pancreas* (seventh entry)**

We will assign the seventhth entry (which corresponds to the target protein, *Human Pancreatic lipase*) to the ***selected_target*** variable

In [7]:
target_index = 3
selected_target = targets.target_chembl_id[target_index]
selected_target

'CHEMBL2319'

In [8]:
selected_target_name = targets.pref_name[target_index]
formatted_target_name = selected_target_name.replace(" ", "_")
selected_target_name

'Kallikrein 1'

Here, we will retrieve only bioactivity data for *Human Pancreatic lipase* (CHEMBL2095204) that are reported as pChEMBL values.

In [9]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
df = pd.DataFrame.from_dict(res)
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,77433,[],CHEMBL701597,Inhibitory activity against kallikrein was det...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,2.1
1,,,181220,[],CHEMBL701595,In vitro inhibitory activity against kallikrei...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,5.8
2,,,265265,[],CHEMBL701763,Tested in vitro for the inhibitory activity ag...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,2.1
3,,,284942,[],CHEMBL701763,Tested in vitro for the inhibitory activity ag...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,0.25
4,,,318861,[],CHEMBL703574,Compound was evaluated for inhibitory activity...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,9.89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349,,,25003710,[],CHEMBL5234025,Inhibition of human KLK1 using H-DVal-Leu-Arg-...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,1.0
350,,,25003711,[],CHEMBL5234025,Inhibition of human KLK1 using H-DVal-Leu-Arg-...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,10.0
351,"{'action_type': 'INHIBITOR', 'description': 'N...",,25003712,[],CHEMBL5234025,Inhibition of human KLK1 using H-DVal-Leu-Arg-...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,8.4
352,"{'action_type': 'INHIBITOR', 'description': 'N...",,25003713,[],CHEMBL5234025,Inhibition of human KLK1 using H-DVal-Leu-Arg-...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,6.1


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [10]:
df.to_csv(data_folder + '/' + formatted_target_name + '_01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** (potency of the drug, the lower the stronger the drug) and **canonical_smiles** column then drop it.

In [11]:
df2 = df.dropna(subset=['canonical_smiles', 'standard_value'])
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,77433,[],CHEMBL701597,Inhibitory activity against kallikrein was det...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,2.1
1,,,181220,[],CHEMBL701595,In vitro inhibitory activity against kallikrei...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,5.8
2,,,265265,[],CHEMBL701763,Tested in vitro for the inhibitory activity ag...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,2.1
3,,,284942,[],CHEMBL701763,Tested in vitro for the inhibitory activity ag...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,0.25
4,,,318861,[],CHEMBL703574,Compound was evaluated for inhibitory activity...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,9.89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349,,,25003710,[],CHEMBL5234025,Inhibition of human KLK1 using H-DVal-Leu-Arg-...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,1.0
350,,,25003711,[],CHEMBL5234025,Inhibition of human KLK1 using H-DVal-Leu-Arg-...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,10.0
351,"{'action_type': 'INHIBITOR', 'description': 'N...",,25003712,[],CHEMBL5234025,Inhibition of human KLK1 using H-DVal-Leu-Arg-...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,8.4
352,"{'action_type': 'INHIBITOR', 'description': 'N...",,25003713,[],CHEMBL5234025,Inhibition of human KLK1 using H-DVal-Leu-Arg-...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,6.1


In [12]:
len(df2.canonical_smiles.unique())

337

In [13]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,77433,[],CHEMBL701597,Inhibitory activity against kallikrein was det...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,2.1
1,,,181220,[],CHEMBL701595,In vitro inhibitory activity against kallikrei...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,5.8
2,,,265265,[],CHEMBL701763,Tested in vitro for the inhibitory activity ag...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,2.1
3,,,284942,[],CHEMBL701763,Tested in vitro for the inhibitory activity ag...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,0.25
4,,,318861,[],CHEMBL703574,Compound was evaluated for inhibitory activity...,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,9.89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
345,,,25003639,[],CHEMBL5234005,Inhibition of human KLK1,B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,40.0
346,,,25003640,[],CHEMBL5234006,Inhibition of KLK1 (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,4.0
347,,,25003641,[],CHEMBL5234006,Inhibition of KLK1 (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,uM,UO_0000065,,40.0
348,"{'action_type': 'INHIBITOR', 'description': 'N...",,25003642,[],CHEMBL5234006,Inhibition of KLK1 (unknown origin),B,,,BAO_0000190,...,Homo sapiens,Kallikrein 1,9606,,,IC50,nM,UO_0000065,,2.9


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [14]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL174750,NS(=O)(=O)c1ccccc1-c1ccc(C(=O)Nc2ccccc2C(=O)Nc...,2100.0
1,CHEMBL294121,CN1CCN=C1c1ccc(C(=O)N2CCN(S(=O)(=O)c3cc4cc(Cl)...,5800.0
2,CHEMBL432306,N=C(N)c1cccc(OCCNC(=O)c2ccc(-c3ccccc3S(N)(=O)=...,2100.0
3,CHEMBL297835,N=C(N)c1cccc(Oc2ccccc2NC(=O)c2ccc(-c3ccccc3S(N...,250.0
4,CHEMBL10378,O=c1oc(-c2ccccc2I)nc2ccccc12,9890.0
...,...,...,...
345,CHEMBL3897372,Cc1cc(C)n(Cc2ccc(Cn3cc(C(=O)NCc4c(C)cc(N)nc4C)...,40000.0
346,CHEMBL5204354,COc1ccnc(CNC(=O)c2cn(Cc3ccc(Cn4ccccc4=O)cc3)nc...,4000.0
347,CHEMBL5095064,COCc1nn(Cc2ccc(Cn3cc(F)ccc3=O)cc2)cc1C(=O)NCc1...,40000.0
348,CHEMBL5276515,Nc1nn(Cc2ccc(Cn3ccccc3=O)cc2)cc1C(=O)NCCOc1ccc...,2.9


Saves dataframe to CSV file

In [15]:
df3.to_csv(data_folder + '/' + formatted_target_name + '_02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [16]:
df4 = pd.read_csv(data_folder + '/' + formatted_target_name + '_02_bioactivity_data_preprocessed.csv')

In [17]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [18]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL174750,NS(=O)(=O)c1ccccc1-c1ccc(C(=O)Nc2ccccc2C(=O)Nc...,2100.0,intermediate
1,CHEMBL294121,CN1CCN=C1c1ccc(C(=O)N2CCN(S(=O)(=O)c3cc4cc(Cl)...,5800.0,intermediate
2,CHEMBL432306,N=C(N)c1cccc(OCCNC(=O)c2ccc(-c3ccccc3S(N)(=O)=...,2100.0,intermediate
3,CHEMBL297835,N=C(N)c1cccc(Oc2ccccc2NC(=O)c2ccc(-c3ccccc3S(N...,250.0,active
4,CHEMBL10378,O=c1oc(-c2ccccc2I)nc2ccccc12,9890.0,intermediate
...,...,...,...,...
332,CHEMBL3897372,Cc1cc(C)n(Cc2ccc(Cn3cc(C(=O)NCc4c(C)cc(N)nc4C)...,40000.0,inactive
333,CHEMBL5204354,COc1ccnc(CNC(=O)c2cn(Cc3ccc(Cn4ccccc4=O)cc3)nc...,4000.0,intermediate
334,CHEMBL5095064,COCc1nn(Cc2ccc(Cn3cc(F)ccc3=O)cc2)cc1C(=O)NCc1...,40000.0,inactive
335,CHEMBL5276515,Nc1nn(Cc2ccc(Cn3ccccc3=O)cc2)cc1C(=O)NCCOc1ccc...,2.9,active


Saves dataframe to CSV file

In [19]:
df5.to_csv(data_folder + '/' + formatted_target_name + '_03_bioactivity_data_curated.csv', index=False)

In [20]:
# ! zip formatted_target_name.zip *.csv

In [21]:
# ! ls -l

---