# **Bioinformatics Project - Computational Drug Discovery**


In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

I will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:
* Redundant code cells were deleted.
* Code cells for saving files to Google Drive has been deleted.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [30]:
! pip install chembl_webresource_client



## **Importing libraries**

In [31]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for pancreas**

In [32]:
# Target search for pancreas
target = new_client.target
target_query = target.search('pancreas')
targets = pd.DataFrame.from_dict(target_query)
targets.head(10)

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,Pancreas,19.0,False,CHEMBL613587,[],TISSUE,10090
1,[],Rattus norvegicus,Pancreas,19.0,False,CHEMBL613650,[],TISSUE,10116
2,[],Homo sapiens,Carboxypeptidase B,18.0,False,CHEMBL2552,"[{'accession': 'P15086', 'component_descriptio...",SINGLE PROTEIN,9606
3,[],Homo sapiens,Kallikrein 1,17.0,False,CHEMBL2319,"[{'accession': 'P06870', 'component_descriptio...",SINGLE PROTEIN,9606
4,[],Homo sapiens,Protein disulfide-isomerase A2,17.0,False,CHEMBL4739853,"[{'accession': 'Q13087', 'component_descriptio...",SINGLE PROTEIN,9606
5,[],Homo sapiens,Tryptase,6.0,False,CHEMBL2095193,"[{'accession': 'Q9BZJ3', 'component_descriptio...",PROTEIN FAMILY,9606
6,[],Homo sapiens,Trypsin,6.0,False,CHEMBL2095204,"[{'accession': 'P07477', 'component_descriptio...",PROTEIN FAMILY,9606
7,[],Sus scrofa,Pancreatic elastase,6.0,False,CHEMBL2096984,"[{'accession': 'P08419', 'component_descriptio...",PROTEIN FAMILY,9823
8,[],Homo sapiens,Thrombin & trypsin,6.0,False,CHEMBL2096988,"[{'accession': 'P00734', 'component_descriptio...",SELECTIVITY GROUP,9606
9,[],Homo sapiens,Coagulation factor VII and X,6.0,False,CHEMBL2111412,"[{'accession': 'P00742', 'component_descriptio...",SELECTIVITY GROUP,9606


### **Select and retrieve bioactivity data for *Human pancreas* (seventh entry)**

We will assign the seventhth entry (which corresponds to the target protein, *Human Pancreatic lipase*) to the ***selected_target*** variable

In [33]:
selected_target = targets.target_chembl_id[6]
selected_target

'CHEMBL2095204'

Here, we will retrieve only bioactivity data for *Human Pancreatic lipase* (CHEMBL2095204) that are reported as pChEMBL values.

In [34]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [35]:
df = pd.DataFrame.from_dict(res)

KeyboardInterrupt: 

In [None]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,34309,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
1,,,34315,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
2,,,37941,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
3,,,40292,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
4,,,40298,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2217,,,25655408,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5372071,Inhibition of trypsin (unknown origin) at 50 u...,B,,,BAO_0000201,...,Homo sapiens,Trypsin,9606,,,INH,%,UO_0000187,,5.0
2218,,,25655409,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5372071,Inhibition of trypsin (unknown origin) at 50 u...,B,,,BAO_0000201,...,Homo sapiens,Trypsin,9606,,,INH,%,UO_0000187,,5.0
2219,"{'action_type': 'INHIBITOR', 'description': 'N...",,25702957,[],CHEMBL5385448,Binding affinity to trypsin (unknown origin) a...,B,,,BAO_0000192,...,Homo sapiens,Trypsin,9606,,,Ki,uM,UO_0000065,,0.18
2220,,,25702988,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5385455,Inhibition of human recombinant trypsin using ...,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,30.0


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('pancreaticlipase_01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [None]:
df2 = df.dropna(subset=['canonical_smiles', 'standard_value'])
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,34309,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
1,,,34315,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
2,,,37941,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
3,,,40292,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
4,,,40298,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2217,,,25655408,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5372071,Inhibition of trypsin (unknown origin) at 50 u...,B,,,BAO_0000201,...,Homo sapiens,Trypsin,9606,,,INH,%,UO_0000187,,5.0
2218,,,25655409,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5372071,Inhibition of trypsin (unknown origin) at 50 u...,B,,,BAO_0000201,...,Homo sapiens,Trypsin,9606,,,INH,%,UO_0000187,,5.0
2219,"{'action_type': 'INHIBITOR', 'description': 'N...",,25702957,[],CHEMBL5385448,Binding affinity to trypsin (unknown origin) a...,B,,,BAO_0000192,...,Homo sapiens,Trypsin,9606,,,Ki,uM,UO_0000065,,0.18
2220,,,25702988,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5385455,Inhibition of human recombinant trypsin using ...,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,30.0


In [None]:
len(df2.canonical_smiles.unique())

1610

In [None]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,34309,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
1,,,34315,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
2,,,37941,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
3,,,40292,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
4,,,40298,[],CHEMBL815474,Binding affinity against human trypsin,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2198,"{'action_type': 'INHIBITOR', 'description': 'N...",,25655384,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5372066,Inhibition of trypsin (unknown origin) assesse...,B,,,BAO_0000193,...,Homo sapiens,Trypsin,9606,,,Ratio,,,,1.0
2199,"{'action_type': 'INHIBITOR', 'description': 'N...",,25655385,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5372066,Inhibition of trypsin (unknown origin) assesse...,B,,,BAO_0000193,...,Homo sapiens,Trypsin,9606,,,Ratio,,,,1.0
2219,"{'action_type': 'INHIBITOR', 'description': 'N...",,25702957,[],CHEMBL5385448,Binding affinity to trypsin (unknown origin) a...,B,,,BAO_0000192,...,Homo sapiens,Trypsin,9606,,,Ki,uM,UO_0000065,,0.18
2220,,,25702988,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5385455,Inhibition of human recombinant trypsin using ...,B,,,BAO_0000190,...,Homo sapiens,Trypsin,9606,,,IC50,uM,UO_0000065,,30.0


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL337921,Cc1cc(NC(=O)Cc2ccc3[nH]c(-c4ccc(Cl)s4)nc3c2)cc...,10000.0
1,CHEMBL340500,Cc1cc(NC(=O)Cc2ccc3[nH]c(-c4ccc(Cl)s4)nc3c2)cc...,10000.0
2,CHEMBL124846,O=C(Cc1ccc2[nH]c(-c3ccc(Cl)s3)nc2c1)Nc1ccc(N2C...,10000.0
3,CHEMBL331231,CS(=O)(=O)c1ccccc1-c1ccc(NC(=O)Cc2ccc3[nH]c(-c...,10000.0
4,CHEMBL124705,O=C(Cc1ccc2[nH]c(-c3ccc(Cl)s3)nc2c1)Nc1ccc(N2C...,10000.0
...,...,...,...
2198,CHEMBL5422704,CC(C)(C)OC(=O)NCCCC[C@H](NC(=O)OCC1c2ccccc2-c2...,1.0
2199,CHEMBL5414967,CC(C)(C)OC(=O)n1cc(C[C@H](NC(=O)OCC2c3ccccc3-c...,1.0
2219,CHEMBL3901198,CCCc1c(-c2csc(-c3cc(C(=N)N)sc3SC)n2)cnn1-c1ccccn1,180.0
2220,CHEMBL5431053,Cc1ccccc1C(=O)N1CCN(c2ccc3c(N)nncc3c2)CC1,30000.0


Saves dataframe to CSV file

In [None]:
df3.to_csv('pancreaticlipase_02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [None]:
df4 = pd.read_csv('pancreaticlipase_02_bioactivity_data_preprocessed.csv')

In [None]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL337921,Cc1cc(NC(=O)Cc2ccc3[nH]c(-c4ccc(Cl)s4)nc3c2)cc...,10000.0,inactive
1,CHEMBL340500,Cc1cc(NC(=O)Cc2ccc3[nH]c(-c4ccc(Cl)s4)nc3c2)cc...,10000.0,inactive
2,CHEMBL124846,O=C(Cc1ccc2[nH]c(-c3ccc(Cl)s3)nc2c1)Nc1ccc(N2C...,10000.0,inactive
3,CHEMBL331231,CS(=O)(=O)c1ccccc1-c1ccc(NC(=O)Cc2ccc3[nH]c(-c...,10000.0,inactive
4,CHEMBL124705,O=C(Cc1ccc2[nH]c(-c3ccc(Cl)s3)nc2c1)Nc1ccc(N2C...,10000.0,inactive
...,...,...,...,...
1605,CHEMBL5422704,CC(C)(C)OC(=O)NCCCC[C@H](NC(=O)OCC1c2ccccc2-c2...,1.0,active
1606,CHEMBL5414967,CC(C)(C)OC(=O)n1cc(C[C@H](NC(=O)OCC2c3ccccc3-c...,1.0,active
1607,CHEMBL3901198,CCCc1c(-c2csc(-c3cc(C(=N)N)sc3SC)n2)cnn1-c1ccccn1,180.0,active
1608,CHEMBL5431053,Cc1ccccc1C(=O)N1CCN(c2ccc3c(N)nncc3c2)CC1,30000.0,inactive


Saves dataframe to CSV file

In [None]:
df5.to_csv('pancreaticlipase_03_bioactivity_data_curated.csv', index=False)

In [None]:
! zip pancreaticlipase.zip *.csv

updating: pancreaticlipase_01_bioactivity_data_raw.csv (deflated 92%)
updating: pancreaticlipase_02_bioactivity_data_preprocessed.csv (deflated 84%)
updating: pancreaticlipase_03_bioactivity_data_curated.csv (deflated 85%)


In [None]:
! ls -l

total 5408
-rw-r--r--  1 clara  staff    99581 Jan 15 09:51 CDD_ML_Pancreaticlipase_Bioactivity_Data_Concised.ipynb
-rw-r--r--  1 clara  staff       25 Jan 14 15:12 README.md
-rw-r--r--  1 clara  staff      804 Jan 15 08:33 main.py
-rw-r--r--  1 clara  staff   153674 Jan 15 09:52 pancreaticlipase.zip
-rw-r--r--  1 clara  staff  1257563 Jan 15 09:52 pancreaticlipase_01_bioactivity_data_raw.csv
-rw-r--r--  1 clara  staff   150266 Jan 15 09:52 pancreaticlipase_02_bioactivity_data_preprocessed.csv
-rw-r--r--  1 clara  staff   165700 Jan 15 09:52 pancreaticlipase_03_bioactivity_data_curated.csv


---