# **Bioinformatics Project - Computational Drug Discovery**

In this Jupyter notebook, I will be building a **data science project**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.


The first code below informs you of the directory you are working in.

In [1]:
import os 
os.getcwd()

'C:\\Users\\Mychael\\Downloads'

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 2024; ChEMBL version 34].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [3]:
! pip install chembl_webresource_client



## **Importing libraries**

In [6]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Esthrogen Receptor Alpha**

In [10]:
target = new_client.target
target_query = target.search('CHEMBL206')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P19785', 'xref_name': None, 'xre...",Mus musculus,Estrogen receptor alpha,15.0,False,CHEMBL3065,"[{'accession': 'P19785', 'component_descriptio...",SINGLE PROTEIN,10090
1,[],Mus musculus,Estrogen receptor,12.0,False,CHEMBL2094113,"[{'accession': 'O08537', 'component_descriptio...",PROTEIN FAMILY,10090


### **Select and retrieve bioactivity data for *Human Esthrogen Receptor Alpha* (first entry)**

We will assign the first entry (which corresponds to the target protein, *Human Esthrogen Receptor Alpha*) to the ***selected_target*** variable 

In [14]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL3065'

Here, we will retrieve only bioactivity data for *Human Esthrogen Receptor Alpha* (CHEMBL206) that are reported as pChEMBL values.

In [17]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [19]:
df = pd.DataFrame.from_dict(res)

In [20]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1468095,[],CHEMBL829541,Inhibition of [3H]17-beta-estradiol binding to...,B,,,BAO_0000190,...,Mus musculus,Estrogen receptor alpha,10090,,,IC50,nM,UO_0000065,,1335.0
1,,,1468243,[],CHEMBL829541,Inhibition of [3H]17-beta-estradiol binding to...,B,,,BAO_0000190,...,Mus musculus,Estrogen receptor alpha,10090,,,IC50,nM,UO_0000065,,1629.0
2,,,1468509,[],CHEMBL829541,Inhibition of [3H]17-beta-estradiol binding to...,B,,,BAO_0000190,...,Mus musculus,Estrogen receptor alpha,10090,,,IC50,nM,UO_0000065,,132.0
3,,,1468524,[],CHEMBL829541,Inhibition of [3H]17-beta-estradiol binding to...,B,,,BAO_0000190,...,Mus musculus,Estrogen receptor alpha,10090,,,IC50,nM,UO_0000065,,199.0
4,,,1468538,[],CHEMBL829541,Inhibition of [3H]17-beta-estradiol binding to...,B,,,BAO_0000190,...,Mus musculus,Estrogen receptor alpha,10090,,,IC50,nM,UO_0000065,,423.0
5,,,1468787,[],CHEMBL829541,Inhibition of [3H]17-beta-estradiol binding to...,B,,,BAO_0000190,...,Mus musculus,Estrogen receptor alpha,10090,,,IC50,nM,UO_0000065,,2601.0
6,,,1469068,[],CHEMBL829541,Inhibition of [3H]17-beta-estradiol binding to...,B,,,BAO_0000190,...,Mus musculus,Estrogen receptor alpha,10090,,,IC50,nM,UO_0000065,,746.0
7,,,1469708,[],CHEMBL829541,Inhibition of [3H]17-beta-estradiol binding to...,B,,,BAO_0000190,...,Mus musculus,Estrogen receptor alpha,10090,,,IC50,nM,UO_0000065,,1089.0
8,,,1469979,[],CHEMBL829541,Inhibition of [3H]17-beta-estradiol binding to...,B,,,BAO_0000190,...,Mus musculus,Estrogen receptor alpha,10090,,,IC50,nM,UO_0000065,,2.2
9,,,18083052,[],CHEMBL4018000,Antagonist activity at estrogen receptor in mo...,B,,,BAO_0000190,...,Mus musculus,Estrogen receptor alpha,10090,,,IC50,nM,UO_0000065,,33.2


In [25]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
dfm = df[selection]
dfm.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL187673,Oc1ccc(-c2nc3ccc(O)cc3o2)cc1,1335.0
1,CHEMBL187392,Oc1ccc(-c2nc3cc(O)ccc3o2)cc1,1629.0
2,CHEMBL188528,Oc1ccc(-c2nc3cc(O)cc(Br)c3o2)cc1,132.0
3,CHEMBL187207,Oc1cc(Br)c2oc(-c3ccc(O)c(F)c3)nc2c1,199.0
4,CHEMBL186513,N#Cc1cc(O)cc2nc(-c3ccc(O)cc3)oc12,423.0
5,CHEMBL188230,O=Cc1cc(O)cc2nc(-c3ccc(O)cc3)oc12,2601.0
6,CHEMBL450940,C=Cc1cc(O)cc2nc(-c3ccc(O)c(F)c3)oc12,746.0
7,CHEMBL188882,Oc1ccc(-c2noc3cc(O)ccc23)cc1,1089.0
8,CHEMBL135,C[C@]12CC[C@@H]3c4ccc(O)cc4CC[C@H]3[C@@H]1CC[C...,2.2
9,CHEMBL2137046,CCC(=C(c1ccc(O)cc1)c1ccc(OCCN(C)C)cc1)c1ccccc1,33.2


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [8]:
df.to_csv('tk_01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [9]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3534,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056466,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,237.8
3535,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056467,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,1100.0
3536,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056468,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,2531.0
3537,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056469,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,252.4


In [10]:
len(df2.canonical_smiles.unique())

2597

In [11]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3532,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056464,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249083,Inhibition of recombinant human aromatase prei...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,13.0
3533,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056465,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249083,Inhibition of recombinant human aromatase prei...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,13.0
3534,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056466,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,237.8
3535,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056467,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,1100.0


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [12]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL341591,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,7100.0
1,CHEMBL2111947,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,50000.0
2,CHEMBL431859,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,238.0
3,CHEMBL113637,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,57.0
4,CHEMBL112021,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,54.0
...,...,...,...
3532,CHEMBL5285636,COc1ccc(C(=O)c2cccc(Cn3ccnc3)c2)cc1,13.0
3533,CHEMBL5266533,O=C(c1ccc(O)cc1)c1cccc(Cn2ccnc2)c1,13.0
3534,CHEMBL5278229,COc1ccc(C(=O)c2ccc(Cn3ccnc3)cc2)cc1,237.8
3535,CHEMBL5275747,O=C(c1ccc(O)cc1)c1ccc(Cn2ccnc2)cc1,1100.0


Saves dataframe to CSV file

In [13]:
df3.to_csv('tk_02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [14]:
df4 = pd.read_csv('kaa_02_bioactivity_data_preprocessed.csv')

In [15]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [16]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL192928,CC(C)(C)c1ccc(C(=O)Nc2[nH]nc3c2CN(C(=O)Cc2ccsc...,5.0,active
1,CHEMBL191816,CC(C)(C)c1ccc(C(=O)Nc2n[nH]c3c2CN(C(=O)c2ccco2...,41.0,active
2,CHEMBL192575,CC(C)(C)c1ccc(C(=O)Nc2n[nH]c3c2CN(C(=O)Cc2cccc...,130.0,active
3,CHEMBL191402,CCc1cccc(CC)c1NC(=O)N1Cc2[nH]nc(NC(=O)c3ccc(C(...,100.0,active
4,CHEMBL363857,CN1CCN(c2ccc(C(=O)Nc3[nH]nc4c3CN(C(=O)Cc3ccsc3...,65.0,active
...,...,...,...,...
3186,CHEMBL5274495,CC(C)(C)c1cc(NC(=O)Nc2ncc(CCNc3ncnc4[nH]c(-c5c...,15.0,active
3187,CHEMBL5286283,Cc1cc(S(=O)(=O)Nc2ccc(/C=C3\SC(=O)N(CC4(C)CCCC...,35.0,active
3188,CHEMBL5278077,NC(=O)c1ccc(Nc2ncc(F)c(Nc3ccccc3)n2)cc1,5.4,active
3189,CHEMBL5271559,NC(=O)c1cnc(Nc2cccc(O)n2)cc1Nc1ccccc1Cl,2500.0,intermediate


Saves dataframe to CSV file

In [17]:
df5.to_csv('tk_03_bioactivity_data_curated.csv', index=False)

In [18]:
import os
from zipfile import ZipFile

# Define the folder containing the files to be zipped
folder = 'C:\\Users\\Mychael\\Downloads'

# Create a ZipFile object
with ZipFile('output.zip', 'w') as myzip:
    # Iterate over the files in the folder
    for file in os.listdir(folder):
        # Check if the file ends with .csv or .pdf
        if file.endswith(".csv") or file.endswith(".pdf"):
            # Write the file to the zip archive
            myzip.write(os.path.join(folder, file), file)