# **Part 1 - Download Bioactivity Data**

## **Computational Drug Discovery Bioinformatics Project**

In this part, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than two million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/55.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.0-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-23.2.3-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Installing col

## **Import Libraries**

In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Target Protein - Aromatase**

For this project, we will choose aromatase as the target protein for drug discovery. This enzyme is a member of the cytochrome P450 superfamily and is responsible for a key step in the biosynthesis of estrogens.

Studies have shown that high levels of these hormones are linked to an increased risk of breast cancer. As such, the goal of our drug discovery efforts is to find molecules or compounds that are able to inhibit the function of the aromatase enzyme, thereby giving us insights into preventing and curing breast cancer.

### **Search for Target Protein**

Let's create a query to look for aromatase in the ChEMBL database.

In [None]:
target = new_client.target
target_query = target.search('aromatase')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P11511', 'xref_name': None, 'xre...",Homo sapiens,Cytochrome P450 19A1,20.0,False,CHEMBL1978,"[{'accession': 'P11511', 'component_descriptio...",SINGLE PROTEIN,9606
1,"[{'xref_id': 'P22443', 'xref_name': None, 'xre...",Rattus norvegicus,Cytochrome P450 19A1,20.0,False,CHEMBL3859,"[{'accession': 'P22443', 'component_descriptio...",SINGLE PROTEIN,10116


### **Select and Retrieve Bioactivity Data**

We will assign the first entry (which corresponds to the target protein, *Human Aromatase*) to the `selected_target` variable.

In [None]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL1978'

Here, we will retrieve only bioactivity data for *Human Aromatase* (CHEMBL1978) that are reported as pChEMBL values.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
df = pd.DataFrame.from_dict(res)

In [None]:
df.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054


Finally we will save the resulting bioactivity data to a CSV file.

In [None]:
df.to_csv('aromatase_01_bioactivity_data_raw.csv', index=False)

## **Handle Missing Data**

We will drop any missing values for compounds in the `standard_value` and `canonical_smiles` columns.

In [None]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3534,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056466,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,237.8
3535,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056467,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,1100.0
3536,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056468,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,2531.0
3537,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056469,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,252.4


In [None]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3532,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056464,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249083,Inhibition of recombinant human aromatase prei...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,13.0
3533,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056465,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249083,Inhibition of recombinant human aromatase prei...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,13.0
3534,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056466,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,237.8
3535,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056467,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,1100.0


## **Data Preprocessing**

#### **Combine Relevant Columns**

In [None]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL341591,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,7100.0
1,CHEMBL2111947,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,50000.0
2,CHEMBL431859,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,238.0
3,CHEMBL113637,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,57.0
4,CHEMBL112021,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,54.0
...,...,...,...
3532,CHEMBL5285636,COc1ccc(C(=O)c2cccc(Cn3ccnc3)c2)cc1,13.0
3533,CHEMBL5266533,O=C(c1ccc(O)cc1)c1cccc(Cn2ccnc2)c1,13.0
3534,CHEMBL5278229,COc1ccc(C(=O)c2ccc(Cn3ccnc3)cc2)cc1,237.8
3535,CHEMBL5275747,O=C(c1ccc(O)cc1)c1ccc(Cn2ccnc2)cc1,1100.0


Let's save the data to a CSV file.

In [None]:
df3.to_csv('aromatase_02_bioactivity_data_preprocessed.csv', index=False)

### **Label Compounds**

The bioactivity data is measured in units of half maximal inhibitory concentration ($IC_{50}$) which is a measure of the potency of a substance in inhibiting a specific biological or biochemical function and in our case, this would be the aromatase enzyme.

Compounds having values less than $1000$ $nM$ will be considered **active** while those greater than $10,000$ $nM$ will be considered **inactive**. Values between the bounds of $1000$ $nM$ and $10,000$ $nM$ will be referred to as **intermediate**.

In [None]:
df4 = pd.read_csv('aromatase_02_bioactivity_data_preprocessed.csv')

In [None]:
bioactivity_threshold = []

for i in df4.standard_value:

  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")

  elif float(i) <= 1000:
    bioactivity_threshold.append("active")

  else:
    bioactivity_threshold.append("intermediate")

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL341591,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,7100.0,intermediate
1,CHEMBL2111947,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,50000.0,inactive
2,CHEMBL431859,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,238.0,active
3,CHEMBL113637,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,57.0,active
4,CHEMBL112021,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,54.0,active
...,...,...,...,...
2592,CHEMBL5285636,COc1ccc(C(=O)c2cccc(Cn3ccnc3)c2)cc1,13.0,active
2593,CHEMBL5266533,O=C(c1ccc(O)cc1)c1cccc(Cn2ccnc2)c1,13.0,active
2594,CHEMBL5278229,COc1ccc(C(=O)c2ccc(Cn3ccnc3)cc2)cc1,237.8,active
2595,CHEMBL5275747,O=C(c1ccc(O)cc1)c1ccc(Cn2ccnc2)cc1,1100.0,intermediate


Let's save this data to a CSV file as well.

In [None]:
df5.to_csv('aromatase_03_bioactivity_data_curated.csv', index=False)