<div style="padding: 0.5em; background-color: #1876d1; color: #fff;">
    
### **[Part 1] Computational Drug Discovery - Download Bioactivity Data**

</div>
In this Jupyter notebook, we will be building a machine learning model using the ChEMBL bioactivity data. In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note :
* Target enzyme: Aromatase responsible for breast cancer
* Objective: find compound that inhibit Aromatase function

---
<b># Bioinformatics Project </b>

Bioinformatics from scratch series from Prof Chanin Nantasenamat ([YouTube channel](http://youtube.com/dataprofessor))

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

In [1]:
# Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.
!pip install chembl_webresource_client



## **Importing libraries**

In [2]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

Target refer to targeted proteins or target  organims the drug will act one. Biologically these compound will come into contact with these proteins and induce a modulatory activity to either activate or to inhibit (kill) them.

<div style="padding: 0.5em; background-color: #1876d1; color: #fff;">
    Aromatase is an enzyme that allows the body to continue producing estrogens by converting androgens (which are produced by the adrenal glands) in postmenopausal women. This enzyme is responsible for <span style="font-weight: bold; color: #f37627;">breast cancer, particularly in postmenopausal women</span>. Therefore, the <b>goal of drug discovery</b> efforts is to <b>find a compound or molecule that can inhibit the function of aromatase</b>.
</div>
<br>

In [3]:
target = new_client.target
target_query = target.search('aromatase')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P11511', 'xref_name': None, 'xre...",Homo sapiens,Cytochrome P450 19A1,20.0,False,CHEMBL1978,"[{'accession': 'P11511', 'component_descriptio...",SINGLE PROTEIN,9606
1,"[{'xref_id': 'P22443', 'xref_name': None, 'xre...",Rattus norvegicus,Cytochrome P450 19A1,20.0,False,CHEMBL3859,"[{'accession': 'P22443', 'component_descriptio...",SINGLE PROTEIN,10116


We can observe that this enzyme exists in both humans and rats. We will focus on humans, targeting the organism *Homo sapiens*.

<div>
    <div style="display:flex; align-items: center !important;">
        <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/Human.svg/langfr-480px-Human.svg.png" alt="Human" width="150"/>
    </div>
    <div style="display:flex; align-items: center;">
        <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Ratte-Vache.jpeg/440px-Ratte-Vache.jpeg" alt="Rat" width="150"/>
    </div>
</div>

### **Select and retrieve bioactivity data for Homo sapiens**

In [4]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL1978'

Here, we will retrieve only bioactivity data that are reported as pChEMBL values.

In [5]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [6]:
df = pd.DataFrame.from_dict(res)

In [7]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3534,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056466,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,237.8
3535,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056467,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,1100.0
3536,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056468,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,2531.0
3537,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056469,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,252.4


We get a look of the data frame columns 

In [8]:
df.columns

Index(['action_type', 'activity_comment', 'activity_id', 'activity_properties',
       'assay_chembl_id', 'assay_description', 'assay_type',
       'assay_variant_accession', 'assay_variant_mutation', 'bao_endpoint',
       'bao_format', 'bao_label', 'canonical_smiles', 'data_validity_comment',
       'data_validity_description', 'document_chembl_id', 'document_journal',
       'document_year', 'ligand_efficiency', 'molecule_chembl_id',
       'molecule_pref_name', 'parent_molecule_chembl_id', 'pchembl_value',
       'potential_duplicate', 'qudt_units', 'record_id', 'relation', 'src_id',
       'standard_flag', 'standard_relation', 'standard_text_value',
       'standard_type', 'standard_units', 'standard_upper_value',
       'standard_value', 'target_chembl_id', 'target_organism',
       'target_pref_name', 'target_tax_id', 'text_value', 'toid', 'type',
       'units', 'uo_units', 'upper_value', 'value'],
      dtype='object')

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [9]:
df.to_csv('data/01-bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [10]:
df2 = df[df.standard_value.notna() & df.canonical_smiles.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3534,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056466,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,237.8
3535,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056467,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,1100.0
3536,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056468,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,2531.0
3537,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056469,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,252.4


In [11]:
len(df2.canonical_smiles.unique())

2597

In [12]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3532,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056464,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249083,Inhibition of recombinant human aromatase prei...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,13.0
3533,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056465,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249083,Inhibition of recombinant human aromatase prei...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,13.0
3534,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056466,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,237.8
3535,"{'action_type': 'INHIBITOR', 'description': 'N...",,25056467,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5249084,Inhibition of human placental microsome aromat...,B,,,BAO_0000190,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,1100.0


## **Data pre-processing of the bioactivity data**

Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame. `molecule_chembl_id` represent the id of the compond in chembl database, `canonical_smiles` corespond to the representation of the molecule or compound while the `standard_value` represent the potency of the drug (the lower the number the better the potency becomes).

In [13]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL341591,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,7100.0
1,CHEMBL2111947,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,50000.0
2,CHEMBL431859,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,238.0
3,CHEMBL113637,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,57.0
4,CHEMBL112021,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,54.0
...,...,...,...
3532,CHEMBL5285636,COc1ccc(C(=O)c2cccc(Cn3ccnc3)c2)cc1,13.0
3533,CHEMBL5266533,O=C(c1ccc(O)cc1)c1cccc(Cn2ccnc2)c1,13.0
3534,CHEMBL5278229,COc1ccc(C(=O)c2ccc(Cn3ccnc3)cc2)cc1,237.8
3535,CHEMBL5275747,O=C(c1ccc(O)cc1)c1ccc(Cn2ccnc2)cc1,1100.0


Saves dataframe to CSV file

In [14]:
df3.to_csv('data/02-bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM(nano Molar) will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [15]:
df4 = pd.read_csv('data/02-bioactivity_data_preprocessed.csv')

In [16]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [17]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL341591,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,7100.0,intermediate
1,CHEMBL2111947,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,50000.0,inactive
2,CHEMBL431859,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,238.0,active
3,CHEMBL113637,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,57.0,active
4,CHEMBL112021,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,54.0,active
...,...,...,...,...
2592,CHEMBL5285636,COc1ccc(C(=O)c2cccc(Cn3ccnc3)c2)cc1,13.0,active
2593,CHEMBL5266533,O=C(c1ccc(O)cc1)c1cccc(Cn2ccnc2)c1,13.0,active
2594,CHEMBL5278229,COc1ccc(C(=O)c2ccc(Cn3ccnc3)cc2)cc1,237.8,active
2595,CHEMBL5275747,O=C(c1ccc(O)cc1)c1ccc(Cn2ccnc2)cc1,1100.0,intermediate


Saves dataframe to CSV file

In [18]:
df5.to_csv('data/03-bioactivity_data_curated.csv', index=False)

---