# **Computational Drug Discovery [Part 1] Download Bioactivity Data**

Nickolas Winters

**Part 1:** Data Collection and Pre-Processing from the ChEMBL Database was performed.

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Importing libraries**

In [2]:
# import required libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Alzheimer**

In [6]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('alzheimer')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Nucleosome-remodeling factor subunit BPTF,13.0,False,CHEMBL3085621,"[{'accession': 'Q12830', 'component_descriptio...",SINGLE PROTEIN,9606
1,"[{'xref_id': 'Q92542', 'xref_name': None, 'xre...",Homo sapiens,Nicastrin,11.0,False,CHEMBL3418,"[{'accession': 'Q92542', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Homo sapiens,Gamma-secretase,11.0,False,CHEMBL2094135,"[{'accession': 'Q96BI3', 'component_descriptio...",PROTEIN COMPLEX,9606
3,[],Rattus norvegicus,Amyloid-beta A4 protein,9.0,False,CHEMBL3638365,"[{'accession': 'P08592', 'component_descriptio...",SINGLE PROTEIN,10116
4,[],Mus musculus,Amyloid-beta A4 protein,8.0,False,CHEMBL4523942,"[{'accession': 'P12023', 'component_descriptio...",SINGLE PROTEIN,10090
5,"[{'xref_id': 'P05067', 'xref_name': None, 'xre...",Homo sapiens,Amyloid-beta A4 protein,7.0,False,CHEMBL2487,"[{'accession': 'P05067', 'component_descriptio...",SINGLE PROTEIN,9606


### **Select and retrieve bioactivity data for *Human Amyloid-beta A4 protein* (sixth entry)**

The sixth entry (which corresponds to the target protein, *Human Amyloid-beta A4 protein*) was assigned to the ***selected_target*** variable 

In [8]:
selected_target = targets.target_chembl_id[5]
selected_target

'CHEMBL2487'

Only the bioactivity data for *Human Amyloid-beta A4* (CHEMBL2487) that are reported as pChEMBL values were retrived.

In [9]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [10]:
df = pd.DataFrame.from_dict(res)

In [11]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,357577,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,5.0
1,,,357580,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,2.7
2,,,358965,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,1.8
3,,,368887,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,11.0
4,,,375954,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1598,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079667,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,1.03
1599,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079668,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,0.99
1600,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079669,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,0.1
1601,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079670,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,1.56


The resulting bioactivity data was saved to a CSV file **amyloid_01_bioactivity_data_raw.csv**.

In [12]:
# df.to_csv('amyloid_01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
Compounds with missing values for the **standard_value** and **canonical_smiles** columns were dropped.

In [13]:
df_nona = df[df.standard_value.notna()]
df_nona = df_nona[df.canonical_smiles.notna()]
df_nona

  df_nona = df_nona[df.canonical_smiles.notna()]


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,357577,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,5.0
1,,,357580,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,2.7
2,,,358965,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,1.8
3,,,368887,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,11.0
4,,,375954,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1598,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079667,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,1.03
1599,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079668,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,0.99
1600,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079669,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,0.1
1601,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079670,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,1.56


In [14]:
len(df_nona.canonical_smiles.unique())

1188

In [16]:
df_nd = df_nona.drop_duplicates(['canonical_smiles'])
df_nd

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,357577,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,5.0
1,,,357580,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,2.7
2,,,358965,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,1.8
3,,,368887,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,11.0
4,,,375954,[],CHEMBL678443,Inhibition of A-beta-42 production by inhibiti...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1598,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079667,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,1.03
1599,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079668,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,0.99
1600,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079669,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,0.1
1601,"{'action_type': 'INHIBITOR', 'description': 'N...",,25079670,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5255354,Inhibition of Amyloid beta (1 to 42) (unknown ...,B,,,BAO_0000190,...,Homo sapiens,Amyloid-beta A4 protein,9606,,,IC50,uM,UO_0000065,,1.56


## **Data pre-processing of the bioactivity data**

### **The 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class were combined into a DataFrame**

In [17]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df_clean = df_nd[selection]
df_clean

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL311039,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,5000.0
1,CHEMBL450926,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,2700.0
2,CHEMBL310242,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,1800.0
3,CHEMBL74874,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,11000.0
4,CHEMBL75183,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,10000.0
...,...,...,...
1598,CHEMBL5274298,COC(=O)c1cc(O)cc(OC)c1C(=O)c1c(O)cc(CO[C@H]2O[...,1030.0
1599,CHEMBL5283067,COC(=O)c1c(Sc2c(O)cc(OC)c(C(=O)c3c(O)cc(C)cc3O...,990.0
1600,CHEMBL5273520,COC(=O)c1c(O)cc(C)c(Sc2c(O)cc(OC)c(Oc3c(O)cc(C...,100.0
1601,CHEMBL5282081,COC(=O)c1c(O)cc(C)cc1C(=O)c1cc(O)cc2oc3cc4c(c(...,1560.0


Dataframe was saved to a CSV file

In [18]:
# df_clean.to_csv('amyloid_02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [26]:
df2 = pd.read_csv('amyloid_02_bioactivity_data_preprocessed.csv')
df2

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL311039,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,5000.0
1,CHEMBL450926,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,2700.0
2,CHEMBL310242,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,1800.0
3,CHEMBL74874,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,11000.0
4,CHEMBL75183,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,10000.0
...,...,...,...
1183,CHEMBL5274298,COC(=O)c1cc(O)cc(OC)c1C(=O)c1c(O)cc(CO[C@H]2O[...,1030.0
1184,CHEMBL5283067,COC(=O)c1c(Sc2c(O)cc(OC)c(C(=O)c3c(O)cc(C)cc3O...,990.0
1185,CHEMBL5273520,COC(=O)c1c(O)cc(C)c(Sc2c(O)cc(OC)c(Oc3c(O)cc(C...,100.0
1186,CHEMBL5282081,COC(=O)c1c(O)cc(C)cc1C(=O)c1cc(O)cc2oc3cc4c(c(...,1560.0


In [27]:
bioactivity_threshold = []

for i in df2.standard_value:
    if float(i) >= 10000:
        bioactivity_threshold.append("inactive")
    elif float(i) <= 1000:
        bioactivity_threshold.append("active")
    else:
        bioactivity_threshold.append("intermediate")
    

In [28]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
bioactivity_class

0       intermediate
1       intermediate
2       intermediate
3           inactive
4           inactive
            ...     
1183    intermediate
1184          active
1185          active
1186    intermediate
1187          active
Name: class, Length: 1188, dtype: object

In [29]:
df_class = pd.concat([df2, bioactivity_class], axis=1)
df_class

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL311039,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,5000.0,intermediate
1,CHEMBL450926,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,2700.0,intermediate
2,CHEMBL310242,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,1800.0,intermediate
3,CHEMBL74874,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,11000.0,inactive
4,CHEMBL75183,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,10000.0,inactive
...,...,...,...,...
1183,CHEMBL5274298,COC(=O)c1cc(O)cc(OC)c1C(=O)c1c(O)cc(CO[C@H]2O[...,1030.0,intermediate
1184,CHEMBL5283067,COC(=O)c1c(Sc2c(O)cc(OC)c(C(=O)c3c(O)cc(C)cc3O...,990.0,active
1185,CHEMBL5273520,COC(=O)c1c(O)cc(C)c(Sc2c(O)cc(OC)c(Oc3c(O)cc(C...,100.0,active
1186,CHEMBL5282081,COC(=O)c1c(O)cc(C)cc1C(=O)c1cc(O)cc2oc3cc4c(c(...,1560.0,intermediate


Dataframe was saved to a CSV file

In [30]:
# df_class.to_csv('amyloid_03_bioactivity_data_classed.csv', index=False)

---