# ML Project - Computational Drug Discovery [Part 1] Download Bioactivity Data 


Alzheimer's disease is a neurological condition in which the death of brain cells causes memory loss and cognitive decline. It is the most common type of dementia, accounting for around 60–80% of cases of dementia.
Alzheimer's disease is cause by excess production of enzyme in nervous system called acetylcholinesterase.
We need to find compound(medication) to inhibit acetylcholinesterase.

In Part 1, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

In [2]:
! which python3

/Users/Tejasv/opt/anaconda3/bin/python3


In [10]:
! which pip



/Users/Tejasv/opt/anaconda3/bin/pip


In [4]:
! python3 -m pip install chembl_webresource_client

Collecting chembl_webresource_client
  Using cached chembl-webresource-client-0.10.4.tar.gz (51 kB)
Collecting requests-cache>=0.6.0
  Using cached requests_cache-0.6.4-py2.py3-none-any.whl (28 kB)
Collecting easydict
  Using cached easydict-1.9.tar.gz (6.4 kB)
Collecting url-normalize>=1.4
  Using cached url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Building wheels for collected packages: chembl-webresource-client, easydict
  Building wheel for chembl-webresource-client (setup.py) ... [?25ldone
[?25h  Created wheel for chembl-webresource-client: filename=chembl_webresource_client-0.10.4-py3-none-any.whl size=55639 sha256=6597d29a139e58dfbcdfb8f446259b1729a3f45d36fe7c1d56342996b1152990
  Stored in directory: /Users/Tejasv/Library/Caches/pip/wheels/af/3e/8f/7cd07c7ad14df6ecf2b3deeddd8255d3f5c3688287866dbfed
  Building wheel for easydict (setup.py) ... [?25ldone
[?25h  Created wheel for easydict: filename=easydict-1.9-py3-none-any.whl size=6349 sha256=fcead69272d447187baae8557322

In [11]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client


# Search For Target Protein 

Search For Target Protein 

In [2]:
target = new_client.target
target_search = target.search('acetylcholinesterase')
targets = pd.DataFrame.from_dict(target_search)
targets


NameError: name 'new_client' is not defined

### Select and retrieve bioactivity data for SARS coronavirus 3C-like proteinase (fifth entry)

We will assign the fifth entry (which corresponds to the target protein, coronavirus 3C-like proteinase) to the selected_target variable

In [14]:
selected_target = targets['target_chembl_id'][4]
selected_target

'CHEMBL3927'

Here, we will retrieve only bioactivity data for coronavirus 3C-like proteinase (CHEMBL3927) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [15]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [16]:
df = pd.DataFrame.from_dict(res)

In [17]:
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1480935,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,7.2
1,,1480936,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,9.4
2,,1481061,[],CHEMBL830868,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.5


'standard_value' represents potency of compound, the lower the better

In [18]:
df.columns

Index(['activity_comment', 'activity_id', 'activity_properties',
       'assay_chembl_id', 'assay_description', 'assay_type',
       'assay_variant_accession', 'assay_variant_mutation', 'bao_endpoint',
       'bao_format', 'bao_label', 'canonical_smiles', 'data_validity_comment',
       'data_validity_description', 'document_chembl_id', 'document_journal',
       'document_year', 'ligand_efficiency', 'molecule_chembl_id',
       'molecule_pref_name', 'parent_molecule_chembl_id', 'pchembl_value',
       'potential_duplicate', 'qudt_units', 'record_id', 'relation', 'src_id',
       'standard_flag', 'standard_relation', 'standard_text_value',
       'standard_type', 'standard_units', 'standard_upper_value',
       'standard_value', 'target_chembl_id', 'target_organism',
       'target_pref_name', 'target_tax_id', 'text_value', 'toid', 'type',
       'units', 'uo_units', 'upper_value', 'value'],
      dtype='object')

Finally we are saving the bioactivity data to a csv file bioactivity_data.csv

In [20]:
df.to_csv('bioactivity_data.csv', index=False)

# Handling Missing Value

If any compounds has missing value for the standard_value column then drop it

In [28]:
df2 = df[df.standard_value.notna()]

# Data pre-processing of the bioactivity data

# Labeling compounds as either being active, inactive or intermediate

The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be active while those greater than 10,000 nM will be considered to be inactive. As for those values in between 1,000 and 10,000 nM will be referred to as intermediate.

In [29]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

In [34]:
# molecule_chembl_id represents the unique ID of a compund, this dataset contains severalcompunds and a single compund may also be repeated

In [35]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]

In [46]:
data = pd.concat([df3,pd.Series(bioactivity_class)],axis=1)

In [47]:
data

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,0
0,CHEMBL187579,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,7200.0,intermediate
1,CHEMBL188487,O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21,9400.0,intermediate
2,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,13500.0,inactive
3,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,13110.0,inactive
4,CHEMBL187717,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-],2000.0,intermediate
...,...,...,...,...
128,CHEMBL2146517,COC(=O)[C@@]1(C)CCCc2c1ccc1c2C(=O)C(=O)c2c(C)c...,10600.0,inactive
129,CHEMBL187460,C[C@H]1COC2=C1C(=O)C(=O)c1c2ccc2c1CCCC2(C)C,10100.0,inactive
130,CHEMBL363535,Cc1coc2c1C(=O)C(=O)c1c-2ccc2c(C)cccc12,11500.0,inactive
131,CHEMBL227075,Cc1cccc2c3c(ccc12)C1=C(C(=O)C3=O)[C@@H](C)CO1,10700.0,inactive


Save dataframe to csv file

In [43]:
data.to_csv('bioactivity_preprocessed_data.csv', index=False)

In [44]:
! ls

Untitled.ipynb                    chembl_webresource_client
bioactivity_data.csv              [34muntitled folder[m[m
bioactivity_preprocessed_data.csv
