# **Machine Learning (ML) For Drug Discovery - Part 1**
This is a complete technology crash course, where you will get a fair idea that how ML works, and how can you use it for a project of Drug Discovery. I have tried my level best to keep things simple, and you do not feel any need to code, therefore, I must say, it suits biologist. Here, I want to acknowledge and refer materials of Dr. Chanin Nantasenamat an Associate professor in Bioinformatics. These materials are compilled by **Dr. Ashfaq Ahmad**.

## **ChEMBL Database**

(https://www.ebi.ac.uk/chembl/)

ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemicals, their bioactivity and genomic data to aid the translation of genomic information into effective new drugs.

(**Citation:** *Nucleic Acids Res. 2019; 47(D1):D930-D940. doi: 10.1093/nar/gky1075*)

I will train you on the use and importance of ChEMBL database during this part of training. I will also teach you how can we pick and process data from the ChEMBL database.

You will be required to execute the cells by clicking the ***Play*** button, with certain changes as per your project details.


### **Step 1 - Installing and Import Required Libraries**

Running the following cell will install ChEMBL webresources for the project to initiate.
Citation: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4489283/

In [27]:
! pip install chembl_webresource_client



In [28]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

### **Step 2 - Picking your desired Target from ChEMBL**

Here we will launch search for the desired target protein or enzyme. To overcome any possibility of a potential mistake, it is highly advised also to confirm you ChEMBL ID from the direct database (ChEMBL) search.

In [29]:
# As an example I will use P2Y purinoceptor 1 protein (UniProt Accession P47900)
target = new_client.target
target_query = target.search('CHEMBL4315')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Purinergic receptor P2Y1,16.0,False,CHEMBL4315,"[{'accession': 'P47900', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,P2Y receptor,7.0,False,CHEMBL4524011,"[{'accession': 'Q9H244', 'component_descriptio...",PROTEIN FAMILY,9606


### **Step 3 - Selection and Retrieval Bioactivity data**

In case you are hit with multiple data lines, then you need to choose the desired data, using the index number below in the cell.

In [30]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL4315'

If you carefully look to the data columns, you will find variety of variable. Here we are interested to perform analysis for Antagonist, therefore, I will go with IC50 column. Please read about IC50 values incase you are not aware.

In [31]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
df = pd.DataFrame.from_dict(res)

In [32]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33943,[],CHEMBL751769,The compound was evaluated for antagonistic ac...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,1.0
1,,,68995,[],CHEMBL751769,The compound was evaluated for antagonistic ac...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,3.0
2,,,989362,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,377.0
3,,,989364,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,452.0
4,,,993873,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,356.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352,"{'action_type': 'INHIBITOR', 'description': 'N...",,25089840,[],CHEMBL5257728,Inhibition of human P2Y1,B,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,2.1
353,"{'action_type': 'AGONIST', 'description': 'Bin...",,25089842,[],CHEMBL5257729,Agonist activity at human P2Y1,B,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,5.8
354,"{'action_type': 'AGONIST', 'description': 'Bin...",,25089843,[],CHEMBL5257729,Agonist activity at human P2Y1,B,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,0.84
355,"{'action_type': 'INHIBITOR', 'description': 'N...",,25089889,[],CHEMBL5257745,Inhibition of human P2Y1 by FLIPR assay,B,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,0.12


Lets play with the above data that contains too many rows and columns. By running the below command, you will only see the first four data lines.

In [33]:
df.head(4)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33943,[],CHEMBL751769,The compound was evaluated for antagonistic ac...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,1.0
1,,,68995,[],CHEMBL751769,The compound was evaluated for antagonistic ac...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,3.0
2,,,989362,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,377.0
3,,,989364,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,452.0


The standard_type column in the data contains IC50. If you want to validate, so better to search for the unique features in the standard_type

In [34]:
df.standard_type.unique()

array(['IC50'], dtype=object)

Now, here I want you to save your data in the Google Drive. For ease, you need first to connect your gdrive to this notebook.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Once the drive is mounted, please make a new folder with a name data where you can save your work directly

In [None]:
! ls /content/gdrive/MyDrive/

In [None]:
! mkdir "/content/gdrive/MyDrive/Colab Notebooks/data"

Now we are now in a position to save our data in a CSV file. I will use the name Bioactivity_data.csv

In [None]:
df.to_csv('P2Y1-01-bioactivity_data_raw.csv', index=False)

In [None]:
! cp P2Y1-01-bioactivity_data_raw.csv "/content/gdrive/MyDrive/Colab Notebooks/data"

In [None]:
ls "/content/gdrive/MyDrive/Colab Notebooks/data"

How our data looks like, take a look if you want to

In [None]:
! head P2Y1-01-bioactivity_data_raw.csv

### **Step 4 - Treating Missing Data**

There comes a possibility that some of the ChEMBL compounds may contain missing value for the **standared_value (IC50)** and **canonical_smiles**. Therefore, we need to remove such entries from the data.

In [35]:
df2 = df[df.standard_value.notna()]

In [36]:
df2 = df[df.canonical_smiles.notna()]

In [37]:
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33943,[],CHEMBL751769,The compound was evaluated for antagonistic ac...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,1.0
1,,,68995,[],CHEMBL751769,The compound was evaluated for antagonistic ac...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,3.0
2,,,989362,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,377.0
3,,,989364,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,452.0
4,,,993873,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,356.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352,"{'action_type': 'INHIBITOR', 'description': 'N...",,25089840,[],CHEMBL5257728,Inhibition of human P2Y1,B,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,2.1
353,"{'action_type': 'AGONIST', 'description': 'Bin...",,25089842,[],CHEMBL5257729,Agonist activity at human P2Y1,B,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,5.8
354,"{'action_type': 'AGONIST', 'description': 'Bin...",,25089843,[],CHEMBL5257729,Agonist activity at human P2Y1,B,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,0.84
355,"{'action_type': 'INHIBITOR', 'description': 'N...",,25089889,[],CHEMBL5257745,Inhibition of human P2Y1 by FLIPR assay,B,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,0.12


In [None]:
len(df2.canonical_smiles.unique())

In [38]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])

In [39]:
df2_nr

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33943,[],CHEMBL751769,The compound was evaluated for antagonistic ac...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,1.0
1,,,68995,[],CHEMBL751769,The compound was evaluated for antagonistic ac...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,3.0
2,,,989362,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,377.0
3,,,989364,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,452.0
4,,,993873,[],CHEMBL750159,Antagonistic activity against P2Y purinoceptor 1,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,nM,UO_0000065,,356.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
343,"{'action_type': 'ANTAGONIST', 'description': '...",,24773444,[],CHEMBL5127323,Antagonist activity at P2Y1 receptor (unknown ...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,0.48
344,"{'action_type': 'ANTAGONIST', 'description': '...",,24773445,[],CHEMBL5127323,Antagonist activity at P2Y1 receptor (unknown ...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,1.6
345,"{'action_type': 'ANTAGONIST', 'description': '...",,24773446,[],CHEMBL5127323,Antagonist activity at P2Y1 receptor (unknown ...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,1.09
346,"{'action_type': 'ANTAGONIST', 'description': '...",,24773447,[],CHEMBL5127323,Antagonist activity at P2Y1 receptor (unknown ...,F,,,BAO_0000190,...,Homo sapiens,Purinergic receptor P2Y1,9606,,,IC50,uM,UO_0000065,,0.84


### **Step 5 - Data Preprocessing**

We will be interested to keep record of the three columns, **ChEMBL_id**, **canonical_smiles**, and **standard_value**.

In [40]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]

In [41]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL133572,Nc1c(S(=O)(=O)[O-])cc(Nc2ccc(Nc3nc(Cl)nc(Cl)n3...,1000.0
1,CHEMBL133576,Cc1cc(C)c(Nc2cc(S(=O)(=O)[O-])c(N)c3c2C(=O)c2c...,3000.0
2,CHEMBL356041,CNc1nc([Se]C)nc2c1ncn2C1CC(OP(=O)(O)O)[C@]2(CO...,377.0
3,CHEMBL2112869,CCCCCCc1nc(NC)c2ncn([C@H]3C[C@H](OP(=O)(O)O)[C...,452.0
4,CHEMBL2112868,CNc1nc(F)nc2c1ncn2[C@H]1C[C@H](OP(=O)(O)O)[C@]...,356.0
...,...,...,...
343,CHEMBL432028,CNc1ncnc2c1ncn2CCC(COP(=O)(O)O)COP(=O)(O)O,480.0
344,CHEMBL108166,CNc1nc(Cl)nc2c1ncn2CCC(COP(=O)(O)O)COP(=O)(O)O,1600.0
345,CHEMBL320924,CNc1ncnc2c1ncn2CC(COP(=O)(O)O)COP(=O)(O)O,1090.0
346,CHEMBL104784,CNc1nc(Cl)nc2c1ncn2CC(COP(=O)(O)O)COP(=O)(O)O,840.0


Now we need to save data frame or the df3 to the CSV file.

In [None]:
df3.to_csv('P2Y1-bioactivity-data-02-preprocessed.csv', index=False)

### **Step 6 - Labeling compounds to different categories**

Here we will need to categorize these compounds into different catergories, based on the IC50 value or **standard_value**. For the purpose, we will classify three groups, **Active**, **Inactive**, **Intermediate**. Compounds with IC50 value less than 1000 nM will be considered active drugs, those greater than 10,000 nM will be considered inactive and IC50 value in greater than 1000 and below 10,000 will be termed as intermediate.

In [42]:
df4 = pd.read_csv('P2Y1-bioactivity-data-02-preprocessed.csv')

In [43]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [44]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)

In [45]:
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,units,class
0,CHEMBL152758,CC(N[C@H](C)C(=O)N1CCC[C@H]1C(=O)O)C(=O)O,90.0,uM,active
1,CHEMBL291381,O=C(CCC(=O)N1CCCC1C(=O)O)NO,50000.0,uM,inactive
2,CHEMBL358439,C[C@@H](NCC(=O)O)C(=O)N1CCC[C@H]1C(=O)O,2400.0,uM,intermediate
3,CHEMBL1237,NCCCC[C@H](N[C@@H](CCc1ccccc1)C(=O)O)C(=O)N1CC...,1.2,nM,active
4,CHEMBL293213,CC(CCC(=O)N1CCCC1C(=O)O)C(=O)O,260000.0,uM,inactive
...,...,...,...,...,...
750,CHEMBL5429562,CC[C@H](C)[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)...,40.0,nM,active
751,CHEMBL5397691,CC[C@H](C)[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)...,40.0,nM,active
752,CHEMBL5415414,CC[C@H](C)[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)...,40.0,nM,active
753,CHEMBL5400083,CC[C@H](C)[C@@H]1NC(=O)[C@H](CCCCNC(=N)N)NC(=O...,40.0,nM,active


Now we need to save the above dataframe or df5 to the CSV file.

In [None]:
df5.to_csv('P2Y1_03_bioactivity_data_curated.csv', index=False)

In [None]:
! zip P2Y1-part1-data.zip *.csv

In [None]:
! cp P2Y1_03_bioactivity_data_curated.csv "/content/gdrive/MyDrive/Colab Notebooks/data"

In [None]:
ls "/content/gdrive/MyDrive/Colab Notebooks/data"