#PLK1 (Machine Learning for Predictive Modeling of Bioactivity)


* In recent years, the availability of large volumes of biological and chemical data has led to the development of various computational approaches for drug discovery. One such approach is the use of machine learning techniques to analyze bioactivity data.

* In this project, we aim to develop a predictive model using machine learning algorithms to accurately predict the bioactivity of small molecules against a specific target (PLK1), and to evaluate the performance of different models using various evaluation metrics such as R-squared score, mean squared error, and root mean squared error. Our goal is to provide a useful tool for drug discovery researchers to accelerate the process of identifying potential lead compounds with high bioactivity.

# PLK1 Pre-Processing Data

*  The ChEMBL Database is a database that contains curated bioactivity data of more than 2.4 million compounds. It is compiled from more than 86,000 documents, 1.5 million assays and the data spans 15,000 targets and 2,000 cells and 45,000 indications as of 4/6/2023 on ChEMBL database.
* In this notebook preparing Pre-Processed data for machine learning model using the ChEMBL bioactivity data PLK1 (Polo-like kinase 1).


* The library helps accessing ChEMBL data and cheminformatics tools from Python. 


* The PLK1 biological activity is retrieved from Chembl database. The data set will be pre-processed left with  the molecule names with corresponding smiles notation and compound labeling of active and inactive class. These are the information of the chemical structure  that is needed for part 2 in order to compute the molecule descriptors.


In [None]:
! pip install chembl_webresource_client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.8-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting attrs<22.0,>=21.2
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: url-normalize, attrs, requests-cache, chembl_webresource_client
  Attempting uninstall: attrs
    Found existing installation: attrs 22.2.0
    Uninstalling attrs-22.2.0:
      Successfully uninstalled attrs-22.2.0
Successfully installed 

In [None]:
# Importing libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client


Polo-like kinase 1 (Plk1) regulates mitotic processes that are crucial for cellular growth. Plk1 overexpression is strongly linked to the development of various cancers in humans, and a vast body of data shows that Plk1 is an appealing target for anticancer treatment development. Drugs targeting Plk1 may be aimed at two separate sites: the N-terminal catalytic domain, which phosphorylates substrates, and the C-terminal polo-box domain, which is required for protein-protein interactions.

Shakeel I, Basheer N, Hasan GM, Afzal M, Hassan MI. Polo-like Kinase 1 as an emerging drug target: structure, function and therapeutic implications. J Drug Target. 2021 Feb;29(2):168-184. doi: 10.1080/1061186X.2020.1818760. Epub 2020 Sep 14. PMID: 32886539.

In [None]:
# Target search for PLK1
target = new_client.target
target_query = target.search('PLK1')
targets = pd.DataFrame.from_dict(target_query)
targets     

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P70032', 'xref_name': None, 'xre...",Xenopus laevis,Serine/threonine-protein kinase PLK1,28.0,False,CHEMBL4519,"[{'accession': 'P70032', 'component_descriptio...",SINGLE PROTEIN,8355
1,"[{'xref_id': 'P53350', 'xref_name': None, 'xre...",Homo sapiens,Serine/threonine-protein kinase PLK1,18.0,False,CHEMBL3024,"[{'accession': 'P53350', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Homo sapiens,Cereblon/Serine/threonine-protein kinase PLK1,18.0,False,CHEMBL4742280,"[{'accession': 'P53350', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
3,[],Homo sapiens,Mitotic interactor and substrate of PLK1,15.0,False,CHEMBL4295893,"[{'accession': 'Q8IVT2', 'component_descriptio...",SINGLE PROTEIN,9606


In [None]:
#Selecting homo sapien/plk1 Serine/threonine-protein kinase PLK1 target protein
selected_target = targets.target_chembl_id[1]
selected_target


'CHEMBL3024'

This will use the ChEMBL web services client to query the ChEMBL database for bioactivity data related to a specific target (specified by the selected_target variable).

The activity.filter() method is used to filter the bioactivity data based on specific criteria. In this case, target_chembl_id=selected_target filters the data based on the selected target, and standard_type="IC50" filters the data to only include records where the standard type is IC50, which is a measure of inhibitory concentration at 50% inhibition.

The retrieved bioactivity data is stored in the res variable.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,750655,[],CHEMBL763908,Inhibition of PLK1 kinase,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,nM,UO_0000065,,10000.0
1,,1662506,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,100.0
2,,1662531,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,20.0
3,,1662532,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,100.0
4,,1662533,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1413,,24404539,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,0.2
1414,,24404540,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,1.65
1415,,24404541,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,1.23
1416,,24404542,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,0.83


# Saving csv file 

In [None]:
df.to_csv('plk1_01_bioactivity_data_raw.csv', index=False)

#Pre-Preocessing Steps

Data cleaning: This involves removing any irrelevant or duplicated data, correcting errors, and dealing with missing data.

Data normalization: This involves transforming the data to a common scale so that different features can be compared.

Feature selection: This involves selecting the most relevant features from the dataset based on their importance and relevance to the target variable.

# Removing missing values in column
Here will drop any missing values in missing value for the standard_value and canonical_smiles.

In [None]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,750655,[],CHEMBL763908,Inhibition of PLK1 kinase,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,nM,UO_0000065,,10000.0
1,,1662506,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,100.0
2,,1662531,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,20.0
3,,1662532,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,100.0
4,,1662533,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1413,,24404539,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,0.2
1414,,24404540,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,1.65
1415,,24404541,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,1.23
1416,,24404542,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,0.83


In [None]:
len(df2.canonical_smiles.unique())


1311

In [None]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,750655,[],CHEMBL763908,Inhibition of PLK1 kinase,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,nM,UO_0000065,,10000.0
1,,1662506,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,100.0
2,,1662531,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,20.0
3,,1662532,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,100.0
4,,1662533,[],CHEMBL864552,Inhibitory activity against Plk1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1413,,24404539,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,0.2
1414,,24404540,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,1.65
1415,,24404541,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,1.23
1416,,24404542,[],CHEMBL5044640,Inhibition of Plk1-PBD in HEK293A cells expres...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Serine/threonine-protein kinase PLK1,9606,,,IC50,uM,UO_0000065,,0.83


Pre-Processing bioactivity data

In [None]:

selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,10000.0
1,CHEMBL200586,COC(=O)c1cc2c(C)n[nH]c2s1,100000.0
2,CHEMBL199996,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)Nc3ccccc3)c12,20000.0
3,CHEMBL199658,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccc(Cl)cc3)c12,100000.0
4,CHEMBL199657,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3cccc(Cl)c3)c12,100000.0
...,...,...,...
1413,CHEMBL5082952,COC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CC[C@H]2OCCNC...,200.0
1414,CHEMBL5075075,COC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CC[C@H]2OCCNC...,1650.0
1415,CHEMBL5093970,COC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CC[C@H]2OCCNC...,1230.0
1416,CHEMBL5078060,COCCOCCOCCOCCOc1cc(OCCOCCOCCOCCOC)cc(C(=O)N[C@...,830.0


In [None]:
#saving df to csv
df3.to_csv('plk1_02_bioactivity_data_preprocessed.csv', index=False)


# Compound labeling
Bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be active while those greater than 10,000 nM will be considered to be inactive. As for those values in between 1,000 and 10,000 nM will be referred to as intermediate.

In [None]:
df4 = pd.read_csv('plk1_02_bioactivity_data_preprocessed.csv')


This creates a list bioactivity_threshold that categorizes the bioactivity of the compounds based on their standard_value in the df4 DataFrame.

The criteria for categorization is as follows:

* "inactive" if the standard_value is greater than or equal to 10,000 nM
* "active" if the standard_value is less than or equal to 1,000 nM
* "intermediate" if the standard_value is between 1,000 and 10,000 nM.

* The for loop iterates through each value in the standard_value column of the df4, converts it to a float, and compares it to the thresholds for activity. The corresponding category is appended to the bioactivity_threshold list.






In [None]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")


This creates a new column 'class' in the existing 'df4' based on the bioactivity_threshold values generated earlier. The new column 'class' is created using the pandas Series function and the name of the series is specified as 'class'. Then the 'df4' and 'bioactivity_class' DataFrames are concatenated together column-wise using the 'concat' function with axis=1 as the argument. The resulting DataFrame is assigned to a new variable called 'df5'.

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL115220,O=C(Cc1ccc2ccccc2c1)Nc1cc(C2CC2)n[nH]1,10000.0,inactive
1,CHEMBL200586,COC(=O)c1cc2c(C)n[nH]c2s1,100000.0,inactive
2,CHEMBL199996,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)Nc3ccccc3)c12,20000.0,inactive
3,CHEMBL199658,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3ccc(Cl)cc3)c12,100000.0,inactive
4,CHEMBL199657,Cc1n[nH]c2sc(C(N)=O)c(NC(=O)c3cccc(Cl)c3)c12,100000.0,inactive
...,...,...,...,...
1306,CHEMBL5082952,COC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CC[C@H]2OCCNC...,200.0,active
1307,CHEMBL5075075,COC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CC[C@H]2OCCNC...,1650.0,intermediate
1308,CHEMBL5093970,COC(=O)[C@H](CC(C)C)NC(=O)[C@@H]1CC[C@H]2OCCNC...,1230.0,intermediate
1309,CHEMBL5078060,COCCOCCOCCOCCOc1cc(OCCOCCOCCOCCOC)cc(C(=O)N[C@...,830.0,active


Storing curated csv for Exploratory Data Analysis 

In [None]:
df5.to_csv('plk1_03_bioactivity_data_curated.csv', index=False)


In [None]:
! zip plk1.zip *.csv

updating: plk1_01_bioactivity_data_raw.csv (deflated 91%)
updating: plk1_02_bioactivity_data_preprocessed.csv (deflated 82%)
updating: plk1_03_bioactivity_data_curated.csv (deflated 83%)


In [None]:
! ls -l


total 1188
-rw-r--r-- 1 root root 847517 Apr 18 20:52 plk1_01_bioactivity_data_raw.csv
-rw-r--r-- 1 root root 105934 Apr 18 20:53 plk1_02_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root 117615 Apr 18 20:53 plk1_03_bioactivity_data_curated.csv
-rw-r--r-- 1 root root 135567 Apr 18 20:53 plk1.zip
drwxr-xr-x 1 root root   4096 Apr 14 13:35 sample_data
