# 1. Introduction

In recent years, computational methods have emerged as indispensable tools in the field of drug discovery, offering unprecedented opportunities to expedite the identification and development of novel therapeutic agents. This project seeks to harness the power of machine learning and computational biology to predict the biological activity of chemical compounds, with a particular focus on targeting the protein Acetylcholinesterase (AChE) and its significance in Alzheimer's disease treatment.

# Objectives and Scope:
The primary objective of this project is to develop predictive models that can accurately forecast the biological activity of compounds targeting AChE. By leveraging machine learning algorithms and molecular descriptors, we aim to identify potential drug candidates with inhibitory activity against AChE, paving the way for the discovery of new treatments for Alzheimer's disease. The scope of the project encompasses data collection, preprocessing, model building, and evaluation, with a focus on translating computational insights into actionable strategies for drug discovery research.

# Importance of Predicting Biological Activity:
Predicting the biological activity of chemical compounds is a crucial step in the drug discovery process, as it allows researchers to prioritize and prioritize candidate molecules for further experimental validation. AChE, a key enzyme involved in neurotransmission, has been implicated in the pathogenesis of Alzheimer's disease, making it an attractive target for therapeutic intervention. By accurately predicting the biological activity of compounds targeting AChE, we can expedite the drug discovery process, potentially leading to the development of more effective treatments for Alzheimer's disease and other neurological disorders.



In [1]:
# Install the chembl database to access the bio activity data
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/55.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m51.2/55.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.0-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-23.2.3-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting url-normalize

# Data Collection
**Description of the ChEMBL Database:**

The ChEMBL database is a comprehensive repository of bioactivity data, focusing primarily on small molecules and their interactions with biological targets. It contains a wealth of information sourced from scientific literature, patents, and other publicly available sources, making it a valuable resource for drug discovery research. ChEMBL provides curated data on compound-target interactions, including binding affinities, inhibition constants, and potency measurements, among others.

**Process of Retrieving Data for Compounds Targeting AChE**

To retrieve biological activity data for compounds targeting Acetylcholinesterase (AChE), we utilized the ChEMBL database's web interface and programmatically accessed relevant information using the chembl_webresource_client Python library. We queried the database using keywords such as "Acetylcholinesterase" or "AChE" as the target protein and extracted data on compound bioactivity, including potency measurements such as IC50 values.

**Details of the Dataset:**

The dataset comprises compounds targeting AChE retrieved from the ChEMBL database, encompassing a diverse range of chemical structures and biological activities. It includes the following details:

Biological Activity Values:
The primary measure of biological activity in the dataset is the half-maximal inhibitory concentration (IC50), representing the concentration of a compound required to inhibit AChE activity by 50%. Additional bioactivity measurements, such as potency, affinity, and efficacy, may also be included.

Molecular Properties:
 Alongside bioactivity data, the dataset includes molecular properties and descriptors for each compound. These properties may encompass chemical features such as molecular weight, lipophilicity, hydrogen bonding capacity, and structural fingerprints, among others. Molecular descriptors play a crucial role in characterizing the physicochemical properties and biological activities of compounds, facilitating the modeling and prediction of bioactivity.

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

In [3]:
# Search for the target protein
target_protein = new_client.target

# Target search for Acetylcholinesterase
target_query = target_protein.search('acetylcholinesterase')

# Convert the targey_qeury dictionary into a dataframe
targets = pd.DataFrame.from_dict(target_query)

# Preview of the data set
targets.head()

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P22303', 'xref_name': None, 'xre...",Homo sapiens,Acetylcholinesterase,27.0,False,CHEMBL220,"[{'accession': 'P22303', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Cholinesterases; ACHE & BCHE,27.0,False,CHEMBL2095233,"[{'accession': 'P06276', 'component_descriptio...",SELECTIVITY GROUP,9606
2,[],Drosophila melanogaster,Acetylcholinesterase,18.0,False,CHEMBL2242744,"[{'accession': 'P07140', 'component_descriptio...",SINGLE PROTEIN,7227
3,[],Bemisia tabaci,AChE2,16.0,False,CHEMBL2366409,"[{'accession': 'B3SST5', 'component_descriptio...",SINGLE PROTEIN,7038
4,[],Leptinotarsa decemlineata,Acetylcholinesterase,16.0,False,CHEMBL2366490,"[{'accession': 'Q27677', 'component_descriptio...",SINGLE PROTEIN,7539


In [4]:
# Select and retrieve bioactivity data for Human Acetylcholinesterase (first entry)
selected_target_protein = targets.target_chembl_id[0]
selected_target_protein
# retrieve only bioactivity data for Human Acetylcholinesterase (CHEMBL220) that are reported as pChEMBL values
bio_activity = new_client.activity
bio_activity_res = bio_activity.filter(target_chembl_id=selected_target_protein).filter(standard_type="IC50")

In [5]:
 # Convert the bio_activity_res data into a dataframe
df = pd.DataFrame.from_dict(bio_activity_res)
# Preview the first few row of the dataframe
df.head(5)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8


In [6]:
# save the resulting bioactivity data to a CSV file bioactivity_data.csv
df.to_csv('bioactivity_data_raw.csv', index=False)

In [8]:
#Check for missing values in the dataset
print("\nMissing values in the dataset:")
print(df.isnull().sum())
# Drop compounds that has missing value for the standard_value column and canonical_smiles
df2 = df[df.standard_value.notna()]
second_df = df2[df.canonical_smiles.notna()]


Missing values in the dataset:
action_type                  7767
activity_comment             7493
activity_id                     0
activity_properties             0
assay_chembl_id                 0
assay_description               0
assay_type                      0
assay_variant_accession      8832
assay_variant_mutation       8832
bao_endpoint                    0
bao_format                      0
bao_label                       0
canonical_smiles               35
data_validity_comment        8202
data_validity_description    8202
document_chembl_id              0
document_journal              937
document_year                 873
ligand_efficiency            2595
molecule_chembl_id              0
molecule_pref_name           6991
parent_molecule_chembl_id       0
pchembl_value                2493
potential_duplicate             0
qudt_units                   1272
record_id                       0
relation                     1283
src_id                          0
standard_flag   

  second_df = df2[df.canonical_smiles.notna()]


In [9]:
# check the length of the cannonical.smiles unique values
len(second_df.canonical_smiles.unique())

6157

In [10]:
# check the length of the standard_value unique values
len(second_df.standard_value.unique())

3106

In [11]:
# drop duplicates value of the cannonica
df2_new = second_df.drop_duplicates(['canonical_smiles'])
df2_new

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8825,"{'action_type': 'INHIBITOR', 'description': 'N...",,24963372,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5216438,Binding affinity to AChE (unknown origin) asse...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,nM,UO_0000065,,0.209
8827,"{'action_type': 'INHIBITOR', 'description': 'N...",,24963385,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5216448,Inhibition of recombinant human AChE using ace...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,nM,UO_0000065,,274.0
8828,"{'action_type': 'INHIBITOR', 'description': 'N...",,24965328,[],CHEMBL5217010,Inhibition of human recombinant AChE using S-a...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,76.2
8829,"{'action_type': 'INHIBITOR', 'description': 'N...",,24965329,[],CHEMBL5217010,Inhibition of human recombinant AChE using S-a...,B,,,BAO_0000190,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,55.0


In [12]:
#Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame
selection = ['molecule_chembl_id','canonical_smiles','standard_value']#selects the required columns
combined_bioactivity_df = df2_new[selection]
combined_bioactivity_df

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.0
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.0
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.0
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.0
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.0
...,...,...,...
8825,CHEMBL5219841,COc1cccc2c1CCCC2NS(=O)(=O)NC(=O)OCc1ccccc1,0.209
8827,CHEMBL5219046,CC[C@@]1(c2cccc(OC(=O)Nc3ccccc3)c2)CCCCN(C)C1,274.0
8828,CHEMBL5219594,O=c1[nH]c2ccc(OCc3ccc(F)cc3)cc2c(=O)o1,76200.0
8829,CHEMBL5219958,CC(C)c1ccc(COc2ccc3[nH]c(=O)oc(=O)c3c2)cc1,55000.0


In [20]:
#Save dataframe to CSV file
combined_bioactivity_df.to_csv('acetylcholinesterase_02_bioactivity_data_preprocessed.csv', index=False)

In [21]:
combined_bioactivity_df2 = pd.read_csv('acetylcholinesterase_02_bioactivity_data_preprocessed.csv')

In [22]:
#Label compounds as either being active, inactive or intermediate
bioactivity_threshold = []
for i in combined_bioactivity_df2.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [24]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
combined_bioactivity_df3 = pd.concat([combined_bioactivity_df2, bioactivity_class], axis=1)
combined_bioactivity_df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.000,active
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.000,active
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.000,inactive
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.000,active
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.000,active
...,...,...,...,...
6152,CHEMBL5219841,COc1cccc2c1CCCC2NS(=O)(=O)NC(=O)OCc1ccccc1,0.209,active
6153,CHEMBL5219046,CC[C@@]1(c2cccc(OC(=O)Nc3ccccc3)c2)CCCCN(C)C1,274.000,active
6154,CHEMBL5219594,O=c1[nH]c2ccc(OCc3ccc(F)cc3)cc2c(=O)o1,76200.000,inactive
6155,CHEMBL5219958,CC(C)c1ccc(COc2ccc3[nH]c(=O)oc(=O)c3c2)cc1,55000.000,inactive


In [25]:
#Save dataframe to CSV file
combined_bioactivity_df3.to_csv('acetylcholinesterase_03_bioactivity_data_curated.csv', index=False)

In [26]:
! zip acetylcholinesterase.zip *.csv

updating: acetylcholinesterase_02_bioactivity_data_preprocessed.csv (deflated 80%)
updating: bioactivity_data_raw.csv (deflated 91%)
  adding: acetylcholinesterase_03_bioactivity_data_curated.csv (deflated 82%)


In [27]:
df = pd.read_csv('acetylcholinesterase_03_bioactivity_data_curated.csv')
df.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.0,active
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.0,active
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.0,inactive
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.0,active
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.0,active


In [28]:
df_no_smiles = df.drop(columns='canonical_smiles')
df_no_smiles

Unnamed: 0,molecule_chembl_id,standard_value,class
0,CHEMBL133897,750.000,active
1,CHEMBL336398,100.000,active
2,CHEMBL131588,50000.000,inactive
3,CHEMBL130628,300.000,active
4,CHEMBL130478,800.000,active
...,...,...,...
6152,CHEMBL5219841,0.209,active
6153,CHEMBL5219046,274.000,active
6154,CHEMBL5219594,76200.000,inactive
6155,CHEMBL5219958,55000.000,inactive


In [29]:
smiles = []

for i in df.canonical_smiles.tolist():
  cpd = str(i).split('.')
  cpd_longest = max(cpd, key = len)
  smiles.append(cpd_longest)

smiles = pd.Series(smiles, name = 'canonical_smiles')

In [30]:
smiles = []

for i in df.canonical_smiles.tolist():
  cpd = str(i).split('.')
  cpd_longest = max(cpd, key = len)
  smiles.append(cpd_longest)

smiles = pd.Series(smiles, name = 'canonical_smiles')