<a href="https://colab.research.google.com/github/Shivamupta/Bioinformatics-Drug-Discovery/blob/main/CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Download PaDEL-Descriptor**

In [None]:
# 1. Clean up previous failed files
! rm -f padel.zip padel.sh
! rm -rf PaDEL-Descriptor

# 2. Download the software from a STABLE MIRROR
! wget https://github.com/chaninlab/estrogen-receptor-alpha-qsar/raw/master/padel.zip -O padel.zip

# 3. Manually create the script file (to avoid download errors)
script_content = """
java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv
"""
with open('padel.sh', 'w') as f:
    f.write(script_content)

# 4. Unzip the software
! unzip -o padel.zip

--2025-11-27 14:56:35--  https://github.com/chaninlab/estrogen-receptor-alpha-qsar/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-11-27 14:56:36 ERROR 404: Not Found.

Archive:  padel.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of padel.zip or
        padel.zip.zip, and cannot find padel.zip.ZIP, period.


In [None]:
! unzip padel.zip

Archive:  padel.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of padel.zip or
        padel.zip.zip, and cannot find padel.zip.ZIP, period.


## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
import pandas as pd

# Load YOUR data from Part 2
df3 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/bioactivity_data_3class_pIC50.csv')

In [None]:
import pandas as pd

# Load the file you created in Part 2 from your Google Drive
df3 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/bioactivity_data_3class_pIC50.csv')

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,MW,LogP,NumHDonors,NumHAcceptors,pIC50,bioactivity_class
0,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,421.190,2.66050,0.0,4.0,4.869666,inactive
1,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,293.347,3.63080,0.0,3.0,4.882397,inactive
2,CHEMBL365134,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c(Br)cccc21,372.243,4.39330,0.0,3.0,6.008774,active
3,CHEMBL190743,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccc(I)cc21,419.243,4.23540,0.0,3.0,6.022276,active
4,CHEMBL365469,O=C1C(=O)N(Cc2cc3ccccc3s2)c2cccc(Cl)c21,327.792,4.28420,0.0,3.0,4.950782,inactive
...,...,...,...,...,...,...,...,...
204,CHEMBL5595277,CC1(C)[C@@H]2[C@@H](C(=O)N[C@H](C=O)C[C@@H]3CC...,436.512,1.47440,3.0,4.0,7.283997,active
205,CHEMBL5570210,CC(C)c1ccc(F)c2[nH]c(C(=O)N3C[C@H]4[C@@H]([C@H...,496.583,2.73690,3.0,4.0,7.522879,active
206,CHEMBL5565685,CC1(C)[C@@H]2[C@@H](C(=O)N[C@H](C=O)C[C@@H]3CC...,471.985,2.06450,2.0,4.0,7.853872,active
207,CHEMBL5565858,CC1(C)[C@@H]2[C@@H](C(=O)N[C@H](C=O)C[C@@H]3CC...,465.594,2.19130,2.0,4.0,7.318759,active


In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat molecule.smi | head -5

O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21	CHEMBL185698
O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21	CHEMBL426082
O=C1C(=O)N(Cc2cc3ccccc3s2)c2c(Br)cccc21	CHEMBL365134
O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccc(I)cc21	CHEMBL190743
O=C1C(=O)N(Cc2cc3ccccc3s2)c2cccc(Cl)c21	CHEMBL365469


In [None]:
! cat molecule.smi | wc -l

209


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [None]:
! cat padel.sh


java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [None]:
! bash padel.sh

Error: Unable to access jarfile ./PaDEL-Descriptor/PaDEL-Descriptor.jar


In [None]:
! ls -l

total 16364
-rw-r--r-- 1 root root 8363909 Nov 27 14:54 descriptors_output.csv
drwx------ 5 root root    4096 Nov 27 14:47 drive
-rw-r--r-- 1 root root 8363909 Nov 27 14:54 final_data.csv
-rw-r--r-- 1 root root   14208 Nov 27 14:56 molecule.smi
-rw-r--r-- 1 root root     232 Nov 27 14:56 padel.sh
-rw-r--r-- 1 root root       0 Nov 27 14:56 padel.zip
drwxr-xr-x 1 root root    4096 Nov 20 14:30 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
import pandas as pd
import os

# 1. Download the FINAL dataset directly (bypassing calculation)
! wget https://github.com/dataprofessor/data/raw/master/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv -O final_data.csv

# 2. Save it to your Google Drive so Part 4 can find it
destination_path = '/content/drive/MyDrive/Colab Notebooks/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv'
df = pd.read_csv('final_data.csv')
df.to_csv(destination_path, index=False)

print(f"SUCCESS! \nFile saved to: {destination_path}")
print("You can now close this notebook and start Part 4.")

--2025-11-27 14:56:37--  https://github.com/dataprofessor/data/raw/master/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv [following]
--2025-11-27 14:56:37--  https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8363909 (8.0M) [text/plain]
Saving to: ‘final_data.csv’


2025-11-27 14:56:37 (167 MB/s) - ‘final_data

In [None]:
# Download the final dataset directly to skip the long calculation
! wget https://github.com/dataprofessor/data/raw/master/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv -O descriptors_output.csv

# Now load it
import pandas as pd
df3_X = pd.read_csv('descriptors_output.csv')
df3_X

--2025-11-27 14:56:39--  https://github.com/dataprofessor/data/raw/master/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv [following]
--2025-11-27 14:56:39--  https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8363909 (8.0M) [text/plain]
Saving to: ‘descriptors_output.csv’


2025-11-27 14:56:39 (193 MB/s) - ‘de

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.124939
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.000000
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.522879
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.096910
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4690,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.612610
4691,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.595166
4692,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.419075
4693,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.460924


## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df3_Y = df3['pIC50']
df3_Y

Unnamed: 0,pIC50
0,4.869666
1,4.882397
2,6.008774
3,6.022276
4,4.950782
...,...
204,7.283997
205,7.522879
206,7.853872
207,7.318759


## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50,pIC50.1
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,6.124939,4.869666
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,7.000000,4.882397
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,4.301030,6.008774
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,6.522879,6.022276
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,6.096910,4.950782
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4690,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,5.612610,
4691,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,5.595166,
4692,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,5.419075,
4693,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,5.460924,


In [None]:
dataset3.to_csv('/content/drive/MyDrive/Colab Notebooks/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**