<a href="https://colab.research.google.com/github/Dannie55/AI_and_Drug_Discovery_Course_2026/blob/main/QSAR_Part_3_Pubchem_Fingerprint_Calculation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **AI And Biotechnology/Bioinformatics**

## **AI and Drug Discovery Course: QSAR Modeling**
This notebook demonstrates how to collect and preprocess bioactivity data from ChEMBL for QSAR modeling

# **Part 3: Descriptor Calculation**

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


PaDELPy is a Python wrapper for the PaDEL-Descriptor (molecular descriptor calculation) software.  

It provide the following descriptors/fingerprint:  
* 1444 - 2D Descriptors
* 431 - 3D Descriptors
* 881 bits - PubChem Fingerprints

## **Install PaDELpy**

In [4]:
!pip install padelpy



## **Import libraries**

In [5]:
import pandas as pd
import numpy as np
from google.colab import files
from padelpy import padeldescriptor

## **Load dataset**

In [33]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/bioactivity_with_descriptors.csv')
df.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class,ROMol,MW,logP,NumHacceptors,NumHdonors
0,CHEMBL191334,COc1ccc(C2=N[C@@H](c3ccc(Cl)cc3)[C@@H](c3ccc(C...,1390.0,intermediate,<rdkit.Chem.rdchem.Mol object at 0x7834426e6b20>,580.164411,5.8858,5,1
1,CHEMBL583413,Cc1ccc2c(c1)C1=C(C(c3ccc(Br)cc3)O2)C(c2ccc(Br)...,3000.0,intermediate,<rdkit.Chem.rdchem.Mol object at 0x7834487d2ff0>,562.000385,6.69572,4,0
2,CHEMBL584512,Cc1cc2c(cc1F)OC(c1ccc(Br)cc1)C1=C2N(C)c2ncnn2C...,11400.0,inactive,<rdkit.Chem.rdchem.Mol object at 0x7834487d2c00>,579.990964,6.83482,4,0
3,CHEMBL567403,CN1C2=C(C(c3ccc(Br)cc3)Oc3ccc(F)cc32)C(c2ccc(B...,44300.0,inactive,<rdkit.Chem.rdchem.Mol object at 0x7834487d2f10>,565.975314,6.5264,4,0
4,CHEMBL583626,CCOc1cccc2c1C1=C(C(c3ccc(Br)cc3)O2)C(c2ccc(Br)...,6500.0,intermediate,<rdkit.Chem.rdchem.Mol object at 0x7834487d2f80>,592.01095,6.786,5,0


In [34]:
data = df[['canonical_smiles', 'molecule_chembl_id']]
data.head()

Unnamed: 0,canonical_smiles,molecule_chembl_id
0,COc1ccc(C2=N[C@@H](c3ccc(Cl)cc3)[C@@H](c3ccc(C...,CHEMBL191334
1,Cc1ccc2c(c1)C1=C(C(c3ccc(Br)cc3)O2)C(c2ccc(Br)...,CHEMBL583413
2,Cc1cc2c(cc1F)OC(c1ccc(Br)cc1)C1=C2N(C)c2ncnn2C...,CHEMBL584512
3,CN1C2=C(C(c3ccc(Br)cc3)Oc3ccc(F)cc32)C(c2ccc(B...,CHEMBL567403
4,CCOc1cccc2c1C1=C(C(c3ccc(Br)cc3)O2)C(c2ccc(Br)...,CHEMBL583626


## **Convert to .smi format**

In [36]:
df_smi = data['canonical_smiles'].to_csv('smiles_chembl.smi', index=None, header=None)

In [37]:
! cat smiles_chembl.smi | head

COc1ccc(C2=N[C@@H](c3ccc(Cl)cc3)[C@@H](c3ccc(Cl)cc3)N2C(=O)N2CCNC(=O)C2)c(OC(C)C)c1
Cc1ccc2c(c1)C1=C(C(c3ccc(Br)cc3)O2)C(c2ccc(Br)cc2)n2ncnc2N1C
Cc1cc2c(cc1F)OC(c1ccc(Br)cc1)C1=C2N(C)c2ncnn2C1c1ccc(Br)cc1
CN1C2=C(C(c3ccc(Br)cc3)Oc3ccc(F)cc32)C(c2ccc(Br)cc2)n2ncnc21
CCOc1cccc2c1C1=C(C(c3ccc(Br)cc3)O2)C(c2ccc(Br)cc2)n2ncnc2N1C
CN1C2=C(C(c3ccc(Cl)cc3)Oc3ccccc32)C(c2ccc(Br)cc2)n2ncnc21
CN1C2=C(C(c3ccc(Cl)cc3)Oc3ccccc32)C(c2ccc(Cl)cc2)n2ncnc21
COc1ccc(C2=N[C@@H](c3ccc(Cl)cc3)[C@@H](c3ccc(Cl)cc3)N2C(=O)N2CCNC(=O)C2)c(OC(C)C)c1
Cc1ccc2c(c1)C1=C(C(c3ccc(Br)cc3)O2)C(c2ccc(Br)cc2)n2ncnc2N1C
CC[C@@H](C(=O)OC(C)(C)C)N1C(=O)[C@@](C)(CC(=O)O)C[C@H](c2cccc(Cl)c2)[C@H]1c1ccc(Cl)cc1


## **Calculate molecular Pubchem Fingerprints using "padeldescriptor" function**


In [38]:
padeldescriptor(mol_dir= "smiles_chembl.smi",
                d_file='pubchem_fingerprints.csv',
                fingerprints = True,
                retainorder= True,
                #removesalt = True, standardizetautomers = True, standardizenitro=True
                )

In [39]:
!ls -lh pubchem_fingerprints.csv

-rw-r--r-- 1 root root 62K Feb 20 01:22 pubchem_fingerprints.csv


In [40]:
df_fingerprint = pd.read_csv("pubchem_fingerprints.csv")
df_fingerprint.head()

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,AUTOGEN_smiles_chembl_1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,AUTOGEN_smiles_chembl_2,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,AUTOGEN_smiles_chembl_3,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,AUTOGEN_smiles_chembl_4,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,AUTOGEN_smiles_chembl_5,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## **Prepare Dataset for ML**

In [41]:
df.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class,ROMol,MW,logP,NumHacceptors,NumHdonors
0,CHEMBL191334,COc1ccc(C2=N[C@@H](c3ccc(Cl)cc3)[C@@H](c3ccc(C...,1390.0,intermediate,<rdkit.Chem.rdchem.Mol object at 0x7834426e6b20>,580.164411,5.8858,5,1
1,CHEMBL583413,Cc1ccc2c(c1)C1=C(C(c3ccc(Br)cc3)O2)C(c2ccc(Br)...,3000.0,intermediate,<rdkit.Chem.rdchem.Mol object at 0x7834487d2ff0>,562.000385,6.69572,4,0
2,CHEMBL584512,Cc1cc2c(cc1F)OC(c1ccc(Br)cc1)C1=C2N(C)c2ncnn2C...,11400.0,inactive,<rdkit.Chem.rdchem.Mol object at 0x7834487d2c00>,579.990964,6.83482,4,0
3,CHEMBL567403,CN1C2=C(C(c3ccc(Br)cc3)Oc3ccc(F)cc32)C(c2ccc(B...,44300.0,inactive,<rdkit.Chem.rdchem.Mol object at 0x7834487d2f10>,565.975314,6.5264,4,0
4,CHEMBL583626,CCOc1cccc2c1C1=C(C(c3ccc(Br)cc3)O2)C(c2ccc(Br)...,6500.0,intermediate,<rdkit.Chem.rdchem.Mol object at 0x7834487d2f80>,592.01095,6.786,5,0


In [44]:
# Convert standard_value to pIC50
# First, ensure standard_value is numeric
df['standard_value'] = pd.to_numeric(df['standard_value'], errors='coerce')
# Drop rows where standard_value is NaN after conversion
df.dropna(subset=['standard_value'], inplace=True)
# Handle potential zero or negative values before log transformation
df = df[df['standard_value'] > 0]
# Convert nM to M and then calculate pIC50
df['pIC50'] = -np.log10(df['standard_value'] * (10**-9))

# Select only the columns we need for ML
meta_cols = df[['molecule_chembl_id', 'bioactivity_class', 'pIC50']]

# Reset index to ensure proper alignment
meta_cols = meta_cols.reset_index(drop=True)
df_fingerprint = df_fingerprint.reset_index(drop=True)

# Combine meta data with fingerprints
combined_df = pd.concat([meta_cols, df_fingerprint.drop(df_fingerprint.columns[0], axis=1)], axis=1)

# Inspect the first few rows
combined_df.head()

Unnamed: 0,molecule_chembl_id,bioactivity_class,pIC50,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL191334,intermediate,5.856985,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL583413,intermediate,5.522879,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL584512,inactive,4.943095,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL567403,inactive,4.353596,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL583626,intermediate,5.187087,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## **Save and download the dataset**

In [45]:
# Save as CSV
combined_df.to_csv("QSAR_dataset.csv", index=False)
print("Combined dataset saved as QSAR_dataset.csv")

# Download file in Colab
files.download("QSAR_dataset.csv")

Combined dataset saved as QSAR_dataset.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Calculate other fingerprints**

## **Download xml Files from Github**

In [46]:
!wget https://github.com/AI-Biotechnology-Bioinformatics/Drug_Discovery_AI_Course_2026/raw/main/padel_descriptors_xml.zip

--2026-02-20 01:24:57--  https://github.com/AI-Biotechnology-Bioinformatics/Drug_Discovery_AI_Course_2026/raw/main/padel_descriptors_xml.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/AI-Biotechnology-Bioinformatics/Drug_Discovery_AI_Course_2026/main/padel_descriptors_xml.zip [following]
--2026-02-20 01:24:57--  https://raw.githubusercontent.com/AI-Biotechnology-Bioinformatics/Drug_Discovery_AI_Course_2026/main/padel_descriptors_xml.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10871 (11K) [application/zip]
Saving to: ‘padel_descriptors_xml.zip’


2026-02-20 01:24:57 (102 MB/s) - ‘pad

## **Unzip all files**

In [47]:
!unzip padel_descriptors_xml.zip

Archive:  padel_descriptors_xml.zip
  inflating: AtomPairs2DFingerprintCount.xml  
  inflating: AtomPairs2DFingerprinter.xml  
  inflating: EStateFingerprinter.xml  
  inflating: ExtendedFingerprinter.xml  
  inflating: Fingerprinter.xml       
  inflating: GraphOnlyFingerprinter.xml  
  inflating: KlekotaRothFingerprintCount.xml  
  inflating: KlekotaRothFingerprinter.xml  
  inflating: MACCSFingerprinter.xml  
  inflating: PubchemFingerprinter.xml  
  inflating: SubstructureFingerprintCount.xml  
  inflating: SubstructureFingerprinter.xml  


## **Calculate Fingerprints**

In [48]:
# Specify the XML file for SubstructureFingerprinter directly
Substruc_fp = "SubstructureFingerprinter.xml"

# Calculate Substructure fingerprints
padeldescriptor(
    mol_dir='smiles_chembl.smi',
    d_file='Substructure_fingerprints.csv',
    fingerprints=True,
    descriptortypes= Substruc_fp,
    retainorder=True
    # removesalt=True, standardizetautomers=True
)