<a href="https://colab.research.google.com/github/Pavalya-Periyasamy05/Machine-Learning-and-AI/blob/main/Compute_Molecular_Descriptor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Computing Molecular Descriptors and Fingerprints using padelpy**

## **Download PaDEL-Descriptor**

In [None]:
! pip install padelpy



## **Prepare fingerprint xml**

### To calculate 12 different types of molecular fingerprints, the XML provides information related to molecular fingerprints.

In [None]:
! wget https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip


--2026-01-01 06:43:41--  https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip [following]
--2026-01-01 06:43:42--  https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10871 (11K) [application/zip]
Saving to: ‘fingerprints_xml.zip.1’


2026-01-01 06:43:42 (100 MB/s) - ‘fingerprints_xml.zip.1’ saved [10871/10871]



In [3]:
! unzip fingerprints_xml.zip

Archive:  fingerprints_xml.zip
replace AtomPairs2DFingerprintCount.xml? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [4]:
! ls -l

total 992
-rwxr-xr-x 1 root root   4645 Mar 27  2018 AtomPairs2DFingerprintCount.xml
-rwxr-xr-x 1 root root   4645 Mar 27  2018 AtomPairs2DFingerprinter.xml
-rw-r--r-- 1 root root 520310 Jan  1 05:41 Bioactivity_3Classes_Data.csv
-rwxr-xr-x 1 root root   4645 Mar 27  2018 EStateFingerprinter.xml
-rwxr-xr-x 1 root root   4645 Mar 27  2018 ExtendedFingerprinter.xml
-rwxr-xr-x 1 root root   4645 Mar 27  2018 Fingerprinter.xml
-rw-r--r-- 1 root root  10871 Jan  1 05:11 fingerprints_xml.zip
-rw-r--r-- 1 root root  10871 Jan  1 06:43 fingerprints_xml.zip.1
-rw-r--r-- 1 root root  10871 Jan  1 06:44 fingerprints_xml.zip.2
-rwxr-xr-x 1 root root   4645 Mar 27  2018 GraphOnlyFingerprinter.xml
-rwxr-xr-x 1 root root   4645 Mar 27  2018 KlekotaRothFingerprintCount.xml
-rwxr-xr-x 1 root root   4645 Mar 27  2018 KlekotaRothFingerprinter.xml
-rwxr-xr-x 1 root root   4645 Mar 27  2018 MACCSFingerprinter.xml
-rw-r--r-- 1 root root 331964 Jan  1 05:51 molecule.smi
-rw-r--r-- 1 root root  11348 Jan  1 0

## **List and sort fingerprint XML files**

In [5]:
import glob
xml_files = glob.glob("*.xml")
xml_files.sort()
xml_files

['AtomPairs2DFingerprintCount.xml',
 'AtomPairs2DFingerprinter.xml',
 'EStateFingerprinter.xml',
 'ExtendedFingerprinter.xml',
 'Fingerprinter.xml',
 'GraphOnlyFingerprinter.xml',
 'KlekotaRothFingerprintCount.xml',
 'KlekotaRothFingerprinter.xml',
 'MACCSFingerprinter.xml',
 'PubchemFingerprinter.xml',
 'SubstructureFingerprintCount.xml',
 'SubstructureFingerprinter.xml']

### Shorten file names for use in the Dictionary in the next step.

In [6]:
FP_list = ["AtomPairs2DCount",
           "AtomPairs2D",
           "EState",
           "CDKextended",
           "CDK",
           "CDKgraphOnly",
           "KlekotaRothCount",
           "KlekotaRoth",
           "MACCS",
           "Pubchem",
           "SubstructureCount",
           "Substructure"]

## **Create a dictionary**

In [7]:
fp = dict(zip(FP_list, xml_files)) #zip() pairs elements from two (or more) lists together, element by element.
#dict(zip(...)) takes the zipped pairs and makes a dictionary, using the first list as keys and the second list as values.
fp

{'AtomPairs2DCount': 'AtomPairs2DFingerprintCount.xml',
 'AtomPairs2D': 'AtomPairs2DFingerprinter.xml',
 'EState': 'EStateFingerprinter.xml',
 'CDKextended': 'ExtendedFingerprinter.xml',
 'CDK': 'Fingerprinter.xml',
 'CDKgraphOnly': 'GraphOnlyFingerprinter.xml',
 'KlekotaRothCount': 'KlekotaRothFingerprintCount.xml',
 'KlekotaRoth': 'KlekotaRothFingerprinter.xml',
 'MACCS': 'MACCSFingerprinter.xml',
 'Pubchem': 'PubchemFingerprinter.xml',
 'SubstructureCount': 'SubstructureFingerprintCount.xml',
 'Substructure': 'SubstructureFingerprinter.xml'}

### Easy to call

In [8]:
fp["AtomPairs2DCount"] # get the value

'AtomPairs2DFingerprintCount.xml'

## **Load bioactivity data**

In [9]:
import pandas as pd

In [10]:
df_3Classes = pd.read_csv("Bioactivity_3Classes_Data.csv")

In [11]:
df_3Classes.head(2)

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL2396992,Cc1[nH]c2cc(Cl)cc(Cl)c2c1CCN,inactive,243.137,3.28432,2.0,1.0,3.809668
1,CHEMBL3218635,CC(C)[C@@H]1NC(=O)[C@@H](CC(N)=O)NC(=O)[C@H](C...,intermediate,1396.591,-1.92556,17.0,17.0,5.920819


### A threshold of 5 and 6 was applied, with values greater than 6 classified as active compounds.

In [12]:
# Sanity check
print(len(df_3Classes))
print(df_3Classes.index.min(), df_3Classes.index.max())

2751
0 2750


## **Filter: To prepare input file for the padel descriptor calculation**

In [13]:
selection = ["canonical_smiles", "molecule_chembl_id"]
df_3Classes_selection = df_3Classes[selection]
df_3Classes_selection.to_csv("molecule.smi", sep ="\t", index=False, header=False)

# or

# df_3Classes_selection = pd.concat([df_3Classes["canonical_smiles"], df_3Classes["molecule_chembl_id"]], axis=1)
#df_3Classes_selection.to_csv("molecule.smi", sep ="\t", index=False, header=False)



In [14]:
df_3Classes_selection

Unnamed: 0,canonical_smiles,molecule_chembl_id
0,Cc1[nH]c2cc(Cl)cc(Cl)c2c1CCN,CHEMBL2396992
1,CC(C)[C@@H]1NC(=O)[C@@H](CC(N)=O)NC(=O)[C@H](C...,CHEMBL3218635
2,CCCC[C@H]1NC(=O)[C@@H](Cc2ccc3ccccc3c2)NC(=O)[...,CHEMBL3218636
3,CC1(C)COC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CCCNC...,CHEMBL3218637
4,CC(C)C[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H...,CHEMBL3218638
...,...,...
2746,C=CC(=O)N1CC2(CC(n3nc(-c4cccnc4C)c(-c4c(Cl)ccc...,CHEMBL5807729
2747,C=CC(=O)N1CC2(CC(n3nc(-c4cnc(OC)nc4)c(-c4c(Cl)...,CHEMBL6033395
2748,C=CC(=O)N1CC2(CC(n3nc(-c4cccc(=O)n4C)c(-c4c(Cl...,CHEMBL5768373
2749,C=CC(=O)N1CC2(CC(n3nc(-c4cc(C)nn4CCN(C)C)c(-c4...,CHEMBL6000934


### Look at the file using "Bash"

In [15]:
! cat molecule.smi | head -5

Cc1[nH]c2cc(Cl)cc(Cl)c2c1CCN	CHEMBL2396992
CC(C)[C@@H]1NC(=O)[C@@H](CC(N)=O)NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](Cc2ccc(F)cc2)NC(=O)[C@H](Cc2ccc(O)cc2)NC(=O)[C@H](CCC(N)=O)NC(=O)C[C@@H](CCc2ccccc2)NC(=O)[C@@H]2CCCCN2C(=O)C(=O)C(C)(C)COC1=O	CHEMBL3218635
CCCC[C@H]1NC(=O)[C@@H](Cc2ccc3ccccc3c2)NC(=O)[C@@H](Cc2ccc3ccccc3c2)NC(=O)[C@@H](CCCCN)NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](Cc2ccc(F)cc2)NC(=O)[C@H](CCC(N)=O)NC(=O)C[C@@H](CCc2ccccc2)NC(=O)[C@@H]2CCCCN2C(=O)C(=O)C(C)(C)COC1=O	CHEMBL3218636
CC1(C)COC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CCCNC(=N)N)NC(=O)CNC(=O)[C@H](Cc2ccc(O)cc2)NC(=O)[C@@H](Cc2ccc3ccccc3c2)NC(=O)[C@H](CCCN)NC(=O)[C@H](CCC(N)=O)NC(=O)C[C@@H](CCc2ccccc2)NC(=O)[C@@H]2CCCCN2C(=O)C1=O	CHEMBL3218637
CC(C)C[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](Cc2ccc(O)cc2)NC(=O)[C@H](Cc2ccc(F)cc2)NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H](CCC(N)=O)NC(=O)C[C@@H](CCc2ccccc2)NC(=O)[C@@H]2CCCCN2C(=O)C(=O)C(C)(C)COC1=O	CHEMBL3218638


In [16]:
! cat molecule.smi | wc -l

2751


## **Calculate molecular descriptors and fingerprints**

### There are 12 fingerprint types in PaDEL. To calculate all 12, make sure to make adjustments to the descriptortypes input argument to any of the ones in the fp dictionary variable as shown above

In [17]:
fp

{'AtomPairs2DCount': 'AtomPairs2DFingerprintCount.xml',
 'AtomPairs2D': 'AtomPairs2DFingerprinter.xml',
 'EState': 'EStateFingerprinter.xml',
 'CDKextended': 'ExtendedFingerprinter.xml',
 'CDK': 'Fingerprinter.xml',
 'CDKgraphOnly': 'GraphOnlyFingerprinter.xml',
 'KlekotaRothCount': 'KlekotaRothFingerprintCount.xml',
 'KlekotaRoth': 'KlekotaRothFingerprinter.xml',
 'MACCS': 'MACCSFingerprinter.xml',
 'Pubchem': 'PubchemFingerprinter.xml',
 'SubstructureCount': 'SubstructureFingerprintCount.xml',
 'Substructure': 'SubstructureFingerprinter.xml'}

In [None]:
fp["Pubchem"]

'PubchemFingerprinter.xml'

In [20]:
from padelpy import padeldescriptor

fingerprint = "Pubchem"

fingerprint_output_file = "".join([fingerprint, ".csv"])   # Could have typed pubchem.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir = "molecule.smi",                 # Path to the folder containing your molecule files (SMILES format).
                d_file = fingerprint_output_file,         # Could have typed pubchem.csv
                #descriptortypes ="PubchemFingerprinter.xml",
                descriptortypes = fingerprint_descriptortypes,
                detectaromaticity=True,                       #Detects aromatic rings in molecules.
                standardizenitro=True,                        # Standardizes nitro groups (-NO2) to a consistent format.
                standardizetautomers = True,                  # Converts tautomers to a standard form.
                threads = 2,
                removesalt = True,                            # Removes salts/ions from molecules before calculation.
                log = True,                                   # # Creates a log file of the calculation.
                fingerprints = True)

## **Display calculated Descriptors**

In [27]:
df_X_fingerprints = pd.read_csv(fingerprint_output_file)


## **Preparing the X and Y Data Matrices**

## **X data matrix**

### X data matrix are molecular descriptors which are pubchem fingerprints

In [28]:
df_X_fingerprints

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL2396992,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL3218635,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL3218636,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL3218637,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL3218638,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2746,CHEMBL5807729,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2747,CHEMBL6033395,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2748,CHEMBL5768373,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2749,CHEMBL6000934,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
df_X_fingerprints = df_X_fingerprints.drop(columns=["Name"])
df_X_fingerprints

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2746,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2747,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2748,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2749,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y data matrix**

In [31]:
df_Y = df_3Classes["pIC50"]
df_Y

Unnamed: 0,pIC50
0,3.809668
1,5.920819
2,5.000000
3,5.744727
4,6.698970
...,...
2746,5.638272
2747,5.455932
2748,5.275724
2749,5.853872


## **Combining X and Y variable**

In [32]:
dataset_ML_model = pd.concat([df_X_fingerprints, df_Y], axis=1)
dataset_ML_model

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.809668
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.920819
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.744727
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.698970
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2746,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.638272
2747,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.455932
2748,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.275724
2749,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.853872


##**Output the dataset into a csv file**

In [33]:
dataset_ML_model.to_csv("Bioactivity_data_3class_pIC50_pubchem_fp.csv", index=False)