# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**



In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [1]:
import pandas as pd

In [2]:
df3 = pd.read_csv('Butyrylcholinesterase_04_bioactivity_data_3class_pIC50.csv')

In [3]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,active,312.325,2.8032,0.0,6.0,-2.963788
1,1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,active,376.913,4.5546,0.0,5.0,-2.954243
2,2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,inactive,426.851,5.3574,0.0,5.0,-4.698970
3,3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,active,404.845,4.7069,0.0,5.0,-3.000000
4,4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,active,346.334,3.0953,0.0,6.0,-2.301030
...,...,...,...,...,...,...,...,...,...
3847,3847,CHEMBL5270598,O=C1CC2CCc3ccsc3C2=NN1CN1CCN(Cc2ccccc2)CC1,intermediate,394.544,3.0222,0.0,5.0,-3.746634
3848,3848,CHEMBL5285633,O=C1CC2CCCc3ccsc3C2=NN1CCCCN1CCN(Cc2ccccc2)CC1,intermediate,450.652,4.2350,0.0,5.0,-3.615950
3849,3849,CHEMBL5283494,O=C1CC2CCc3ccsc3C2=NN1CCN1CCN(Cc2ccccc2)CC1,intermediate,408.571,3.0647,0.0,5.0,-3.708421
3850,3850,CHEMBL5283557,O=C1CC2CCc3ccsc3C2=NN1Cc1ccc(CN2CCN(Cc3ccccc3)...,intermediate,484.669,4.7649,0.0,5.0,-3.986772


In [4]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [5]:
# Display the top 5 elements of the selected DataFrame
print("Top 5 elements of df3_selection:")
print(df3_selection.head())

Top 5 elements of df3_selection:
                                    canonical_smiles molecule_chembl_id
0              CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1       CHEMBL133897
1         O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1       CHEMBL336398
2  CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1       CHEMBL131588
3      O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F       CHEMBL130628
4          CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C       CHEMBL130478


In [6]:
# Check the number of rows in the DataFrame
num_rows = df3_selection.shape[0] 
print(f"Number of rows in df3_selection: {num_rows}")

Number of rows in df3_selection: 3852


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [9]:
# Specify the path to your .sh file
file_path = 'c:/Users/sumee/Downloads/Padelpubchem/padel.sh'

# Open the file and read its contents
with open(file_path, 'r') as file:
    content = file.read()

# Display the contents of the .sh file
print(content)


java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv



## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [10]:
df3_X = pd.read_csv("C:/Users/sumee/Downloads/Padelpubchem/descriptors_output.csv")

In [11]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL336538,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL130098,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL106126,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL130478,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL339995,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3847,CHEMBL5288465,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3848,CHEMBL5278371,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3849,CHEMBL5268097,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3850,CHEMBL5271548,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3847,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3848,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3849,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3850,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [13]:
df3_Y = df3['pIC50']
df3_Y

0      -2.963788
1      -2.954243
2      -4.698970
3      -3.000000
4      -2.301030
          ...   
3847   -3.746634
3848   -3.615950
3849   -3.708421
3850   -3.986772
3851   -3.518514
Name: pIC50, Length: 3852, dtype: float64

## **Combining X and Y variable**

In [14]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-2.963788
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-2.954243
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-4.698970
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-3.000000
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-2.301030
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3847,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-3.746634
3848,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-3.615950
3849,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-3.708421
3850,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,-3.986772


In [16]:
# Check for NaN values in dataset3
nan_counts = dataset3.isna().sum()
print(nan_counts)


PubchemFP0      0
PubchemFP1      0
PubchemFP2      0
PubchemFP3      0
PubchemFP4      0
               ..
PubchemFP877    0
PubchemFP878    0
PubchemFP879    0
PubchemFP880    0
pIC50           1
Length: 882, dtype: int64


In [17]:
# Find the indices where pIC50 is NaN
nan_indices = dataset3[dataset3['pIC50'].isna()].index
print(nan_indices)


Index([80], dtype='int64')


In [18]:
dataset3.dropna(inplace=True)


In [19]:
dataset3.to_csv('Butyrylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**