<a href="https://colab.research.google.com/github/TanmayeeKolli/Drug-Discovery-Model-Ovarian-Cancer/blob/main/Pt3_Drug_Discovery_for_Ovarian_Cancer_Part_3_Descriptor_Calculation_and_Dataset_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Computational Drug Discovery for Ovarian Cancer [Part 3] Descriptor Calculation and Dataset Preparation**

*Tanmayee Kolli*




In **Part 3**, I calculated molecular descriptors that "quantitatively describe compounds in the dataset" (Nantasenamat). The specific molecular descriptor I used are Pubchem fingerprints. This will allow us to obtain a standardized data set with which we can build a model in part 4.

Reference: *'Data Professor'* Youtube channel [*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)  by Chanin Nantasenamat

---

## **Download PaDEL-Descriptor**

Here, I downloaded the PaDEL Descriptor software that is available in Data Professor and is used for calculating molecular descriptors and fingerprints for molecules. This includes the padel zip file and padel sh file that contains instructions on how to run the padel calculations which will calculate molecular descriptors.

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip -q
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh -q

In [None]:
! unzip padel.zip -q

Archive:  padel.zip
caution: filename not matched:  -q


## **Load bioactivity data**

Downloaded the bioactivity data from part 2 that contains all three bioactivity types (activve, inactive, intermediate) and contains the pIC50 values that will be used for building a regression model.

In [None]:
import pandas as pd

In [5]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

bdata = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/Drug Discovery Ovarian Cancer project/data/bioactivity_data_with_intermediates_PARP.csv')
bdata.head(4)

Mounted at /content/gdrive/


Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL521686,O=C(c1cc(Cc2n[nH]c(=O)c3ccccc23)ccc1F)N1CCN(C(...,active,434.471,2.3474,1.0,4.0,9.0
1,CHEMBL558845,O=C1NCCc2c1ccc1[nH]cc(CCNC(=O)N3CCNCC3)c21,active,341.415,0.6111,4.0,3.0,7.49485
2,CHEMBL560790,O=C1NCCc2c1ccc1[nH]cc(CCNC(=O)N3CCCNCC3)c21,active,355.442,1.0012,4.0,3.0,7.769551
3,CHEMBL595018,O=C1NCC2c3c(cccc31)CCN2C(=O)Cc1cccnc1,active,307.353,1.4935,1.0,3.0,9.0


I selected **canonical_smiles** and **molecule_chembl_id** to subset bdata and set it to df3_selection. Canonical_smiles represents the **chemical information** and structure of each molecule.

In [6]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = bdata[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)
df3_selection.to_csv('Molecule.txt', sep=' ', index=False, header=False)

This is what my canonical smiles and molecule chembl id look like.

In [7]:
! cat molecule.smi | head -5

O=C(c1cc(Cc2n[nH]c(=O)c3ccccc23)ccc1F)N1CCN(C(=O)C2CC2)CC1	CHEMBL521686
O=C1NCCc2c1ccc1[nH]cc(CCNC(=O)N3CCNCC3)c21	CHEMBL558845
O=C1NCCc2c1ccc1[nH]cc(CCNC(=O)N3CCCNCC3)c21	CHEMBL560790
O=C1NCC2c3c(cccc31)CCN2C(=O)Cc1cccnc1	CHEMBL595018
O=C1NCC2c3c(cccc31)CCN2C(=O)CCc1cnccn1	CHEMBL609002


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

The bottom shows the contents of padel.sh, which contains the instruction for generating molecular descriptors, which are the fingerprints. In this case, we are using Pubchem fingerprints. The contents of padel.sh also shows that we remove the salts and organic acids from the chemical structures, essentially removing all impurities. Bash padel.sh will compute this process for all 744 molecules.

In [9]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [11]:
# Install unzip if not available
!apt-get install unzip

# Extract the padel.zip file
!unzip padel.zip

# Run your padel.sh script
!bash padel.sh -q

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
unzip is already the newest version (6.0-26ubuntu3.2).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDE

## **Preparing the X and Y Data Matrices**

### **X data matrix**

Here I created the X data matrix for our ML model in part 4. The X data matrix contains our **input features**, where each column represents a different feature. descriptors_output.csv' contains the Pubchem fingerprints for all the molecules. The fingerprints encode the **structural features** of each compound through 1s and 0s, where each fingerprint can represent certain atoms or molecules, bond patterns, molecular patterns, etc. It is set to 1 if that fingeprint is present, or 0 if absent. Each fingerprint will be an input feature.

In [12]:
df3_X = pd.read_csv('descriptors_output.csv')

In [13]:
df3_X.head(4)

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL558845,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL521686,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL560790,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL595018,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Dropped the Name column to make it easier to build models.

In [14]:
df3_X = df3_X.drop(columns=['Name'])
df3_X.head(4)

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

The Y variable is the target variable for the ML model. It is the actual result we want to classify using the input variables of the X data matrix. In this case, we will use the pIC50 values as our target variable.

In [16]:
df3_Y = bdata['pIC50']
df3_Y.head(4)

Unnamed: 0,pIC50
0,9.0
1,7.49485
2,7.769551
3,9.0


## **Combining X and Y variable**

In [19]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3.head(4)

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.49485
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.769551
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.0


We will use the combined data set with the X and Y variable in the next part!

In [18]:
dataset3.to_csv('PARP_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)