# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [12]:
!pip install wget



In [10]:
import wget

wget.download("https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip")
wget.download("https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh")

HTTPError: HTTP Error 404: Not Found

In [7]:
! unzip padel.zip

unzip:  cannot find or open padel.zip, padel.zip.zip or padel.zip.ZIP.


## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [13]:
! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

zsh:1: command not found: wget


In [14]:
import pandas as pd

In [15]:
df3 = pd.read_csv('acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv')

In [16]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL463210,CCOP(=S)(OCC)Oc1nc(Cl)c(Cl)cc1Cl,intermediate,350.591,4.7181,0.0,5.0,5.737549
1,1,CHEMBL2252723,CCOP(=O)(OCC)SCCCCCCCCCCN1C(=O)c2ccccc2C1=O,inactive,455.557,6.3177,0.0,6.0,3.947999
2,2,CHEMBL2252722,CCOP(=O)(OCC)SCCCCCCCCCN1C(=O)c2ccccc2C1=O,inactive,441.53,5.9276,0.0,6.0,4.425969
3,3,CHEMBL2252721,CCOP(=O)(OCC)SCCCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,427.503,5.5375,0.0,6.0,5.346787
4,4,CHEMBL2252851,CCOP(=O)(OCC)SCCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,413.476,5.1474,0.0,6.0,5.735182
5,5,CHEMBL2252850,CCOP(=O)(OCC)SCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,399.449,4.7573,0.0,6.0,5.419075
6,6,CHEMBL2252849,CCOP(=O)(OCC)SCCCCCN1C(=O)c2ccccc2C1=O,inactive,385.422,4.3672,0.0,6.0,4.908685
7,7,CHEMBL2252848,CCOP(=O)(OCC)SCCCCN1C(=O)c2ccccc2C1=O,intermediate,371.395,3.9771,0.0,6.0,5.003488
8,8,CHEMBL2252847,CCOP(=O)(OCC)SCCCN1C(=O)c2ccccc2C1=O,intermediate,357.368,3.587,0.0,6.0,5.081445
9,9,CHEMBL2252846,CCOP(=O)(OCC)SCCCCCCCCCCSP(=O)(OCC)OCC,intermediate,478.594,7.9358,0.0,8.0,5.754487


In [17]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [18]:
! cat molecule.smi | head -5

CCOP(=S)(OCC)Oc1nc(Cl)c(Cl)cc1Cl	CHEMBL463210
CCOP(=O)(OCC)SCCCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252723
CCOP(=O)(OCC)SCCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252722
CCOP(=O)(OCC)SCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252721
CCOP(=O)(OCC)SCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252851


In [26]:
! cat molecule.smi | wc -l

      18


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

### Can't run since we don't have access to padel.sh files.

In [35]:
! cat padel.sh

cat: padel.sh: No such file or directory


In [36]:
! bash padel.sh

bash: padel.sh: No such file or directory


In [37]:
! ls -l

total 33760
-rw-r--r--@ 1 sandrapepkolaj  staff   154554 Oct 15 13:49 CDD_ML_Part_1_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb
-rw-r--r--@ 1 sandrapepkolaj  staff   269174 Oct 22 12:24 CDD_ML_Part_2_Acetylcholinesterase_Exploratory_Data_Analysis.ipynb
-rw-r--r--  1 sandrapepkolaj  staff   107989 Oct 22 13:17 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
-rw-r--r--  1 sandrapepkolaj  staff   100076 Oct 15 12:46 CDD_ML_Part_4_Acetylcholinesterase_Regression_Random_Forest.ipynb
-rw-r--r--  1 sandrapepkolaj  staff   230778 Oct 15 12:46 CDD_ML_Part_5_Acetylcholinesterase_Compare_Regressors.ipynb
-rw-r--r--  1 sandrapepkolaj  staff       29 Oct 15 12:46 README.md
-rw-r--r--@ 1 sandrapepkolaj  staff   873617 Oct 15 13:48 acetylcholinesterase.zip
-rw-r--r--@ 1 sandrapepkolaj  staff     9828 Oct 15 13:46 acetylcholinesterase_01_bioactivity_data_raw.csv
-rw-r--r--@ 1 sandrapepkolaj  staff     1093 Oct 15 13:48 acetylcholinesterase_02_bioactivity_data_preproces

## **Preparing the X and Y Data Matrices**
## Can't run since descriptors_output.csv is ouput of the padel.sh file.

### **X data matrix**

In [38]:
df3_X = pd.read_csv('descriptors_output.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'descriptors_output.csv'

In [39]:
df3_X

NameError: name 'df3_X' is not defined

In [40]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

NameError: name 'df3_X' is not defined

## **Y variable**

### **Convert IC50 to pIC50**

In [41]:
df3_Y = df3['pIC50']
df3_Y

0     5.737549
1     3.947999
2     4.425969
3     5.346787
4     5.735182
5     5.419075
6     4.908685
7     5.003488
8     5.081445
9     5.754487
10    5.844664
11    5.315155
12    4.991400
13    6.060481
14    4.908685
15    5.093126
16    5.785156
17    1.397940
Name: pIC50, dtype: float64

## **Combining X and Y variable**

In [42]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

NameError: name 'df3_X' is not defined

In [43]:
dataset3.to_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

NameError: name 'dataset3' is not defined

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**