# Exploratory Data Analysis

In this section, we add descriptors features to the dataset using the rdkit and perform some exploratory data analysis

Import the needed library for calculating descriptors:

In [1]:
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

# import the dataset from previous notebook:
df = pd.read_csv('bioactivity_data.csv')

## Lipinski Descriptors

Lipinski descriptors, also known as Lipinski's Rule of Five, are a set of empirical rules to evaluate the 
druglikeness of a compound base on Absorption, Distribution, Metabolism and Excretion. They predict whether a chemical compound with a certain pharmacological or biological activity has properties that would make it a likely orally active drug in humans. These rules are:

No more than 5 hydrogen bond donors (the sum of OHs and NHs).
No more than 10 hydrogen bond acceptors (the sum of Os and Ns).
A molecular weight less than 500 daltons.
A partition coefficient (LogP) not greater than 5.


## Calculate descriptors

In [2]:
def calculate_lipinski(smiles):
    mol = Chem.MolFromSmiles(smiles)
    mol_weight = Descriptors.MolWt(mol)
    log_p = Descriptors.MolLogP(mol)
    num_h_donors = Descriptors.NumHDonors(mol)
    num_h_acceptors = Descriptors.NumHAcceptors(mol)
    return mol_weight, log_p, num_h_donors, num_h_acceptors

Add those lpinski descriptors as new features into the dataset

In [3]:
df[['mol_weight','log_p','num_h_donors','num_h_acceptors']] = df['canonical_smiles'].apply(lambda x: pd.Series(calculate_lipinski(x)))
df.head(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_level,mol_weight,log_p,num_h_donors,num_h_acceptors
0,CHEMBL311039,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,5000.0,intermediate,311.422,3.3188,1.0,2.0
1,CHEMBL450926,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,2700.0,intermediate,299.461,3.2412,1.0,3.0
2,CHEMBL310242,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,1800.0,intermediate,311.422,3.3188,1.0,2.0
3,CHEMBL74874,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,11000.0,inactive,327.877,3.8331,1.0,2.0
4,CHEMBL75183,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,10000.0,inactive,372.328,3.9422,1.0,2.0
5,CHEMBL563,CC(C(=O)O)c1ccc(-c2ccccc2)c(F)c1,305000.0,inactive,244.265,3.6808,1.0,1.0
6,CHEMBL196279,CC(C(=O)O)c1ccc(-c2ccc(Cl)c(Cl)c2)c(F)c1,75000.0,inactive,313.155,4.9876,1.0,1.0
7,CHEMBL195970,CC(C(=O)O)c1ccc(-c2cc(Cl)cc(Cl)c2)c(F)c1,77000.0,inactive,313.155,4.9876,1.0,1.0
8,CHEMBL195970,CC(C(=O)O)c1ccc(-c2cc(Cl)cc(Cl)c2)c(F)c1,94000.0,inactive,313.155,4.9876,1.0,1.0
9,CHEMBL264006,CC(C(=O)O)c1ccc(-c2ccc(C3CCCCC3)cc2)c(F)c1,21000.0,inactive,326.411,5.7285,1.0,1.0


## Calculate pIC50 values

Convert the standard value of IC50 from nM to M by multiplying 10**-9
pIC50 = -log(IC50)

Add the new column into the dataframe

In [4]:
df['pIC50'] = -np.log10(df['standard_value']*(10**-9))
df.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_level,mol_weight,log_p,num_h_donors,num_h_acceptors,pIC50
0,CHEMBL311039,CC12CCC(C1)C(C)(C)C2NS(=O)(=O)c1ccc(F)cc1,5000.0,intermediate,311.422,3.3188,1.0,2.0,5.30103
1,CHEMBL450926,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1cccs1,2700.0,intermediate,299.461,3.2412,1.0,3.0,5.568636
2,CHEMBL310242,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,1800.0,intermediate,311.422,3.3188,1.0,2.0,5.744727
3,CHEMBL74874,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,11000.0,inactive,327.877,3.8331,1.0,2.0,4.958607
4,CHEMBL75183,CC12CC[C@@H](C1)C(C)(C)[C@@H]2NS(=O)(=O)c1ccc(...,10000.0,inactive,372.328,3.9422,1.0,2.0,5.0


Take a look at the summary statistic for pIC50

In [5]:
df.pIC50.describe()

count    1425.000000
mean        5.711224
std         1.303668
min         2.781202
25%         4.782516
50%         5.320572
75%         6.568636
max         9.522879
Name: pIC50, dtype: float64