# ADN T002 ~ Molecular Descriptors and Fingerprints

Authors:

* Adnane Aouidate, 2022, Structural Bioinformatics and Chemoinformatics, Institute of Organic and Analytical Chemistry (ICOA), Orléans, France.
* Update , 2023, Ait Melloul Faculty of Applied Sciences, Ibn Zohr University, Agadir, Morocco.

## Aim of this talktorial

In the current tutorial we will learn how to use python combined with Rdkit in order to calculate different molecular descriptors descriptors, using in built *rdkit* functions and also third party software like *Mordred*.

as a definition : a **Molecular Descriptor or MD** “… the final result of a logical and mathematical procedure that transforms chemical information of a molecule, such as structural features, into useful numbers or the result of standardized experiments.

Utility :

They can be used in Molecular similarity to compare molecules (and/or their properties), so it can help us to identify new molecules with desired properties and biological activity

## Molecular descriptors
*Similarity can be assessed in many different ways depending on the application (see J. Med. Chem. (2014), 57, 3186-3204</a>):*

* 1D molecular descriptor: Solubility, logP, molecular weight, melting point.
    * Global descriptor: only one value represents the whole molecule
    * Usually do not contain enough information to be applied to machine learning (ML)
    * Can be added to 2D fingerprints to improve molecular encoding for ML
* 2D molecular descriptors: Molecular graphs, paths, fragments, atom environments
    * Detailed representation of individual parts of the molecule
    * Contains many features/bits per molecule called fingerprints
    * Very often used in similarity search and ML
* 3D molecular descriptors: Shape, stereochemistry
    * Less robust than 2D representations because of molecule flexibility (what is the "right" conformation of a molecule?) Biological similarity
    * Biological fingerprint, e.g. individual bits represent bioactivity measure against different targets
    * Independent of molecular structure
    * Requires experimental (or predicted) data

In [1]:
import pandas as pd
import numpy as np
from rdkit import Chem,DataStructs
from rdkit.Chem import AllChem, Descriptors, MACCSkeys
from rdkit.Chem.Draw import IPythonConsole
from mordred import Calculator, descriptors


Let's first read the sdf file

In [47]:
file = 'data-test.sdf'
mols = []
for m in Chem.SDMolSupplier(file):
    if m is not None:
        mols.append(m)

In [48]:
len(mols)

50

### Molecular descriptors using *Rdkit*

In [4]:
desc = []

for m in mols :
    desc.append([Descriptors.MolLogP(m),
    Descriptors.HeavyAtomCount(m),
    Descriptors.MolMR(m),
    Descriptors.MolWt(m),
    Descriptors.NumHAcceptors(m),
    Descriptors.NumHDonors(m)])
columns = ['logp', 'HAC', 'MMR', 'MW', 'NHA', 'NHB']
desc = pd.DataFrame(desc, columns= columns)

In [5]:
desc

Unnamed: 0,logp,HAC,MMR,MW,NHA,NHB
0,2.1436,12,47.468,156.188,2,0
1,0.3995,10,35.372,138.174,4,0
2,-0.2663,6,25.5104,86.158,0,1
3,-0.4891,17,68.2684,236.403,0,2
4,2.783,14,57.044,180.21,2,0
5,2.1612,25,95.9625,344.519,2,2
6,2.1612,25,95.9625,344.519,2,2
7,-2.9381,56,186.3958,804.88,18,11
8,-5.1139,67,219.0292,967.021,23,14
9,-2.9381,56,186.3958,804.88,18,11


### Molecular descriptors using *Mordred*

In [21]:
df = pd.read_csv("./databases/acetylcholinesterase_Ki_pKi_bioactivity_data_curated.csv", index_col="molecule_chembl_id")

Let's first read the database that we got from the the jupyter notebook ADN_T000 in this series of notebooks

In [22]:
df.head()

Unnamed: 0_level_0,units,Ki,smiles,pKi
molecule_chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CHEMBL11805,nM,0.104,COc1ccccc1CN(C)CCCCCC(=O)N(C)CCCCCCCCN(C)C(=O)...,9.982967
CHEMBL208599,nM,0.026,CCC1=CC2Cc3nc4cc(Cl)ccc4c(N)c3[C@@H](C1)C2,10.585027
CHEMBL60745,nM,1.63,CC[N+](C)(C)c1cccc(O)c1.[Br-],8.787812
CHEMBL95,nM,151.0,Nc1c2c(nc3ccccc13)CCCC2,6.821023
CHEMBL173309,nM,12.2,CCN(CCCCCC(=O)N(C)CCCCCCCCN(C)C(=O)CCCCCN(CC)C...,7.91364


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 472 entries, 0 to 471
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   molecule_chembl_id  472 non-null    object 
 1   units               472 non-null    object 
 2   Ki                  472 non-null    float64
 3   smiles              472 non-null    object 
 4   pKi                 472 non-null    float64
dtypes: float64(2), object(3)
memory usage: 18.6+ KB


Here, we will get all molecules as mols rdkit objects

In [10]:
mols = [Chem.MolFromSmiles(smi) for smi in df.smiles]

### Create descriptor calculato

In [11]:
Calc = Calculator(descriptors, ignore_3D= True)

In [6]:
len(Calc.descriptors)

1613

# Calculate descriptors and store them in a dataframe

In [14]:
df1 = Calc.pandas(mols)

  0%|                                           | 1/472 [00:01<14:28,  1.84s/it]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


  1%|▎                                          | 4/472 [00:02<09:56,  1.28s/it]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


  3%|█▏                                        | 13/472 [00:06<04:06,  1.86it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


  8%|███▌                                      | 40/472 [00:09<01:13,  5.89it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


  9%|███▋                                      | 42/472 [00:09<01:07,  6.35it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 20%|████████▌                                 | 96/472 [00:30<04:50,  1.29it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 23%|█████████▍                               | 109/472 [00:31<01:31,  3.96it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


100%|█████████████████████████████████████████| 472/472 [02:31<00:00,  3.11it/s]


In [20]:
df1.head()

Unnamed: 0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
0,35.142811,24.264889,0,2,60.122409,2.279705,4.55941,60.122409,1.25255,4.737539,...,10.238852,84.598254,666.508407,5.846565,14630,63,218.0,238.0,16.611111,11.361111
1,16.987142,12.790496,0,0,27.467293,2.533089,5.066178,27.467293,1.307966,4.015374,...,10.293467,55.839709,298.123676,7.453092,842,39,120.0,147.0,6.25,4.472222
2,8.850899,8.508709,1,1,multiple fragments (SpAbs_A/SpAbs),multiple fragments (SpMax_A/SpMax),multiple fragments (SpDiam_A/SpDiam),multiple fragments (SpAD_A/SpAD),multiple fragments (SpMAD_A/SpMAD),multiple fragments (LogEE_A/LogEE),...,9.303375,43.773162,245.041526,8.449708,1200000190,16,58.0,65.0,divide by zero encountered in power (mZagreb1),2.708333
3,11.968445,9.625522,0,0,20.264831,2.459954,4.919908,20.264831,1.350989,3.67295,...,9.827416,47.796305,198.115698,6.831576,326,25,82.0,99.0,3.805556,3.277778
4,36.338245,25.499176,0,2,63.382841,2.287195,4.57439,63.382841,1.267657,4.776866,...,10.283053,86.794615,694.539707,5.787831,16085,67,226.0,248.0,17.111111,12.027778


### Let's set the index cas cid of the molecules

In [23]:
df1.set_index(df.index, inplace=True)

In [24]:
df1.head()

Unnamed: 0_level_0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
molecule_chembl_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL11805,35.142811,24.264889,0,2,60.122409,2.279705,4.55941,60.122409,1.25255,4.737539,...,10.238852,84.598254,666.508407,5.846565,14630,63,218.0,238.0,16.611111,11.361111
CHEMBL208599,16.987142,12.790496,0,0,27.467293,2.533089,5.066178,27.467293,1.307966,4.015374,...,10.293467,55.839709,298.123676,7.453092,842,39,120.0,147.0,6.25,4.472222
CHEMBL60745,8.850899,8.508709,1,1,multiple fragments (SpAbs_A/SpAbs),multiple fragments (SpMax_A/SpMax),multiple fragments (SpDiam_A/SpDiam),multiple fragments (SpAD_A/SpAD),multiple fragments (SpMAD_A/SpMAD),multiple fragments (LogEE_A/LogEE),...,9.303375,43.773162,245.041526,8.449708,1200000190,16,58.0,65.0,divide by zero encountered in power (mZagreb1),2.708333
CHEMBL95,11.968445,9.625522,0,0,20.264831,2.459954,4.919908,20.264831,1.350989,3.67295,...,9.827416,47.796305,198.115698,6.831576,326,25,82.0,99.0,3.805556,3.277778
CHEMBL173309,36.338245,25.499176,0,2,63.382841,2.287195,4.57439,63.382841,1.267657,4.776866,...,10.283053,86.794615,694.539707,5.787831,16085,67,226.0,248.0,17.111111,12.027778


### Check for none values

In [27]:
df1.isnull().sum().sum()

0

ABC         float64
ABCGG       float64
nAcid         int64
nBase         int64
SpAbs_A      object
             ...   
WPol          int64
Zagreb1     float64
Zagreb2     float64
mZagreb1     object
mZagreb2    float64
Length: 1613, dtype: object

### Molecular fingerprints
Will be calculated soon !!

In [41]:
# def get_fingerprints(smiles,method = 'MCCS'):
    
#     for mol in Chem.SDMolSupplier(file):
#         if mol is not None:
            
#             if method == 'MCCS':
#                 return (fp == np.asanyarray(MACCSkeys.GenMACCSKeys(mol)))

# #             if method == 'morgan2':
# #                 return pd.DataFrame(AllChem.GetHashedAtomPairFingerprintAsBitVect(mol, nBits=2048))