# ADN T2 ~ Molecular Descriptors and Fingerprints

In the current tutorial we will learn how to use python combined with Rdkit in order to calculate different molecular descriptors descriptors, using in built *rdkit* functions and also third party software like *Mordred*.

as a definition : a **Molecular Descriptor or MD** “… the final result of a logical and mathematical procedure that transforms chemical information of a molecule, such as structural features, into useful numbers or the result of standardized experiments.

Utility :

They can be used in Molecular similarity to compare molecules (and/or their properties), so it can help us to identify new molecules with desired properties and biological activity

## Molecular descriptors
*Similarity can be assessed in many different ways depending on the application (see J. Med. Chem. (2014), 57, 3186-3204</a>):*

* 1D molecular descriptor: Solubility, logP, molecular weight, melting point.
    * Global descriptor: only one value represents the whole molecule
    * Usually do not contain enough information to be applied to machine learning (ML)
    * Can be added to 2D fingerprints to improve molecular encoding for ML
* 2D molecular descriptors: Molecular graphs, paths, fragments, atom environments
    * Detailed representation of individual parts of the molecule
    * Contains many features/bits per molecule called fingerprints
    * Very often used in similarity search and ML
* 3D molecular descriptors: Shape, stereochemistry
    * Less robust than 2D representations because of molecule flexibility (what is the "right" conformation of a molecule?) Biological similarity
    * Biological fingerprint, e.g. individual bits represent bioactivity measure against different targets
    * Independent of molecular structure
    * Requires experimental (or predicted) data

In [2]:
import pandas as pd
import numpy as np
from rdkit import Chem,DataStructs
from rdkit.Chem import AllChem, Descriptors, MACCSkeys
from rdkit.Chem.Draw import IPythonConsole
from mordred import Calculator, descriptors


In [47]:
file = 'data-test.sdf'
mols = []
for m in Chem.SDMolSupplier(file):
    if m is not None:
        mols.append(m)

In [48]:
len(mols)

50

### Molecular descriptors using *Rdkit*

In [4]:
desc = []

for m in mols :
    desc.append([Descriptors.MolLogP(m),
    Descriptors.HeavyAtomCount(m),
    Descriptors.MolMR(m),
    Descriptors.MolWt(m),
    Descriptors.NumHAcceptors(m),
    Descriptors.NumHDonors(m)])
columns = ['logp', 'HAC', 'MMR', 'MW', 'NHA', 'NHB']
desc = pd.DataFrame(desc, columns= columns)

In [5]:
desc

Unnamed: 0,logp,HAC,MMR,MW,NHA,NHB
0,2.1436,12,47.468,156.188,2,0
1,0.3995,10,35.372,138.174,4,0
2,-0.2663,6,25.5104,86.158,0,1
3,-0.4891,17,68.2684,236.403,0,2
4,2.783,14,57.044,180.21,2,0
5,2.1612,25,95.9625,344.519,2,2
6,2.1612,25,95.9625,344.519,2,2
7,-2.9381,56,186.3958,804.88,18,11
8,-5.1139,67,219.0292,967.021,23,14
9,-2.9381,56,186.3958,804.88,18,11


### Molecular descriptors using *Mordred*

In [6]:
Calc = Calculator(descriptors, ignore_3D= True)

In [8]:
len(Calc.descriptors)

1613

In [7]:
df = Calc.pandas(mols)

 64%|███████████████████████████████████████████████████▏                            | 32/50 [00:13<00:09,  1.90it/s]

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


100%|████████████████████████████████████████████████████████████████████████████████| 50/50 [00:14<00:00,  3.39it/s]


In [9]:
df

Unnamed: 0,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,LogEE_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
0,9.151948,7.830014,0,0,16.383377,2.278414,4.556827,16.383377,1.365281,3.412246,...,9.076466,42.338794,156.068748,7.803437,198,14,58.0,65.0,2.722222,2.777778
1,7.737734,7.200149,2,0,13.363517,2.310277,4.405571,13.363517,1.336352,3.252036,...,8.817742,53.260053,138.090546,6.904527,107,13,50.0,57.0,2.222222,2.277778
2,4.242641,4.0,0,1,8.0,2.0,4.0,8.0,1.333333,2.687624,...,7.627057,30.941317,86.096426,4.783135,27,3,24.0,24.0,1.5,1.5
3,13.980375,10.32463,0,2,23.542704,2.501909,5.003819,23.542704,1.384865,3.82132,...,10.066499,50.79383,236.224152,5.249426,458,30,98.0,120.0,3.416667,3.611111
4,11.192388,8.912312,0,0,19.448251,2.434764,4.869528,19.448251,1.389161,3.607247,...,9.70522,46.333857,180.068748,8.184943,271,22,76.0,91.0,2.944444,3.083333
5,21.099924,16.659435,0,1,33.785723,2.731855,5.416407,33.785723,1.351429,4.240872,...,10.964761,78.357822,344.258406,5.834888,1133,61,162.0,215.0,6.965278,5.041667
6,21.099924,16.510656,0,1,33.797467,2.722545,5.399436,33.797467,1.351899,4.240865,...,10.958862,78.350114,344.258406,5.834888,1160,61,162.0,215.0,6.965278,5.041667
7,44.506807,32.070274,0,0,71.425497,2.691428,5.320516,71.425497,1.275455,4.976051,...,11.412165,110.845479,804.377965,6.934293,14019,122,322.0,405.0,21.861111,11.916667
8,52.939717,38.541609,0,0,85.519591,2.691932,5.321309,85.519591,1.276412,5.151137,...,11.563989,122.662721,966.430789,7.054239,21164,146,382.0,480.0,26.166667,14.444444
9,44.397417,32.698182,0,0,71.454186,2.691792,5.321068,71.454186,1.275968,4.976124,...,11.420536,110.861358,804.377965,6.934293,13227,123,322.0,406.0,21.861111,12.0


### Mocular fingerprints

In [41]:
# def get_fingerprints(smiles,method = 'MCCS'):
    
#     for mol in Chem.SDMolSupplier(file):
#         if mol is not None:
            
#             if method == 'MCCS':
#                 return (fp == np.asanyarray(MACCSkeys.GenMACCSKeys(mol)))

# #             if method == 'morgan2':
# #                 return pd.DataFrame(AllChem.GetHashedAtomPairFingerprintAsBitVect(mol, nBits=2048))

In [52]:
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols if mol is not None]

In [72]:
Dfps = pd.DataFrame(fps, index= [mol.GetProp('_Name') for mol in mols ], columns= ['Morgan2'])

In [76]:
for i in range(0, len(Dfps)):
    Similariry = []
    sim = DataStructs.TanimotoSimilarity(Dfps.Morgan2[0], Dfps.Morgan2[i])
    Similariry.append(sim)
Similariry

[0.03125]

In [77]:
len(Similariry)

1