# Introduction to cheminformatics

Andrea Volkamer - adapted by Gautier Peyrat

#### Basic handling of molecules

* Reading & writing of molecules
* Molecular descriptors & fingerprints
* Molecular similarity

#### Using RDKit: open source cheminformatics software

More information can be found here:

* http://www.rdkit.org/docs/index.html
* http://www.rdkit.org/docs/api/index.html

In [None]:
# The majority of the basic molecular functionality is found in module rdkit.Chem library
from rdkit import Chem
from rdkit.Chem import AllChem

## Representation of molecules

### SMILES (Simplified Molecular Input Line Entry Specification)

* Atoms are represented by atomic symbols: C, N, O, F, S, Cl, Br, I
* Double bonds are `=`, triple bonds are `#`
* Branching is indicated by parenthesis
* Ring closures are indicated by pairs of matching digits

More information can be found here: http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

In [None]:
# Individual molecules can be constructed using a variety of approaches
# FDA approved EGFR inhibitors: Gefitinib, Erlotinib

mol1 = Chem.MolFromSmiles('COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1')
mol2 = Chem.MolFromSmiles('C#Cc1cccc(Nc2ncnc3cc(OCCOC)c(OCCOC)cc23)c1')

#### Drawing molecules

In [None]:
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

In [None]:
# Single molecule
mol1

In [None]:
# List of molecules
Draw.MolsToGridImage([mol1,mol2], useSVG=True)

In [None]:
# A single molecule has different SMILES 
mol3 = Chem.MolFromSmiles('C1=CC=CN=C1')
mol4 = Chem.MolFromSmiles('c1cccnc1')
mol5 = Chem.MolFromSmiles('n1ccccc1')
Chem.Draw.MolsToGridImage([mol3, mol4, mol5])

In [None]:
#By default RDKit returns the canonical SMILES
print(Chem.MolToSmiles(mol3))
print(Chem.MolToSmiles(mol4))
print(Chem.MolToSmiles(mol5))

#### Isomerism

Isomerism is encoded in the SMILES with `/` and `\` symbols

In [None]:
mol6 = Chem.MolFromSmiles("O/C=C/Cl") # E or trans isomer
mol6

In [None]:
mol7 = Chem.MolFromSmiles('O/C=C\Cl') # Z or cis isomer
mol7

#### Chirality (stereochemistry)

Chirality is encoded with `@@` symbol

In [None]:
mol8 = Chem.MolFromSmiles('Oc1ccc(cc1)/C=C/c1cc(O)cc2c1C(c1cc(O)cc(c1)O)C(O2)c1ccc(cc1)O')
mol8

In [None]:
mol9 = Chem.MolFromSmiles('Oc1ccc(cc1)/C=C/c1cc(O)cc2c1[C@@H](c1cc(O)cc(c1)O)[C@@H](O2)c1ccc(cc1)O')
mol9

#### Molecular formats

In [None]:
# Inchi
print(Chem.MolToInchi(mol1))

In [None]:
# InchiKey
print(Chem.MolToInchiKey(mol1))

In [None]:
# Inchikeys from two molecules with different stereochemistry

print(Chem.MolToInchiKey(mol8))
print(Chem.MolToInchiKey(mol9))

In [None]:
# Neutral and charged molecules :
mol10 = Chem.MolFromSmiles('[H]N1CCCC1C(O)=O')
mol11 = Chem.MolFromSmiles('[H]N1CCCC1C([O-])=O')
mol12 = Chem.MolFromSmiles('[NH2+]1CCCC1C([O-])=O')
mol13 = Chem.MolFromSmiles('[NH2+]1CCCC1C(O)=O')

Chem.Draw.MolsToGridImage([mol10, mol11, mol12, mol13], molsPerRow=4)

In [None]:
for mol in [mol10, mol11, mol12, mol13] :
    print(Chem.MolToInchiKey(mol))

In [None]:
# MolBlock
print(Chem.MolToMolBlock(mol1))

For more details about the definition of sdf file:

https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_(2019)/2._Representing_Small_Molecules_on_Computers/2.5%3A_Structural_Data_Files

### Generating 3D coordinates

In [None]:
m_3D = Chem.AddHs(mol1)
AllChem.EmbedMolecule(m_3D)
#AllChem.UFFOptimizeMolecule(m_3D) # Improves the quality of the conformation; this step should not be necessary since v2018.09: default conformations use ETKDG
Draw.MolsToGridImage([mol1,m_3D])

In [None]:
print(Chem.MolToMolBlock(m_3D))

### Writing molecules to *sdf* (structure data files)

In [None]:
w = Chem.SDWriter('./data/mytest_mol3D.sdf')
w.write(m_3D)
w.close()

### Get information on molecules

In [None]:
mol2

Number of heavy atoms (C, O, N, F, Cl ...), not H

In [None]:
mol2.GetNumHeavyAtoms()

Number of bonds

In [None]:
mol2.GetNumBonds()

### Pandas Dataframe

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation Python tool.
It allows to:
 - Manipulate data tables with labels for variables (columns) and individuals (rows).
 - These tables are called DataFrames, similar to dataframes under R.
 - Read and write these dataframes from or to a tabulated file.
 - Plot graphs from these DataFrames with matplotlib.

Here we combine Pandas dataframes and RDKit to display molecules in tables.

In [None]:
import pandas as pd   # "as pd" creates an alias to simplify the call of pandas library
from rdkit.Chem.PandasTools import RenderImagesInAllDataFrames
RenderImagesInAllDataFrames(images=True)

In [None]:
# Get the smiles of 3 drugs (Aspirin, Paracetamel and Ibuprofen) and convert them to RDKit molecules
aspirin = Chem.MolFromSmiles('CC(=O)OC1=CC=CC=C1C(=O)O')
paracetamol = Chem.MolFromSmiles('CC(=O)NC1=CC=C(C=C1)O')
ibuprofen = Chem.MolFromSmiles('CC(C)CC1=CC=C(C=C1)C(C)C(=O)O')
# Create a list Lmeds containing the three drugs
Lmeds=[aspirin ,paracetamol,ibuprofen]

In [None]:
df_meds=pd.DataFrame()
df_meds['ID'] = ['Aspirin', 'Paracetamol', 'Ibuprofen']
df_meds['Molecule'] = Lmeds
df_meds

In [None]:
# Add heavy atom number and bond number to each row of the dataframe
L_heavyatoms = []
L_bonds =[]
for mol in df_meds.Molecule:
    L_heavyatoms.append(mol.GetNumHeavyAtoms())

In [None]:
df_meds['Heavy Atoms'] = L_heavyatoms 
df_meds

Another way to add a new column is to use "apply" directly on the dataframe :

In [None]:
from rdkit.Chem import Fragments

In [None]:
# Apply avoids to do a for loop and store the result in a list to append a new column in the dataframe
df_meds['Carboxylic_acid'] = df_meds.Molecule.apply(Fragments.fr_COO)
df_meds

# Quiz

In [None]:
from nbautoeval import run_yaml_quiz

In [None]:
run_yaml_quiz(f"../corrections/quiz/intro.yaml", "theoric-quiz_cheminf")

In [None]:
run_yaml_quiz(f"../corrections/quiz/intro.yaml", "code-quiz_cheminf")

In [None]:
from nbautoeval.storage import storage_clear

In [None]:
storage_clear("quiz-intro-02")
storage_clear("quiz-intro-04")