# Introduction to cheminformatics

Andrea Volkamer - adapted by Gautier Peyrat

## Learning objectives

### Basic handling of molecules

* Reading & writing of molecules
* Molecular descriptors & fingerprints (first taste. We will see in more detail later)

### Using `RDKit`: open source cheminformatics software

More information can be found here:

* http://www.rdkit.org/docs/index.html
* http://www.rdkit.org/docs/api/index.html

In [None]:
# The majority of the basic molecular functionality is found in module rdkit.Chem
from rdkit import Chem
from rdkit.Chem import AllChem

## Representation of molecules

You have perhaps encountered many ways of representing chemicals, and here we will list a few:

- Trivial Names (Aspirin)
- Systematic Names (2-acetyloxybenzoic acid)
- Formula (C<sub>9</sub>H<sub>8</sub>O<sub>4</sub>)
- Images (https://en.wikipedia.org/wiki/File:Aspirin-B-3D-balls.png)

But how do **chemists** communicate with **computers**?

2 main methods:
- molecular graph (connection tables)
- line notation

We will focus on the **line notation**, because:

1. many computational processes operate more effectively on data structured as linear strings than data structured as tables.
2. line notations can be reasonably legible to human chemists designing functions with these tools.

One of the most frequently used line notation in chemistry is:

### SMILES (Simplified Molecular Input Line Entry Specification)

* Atoms are represented by atomic symbols: C, N, O, F, S, Cl, Br, I
* Double bonds are `=`, triple bonds are `#`
* Branching is indicated by parenthesis
* Ring closures are indicated by pairs of matching digits

More information can be found here: http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

In [None]:
# Individual molecules can be constructed using a variety of approaches
# FDA approved EGFR inhibitors: Gefitinib, Erlotinib

mol1 = Chem.MolFromSmiles('COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1')
mol2 = Chem.MolFromSmiles('C#Cc1cccc(Nc2ncnc3cc(OCCOC)c(OCCOC)cc23)c1')

Above cell creates what we often called "RDKit molecule object" from their SMILES.  
We can check by using the `type` command:

In [None]:
type(mol1)

#### Drawing molecules

In [None]:
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

In [None]:
# Single molecule
mol1

In [None]:
# List of molecules
Draw.MolsToGridImage([mol1,mol2], useSVG=True)

In [None]:
# A single molecule has different SMILES 
mol3 = Chem.MolFromSmiles('C1=CC=CN=C1')
mol4 = Chem.MolFromSmiles('c1cccnc1')
mol5 = Chem.MolFromSmiles('n1ccccc1')
Chem.Draw.MolsToGridImage([mol3, mol4, mol5])

In [None]:
#By default RDKit returns the canonical SMILES
print(Chem.MolToSmiles(mol3))
print(Chem.MolToSmiles(mol4))
print(Chem.MolToSmiles(mol5))

#### Stereochemistry

[Wikpedia](https://en.wikipedia.org/wiki/Stereochemistry)  
The famous image of different types of isomers:
![The famous image of different types of isomers](https://upload.wikimedia.org/wikipedia/commons/0/04/Isomerism.svg)

##### Cis–trans isomerism

Example: Geraniol and Nerol

https://en.wikipedia.org/wiki/Geraniol  
https://en.wikipedia.org/wiki/Nerol

https://pubs.acs.org/doi/10.1021/acs.jafc.6b04534

    Geraniol, or (2E)-3,7-dimethylocta-2,6-dien-1-ol, and its Z-isomer, nerol, are fragrant substances of great value.  
    The smell of geraniol has previously been described as sweet, fruity, and berry-like, whereas nerol has been reported as having floral, citrus-like, and sweet-smelling properties. 

Encoded in the SMILES with `/` and `\` symbols

In [None]:
mol6 = Chem.MolFromSmiles("O/C=C/Cl") # E or trans isomer
mol6

In [None]:
mol7 = Chem.MolFromSmiles('O/C=C\Cl') # Z or cis isomer
mol7

##### Chirality

https://en.wikipedia.org/wiki/Chirality

Example: https://en.wikipedia.org/wiki/Thalidomide

Chirality is encoded with `@` or `@@` symbol

In [None]:
mol8 = Chem.MolFromSmiles('Oc1ccc(cc1)/C=C/c1cc(O)cc2c1C(c1cc(O)cc(c1)O)C(O2)c1ccc(cc1)O')
mol8

In [None]:
mol9 = Chem.MolFromSmiles('Oc1ccc(cc1)/C=C/c1cc(O)cc2c1[C@@H](c1cc(O)cc(c1)O)[C@@H](O2)c1ccc(cc1)O')
mol9

Remark:

During the session 2023-2024, we found that the image above concerning the 2 stereo bonds are not very convincing for certain students.  
We assure you that it is NOT the problem of SMILES definition, but problem of poor resolution of images of notebook...  
If you increase the resolution of the rendered image, it will be more visible.  

Just execute the next cell, you should see more clearly the stereochemistry of the 2 bonds we specified

In [None]:
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem.Draw import IPythonConsole

# add atom indices and stereo annotation on the visual molecules
# IPythonConsole.drawOptions.addAtomIndices = True
IPythonConsole.drawOptions.addStereoAnnotation = True
# molecule draw size
IPythonConsole.molSize = 500,500 

mol9

#### other representations: InChI/InChIKey/SDF

In [None]:
# Inchi
print(Chem.MolToInchi(mol1))

In [None]:
# InchiKey
print(Chem.MolToInchiKey(mol1))

In [None]:
# Inchikeys from two molecules with different stereochemistry

print(Chem.MolToInchiKey(mol8))
print(Chem.MolToInchiKey(mol9))

In [None]:
# Neutral and charged molecules :
mol10 = Chem.MolFromSmiles('[H]N1CCCC1C(O)=O')
mol11 = Chem.MolFromSmiles('[H]N1CCCC1C([O-])=O')
mol12 = Chem.MolFromSmiles('[NH2+]1CCCC1C([O-])=O')
mol13 = Chem.MolFromSmiles('[NH2+]1CCCC1C(O)=O')

Chem.Draw.MolsToGridImage([mol10, mol11, mol12, mol13], molsPerRow=4)

Question for biologists and chemists:
- what is the name of the molecule that you saw in above output cell? (Proline/Pro)
- why it has 4 different charged/uncharges forms? ()
- what is the professional term for this? (protonation state)
- what is the factor that will influence its form? (pH)

In [None]:
for mol in [mol10, mol11, mol12, mol13] :
    print(Chem.MolToInchiKey(mol))

In [None]:
# MolBlock
print(Chem.MolToMolBlock(mol1))

For more details about the definition of MolBlock of an SDF file:

https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics/02%3A_Representing_Small_Molecules_on_Computers/2.05%3A_Structural_Data_Files

### Generating 3D coordinates

In [None]:
m_3D = Chem.AddHs(mol1)
AllChem.EmbedMolecule(m_3D)
#AllChem.UFFOptimizeMolecule(m_3D) # Improves the quality of the conformation; this step should not be necessary since v2018.09: default conformations use ETKDG
Draw.MolsToGridImage([mol1,m_3D])

In [None]:
print(Chem.MolToMolBlock(m_3D))

### Writing molecules to *sdf* (structure data files)

In [None]:
w = Chem.SDWriter('./data/mytest_mol3D.sdf')
w.write(m_3D)
w.close()

### Get information on molecules

In [None]:
mol2

Number of heavy atoms (C, O, N, F, Cl ...), not H

In [None]:
mol2.GetNumHeavyAtoms()

Number of bonds

In [None]:
mol2.GetNumBonds()

### Pandas Dataframe

`Pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation Python tool.
It allows to:
 - Manipulate data tables with labels for variables (columns) and individuals (rows).
 - These tables are called DataFrames, similar to dataframes under R.
 - Read and write these dataframes from or to a tabulated file.
 - Plot graphs from these DataFrames with `matplotlib` (or other similar packages).

Here we combine Pandas dataframes and RDKit to display molecules in tables.

In [None]:
import pandas as pd   # "as pd" creates an alias to simplify the call of pandas library
from rdkit.Chem.PandasTools import RenderImagesInAllDataFrames
RenderImagesInAllDataFrames(images=True)

Exercise for students:

In [None]:
# Get the smiles of 3 drugs (Aspirin, Paracetamel and Ibuprofen) and convert them to RDKit molecules
aspirin = Chem.MolFromSmiles('CC(=O)OC1=CC=CC=C1C(=O)O')
paracetamol = Chem.MolFromSmiles('CC(=O)NC1=CC=C(C=C1)O')
ibuprofen = Chem.MolFromSmiles('CC(C)CC1=CC=C(C=C1)C(C)C(=O)O')
# Create a list named "Lmeds" containing the molecule objects of these three drugs
Lmeds=[aspirin ,paracetamol,ibuprofen]

Create a dataframe with the information in above list:

In [None]:
df_meds=pd.DataFrame()
df_meds['ID'] = ['Aspirin', 'Paracetamol', 'Ibuprofen']
df_meds['Molecule'] = Lmeds
df_meds

In [None]:
# Add heavy atom number to each row of the dataframe
# Firstly, create an empty list called L_heavyatoms
L_heavyatoms = []

# Then iterate through the 'Molecule' column of 'df_meds', calculate the number of heavy atoms
# using RDKit molecule GetNumHeavyAtoms() method, then append the result
# to the list created previously
for mol in df_meds.Molecule:
    L_heavyatoms.append(mol.GetNumHeavyAtoms())

Add new column to the dataframe

In [None]:
df_meds['Heavy Atoms'] = L_heavyatoms 
df_meds

Another way to add a new column is to use "apply" directly on the dataframe :

In [None]:
from rdkit.Chem import Fragments

In [None]:
# Apply avoids to do a for loop and store the result in a list to append a new column in the dataframe
df_meds['Carboxylic_acid'] = df_meds.Molecule.apply(Fragments.fr_COO)
df_meds

# Quiz

In [None]:
from nbautoeval import run_yaml_quiz

In [None]:
run_yaml_quiz(f"../corrections/quiz/intro.yaml", "theoric-quiz_cheminf")

In [None]:
run_yaml_quiz(f"../corrections/quiz/intro.yaml", "code-quiz_cheminf")

In [None]:
from nbautoeval.storage import storage_clear

In [None]:
storage_clear("quiz-intro-02")
storage_clear("quiz-intro-04")