In [None]:
# apply Jupyter notebook style
from IPython.core.display import HTML

from custom.styles import style_string

HTML(style_string)


# Molecules from file using RDKit

<div class="overview admonition"> 
<p class="admonition-title">Overview</p>

Questions:

* What are MOL and SDF files?
* How can they be used to create Mol objects in RDKit?
* What is sanitization, and how is it modified?

Objectives:

* Read a MOL file.
* Create Mol objects from MOL and SDF files.
* Carry out a partial sanitization of a Mol object to preserve molecular features.

</div>


<div class="attention admonition"> 
<p class="admonition-title">Overview</p>

<p>Note about RDKit objects: RDKit syntax can be unusual and it's important to remember that although there is a Python interface for it, its objects and core functionality are written in C++.  Because of the way the Python and C++ are bound together, returned objects must sometimes go through an additional conversion step after using RDKit function if we want the data available as a list.</p>

</div>



## Creating Mol Objects from Files

Mol object instances can be created by a variety of methods.  We've seen some examples of Mol objects being created from SMILES strings, now we'll learn some additional ways a Mol can be created.

Data for any system under investigation that is non-ideal, from real systems in experimental methods to optimized geometries in QM methods, cannot be communicated via a smile string.  In these situations, often more detailed specifications are needed, such as an accurate 3D geometry with explicit hydrogens.  Part of the utility of RDKit lies in its ability to create Mol objects using a variety of input formats, so that we can take molecular data from other sources or research processes and generate additional properties and descriptors.

Different methods are called for depending on the format of the molecule's specifications, depending on what you're passing passing around, between researchers, applications, or modules.  Two common and versatile formats that we'll cover today are the MOL file and related Structural Data File (SDF).

In [None]:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

### Mol from Mol File

The MOL file is a format first created by MDL internally and published by [Dalby et al. in 1992](https://pubs.acs.org/doi/10.1021/ci00007a012).  It was created specifically for cheminformatics applications and is the most straightforward way of passing detailed structural data and metadata into RDKit.  The formatting is very precise and adhered to strictly by RDKit, so a link to the original paper has been included above.

#### Previewing the File
Before loading the file with RDKit, let's use Python to take a look at it.

In [None]:
pro_file = "data/amino_acids/pro.mol"

with open(pro_file) as file:
    print(file.read())

#### Loading the File

To load the file, we can use the `MolFromMolFile` function in RDKit. We add the argument `strictParsing=False` in case there are any slight problems with our mol file

In [None]:
pro = Chem.MolFromMolFile(pro_file, strictParsing=False)
pro

Some information from the Mol file such as the molecule name and references are stored in the RDKit molecule object.
For example, the first 3 lines in the MOL file are the name followed by any reference information.  The Mol object RDKit creates will store this data under the **private properties** "_Name" and "_MolFileInfo".  All of the built-in private properties can be accessed with the syntax:

```python
mol = Mol    # create an instance of a Mol Object
rdk_properties = mol.GetPropNames(includePrivate=True)   
rdk_properties_list = list(rdk_properties)
print(rdk_property_list)
```

The remaining lines until 'M  END' are the structural data for the molecule, beginning with the counts line followed by the coordinates and then the connectivity.  There can be additional metadata after the connectivity, but that is beyond the scope of this lesson.

## Mol from SDF

An SDF file is a collection of **MOL file** style string blocks separated by a blank line and a line composed of 4 '$' characters.  It's a very convenient way of passing sets of molecules related in some way such as by similarity or research project between users/machines.  Since each molecule's specifications is already in MOL format, creating Mol objects is very straightforward and a list of Mols is created when the file is read.

The syntax to create a list of Mol objects from an SDF file is:

```python
mol_list = Chem.SDMolSupplier('path/to/file.sdf')
```

This will create an RDKit `SDMolSupplier` object. If we would like a list of RDKit `mol` objects, we can cast the variable as a list.

In [None]:
nat20_sdf = Chem.SDMolSupplier(
    "data/amino_acids/amino_acids-nat20.sdf", 
    sanitize=False, 
    removeHs=False, 
    strictParsing=False
)

# Make a list of the molecule names, instantiate the Mol objects belonging to the SDF supplier
mol_list = []
for mol in nat20_sdf:
    mol_name = mol.GetProp('_Name')
    mol_list += [mol_name]

### Viewing multiple molecules

Multiple molecules can be placed in the same image in an array.  To do this, specify the set of molecules to visualize, the number of sub-images per row, and the size of the sub-images.  In the example below, the list of molecule names generated while instantiating the Mol objects is used as a legend.

In [None]:
# Visualize the molecules
Draw.MolsToGridImage(nat20_sdf,molsPerRow=5,subImgSize=(200,200), legends=mol_list)

## Molecule Sanitization

Sanitization in RDKit is a process in which RDKit uses the molecular connectivity graph and element list to predict certain molecular features in a standardized way.  These properties include finding radicals or charged atoms, determining aromaticity, VSEPR hybridization, degree of conjugation, etc.  This process can also be used to find common problems in a hypothetical molecule, such as a hyperconjugated carbon.

Working with SMILES strings or MOL files, sanitization can largely be ignored and allowed to process by its default parameters, however there are a few cases in which one may wish to control the sanitization steps more explicitly.  For example, full sanitization of aromatic heterocycles such as the imidazole ring in Histidine(His) or the indole in Tryptophan(Trp) will cause RDKit to miss the aromaticity of these rings.

For a list of sanitization flags, see the [rdmolops module](https://www.rdkit.org/docs/source/rdkit.Chem.rdmolops.html) in the RDKit documentation.

To set the sanitization, the flags are passed to a variable and concatenated by means of a bitwise OR function, which is then passed along with the molecule to the **sanitize** function as parameters.  The syntax is as follows:
    
    s_flags = Chem.rdmolops.SanitizeFlags.[FLAG1]|Chem.rdmolops.SanitizeFlags.[FLAG2]|...
    Chem.SanitizeMol(mol, sanitizeOps=s_flags, ...)
    
When we created the amino acid molecules from the SDF file, sanitization was turned off.  Below is an example of a limited sanitization that preserves the aromaticity of the heterocycles.

In [None]:
# Set flags for a limited sanitization (full sanitization results in bad aromaticity detection).
s_flags = Chem.rdmolops.SanitizeFlags.SANITIZE_CLEANUP|Chem.rdmolops.SanitizeFlags.SANITIZE_FINDRADICALS|Chem.rdmolops.SanitizeFlags.SANITIZE_CLEANUPCHIRALITY|Chem.rdmolops.SanitizeFlags.SANITIZE_PROPERTIES|Chem.rdmolops.SanitizeFlags.SANITIZE_ADJUSTHS|Chem.rdmolops.SanitizeFlags.SANITIZE_SETHYBRIDIZATION

for mol in nat20_sdf:
    Chem.SanitizeMol(mol, sanitizeOps=s_flags)