In [None]:
# apply Jupyter notebook style
from IPython.core.display import HTML

from custom.styles import style_string

HTML(style_string)


# Molecules from files using RDKit

<div class="overview admonition"> 
<p class="admonition-title">Overview</p>

Questions:

* What are MOL and SDF files?
* How can they be used to create Mol objects in RDKit?
* How can I work with RDKit molecule objects in a pandas DataFrame?

Objectives:

* Read a MOL file.
* Create Mol objects from MOL and SDF files.
    
</div>


## Creating Mol Objects from Files

Mol object instances can be created by a variety of methods.  We've seen some examples of Mol objects being created from SMILES strings, now we'll learn some additional ways a Mol can be created.

Data for any system under investigation that is non-ideal, from real systems in experimental methods to optimized geometries in QM methods, cannot be communicated via a smile string.  In these situations, often more detailed specifications are needed, such as an accurate 3D geometry with explicit hydrogens.  Part of the utility of RDKit lies in its ability to create Mol objects using a variety of input formats, so that we can take molecular data from other sources or research processes and generate additional properties and descriptors.

Different methods are called for depending on the format of the molecule's specifications, depending on what you're passing passing around, between researchers, applications, or modules.  Two common and versatile formats that we'll cover today are the MOL file and related Structural Data File (SDF).

In [None]:
from rdkit import Chem

### Mol from Mol File

The MOL file is a format first created by MDL internally and published by [Dalby et al. in 1992](https://pubs.acs.org/doi/10.1021/ci00007a012).  It was created specifically for cheminformatics applications and is the most straightforward way of passing detailed structural data and metadata into RDKit.  The formatting is very precise and adhered to strictly by RDKit, so a link to the original paper has been included above.

#### Previewing the File
Before loading the file with RDKit, let's use Python to take a look at it.

In [None]:
pro_file = "data/amino_acids/pro.mol"

with open(pro_file) as file:
    print(file.read())

#### Loading the File

To load the file, we can use the `MolFromMolFile` function in RDKit. We add the argument `strictParsing=False` in case there are any slight problems with our mol file

In [None]:
pro = Chem.MolFromMolFile(pro_file, strictParsing=False)
pro

## Mol from SDF

An SDF file is a collection of **MOL file** style string blocks separated by a blank line and a line composed of 4 '$' characters.  It's a very convenient way of passing sets of molecules related in some way such as by similarity or research project between users/machines.  Since each molecule's specifications is already in MOL format, creating Mol objects is very straightforward and a list of Mols is created when the file is read.

Although there are a few ways to work with SDF files in RDKit, we will highlight just one in this lesson. 
We will use the RDKit module called `PandasTools` to load all of the molecules from an SDF file into a pandas dataframe.

`PandasTools` is a module in `rdkit.Chem`, and we must first import it. 
Next, we tell rdkit that we want to see our molecules as images by setting `PandasTools.RenderImagesInAllDataFrames(True)`

In [None]:
from rdkit.Chem import PandasTools

PandasTools.RenderImagesInAllDataFrames(True)

df = PandasTools.LoadSDF("data/amino_acids/amino_acids-nat20.sdf", strictParsing=False, includeFingerprints=True)

In [None]:
# View the first 5 frames
df.head()


<div class="attention admonition"> 
<p class="admonition-title">Another way to load SDF Files</p>
<p> You might also see a list of RDKit molecule objects created using the `SDMolSupplier` function. The syntax to create a list of Mol objects from an SDF file is:</p>


```python
mol_list = Chem.SDMolSupplier('path/to/file.sdf')
```


<p>This will create an RDKit `SDMolSupplier` object. If we would like a list of RDKit `mol` objects, we can cast the variable as a list.</p>
</div>




### Viewing multiple molecules

Multiple molecules can be placed in the same image in an array.  To do this, specify the set of molecules to visualize, the number of sub-images per row, and the size of the sub-images.  In the example below, the list of molecule names generated while instantiating the Mol objects is used as a legend.

In [None]:
# Visualize the molecules
from rdkit.Chem import Draw

mol_list = df["ROMol"].to_list()
mol_names = df["ID"].to_list()

Draw.MolsToGridImage(mol_list, molsPerRow=5, legends=mol_names)

## Building a Dataset

We can add values for some molecular descriptors to our dataframe using the `apply` method in pandas.

In [None]:
from rdkit.Chem import Descriptors

df["MolWt"] = df["ROMol"].apply(Descriptors.MolWt)
df["TPSA"] = df["ROMol"].apply(Descriptors.TPSA)

In [None]:
df.head()