In [None]:
# apply Jupyter notebook style
from IPython.core.display import HTML

from custom.styles import style_string

HTML(style_string)

# Molecular Datasets using RDKit and Pandas

<div class="overview admonition"> 
<p class="admonition-title">Overview</p>

Questions:

* What are MOL and SDF files?
* How can they be used to create Mol objects in RDKit?
* How can I work with RDKit molecule objects in a pandas DataFrame?

Objectives:

* Read a MOL file.
* Create Mol objects from MOL and SDF files.
    
</div>


## Creating Mol Objects from Files

Mol object instances can be created by a variety of methods.  We've seen some examples of Mol objects being created from SMILES strings, now we'll learn some additional ways an RDKit mol object can be created.

Data for any system under investigation that is non-ideal, from real systems in experimental methods to optimized geometries in QM methods, cannot be communicated via a SMILES string.  In these situations, often more detailed specifications are needed, such as an accurate 3D geometry with explicit hydrogens.  Part of the utility of RDKit lies in its ability to create Mol objects using a variety of input formats, so that we can take molecular data from other sources or research processes and generate additional properties and descriptors.

Different methods are called for depending on the format of the molecule's specifications, depending on what you're passing passing around, between researchers, applications, or modules.  Two common and versatile formats that we'll cover today are the MOL file and related Structural Data File (SDF).

In [None]:
from rdkit import Chem

### Mol from Mol File

The MOL file is a format first created by MDL internally and published by [Dalby et al. in 1992](https://pubs.acs.org/doi/10.1021/ci00007a012).  It was created specifically for cheminformatics applications and is the most straightforward way of passing detailed structural data and metadata into RDKit.  The formatting is very precise and adhered to strictly by RDKit, so a link to the original paper has been included above.

#### Previewing the File
Before loading the file with RDKit, let's use Python to take a look at it.

In [None]:
pro_file = "data/amino_acids/pro.mol"

with open(pro_file) as file:
    print(file.read())

#### Loading the File

To load the file, we can use the `MolFromMolFile` function in RDKit. We add the argument `strictParsing=False` in case there are any slight problems with our mol file

In [None]:
pro = Chem.MolFromMolFile(pro_file, strictParsing=False)
pro

<div class="exercise admonition">
<p class="admonition-title">Check Your Understanding</p>
<p> Load and visualize histidine molecule from the mol file provided in `data/amino_acids/his.mol`. </p>
    <p><strong>Challenge</strong> - Print the stereochemistry of each molecule. Then, use a substructure search to highlight the amine group on the two amino acids </p>
</div>

## Mol from SDF

An SDF file is a collection of **MOL file** style string blocks separated by a blank line and a line composed of 4 '$' characters.  It's a very convenient way of passing sets of molecules related in some way such as by similarity or research project between users/machines.  Since each molecule's specifications is already in MOL format, creating Mol objects is very straightforward and a list of Mols is created when the file is read.

Although there are a few ways to work with SDF files in RDKit, we will highlight just one in this lesson. 
We will use the RDKit module called `PandasTools` to load all of the molecules from an SDF file into a pandas dataframe.
If you have not worked with pandas before or if you need a refresher, see notebook `03_python_data_science.ipynb`.

`PandasTools` is a module in `rdkit.Chem` that allows you to easily work with RDKit molecules and pandas dataframes.
To use it, we must first import it. 
Next, we tell rdkit that we want to see our molecules as images in the dataframe by setting `PandasTools.RenderImagesInAllDataFrames(True)`. 
This is a setting that is set once in the notebook.

The `PandasTools.LoadSDF` function loads molecules from an SDF file into a dataframe. 

In [None]:
from rdkit.Chem import PandasTools

PandasTools.RenderImagesInAllDataFrames(True)

df = PandasTools.LoadSDF("data/amino_acids/amino_acids-nat20.sdf", strictParsing=False)

In [None]:
# View the first 3 frames
df.head(3)

## Options for loading SDF files

The SDF loaded from RDKit supports automatically adding the SMILES string to the dataframe when the data is loaded.
We can reload our SDF and get the isomeric SMILES (has stereochemistry information) at load time by adding the option
`smilesName='SMILES', isomericSmiles=True` to the function after `strictParsing=False` argument. We can also add a molecular 
fingerprint to each using the argument `includeFingerprints=True`. This will store a fingerprint on the molecule object
and make substructure searches faster.

You can see the [documentation for the LoadSDF function](http://rdkit.org/docs/source/rdkit.Chem.PandasTools.html#rdkit.Chem.PandasTools.LoadSDF) for more information.

<div class="exercise admonition">
<p class="admonition-title">Check Your Understanding</p>

    Add the arguments `smilesName=SMILES, isomericSmiles=True` and `includeFingerprints=True` to your `LoadSDF` function. Save your result in the `df`
    variable and view the first three rows. How is the dataframe different from the dataframe without the arguments?
</div>


<div class="attention admonition"> 
<p class="admonition-title">Another way to load SDF Files</p>
<p> You might also see a list of RDKit molecule objects created using the `SDMolSupplier` function. The syntax to create a list of Mol objects from an SDF file is:</p>


```python
mol_list = Chem.SDMolSupplier('path/to/file.sdf')
```


<p>This will create an RDKit `SDMolSupplier` object. If we would like a list of RDKit `mol` objects, we can cast the variable as a list.</p>
</div>




### Viewing multiple molecules

Multiple molecules can be placed in the same image in an array.  To do this, specify the set of molecules to visualize, the number of sub-images per row, and the size of the sub-images.  In the example below, the list of molecule names generated while instantiating the Mol objects is used as a legend. This allows us to view all of the molecules in our data frame at once.

In [None]:
# Visualize the molecules
from rdkit.Chem import Draw

mol_list = df["ROMol"].to_list()
mol_names = df["ID"].to_list()

Draw.MolsToGridImage(mol_list, molsPerRow=5, legends=mol_names)

### Substructure Searches using pandas

When you have a molecule column in a pandas dataframe, you can use `>=` to do substructure searches. 
The syntax for this is

```python
df[MOL_COLUMN] >= substructure
```

This will return a list of `True` or `False` depending on if the substructure is in the molecule or not.


In [None]:
phenyl = Chem.MolFromSmarts("c1ccccc1") # first create the mol from the smarts string
phenyl

In [None]:
df["ROMol"] >= phenyl

You can use this as a filter to show the molecules that have this substructure.

In [None]:
df[df["ROMol"] >= phenyl]

<div class="exercise admonition"> 
<p class="admonition-title">Check Your Understanding</p>
<p>Do a substructure for all amino acids that contain a methyl group (SMARTS `[CH3]`)</p>

</div>

## Building and Saving a Dataset using pandas

We can add values for some molecular descriptors to our dataframe using the `apply` method in pandas.
The `apply` method applies the function to every cell in a column. 
We can then save this as a new column in the dataframe by setting `df[new_column_name]` equal to what is returned.

In [None]:
from rdkit.Chem import Descriptors

df["MolWt"] = df["ROMol"].apply(Descriptors.MolWt) # Saves a new column called "MolWt"
df["TPSA"] = df["ROMol"].apply(Descriptors.TPSA) # Saves a new column called TPSA
df["NumHeavyAtoms"] = df["ROMol"].apply(Descriptors.HeavyAtomCount)

In [None]:
df.head()

If we want to save our dataframe so we can load it later, we should have a column that will allow us to reconstruct the RDKit mol
objects. We achieved this at load time by adding the SMILES column. 
If you'd like to save your dataframe as a CSV, you can discard the `ROMol` column. 

In [None]:
# drop the mol object column since when we save as a CSV (text file)
# molecular information will be lost.

df_noMol = df.drop("ROMol", axis=1) # make a new dataframe without the mol column
df_noMol.to_csv("data/amino_acids.csv", index=False) # save the dataframe to a file.

<div class="exercise admonition"> 
<p class="admonition-title">Exercise</p>
<p>You have another SDF in your data folder called 'amino_acids-extended.sdf'. Load this SDF into a dataframe, perform substructure searches, and creat a dataset of molecular descriptors.
    </p>

</div>