In [1]:
# apply Jupyter notebook style
from IPython.core.display import HTML

from custom.styles import style_string

HTML(style_string)

<div style="text-align:center;">
  <img src="custom/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>


# Digital Representation of Molecules

<div class="overview admonition"> 
<p class="admonition-title">Overview</p>

Questions:

* How are molecules represented on computers?

* What is a SMILES string?
    
* What are common molecular file formats?

Objectives:

* Convert molecules from chemical formula and structures to SMILES strings.

</div>

In order to work with molecules in a programmatically, we have to be able to represent them on computers.
This can be acheived many ways. 
One way is through something called a SMILES string.
There are also file formats that represent molecules.
Examples of file formats include ``pdb``, ``mol``, ``mol2``, ``cif``, and ``mmcif``.

## SMILES Strings

SMILES stands for "Simplified Molecular-Input Line-Entry System" and is a way to represent molecules as a string of characters.

You can read more about SMILES at [this tutorial](https://archive.epa.gov/med/med_archive_03/web/html/smiles.html), but rules for atoms and bonds are also repeated below.

### Atoms
SMILES supports all elements in the periodic table. An atom is represented using its respective atomic symbol. Upper case letters refer to non-aromatic atoms; lower case letters refer to aromatic atoms. If the atomic symbol has more than one letter the second letter must be lower case.

### Bonds
```
-	Single bond
=	Double bond
#	Triple bond
*	Aromatic bond
.	Disconnected structures
```
Single bonds are the default and therefore need not be entered. For example, 'CC' would mean that there is a non-aromatic carbon attached to another non-aromatic carbon by a single bond, and the computer would identify the structure as the chemical ethane. It is also assumed that the bond between two lower case atom symbols is aromatic. A blank terminates the SMILES string.

### Branches

A branch from a chain is specified by placing the SMILES symbol(s) for the branch between parenthesis. Some examples:

```

CC(O)C	2-Propanol
CC(=O)C	2-Propanone
CC(CC)C	2-Methylbutane
CC(C)CC(=O)	2-Methylbutanal
c1c(N(=O)=O)cccc1	Nitrobenzene
CC(C)(C)CC	2,2-Dimethylbutane
```

Most of the time, you will not need to write a SMILES string by hand. You will be able to look up a molecule's SMILES string from a web database like [PubChem](https://pubchem.ncbi.nlm.nih.gov/).

You can also use tools like this [molecule sketcher](https://pubchem.ncbi.nlm.nih.gov//edit3/index.html) to draw molecules and get their SMILES strings.

<div class="exercise admonition">
<p class="admonition-title">Check Your Understanding</p>
<p> Based on what you've learned about SMILES strings, answer the following questions:
<p>
    <ul>
        <li> What would be the SMILES string for ethanol?
        <li> What is the SMILES string for water?
        <li> What molecule is represented by the SMILES string O=C=O?
    </ul>
</p>
<p>Check your answers from this previous exercise using the PubChem molecule sketcher. Notice that you can type in a SMILES string and have the sketcher draw the molecule for you.</p>
</div>

### Fill in your answers here:

1.
2.
3.


## Molecule File Formats

Information about molecules can also be stored in text files. Some text file formats include:

* [MOL (MDL Molfile)](https://en.wikipedia.org/wiki/Chemical_table_file#:~:text=specifications.%5B3%5D-,Molfile,-%5Bedit%5D): The MOL file format is a widely used standard developed by Molecular Design Limited (MDL) for storing 2D and 3D molecular structures. It contains information on atoms, bonds, and connectivity.

* [MOL2 (Sybyl Mol2)](http://chemyang.ccnu.edu.cn/ccb/server/AIMMS/mol2.pdf): The MOL2 file format, developed by Tripos Inc., is an extension of the MOL format. It is mainly used in molecular modeling and computational chemistry. It supports more advanced features like force field types, atomic charges, and multiple conformations for the same molecule.

* [PDB (Protein Data Bank)](https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html): The PDB format is a standard for storing 3D structures of biological macromolecules, such as proteins and nucleic acids. PDB files contain information on atom types, positions, connectivity, and secondary structure elements. PDB files are widely used in structural biology and bioinformatics.

* [SDF (Structure Data File)](https://en.wikipedia.org/wiki/Chemical_table_file#:~:text=or%20model%20regno-,SDF,-%5Bedit%5D): SDF is another file format developed by MDL, used primarily for storing and exchanging large sets of 2D and 3D chemical structures. It is an extension of the MOL file format, where multiple MOL files can be concatenated together with additional metadata in a single file.

* [XYZ](https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/xyz.html): A simple, human-readable file format for storing atomic coordinates. It does not include information on atomic connectivity.

* [CIF (Crystallographic Information File)](https://www.iucr.org/resources/cif/spec/version1.1): A file format used in crystallography to store crystal structures, including atomic coordinates, unit cell parameters, and symmetry information. It is the standard format for the International Union of Crystallography's (IUCr) databases.

* [PDBx/mmCIF (Macromolecular Crystallographic Information File)](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/beginner%E2%80%99s-guide-to-pdb-structures-and-the-pdbx-mmcif-format): a file format used for storing crystallographic information related to macromolecular structures such as proteins, nucleic acids, and their complexes. mmCIF is an extension of the CIF (Crystallographic Information File) format, which is used for small-molecule crystal structures. Developed by the International Union of Crystallography (IUCr) and the Research Collaboratory for Structural Bioinformatics (RCSB), mmCIF was created to address the limitations of the PDB file format.

Since these are all machine readable files, they can all be processed and analyzed using Python. 
Most often, you will use a Python library specialized to work with each file format.
In this workshop, we will be learning about the Python library [RDKit](https://www.rdkit.org/docs/index.html). 
RDKit is a library commonly used in cheminformatics for working with molecules.

A preview of a Mol file for histidine. We will see how we can use RDKit to load files like this in the next few lessons.

In [None]:
with open("data/amino_acids/his.mol") as f:
    data = f.read()

# Print the first 5000 characters in the file.
print(data[:2000])