# Covert Smile to SDF

## Molecular Representations of Biological Molecules for Computing

In bioinformatics, if you work with genomics you are accustomed to working with DNA sequences. For example, this is a partial DNA sequence for human dihdydrofolate reductase, an important enzyme in nucelotide metabolism.

```
GAATTCATGAAAACGTAGCTCGTCCTCAAAAAAAACAGAAGAGGAGTAATCATTTTAAGGGAGAAATATA
TACGAAAGGAACAAGATTTTGAAGCACCCAAGCTGCCACCTACATTAAAACACGGTAGGTGGCTAAACAC
CAGTCTTCAATGCCCTTCCACAGCCTCAGTCTGAAAAATACTGTGCAGGTGACCCAAGTGAGGGGTCACC
CTTGGGCTTTTCCTGTGGCAGTATCTCTGGTTTAAAAACAAACAAACGTACTTATTGCGTTGAAGGACGG
CAACAGGAAGGACTCCATGATTAGTCACATCTATACCATCCTAAGAAACTTTATCCACCCAAACTGTATT
TCAGACTTTATAATCTAAACTACAAAAAGTGTTCACTGGGGAACTGCACAATATGACTGCTTTTAACCGT
```

The DNA sequence shown here is a simplified representation of a very complex 3D structure that is part of a chromosome, an enormously complex structure. The sequence that represents this gene can be used as a string in coding - computation with strings is orders of magnitude faster than computation with 3D structures. And we can still learn a great deal about this gene simply by exploring the sequence. Likewise we can represent the RNA transcribed from this sequence as a list of characters where T is replaced by U.

If you study proteins or proteomics, you know that protein function depends on protein structure. Protein structures involve 20 (or more) building blocks so the sequences are more complex, but the principle of representing the protein as a simple string for ease with computing still applies. Here is the sequence of the dihydrofolate reductase protein that is coded in this gene sequence above.

```
MVGSLNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSIPEKNRPLKG
RINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPGHLKLFVTRIM
QDFESDTFFPEIDLEKYKLLPEYPGVLSDVQEEKGIKYKFEVYEKND
```

As we move into cheminformatics, we often want to convert small molecule structures, like the aspirin shown here, into strings for computing ease.

![Aspirin.png](https://drive.google.com/uc?export=view&id=1hcWaacd-pIb09Wi9dceVSQIfkXgWC-vD)

There are three well-known string conversions for small molecules, SMILES, InChI, and InChI Key. Here are the SMILES, InChI, and InChI Key strings for aspirin.

SMILES: CC(=O)OC1=CC=CC=C1C(=O)O

InChI: 1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)

InChI Key: BSYNRYMUTXBXSQ-UHFFFAOYSA-N

In this workshop we will use SMILES strings. SMILES syntax is explained below.


## Simplified Molecular-Input Line-entry System: SMILES

SMILES stands for "Simplified Molecular-Input Line-Entry System" and is a way to represent molecules as a string of characters.

Consider the molecule ethanol. The image below shows a representation that we are used to seeing in chemistry:

![ethanol](https://drive.google.com/uc?export=view&id=1pBnnNujVdkw43xpDOM27nzICgnn7EqJj)

However, the SMILES representation of this molecule would be "CCO".

You can read more about SMILES at [this tutorial](https://archive.epa.gov/med/med_archive_03/web/html/smiles.html), but rules for atoms and bonds are also repeated below.

### Atoms
SMILES supports all elements in the periodic table. An atom is represented using its respective atomic symbol. Upper case letters refer to non-aromatic atoms; lower case letters refer to aromatic atoms. If the atomic symbol has more than one letter the second letter must be lower case.

### Bonds
```
-	Single bond
=	Double bond
#	Triple bond
*	Aromatic bond
.	Disconnected structures
```
Single bonds are the default and therefore need not be entered. For example, 'CC' would mean that there is a non-aromatic carbon attached to another non-aromatic carbon by a single bond, and the computer would identify the structure as the chemical ethane. It is also assumed that the bond between two lower case atom symbols is aromatic. A blank terminates the SMILES string.

### Branches

A branch from a chain is specified by placing the SMILES symbol(s) for the branch between parenthesis. Some examples:

```

CC(O)C	2-Propanol
CC(=O)C	2-Propanone
```

### Rings

A ring is specified by placing a number directly after the SMILES symbol where the ring closure occurs. This number acts as a marker, indicating that the atoms with the same number are connected, thus forming a ring. For instance:

```
C1CCCC1   cyclopentane
n1ccccc1  Pyridine
```

### SMILES Examples

![SMILES Example 1](https://drive.google.com/uc?export=view&id=1-MFSoAGwqOPiqIUD06reOkBPx4BTMhGC)

![SMILES Example 2](https://drive.google.com/uc?export=view&id=18Ub9L98y8cL_lDLF9wl6pLQxCkt8JFqu)

### Using Online Resources
Most of the time, you will not need to write a SMILES string by hand. You will be able to look up a molecule's SMILES string from a web database like [PubChem](https://pubchem.ncbi.nlm.nih.gov/).

You can also use tools like this [molecule sketcher from the Protein Data Bank](https://www.rcsb.org/chemical-sketch)
to draw molecules and get their SMILES strings.

In [2]:
# Install RDKit and pandas
!pip install rdkit-pypi pandas

Collecting rdkit-pypi
  Downloading rdkit_pypi-2022.9.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Downloading rdkit_pypi-2022.9.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.4/29.4 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit-pypi
Successfully installed rdkit-pypi-2022.9.5


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
from google.colab import files
uploaded = files.upload()

Saving filtered_km_serine_proteases.csv to filtered_km_serine_proteases.csv


In [8]:
import os
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem

# Load the CSV file
df = pd.read_csv("filtered_km_serine_proteases.csv")

# Create output directory if it doesn't exist
output_dir = "ligand_sdfs"
os.makedirs(output_dir, exist_ok=True)

for i, row in df.head(10).iterrows():
    smiles = row['substrate_smiles']  # Adjust if the column has a different name
    ligand_id = row.get('ligand_id', f"ligand_{i}")  # Fallback to generic ID if not available

    mol = Chem.MolFromSmiles(smiles)
    if mol:
        mol.SetProp("_Name", str(ligand_id))
        mol = Chem.AddHs(mol)

        # Try standard embedding
        result = AllChem.EmbedMolecule(mol, randomSeed=0xf00d)
        if result != 0:
            # If standard embedding fails, try random coordinate embedding
            result = AllChem.EmbedMolecule(mol, useRandomCoords=True, randomSeed=0xf00d)

        if result == 0:  # Success
            try:
                AllChem.UFFOptimizeMolecule(mol)
                # Define the output file path
                filename = f"{ligand_id}.sdf"
                filepath = os.path.join(output_dir, filename)
                # Write the molecule to an SDF file
                writer = Chem.SDWriter(filepath)
                writer.write(mol)
                writer.close()
                print(f"✅ Saved {filename}")
            except Exception as e:
                print(f"⚠️ Optimization failed for {ligand_id}: {e}")
        else:
            print(f"❌ Embedding failed for {ligand_id} (SMILES: {smiles})")

print(f"🎉 All individual SDF files are saved in the '{output_dir}' directory.")

✅ Saved ligand_0.sdf
✅ Saved ligand_1.sdf
✅ Saved ligand_2.sdf
✅ Saved ligand_3.sdf
✅ Saved ligand_4.sdf
✅ Saved ligand_5.sdf
✅ Saved ligand_6.sdf
✅ Saved ligand_7.sdf
✅ Saved ligand_8.sdf
✅ Saved ligand_9.sdf
🎉 All individual SDF files are saved in the 'ligand_sdfs' directory.


In [9]:
# Zip the folder
!zip -r ligand_sdfs.zip ligand_sdfs

# Download the zip
from google.colab import files
files.download("ligand_sdfs.zip")

  adding: ligand_sdfs/ (stored 0%)
  adding: ligand_sdfs/ligand_23.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_10.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_2.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_22.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_12.sdf (deflated 79%)
  adding: ligand_sdfs/ligand_21.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_7.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_9.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_5.sdf (deflated 75%)
  adding: ligand_sdfs/ligand_8.sdf (deflated 79%)
  adding: ligand_sdfs/ligand_15.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_13.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_28.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_1.sdf (deflated 79%)
  adding: ligand_sdfs/ligand_34.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_32.sdf (deflated 78%)
  adding: ligand_sdfs/ligand_16.sdf (deflated 79%)
  adding: ligand_sdfs/ligand_6.sdf (deflated 75%)
  adding: ligand_sdfs/ligand_27.sdf (deflated 78%)
  a

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>