<a href="https://colab.research.google.com/github/Mark12481632/Imperial_MSc_Project/blob/main/code/Data-Preperation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##MSc Project: Data Preperation

My research project is intending to use a Graphical Neural Network (GNN) to predict the solubility, in water, of small organic molecules.
<BR><BR>


**This script was run in Google's Colab environment.**


1. Install Packages/Libraries

- Here we download the required libraries/packages onto COLAB.

- Import packages.

In [None]:
# Install the required packages/libraries.
# These ones need to be downloaded into COLAB.
## %%capture

# Install CONDACOLAB
!pip install -q condacolab
import condacolab
condacolab.install()

# Install RDKIT
!conda install -c rdkit rdkit

# Install pytorch-geometric
!pip install torch_geometric

In [6]:
# Import libraries:

import numpy as np
import pandas as pd

from rdkit import Chem



---


**Section 2: Loading and examining the ESOL dataset**<BR>

- The ESOL dataset has already been downloaded into the github repository - we
  can load it using Pandas.
- Remove the column "ESOL predicted log solubility in mols per litre" as this is the result of another regression model.
<BR><BR>

The column descriptions follow:

1.   **Compound ID:** Name of the compund.
2.   **ESOL predicted log solubility in mols per litre:** Predicted solubility using a regression model - we will remove this.
3.   **Minimum Degree:** The minimum number of bonds that any atom in the molecule has with other atoms.
4.   **Molecular Weight:** The molecular weight of the molecule.
5.   **Number of H-Bond Donors:** A count of the number of hydrogen bond donor groups in the molecule. H-bond donors are atoms (usually hydrogen) that can donate hydrogen bonds to other atoms.
6.   **Number of Rings:** Number of "ring"s, where a "ring" represents a closed cycle of atoms.
7.   **Number of Rotatable Bonds:** A rotatable bond is normally a single bond and that allows the molecule to take on different "forms" by rotation about said bond.
8.   **Polar Surface Area:** Identifies a measure of exposed polar area of a molecule which can provide insights into a molecules polarity.
9.   **measured log solubility in mols per litre:** The logorithm of the solubility of the molecule (in mols/litre)
10.  **smiles:** The SMILES representation of the molecule - see references above.



In [7]:
# 2. Load the "ESOL" dataset and show attributes.

github_esol_url = "https://raw.githubusercontent.com/Mark12481632/Imperial_MSc_Project/main/chem_data/esol_raw.csv"
esol_data_orig = pd.read_csv(github_esol_url)

# Display sample from data:
esol_data_orig.head()

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
0,Amigdalin,-0.974,1,457.432,7,3,7,202.32,-0.77,OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...
1,Fenfuram,-2.885,1,201.225,1,2,2,42.24,-3.3,Cc1occc1C(=O)Nc2ccccc2
2,citral,-2.579,1,152.237,0,0,4,17.07,-2.06,CC(C)=CCCC(C)=CC(=O)
3,Picene,-6.618,2,278.354,0,5,0,0.0,-7.87,c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43
4,Thiophene,-2.232,2,84.143,0,1,0,0.0,-1.33,c1ccsc1


In [8]:
# 2.2. Remove column "ESOL predicted log solubility in mols per litre"
esol_data = esol_data_orig.copy()
del esol_data['ESOL predicted log solubility in mols per litre']
esol_data.head()

Unnamed: 0,Compound ID,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
0,Amigdalin,1,457.432,7,3,7,202.32,-0.77,OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...
1,Fenfuram,1,201.225,1,2,2,42.24,-3.3,Cc1occc1C(=O)Nc2ccccc2
2,citral,1,152.237,0,0,4,17.07,-2.06,CC(C)=CCCC(C)=CC(=O)
3,Picene,2,278.354,0,5,0,0.0,-7.87,c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43
4,Thiophene,2,84.143,0,1,0,0.0,-1.33,c1ccsc1


3. Create functions that:

Take a SMILE molecule encoding and return a map which contains:
  - (key: )

In [26]:
def parse_smile_molecule(molecule_name, smiles):
  """
  """

  atoms_map = {}
  bonds_map = {}

  full_structure_map = {}
  full_structure_map['atoms'] = atoms_map
  full_structure_map['bonds'] = bonds_map
  full_structure_map['name'] = molecule_name

  molecule = Chem.MolFromSmiles(smiles)

  # Add nodes - each atom is a node
  for atom in molecule.GetAtoms():
    atom_id = atom.GetIdx()
    atom_attributes = [atom.GetNumImplicitHs(),
                       atom.GetAtomicNum(),
                       atom.GetSymbol(),
                       atom.GetTotalDegree()]
    atoms_map[atom_id] = atom_attributes

  for bond in molecule.GetBonds():
    key = (bond.GetBeginAtomIdx(), bond.GetEndAtomIdx())
    bonds_map[key] = bond.GetBondTypeAsDouble()

  return full_structure_map

# Ensure that graph looks the same each time this cell is run
np.random.seed(291)

#graph = create_graph_from_smiles('Mark Roberts', 'CC(N)C(=O)O')
mol_struct = parse_smile_molecule('Mark Roberts', 'CC(N)C(=O)O')

print(mol_struct['name'])
print(mol_struct['atoms'])
print(mol_struct['bonds'])

Mark Roberts
{0: [3, 6, 'C', 4], 1: [1, 6, 'C', 4], 2: [2, 7, 'N', 3], 3: [0, 6, 'C', 3], 4: [0, 8, 'O', 1], 5: [1, 8, 'O', 2]}
{(0, 1): 1.0, (1, 2): 1.0, (1, 3): 1.0, (3, 4): 2.0, (3, 5): 1.0}


In [None]:
#
