# Synopsis

The purpose of this kernel to provide a simple *annotated* visualization baseline code for molecules.

Additional clickable 3D molecule views have been added. They use the python `ase` library which can be installed with pip, see [How To: Easy Visualization of Molecules](https://www.kaggle.com/borisdee/how-to-easy-visualization-of-molecules/comments). 

## Changelog

* v7: Ionized atoms groups support with `NH3+` and 1.5 bond in `COO-` groups.
* v4: Bonds have been added using [Dataset with number of bonds between atoms](https://www.kaggle.com/asauve/dataset-with-number-of-bonds-between-atoms)

# General information

![](https://storage.googleapis.com/kaggle-competitions/kaggle/14313/logos/thumb76_76.png?t=2019-05-16-16-56-19)

This kernel uses data from [Predicting Molecular Properties](https://www.kaggle.com/c/champs-scalar-coupling) which is intended to predict interactions between atoms in the domain of Nuclar Magnetic Resonnance (NMR). More precisely it is the scalar coupling constant between atoms which is to be predicted.

As this challenge is based uppon molecular topological properties, it can be useful to have an appropriate way of representing molecules.
These visualizations can then be used to infer useful hints to understand the coupling properties, engineer appropriate features and debug
prediction failures.



# Load data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from ase import Atoms
import ase.visualize  # clickable 3D molecule viewer    # pip install ase

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import os
!ls -al --color ../input

In [None]:
def load_dir_csv(directory):
    csv_files = sorted( [ f for f in os.listdir(directory) if f.endswith(".csv") ])    
    csv_vars  = [ filename[:-4] for filename in csv_files ]
    gdict = globals()
    for filename, var in zip( csv_files, csv_vars ):
        print(f"{var:32s} = pd.read_csv({directory}/{filename})")
        gdict[var] = pd.read_csv( f"{directory}/{filename}" )
        print(f"{'nb of rows ':32s} = " + str(len(gdict[var])))
        display(gdict[var].head())

load_dir_csv("../input/champs-scalar-coupling")
load_dir_csv("../input/predicting-molecular-properties-bonds/")
                       

# Structure of molecules

## Molecule geometry

To begin with, the location of atoms can be found in `structures.csv`. The columns of interest are
* atom : a letter for the atom type
* x, y, z: 3D coordinrates of each atom

## Atoms types

In [None]:
structures.atom.unique()

Ok there are only four different types of atoms. 
Let use the standard [CPK](https://fr.wikipedia.org/wiki/Code_de_couleurs_CPK) color code for their representation:  

<img src="https://images-na.ssl-images-amazon.com/images/I/610rjiH9f5L._SL1500_.jpg" width="50%" />

## Coupling types

In [None]:
train.type.unique()

There are only 8 different possible couplings types. The coupling type starts with a number `1`, `2` or `3` and is relative to the number of bonds involved.
It will be plotted with
* black (transparent) line for 1 bond coupling
* green dashed (transparent) line for two bonds coupling
* red dotted (transparent) line for three bond coupling


## Scalar coupling constant

this is the value which has to be predicted, it can be found in `train.sc`
* coupling is shown as a transparent connection between coupled atoms, which thickness correspond to the scalar value

## Bonds

A special dataset has been calculated from molecule topology and most probable covalent bondings in [Dataset with number of bonds between atoms](https://www.kaggle.com/asauve/dataset-with-number-of-bonds-between-atoms).

Bonds are shown as
* thick black lines for 1-bonds
* thick*1.5 dark green for 1.5-bonds found in COO- groups
* thick*2 green lines for 2-bonds
* thick*3 red lines for 3-bonds

## Visualization code

In [None]:
def view3d_molecule(name, xsize="200px", ysize="200px"):
    """Mouse clickeble 3D view"""
    m = structures[structures.molecule_name == name]
    positions = m[['x','y','z']].values
    v = ase.visualize.view(Atoms(positions=positions, symbols=m.atom.values), 
                           viewer="x3d") 
    return v

cpk = { 
    'C': ("black", 2),
    'H': ("white", 1),
    'O': ("red",   2),
    'N': ("dodgerblue", 2),
    'F': ("green", 2) }

bond_colors = {'1.0':'black', '1.5':'darkgreen', '2.0':'green', '3.0':'red'}

def bond_type_to_pair(bond_type):
    return bond_type[3:]
def bond_type_to_n(bond_type):
    return bond_type[0:3]

def plot_molecule(name, ax=None, bonds=None, charges=None, elev=0, azim=-60):
    """bonds = if provided add bonds display from the bond table dataset in https://www.kaggle.com/asauve/predicting-molecular-properties-bonds
    elev = 3D elevation angle [degree] for the molecule view
    azim = 3D azimut angle [degree]
    """
    if not ax:
        fig = plt.figure()
        ax = fig.add_subplot(111, projection='3d')
    if (elev != 0) or (azim != -60):
        ax.view_init(elev=elev, azim=azim)
    
    # atoms location
    m = structures[structures.molecule_name == name].sort_values(by='atom_index')
    if (charges is not None):
        charges = charges[charges.molecule_name == name].sort_values(by='atom_index')
        if len(charges) != len(m):
            print(f"Warning bad charges data for molecule {name}")
    
    # formula
    acount = {a : 0 for a in cpk}
    for a in m.atom:
        acount[a] += 1
    formula = ""
    for a in acount:
        if acount[a] == 1:
            formula += a
        elif acount[a] > 1:
            formula += "%s_{%d}" % (a, acount[a])

    ax.set_title(f'{name} ${formula}$')
    
    # display couplings (coupling is not molecular bonds!)
    couples = train[train.molecule_name == name][['atom_index_0', 'atom_index_1', 'type', 'scalar_coupling_constant']]
    for c in couples.itertuples():
        m0 = m[m.atom_index == c.atom_index_0]
        m1 = m[m.atom_index == c.atom_index_1]
        ax.plot([float(m0.x), float(m1.x)],[float(m0.y), float(m1.y)],[float(m0.z), float(m1.z)],
               linestyle = ['', '-', '--', 'dotted'][int(c.type[0])],
               color     = ['', 'black', 'green', 'red' ][int(c.type[0])],
               linewidth = abs(float(c.scalar_coupling_constant))/5,
               alpha=0.2), 
    if bonds is not None:
        for b in bonds[bonds.molecule_name == name].itertuples():
            m0 = m[m.atom_index == b.atom_index_0]
            m1 = m[m.atom_index == b.atom_index_1]
            ax.plot([float(m0.x), float(m1.x)],[float(m0.y), float(m1.y)],[float(m0.z), float(m1.z)], 'black', 
                    linewidth=2*float(b.nbond),
                    color=bond_colors[bond_type_to_n(b.bond_type)])
            
    # display atoms
    ax.scatter(m.x, m.y, m.z, c=[cpk[a][0] for a in m.atom], s=[100*cpk[a][1] for a in m.atom], edgecolor='black')
        
    # display atom index and charges
    for row in m.itertuples():
        x = float(row.x) - 0.15 if row.x > ax.get_xlim()[0] + 0.15 else row.x
        y = float(row.y) - 0.15 if row.y > ax.get_ylim()[0] + 0.15 else row.y
        z = float(row.z) - 0.15 if row.z > ax.get_zlim()[0] + 0.15 else row.z
        ax.text(x, y, z, str(row.atom_index), color='darkviolet')
        if charges is not None:
            ch = float(charges[charges.atom_index == row.atom_index].charge)
            if ch != 0:
                x = float(row.x) + 0.15 if row.x < ax.get_xlim()[1] - 0.15 else row.x
                y = float(row.y) + 0.15 if row.y > ax.get_ylim()[1] - 0.15 else row.y
                z = float(row.z) + 0.15 if row.z < ax.get_zlim()[1] - 0.15 else row.z
                ax.text(x, y, z, f"{ch:+.1f}", color='orangered' if ch > 0 else 'blue',
                        bbox=dict(boxstyle='round', facecolor='white', alpha=0.5, 
                                  edgecolor='black'))
                
ax = plot_molecule("dsgdb9nsd_000007", bonds=train_bonds)

This example shows the three kind of *scalar coupling*.   
The 1 bond coupling having the largest coupling constant (visible as the thickness of the bond).

In [None]:
view3d_molecule("dsgdb9nsd_000007")

This is the very same molecule but with the non-annotated `ase`  3D  view.   
Both visualizatons are complementary for complex molecules.

# Visualization of the 20 first molecules

Now let show the 20 first molecules. Most of them are common.

* dsgdb9nsd_000001: $CH_4$ Methane 
* dsgdb9nsd_000002: $NH_3$ Ammonia
* dsgdb9nsd_000003: $H_2O$ Water :-) 
* dsgdb9nsd_000005: $HCN$ Hydrogen cyanide (aka the infamous zyklon B) contains a triple `3CC` bond and exhibit a very high scalar coupling for the `H` on the other side
* dsgdb9nsd_000007: $C_2H_6$ Ethane 
* dsgdb9nsd_000008: $CH-3OH$ Methanol
* dsgdb9nsd_000009: $C_3H_4$ Propyne contains a triple `3CC` bond and has also a very high scalar coupling for the `H` on the other side
* dsgdb9nsd_000010: $CH_3CN$ Acetonitrile
* dsgdb9nsd_000011: $CH_3CHO$ Ethanal or Acetaldehyde
* dsgdb9nsd_000012: $HCONH_2$  Formamide or Methanamid
* dsgdb9nsd_000013: $C_3H_8$ Propane
* dsgdb9nsd_000014: $C_2H_5OH$ Alcool :-) 
* dsgdb9nsd_000017: $C_2H_4O$ Ethylene oxide or oxyrane contains a CCO cycle
* dsgdb9nsd_000018: $C_3H_6O$ Aceton
* dsgdb9nsd_000019: $CH_3CONH_2$ Acetamide or Ethanamide
* dsgdb9nsd_000021: $C_4H_{10}$ Butane
* dsgdb9nsd_000023: $C4_H2$ Diacetylene or Butadiyne has a very uncommon linear structure with two triple `3CC` bonds and very high scalar coupling
* dsgdb9nsd_000026: $HC_2CHO$ Propynal contains also a triple bond and a very high scalar coupling
* dsgdb9nsd_000027: $HCOCN$ Formyl cyanide has an uncomon triple bond `3CN` and has also a particularly high 2-bond scalar coupling `2JHC`
* dsgdb9nsd_000028: $OCHCHO$ Glyoxal


In [None]:
nrow = 5
ncol = 4
fig = plt.figure(figsize=(20, 20))
molecules = train.molecule_name.unique()
for i in range(nrow*ncol):
    ax = fig.add_subplot(nrow, ncol, 1+i, projection='3d')
    plot_molecule(molecules[i], ax=ax, bonds=train_bonds)

The butadyne - see bellow - is an interresting case because of its inline structure and high coupling. The triple bonds effect are visible because `C` atoms are closer in `3CC` bonds.

In [None]:
view3d_molecule("dsgdb9nsd_000023")

# Ionized molecules

A few (about 3% in each set) of total molecules are ionized, here are some examples of these beasts.

In [None]:
ionized = train_charges[(train_charges.charge != 0)].molecule_name.unique()

# filter out molecules with failed bonding
errors  = train_bonds[train_bonds.error == 1].molecule_name.unique()
errors  = {e:1 for e in errors} # convert to dict for fast lookup
ionized = [ name for name in ionized if not name in errors]

nrow = 3
ncol = 3
fig = plt.figure(figsize=(18, 18))
for i in range(nrow*ncol):
    ax = fig.add_subplot(nrow, ncol, 1+i, projection='3d')
    plot_molecule(ionized[i], ax=ax, bonds=train_bonds, charges=train_charges)

Currently the two ionized groups supported are `COO-` and `NH3+`. The negative charge, at this time is supposed to be shared by both atoms of the `COO-` group.

One of these molecule is shown bellow for appreciationg the absence of `H` atom on the `COO-` group.
There is one extra `H` on the `NH3+` group that is further away than the other two `H` (supposedly building regular bonds).

In [None]:
view3d_molecule("dsgdb9nsd_076394")

# What to do next ?

The first run is really instructive and shows for example that inline HCCH structures with  `3CC` triple bonds produce a strong scalar value. 
Hence a potentially interesting directions would be to
* study the relation of the scalar constant with the number of bonds on the nearby atoms
* build space oriented features
* study cycles

Have fun!
