# Intro
Welcome to the [Bristol-Myers Squibb – Molecular Translation](https://www.kaggle.com/c/bms-molecular-translation/overview) Competition:

![](https://storage.googleapis.com/kaggle-competitions/kaggle/22422/logos/header.png)

For informations about the International Chemical Identifier we recommend this [link](https://en.wikipedia.org/wiki/International_Chemical_Identifier).

<span style="color: royalblue;">Please vote the notebook up if it helps you. Feel free to leave a comment above the notebook. Thank you. </span>

# Libraries

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import cv2
import random
import re

import warnings
warnings.filterwarnings("ignore")

# Path

In [None]:
path = '/kaggle/input/bms-molecular-translation/'
os.listdir(path)

# Functions
We define some helper functions for visualizations.

In [None]:
def plot_example(image_id):
    fig = plt.figure(figsize=(12, 7))
    ax = fig.add_subplot(111)
    filename = train_data.loc[image_id, 'image_id']
    path_img ='/'.join([path, 'train', filename[0], filename[1], filename[2]])
    path_img = path_img+'/'
    img = cv2.imread(path_img+filename+'.png')
    ax.imshow(img)
    ax.set_title(train_data.loc[image_id, 'InChI'])
    ax.set_xticklabels([])
    ax.set_yticklabels([])
    plt.show()

def plot_examples(list_IDs):
    fig, axs = plt.subplots(5, 5, figsize=(25, 14))
    fig.subplots_adjust(hspace = .2, wspace=.2)
    axs = axs.ravel()
    for i in range(25):
        filename = train_data.loc[list_IDs[i], 'image_id']
        path_img ='/'.join([path, 'train', filename[0], filename[1], filename[2]])
        path_img = path_img+'/'
        img = cv2.imread(path_img+filename+'.png')
        axs[i].imshow(img)
        axs[i].set_title(train_data.loc[list_IDs[i], 'InChI'][0:20]+'...')
        axs[i].set_xticklabels([])
        axs[i].set_yticklabels([])
    plt.show()

# Load Data

In [None]:
train_data = pd.read_csv(path+'train_labels.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')

# Overview

In [None]:
print('Number train samples:', len(train_data.index))
print('Number submission samples:', len(samp_subm.index))

In [None]:
train_data.head()

# Exploratory Data Analysis

## Images
We plot some images and a part of the InChi string as title:

In [None]:
list_IDs = random.sample(list(train_data.index), 25)
plot_examples(list_IDs)

## Labels
We count the lenght of the labels. Every label starts with the string InChI=, so we subtract 6 of the whole label lenght:

In [None]:
def lenght_label(s):
    return len(s)-6

train_data['lenght_label'] = train_data['InChI'].apply(lenght_label)

In [None]:
fig = plt.figure(figsize=(8, 5))
train_data['lenght_label'].hist(bins=100)
plt.title('Distribution of label lenght', loc='left')
plt.xlabel('Lenght of label')
plt.ylabel('Frequency')
plt.show()

The labels consists of layers and sublaysers which are aseparated by the delimiter "/" and start with a characteristic prefix letter.

The six layers with important sublayers are:

1) Main layer

* Chemical formula (no prefix). This is the only sublayer that must occur in every InChI.
* Atom connections (prefix: "c"). The atoms in the chemical formula (except for hydrogens) are numbered in sequence; this sublayer describes which atoms are connected by bonds to which other ones.
* Hydrogen atoms (prefix: "h"). Describes how many hydrogen atoms are connected to each of the other atoms.

2) Charge layer
   
* charge sublayer (prefix: "q")
* proton sublayer (prefix: "p" for "protons")

3) Stereochemical layer
   
* double bonds and cumulenes (prefix: "b")
* tetrahedral stereochemistry of atoms and allenes (prefixes: "t", "m")
* type of stereochemistry information (prefix: "s")
4) Isotopic layer (prefixes: "i", "h", as well as "b", "t", "m", "s" for isotopic stereochemistry)

5) Fixed-H layer (prefix: "f"); contains some or all of the above types of layers except atom connections; may end with "o" sublayer; never included in standard InChI

6) Reconnected layer (prefix: "r"); contains the whole InChI of a structure with reconnected metal atoms; never included in standard InChI

In [None]:
def number_layer(s):
    return len(s.split('/'))-1

train_data['number_layer'] = train_data['InChI'].apply(number_layer)

In [None]:
fig = plt.figure(figsize=(8, 5))
train_data['number_layer'].hist(bins=12)
plt.title('Distribution number of layers')
plt.xlabel('Number layer')
plt.ylabel('Frequency')
plt.show()

## Focus On Examples
For molecular informations we recommend the [page](https://pubchem.ncbi.nlm.nih.gov/). Next we consider 3 examples with different number of layers.

### 1-[5-methyl-2-(2-methylpropylsulfanyl)phenyl]ethanol
We focus on the first train data sample. For more informations look [here](https://pubchem.ncbi.nlm.nih.gov/compound/82265033).

In [None]:
plot_example(0)

In [None]:
train_data.loc[0, 'InChI']

This is the topology of the example: <br>
1) Main Layer<br>
* Formular C<sub>13</sub>H<sub>20</sub>OS
* Atom connections: c1-9(2)8-15-13-6-5-10(3)7-12(13)11(4)14
* Hydrogen atoms: h5-7,9,11,14H,8H2,1-4H3

### (E)-3-(2-Bromophenyl)-2-(4-oxo-4aH-quinazolin-2-yl)prop-2-enenitrile
We focus on example on index 6. For more informations look [here](https://pubchem.ncbi.nlm.nih.gov/compound/133560351).

In [None]:
plot_example(6)

In [None]:

train_data.loc[6, 'InChI']

This is the topology of the example: <br>
1) Main Layer<br>
* Formular C<sub>17</sub>H<sub>10</sub>BrN<sub>3O</sub>
* Atom connections: c18-14-7-3-1-5-11(14)9-12(10-19)16-20-15-8-4-2-6-13(15)17(22)21-16
* Hydrogen atoms: h1-9,13H

3) Stereochemical layer
   
* double bonds and cumulenes: b12-9+

### N-[(1R,2Z)-2-[(3Ar,5S,6aR)-5-[(4R)-2,2-dimethyl-1,3-dioxolan-4-yl]-2,2-dimethyl-3a,6a-dihydrofuro[2,3-d][1,3]dioxol-6-ylidene]-1-deuterioethyl]-2,2,2-trichloroacetamide

We focus on example on index 774,948 with 10 layers. For more informations look [here](https://pubchem.ncbi.nlm.nih.gov/compound/134870524).

In [None]:
plot_example(774948)

In [None]:
train_data.loc[774948, 'InChI']

This is the topology of the example: <br>
1) Main Layer<br>
* Formular C<sub>16</sub>H<sub>22</sub>Cl<sub>3</sub>NO<sub>6</sub>
* Atom connections: c1-14(2)22-7-9(24-14)10-8(5-6-20-13(21)16(17,18)19)11-12(23-10)26-15(3,4)25-11
* Hydrogen atoms: h5,9-12H,6-7H2,1-4H3,(H,20,21)

3) Stereochemical layer
   
* double bonds and cumulenes: b8-5-
* tetrahedral stereochemistry of atoms and allenes 1: t9-,10+,11-,12-
* tetrahedral stereochemistry of atoms and allenes 1: m1
* type of stereochemistry information: s1

4) Isotopic layer

* #1: i6D
* #2: t6-,9+,10-,11+,12+
* #3: m0

# Main Layer - Formular
The formular layer must occure in every InChI. To focus on the formular we extract the first layer.

In [None]:
def split_formular_layer(s):
    return s.split('/')[1]

train_data['formular'] = train_data['InChI'].apply(split_formular_layer)

There are a lot of duplicates:

In [None]:
train_data['formular'].value_counts()

In total there are 329,768 different formulars.

The next step is to split the formular into the atoms by there symbol and number of atomes.

In [None]:
def split_formular(formular):
    dict_formular = {k: int(v) if v else 1 for k,v in
                     re.findall(r"([A-Z][a-z]?)(\d+)?", formular)}
    return dict_formular

# Test the function
test_formular = train_data.loc[0, 'formular']
print(test_formular)
split_formular(test_formular)

In [None]:
df_formular = pd.DataFrame(columns = ['C', 'H', 'Br','Cl','I', 'F', 'N', 'O', 'S', 'Si', 'P'])
list_formulars = list(train_data['formular'].value_counts().keys())
for formular in list_formulars[0:100]:
    dict_formular = split_formular(formular)
    temp = pd.DataFrame.from_dict(dict_formular, orient='index').T
    df_formular = pd.concat([df_formular, temp])
df_formular.index=list_formulars[0:100]
df_formular.fillna(0, inplace=True)

In [None]:
df_formular.head()

To be continued ...

# Data Generator
We define a data generator to load the image data on demand.

*Comming Soon*

# Export

In [None]:
output = samp_subm
output.to_csv('submission.csv', index=False)

In [None]:
output.head()