<div style="background-color:orange;">
    <center><h1>Bristol-Myers Squibb – Molecular Translation</h1></center>
</div>


<center><img src="https://storage.googleapis.com/kaggle-competitions/kaggle/22422/logos/header.png?t=2021-02-03-02-05-31"> </center>

<div style="background-color:orange">
    <center><h2>About Competition and What is InChi ? 🤷‍♂️</h2></center>
</div>

<div>
Before understanding the competition we need to understand what is InChi.

InChi (prononounced "en-chee") stands for International Chemical Identifier.<br/>
It is a computer based standard for representing chemical structures created by chemists around the world under IUPAC.

So InChi is a way to query chemical structures using rule based key.<br/>
This key is divided into different layers which shows following things.<br/>

1. Chemical Formula.
2. How the structure is connected.
3. Hydrogen atoms in these structure.
4. Various properties of bonds etc.

see video below to understand InChi better.
</div>

<iframe width="853" height="480" src="https://www.youtube.com/embed/rAnJ5toz26c" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

As you can see in the video each molecular structure can be represented using InChi.<br/>
As New compounds are created InChi for that compound is also created along with it.<br/>
But previously created structures do not have InChi and there are millions of such structures<br/>
so creating InChi for them manually is not possible. 

So, we have to create a machine learning model which creates InChi from Images.

<center style="margin-left:25%;width:50%;font-family:Source Code Pro;background-color:yellow;color:black;">
    <h3>Example of InChi from structure</h3>
</center>
<img src="https://www.inchi-trust.org/wp/wp-content/uploads/2014/05/techfaq_4.2.png">

In [None]:
!pip install py3Dmol

### Importing Libraries 📗

In [None]:
import os
import gc
import cv2
import sys
import time
import tqdm
import random
import numpy as np
import pandas as pd
import seaborn as sns
import datatable as dt
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import py3Dmol 

import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff

from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.manifold import TSNE

from colorama import Fore, Back, Style
y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA
c_ = Fore.CYAN
sr_ = Style.RESET_ALL

import warnings
warnings.filterwarnings('ignore')

import lightgbm as lgb

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from sklearn.metrics import roc_auc_score
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import KFold, StratifiedKFold

<div style="background-color:orange">
    <center><h2>1. Given Data  💽</h2></center>
</div>


There are two "train" and "test" which contains folder in three layers.<br/>

train_labels.csv has two columns image_id and InChi code for that image.<br/>
first three character of image id shows path of the folder in which image exist.<br/>

**Note: some images have some form of corruption**

For example :- image_id: 000011a64c74 has a path train/0/0/0/000011a64c74.png.<br/>
Same for images in test data

sample_submission also contains same two columns image_id and InChi.

In [None]:
train_path = '../input/bms-molecular-translation/train/'
test_path  = '../input/bms-molecular-translation/test/'

train_data= dt.fread('../input/bms-molecular-translation/train_labels.csv').to_pandas()
sample = dt.fread('../input/bms-molecular-translation/train_labels.csv').to_pandas()

In [None]:
print("{0}Number of rows in train data: {1}{2}\n{0}Number of columns in train data: {1}{3}".format(y_,r_,train_data.shape[0],train_data.shape[1]))
print("{0}Number of rows in sample data: {1}{2}\n{0}Number of columns in sample data: {1}{3}".format(c_,r_,sample.shape[0],sample.shape[1]))

In [None]:
train_data.head()

<div style="background-color:orange">
    <center><h2>2. Understanding Metrics 📐 </h2></center>
</div>

In this competition submission is evaluated on basis of mean of [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) distance.

It is used to measure difference between two strings.

Number of single-character edits (insertions, deletions or substitutions) between 2 strings is levenshtein distance

<center style="margin: 1em 0 .6em 0;
	padding: 0 0 0 20px;
	font-weight: normal;
	color: white;
	font-family: 'Hammersmith One', sans-serif;
	text-shadow: 0 -1px 0 rgba(0,0,0,0.4);
	position: relative;
	font-size: 14px;
	line-height: 40px;">
    <h3>Formula for Levenshtein distance </h3>
</center>

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/10554aecc5e56da9be4657acd75b9a67b5e8b394">

were tail of x is string of all but first character in x and x[n] is nth character of x.

In [None]:
def levenshtein_distance(s, t):
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                cost = 0
            else:
                cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,     
                            distance[row][col-1] + 1,          
                            distance[row-1][col-1] + cost)

    return distance[row][col]

In [None]:
levenshtein_distance('InChI=1S/C13H20OS/c1-9(2)8-15-13-6-5-10(3)7-12(13)11(4)14/h5-7,9,11,14H,8H2,1-4H3',
                    'InChI=1S/C13H20OS/c1-9(2)8-15-13-6-5-10(3)7-12(13)11(4)14/h5-7,9,11,4H,8H2,1-4H3')

Can you find that one difference ? 😉

<div style="background-color:orange">
    <center><h2>3. Let's Look at sample image</h2></center>
</div>

In [None]:
def get_image_path(image_id,train=True):
    path = '../input/bms-molecular-translation/{}/{}/{}/{}/{}.png'
    if train:
        return path.format('train',image_id[0],image_id[1],image_id[2],image_id)
    else:
        return path.format('test',image_id[0],image_id[1],image_id[2],image_id)

def show_image(image_id,InChi):
    path = get_image_path(image_id)
    image = cv2.imread(path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image[np.where((image==[0,0,0]).all(axis=2))] = [0,255,0]
    image[np.where((image!=[0,0,0]).all(axis=2))] = [0,38,87]
    
    plt.imshow(image,cmap='gray')
    plt.title(f"{InChi[:40]}...",size=8)
    plt.axis('off')

In [None]:
show_image(train_data['image_id'][1],train_data['InChI'][1])

<div style="background-color:orange">
    <center><h2>4. Look at 20 random samples</h2></center>
</div>

In [None]:
def show_some_images(df=train_data):
    n = np.random.choice(range(len(df)),size=20)
    plt.figure(figsize=(20,20))
    for i,j in enumerate(n):
        plt.subplot(5,4,i+1)
        show_image(df['image_id'][j],df['InChI'][j])
        plt.axis('off')

In [None]:
show_some_images()

<div style="background-color:orange">
    <center><h2>5. Distribution of shape of images</h2></center>
</div>

In [None]:
df = train_data.sample(1000)
heights, widths, percentages = [], [], []

for image_id in df['image_id']:
    path = get_image_path(image_id)
    image = cv2.imread(path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    percentage = len(image[np.where((image==[0,0,0]).all(axis=2))])/len(image[np.where((image!=[0,0,0]).all(axis=2))])
    heights.append(image.shape[0])
    widths.append(image.shape[1])
    percentages.append(percentage)

df['height'] = heights
df['width'] = widths
df['black_pixels_%']= percentages

In [None]:
def distribution1(feature,color1,color2,df=train_data):
    plt.figure(figsize=(15,7))
    
    plt.subplot(121)
    dist = sns.distplot(df[feature],color=color1)
    a = dist.patches
    xy = [(a[i].get_x() + a[i].get_width() / 2,a[i].get_height()) \
          for i in range(1,len(a)-1) if (a[i].get_height() > a[i-1].get_height() and a[i].get_height() > a[i+1].get_height())]
    
    for i,j in xy:
        dist.annotate(
            s=f"{i:.3f}",
            xy=(i,j), 
            xycoords='data',
            ha='center', 
            va='center', 
            fontsize=11, 
            color='black',
            xytext=(0,7), 
            textcoords='offset points',
        )
    
    qnt = df[feature].quantile([.25, .5, .75]).reset_index(level=0).to_numpy()
    plt.subplot(122)
    box = sns.boxplot(df[feature],color=color2)
    for i,j in qnt:
        box.annotate(str(j)[:4],xy= (j-.05,-0.01),horizontalalignment='center')
        
    print("{}Max value of {} is: {} {:.2f} \n{}Min value of {} is: {} {:.2f}\n{}Mean of {} is: {}{:.2f}\n{}Standard Deviation of {} is:{}{:.2f}"\
      .format(y_,feature,r_,df[feature].max(),g_,feature,r_,df[feature].min(),b_,feature,r_,df[feature].mean(),m_,feature,r_,df[feature].std()))

<div style="background-color:pink">
    <center><h3>5.1 Distribution of height of images</h3></center>
</div>

In [None]:
distribution1('height','green','red',df=df)

<div style="background-color:pink">
    <center><h3>5.2 Distribution of width of images</h3></center>
</div>

In [None]:
distribution1('width','yellow','lightblue',df=df)

<div style="background-color:pink">
    <center><h3>5.3 Distribution of % of pixel in images</h3></center>
</div>

In [None]:
distribution1('black_pixels_%','red','orange',df=df)

<div style="background-color:orange">
    <center><h2>6. Analyse Main layer</h2></center>
</div>

In [None]:
%%time
temp_df = train_data.sample(n=100000)

element_counts = dict()

def get_main_layer(row):
    return row['InChI'].split('/')[1]

def get_element_counts(row):
    elements = list()
    numbers = list()
    number  = ""
    element = ""
    for char in row['MainLayer']:
        if (ord(char) >=65 and ord(char) <= 90) or (ord(char) >= 97 and ord(char) <= 122):
            element += char
            if number != "":
                numbers.append(int(number))
                if element_counts.get(element):
                    element_counts[element] += int(number)
                else:
                    element_counts[element] = int(number)
            number = ""
        else:
            number += char
            if element != "":
                elements.append(element)
                if not element_counts.get(element):
                    element_counts[element] = 1
            element = ""
    row['elements'] = elements
    row['counts'] = numbers
    return row
    
temp_df['MainLayer'] = temp_df.apply(lambda x: get_main_layer(x),axis=1)
temp_df = temp_df.apply(lambda x: get_element_counts(x),axis=1)

element_counts = pd.DataFrame({"elements":element_counts.keys(),'counts':element_counts.values()})

<div style="background-color:pink">
    <center><h3>6.1 count of elements in 10000 samples</h3></center>
</div>

In [None]:
fig = px.bar(element_counts,x='elements',y='counts')
fig.show()

Obviously C, H, O, N, B, F, S are most common elements in structure.

<div style="background-color:orange">
    <center><h2>7. 3D models</h2></center>
</div>

you need cid of structure for getting 3D model which you can get from [here](https://pubchem.ncbi.nlm.nih.gov/#query=InChI%3D1S%2FC3H6O%2Fc1-3(2)4%2Fh1-2H3)

In [None]:
train_data['InChI'][3]

In [None]:
def show_3d_models(cid):
    view = py3Dmol.view(width=600, height=1000, query=cid, viewergrid=(2,1), linked=False)
    view.setStyle({'stick': {}}, viewer=(0,0))
    view.setStyle({'sphere': {}}, viewer=(1,0))
    view.setBackgroundColor('#1AD40D', viewer=(0,0))
    view.setBackgroundColor('#1AD40D', viewer=(1,0))
    view.show()

In [None]:
show_image(train_data['image_id'][3],train_data['InChI'][3])

In [None]:
cid = 'cid:120539154'

show_3d_models(cid)

## 🚧 Work in Progress 🚧