# Introduction
I share a simple example which employ SchNet for predicting coupling constants.
I hope this kernel helps beginners of DNN and will be used as a starter kit.

The core idea is same as Heng's, which employ GNN as a feature exstractor.
Two feature vectors are concatenated and thrown into regression header.
More details of his idea and discussions can be read in below pages.

* Which graph CNN is the best (with starter kit at LB -1.469)?<br>https://www.kaggle.com/c/champs-scalar-coupling/discussion/93972#latest-591759

Due to the limitation of Kaggle kernel, the model in not trained completely in this Kernel.
You would achieve better score by continuing training procedure longer.

# Install packages
In this example, I use chainer chemistry which offer an implementation of SchNet.
This library can be install by PIP.
* Chainer Chemistry: A Library for Deep Learning in Biology and Chemistry<br>https://github.com/pfnet-research/chainer-chemistry

In [1]:
# !pip uninstall -y tensorflow
# !pip install chainer-chemistry==0.5.0

# Import packages
Next, I import main packages. Other sub-modules are imported later.

In [1]:
import random
import numpy as np
import pandas as pd
import chainer
import chainer_chemistry
from IPython.display import display

# Load dataset
In this example, 90% of training data is used actual training data, and the other 10% is used for validation.
Each dataset is grouped by molecule_name name for following procedures.

In [2]:
def load_dataset():

    train = pd.merge(pd.read_csv('../input/train.csv'),
                     pd.read_csv('../input/scalar_coupling_contributions.csv'))

    test = pd.read_csv('../input/test.csv')

    counts = train['molecule_name'].value_counts()
    moles = list(counts.index)

    random.shuffle(moles)

    num_train = int(len(moles) * 0.9)
    train_moles = sorted(moles[:num_train])
    valid_moles = sorted(moles[num_train:])
    test_moles = sorted(list(set(test['molecule_name'])))

    valid = train.query('molecule_name not in @train_moles')
    train = train.query('molecule_name in @train_moles')

    train.sort_values('molecule_name', inplace=True)
    valid.sort_values('molecule_name', inplace=True)
    test.sort_values('molecule_name', inplace=True)

    return train, valid, test, train_moles, valid_moles, test_moles

train, valid, test, train_moles, valid_moles, test_moles = load_dataset()

train_gp = train.groupby('molecule_name')
valid_gp = valid.groupby('molecule_name')
test_gp = test.groupby('molecule_name')

structures = pd.read_csv('../input/structures.csv')
structures_groups = structures.groupby('molecule_name')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


## train data

In [3]:
display(train.head())

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant,fc,sd,pso,dso
0,0,dsgdb9nsd_000001,1,0,1JHC,84.8076,83.0224,0.254579,1.25862,0.27201
1,1,dsgdb9nsd_000001,1,2,2JHH,-11.257,-11.0347,0.352978,2.85839,-3.4336
2,2,dsgdb9nsd_000001,1,3,2JHH,-11.2548,-11.0325,0.352944,2.85852,-3.43387
3,3,dsgdb9nsd_000001,1,4,2JHH,-11.2543,-11.0319,0.352934,2.85855,-3.43393
4,4,dsgdb9nsd_000001,2,0,1JHC,84.8074,83.0222,0.254585,1.25861,0.272013


## validation data

In [4]:
display(valid.head())

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant,fc,sd,pso,dso
187,187,dsgdb9nsd_000018,4,0,1JHC,82.1639,80.284,0.177421,1.06786,0.634598
210,210,dsgdb9nsd_000018,9,2,1JHC,82.162,80.282,0.177379,1.06795,0.634608
209,209,dsgdb9nsd_000018,9,1,2JHC,-2.59645,-2.64044,0.151414,-0.092291,-0.015137
207,207,dsgdb9nsd_000018,8,9,2JHH,-10.6034,-10.614,0.340467,2.40096,-2.73088
206,206,dsgdb9nsd_000018,8,2,1JHC,88.4762,86.8365,0.218369,0.794887,0.626462


## test data

In [5]:
display(test.head())

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type
0,4658147,dsgdb9nsd_000004,2,0,2JHC
1,4658148,dsgdb9nsd_000004,2,1,1JHC
2,4658149,dsgdb9nsd_000004,2,3,3JHH
3,4658150,dsgdb9nsd_000004,3,0,1JHC
4,4658151,dsgdb9nsd_000004,3,1,2JHC


## structures

In [6]:
display(structures.head())

Unnamed: 0,molecule_name,atom_index,atom,x,y,z
0,dsgdb9nsd_000001,0,C,-0.012698,1.085804,0.008001
1,dsgdb9nsd_000001,1,H,0.00215,-0.006031,0.001976
2,dsgdb9nsd_000001,2,H,1.011731,1.463751,0.000277
3,dsgdb9nsd_000001,3,H,-0.540815,1.447527,-0.876644
4,dsgdb9nsd_000001,4,H,-0.523814,1.437933,0.906397


# Preprocessing
I implemented a class named `Graph` whose instances contain molecules.
The distances between atoms are calculated in the initializer of this class.
## Define Graph class

In [7]:
from scipy.spatial import distance


class Graph:

    def __init__(self, points_df, list_atoms):

        self.points = points_df[['x', 'y', 'z']].values

        self._dists = distance.cdist(self.points, self.points)

        self.adj = self._dists < 1.5
        self.num_nodes = len(points_df)

        self.atoms = points_df['atom']
        dict_atoms = {at: i for i, at in enumerate(list_atoms)}

        atom_index = [dict_atoms[atom] for atom in self.atoms]
        one_hot = np.identity(len(dict_atoms))[atom_index]

        bond = np.sum(self.adj, 1) - 1
        bonds = np.identity(len(dict_atoms))[bond - 1]

        self._array = np.concatenate([one_hot, bonds], axis=1).astype(np.float32)

    @property
    def input_array(self):
        return self._array

    @property
    def dists(self):
        return self._dists.astype(np.float32)

In [8]:
points_df = structures_groups.get_group('dsgdb9nsd_000055')
points_df

Unnamed: 0,molecule_name,atom_index,atom,x,y,z
450,dsgdb9nsd_000055,0,C,-0.00856,1.542701,0.001527
451,dsgdb9nsd_000055,1,C,0.005068,0.007026,0.018562
452,dsgdb9nsd_000055,2,C,0.761377,-0.518392,1.247664
453,dsgdb9nsd_000055,3,C,-1.420128,-0.549332,-0.02116
454,dsgdb9nsd_000055,4,O,0.625905,-0.483637,-1.176571
455,dsgdb9nsd_000055,5,H,1.013614,1.940208,0.002811
456,dsgdb9nsd_000055,6,H,-0.514597,1.907025,-0.897058
457,dsgdb9nsd_000055,7,H,-0.522148,1.948825,0.879183
458,dsgdb9nsd_000055,8,H,0.281306,-0.201655,2.179436
459,dsgdb9nsd_000055,9,H,0.799655,-1.611317,1.230225


In [9]:
list_atoms = list(set(structures['atom']))
list_atoms

['N', 'H', 'F', 'C', 'O']

In [10]:
train_gp.get_group('dsgdb9nsd_000001')

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant,fc,sd,pso,dso
0,0,dsgdb9nsd_000001,1,0,1JHC,84.8076,83.0224,0.254579,1.25862,0.27201
1,1,dsgdb9nsd_000001,1,2,2JHH,-11.257,-11.0347,0.352978,2.85839,-3.4336
2,2,dsgdb9nsd_000001,1,3,2JHH,-11.2548,-11.0325,0.352944,2.85852,-3.43387
3,3,dsgdb9nsd_000001,1,4,2JHH,-11.2543,-11.0319,0.352934,2.85855,-3.43393
4,4,dsgdb9nsd_000001,2,0,1JHC,84.8074,83.0222,0.254585,1.25861,0.272013
5,5,dsgdb9nsd_000001,2,3,2JHH,-11.2541,-11.0317,0.352932,2.85856,-3.43395
6,6,dsgdb9nsd_000001,2,4,2JHH,-11.2548,-11.0324,0.352943,2.85853,-3.43387
7,7,dsgdb9nsd_000001,3,0,1JHC,84.8093,83.0241,0.254634,1.25856,0.272012
8,8,dsgdb9nsd_000001,3,4,2JHH,-11.2543,-11.0319,0.352943,2.85856,-3.43393
9,9,dsgdb9nsd_000001,4,0,1JHC,84.8095,83.0243,0.254628,1.25856,0.272012


In [11]:
points = points_df[['x', 'y', 'z']].values
points

array([[-8.55999150e-03,  1.54270147e+00,  1.52716990e-03],
       [ 5.06779760e-03,  7.02634880e-03,  1.85616025e-02],
       [ 7.61377311e-01, -5.18391961e-01,  1.24766443e+00],
       [-1.42012774e+00, -5.49331833e-01, -2.11600848e-02],
       [ 6.25904769e-01, -4.83637453e-01, -1.17657057e+00],
       [ 1.01361406e+00,  1.94020776e+00,  2.81128420e-03],
       [-5.14596692e-01,  1.90702466e+00, -8.97057806e-01],
       [-5.22148149e-01,  1.94882514e+00,  8.79183439e-01],
       [ 2.81306127e-01, -2.01655480e-01,  2.17943598e+00],
       [ 7.99655059e-01, -1.61131746e+00,  1.23022549e+00],
       [ 1.79096916e+00, -1.40729353e-01,  1.26083003e+00],
       [-1.93854829e+00, -1.99107237e-01, -9.18282812e-01],
       [-1.39937632e+00, -1.64259062e+00, -4.56383289e-02],
       [-1.98850725e+00, -2.30596561e-01,  8.57265204e-01],
       [ 1.53111188e+00, -1.54880460e-01, -1.19210222e+00]])

In [12]:
_dists = distance.cdist(points, points)

In [13]:
_dists

array([[0.        , 1.53583006, 2.52859005, 2.52381488, 2.42827289,
        1.09674641, 1.09373648, 1.09498391, 2.80536813, 3.48004979,
        2.7673251 , 2.75768069, 3.47601501, 2.79232245, 2.58401335],
       [1.53583006, 0.        , 1.53582624, 1.53045567, 1.43336332,
        2.18050545, 2.17218868, 2.18842618, 2.18843181, 2.1722189 ,
        2.18048304, 2.16744366, 2.16744575, 2.17582855, 1.95466909],
       [2.52859005, 1.53582624, 0.        , 2.52385365, 2.42826605,
        2.76730832, 3.48002714, 2.80541905, 1.09498336, 1.09373463,
        1.09675055, 3.47604207, 2.75777282, 2.79232929, 2.58400721],
       [2.52381488, 1.53045567, 2.52385365, 0.        , 2.35064633,
        3.48159176, 2.76059242, 2.80317214, 2.80328006, 2.76065481,
        3.48160853, 1.09373046, 1.09372967, 1.09374514, 3.19945506],
       [2.42827289, 1.43336332, 2.42826605, 2.35064633, 0.        ,
        2.72328582, 2.66348199, 3.38541346, 3.38541607, 2.66355375,
        2.72321183, 2.59308494, 2.59305682, 

In [14]:
adj = _dists < 1.5
adj

array([[ True, False, False, False, False,  True,  True,  True, False,
        False, False, False, False, False, False],
       [False,  True, False, False,  True, False, False, False, False,
        False, False, False, False, False, False],
       [False, False,  True, False, False, False, False, False,  True,
         True,  True, False, False, False, False],
       [False, False, False,  True, False, False, False, False, False,
        False, False,  True,  True,  True, False],
       [False,  True, False, False,  True, False, False, False, False,
        False, False, False, False, False,  True],
       [ True, False, False, False, False,  True, False, False, False,
        False, False, False, False, False, False],
       [ True, False, False, False, False, False,  True, False, False,
        False, False, False, False, False, False],
       [ True, False, False, False, False, False, False,  True, False,
        False, False, False, False, False, False],
       [False, False,  T

In [15]:
num_nodes = len(points_df)
num_nodes

15

In [16]:
atoms = points_df['atom']
atoms

450    C
451    C
452    C
453    C
454    O
455    H
456    H
457    H
458    H
459    H
460    H
461    H
462    H
463    H
464    H
Name: atom, dtype: object

In [17]:
dict_atoms = {at: i for i, at in enumerate(list_atoms)}
dict_atoms

{'N': 0, 'H': 1, 'F': 2, 'C': 3, 'O': 4}

In [18]:
atom_index = [dict_atoms[atom] for atom in atoms]
atom_index

[3, 3, 3, 3, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [19]:
one_hot = np.identity(len(dict_atoms))[atom_index]
one_hot

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [20]:
bond = np.sum(adj, 1) - 1
bond

array([3, 1, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [21]:
bonds = np.identity(len(dict_atoms))[bond - 1]
bonds

array([[0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.]])

In [22]:
### C H O N F - 0Bond, 1Bond, 2Bond, 3Bond, 4Bond
_array = np.concatenate([one_hot, bonds], axis=1).astype(np.float32)
_array

array([[0., 0., 0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.]], dtype=float32)

In [23]:
array_df = pd.DataFrame(_array, columns=['C', 'H', 'O', 'N', 'F', '0Bond', '1Bond', '2Bond', '3Bond', '4Bond'])
array_df

Unnamed: 0,C,H,O,N,F,0Bond,1Bond,2Bond,3Bond,4Bond
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
5,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
6,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
8,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
9,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [24]:
dists_df = pd.DataFrame(_dists)
dists_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.0,1.53583,2.52859,2.523815,2.428273,1.096746,1.093736,1.094984,2.805368,3.48005,2.767325,2.757681,3.476015,2.792322,2.584013
1,1.53583,0.0,1.535826,1.530456,1.433363,2.180505,2.172189,2.188426,2.188432,2.172219,2.180483,2.167444,2.167446,2.175829,1.954669
2,2.52859,1.535826,0.0,2.523854,2.428266,2.767308,3.480027,2.805419,1.094983,1.093735,1.096751,3.476042,2.757773,2.792329,2.584007
3,2.523815,1.530456,2.523854,0.0,2.350646,3.481592,2.760592,2.803172,2.80328,2.760655,3.481609,1.09373,1.09373,1.093745,3.199455
4,2.428273,1.433363,2.428266,2.350646,0.0,2.723286,2.663482,3.385413,3.385416,2.663554,2.723212,2.593085,2.593057,3.321998,0.963183
5,1.096746,2.180505,2.767308,3.481592,2.723286,0.0,1.773779,1.768239,3.14031,3.763729,2.552879,3.760365,4.319874,3.802001,2.466783
6,1.093736,2.172189,3.480027,2.760592,2.663482,1.773779,0.0,1.776749,3.813765,4.316402,3.763696,2.542416,3.755998,3.133606,2.919491
7,1.094984,2.188426,2.805419,2.803172,3.385413,1.768239,1.776749,0.0,2.638326,3.813792,3.140446,3.138581,3.810917,2.626893,3.596064
8,2.805368,2.188432,1.094983,2.80328,3.385416,3.14031,3.813765,2.638326,0.0,1.776748,1.768228,3.810987,3.138781,2.62698,3.596036
9,3.48005,2.172219,1.093735,2.760655,2.663554,3.763729,4.316402,3.813792,1.776748,0.0,1.773773,3.756086,2.542547,3.133582,2.919573


## Convert into graph object
Each dataset is represented as a list of Graphs and prediction targets.

In [25]:
list_atoms = list(set(structures['atom']))
print('list of atoms')
print(list_atoms)
    
train_graphs = list()
train_targets = list()
print('preprocess training molecules ...')
for mole in train_moles:
    train_graphs.append(Graph(structures_groups.get_group(mole), list_atoms))
    train_targets.append(train_gp.get_group(mole))

valid_graphs = list()
valid_targets = list()
print('preprocess validation molecules ...')
for mole in valid_moles:
    valid_graphs.append(Graph(structures_groups.get_group(mole), list_atoms))
    valid_targets.append(valid_gp.get_group(mole))

test_graphs = list()
test_targets = list()
print('preprocess test molecules ...')
for mole in test_moles:
    test_graphs.append(Graph(structures_groups.get_group(mole), list_atoms))
    test_targets.append(test_gp.get_group(mole))

list of atoms
['N', 'H', 'F', 'C', 'O']
preprocess training molecules ...
preprocess validation molecules ...
preprocess test molecules ...


In [26]:
mole

'dsgdb9nsd_133885'

In [27]:
test_graphs[-1]._dists

array([[0.        , 1.55442597, 2.13359743, 3.41077651, 3.60792696,
        2.87300638, 1.5406667 , 2.13358977, 3.41077339, 1.09182613,
        1.09182626, 2.53265842, 4.3791521 , 4.64982494, 2.53264506,
        4.37914813],
       [1.55442597, 0.        , 1.56309761, 2.5996242 , 3.32117236,
        3.24016526, 1.91136571, 1.5630973 , 2.59962957, 2.22595914,
        2.22595953, 2.2482947 , 3.42424991, 4.39379221, 2.24829538,
        3.42425823],
       [2.13359743, 1.56309761, 0.        , 1.50327185, 2.37662789,
        2.42805394, 1.53118284, 2.05227375, 2.31911581, 2.48109382,
        3.1421087 , 1.08820409, 2.31557336, 3.37896524, 3.09389518,
        3.31353601],
       [3.41077651, 2.5996242 , 1.50327185, 0.        , 1.53773905,
        2.46737812, 2.28845633, 2.319112  , 1.51951589, 3.93768831,
        4.27287504, 2.26444214, 1.08048233, 2.32454799, 3.36722538,
        2.30797827],
       [3.60792696, 3.32117236, 2.37662789, 1.53773905, 0.        ,
        1.43225399, 2.10295922, 

In [28]:
test_graphs[-1]._array

array([[0., 0., 0., 1., 0., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0.]], dtype=float32)

In [29]:
test_gp.get_group('dsgdb9nsd_133885')

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type
2505526,7163673,dsgdb9nsd_133885,14,0,3JHC
2505525,7163672,dsgdb9nsd_133885,13,15,3JHH
2505524,7163671,dsgdb9nsd_133885,13,8,2JHC
2505523,7163670,dsgdb9nsd_133885,13,7,3JHC
2505520,7163667,dsgdb9nsd_133885,13,3,2JHC
2505521,7163668,dsgdb9nsd_133885,13,4,1JHC
2505519,7163666,dsgdb9nsd_133885,13,2,3JHC
2505518,7163665,dsgdb9nsd_133885,12,15,3JHH
2505517,7163664,dsgdb9nsd_133885,12,13,3JHH
2505522,7163669,dsgdb9nsd_133885,13,6,3JHC


## Convert into chainer's dataset
This type of dataset can be handled by `DictDataset`.
Graph objects and prediction targets are merged as a `DictDataset`.

In [30]:
from chainer.datasets.dict_dataset import DictDataset

train_dataset = DictDataset(graphs=train_graphs, targets=train_targets)
valid_dataset = DictDataset(graphs=valid_graphs, targets=valid_targets)
test_dataset = DictDataset(graphs=test_graphs, targets=test_targets)

# Model
## Build SchNet model
The prediction model is implemented as follows.
First, fully connected layer is applied to input arrays to align dimensions.
Next, SchNet layer is applied for feature extraction.
Finally, features vectors are concatenated and thrown into three layers MLP.
I add batch-normalization layers like ResNet.

In [31]:
# !pip install cupy

In [32]:
# !pip install chainer

In [33]:
# !pip install cupy-cuda100

In [34]:
from chainer import reporter
from chainer import functions as F
from chainer import links as L
from chainer_chemistry.links import SchNetUpdate
from chainer_chemistry.links import GraphLinear, GraphBatchNormalization

class SchNetUpdateBN(SchNetUpdate):

    def __init__(self, *args, **kwargs):
        super(SchNetUpdateBN, self).__init__(*args, **kwargs)
        with self.init_scope():
            self.bn = GraphBatchNormalization(args[0])

    def __call__(self, h, adj, **kwargs):
        v = self.linear[0](h)
        v = self.cfconv(v, adj)
        v = self.linear[1](v)
        v = F.softplus(v)
        v = self.linear[2](v)
        return h + self.bn(v)

class SchNet(chainer.Chain):

    def __init__(self, num_layer=3):
        super(SchNet, self).__init__()

        self.num_layer = num_layer

        with self.init_scope():
            self.gn = GraphLinear(512)
            for l in range(self.num_layer):
                self.add_link('sch{}'.format(l), SchNetUpdateBN(512))

            self.interaction1 = L.Linear(128)
            self.interaction2 = L.Linear(128)
            self.interaction3 = L.Linear(4)

    def __call__(self, input_array, dists, pairs_index, targets):

        out = self.predict(input_array, dists, pairs_index)
        loss = F.mean_absolute_error(out, targets)
        reporter.report({'loss': loss}, self)
        return loss

    def predict(self, input_array, dists, pairs_index, **kwargs):

        h = self.gn(input_array)

        for l in range(self.num_layer):
            h = self['sch{}'.format(l)](h, dists)

        h = F.concat((h, input_array), axis=2)

        concat = F.concat([
            h[pairs_index[:, 0], pairs_index[:, 1], :],
            h[pairs_index[:, 0], pairs_index[:, 2], :],
            F.expand_dims(dists[pairs_index[:, 0],
                                pairs_index[:, 1],
                                pairs_index[:, 2]], 1)
        ], axis=1)

        h1 = F.leaky_relu(self.interaction1(concat))
        h2 = F.leaky_relu(self.interaction2(h1))
        out = self.interaction3(h2)

        return out

model = SchNet(num_layer=3)
model.to_gpu(device=1)

<__main__.SchNet at 0x7fdcd4753438>

# Training preparation
## Make samplers
For mini-batch training, I implement a sampler named `SameSizeSampler`.
The molecules which have same number of atoms are selected simultaneously.

In [35]:
from chainer.iterators import OrderSampler

class SameSizeSampler(OrderSampler):

    def __init__(self, structures_groups, moles, batch_size,
                 random_state=None, use_remainder=False):

        self.structures_groups = structures_groups
        self.moles = moles
        self.batch_size = batch_size
        if random_state is None:
            random_state = np.random.random.__self__
        self._random = random_state
        self.use_remainder = use_remainder

    def __call__(self, current_order, current_position):

        batches = list()

        atom_counts = pd.DataFrame()
        atom_counts['mol_index'] = np.arange(len(self.moles))
        atom_counts['molecular_name'] = self.moles
        atom_counts['num_atom'] = [len(self.structures_groups.get_group(mol))
                                   for mol in self.moles]

        num_atom_counts = atom_counts['num_atom'].value_counts()

        for count, num_mol in num_atom_counts.to_dict().items():
            if self.use_remainder:
                num_batch_for_this = -(-num_mol // self.batch_size)
            else:
                num_batch_for_this = num_mol // self.batch_size

            target_mols = atom_counts.query('num_atom==@count')['mol_index'].values
            random.shuffle(target_mols)

            devider = np.arange(0, len(target_mols), self.batch_size)
            devider = np.append(devider, 99999)

            if self.use_remainder:
                target_mols = np.append(
                    target_mols,
                    np.repeat(target_mols[-1], -len(target_mols) % self.batch_size))

            for b in range(num_batch_for_this):
                batches.append(target_mols[devider[b]:devider[b + 1]])

        random.shuffle(batches)
        batches = np.concatenate(batches).astype(np.int32)

        return batches

batch_size = 8
train_sampler = SameSizeSampler(structures_groups, train_moles, batch_size)
valid_sampler = SameSizeSampler(structures_groups, valid_moles, batch_size,
                                use_remainder=True)
test_sampler = SameSizeSampler(structures_groups, test_moles, batch_size,
                               use_remainder=True)

## Make iterators, oprimizer
Iterators for data feeding is made as below.

In [36]:
train_iter = chainer.iterators.SerialIterator(
    train_dataset, batch_size, order_sampler=train_sampler)

valid_iter = chainer.iterators.SerialIterator(
    valid_dataset, batch_size, repeat=False, order_sampler=valid_sampler)

test_iter = chainer.iterators.SerialIterator(
    test_dataset, batch_size, repeat=False, order_sampler=test_sampler)

## Make optimizer
Adam is used as an optimizer.

In [37]:
from chainer import optimizers
optimizer = optimizers.Adam(alpha=1e-3)
optimizer.setup(model)

<chainer.optimizers.adam.Adam at 0x7fdcd4504860>

## Make updator
Since the model receives input arrays separately, I implement an original converter.
`input_array` and `dists` are exstracted from `Graph` object and `pair_index` and `targets` are exstracted from `targets` object.
`targets` is added only for training.
When this converter is used for evaluation, `targets` is not added.

In [38]:
from chainer import training
from chainer.dataset import to_device

def coupling_converter(batch, device):

    list_array = list()
    list_dists = list()
    list_targets = list()
    list_pairs_index = list()

    with_target = 'fc' in batch[0]['targets'].columns

    for i, d in enumerate(batch):
        list_array.append(d['graphs'].input_array)
        list_dists.append(d['graphs'].dists)
        if with_target:
            list_targets.append(
                d['targets'][['fc', 'sd', 'pso', 'dso']].values.astype(np.float32))

        sample_index = np.full((len(d['targets']), 1), i)
        atom_index = d['targets'][['atom_index_0', 'atom_index_1']].values

        list_pairs_index.append(np.concatenate([sample_index, atom_index], axis=1))

    input_array = to_device(device, np.stack(list_array))
    dists = to_device(device, np.stack(list_dists))
    pairs_index = np.concatenate(list_pairs_index)

    array = {'input_array': input_array, 'dists': dists, 'pairs_index': pairs_index}

    if with_target:
        array['targets'] = to_device(device, np.concatenate(list_targets))

    return array

updater = training.StandardUpdater(train_iter, optimizer,
                                   converter=coupling_converter, device=1)
trainer = training.Trainer(updater, (200, 'epoch'), out="result")

# Training extensions
## Evaluator
I implemented an Evaluator which measure validation score during training.
The prediction for test data is also calculated in this evaluator and the submision file is generated.

In [39]:
from chainer.training.extensions import Evaluator
from chainer import cuda

class TypeWiseEvaluator(Evaluator):

    def __init__(self, iterator, target, converter, device, name,
                 is_validate=False, is_submit=False):

        super(TypeWiseEvaluator, self).__init__(
            iterator, target, converter=converter, device=device)

        self.is_validate = is_validate
        self.is_submit = is_submit
        self.name = name

    def calc_score(self, df_truth, pred):

        target_types = list(set(df_truth['type']))

        diff = df_truth['scalar_coupling_constant'] - pred

        scores = 0
        metrics = {}

        for target_type in target_types:

            target_pair = df_truth['type'] == target_type
            score_exp = np.mean(np.abs(diff[target_pair]))
            scores += np.log(score_exp)
            metrics[f'LogMAE_{target_type}'] = score_exp
            #metrics[target_type] = scores

        metrics['ALL_LogMAE'] = scores / len(target_types)
        # print(metrics)

        observation = {}
        with reporter.report_scope(observation):
            reporter.report(metrics, self._targets['main'])

        return observation

    def evaluate(self):
        iterator = self._iterators['main']
        eval_func = self._targets['main']

        iterator.reset()
        it = iterator

        y_total = []
        t_total = []

        for batch in it:
            in_arrays = self.converter(batch, self.device)
            with chainer.no_backprop_mode(), chainer.using_config('train', False):
                y = eval_func.predict(**in_arrays)

            y_data = cuda.to_cpu(y.data)
            y_total.append(y_data)
            t_total.extend([d['targets'] for d in batch])

        df_truth = pd.concat(t_total, axis=0)
        y_pred = np.sum(np.concatenate(y_total), axis=1)

        if self.is_submit:
            submit = pd.DataFrame()
            submit['id'] = df_truth['id']
            submit['scalar_coupling_constant'] = y_pred
            submit.drop_duplicates(subset='id', inplace=True)
            submit.sort_values('id', inplace=True)
            submit.to_csv('kernel_schnet.csv', index=False)

        if self.is_validate:
            return self.calc_score(df_truth, y_pred)

        return {}

trainer.extend(
    TypeWiseEvaluator(iterator=valid_iter, target=model, converter=coupling_converter, 
                      name='valid', device=1, is_validate=True))
trainer.extend(
    TypeWiseEvaluator(iterator=test_iter, target=model, converter=coupling_converter,
                      name='test', device=1, is_submit=True))

## Other extensions
ExponentialShift is set as a learning rate scheduler.
An extension which turn off training mode is also set to deactivate normalizatoin from second epoch.

Log options are set to report the metrics.
This helps us to analyze the result of training.

In [40]:
trainer.extend(training.extensions.ExponentialShift('alpha', 0.99999))

from chainer.training import make_extension

def stop_train_mode(trigger):
    @make_extension(trigger=trigger)
    def _stop_train_mode(_):
        chainer.config.train = False
    return _stop_train_mode

trainer.extend(stop_train_mode(trigger=(1, 'epoch')))

trainer.extend(
    training.extensions.observe_value(
        'alpha', lambda tr: tr.updater.get_optimizer('main').alpha))

trainer.extend(training.extensions.LogReport())
trainer.extend(training.extensions.PrintReport(
    ['epoch', 'elapsed_time', 'main/loss', 'valid/main/ALL_LogMAE', 'LogMAE_1JHN', 'alpha']))

# Training
## Run
I tuned number of epochs to prevent timeout.
SchNet tends to be underfitting, longer training makes the model better basically.

In [None]:
chainer.config.train = True
# For using device 1
with chainer.using_device('@cupy:1'):
    trainer.run()

epoch       elapsed_time  main/loss   valid/main/ALL_LogMAE  LogMAE_1JHN  alpha     
[J1           517.62        0.831334    0.408113                            0.000908927  
[J2           1054.1        0.432823    -0.119311                           0.00082614  


## Check output

In [None]:
submit = pd.read_csv('kernel_schnet.csv')
display(submit.head())
print('shape: {}'.format(submit.shape))

# For more improvement
This example can be improved by below ways.
* Train more
* Tune hyperparameters
* Add original feature
* Try different GNNs
* Blend with Gradient Boosting Machines

You can start with this kernel and don't forget upvote :)

# References

* SchNet: A continuous-filter convolutional neural network for modeling quantum interactions<br>https://arxiv.org/abs/1706.08566
* SchNet - a deep learning architecture for molecules and materials<br>https://arxiv.org/abs/1712.06113
