# Summary

[Raphaël CC's kernel](https://www.kaggle.com/daijin12/coulomb-interaction-high-perf-no-loop-almost) can finish the computation of 5 nearest atoms' Coulomb interaction in just under 8 hours for all molecules and for `H`, `C`, `O`, `N`, `F` atoms. 

However, if you wanna compute 8 or 9 nearest atoms, this kernel will time out since there are more than 10 `H` and `C` atoms in some molecules. The biggest reason is that `pandas` uses only 1 CPU core when manipulating dataframes, which is not very economical considering there are 4 cores at our disposal. So in this a kernel parallel version of the above computation is implemented, where even more complicated computation like Yukawa potentials for all molecules in under 4 hours.

To use the output file, simple concat or merge it with the original structures, then merge it with the train and test dataframes, and you are good to go.

### Reference:

* [coulomb_interaction - speed up!](https://www.kaggle.com/rio114/coulomb-interaction/notebook)
* [coulomb_interaction - Parallelized](https://www.kaggle.com/brandenkmurray/coulomb-interaction-parallelized/notebook)
* [Coulomb interaction - High perf, no loop (almost)](https://www.kaggle.com/daijin12/coulomb-interaction-high-perf-no-loop-almost)

In [None]:
import numpy as np
from tqdm import tqdm_notebook as tqdm
import pandas as pd 
import multiprocessing as mp
import warnings
warnings.filterwarnings("ignore")

In [None]:
structures = pd.read_csv('../input/structures.csv')

In [None]:
nuclear_charge = {'H':1.0, 'C':6, 'N':7, 'O':8, 'F':9}
structures['nuclear_charge'] = [nuclear_charge[x] for x in structures['atom'].values]

## Yukawa potential

I am not a chemist, but from [the Wikipedia entry](https://en.wikipedia.org/wiki/Yukawa_potential):
$$
{\displaystyle V_{\text{Yukawa}}(r)=-\frac{q_{1}q_{2}}{4\pi \epsilon_{0}}{\frac {e^{-\alpha mr}}{r}},}
$$
$r$ is the radial distance to the particle (atom), and $\alpha$ is another scaling constant, so that ${\displaystyle 1/\alpha m}$ is the range. So here I make a simplification that let the range be the 0.5 times the maximum distance among all other atoms in a molecule to the atom of interest.

In [None]:
def compute_all_yukawa(x):   
    #Apply compute_all_dist2 to each atom 
    return x.apply(compute_yukawa_matrix,axis=1,x2=x)

def compute_yukawa_matrix(x,x2):
    # atoms in the molecule which are not the processed one
    notatom = x2[(x2.atom_index != x["atom_index"])].reset_index(drop=True) 
    # processed atom
    atom = x[["x","y","z"]]
    charge = x[['nuclear_charge']]
    
    # compute distance from to processed atom to each other
    notatom['dist'] = ((notatom[["x","y","z"]].values - atom.values)**2).sum(axis=1)
    notatom['dist'] = np.sqrt(notatom['dist'].astype(np.float32))
    notatom['dist'] = charge.values*notatom[['nuclear_charge']].values.reshape(-1)\
                    *np.exp(-2*notatom['dist']/notatom['dist'].max())/notatom['dist']

    # sort atom per the smallest distance (highest 1/r**2) per group of C/H/N... 
    s = notatom.groupby("atom")["dist"].transform(lambda x : x.sort_values(ascending=False))
    
    # keep only the five nearest atoms per group of C/H/N...
    index0, index1=[],[]
    for i in notatom.atom.unique():
        for j in range(notatom[notatom.atom == i].shape[0]):
            if j < 5:
                index1.append("dist_" + i + "_" + str(j))
            index0.append(j)
    s.index = index0
    s = s[s.index < 5]
    s.index = index1
    
    return s

## Benchmark using first 100 molecules

In [None]:
small_idx = structures.molecule_name.isin(structures.molecule_name.unique()[:100])
_smallstruct = structures[small_idx]

Using the current [fastest way](https://www.kaggle.com/daijin12/coulomb-interaction-high-perf-no-loop-almost) to compute takes about 20 seconds.

In [None]:
%%time
smallstruct1 = _smallstruct.groupby("molecule_name").apply(compute_all_yukawa)

In [None]:
smallstruct1.head(10)

## Multiprocessing

The following approach makes use of the `groupby` to get an iterator, so that we can use `multiprocessing` which saves more than 50% of the computation time.

Reference: [Parallel operations over a Pandas DF](https://www.kaggle.com/gvyshnya/parallel-operations-over-a-pandas-df)

In [None]:
%%time
chunk_iter = _smallstruct.groupby(['molecule_name'])
pool = mp.Pool(4) # use 4 processes

funclist = []
for df in tqdm(chunk_iter):
    # process each data frame
    f = pool.apply_async(compute_all_yukawa,[df[1]])
    funclist.append(f)

result = []
for f in tqdm(funclist):
    result.append(f.get(timeout=120)) # timeout in 120 seconds = 2 mins

# combine chunks with transformed data into a single structure file
smallstruct2 = pd.concat(result)

In [None]:
smallstruct2.head(10)

Just to make sure we are getting the same thing by two methods.

In [None]:
np.allclose(smallstruct2.fillna(0), smallstruct1.fillna(0))

## Compute Yukawa interaction for all molecules

Without the parallelization, it takes about 11 hours to run.

In [None]:
chunk_iter = structures.groupby(['molecule_name'])
pool = mp.Pool(4) # use 4 CPU cores

funclist = []
for df in tqdm(chunk_iter):
    # process each data frame
    f = pool.apply_async(compute_all_yukawa,[df[1]])
    funclist.append(f)

result = []
for f in tqdm(funclist):
    result.append(f.get()) 

# combine chunks with transformed data into a single training set
structures_yukawa = pd.concat(result)

In [None]:
structures_yukawa.to_csv('structures_yukawa.csv',index=False)

## How to use this file in CHAMPS competition

Simply concatenate it with the existing `structures` dataframe:
```python
structures = pd.concat([structures, structures_yukawa], axis=1)
```
Then merge it with train or test dataframe using a modified version of Andrew's routine

```python
def map_atom_info(df_1, df_2, atom_idx):
    df = pd.merge(df_1, df_2, how = 'left',
                  left_on  = ['molecule_name', f'atom_index_{atom_idx}'],
                  right_on = ['molecule_name',  'atom_index'])
    
    df = df.drop('atom_index', axis=1)
    return df

for atom_idx in [0,1]:
    train = map_atom_info(train, structures, atom_idx)
    train = train.rename(columns={'atom': f'atom_{atom_idx}',
                    'x': f'x_{atom_idx}',
                    'y': f'y_{atom_idx}',
                    'z': f'z_{atom_idx}'})
```