# Adding Structures in Parallel

In [1]:
import numpy as np
import json
from monty.serialization import loadfn
from pymatgen import Structure
from smol.cofe import ClusterSubspace, StructureWrangler

### 1) Preparing a `StructureWrangler`
When adding large structures or structures that underwent a considerable amount of relaxation (compared to the primitive structure) to a `StructureWrangler`, it can be time consuming to appropriately match the structures to compute the correlations vector for the feature matrix. In this case it can be very helpful (and easy!) to add structures in a dataset in parallel.

First, we'll prepare the cluster subspace and structure wrangler as before.

In [2]:
# Load the raw data

# load the prim structure
lmof_prim = loadfn('data/lmof_prim.json')
    
# load the fitting data
lmof_data =loadfn('data/lmof_fitting_data.json')
    
# create a cluster subspace
subspace = ClusterSubspace.from_radii(lmof_prim,
                                      radii={2: 7, 3: 5},
                                      basis='sinusoid',
                                      supercell_size=('O2-', 'F-'),
                                      ltol = 0.15, stol = 0.2,
                                      angle_tol = 15)

# create the structre wrangler
wrangler = StructureWrangler(subspace)

### 2) Add structures in parallel
Since adding structures is an embarassingly parallel operation,
all we need to do is run a parallel loop. There are a few ways to
do this in python. Here we will use the `joblib` library. But using `multiprocessing` 
would be very similar.

In [4]:
from time import time
from joblib import Parallel, delayed, cpu_count

print(f'This computers has {cpu_count()} cpus.')

nprocs = cpu_count()  # setting this to -1 also uses all cpus

# setting a batch size usually improves speed
batch_size = 'auto' #len(lmof_data)//nprocs

start = time()

# we need to add the data a bit differently to avoid having to use
# shared memory between processes
with Parallel(n_jobs=nprocs, batch_size=batch_size, verbose=True) as parallel: 
    items = parallel(delayed(wrangler.process_structure)(struct, {'energy': energy})
                                                         for struct, energy in lmof_data)

# unpack the items and remove Nones from structure that failed to match
data_items = [item for item in items if item is not None]
wrangler.append_data_items(data_items)

print(f'Parallel finished in {time()-start} seconds.')
print(f'Matched {wrangler.num_structures}/{len(lmof_data)} structures.')

This computers has 8 cpus.


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.


Parallel finished in 25.01846957206726 seconds.
Matched 17/26


[Parallel(n_jobs=8)]: Done  26 out of  26 | elapsed:   25.0s finished


1.1) Compare with serial code

In [5]:
wrangler.remove_all_data()

start = time()

for s, e in lmof_data:
    wrangler.add_data(s, {'energy': e}, verbose=True)

print(f'Serial finished in {time()-start} seconds.')
print(f'Matched {wrangler.num_structures}/{len(lmof_data)} structures.')

Unable to match Li+6 Mn3+6 Mn4+3 O2-18 with properties {'energy': -363.8585} to supercell_structure. Throwing out.
 Error Message: Mapping could not be found from structure.
Unable to match Mn3+32 O2-48 with properties {'energy': -1057.1584} to supercell_structure. Throwing out.
 Error Message: Supercell could not be found from structure
Unable to match Li+8 Mn2+4 Mn4+12 O2-32 with properties {'energy': -636.38575} to supercell_structure. Throwing out.
 Error Message: Mapping could not be found from structure.
Unable to match Li+9 Mn3+5 Mn4+2 O2-16 with properties {'energy': -321.99251} to supercell_structure. Throwing out.
 Error Message: Mapping could not be found from structure.
Unable to match Li+7 Mn3+6 Mn4+4 Mn2+1 O2-19 F-5 with properties {'energy': -453.85747} to supercell_structure. Throwing out.
 Error Message: Mapping could not be found from structure.
Unable to match Li+7 Mn3+7 Mn4+3 Mn2+1 O2-18 F-6 with properties {'energy': -452.20166} to supercell_structure. Throwing out