# Training Data Preparation

In [2]:
import numpy as np
import json
from monty.serialization import loadfn
from pymatgen import Structure
from smol.cofe import ClusterSubspace, StructureWrangler

### 1) Preparing a `StructureWrangler`
Training structures and target data are handled by the `StructureWrangler` class. The class obtains the features and corresponding feature matrix based on the underlying `ClusterSubspace` provided.

In the most simply settings we just use the feature matrix our supplied total energy from DFT to fit a cluster expansion. But it many cases we may want to improve our fit quality or reduce the model complexity by modifying the target property (i.e. using a reference energy or the energy of mixing) and/or by weighing structures based on some importance metric (i.e. by energy above hull). Using the `StructureWrangler` we can create this modified fitting data.

In [3]:
# Load the raw data
# load the prim structure
lno_prim = loadfn('data/lno_prim.json')
    
# load the fitting data
with open('data/lno_fitting_data.json', 'r') as f:
    lno_data = [(Structure.from_dict(x['s']), x['toten']) for x in json.load(f)]

    
# create a cluster subspace
subspace = ClusterSubspace.from_radii(lno_prim,
                                      radii={2: 5, 3: 4.1},
                                      basis='sinusoid',
                                      supercell_size='O2-')

# create the structre wrangler
wrangler = StructureWrangler(subspace)

# add the raw data
for structure, tot_energy in lno_data:
    wrangler.add_data(structure,
                      properties={'total_energy': tot_energy})

print(f'\nTotal structures that match {wrangler.num_structures}/{len(lno_data)}')


Total structures that match 27/31


### 2) Modifying and adding new target properties

Now that we have access to the structures that match to our cluster subspace, and access to the raw and normalized target properties, we can easily create new modifiend target properties to fit to.

For a simple example say we simply want to set the minimum energy in our data as a new reference point.

In [9]:
# obtain the minimum energy. Calling the get_property_vector
# will by default give you the property normalized per prim 
# (you should always used consistently normalized data when fitting)
min_energy = min(wrangler.get_property_vector('total_energy'))

# simply create a new re-reference energy
reref_energy_vect = wrangler.get_property_vector('total_energy') - min_energy

# add it as a new property to the wrangler
# in this case since the reref energy is a normalized
# quantity we need to explicitly tell the wrangler
wrangler.add_properties('rereferenced_energy',
                        reref_energy_vect,
                        normalized=False)

# Now we have to properties in the wrangler that we can
# use to fit a cluster expansion, the total energy
# and the rereference energy

### 2.1) Another example of modifying target properties

We can do more complex modifications 

[]

### 3) Obtaining and adding weights

### 4) Filtering structures