# 4. Customized dataset

## data list dataset

Suppose your dataset can be represented as a list and each data point can 
be accessed separately with some function.
The list dataset descriptor helps you to transform your reader function to 
a dataset loader, with a handy option to split your dataset.
The list can be your list of filenames of structures, or identifiers to 
retrive you data points, e.g. ID from some online database.

The advantage of this approach is that you only need to write the reader for 
one data point, 
and you can get the tensorflow dataset objects with reasonably optimized IO.
Later, it's also easy to convert your dataset into the TFRecord format 
if you need to train on the cloud or further import the IO.

We'll demonstration with a list of ASE atoms.

In [1]:
from ase import Atoms
datalist = [Atoms(elem) for elem in ['Cu', 'Ag', 'Au']]

For the purpose of training ANN potentials, you typically need to provide the 
elements, coordinates and potential energy of a struture. 
In addition, you need to specify the maximum number of atoms in one structure
in advance.

Your reader function should take one list element as input, 
and return a dictionary consisting of:

- `'atoms'`: the elements of shape [n_atoms]
- `'coord'`: the coordinates of shape [n_atoms, 3]
- `'e_data'`: a single number

After you have got your reader function, decorate it with the `list_loader`
decorater to transform it into a dataset loader.

In [2]:
from pinn.datasets.base import list_loader

@list_loader()
def load_ase_list(atoms, n_atoms):
    import numpy as np
    coord = atoms.positions
    elems = atoms.numbers
    # Currently, we demand all the import structures to have the same shape.
    # If atoms have different number of atoms, we have to pad them with zeros.
    to_pad = n_atoms - len(atoms)
    elems = np.pad(elems, [0,to_pad], 'constant')
    coord = np.pad(coord, [[0,to_pad], [0,0]], 'constant')

    data = {'atoms': elems,
            'coord': coord,
            'e_data': 0.0}
    return data

That's it, you've got your customized dataset!

In [3]:
import tensorflow as tf

dataset = load_ase_list(datalist, n_atoms=1)['train']
d = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
    print(sess.run(d))

{'atoms': array([29], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0}


If you need to import more complex dataset, say, to include more 
features or labels, you shall consider writing a formatter function 
to define your dataset structure.

For example, we add a molecular weight entry to the dataset here.

In [4]:
def my_formater(n_atoms):
    import tensorflow as tf
    format_dict = {
    'atoms': {'dtype':  tf.int32,   'shape': [n_atoms]},
    'coord': {'dtype':  tf.float32, 'shape': [n_atoms, 3]},
    'e_data': {'dtype': tf.float32, 'shape': []},
    'mw_data': {'dtype': tf.float32, 'shape': []}}
    return format_dict

@list_loader(formater=my_formater)
def load_ase_list(atoms, n_atoms):
    import numpy as np
    coord = atoms.positions
    elems = atoms.numbers

    to_pad = n_atoms - len(atoms)
    elems = np.pad(elems, [0,to_pad], 'constant')
    coord = np.pad(coord, [[0,to_pad], [0,0]], 'constant')

    data = {'atoms': elems,
            'coord': coord,
            'e_data': 0.0,
            'mw_data': atoms.get_masses().sum()}
    return data

In [5]:
dataset = load_ase_list(datalist, n_atoms=1)['train']
d = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
    print(sess.run(d))

{'atoms': array([29], dtype=int32), 'coord': array([[0., 0., 0.]], dtype=float32), 'e_data': 0.0, 'mw_data': 63.546}
