# Installation

In a new conda environment, install the following packages in the following
order:

1. `pip install git+https://github.com/luthaf/rascaline.git@clebsch_gordan`
2. `pip install git+https://github.com/lab-cosmo/metatensor.git`
3. Optional but nice: `pip install chemiscope` (allows you to visualize the
   dataset)


Also required: `ase` and `numpy`.

In [None]:
%load_ext autoreload
%autoreload 2

import os
import ase.io
import numpy as np

import chemiscope
import metatensor
from metatensor import Labels, TensorBlock, TensorMap

import rascaline
import clebsch_gordan
# from rascaline.utils import clebsch_gordan

# Read frames and visualize with `chemiscope`

In [None]:
frames = ase.io.read("combined_magres_spherical.xyz", ":")
chemiscope.show(frames, mode="structure")

# Convert target property to `metatensor` format

In [None]:
# We will create a TensorMap for each frame, then combine them into a single
# TensorMap at the end
structure_tms = []
for frame_i, frame in enumerate(frames):
    # Get the number of atoms in the frame
    n_atoms = frame.get_global_number_of_atoms()

    # Store the target data by l value and chemical species (i.e. atomic number)
    data_dict = {}
    for atom_i, atomic_number in enumerate(frame.get_atomic_numbers()):
        for l, data in zip(
            [0, 2], [frames[0].arrays["efg_L0"], frames[0].arrays["efg_L2"]]
        ):
            key = (l, atomic_number)
            data_arr = data[atom_i]
            if isinstance(data_arr, float):
                data_arr = np.array([data_arr])
            # Store the data array
            if data_dict.get(key) is None:
                data_dict[key] = {atom_i: data_arr}
            else:
                data_dict[key][atom_i] = data_arr

    # Build the keys of the resulting TensorMap
    keys = Labels(
        names=["spherical_harmonics_l", "species_center"],
        values=np.array([[l, species_center] for l, species_center in data_dict.keys()]),
    )

    # Construct the TensorMap blocks for each of these keys
    blocks = []
    for l, species_center in keys.values:
        # Retrive the raw block data
        data = data_dict[(l, species_center)]

        # Get a list of sorted samples (i.e. atom indices) for the block 
        n_atoms_block = len(data)
        ordered_atom_idxs = sorted(data.keys())

        # Sort the raw block data
        block_data = np.array([data[atom_i] for atom_i in ordered_atom_idxs]).reshape(
            n_atoms_block, 2 * l + 1, 1
        )

        # Construct a TensorBlock, where the raw data is labelled with metadata
        # Note here that we keep track of the structure index - this is
        # important for later when we join the TensorMaps
        block = TensorBlock(
            values=block_data,
            samples=Labels(
                names=["structure", "center"],
                values=np.array([[frame_i, atom_i] for atom_i in ordered_atom_idxs]),
            ),
            components=[
                Labels(
                    names=["spherical_harmonics_m"],
                    values=np.arange(-l, l + 1).reshape(-1, 1),
                )
            ],
            properties=Labels(
                names=["efg"],
                values=np.array([[0]]).reshape(-1, 1),
            ),
        )
        # Store the block
        blocks.append(block)

    # Construct a TensorMap for this structure from the keys and blocks
    structure_tms.append(TensorMap(keys=keys, blocks=blocks))

# Now join the stucture-based TensorMaps into a single TensorMap. We want to
# join along the "samples" axis
efg = metatensor.join(structure_tms, axis="samples", remove_tensor_name=True)

# Save the TensorMap to file
metatensor.save("efg.npz", efg)

The TensorMap is comprised of 6 blocks, each corresponding to a different
combination of l channel and chemical species:

In [None]:
efg

In [None]:
efg.keys

We can pick out all the invariant blocks using `TensorMap.blocks()`:

In [None]:
efg.blocks(spherical_harmonics_l=0)

Or just a single block using `TensorMap.block()`. i.e. for l = 2, titanium:

In [None]:
block = efg.block(spherical_harmonics_l=2, species_center=22)
block

Each TensorBlock is a tensor of values, wrapped with metadata. The "samples" are
always the first axis, the "properties" always the last axis, and all intermediate
axes are "components". Here the samples track the atom indices and which
structure they belong to. The components track the symmetry of the target
property. In this case we're representing the data in the spherical basis, so
there is a single component axis that tracks the $m$ component of the
irreducible spherical component (ISC) vector. As we only have a single property
per atom the properties axis has size one, and is just labelled with "efg" here. 

In [None]:
block.values.shape

The raw data is stored in this case as a numpy array:

In [None]:
type(block.values)

But we can also convert, for instance, to a torch backend. (Commented out in
case you don't have torch installed)

In [None]:
# import torch

# efg_torch = metatensor.to(efg, backend="torch")
# type(efg_torch.block(0).values)

# Generate $\lambda$-SOAP descriptor

First we need to generate a $\nu=1$ order spherical expansion (an
atom-centered density correlation) using the `SphericalExpansion` calculator in
rascaline. From there we combine it with itself using Clebsch-Gordan iterations
to generate the $\nu=2$ order descriptor, i.e. $\lambda$-SOAP.

In [None]:
# Define hyperparameters for generating the rascaline SphericalExpansion
rascal_hypers = {
    "cutoff": 3.0,  # Angstrom
    "max_radial": 6,  # Exclusive
    "max_angular": 5,  # Inclusive
    "atomic_gaussian_width": 0.2,
    "radial_basis": {"Gto": {}},
    "cutoff_function": {"ShiftedCosine": {"width": 0.5}},
    "center_atom_weight": 1.0,
}
calculator = rascaline.SphericalExpansion(**rascal_hypers)
nu_1_tensor = calculator.compute(frames)

In [None]:
# Define target lambda channels - we only want invariant l=0 and l=2 channels to
# match the target property EFG
angular_selection = [0, 2]

# Now generate the lambda-SOAP vector
lsoap = clebsch_gordan.lambda_soap_vector(
    nu_1_tensor=nu_1_tensor,
    angular_selection=angular_selection,
    parity_selection=+1,
)

# Save
metatensor.save("lsoap.npz", lsoap)

lsoap

Some notes on the argument `angular_cutoff`, which can be set but isn't used
above. This sets the maximum intermediate value of lambda used when performing
CG iterations. The maximum value corresponding to non-zero combinations is given
by `target_body_order * rascal_hypers["max_angular"]`.

`target_body_order` is the target body-order of the descriptor. When using the
publci function `clebsch_gordan.lambda_soap_vector` as above,
`target_body_order` is by definition 2 so the specifying this isn't required.
For generating descriptors of higher body order, using the public function
`clebsch_gordan.combine_single_center_to_body_order`, this argumetn can be
specified. 

In some cases, particular for combinations to high body order (beyond
lambda-SOAP), combining at each iteration to the theoretical maximum can lead to
memory blow-up, so the `angular_cutoff` needs to be tailored in each case.
Setting this cutoff to less than theoretical maximum will lead to some
information loss. In the above function call I didn't specify `angular_cutoff`
so it just takes the maximum by default. 

The same logic also applies to the use of `angular_selection` and
`parity_selection`. Applying selection filters on intermediate iterations (i.e.
not the final one) can lead to some information loss, but reduce memory
consumption particularly at high body orders. `angular_cutoff` is essentially a
way of applying a global maximum cutoff to all iterations, whereas
`angular_selection` allows control of which specific angular channels are
constructed at each iteration. In our case, we only perform one iteration to
form our lambda-SOAP descriptor. As there are techincally no intermediate
iterations (only the case for num. iterations > 1), applying angular and parity
selections on the final (and only) iteration results in no information loss.

Inspect the $\lambda$-SOAP descriptor:

In [None]:
lsoap.block(spherical_harmonics_l=2, species_center=22)

The samples and components metadata are equivalent when compared to
the corresponding block of the EFG tensor, but the properties aren't, as the
relationship from descriptor properties -> target properties is the thing we wa
machine learn.

It is important that all other metadata, except for the properties of each
block, agrees. More specifically, there should be a one-to-one mapping of
blocks, indexed by the same set of keys. For each of these blocks, the samples
and components should be the same and in the same order.

This can be checked with the following function in `metatensor`. Here we toggle a
check for samples and components:

In [None]:
assert metatensor.equal_metadata(lsoap, efg, check=["samples", "components"])

But checking for properties too fails the test, as expected:

In [None]:
assert metatensor.equal_metadata(lsoap, efg, check=["samples", "components", "properties"])

There are loads of other operations and convenience functions for checking and
manipulating both the data and metadata of TensorMaps. Check out the
[docs](https://lab-cosmo.github.io/metatensor/latest/) for more info! If there's
an operation we don't have but you think would be useful, please [open an
issue](https://github.com/lab-cosmo/metatensor/issues/new/choose) :)