# General IO with Shnitsel tools

## Reading input data

Shnitsel tools is available in the ```shnitsel``` package.
Within the ```shnitsel.io``` module, we offer the handy ```read()``` function to read in a multitude of different formats. 
Currently, we support the following file types:
- SHARC outputs, both ICOND and TRAJ formats which have been tested for version 2.0, 2.1 and 3.0 outputs.
- PyrAI2md outputs, reading of NACs and SOCs is currently still limited, but the reading has been tested on version 2.4 outputs and is expected to work well on version 2.5 outputs as well.
- NewtonX outputs. Testing has been performed up to version 2.2
We plan to support ASE database files soon.

During the call to `read()`, all input data will be converted into standard units as documented by the `shnitsel-tools` package, e.g. times are converted to `fs`, lengths to `Bohr`, forces to `Hartree/Bohr` and energies to `Hartree`. This allows for standardized and comparable processing independent of the input format, where different unit conventions are common.

To use Shnitsel tools, we import the shnitsel package:

In [None]:
import shnitsel as st

The ```st.io.read()``` function handles all of the details of input of the different formats. 
Its only essential requirement is a ```path``` to the input that is supposed to be read.
Here, we have multiple options:
- ```path``` can point to the directory of a single trajectory, file or initial condition. Then only this one trajectory will be read.

In [None]:
# This call loads a single initial condition configuration
single_icond_butene = st.io.read(path='./test_data/sharc/iconds_butene/ICOND_00000')

- Alternatively, ```path``` can point to a directory of multiple directories or files containing trajectories or initial conditions. In that case, ```st.io.read(path)``` will iterate over relevant subdirectories and load them in parallel. In the end, the individual trajectories are combined into a single object.

In [None]:
# This call loads all initial conditions within the `./test_data/sharc/iconds_butene` directory, because the directory itself is not a trajectory.
iconds_butene = st.io.read(path='./test_data/sharc/iconds_butene')

# This call loads all trajectories within `./test_data/sharc/traj_butene` and combines them into a single object
trajectories_butene_sharc = st.io.read(path='./test_data/sharc/traj_butene')

# We can similarly load a collection of newtonx trajectories
trajectories_butene_sharc = st.io.read(path='./test_data/newtonx/test_trajectory')

When loading various data structures, from different formats, Shnitsel tools performs some sophisticated type-autodetection to apply the correct import logic. 
However, if the detection fails, because it detects multiple fitting formats, the ```read()``` function offers the parameter ```kind```, which can be set to ```sharc```,```newtonx```, or ```pyrai2md``` to make it pick the specified input format without performing type detection:

In [None]:
# Here we tell shnitsel tools explicitly to load a sharc-type input
trajectories_butene_sharc = st.io.read(path='./test_data/sharc/traj_butene', kind='sharc')

# If we specify the wrong format, it will result in an error and the call will yield no result:
wrong_type = st.io.read(path='./test_data/newtonx/test_trajectory', kind='pyrai2md')

### Advanced `read()` configuration

While ```st.io.read()``` will generally try and configure the parsing routines in a way that should fit most userst, there are several settings that can be adapted:
- ```sub_pattern```: This is a path pattern to match files or subdirectory names in the ```path``` location to. Let us assume, we have ```TRAJ_0,TRAJ_1,...TRAJ_99,``` and ```ICOND_0,...ICOND_4``` directories within the same folder, but we only want to load ```ICOND_*``` files. Then we can set ```sub_pattern='ICOND_*'``` and ```read()``` will only consider the matching entries in `path`.
- `input_units`: This is an optional setting to determine the units in which different input variables are provided, e.g. `atXYZ` is the variable name for positional data, `forces` is the variable name for forces and `energy` is the variable name for absolute system energy. The units are provided in a `{variable_name:'unit_string'}` fashion, where the variable names can be taken from the declaration of the standard Shnitsel format and the units values for different unit types can be found in `shnitsel.units.definitions`, 
- `input_state_types` and `input_state_names`: For various formats, it is not entirely possible to derive the multiplicity/type of states and a reasonable readable state name label from the trajectory format. `read()` will make a best effort to apply reasonable state names and state multiplicities based on the input information, but if the resulting state types (or names) are not according to your expectations, you can set them by providing either a list of types or names to the `input_state_types` and `input_state_names` or by providing a function in either of those parameters that takes the already parsed `xarray.Dataset` and can set the `ds.state_types` or `ds.state_names` values based on additional information from the dataset. Note that shnitsel tools will always apply state types first before setting state names and it will use the values of the settings `input_state_types` and `input_state_names` in this order as well, if they are provided.
- ```concat_method='db'```: When multiple trajectories are loaded in parallel, they need to be combined into a single result. The default behavior is to pack them into a ShnitselDB structure (based on `xarray.DataTree`) to preserve all meta-data of the individual trajectories, allowing for advanced filtering and storage according to the FAIR principles. 
Alternatives are 
- - `'concat'`, which attaches the trajctories along the `time` dimension, which is then referred to as `frame`, effectively building one long, concatenated trajectory. This will lose insights into different meta-data of the concatenated trajectories.
- - `'layers'`, which packs all trajectories into a shared `xarray.Dataset` object that has a new `trajid` dimension (meaning: `trajectory id`), along which the different trajectories can be indexed. As in `'concat'`, we will lose individual differences in metadata on the combined object compared to the individual trajectories.
- - `'list'`, which simply returnes the trajectories as a list of `xarray.Dataset` objects.
- `input_trajectory_id_maps`: This option allows for controlled setting of trajectory ids for individual datasets. If a dict is provided here, the keys are meant to be absolute paths in posix-format (specifically of type `str`) with the associated values being the resulting ids. Alternatively, a function can be provided that receives the `pathlib.Path` object of the input path of the trajectory and is supposed to yield an integer id that is distinct from all other trajectories loaded with the same call to `read()`
 
### More techincal `read()` configuration options

- ```multiple=True```: This flag determines whether `read()` should automatically look for entries within the `path` folder to load as input datasets. If set to `False`, only `path` will be attempted to be loaded as a single dataset.
- `parallel=True`: A flag that by default allows parallel processing of input in multiple child processes to accelerate multi-trajectory input.
- `error_reporting='log'`: A setting to determine whether errors should be thrown as an exception or just logged within the read routine. If parallel reading is enabled, only `log` is supported. 


## Storing data on disk in Shnitsel/NetCDF format

Within the `shnitsel.io` module, we provide the function `write_shnitsel_file(dataset, savepath)` to write Shnitsel-format files using the NetCDF4 format. 
Writing is performed using the `to_netcdf()` functionality provided by the `xarray` package for its datatypes, which can be controlled using the option `complevel=9`, which sets the compression level on a scale of `0` (not compressed) to `9` (maximum compression).
Higher `complevel` values result in smaller output files but take longer time to compute. 
Lower compression levels are therefor preferable if writing latency is relevant to the application.

In [None]:
# Import intiial conditions as a Shnitsel DB. 
iconds_butene = st.io.read(path='./test_data/sharc/iconds_butene', concat_method='db')

assert iconds_butene is not None, "Loading of initial conditions failed"
assert not isinstance(iconds_butene, list), "Format should be ShnitselDB"

# Write the entire set of initial conditions to `path` and apply maximum levels of compression
st.io.write_shnitsel_file(iconds_butene, savepath="./test_data/sharc/iconds_butene.nc")

# Import a single initial condition to demonstrate that this also works with individual datasets
iconds_butene_single = st.io.read(path='./test_data/sharc/iconds_butene/ICOND_00000')
assert iconds_butene_single is not None, "Loading of initial conditions failed"
assert not isinstance(iconds_butene_single, list), "Format should be Trajectory/xr.Dataset"

# Write the dataset but with lower compression, yielding faster write times but larger output files.
st.io.write_shnitsel_file(iconds_butene_single, savepath="./test_data/sharc/iconds_butene_single_low_compression.nc", complevel=0)

## TL;DR: Converting trajectories to Shnitsel-format

To simplify the process of converting trajectories to Shnitsel format for publication, we provide a simple command line script.
The script `convert_to_shnitsel_file` is installed with the `shnitsel-tools` package (if installed in a virtual environment, remember to activate that environment).

It can be called like this to convert a set of trajectories into a shnitsel-db format:

In [None]:
# The convert_t_shnitsel_file reads the input path, applies default unit conversion
# Optionally, it sets the compound name of the loaded data (`-c` option) and an optional group name (`-g` option).
# With the mandatory est_level parameter (`-est` or `--est_level`) the user specifies the est_level used to simulate the data. 
# With the mandatory basis set parameter (`-basis` or `--basis_set`) the user specifies the basis set of the QM calculations.
# These two settings will automatically be stored in the metadata of all loaded trajectories. 
# Please convert trajectories with different est levels or basis sets through different calls.
%sx convert_to_shnitsel_file ./test_data/sharc/iconds_butene/ -o ./test_data/converted_iconds_butene.nc -k sharc -c butene -g iconds -est CASSCF -basis cc-pVDZ

# Similarly to the call to `st.io.read()`, this script also provides the option to specify a pattern to filter subdirectories (Option `-p`)
# Or to set the log level during conversion (Option `-log`)
%sx convert_to_shnitsel_file ./test_data/sharc/traj_butene/ -o ./test_data/converted_traj_butene.nc -k sharc -c butene -g trajc -est CASSCF -basis cc-pVDZ -p TRAJ_* -log info



### Merging multiple Shnitsel-format files into a single file

You can merge multiple converted shnitsel files with the command `merge_shnitsel_files` also installed with the `shnitsel-tools` package.
Provide the input files as a list of paths, then the output path with the `-o` option and optionally set the log level for debugging using the `-log` option:

In [None]:
# Merge butene iconds and trajectories into a mixed-format dataset
%sx merge_shnitsel_files ./test_data/converted_iconds_butene.nc ./test_data/converted_traj_butene.nc -o ./test_data/combined_butene.nc

Note that compound names and group names applied in the conversion (or manually if the conversion is done in code) are preserved, when merging two or more files.
If two files contain the same compounds, their compound-data will be merged. 
If two compounds that are being merged contain a group of the same name, that group will also be merged. 

Further note that no identity check is performed on trajectories while merging groups or compounds. 
This can lead to the same trajectory dataset existing within the same dataset multiple times if you accidentally made it a part of multiple input files.

## Reading the Shnitsel/NetCDF format
The Shnitsel-format files are easy to import back into shnitsel-tools. This uses the same call to `st.io.read()` as for the other input formats. 
If you want to specify the input format, you can set `kind='shnitsel'`. 
All other options of `read()` are also supported for shnitsel-format inputs.

In [None]:
# Reading a shnitsel file with automatic type detection
shnitsel_input_butene = st.io.read("./test_data/sharc/iconds_butene.nc")

# Specifying the type explicitly
shnitsel_input_butene = st.io.read("./test_data/sharc/iconds_butene.nc", kind="shnitsel")

Shnitsel-files have a version attribute to allow for backward-compatible loading of shnitsel-files in later versions of shnitsel-tools. 
Please be aware that loading an old shnitsel-format file can result in warnings due to missing metadata in the earliest versions of the shnitsel-format.
If there are issues with loading a shnitsel file, it may be that you are using and older version of `shnitsel-tools` than what has been used to write that file. 
In that case, please update `shnitsel-tools` to the latest version.

## Reading and writing ASE/SPaiNN/SchNet format databases

We are currently working on the support of ASE/SPaiNN/SchNet format input data. 
The input will be enabled through `st.io.read()` again with the input format flag `kind='ase'`.

In [None]:
# Reading an ase-style sql database input:
ase_input = st.io.read("./test_data/ase/spainn_sh2nh2+.db")

We plan to support the output of shnitsel-style datasets to a spainn or schnet database via the method `st.io.write_ase_db()`

In [None]:
#Writing in spainn format
st.io.write_ase_db(ase_input, "./test_data/ase/spainn_conversion.db", kind="spainn")

#Writing in schnarc format
st.io.write_ase_db(ase_input, "./test_data/ase/spainn_conversion.db", kind="schnarc")

## Working with trajectories

Here we read from the folder ```test_data/sharc/traj_butene``` the data of two trajectories of butene upon excitation to the $\mathrm S_1$ state.

In the resulting xarray, we can see that we have information available on the energies, forces, NACs, transition and permanent dipoles and the phase of the wavefunctions for all four configurations of butene in three electronic states.


In [None]:

# Read the butene trajectory in sharc format from the sample data 
trajectories_butene_sharc = st.io.read(path='./test_data/sharc/traj_butene', kind='sharc')
trajectories_butene_sharc

# Selecting data is done with various xarray functionalities

# E.g. data at a certain time-step (identified by its index 300)
trajectories_butene_sharc.isel(time=300)

# Read the butene trajectory in sharc format from the sample data but this time with the layering option to introduce a trajid coordinate:
trajectories_butene_sharc = st.io.read(path='./test_data/sharc/traj_butene', kind='sharc', concat_method="layers")

# data for a certain trajectory (trajid)
trajectories_butene_sharc.sel(trajid=2)