#  Awkward Target for Kaitai Struct
Data formats for scientific data often differ across experiments due the hardware design and availability constraints. To interact with these unique data formats, researchers have to develop, document and maintain specific analysis software which are often tightly coupled with a particular data format.

This proliferation of custom data formats has been a prominent challenge for the Nuclear and High Energy Physics (NHEP) community. Within the Large Hadron Collider (LHC) experiments, this problem has largely been mitigated with the widespread adoption of `ROOT`.

However, not all experiments in the NHEP community use `ROOT` for their data formats. Experiments such as Cryogenic Dark Matter Search (CDMS) continue to use custom data formats to meet specific research needs. Therefore, simplifying the process of converting a unique data format to analysis code while requiring no changes to existing data formats still holds immense value for the broader NHEP community.

<div>
    <img src="img/Kaitai%20logo.png" width="40" style="float: left; margin-right: 10px;"/>
    <h2>Kaitai Struct YAML</h2>
</div> <br>

Kaitai Struct is a declarative language that uses a YAML-like description of a binary data structure to generate code, in any supported language, for reading a raw data file.

The main idea is that a particular format is described in `Kaitai Struct YAML (KSY)` language and then can be compiled with `Kaitai Struct Compiler (ksc)` into source files in one of the supported programming languages. These modules include a generated code for a parser that can read the described data structure from a file or stream and can be included as a library in any external code.

<p style="text-align: center;">
    <img src="img/Kaitai%20workflow.png" width="800">
</p>

### Kaitai Example

Take an example of a simple binary data structure that describes an animal species, age, and weight. You want to analyze this animal data in Python. This data structure can be described in `KSY` format. This file can then be passed to the Kaitai Struct compiler which generates a Python script that can be imported into any other Python code.


<p style="float: right;">
    <img src="img/animal.png" width="600" style="margin-bottom: 20px;"> <br>
    <img src="img/animal%20python.png" width="500" style="margin-top: 20px;">
</p>

```yaml
meta:
  id: animal
  endian: le
  license: CC0-1.0
  ks-version: 0.8

seq:

  - id: entry
    type: animal_entry
    repeat: eos

types:

  animal_entry:
    seq:
      - id: str_len
        type: u1

      - id: species
        type: str
        size: str_len
        encoding: UTF-8

      - id: age
        type: u1

      - id: weight
        type: u2
```

Generated Python script `animal.py`

```python
import kaitaistruct
from kaitaistruct import KaitaiStruct, KaitaiStream, BytesIO


if getattr(kaitaistruct, 'API_VERSION', (0, 9)) < (0, 9):
    raise Exception("Incompatible Kaitai Struct Python API: 0.9 or later is required, but you have %s" % (kaitaistruct.__version__))

class Animal(KaitaiStruct):
    def __init__(self, _io, _parent=None, _root=None):
        self._io = _io
        self._parent = _parent
        self._root = _root if _root else self
        self._read()

    def _read(self):
        self.entry = []
        i = 0
        while not self._io.is_eof():
            self.entry.append(Animal.AnimalEntry(self._io, self, self._root))
            i += 1


    class AnimalEntry(KaitaiStruct):
        def __init__(self, _io, _parent=None, _root=None):
            self._io = _io
            self._parent = _parent
            self._root = _root if _root else self
            self._read()

        def _read(self):
            self.str_len = self._io.read_u1()
            self.species = (self._io.read_bytes(self.str_len)).decode(u"UTF-8")
            self.age = self._io.read_u1()
            self.weight = self._io.read_u2le()
```

<div>
    <img src="img/Awkward%20logo.png" width="200" style="align=left;"/>
</div>

When dealing with large files and complicated data structures of scientific data which consists of nested arrays, even the most efficient Python code can be quite time and resource-heavy.

<p style="text-align: center;">
    <strong>MIDAS Event Format</strong> <br>
    <img src="img/MIDAS.jpg" width="700">
</p>


Awkward Array is a Scikit-HEP library that uses NumPy-like idioms to create arrays of nested records, variable-length lists, mixed types, and missing data. It offers a dynamic and memory efficient approach to represent complex data structures into Numpy-like arrays. Therefore, we propose adding Awkward Array as one of the target languages for Kaitai  Struct using the header-only `LayoutBuilder`.

Refer to Ianna Osborne's talk to know more about [LayoutBuilder](https://github.com/ianna/PyHEP-Users-Workshop-2023/blob/main/slides/Awkward_LayoutBuilder.ipynb).

## Awkward Target for Kaitai: How it works?

Users can simply describe their data format in the `.ksy` format just once. Then this KSY file can be converted into a compiled Python module (C++ files wrapped up in pybind11 and Scikit-Build) which takes the raw data and converts it into Awkward Arrays.

<p style="text-align: center;">
    <img src="img/Awkward%20Kaitai%20workflow.png" width="1000">
</p>

### Example

Take an example of this simple `fake.ksy` file - 

```yaml
meta:
  id: fake
  file-extension: raw
  endian: le

seq:
  - id: points
    type: point
    repeat: eos

types:
  point:
    seq:
      - id: x
        type: u4
      - id: y
        type: u4
      - id: z
        type: u4
```

#### Steps

1. Clone the `kaitai_awkward_runtime` and change the directory

In [1]:
!git clone --recursive https://github.com/ManasviGoyal/kaitai_awkward_runtime.git
%cd kaitai_awkward_runtime

Cloning into 'kaitai_awkward_runtime'...
remote: Enumerating objects: 242, done.[K
remote: Counting objects: 100% (242/242), done.[K
remote: Compressing objects: 100% (165/165), done.[K
remote: Total 242 (delta 127), reused 176 (delta 69), pack-reused 0[K
Receiving objects: 100% (242/242), 20.83 MiB | 25.55 MiB/s, done.
Resolving deltas: 100% (127/127), done.
Submodule 'kaitai_struct_compiler' (https://github.com/ManasviGoyal/kaitai_struct_compiler.git) registered for path 'kaitai_struct_compiler'
Submodule 'kaitai_struct_cpp_stl_runtime' (https://github.com/kaitai-io/kaitai_struct_cpp_stl_runtime.git) registered for path 'kaitai_struct_cpp_stl_runtime'
Cloning into '/home/jovyan/kaitai_awkward_runtime/kaitai_struct_compiler'...
remote: Enumerating objects: 24374, done.        
remote: Counting objects: 100% (534/534), done.        
remote: Compressing objects: 100% (159/159), done.        
remote: Total 24374 (delta 174), reused 483 (delta 150), pack-reused 23840        
Receiving

2. Install the Python module via `pip`

In [2]:
%pip install . --config-settings 'cmake.define.KSY=schemas/fake.ksy'   

Processing /home/jovyan/kaitai_awkward_runtime
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting awkward>2.4.3 (from kaitai-awkward-runtime==0.1.dev54+g8cf7586)
  Downloading awkward-2.4.5-py3-none-any.whl (708 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m708.5/708.5 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting awkward-cpp==24 (from awkward>2.4.3->kaitai-awkward-runtime==0.1.dev54+g8cf7586)
  Downloading awkward_cpp-24-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m
Collecting numpy>=1.18.0 (from awkward>2.4.3->kaitai-awkward-runtime==0.1.dev54+g8cf7586)
  Downloading numpy-1.26.0-cp310-cp310-manylinux_2_17_x86_64.man

Along with the `fake.h` and `fake.cpp` files which contain the code for parsing the binary format, a `fake_main.cpp` file is generated to create the compiled python module.

<p style="text-align: center;">
    <img src="img/fake%20awkward.png" width="500">
</p>

```C++
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <fstream>
#include <iostream>
#include "awkward/LayoutBuilder.h"
#include "fake.h"
namespace py = pybind11;
/**
 * Create a snapshot of the given builder, and return an `ak.Array` pyobject
 * @tparam T type of builder
 * @param builder builder
 * @return pyobject of Awkward Array
 */

template<typename T>
py::object snapshot_builder(const T &builder) {
    // How much memory to allocate?
    std::map <std::string, size_t> names_nbytes = {};
    builder.buffer_nbytes(names_nbytes);

    // Allocate memory
    std::map<std::string, void *> buffers = {};
    for (auto it: names_nbytes) {
        uint8_t *ptr = new uint8_t[it.second];
        buffers[it.first] = (void *) ptr;
    }

    // Write non-contiguous contents to memory
    builder.to_buffers(buffers);
    auto from_buffers = py::module::import("awkward").attr("from_buffers");

    // Build Python dictionary containing arrays
    py::dict container;
    for (auto it: buffers) {
        py::capsule free_when_done(it.second, [](void *data) {
            uint8_t *dataPtr = reinterpret_cast<uint8_t *>(data);
            delete[] dataPtr;
        });

        uint8_t *data = reinterpret_cast<uint8_t *>(it.second);
        container[py::str(it.first)] = py::array_t<uint8_t>(
                {names_nbytes[it.first]},
                {sizeof(uint8_t)},
                data,
                free_when_done
        );
    }
    return from_buffers(builder.form(), builder.length(), container);
}

/**
 * Pass the file path of binary data to create an array and return its snapshot
 * @param file_path of binary data file
 * @return pyobject of Awkward Array
 */
py::object load(std::string file_path) {
    std::ifstream infile(file_path, std::ifstream::binary);
    kaitai::kstream ks(&infile);
    fake_t obj = fake_t(&ks);
    return snapshot_builder(obj.fake_builder);
}

PYBIND11_MODULE(awkward_fake, m) {
    m.def("load", &load);
}
```

Print the returned `ak.Array`

In [64]:
import kaitai_awkward_runtime
awkward_arrays = kaitai_awkward_runtime.load("data/fake.raw")
print(awkward_arrays)

AttributeError: module 'kaitai_awkward_runtime' has no attribute 'load'

### Future Plans
- Improving the User Interface by making it simpler and more customizable
- Adding support for `IndexedOption` Arrays

For any questions or suggestions, feel free to contact:
- Manasvi Goyal `mg.manasvi@gmail.com`
- Amy Roberts `amy.roberts@ucdenver.edu`

### Acknowledgements

<div>
    <img src="img/NSF%20logo.png" width="60" style="float: left;">
    Support for this work was provided by NSF cooperative agreements <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=1836650">OAC-1836650</a> and <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=2323298">PHY-2323298</a> (IRIS-HEP) and grants <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=2104003">OAC-2104003</a> (PONDD) and <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=2103945">OAC-2103945</a> (Awkward Array).
</div>