#  Awkward Target for Kaitai Struct
Data formats for scientific data often differ across experiments due the hardware design and availability constraints. To interact with these unique data formats, researchers have to develop, document and maintain specific analysis software which are often tightly coupled with a particular data format.

This proliferation of custom data formats has been a prominent challenge for the Nuclear and High Energy Physics (NHEP) community. Within the Large Hadron Collider (LHC) experiments, this problem has largely been mitigated with the widespread adoption of `ROOT`.

However, not all experiments in the NHEP community use `ROOT` for their data formats. Experiments such as Cryogenic Dark Matter Search (CDMS) continue to use custom data formats to meet specific research needs. Therefore, simplifying the process of converting a unique data format to analysis code while requiring no changes to existing data formats still holds immense value for the broader NHEP community.

<div>
    <img src="img/Kaitai%20logo.png" width="40" style="float: left; margin-right: 10px;"/>
    <h2>Kaitai Struct YAML</h2>
</div> <br>
Kaitai Struct is a declarative language that uses a YAML-like description of a binary data structure to generate code, in any supported language, for reading a raw data file.

The main idea is that a particular format is described in `Kaitai Struct YAML (KSY)` language and then can be compiled with `Kaitai Struct Compiler (ksc)` into source files in one of the supported programming languages. These modules include a generated code for a parser that can read the described data structure from a file or stream and can be included as a library in any external code.

<p style="text-align: center;">
    <img src="img/Kaitai%20workflow.png" width="800">
</p>

### Kaitai Example

Take an example of a simple binary data structure that describes an animal species, age, and weight. You want to analyze this animal data in Python. This data structure can be described in `KSY` format. This file can then be passed to the Kaitai Struct Compiler which generates a Python script that can be imported into any other Python code.


<p style="float: right;">
    <img src="img/animal.png" width="600" style="margin-bottom: 10px;"> <br>
    <img src="img/animal%20data%20format.png" width="600" style="margin-top: 10px; margin-bottom: 10px;"> <br>
    <img src="img/animal%20python.png" width="550" style="margin-top: 10px;">
</p>

```yaml
meta:
  id: animal
  endian: le
  license: CC0-1.0
  ks-version: 0.8

seq:

  - id: entry
    type: animal_entry
    repeat: eos

types:

  animal_entry:
    seq:
      - id: str_len
        type: u1

      - id: species
        type: str
        size: str_len
        encoding: UTF-8

      - id: age
        type: u1

      - id: weight
        type: u2
```

#### Generated Python script `animal.py`

```python
import kaitaistruct
from kaitaistruct import KaitaiStruct, KaitaiStream, BytesIO


if getattr(kaitaistruct, 'API_VERSION', (0, 9)) < (0, 9):
    raise Exception("Incompatible Kaitai Struct Python API: 0.9 or later is required, but you have %s" % (kaitaistruct.__version__))

class Animal(KaitaiStruct):
    def __init__(self, _io, _parent=None, _root=None):
        self._io = _io
        self._parent = _parent
        self._root = _root if _root else self
        self._read()

    def _read(self):
        self.entry = []
        i = 0
        while not self._io.is_eof():
            self.entry.append(Animal.AnimalEntry(self._io, self, self._root))
            i += 1


    class AnimalEntry(KaitaiStruct):
        def __init__(self, _io, _parent=None, _root=None):
            self._io = _io
            self._parent = _parent
            self._root = _root if _root else self
            self._read()

        def _read(self):
            self.str_len = self._io.read_u1()
            self.species = (self._io.read_bytes(self.str_len)).decode(u"UTF-8")
            self.age = self._io.read_u1()
            self.weight = self._io.read_u2le()
```

#### Output
```python
Species: cat
Age: 5
Weight: 12
Species: dog
Age: 3
Weight: 43
Species: turtle
Age: 10
Weight: 5
```

<div>
    <img src="img/Awkward%20logo.png" width="200" style="align=left;"/>
</div>

When dealing with large files and complicated data structures of scientific data which consists of nested arrays, even the most efficient Python code can be quite time and resource-heavy.

<p style="text-align: center;">
    <strong>MIDAS Event Format</strong> <br> <br>
    <img src="img/MIDAS.jpg" width="800">
</p>


Awkward Array is a Scikit-HEP library that uses NumPy-like idioms to create arrays of nested records, variable-length lists, mixed types, and missing data. It offers a dynamic and memory efficient approach to represent complex data structures into Numpy-like arrays.

In [74]:
import numpy as np
import awkward as ak

array = ak.Array([
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
    [],
    [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]
])

array

Therefore, we propose adding Awkward Array as one of the target languages for Kaitai  Struct using the header-only `LayoutBuilder`.

Refer to Ianna Osborne's talk to know more about [LayoutBuilder](https://github.com/ianna/PyHEP-Users-Workshop-2023/blob/main/slides/Awkward_LayoutBuilder.ipynb).

## Awkward Target for Kaitai: How it works?

Users can simply describe their data format in the `.ksy` format just once. Then this KSY file can be converted into a compiled Python module which takes the raw data and converts it into Awkward Arrays.

<p style="text-align: center;">
    <img src="img/Awkward%20Kaitai%20workflow.png" width="1000">
</p>

### Example

Take an example of this simple `fake.ksy` file - 

```yaml
meta:
  id: fake
  file-extension: raw
  endian: le

seq:
  - id: points
    type: point
    repeat: eos

types:
  point:
    seq:
      - id: x
        type: u4
      - id: y
        type: u4
      - id: z
        type: u4
```

### Steps

#### 1. Clone the *kaitai_awkward_runtime* and change the directory

```bash
git clone --recursive https://github.com/ManasviGoyal/kaitai_awkward_runtime.git
cd kaitai_awkward_runtime
````

#### 2. Install the Python module via *pip*

```bash
pip install . --config-settings 'cmake.define.KSY=schemas/fake.ksy'
```

<p style="text-align: center;">
    <img src="img/fake%20awkward.png" width="500">
</p>

#### Scala Code snippet: For generating C++ string for Builder Declaration
``` scala
case class RecordBuilder(
  fields: ListBuffer[String],
  contents: ListBuffer[LayoutBuilder]
) extends LayoutBuilder {

  override def printStructure(indent: Int, builderName: String): String = {
    UserDefinedMap(builderName)
    val fieldStrings = fields.zip(contents).zipWithIndex.map { case ((field, content), i) =>
      s"${"\t" * (indent + 2)}RecordField<Field_${builderName}::$field, ${content.printStructure(indent + 1, s"${field}")}>"
    }
    s"${if (indent == 0) "\nusing " + builderName.capitalize + "BuilderType =\n\t" else ""}" +
    s"RecordBuilder<\n${fieldStrings.mkString(",\n")}\n${"\t" * (indent)}\t>"
  }

  def UserDefinedMap(builderName: String): Unit = {
    val mapStrings = fields.zipWithIndex.map { case (field, i) =>
      s"""{Field_${builderName}::$field, "$field"}"""
    }
    outHdrAwkward.puts(s"enum Field_${builderName} : std::size_t {${fields.mkString(", ")}};")
    outSrcAwkward.puts
    outSrcAwkward.puts(s"UserDefinedMap ${builderName}_fields_map({\n\t${mapStrings.mkString(",\n\t")}});")
  }
}
```

#### Generated C++ libraries with awkward code

<p style="text-align: center;">
    <strong><h4>fake.h</h4></strong>
</p>

```C++
#ifndef FAKE_H_
#define FAKE_H_

// This is a generated file! Please edit source .ksy file and use kaitai-struct-compiler to rebuild

#include "kaitai/kaitaistruct.h"
#include <stdint.h>
#include "awkward/LayoutBuilder.h"
#include <vector>

using UserDefinedMap = std::map<std::size_t, std::string>;
template<class... BUILDERS>
using RecordBuilder = awkward::LayoutBuilder::Record<UserDefinedMap, BUILDERS...>;
template<std::size_t field_name, class BUILDER>
using RecordField = awkward::LayoutBuilder::Field<field_name, BUILDER>;
template<class PRIMITIVE, class BUILDER>
using ListOffsetBuilder = awkward::LayoutBuilder::ListOffset<PRIMITIVE, BUILDER>;
template<class PRIMITIVE>
using NumpyBuilder = awkward::LayoutBuilder::Numpy<PRIMITIVE>;

enum Field_fake : std::size_t {points};
enum Field_points : std::size_t {x, y, z};

using FakeBuilderType =
	RecordBuilder<
		RecordField<Field_fake::points, ListOffsetBuilder<int64_t, RecordBuilder<
			RecordField<Field_points::x, NumpyBuilder<uint32_t>>,
			RecordField<Field_points::y, NumpyBuilder<uint32_t>>,
			RecordField<Field_points::z, NumpyBuilder<uint32_t>>
		>>>
	>;
using PointsBuilderType = decltype(std::declval<FakeBuilderType>().content<Field_fake::points>().content());


#if KAITAI_STRUCT_VERSION < 9000L
#error "Incompatible Kaitai Struct C++/STL API: version 0.9 or later is required"
#endif

class fake_t : public kaitai::kstruct {

public:
    class point_t;

    fake_t(kaitai::kstream* p__io, kaitai::kstruct* p__parent = 0, fake_t* p__root = 0);

private:
    void _read();
    void _clean_up();

public:
    ~fake_t();

    class point_t : public kaitai::kstruct {

    public:

        point_t(kaitai::kstream* p__io, fake_t* p__parent = 0, fake_t* p__root = 0);

    private:
        void _read();
        void _clean_up();

    public:
        ~point_t();

    private:
        uint32_t m_x;
        uint32_t m_y;
        uint32_t m_z;
        fake_t* m__root;
        fake_t* m__parent;

    public:
        uint32_t x() const { return m_x; }
        uint32_t y() const { return m_y; }
        uint32_t z() const { return m_z; }
        fake_t* _root() const { return m__root; }
        fake_t* _parent() const { return m__parent; }
        PointsBuilderType points_builder;
    };

private:
    std::vector<point_t*>* m_points;
    fake_t* m__root;
    kaitai::kstruct* m__parent;

public:
    std::vector<point_t*>* points() const { return m_points; }
    fake_t* _root() const { return m__root; }
    kaitai::kstruct* _parent() const { return m__parent; }
    FakeBuilderType fake_builder;
};

#endif  // FAKE_H_

```


<p style="text-align: center;">
    <strong><h4>fake.cpp</h4></strong>
</p>

```C++
// This is a generated file! Please edit source .ksy file and use kaitai-struct-compiler to rebuild

#include "fake.h"

UserDefinedMap fake_fields_map({
	{Field_fake::points, "points"}});

UserDefinedMap points_fields_map({
	{Field_points::x, "x"},
	{Field_points::y, "y"},
	{Field_points::z, "z"}});

fake_t::fake_t(kaitai::kstream* p__io, kaitai::kstruct* p__parent, fake_t* p__root) : kaitai::kstruct(p__io) {
    m__parent = p__parent;
    m__root = this;
    m_points = 0;

    fake_builder.set_fields(fake_fields_map);

    try {
        _read();
    } catch(...) {
        _clean_up();
        throw;
    }
}

void fake_t::_read() {
    m_points = new std::vector<point_t*>();
    {
        int i = 0;
        auto& sub_points_builder = fake_builder.content<Field_fake::points>();
        while (!m__io->is_eof()) {
            sub_points_builder.begin_list();
            m_points->push_back(new point_t(m__io, this, m__root));
            sub_points_builder.end_list();
            i++;
        }
    }
}

fake_t::~fake_t() {
    _clean_up();
}

void fake_t::_clean_up() {
    if (m_points) {
        for (std::vector<point_t*>::iterator it = m_points->begin(); it != m_points->end(); ++it) {
            delete *it;
        }
        delete m_points; m_points = 0;
    }
}

fake_t::point_t::point_t(kaitai::kstream* p__io, fake_t* p__parent, fake_t* p__root) : kaitai::kstruct(p__io),
	points_builder(p__parent->fake_builder.content<Field_fake::points>().content()) {
    m__parent = p__parent;
    m__root = p__root;

    points_builder.set_fields(points_fields_map);

    try {
        _read();
    } catch(...) {
        _clean_up();
        throw;
    }
}

void fake_t::point_t::_read() {
    m_x = m__io->read_u4le();
    auto& x_builder = points_builder.content<Field_points::x>();
    x_builder.append(m_x);
    m_y = m__io->read_u4le();
    auto& y_builder = points_builder.content<Field_points::y>();
    y_builder.append(m_y);
    m_z = m__io->read_u4le();
    auto& z_builder = points_builder.content<Field_points::z>();
    z_builder.append(m_z);
}

fake_t::point_t::~point_t() {
    _clean_up();
}

void fake_t::point_t::_clean_up() {
}
```

3. Print the returned `ak.Array`
```python
import kaitai_awkward_runtime
awkward_arrays = kaitai_awkward_runtime.load("data/fake.raw")
print(awkward_arrays)

[{points: [{x: 1, y: 2, z: 3}]}, {...}, ..., {points: [{x: 11, y: 22, ...}]}]
```

### Future Plans
- Improving the User Interface by making it simpler and more customizable
- Adding support for `IndexedOption` Arrays

For any questions or suggestions, feel free to contact:
- Manasvi Goyal - `mg.manasvi@gmail.com`
- Amy Roberts - `amy.roberts@ucdenver.edu`

### Acknowledgements

<div>
    <img src="img/NSF%20logo.png" width="60" style="float: left;">
    Support for this work was provided by NSF cooperative agreements <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=1836650">OAC-1836650</a> and <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=2323298">PHY-2323298</a> (IRIS-HEP) and grants <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=2104003">OAC-2104003</a> (PONDD) and <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=2103945">OAC-2103945</a> (Awkward Array).
</div>