# Theme

**Table of contents:**
 - [Problem formulation](#Problem-formulation)
 - [Passing arguments to our programs](#Passing-arguments-to-our-programs)
 - [Creating sample paths](#Creating-sample-paths)
 - [Writing text files](#Writing-text-files)
 - [Writing binary files](#Writing-binary-files)
 - [Writing HDF5 files](#Writing-HDF5-files)
 - [Visualization](#Visualization)
 - [Discussion](#Discussion)

## Problem formulation

Think of a device that over a certain epoch produces time series of measurements. The device could be a physical sensor or a numerical device such as a simulation, or it could be the price of a financial instrument such as a stock. Let's assume we want to capture such time series over many epochs and that each epoch has a fixed number of time steps. A visualization of five such series is shown in the figure (**Source:** __[Wikimedia](https://commons.wikimedia.org/wiki/File:Ornstein-Uhlenbeck-5traces.svg)__) below.

<img src="https://upload.wikimedia.org/wikipedia/commons/6/60/Ornstein-Uhlenbeck-5traces.svg" title="Five sampled traces of an Ornstein-Uhlenbeck process." />

Assuming that an epoch of length 10 was sampled over 1,000 equidistant time steps, we could conveniently store the underlying data (values of `X`) in a `5 x 1000` two-dimensional array of floating-point values, with the series number and the time step as the row and column index, respectively.

For our own benefit and the researchers' with whom we might want to share our observations, we should also record relevant information about the context in which the data was collected. This might be for example the unit of the quantity `X`, the length of an individual time step, the date, time, and location when the recording was made (if that were relvant), and any additional information about the calibration of the generating mechanism. To be specific, in this tutorial, we will use numerical samples of a well-known stochastic process, the so-called __[Ornstein–Uhlenbeck process](https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process)__, as our data source. The sample traces or path samples shown in the figure were actually generated from a numerical experiment of such a process.

<div class="alert alert-block alert-success">
<b>Up to you:</b> To follow this tutorial, it is <i>not</i> necessary to understand what a stochastic process is or how to simulate one. It is OK to think about it as a sophisticated pseudo-random number generator, which merely serves as our source of series data.
</div>

Our problem formulation is thus: Store the floating-point values of `path_count` series with `step_count` values each in a two-dimensional array of `path_count` rows and `step_count` columns. In addition, store the following floating-point calibrations, which are the same for all series
- the time step `dt`
- the long-term process mean `mu`
- the reversion rate to the mean `theta`
- the volatility of the process `sigma`.

Before thinking about how to store this data, we will quickly go over how to pass the parameters to our simulation. After all, we want to control how much or how little data to generate! For completeness, we take a quick look at how the series data is generated. (It won't come as a shock that pseudo-random numbers play a big part!) With that out of the way, we look at storing the data in plain text files, as unformatted byte streams, and in HDF5 files. We conclude with a discussion of the advantages and challenges of the three approaches.

## Passing arguments to our programs

We need two helper functions to pass and parse the relevant arguments to our programs.

We use Pranav Srinivas Kumar's [`argparse` module](https://github.com/p-ranav/argparse), which is the "gold standard" in the C++ world.

In [None]:
%%writefile src/parse_arguments.hpp
#ifndef PARSE_ARGUMENTS_HPP
#define PARSE_ARGUMENTS_HPP

#include "argparse.hpp"

// Sets the options for which we are looking
extern void set_options(argparse::ArgumentParser& program);

// Tests the options and retrieves the arguments
extern int get_arguments
(
    const argparse::ArgumentParser& program,
    size_t&                         path_count,
    size_t&                         step_count,
    double&                         dt,
    double&                         theta,
    double&                         mu,
    double&                         sigma
);

#endif

The implementation of our functions is rather unsurprising and speaks to the quality (usability) of `argparse`.

In `set_options()`, we inform `argparse` about the program options we want to support. For each argument, we tell `argparse` the short and long forms of the command line argument, a synopsis from which a help message will be created, and the default type and value of the argument. This information guides how `argparse` processes the `main` function's arguments, `argc` and `argv`.

In `get_arguments()`, we check if an option was specificed and what value was provided, or use the default value.

In [None]:
%%writefile src/parse_arguments.cpp

#include "parse_arguments.hpp"
#include <cfloat>
#include <iostream>

using namespace std;

void set_options(argparse::ArgumentParser& program)
{
    program.add_argument("-p", "--paths")  // command line arguments
    .help("chooses the numnber of paths")  // synopsis
    .default_value(size_t{100})            // default value
    .scan<'u', size_t>();                  // expected type

    program.add_argument("-s", "--steps")
    .help("chooses the number of steps")
    .default_value(size_t{1000})
    .scan<'u', size_t>();

    program.add_argument("-d", "--dt")
    .help("chooses the time step")
    .default_value(double{0.01})
    .scan<'f', double>();

    program.add_argument("-t", "--theta")
    .help("chooses the rate of reversion to the mean")
    .default_value(double{1.0})
    .scan<'f', double>();

    program.add_argument("-m", "--mu")
    .help("chooses the long-term mean of the process")
    .default_value(double{0.0})
    .scan<'f', double>();

    program.add_argument("-g", "--sigma")
    .help("chooses the volatility of the process")
    .default_value(double{0.1})
    .scan<'f', double>();
}

int get_arguments
(
    const argparse::ArgumentParser& program,
    size_t&                         path_count,
    size_t&                         step_count,
    double&                         dt,
    double&                         theta,
    double&                         mu,
    double&                         sigma
)
{
    path_count = program.get<size_t>("--paths");
    if (path_count == 0) {
        cerr << "Number of paths must be greater than zero" << endl;
        return -1;
    }
    step_count = program.get<size_t>("--steps");
    if (step_count == 0) {
        cerr << "Number of steps must be greater than zero" << endl;
        return -1;
    }
    dt = program.get<double>("--dt");
    if (dt < DBL_MIN) {
        cerr << "Time step must be greater than zero" << endl;
        return -1;
    }
    theta = program.get<double>("--theta");
    if (theta < DBL_MIN) {
        cerr << "Reversion rate must be greater than zero" << endl;
        return -1;
    }
    mu = program.get<double>("--mu");
    sigma = program.get<double>("--sigma");
    if (sigma < DBL_MIN) {
        cerr << "Volatility must be greater than zero" << endl;
        return -1;
    }

    return 0;
}

Let's make sure the compiler is happy!

In [None]:
%%bash
g++ -std=c++17 -Wall -pedantic -I./include -c ./src/parse_arguments.cpp -o ./build/parse_arguments.o

## Creating sample paths

We create `path_count` sample paths with `step_count` timesteps with the given parameters, and store them in a C++ `vector`.

In [None]:
%%writefile src/ou_sampler.hpp
#ifndef OU_SAMPLER_HPP
#define OU_SAMPLER_HPP

#include <vector>

// Creates `path_count` sample paths of length `step_count` with parameters
// `dt`, `theta`, `mu`, and `sigma`
extern void ou_sampler
(
    std::vector<double>& ou_process,
    const size_t&        path_count,
    const size_t&        step_count,
    const double&        dt,
    const double&        theta,
    const double&        mu,
    const double&        sigma
);

#endif

In [None]:
%%writefile src/ou_sampler.cpp

#include <random>

using namespace std;

void ou_sampler
(
    vector<double>& ou_process,
    const size_t&   path_count,
    const size_t&   step_count,
    const double&   dt,
    const double&   theta,
    const double&   mu,
    const double&   sigma
)
{
    // Store sample paths in one contiguous buffer
    ou_process.clear();
    ou_process.resize(path_count * step_count);

    random_device rd;
    mt19937 generator(rd());
    normal_distribution<double> dist(0.0, sqrt(dt));

    for (size_t i = 0; i < path_count; ++i)
    {
        ou_process[i * step_count] = 0; // sample paths start at x = 0
        for (size_t j = 1; j < step_count; ++j)
        {
            auto dW = dist(generator);
            auto pos = i * step_count + j;
            ou_process[pos] = ou_process[pos - 1] + theta * (mu - ou_process[pos - 1]) * dt + sigma * dW;
        }
    }
}

In [None]:
%%bash
g++ -std=c++17 -Wall -pedantic -c ./src/ou_sampler.cpp -o ./build/ou_sampler.o

## Writing text files

The easiest way to store the sample paths would be a text file. We include a header in which we document the parameters that were uses in the generation. This is adequate as long as the number of sample paths and timesteps is relatively small. For large numbers, this is cleary cumbersome and unmanagable.

In [None]:
%%writefile src/ou_text.cpp
#include "parse_arguments.hpp"
#include "ou_sampler.hpp"

#include <fstream>
#include <vector>

using namespace std;

int main(int argc, char *argv[])
{
    size_t path_count, step_count;
    double dt, theta, mu, sigma;

    argparse::ArgumentParser program("ou_text");
    set_options(program);
    program.parse_args(argc, argv);
    get_arguments(program, path_count, step_count, dt, theta, mu, sigma);

    cout << "Running with parameters:"
         << " paths=" << path_count << " steps=" << step_count
         << " dt=" << dt << " theta=" << theta << " mu=" << mu << " sigma=" << sigma << endl;

    vector<double> ou_process;
    ou_sampler(ou_process, path_count, step_count, dt, theta, mu, sigma);
    
    // Write the sample paths to a text file
    ofstream file("ou_process.txt");
    
    file << "# paths steps dt theta mu sigma" << endl;
    file << path_count << " " << step_count << " " << dt << " " << theta << " " << mu << " " << sigma << endl;
    file << "# data" << endl;
    
    for (size_t i = 0; i < path_count; ++i)
        {
            for (size_t j = 0; j < step_count; ++j)
            {
                auto pos = i * step_count + j;
                file << ou_process[pos] << " ";
            }
            file << endl;
        }

    file.close();

    return 0;
}

In [None]:
%%bash
g++ -std=c++17 -Wall -pedantic -I./include ./src/ou_text.cpp ./build/parse_arguments.o ./build/ou_sampler.o -o ./build/ou_text
./build/ou_text
ls -iks ou_process.txt

## Writing binary files

To reduce the file size, instead of using text, we can store the sample paths as an unformatted binary stream. We would need to maintain separate documentation to tell consumers of the data how to parse and interpret this byte stream, which is a minor inconvenience. A major headache is that unformatted binary streams are platform (processor) dependent, which opens the door for misinterpretation when the data is copied to an incompatible platform and not adjusted for this change.

In [None]:
%%writefile src/ou_binary.cpp
#include "parse_arguments.hpp"
#include "ou_sampler.hpp"

#include <fstream>
#include <vector>

using namespace std;

int main(int argc, char *argv[])
{
    size_t path_count, step_count;
    double dt, theta, mu, sigma;
    
    argparse::ArgumentParser program("ou_binary");
    set_options(program);
    program.parse_args(argc, argv);
    get_arguments(program, path_count, step_count, dt, theta, mu, sigma);

    cout << "Running with parameters:"
         << " paths=" << path_count << " steps=" << step_count
         << " dt=" << dt << " theta=" << theta << " mu=" << mu << " sigma=" << sigma << endl;

    vector<double> ou_process;
    ou_sampler(ou_process, path_count, step_count, dt, theta, mu, sigma);
    
    // Write the sample paths to an unformatted binary file

    ofstream file("ou_process.bin", ios::out | ios::binary);
    file.write((char *)&path_count, sizeof(path_count));
    file.write((char *)&step_count, sizeof(step_count));
    file.write((char *)&dt, sizeof(dt));
    file.write((char *)&theta, sizeof(theta));
    file.write((char *)&mu, sizeof(mu));
    file.write((char *)&sigma, sizeof(sigma));
    file.write((char *)ou_process.data(), sizeof(double) * ou_process.size());
    file.close();

    return 0;
}

In [None]:
%%bash
g++ -std=c++17 -Wall -pedantic -I./include ./src/ou_binary.cpp ./build/parse_arguments.o ./build/ou_sampler.o -o ./build/ou_binary
./build/ou_binary
ls -iks ou_process.bin

## Writing HDF5 files

In [None]:
%%writefile src/ou_hdf5.cpp
#include "parse_arguments.hpp"
#include "ou_sampler.hpp"

#include "hdf5.h"
#include <vector>

using namespace std;

int main(int argc, char *argv[])
{
    size_t path_count, step_count;
    double dt, theta, mu, sigma;

    argparse::ArgumentParser program("ou_hdf5");
    set_options(program);
    program.parse_args(argc, argv);
    get_arguments(program, path_count, step_count, dt, theta, mu, sigma);

    cout << "Running with parameters:"
         << " paths=" << path_count << " steps=" << step_count
         << " dt=" << dt << " theta=" << theta << " mu=" << mu << " sigma=" << sigma << endl;

    vector<double> ou_process;
    ou_sampler(ou_process, path_count, step_count, dt, theta, mu, sigma);
    
    //
    // Write the sample paths to an HDF5 file using the HDF5 C-API!
    //

    auto file = H5Fcreate("ou_process.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    { // create & write the dataset
        hsize_t dimsf[] = {(hsize_t)path_count, (hsize_t)step_count};
        auto space = H5Screate_simple(2, dimsf, NULL);
        auto dataset = H5Dcreate(file, "/dataset", H5T_NATIVE_DOUBLE, space, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
        H5Dwrite(dataset, H5T_NATIVE_DOUBLE, H5S_ALL, H5S_ALL, H5P_DEFAULT, ou_process.data());
        H5Dclose(dataset);
        H5Sclose(space);
    }

    { // make the file self-describing by adding a few attributes to `dataset`
        auto scalar = H5Screate(H5S_SCALAR);
        auto acpl = H5Pcreate(H5P_ATTRIBUTE_CREATE);
        H5Pset_char_encoding(acpl, H5T_CSET_UTF8);
        
        auto set_attribute = [&](const string& name, const double& value) {
            auto attr = H5Acreate_by_name(file, "dataset", name.c_str(), H5T_NATIVE_DOUBLE, scalar, acpl, H5P_DEFAULT, H5P_DEFAULT);
            H5Awrite(attr, H5T_NATIVE_DOUBLE, &value);
            H5Aclose(attr);
        };
        set_attribute("dt", dt);
        set_attribute("θ", theta);
        set_attribute("μ", mu);
        set_attribute("σ", sigma);

        H5Pclose(acpl);
        H5Sclose(scalar);
    }

    H5Fclose(file);

    return 0;
}

In [None]:
%%bash
g++ -std=c++17 -Wall -pedantic -I/usr/include/hdf5/serial -L/usr/lib/x86_64-linux-gnu -I./include  ./src/ou_hdf5.cpp ./build/parse_arguments.o ./build/ou_sampler.o -o ./build/ou_hdf5 -lhdf5_serial
./build/ou_hdf5
ls -iks ou_process.h5

## Visualization

The easiest way to visualize the content of the HDF5 file `ou_process.h5` is to select it in the Explorer. This will launch the __[H5Web VS Code plugin](https://marketplace.visualstudio.com/items?itemName=h5web.vscode-h5web)__, which is sufficient for quick inspection and simple plot types. A more powerful option is __[Matplotlib](https://matplotlib.org/)__, and a plot of the fourty third sample path can be obtained as shown below.

In [None]:
%matplotlib inline
import h5py
import numpy as np

f = h5py.File("ou_process.h5")
dset = f["dataset"]

In [None]:
arr = dset[42,:]
print(f"min: {arr.min():.2f}, max: {arr.max():.2f}, mean: {arr.mean():.2f}")

In [None]:
import matplotlib.pyplot as plt

plt.style.use('_mpl-gallery')
fig, ax = plt.subplots()
ax.plot(np.arange(0,len(arr)), arr, linewidth=2.0)
plt.show()

In [None]:
f.close()

## Discussion

In this part of the tutorial, we demonstrated how to store a set of sample paths plus metadata in plain text, as an unformatted binary stream, and in an HDF5 file. Advantages and challenges are relative to circumstances, and pointless to belabor in the abstract. If circumstances change, it might be time for a change of approach. Here are a few considerations:

The simplicity of plain text is hard to beat for small data sets. No special tools or libraries are needed to handle text. (There are, however, __[challenges](https://youtu.be/_mZBa3sqTrI?si=sKQKvFgserRq2uHf)__ with "plain text" the moment you go beyond US-ASCII.) Performance in space (file size) and time (marshalling overhead) deteriorates with data size. Common mitigations include compression and parallel processing, but the original simplicity is gone.

Unformatted binary streams are fine as long as platforms and layouts are stable. The moment multiple platforms with different byte orders enter the mix, things get tricky. Change in the metadata or data is the biggest enemy. Not only is (separate) documentation in need of updates, but costly code changes to support legacy and new data might be necessary.

The HDF5 framework takes care of all of the issues mentioned so far at the expense of a dependency on the HDF5 library and file format or HSDS. Common considerations before deciding on making your product dependent on the HDF5 framework include:

1. **Licensing and Costs:** FOSS, BSD or Apache license, cost of ownership
2. **Compatibility and Integration:** Great third-party support
3. **Security and Privacy:** CVE-free
4. **Support and Documentation:** Get support from The HDF Group; extensive documentation
5. **Reliability and Performance:** Use in mission-critical applications; de facto standard for science and engineering data; Google Scholar
6. **Updates and Longevity:** 25 year track record; regular maintenance releases, forward/backward compatibility guarantee
7. **Vendor Lock-in:** FOSS & open specs.
8. **Community and Ecosystem:** Outstanding, HUG events, Forum, Google Scholar, third-party support
9. **Legal and Compliance Issues:** Permissive licensing; you know your legal envelope
10. **Impact on Your Product Roadmap:** Where is the industry headed? Need room to grow? Move to the cloud? AI ready?