# Benchmark lossless compression strategies

In this notebook we analyze the performance of different lossless compression strategies in terms of:

* compression ratio (n_bytes / n_stored_bytes)
* compression speed
* decompression speed

The lossless compression algorithms compared are:

- Zarr BLOSC compressors
    * lz4 
    * lz4hc 
    * zlib
    * zstd
- Audio compressor
    * FLAC
    * WavPack
    
The zarr compressors are implemented via the SpikeInterface `save(format="zarr")` function and run with different options:

* clevel (1 - min, 5, 9 - max)
* BLOSC shuffle filter (no, auto/shuffle, bit)

For the FLAC compression see the custom implementationof a SpikeInterface-like save function in `audiocompression.py` using [pyFLAC](https://github.com/sonos/pyFLAC). 
Since pyFLAC supports 2 channels at most, the input data is split into streams of 2-channels and the block chunking mechanism is also implemented *externally*

All compressors are run with different chunk sizes (0.1s, 1s, 10s). Additionally, compression is run for two cases:

- raw data (no preprocessing)
- median subtraction and LSB division


This notebook assumes the `scripts/benchmark-lossless.py` has been run and the `"benchmark-lossless.csv"` is available.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path

%matplotlib notebook

In [None]:
save_fig = False

fig_folder = Path(".") / "figures"
fig_folder.mkdir(exist_ok=True)

In [None]:
audio_compressors = ['flac', 'wavpack']

In [None]:
res = pd.read_csv("../data/results/benchmark-lossless-lsb.csv", index_col=False)

print(len(res))

# res_no_median = res.query("lsb != 'min'")
# print(len(res_no_median))
# res_no_median["lsb_bool"] = [False] * len(res_no_median)

# res_no_median.loc[res_no_median.lsb == "median", "lsb_bool"] = True

In [None]:
res.loc[:, "xRT"]  = res["duration"] / res["C-speed"]

In [None]:
res_zarr = res.query(f"compressor != {audio_compressors}")

In [None]:
res_zarr

# ZARR

We start by comparing compression options readily available via Zarr

### What is the best ZARR-based option in terms of CR?

In [None]:
dset = res_zarr

In [None]:
res_np1 = dset.query("probe == 'Neuropixels1.0'")
res_np2 = dset.query("probe == 'Neuropixels2.0'")

In [None]:
print("\nNP1\n")
print(res_np1.iloc[np.argmax(res_np1.CR)])
print("\nNP2\n")
print(res_np2.iloc[np.argmax(res_np2.CR)])

For both NP1 and NP2, the best zarr compressor is **Zstd - level 9 - chunk 1s - shuffle BIT**

### Effect of shuffling options

In [None]:
fig_sh_all, axs_sh_all = plt.subplots(ncols=3, figsize=(10, 7))

sns.boxplot(data=dset, x="shuffle", y="CR", ax=axs_sh_all[0])
sns.boxplot(data=dset, x="shuffle", y="xRT", ax=axs_sh_all[1])
sns.boxplot(data=dset, x="shuffle", y="D-10s", ax=axs_sh_all[2])

fig_sh_all.subplots_adjust(wspace=0.3)

fig_sh, axs_sh = plt.subplots(ncols=3, figsize=(10, 7))

sns.boxplot(data=dset, x="shuffle", y="CR", hue="compressor", ax=axs_sh[0])
sns.boxplot(data=dset, x="shuffle", y="xRT", hue="compressor", ax=axs_sh[1])
sns.boxplot(data=dset, x="shuffle", y="D-10s", hue="compressor", ax=axs_sh[2])

fig_sh.subplots_adjust(wspace=0.3)

**COMMENT**

The `BIT shuffle` options appears to generally improve CR, without affecting compression/decompression speed. 
Let's focus on that for the rest of the analysis.

In [None]:
selected_shuffle = "bit"
dset_shuffle = dset.query(f"shuffle == '{selected_shuffle}'")
dset_shuffle

### Effect of chunk duration

In [None]:
fig_ch_all, axs_ch_all = plt.subplots(ncols=3, figsize=(10, 7))

sns.boxplot(data=dset_shuffle, x="chunk_dur", y="CR", ax=axs_ch_all[0])
sns.boxplot(data=dset_shuffle, x="chunk_dur", y="xRT", ax=axs_ch_all[1])
sns.boxplot(data=dset_shuffle, x="chunk_dur", y="D-10s", ax=axs_ch_all[2])

fig_ch_all.subplots_adjust(wspace=0.3)

fig_ch, axs_ch = plt.subplots(ncols=3, figsize=(10, 7))

sns.boxplot(data=dset_shuffle, x="chunk_dur", y="CR", hue="compressor", ax=axs_ch[0])
sns.boxplot(data=dset_shuffle, x="chunk_dur", y="xRT", hue="compressor", ax=axs_ch[1])
sns.boxplot(data=dset_shuffle, x="chunk_dur", y="D-10s", hue="compressor", ax=axs_ch[2])

fig_ch.subplots_adjust(wspace=0.3)

**COMMENT**

Chunk duration seems to be relatively irrelevant for compression metrics. So let's pick 1s:

In [None]:
selected_chunk = "1s"
dset_chunk = dset_shuffle.query(f"chunk_dur == '{selected_chunk}'")

Let's now confirm that the level does its job...

### Effect of compression level

In [None]:
fig_lev_all, axs_lev_all = plt.subplots(ncols=3, figsize=(10, 7))

sns.boxplot(data=dset_chunk, x="level", y="CR", ax=axs_lev_all[0])
sns.boxplot(data=dset_chunk, x="level", y="xRT", ax=axs_lev_all[1])
sns.boxplot(data=dset_chunk, x="level", y="D-10s", ax=axs_lev_all[2])

fig_lev_all.subplots_adjust(wspace=0.3)

fig_lev, axs_lev = plt.subplots(ncols=3, figsize=(10, 7))

sns.boxplot(data=dset_chunk, x="level", y="CR", hue="compressor", ax=axs_lev[0])
sns.boxplot(data=dset_chunk, x="level", y="xRT", hue="compressor", ax=axs_lev[1])
sns.boxplot(data=dset_chunk, x="level", y="D-10s", hue="compressor", ax=axs_lev[2])

fig_lev.subplots_adjust(wspace=0.3)

**COMMENT**

For most compressors, compression level does its job (increasing levels yield increasing CR). Strangely, for `zlib` the level seems to have the opposite effect. Of course, the higher the level the slower the compression speed. Decompression speed doesn't seem to be affected.

For the final analysis, let's pick level 9 and compare the raw compression with the median+lsb preprocessing.

In [None]:
selected_level = 9
dset_level = dset_chunk.query(f"level == {selected_level}")

### Effect of LSB correction

For Open Ephys saved data, the `int16` binary files actually have an lsb > 1:

- NP1: lsb = 12 --> ~2.34 uV
- NP2: lab = 3 --> ~0.585 uV

While this does not affect the signals, it might affect compression because more bits than needed are used to encode for the voltage values.

For NP1, in addition, the channel signals are not always centered at 0, meaning that one channel could have **central** values of (-12, 0, 12) and another channel could have, for instance (-11, 1, 13). For NP2, this is not the case, but many channels are not centered at 0.

In order to account for this, we first estimate the median for each channel using chunks of the data and, before compression, we subtract the median and divide the signals by the LSB.

At decompression, to recover the original data, the signals are rescaled by the LSB and the median is re-added. Note that these last two steps are not necessary:

- the median values are irrelevant for downstrem analysis
- the LSB scaling can be accounted for simply by resetting the `gain` values with the initial scaling

For these reasons, the decompression speeds displayed below are an over-estimation of the actual values.


In [None]:
for probe in np.unique(dset_level.probe):
    dset_probe = dset_level.query(f"probe == '{probe}'")
    fig_lsb_all, axs_lsb_all = plt.subplots(ncols=3, figsize=(10, 7))

    sns.boxplot(data=dset_probe, x="lsb", y="CR", ax=axs_lsb_all[0])
    sns.boxplot(data=dset_probe, x="lsb", y="xRT", ax=axs_lsb_all[1])
    sns.boxplot(data=dset_probe, x="lsb", y="D-10s", ax=axs_lsb_all[2])

    fig_lsb_all.subplots_adjust(wspace=0.3)
    fig_lsb_all.suptitle(probe)

    fig_lsb, axs_lsb = plt.subplots(ncols=3, figsize=(10, 7))

    sns.barplot(data=dset_probe, x="lsb", y="CR", hue="compressor", ax=axs_lsb[0])
    sns.barplot(data=dset_probe, x="lsb", y="xRT", hue="compressor", ax=axs_lsb[1])
    sns.barplot(data=dset_probe, x="lsb", y="D-10s", hue="compressor", ax=axs_lsb[2])

    fig_lsb.subplots_adjust(wspace=0.3)
    fig_lsb.suptitle(probe)

In [None]:
for probe in np.unique(dset_level.probe):
    dset_probe = dset_level.query(f"probe == '{probe}'")
    dset_no = dset_probe.query("lsb == False")
    dset_lsb = dset_probe.query("lsb == True")
    
    print(f"\n\n{probe}\n")
    print(dset_probe.groupby("lsb")["CR"].max())

**COMMENT**

For both NP1 and NP2, the LSB correction significantly improves CRs.

- NP1: from 2.1 ($\sim$47% size) to 3.13 ($\sim$32% size) 
- NP2: from 1.5 ($\sim$66% size) to 1.88 ($\sim$53% size)

Compression speed is reduced (especially for `lz4`) due to the preprocessing (which requires upcasting to float, scaling, and downcasting back to int16).

As a final step, we select LSB and Zstd as best options:

In [None]:
selected_lsb = True
selected_compressor = "zstd"
dset_best_zarr = dset_level.query(f"compressor == '{selected_compressor}' and lsb == {selected_lsb}")
dset_best_zarr

# AUDIO compression

Lossless audio codecs could provide a good alternative to general-purpose compression algorithms because: 

- Audio signals are also timeseries 
- Frequiency range is similar
- Multiple channels are correlated

We tried to use FLAC and WavPack. FLAC supports up to 8 channels, so we need to either:
- concatenate multi-channel signals into (num_samples x 2)
- save multiple streams

In [None]:
dset_audio = res.query(f"compressor in {audio_compressors}")

No shuffling is available in FLAC, so we just select the same chunk duration for the comparison. 
WavPack doesn't have a compression level.

In [None]:
selected_chunk = "1s"
dset_audio_chunk = dset_audio.query(f"chunk_dur == '{selected_chunk}'")

In [None]:
fig_lev_all, axs_lev_all = plt.subplots(ncols=3, figsize=(10, 7))

sns.boxplot(data=dset_audio_chunk, x="level", y="CR", ax=axs_lev_all[0])
sns.boxplot(data=dset_audio_chunk, x="level", y="xRT", ax=axs_lev_all[1])
sns.boxplot(data=dset_audio_chunk, x="level", y="D-10s", ax=axs_lev_all[2])

fig_lev_all.subplots_adjust(wspace=0.3)

fig_lev, axs_lev = plt.subplots(ncols=3, figsize=(10, 7))

sns.boxplot(data=dset_audio, x="level", y="CR", hue="compressor", ax=axs_lev[0])
sns.boxplot(data=dset_audio, x="level", y="xRT", hue="compressor", ax=axs_lev[1])
sns.boxplot(data=dset_audio, x="level", y="D-10s", hue="compressor", ax=axs_lev[2])

fig_lev.subplots_adjust(wspace=0.3)

Only a slight increase in compression performance is visible. Let's select level 9 for FLAC and see how lsb correction affects CR:

In [None]:
selected_level = 9
dset_wv = dset_audio_chunk.query(f"compressor == 'wavpack'")
dset_flac = dset_audio_chunk.query(f"compressor == 'flac' and level == {selected_level}")
dset_level = pd.concat([dset_wv, dset_flac])

In [None]:
dset_level

In [None]:
for probe in np.unique(dset_level.probe):
    dset_probe = dset_level.query(f"probe == '{probe}'")
    fig_lsb_all, axs_lsb_all = plt.subplots(ncols=3, figsize=(10, 7))

    sns.barplot(data=dset_probe, x="lsb", y="CR", hue="compressor", ax=axs_lsb_all[0])
    sns.barplot(data=dset_probe, x="lsb", y="xRT", hue="compressor",  ax=axs_lsb_all[1])
    sns.barplot(data=dset_probe, x="lsb", y="D-10s", hue="compressor", ax=axs_lsb_all[2])

    fig_lsb_all.subplots_adjust(wspace=0.3)
    fig_lsb_all.suptitle(probe)

In [None]:
dset_best_audio = dset_level.query(f"lsb == {selected_lsb}")

In [None]:
dset_best = pd.concat([dset_best_zarr, dset_best_audio])

In [None]:
for probe in np.unique(dset_best.probe):
    dset_probe = dset_best.query(f"probe == '{probe}'")
    fig_lsb_all, axs_lsb_all = plt.subplots(ncols=3, figsize=(10, 7))

    sns.barplot(data=dset_probe, x="compressor", y="CR", ax=axs_lsb_all[0])
    sns.barplot(data=dset_probe, x="compressor", y="xRT", ax=axs_lsb_all[1])
    sns.barplot(data=dset_probe, x="compressor", y="D-10s", ax=axs_lsb_all[2])

    fig_lsb_all.subplots_adjust(wspace=0.3)
    fig_lsb_all.suptitle(probe)
    display(dset_probe)

**COMMENT**

