# Benchmark lossless compression strategies

In this notebook we analyze the performance of different lossless compression strategies in terms of:

* compression ratio (n_bytes / n_stored_bytes)
* compression speed
* decompression speed

The lossless compression algorithms compared are:

- Zarr BLOSC compressors
    * lz4 
    * lz4hc 
    * zlib
    * zstd
- Audio compressor
    * FLAC
    * WavPack
    
The zarr compressors are implemented via the SpikeInterface `save(format="zarr")` function and run with different options:

* level (low - 1, medium - 5, high - 9)
* BLOSC shuffle filter (no, auto/shuffle, bit)


Custom numcodecs wrapper have also been written for FLAC and WavPack with the following levels/options:

* flac (low - 1, medium - 5, high - 8)
* wavpack (low - f, medium - h, high - hh)

Since pyFLAC supports 2 channels at most, we test 2 options for FLAC:

    - chunking by time only (and flatten the data)
    - chunk size is (num_samples, 2) --> FLAC compresses streams of "stereo" channels

All compressors are run with different chunk sizes (0.1s, 1s, 10s). Additionally, compression is run for two cases:

- raw data (no preprocessing) - lsb = False
- median subtraction and LSB division - lsb = True


This notebook assumes the `ephys-compression/scripts/benchmark-lossless.py` has been run and the `ephys-compression/data/results/benchmark-lossless.csv"` is available.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import sys

sys.path.append("..")

from utils import prettify_axes

%matplotlib notebook

In [None]:
save_fig = True

fig_folder = Path(".") / "figures" / "lossless"
fig_folder.mkdir(exist_ok=True, parents=True)

In [None]:
res = pd.read_csv("../data/results/benchmark-lossless-final.csv", index_col=False)

In [None]:
res

In [None]:
blosc_compressors = np.unique(res.query("compressor_type == 'blosc'").compressor)
numcodecs_compressors = np.unique(res.query("compressor_type == 'numcodecs'").compressor)
audio_compressors = np.unique(res.query("compressor_type == 'audio'").compressor)

# LSB correction?

For Open Ephys saved data, the `int16` binary files actually have an lsb > 1:

- NP1: lsb = 12 --> ~2.34 uV
- NP2: lab = 3 --> ~0.585 uV

While this does not affect the signals, it might affect compression because more bits than needed are used to encode for the voltage values.

For NP1, in addition, the channel signals are not always centered at 0, meaning that one channel could have **central** values of (-12, 0, 12) and another channel could have, for instance (-11, 1, 13). For NP2, this is not the case, but many channels are not centered at 0.

In order to account for this, we first estimate the median for each channel using chunks of the data and, before compression, we subtract the median and divide the signals by the LSB.

At decompression, to recover the original data, the signals are rescaled by the LSB and the median is re-added. Note that these last two steps are not necessary:

- the median values are irrelevant for downstrem analysis
- the LSB scaling can be accounted for simply by resetting the `gain` values with the initial scaling

For these reasons, the decompression speeds displayed below are an over-estimation of the actual values.


In [None]:
fig_lsb_all, axs_lsb_all = plt.subplots(ncols=2, figsize=(10, 5))

for p, probe in enumerate(np.unique(res.probe)):
    dset_probe = res.query(f"probe == '{probe}'")
    
    sns.boxplot(data=dset_probe, x="compressor_type", y="CR", hue="lsb", ax=axs_lsb_all[p])

    axs_lsb_all[p].set_title(probe, fontsize=20)
fig_lsb_all.subplots_adjust(wspace=0.3)  
prettify_axes(axs_lsb_all)

In [None]:
if save_fig:
    fig_lsb_all.savefig(fig_folder / "lsb_corr.pdf")

In [None]:
res_lsb = res.query("lsb == True")

In [None]:
res_blosc = res_lsb.query(f"compressor_type == 'blosc'")
res_numcodecs = res_lsb.query(f"compressor_type == 'numcodecs'")
res_audio = res_lsb.query(f"compressor_type == 'audio'")

rec_zarr = res_lsb.query(f"compressor_type != 'audio'")

# ZARR 

We start by comparing compression options readily available via the Blosc meta-compressor in ZARR

### What is the best ZARR-based option in terms of CR?

In [None]:
dset = rec_zarr

In [None]:
res_np1 = dset.query("probe == 'Neuropixels1.0'")
res_np2 = dset.query("probe == 'Neuropixels2.0'")

In [None]:
print("\nNP1\n")
print(res_np1.iloc[np.argmax(res_np1.CR)])
print("\nNP2\n")
print(res_np2.iloc[np.argmax(res_np2.CR)])

For both NP1 and NP2, the best zarr compressor is **Blosc-zstd - level 9 - chunk 1s/10s - shuffle BIT**

### Effect of shuffling options

The BLOSC meta-compressor provides two shuffling options:

- byte shuffle (shuffle)
- bit shuffle

A byte shuffle is also available via the `numcodecs.Shuffle` for other non-blosc codecs.

In [None]:
fig_sh_all, axs_sh_all = plt.subplots(ncols=3, nrows=2, figsize=(15, 10))
sns.boxplot(data=dset, x="shuffle", y="CR", ax=axs_sh_all[0, 0])
sns.boxplot(data=dset, x="shuffle", y="xRT", ax=axs_sh_all[0, 1])
sns.boxplot(data=dset, x="shuffle", y="D-10s", ax=axs_sh_all[0, 2])

sns.boxplot(data=dset, x="compressor", y="CR", hue="shuffle", ax=axs_sh_all[1, 0])
sns.boxplot(data=dset, x="compressor", y="xRT", hue="shuffle", ax=axs_sh_all[1, 1])
sns.boxplot(data=dset, x="compressor", y="D-10s", hue="shuffle", ax=axs_sh_all[1, 2])

axs_sh_all[1, 0].set_xticklabels(axs_sh_all[1, 0].get_xticklabels(), rotation=90)
axs_sh_all[1, 1].set_xticklabels(axs_sh_all[1, 1].get_xticklabels(), rotation=90)
axs_sh_all[1, 2].set_xticklabels(axs_sh_all[1, 2].get_xticklabels(), rotation=90)

fig_sh_all.subplots_adjust(wspace=0.3)

fig_sh_all.suptitle("Shuffling", fontsize=20)

prettify_axes(axs_sh_all)

In [None]:
if save_fig:
    fig_sh_all.savefig(fig_folder / "shuffling.pdf")

**COMMENT**

In general, pre-shuffling (byte or bit) the data improves conversion performance with respect to no pre-shuffling.
The `BIT shuffle` available in blosc seem to be the best option, as it provides better CR, compression and decompression speed. 

Let's focus on that (and therefore on BLOSC compressors) for the rest of the analysis.

In [None]:
selected_shuffle = "bit"
dset_shuffle = dset.query(f"shuffle == '{selected_shuffle}'")
dset_shuffle

### Effect of chunk duration

In [None]:
fig_ch_all, axs_ch_all = plt.subplots(ncols=3, nrows=2, figsize=(15, 10))
sns.boxplot(data=dset_shuffle, x="chunk_dur", y="CR", ax=axs_ch_all[0, 0])
sns.boxplot(data=dset_shuffle, x="chunk_dur", y="xRT", ax=axs_ch_all[0, 1])
sns.boxplot(data=dset_shuffle, x="chunk_dur", y="D-10s", ax=axs_ch_all[0, 2])

sns.boxplot(data=dset_shuffle, x="compressor", y="CR", hue="chunk_dur", ax=axs_ch_all[1, 0])
sns.boxplot(data=dset_shuffle, x="compressor", y="xRT", hue="chunk_dur", ax=axs_ch_all[1, 1])
sns.boxplot(data=dset_shuffle, x="compressor", y="D-10s", hue="chunk_dur", ax=axs_ch_all[1, 2])

axs_ch_all[1, 0].set_xticklabels(axs_ch_all[1, 0].get_xticklabels(), rotation=90)
axs_ch_all[1, 1].set_xticklabels(axs_ch_all[1, 1].get_xticklabels(), rotation=90)
axs_ch_all[1, 2].set_xticklabels(axs_ch_all[1, 2].get_xticklabels(), rotation=90)

fig_ch_all.subplots_adjust(wspace=0.3)
fig_ch_all.suptitle("Chunk duration", fontsize=20)

prettify_axes(axs_ch_all)

In [None]:
if save_fig:
    fig_ch_all.savefig(fig_folder / "chunks.pdf")

**COMMENT**

Chunk duration seems to be relatively irrelevant for compression metrics. So let's pick 1s:

In [None]:
selected_chunk = "1s"
dset_chunk = dset_shuffle.query(f"chunk_dur == '{selected_chunk}'")

In [None]:
dset_chunk

Let's now confirm that the level does its job...

### Effect of compression level

In [None]:
fig_lev_all, axs_lev_all = plt.subplots(ncols=3, nrows=1, figsize=(15, 6))
sns.boxplot(data=dset_chunk, x="level", y="CR", hue="compressor", ax=axs_lev_all[0])
sns.boxplot(data=dset_chunk, x="level", y="xRT", hue="compressor", ax=axs_lev_all[1])
sns.boxplot(data=dset_chunk, x="level", y="D-10s", hue="compressor", ax=axs_lev_all[2])
fig_lev_all.subplots_adjust(wspace=0.3)
fig_lev_all.suptitle("Compression level", fontsize=20)

prettify_axes(axs_lev_all)

In [None]:
if save_fig:
    fig_lev_all.savefig(fig_folder / "levels.pdf")

In [None]:
fig_zlib, ax_zlib = plt.subplots(figsize=(12, 10))

res_zlib = dset.query("compressor == 'blosc-zlib' and chunk_dur == '1s'")

sns.barplot(data=res_zlib, x="level", y="CR", hue="shuffle", ax=ax_zlib)
ax_zlib.set_title("BLOSC-ZLIB - compression level", fontsize=20)
prettify_axes(ax_zlib)

In [None]:
if save_fig:
    fig_zlib.savefig(fig_folder / "zlib_level.pdf")

**COMMENT**

For most compressors, compression level does its job (increasing levels yield increasing CR). Strangely, for `blosc-zlib` the level seems to have the opposite effect. Of course, the higher the level the slower the compression speed. Decompression speed doesn't seem to be affected (this is not the case for the `numcodecs.Zlib` wrapper, that doesn't play well with BIT shuffling).

For the final analysis, let's pick level 9 and compare the raw compression with the median+lsb preprocessing.

In [None]:
selected_level = "high"
dset_level = dset_chunk.query(f"level == '{selected_level}'")

In [None]:
dset_level

**COMMENT**

For both NP1 and NP2, the LSB correction significantly improves CRs.

- NP1: from 2.1 ($\sim$47% size) to 3.13 ($\sim$32% size) 
- NP2: from 1.5 ($\sim$66% size) to 1.88 ($\sim$53% size)

Compression speed is reduced (especially for `lz4`) due to the preprocessing (which requires upcasting to float, scaling, and downcasting back to int16).

As a final step, we select LSB and Zstd as best options:

In [None]:
selected_compressor = "blosc-zstd"
dset_best_zarr = dset_level.query(f"compressor == '{selected_compressor}'")
dset_best_zarr

In [None]:
print(dset_best_zarr.to_latex(index=False, columns=["probe", "duration", "compressor_type",
                                                    "compressor", "level", "shuffle", "chunk_dur",
                                                    "CR", "xRT", "D-10s"]))

# AUDIO compression

Lossless audio codecs could provide a good alternative to general-purpose compression algorithms because: 

- Audio signals are also timeseries 
- Frequiency range is similar
- Multiple channels are correlated

We tried to use FLAC and WavPack. FLAC supports up to 2 channels with pyFLAC, so we need to either:
- flatten multi-channel signals into (channel_chunk_size=-1)
- save blocks with 2 channels (channel_chunk_size=2)

In [None]:
dset_audio = res_audio

No shuffling is available in FLAC, so we just select the same chunk duration for the comparison. 
WavPack doesn't have a compression level.

In [None]:
selected_chunk = "1s"
dset_audio_chunk = dset_audio.query(f"chunk_dur == '{selected_chunk}'")

### FLAC: flattening or not?

In [None]:
dset_flac = dset_audio_chunk.query("compressor == 'flac'")

In [None]:
dset_flac

In [None]:
fig_flac, axs_flac = plt.subplots(ncols=3, figsize=(15, 6))

sns.boxplot(data=dset_flac, x="channel_chunk_size", y="CR", ax=axs_flac[0])
sns.boxplot(data=dset_flac, x="channel_chunk_size", y="xRT", ax=axs_flac[1])
sns.boxplot(data=dset_flac, x="channel_chunk_size", y="D-10s", ax=axs_flac[2])
fig_flac.suptitle("Channel chunk size (flac)", fontsize=20)

prettify_axes(axs_flac)

In [None]:
if save_fig:
    fig_flac.savefig(fig_folder / "flac_chunks.pdf")

**COMMENT**

Flattening or using "stereo" channels doesn't seem to make a difference for CR. Actually, compression and decompression speeds are slightly faster when stereo mode is enabled!

In [None]:
dset_wavpack = dset_audio_chunk.query("compressor == 'wavpack'")
dset_flac_2 = dset_flac.query("channel_chunk_size == 2") # wavpack has chunk_channel_size == -1
dset_audio_2 = pd.concat([dset_flac_2, dset_wavpack])

### Effect of compression level

In [None]:
fig_lev_all, axs_lev_all = plt.subplots(ncols=3, nrows=2, figsize=(15, 10))

sns.barplot(data=dset_audio_2.query("probe == 'Neuropixels1.0'"), x="compressor", 
            y="CR", hue="level", ax=axs_lev_all[0, 0])
sns.barplot(data=dset_audio_2.query("probe == 'Neuropixels1.0'"), x="compressor", 
            y="xRT", hue="level", ax=axs_lev_all[0, 1])
sns.barplot(data=dset_audio_2.query("probe == 'Neuropixels1.0'"), x="compressor", 
            y="D-10s", hue="level", ax=axs_lev_all[0, 2])

sns.barplot(data=dset_audio_2.query("probe == 'Neuropixels2.0'"), x="compressor", 
            y="CR", hue="level", ax=axs_lev_all[1, 0])
sns.barplot(data=dset_audio_2.query("probe == 'Neuropixels2.0'"), x="compressor", 
            y="xRT", hue="level", ax=axs_lev_all[1, 1])
sns.barplot(data=dset_audio_2.query("probe == 'Neuropixels2.0'"), x="compressor", 
            y="D-10s", hue="level", ax=axs_lev_all[1, 2])

axs_lev_all[0, 1].set_title("Neuropixels 1.0", fontsize=18)
axs_lev_all[1, 1].set_title("Neuropixels 2.0", fontsize=18)


fig_lev_all.subplots_adjust(wspace=0.3, hspace=0.5)
fig_lev_all.suptitle("Compression level", fontsize=20)

prettify_axes(axs_lev_all)

In [None]:
dset_audio_2

In [None]:
if save_fig:
    fig_lev_all.savefig(fig_folder / "audio_levels.pdf")

There is a slight increase in compression performance with compression level from low to medium, but not so between medium an high.  For FLAC, compression speed does not seem to by affected compression level.
However, for WavPack, decompression speed is a bit slow (~0.5 xRT), probably due to the sub-optimal implementation.

In [None]:
selected_level = "medium"
dset_best_audio = dset_audio_2.query(f"level == '{selected_level}'")

In [None]:
dset_best_all = pd.concat([dset_best_zarr, dset_best_audio])
dset_best_all

In [None]:
fig_best_all, axs_best_all = plt.subplots(ncols=3, nrows=2, figsize=(15, 10))

sns.barplot(data=dset_best_all.query("probe == 'Neuropixels1.0'"), x="compressor", 
            y="CR", ax=axs_best_all[0, 0])
sns.barplot(data=dset_best_all.query("probe == 'Neuropixels1.0'"), x="compressor", 
            y="xRT", ax=axs_best_all[0, 1])
sns.barplot(data=dset_best_all.query("probe == 'Neuropixels1.0'"), x="compressor", 
            y="D-10s", ax=axs_best_all[0, 2])

sns.barplot(data=dset_best_all.query("probe == 'Neuropixels2.0'"), x="compressor", 
            y="CR", ax=axs_best_all[1, 0])
sns.barplot(data=dset_best_all.query("probe == 'Neuropixels2.0'"), x="compressor", 
            y="xRT", ax=axs_best_all[1, 1])
sns.barplot(data=dset_best_all.query("probe == 'Neuropixels2.0'"), x="compressor", 
            y="D-10s",  ax=axs_best_all[1, 2])

axs_best_all[0, 1].set_title("Neuropixels 1.0", fontsize=18)
axs_best_all[1, 1].set_title("Neuropixels 2.0", fontsize=18)

prettify_axes(axs_best_all)

fig_best_all.subplots_adjust(wspace=0.3, hspace=0.5)
fig_best_all.suptitle("Compression strategies", fontsize=20)

In [None]:
if save_fig:
    fig_best_all.savefig(fig_folder / "best.pdf")

In [None]:
for probe in np.unique(dset_level.probe):
    dset_probe = dset_best_all.query(f"probe == '{probe}'")
    
    print(f"\n\n{probe}\n")
    print(dset_probe.groupby("compressor")["CR"].max())

In [None]:
dset_best_sorted = dset_best_all.sort_values("probe")

dset_best_sorted["file_size"] = (1 / dset_best_sorted["CR"]) * 100

In [None]:
print(dset_best_sorted.to_latex(index=False, columns=["probe", "duration", 
                                                    "compressor", "level", "shuffle", "chunk_dur",
                                                    "CR", "file_size", "xRT", "D-10s"]))

### Conclusion


In terms of CR, FLAC reaches the highest lossless compression for NP1 (3.462 -- $\sim$29% size), but it is extremely slow to decompress. Note that Zstd, with its 3.13 CR, reduced the size to $\sim$32%. WavPack is somewhere in the middle (CR 3.36 -- $\sim$30% size). 

For NP2, WavPack can reduce to a $\sim$42% size in contrast to a $\sim$53% of Zstd. Decompression is currently a bit slow also for WavPack (around 5s to retrieve 10s of traces), but this can be probably improved by bypassing the temporary wav conversion and binding the WavPack C library directly.