# Lossless compression

This notebook reproduces the results of the `Lossless compression` results in the paper.

We assumes the `ephys-compression/scripts/benchmark-lossless.py` has been run and the `../data/results-lossless/benchmark-lossless.csv"` is available.

In [None]:
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

import pandas as pd
import seaborn as sns

from utils import prettify_axes, stat_test

%matplotlib inline

In [None]:
figsize_rect = (10,5)
figsize_square = (10, 10)

In [None]:
data_folder = Path("../data")
results_folder = Path("../results")

save_fig = True
fig_folder = results_folder / "figures" / "lossless"
fig_folder.mkdir(exist_ok=True, parents=True)

results_lossless_folder = data_folder / "results-lossless"

In [None]:
res = pd.read_csv(results_lossless_folder / "benchmark-lossless.csv", index_col=False)

In [None]:
probe_names = {"Neuropixels1.0": "NP1",
               "Neuropixels2.0": "NP2"}
for probe, probe_name in probe_names.items():
    res.loc[res.query(f"probe == '{probe}'").index, "probe"] = probe_name
probes = np.unique(res.probe)

In [None]:
res.head()

In [None]:
# color palette
cmap = plt.get_cmap("Paired")
palette = {}
ctypes = ["blosc", "numcodecs", "audio"]
num_compressors = len(np.unique(res.compressor))
i = 0
for ctype in ctypes:
    res_ctype = res.query(f"compressor_type == '{ctype}'")
    for comp in np.unique(res_ctype.sort_values("compressor").compressor):
        palette[comp]  = cmap(i / num_compressors)
        i += 1

# LSB correction drastically increases compression performance

As mentioned in the Experimental data section, the Neuropixels data recorded by `Open Ephys` (in our case from the `AIND` data source) do not have a unitary LSB (12 for NP1, 3 for NP2). In principle, dividing the samples by the LSB value would reduce the data range (and possibly improve compression) without affecting the data integrity. We denote this process as *LSB correction*. In Figure 1 we display the distribution of all compression metrics without (light blue) and with (dark blue) LSB correction. Note that in this case the distributions contain data points from both NP1 and NP2 probes (1800 NP1 - 3600 NP2 observations). Compression ratio (panel A) is significantly higher after LSB correction with a large effect size (p-value<1e-10, effect size=1). Compression speed, instead, slightly decreases, likely due to the additional computational load of the LSB correction itself (panel B) (p-value<1e-4, effect size=0.14), while a minor increas in decompression speed is observed (p-value<1e-8, effect size=0.2). Due to the large improvement in compression ratio, we continued the analysis with LSB correction and we advice users to do so as well when using \openephys recordings (see the Software Extension section).


In [None]:
res_aind = res.query("lsb != 'none'")

### Figure 1

In [None]:
fig_lsb_all, axs = plt.subplots(ncols=3, figsize=figsize_rect)

color = "Blues"

sns.boxenplot(data=res_aind, x="lsb", y="CR", ax=axs[0], palette=color)
sns.boxenplot(data=res_aind, x="lsb", y="cspeed_xrt", ax=axs[1], palette=color)
sns.boxenplot(data=res_aind, x="lsb", y="dspeed10s_xrt", ax=axs[2], palette=color)

axs[0].set_xlabel("")
axs[2].set_xlabel("")
axs[1].set_xlabel("LSB correction")

axs[0].set_xticklabels(["off", "on"])
axs[1].set_xticklabels(["off", "on"])
axs[2].set_xticklabels(["off", "on"])

axs[0].set_ylabel("Compression Ratio")
axs[1].set_ylabel("Compression speed (xRT)")
axs[2].set_ylabel("Decompression speed (xRT)")

fig_lsb_all.subplots_adjust(wspace=0.5)
prettify_axes(axs)

In [None]:
test_columns=["CR", "cspeed_xrt", "dspeed10s_xrt"]
rest_lsb = stat_test(res_aind, column_group_by="lsb", 
                     test_columns=test_columns, verbose=True)

In [None]:
if save_fig:
    fig_lsb_all.savefig(fig_folder / "lsb_corr.pdf")

In [None]:
# select 
res_lsb = res.query("lsb != 'false'")
print(len(res_lsb))

# Chunk size does not affect compression

The first compression parameter that we analyze is the chunk duration, i.e., the size of each block in time.  Figure 2 displays the distribution of compression ratios (A), compression speeds (B), and decompression speed (C) for different the three different chunk durations: 0.1 s, 1 s, and 10 s. We found no significant differences in compression ratio and decompression speeds, while compression speed resulted to be significantly higher for chunks of 1~s (p-values: 0.1*vs*1<1e-10; 1*vs*10<1e-7), but with small effect sizes (effect sizes: 0.1*vs*1=0.28; 1*vs*10=0.19). Given this minor, but significant increase in compression speed and provided that 1 s chunk sizes are commonly used for parallel processing (e.g., default in `SpikeInterface`), we restricted further analysis to chunk durations of 1 s(1200 data points).

### Figure 2

In [None]:
fig_ch, axs = plt.subplots(ncols=3, nrows=1, figsize=figsize_rect)

color = "Oranges"

chunk_order = ["0.1s", "1s", "10s"]

sns.boxenplot(data=res_lsb, x="chunk_duration", y="CR", ax=axs[0], palette=color,
              order=chunk_order)
sns.boxenplot(data=res_lsb, x="chunk_duration", y="cspeed_xrt", ax=axs[1], palette=color,
              order=chunk_order)
sns.boxenplot(data=res_lsb, x="chunk_duration", y="dspeed10s_xrt", ax=axs[2], palette=color,
              order=chunk_order)

axs[0].set_xlabel("")
axs[1].set_xlabel("Chunk duration (s)")
axs[2].set_xlabel("")

axs[0].set_xticklabels(["0.1", "1", "10"])
axs[1].set_xticklabels(["0.1", "1", "10"])
axs[2].set_xticklabels(["0.1", "1", "10"])

axs[0].set_ylabel("Compression Ratio")
axs[1].set_ylabel("Compression speed (xRT)")
axs[2].set_ylabel("Decompression speed (xRT)")

fig_ch.subplots_adjust(wspace=0.5)
prettify_axes(axs)

In [None]:
rest_chunk = stat_test(res_lsb, column_group_by="chunk_duration", 
                       test_columns=test_columns, verbose=True)

In [None]:
if save_fig:
    fig_ch.savefig(fig_folder / "chunk_duration.pdf")

In [None]:
res_chunk = res_lsb.query("chunk_duration == '1s'")

## Differences between Neuropixels 1.0 and 2.0 probes

As mentioned in the Methods section, NP1 and NP2 acquisition systems differ in the ADC depth (10 for NP1, 14 for NP2) and in the acquired signals, which are split in *ap* and *lf* bands in NP1, while they are wide-band for NP2. These key differences are likely to results in different compressibility between the two data sources.
Figure 3 shows the three compression metrics for the two probes. While there are no significant differences in compression and decompression speeds, compression ratio is significantly higher for NP1 probes with respect to NP2 (p-value<1e-10 - effect size: 0.77), confirming the higher compressibility of NP1 probes, with their reduced frequency band and lower ADC depth (see Table 1). In the following sections, we therefore analyze the results from the two probes separately.

### Figure 3

In [None]:
fig_probes, axs = plt.subplots(ncols=3, nrows=1, figsize=figsize_rect)
probe_order = ["NP1", "NP2"]

color = "Purples"

sns.boxenplot(data=res_chunk, x="probe", y="CR", ax=axs[0], palette=color,
              order=probe_order)
sns.boxenplot(data=res_chunk, x="probe", y="cspeed_xrt", ax=axs[1], palette=color,
              order=probe_order)
sns.boxenplot(data=res_chunk, x="probe", y="dspeed10s_xrt", ax=axs[2], palette=color,
              order=probe_order)

axs[0].set_xlabel("")
axs[1].set_xlabel("Probe type")
axs[2].set_xlabel("")

axs[0].set_xticklabels(["NP1", "NP2"])
axs[1].set_xticklabels(["NP1", "NP2"])
axs[2].set_xticklabels(["NP1", "NP2"])

axs[0].set_ylabel("Compression Ratio")
axs[1].set_ylabel("Compression speed (xRT)")
axs[2].set_ylabel("Decompression speed (xRT)")

fig_probes.subplots_adjust(wspace=0.5)
prettify_axes(axs)

In [None]:
if save_fig:
    fig_probes.savefig(fig_folder / "probes.pdf")

In [None]:
rest_probe = stat_test(res_chunk, column_group_by="probe", 
                       test_columns=test_columns, verbose=True)

## Audio codecs achieve better compression ratio than general compressors 

We start by comparing compression options readily available via the Blosc meta-compressor in ZARR

### What is the best built-in best option?

In [None]:
res_zarr = res_chunk.query(f"compressor_type != 'audio'")

In [None]:
res_zarr_np1 = res_zarr.query("probe == 'NP1'")
res_zarr_np2 = res_zarr.query("probe == 'NP2'")

In [None]:
fig_zarr_np1, axs = plt.subplots(ncols=3, nrows=1, figsize=figsize_rect)

sns.boxenplot(data=res_zarr_np1, x="compressor", y="CR", ax=axs[0], palette=palette)
sns.boxenplot(data=res_zarr_np1, x="compressor", y="cspeed_xrt", ax=axs[1], palette=palette)
sns.boxenplot(data=res_zarr_np1, x="compressor", y="dspeed10s_xrt", ax=axs[2], palette=palette)

axs[0].set_xlabel("")
axs[1].set_xlabel("")
axs[2].set_xlabel("")

axs[0].set_ylabel("Compression Ratio")
axs[1].set_ylabel("Compression speed (xRT)")
axs[2].set_ylabel("Decompression speed (xRT)")

axs[0].set_xticklabels(axs[0].get_xticklabels(), rotation=90)
axs[1].set_xticklabels(axs[1].get_xticklabels(), rotation=90)
axs[2].set_xticklabels(axs[2].get_xticklabels(), rotation=90)

fig_zarr_np1.subplots_adjust(wspace=0.5, bottom=0.2)
fig_zarr_np1.suptitle("NP1", fontsize=20)
prettify_axes(axs)

fig_zarr_np2, axs = plt.subplots(ncols=3, nrows=1, figsize=figsize_rect)

sns.boxenplot(data=res_zarr_np2, x="compressor", y="CR", ax=axs[0], palette=palette)
sns.boxenplot(data=res_zarr_np2, x="compressor", y="cspeed_xrt", ax=axs[1], palette=palette)
sns.boxenplot(data=res_zarr_np2, x="compressor", y="dspeed10s_xrt", ax=axs[2], palette=palette)

axs[0].set_xlabel("")
axs[1].set_xlabel("Compressor")
axs[2].set_xlabel("")

axs[0].set_ylabel("Compression Ratio")
axs[1].set_ylabel("Compression speed (xRT)")
axs[2].set_ylabel("Decompression speed (xRT)")

axs[0].set_xticklabels(axs[0].get_xticklabels(), rotation=90)
axs[1].set_xticklabels(axs[1].get_xticklabels(), rotation=90)
axs[2].set_xticklabels(axs[2].get_xticklabels(), rotation=90)

fig_zarr_np2.subplots_adjust(wspace=0.5, bottom=0.25)
fig_zarr_np2.suptitle("NP2", fontsize=20)
prettify_axes(axs)


In [None]:
print("NP1")
res_np1_gb_comp = res_zarr_np1.groupby("compressor")
compressors = np.unique(res_zarr_np1.compressor)
print("\ncompression ratio\n")
for compressor in compressors:
    med = np.round(res_np1_gb_comp['CR'].median()[compressor], 2)
    mad = np.round(res_np1_gb_comp['CR'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")
print("\ncompression speed\n")
for compressor in compressors:
    med = np.round(res_np1_gb_comp['cspeed_xrt'].median()[compressor], 2)
    mad = np.round(res_np1_gb_comp['cspeed_xrt'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")
print("\ndecompression speed\n")
for compressor in compressors:
    med = np.round(res_np1_gb_comp['dspeed10s_xrt'].median()[compressor], 2)
    mad = np.round(res_np1_gb_comp['dspeed10s_xrt'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")
    
print("\n\nNP2")
res_np2_gb_comp = res_zarr_np1.groupby("compressor")
print("\ncompression ratio\n")
for compressor in compressors:
    med = np.round(res_np2_gb_comp['CR'].median()[compressor], 2)
    mad = np.round(res_np2_gb_comp['CR'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")
print("\ncompression speed\n")
for compressor in compressors:
    med = np.round(res_np2_gb_comp['cspeed_xrt'].median()[compressor], 2)
    mad = np.round(res_np2_gb_comp['cspeed_xrt'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")
print("\ndecompression speed\n")
for compressor in compressors:
    med = np.round(res_np2_gb_comp['dspeed10s_xrt'].median()[compressor], 2)
    mad = np.round(res_np2_gb_comp['dspeed10s_xrt'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")

In [None]:
if save_fig:
    fig_zarr_np1.savefig(fig_folder / "zarr_comp_np1.pdf")
    fig_zarr_np2.savefig(fig_folder / "zarr_comp_np2.pdf")    

In [None]:
g_blosc_np1 = sns.relplot(data=res_zarr_np1.query("compressor_type == 'blosc'"), 
                          x="CR", y="dspeed10s_xrt", hue="compressor", 
                          style="shuffle", size="level", size_order=["high", "medium", "low"],
                          palette=palette)
fig_blosc_np1 = g_blosc_np1.figure
ax = g_blosc_np1.ax
ax.set_xlabel("Compression Ratio")
ax.set_ylabel("Decompression speed (xRT)")
prettify_axes([ax])

g_blosc_np2 = sns.relplot(data=res_zarr_np2.query("compressor_type == 'blosc'"), 
                          x="CR", y="dspeed10s_xrt", hue="compressor", 
                          style="shuffle", size="level", size_order=["high", "medium", "low"],
                          palette=palette)
fig_blosc_np2 = g_blosc_np2.figure
ax = g_blosc_np2.ax
ax.set_xlabel("Compression Ratio")
ax.set_ylabel("Decompression speed (xRT)")
prettify_axes([ax])

g_nc_np1 = sns.relplot(data=res_zarr_np1.query("compressor_type == 'numcodecs'"), 
                       x="CR", y="dspeed10s_xrt", hue="compressor", 
                       style="shuffle", size="level", size_order=["high", "medium", "low"],
                       palette=palette)
fig_nc_np1 = g_nc_np1.figure
ax = g_nc_np1.ax
ax.set_xlabel("Compression Ratio")
ax.set_ylabel("Decompression speed (xRT)")
prettify_axes([ax])

g_nc_np2 = sns.relplot(data=res_zarr_np2.query("compressor_type == 'numcodecs'"), 
                       x="CR", y="dspeed10s_xrt", hue="compressor", 
                       style="shuffle", size="level", size_order=["high", "medium", "low"],
                       palette=palette)
fig_nc_np2 = g_nc_np2.figure
ax = g_nc_np2.ax
ax.set_xlabel("Compression Ratio")
ax.set_ylabel("Decompression speed (xRT)")
prettify_axes([ax])

In [None]:
if save_fig:
    fig_blosc_np1.savefig(fig_folder / "rel_blosc_np1.pdf")
    fig_blosc_np2.savefig(fig_folder / "rel_blosc_np2.pdf")
    fig_nc_np1.savefig(fig_folder / "rel_numcodecs_np1.pdf")
    fig_nc_np2.savefig(fig_folder / "rel_numcodecs_np2.pdf")

### Effect of shuffling options

The BLOSC meta-compressor provides two shuffling options:

- byte shuffle (shuffle)
- bit shuffle

A byte shuffle is also available via the `numcodecs.Shuffle` for other non-blosc codecs.

In [None]:
color = "YlGn"

fig_sh_np1, axs = plt.subplots(ncols=3, nrows=2, figsize=figsize_square)
res_blosc = res_zarr_np1.query("compressor_type == 'blosc'")
sns.boxenplot(data=res_blosc, x="compressor", y="CR", hue="shuffle", ax=axs[0, 0],
              hue_order=["no", "byte", "bit"], palette=color)
sns.boxenplot(data=res_blosc, x="compressor", y="cspeed_xrt", hue="shuffle", ax=axs[0, 1],
              hue_order=["no", "byte", "bit"], palette=color)
sns.boxenplot(data=res_blosc, x="compressor", y="dspeed10s_xrt", hue="shuffle", ax=axs[0, 2],
              hue_order=["no", "byte", "bit"], palette=color)
axs[0, 0].set_xticklabels(axs[0, 0].get_xticklabels(), rotation=45)
axs[0, 1].set_xticklabels(axs[0, 1].get_xticklabels(), rotation=45)
axs[0, 2].set_xticklabels(axs[0, 2].get_xticklabels(), rotation=45)
axs[0, 0].set_xlabel("")
axs[0, 1].set_xlabel("")
axs[0, 2].set_xlabel("")
axs[0, 0].set_ylabel("Compression Ratio")
axs[0, 1].set_ylabel("Compression speed (xRT)")
axs[0, 2].set_ylabel("Decompression speed (xRT)")

res_nc = res_zarr_np1.query("compressor_type == 'numcodecs'")
sns.boxenplot(data=res_nc, x="compressor", y="CR", hue="shuffle", ax=axs[1, 0], palette=color)
sns.boxenplot(data=res_nc, x="compressor", y="cspeed_xrt", hue="shuffle", ax=axs[1, 1], palette=color)
sns.boxenplot(data=res_nc, x="compressor", y="dspeed10s_xrt", hue="shuffle", ax=axs[1, 2], palette=color)
axs[1, 0].set_xticklabels(axs[1, 0].get_xticklabels(), rotation=45)
axs[1, 1].set_xticklabels(axs[1, 1].get_xticklabels(), rotation=45)
axs[1, 2].set_xticklabels(axs[1, 2].get_xticklabels(), rotation=45)
axs[1, 0].set_ylabel("Compression Ratio")
axs[1, 1].set_ylabel("Compression speed (xRT)")
axs[1, 2].set_ylabel("Decompression speed (xRT)")
axs[1, 0].set_xlabel("")
axs[1, 2].set_xlabel("")

for ax in axs.flatten()[1:]:
    ax.get_legend().remove()

prettify_axes(axs)
fig_sh_np1.subplots_adjust(wspace=0.3, hspace=0.3)
fig_sh_np1.suptitle("NP1", fontsize=20)


fig_sh_np2, axs = plt.subplots(ncols=3, nrows=2, figsize=figsize_square)
res_blosc = res_zarr_np2.query("compressor_type == 'blosc'")
sns.boxenplot(data=res_blosc, x="compressor", y="CR", hue="shuffle", ax=axs[0, 0],
              hue_order=["no", "byte", "bit"], palette=color)
sns.boxenplot(data=res_blosc, x="compressor", y="cspeed_xrt", hue="shuffle", ax=axs[0, 1],
              hue_order=["no", "byte", "bit"], palette=color)
sns.boxenplot(data=res_blosc, x="compressor", y="dspeed10s_xrt", hue="shuffle", ax=axs[0, 2],
              hue_order=["no", "byte", "bit"], palette=color)
axs[0, 0].set_xticklabels(axs[0, 0].get_xticklabels(), rotation=45)
axs[0, 1].set_xticklabels(axs[0, 1].get_xticklabels(), rotation=45)
axs[0, 2].set_xticklabels(axs[0, 2].get_xticklabels(), rotation=45)
axs[0, 0].set_xlabel("")
axs[0, 1].set_xlabel("")
axs[0, 2].set_xlabel("")
axs[0, 0].set_ylabel("Compression Ratio")
axs[0, 1].set_ylabel("Compression speed (xRT)")
axs[0, 2].set_ylabel("Decompression speed (xRT)")

res_nc = res_zarr_np2.query("compressor_type == 'numcodecs'")
sns.boxenplot(data=res_nc, x="compressor", y="CR", hue="shuffle", ax=axs[1, 0], palette=color)
sns.boxenplot(data=res_nc, x="compressor", y="cspeed_xrt", hue="shuffle", ax=axs[1, 1], palette=color)
sns.boxenplot(data=res_nc, x="compressor", y="dspeed10s_xrt", hue="shuffle", ax=axs[1, 2], palette=color)
axs[1, 0].set_xticklabels(axs[1, 0].get_xticklabels(), rotation=45)
axs[1, 1].set_xticklabels(axs[1, 1].get_xticklabels(), rotation=45)
axs[1, 2].set_xticklabels(axs[1, 2].get_xticklabels(), rotation=45)
axs[1, 0].set_ylabel("Compression Ratio")
axs[1, 1].set_ylabel("Compression speed (xRT)")
axs[1, 2].set_ylabel("Decompression speed (xRT)")
axs[1, 0].set_xlabel("")
axs[1, 2].set_xlabel("")

for ax in axs.flatten()[1:]:
    ax.get_legend().remove()

prettify_axes(axs)
fig_sh_np2.subplots_adjust(wspace=0.3, hspace=0.3)
fig_sh_np2.suptitle("NP2", fontsize=20)

In [None]:
if save_fig:
    fig_sh_np1.savefig(fig_folder / "shuffling_np1.pdf")    
    fig_sh_np2.savefig(fig_folder / "shuffling_np2.pdf")

**COMMENT**

In general, pre-shuffling (byte or bit) the data improves conversion performance with respect to no pre-shuffling.
The `BIT shuffle` available in blosc seem to be the best option, as it provides better CR, compression and decompression speed.

Let's now confirm that the level does its job...

### Effect of compression level

In [None]:
color = 'afmhot'

fig_lev_np1, axs = plt.subplots(ncols=3, nrows=1, figsize=figsize_rect)
sns.boxenplot(data=res_zarr_np1, x="compressor", y="CR", hue="level", ax=axs[0], palette=color)
sns.boxenplot(data=res_zarr_np1, x="compressor", y="cspeed_xrt", hue="level", ax=axs[1], palette=color)
sns.boxenplot(data=res_zarr_np1, x="compressor", y="dspeed10s_xrt", hue="level", ax=axs[2], palette=color)

axs[0].set_xlabel("")
axs[1].set_xlabel("")
axs[2].set_xlabel("")

axs[0].set_ylabel("Compression Ratio")
axs[1].set_ylabel("Compression speed (xRT)")
axs[2].set_ylabel("Decompression speed (xRT)")

axs[0].set_xticklabels(axs[0].get_xticklabels(), rotation=90)
axs[1].set_xticklabels(axs[1].get_xticklabels(), rotation=90)
axs[2].set_xticklabels(axs[2].get_xticklabels(), rotation=90)

for ax in axs.flatten()[:-1]:
    ax.get_legend().remove()

prettify_axes(axs)
fig_lev_np1.subplots_adjust(wspace=0.5, bottom=0.2)
fig_lev_np1.suptitle("NP1", fontsize=20)


fig_lev_np2, axs = plt.subplots(ncols=3, nrows=1, figsize=figsize_rect)
sns.boxenplot(data=res_zarr_np2, x="compressor", y="CR", hue="level", ax=axs[0], palette=color)
sns.boxenplot(data=res_zarr_np2, x="compressor", y="cspeed_xrt", hue="level", ax=axs[1], palette=color)
sns.boxenplot(data=res_zarr_np2, x="compressor", y="dspeed10s_xrt", hue="level", ax=axs[2], palette=color)

axs[0].set_xlabel("")
axs[1].set_xlabel("Compressor")
axs[2].set_xlabel("")

axs[0].set_ylabel("Compression Ratio")
axs[1].set_ylabel("Compression speed (xRT)")
axs[2].set_ylabel("Decompression speed (xRT)")

axs[0].set_xticklabels(axs[0].get_xticklabels(), rotation=90)
axs[1].set_xticklabels(axs[1].get_xticklabels(), rotation=90)
axs[2].set_xticklabels(axs[2].get_xticklabels(), rotation=90)

for ax in axs.flatten()[:-1]:
    ax.get_legend().remove()

prettify_axes(axs)
fig_lev_np2.subplots_adjust(wspace=0.5, bottom=0.25)
fig_lev_np2.suptitle("NP2", fontsize=20)


In [None]:
if save_fig:
    fig_lev_np1.savefig(fig_folder / "levels_np1.pdf")
    fig_lev_np2.savefig(fig_folder / "levels_np2.pdf")    

In [None]:
color = "YlGn"

fig_zlib, ax = plt.subplots(figsize=figsize_rect)

res_zlib = res_chunk.query("compressor == 'blosc-zlib'")

sns.boxplot(data=res_zlib, x="level", y="CR", hue="shuffle", ax=ax, palette=color)

ax.set_xlabel("Compression Level")
ax.set_ylabel("Compression Ratio")
prettify_axes(ax)

In [None]:
if save_fig:
    fig_zlib.savefig(fig_folder / "zlib_level.pdf")

**COMMENT**

For most compressors, compression level does its job (increasing levels yield increasing CR). Strangely, for `blosc-zlib` the level seems to have the opposite effect. Of course, the higher the level the slower the compression speed. Decompression speed doesn't seem to be affected (this is not the case for the `numcodecs.Zlib` wrapper, that doesn't play well with BIT shuffling).

For the final analysis, let's pick the `high` level as the best option.

In [None]:
selected_level = "high"
res_level = res_zarr.query(f"level == '{selected_level}'")

As a final step, we select Zstd as best options:

In [None]:
selected_compressors = ["blosc-zstd", "lzma"]
res_best_zstd = res_level.query(f"compressor == 'blosc-zstd' and shuffle == 'bit'")
res_best_lzma = res_level.query(f"compressor == 'lzma' and shuffle == 'byte'")
res_best_zarr = pd.concat([res_best_zstd, res_best_lzma])

# AUDIO compression

Lossless audio codecs could provide a good alternative to general-purpose compression algorithms because: 

- Audio signals are also timeseries 
- Frequiency range is similar
- Multiple channels are correlated

We tried to use FLAC and WavPack. FLAC supports up to 2 channels with pyFLAC, so we need to either:
- flatten multi-channel signals into (channel_chunk_size=-1)
- save blocks with 2 channels (channel_chunk_size=2)

In [None]:
res_audio = res_chunk.query("compressor_type == 'audio'")

No shuffling is available in FLAC, so we just select the same chunk duration for the comparison. 
WavPack doesn't have a compression level.

### FLAC: flattening or not?

In [None]:
res_flac = res_audio.query("compressor == 'flac'")

In [None]:
fig_flac, axs = plt.subplots(ncols=3, figsize=figsize_rect)

color = "YlOrBr"

sns.boxenplot(data=res_flac, x="channel_chunk_size", y="CR", ax=axs[0], palette=color)
sns.boxenplot(data=res_flac, x="channel_chunk_size", y="cspeed_xrt", ax=axs[1], palette=color)
sns.boxenplot(data=res_flac, x="channel_chunk_size", y="dspeed10s_xrt", ax=axs[2], palette=color)

axs[0].set_xlabel("")
axs[1].set_xlabel("Channel block size")
axs[2].set_xlabel("")

axs[0].set_ylabel("Compression Ratio")
axs[1].set_ylabel("Compression speed (xRT)")
axs[2].set_ylabel("Decompression speed (xRT)")


prettify_axes(axs)
fig_flac.subplots_adjust(wspace=0.5)


In [None]:
if save_fig:
    fig_flac.savefig(fig_folder / "flac_chunks.pdf")

In [None]:
rest_flac = stat_test(res_flac, column_group_by="channel_chunk_size", 
                      test_columns=test_columns, verbose=True)

**COMMENT**

Flattening or using "stereo" channels doesn't seem to make a difference for CR. Actually, compression and decompression speeds are slightly faster when stereo mode is enabled!

In [None]:
res_wavpack = res_audio.query("compressor == 'wavpack'")
res_flac_2 = res_audio.query("channel_chunk_size == 2") # wavpack has chunk_channel_size == -1
res_audio_2 = pd.concat([res_flac_2, res_wavpack])

In [None]:
res_audio_np1 = res_audio_2.query("probe == 'NP1'")
res_audio_np2 = res_audio_2.query("probe == 'NP2'")

For both NP1 and NP2, the best zarr compressor is **Blosc-zstd - level 9 - chunk 1s/10s - shuffle BIT**

In [None]:
fig_audio_np1, axs = plt.subplots(ncols=3, nrows=1, figsize=figsize_rect)

sns.boxplot(data=res_audio_np1, x="level", y="CR", hue="compressor", ax=axs[0], palette=palette)
sns.boxplot(data=res_audio_np1, x="level", y="cspeed_xrt", hue="compressor", ax=axs[1], palette=palette)
sns.boxplot(data=res_audio_np1, x="level", y="dspeed10s_xrt", hue="compressor", ax=axs[2], palette=palette)

axs[0].set_xlabel("")
axs[1].set_xlabel("")
axs[2].set_xlabel("")

axs[0].set_ylabel("Compression Ratio")
axs[1].set_ylabel("Compression speed (xRT)")
axs[2].set_ylabel("Decompression speed (xRT)")

axs[0].set_xticklabels(axs[0].get_xticklabels(), rotation=45)
axs[1].set_xticklabels(axs[1].get_xticklabels(), rotation=45)
axs[2].set_xticklabels(axs[2].get_xticklabels(), rotation=45)

for ax in axs[1:]:
    ax.get_legend().remove()

fig_audio_np1.subplots_adjust(wspace=0.5, bottom=0.2)
fig_audio_np1.suptitle("NP1", fontsize=20)
prettify_axes(axs)

fig_audio_np2, axs = plt.subplots(ncols=3, nrows=1, figsize=figsize_rect)

sns.boxplot(data=res_audio_np2, x="level", y="CR", hue="compressor", ax=axs[0], palette=palette)
sns.boxplot(data=res_audio_np2, x="level", y="cspeed_xrt", hue="compressor", ax=axs[1], palette=palette)
sns.boxplot(data=res_audio_np2, x="level", y="dspeed10s_xrt", hue="compressor", ax=axs[2], palette=palette)

axs[0].set_xlabel("")
axs[1].set_xlabel("Compressor")
axs[2].set_xlabel("")

axs[0].set_ylabel("Compression Ratio")
axs[1].set_ylabel("Compression speed (xRT)")
axs[2].set_ylabel("Decompression speed (xRT)")

axs[0].set_xticklabels(axs[0].get_xticklabels(), rotation=45)
axs[1].set_xticklabels(axs[1].get_xticklabels(), rotation=45)
axs[2].set_xticklabels(axs[2].get_xticklabels(), rotation=45)

for ax in axs[1:]:
    ax.get_legend().remove()

fig_audio_np2.subplots_adjust(wspace=0.5, bottom=0.25)
fig_audio_np2.suptitle("NP2", fontsize=20)
prettify_axes(axs)


In [None]:
if save_fig:
    fig_audio_np1.savefig(fig_folder / "audio_np1.pdf")
    fig_audio_np2.savefig(fig_folder / "audio_np2.pdf")    

### Compare all

There is a slight increase in compression performance with compression level from low to medium, but not so between medium an high.  For FLAC, compression speed does not seem to by affected compression level.
However, for WavPack, decompression speed is a bit slow (~0.5 cspeed_xrt), probably due to the sub-optimal implementation.

In [None]:
selected_level = "medium"
res_best_audio = res_audio_2.query(f"level == '{selected_level}'")

In [None]:
res_best_all = pd.concat([res_best_zarr, res_best_audio])

In [None]:
res_best_np1 = res_best_all.query("probe == 'NP1'")
res_best_np2 = res_best_all.query("probe == 'NP2'")

In [None]:
fig_best_all, axs = plt.subplots(ncols=3, nrows=2, figsize=figsize_square)

plot_fun = sns.boxplot

plot_fun(data=res_best_np1, x="compressor", 
         y="CR", ax=axs[0, 0], palette=palette)
plot_fun(data=res_best_np1, x="compressor", 
         y="cspeed_xrt", ax=axs[0, 1], palette=palette)
plot_fun(data=res_best_np1, x="compressor", 
         y="dspeed10s_xrt", ax=axs[0, 2], palette=palette)

plot_fun(data=res_best_np2, x="compressor", 
         y="CR", ax=axs[1, 0], palette=palette)
plot_fun(data=res_best_np2, x="compressor", 
         y="cspeed_xrt", ax=axs[1, 1], palette=palette)
plot_fun(data=res_best_np2, x="compressor", 
         y="dspeed10s_xrt",  ax=axs[1, 2], palette=palette)

axs[0, 1].set_title("NP1", fontsize=25)
axs[1, 1].set_title("NP2", fontsize=25)


axs[0, 0].set_xticklabels(axs[0, 0].get_xticklabels(), rotation=45)
axs[0, 1].set_xticklabels(axs[0, 1].get_xticklabels(), rotation=45)
axs[0, 2].set_xticklabels(axs[0, 2].get_xticklabels(), rotation=45)

axs[1, 0].set_xticklabels(axs[0, 0].get_xticklabels(), rotation=45)
axs[1, 1].set_xticklabels(axs[0, 1].get_xticklabels(), rotation=45)
axs[1, 2].set_xticklabels(axs[0, 2].get_xticklabels(), rotation=45)

axs[0, 0].set_xlabel("")
axs[1, 0].set_xlabel("")
axs[0, 1].set_xlabel("")
axs[1, 1].set_xlabel("compressor")
axs[0, 2].set_xlabel("")
axs[1, 2].set_xlabel("")

axs[0, 0].set_ylabel("Compression Ratio")
axs[0, 1].set_ylabel("Compression speed (xRT)")
axs[0, 2].set_ylabel("Decompression speed (xRT)")
axs[1, 0].set_ylabel("Compression Ratio")
axs[1, 1].set_ylabel("Compression speed (xRT)")
axs[1, 2].set_ylabel("Decompression speed (xRT)")

axs[0, 2].set_yscale("log")
axs[1, 2].set_yscale("log")

prettify_axes(axs)

fig_best_all.subplots_adjust(wspace=0.5, hspace=0.3)

In [None]:
if save_fig:
    fig_best_all.savefig(fig_folder / "best.pdf")

In [None]:
print("NP1")
res_np1_gb_comp = res_best_np1.groupby("compressor")
compressors = np.unique(res_best_np1.compressor)
print("\ncompression ratio\n")
for compressor in compressors:
    med = np.round(res_np1_gb_comp['CR'].median()[compressor], 2)
    mad = np.round(res_np1_gb_comp['CR'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")
print("\ncompression speed\n")
for compressor in compressors:
    med = np.round(res_np1_gb_comp['cspeed_xrt'].median()[compressor], 2)
    mad = np.round(res_np1_gb_comp['cspeed_xrt'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")
print("\ndecompression speed\n")
for compressor in compressors:
    med = np.round(res_np1_gb_comp['dspeed10s_xrt'].median()[compressor], 2)
    mad = np.round(res_np1_gb_comp['dspeed10s_xrt'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")
    
print("\n\nNP2")
res_np2_gb_comp = res_best_np1.groupby("compressor")
print("\ncompression ratio\n")
for compressor in compressors:
    med = np.round(res_np2_gb_comp['CR'].median()[compressor], 2)
    mad = np.round(res_np2_gb_comp['CR'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")
print("\ncompression speed\n")
for compressor in compressors:
    med = np.round(res_np2_gb_comp['cspeed_xrt'].median()[compressor], 2)
    mad = np.round(res_np2_gb_comp['cspeed_xrt'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")
print("\ndecompression speed\n")
for compressor in compressors:
    med = np.round(res_np2_gb_comp['dspeed10s_xrt'].median()[compressor], 2)
    mad = np.round(res_np2_gb_comp['dspeed10s_xrt'].mad()[compressor], 2)
    print(f"{compressor}: {med}+-{mad}")

### Conclusion


In terms of CR, FLAC reaches the highest lossless compression for NP1 (3.462 -- $\sim$29% size), but it is extremely slow to decompress. Note that Zstd, with its 3.13 CR, reduced the size to $\sim$32%. WavPack is somewhere in the middle (CR 3.36 -- $\sim$30% size). 

For NP2, WavPack can reduce to a $\sim$42% size in contrast to a $\sim$53% of Zstd. Decompression is currently a bit slow also for WavPack (around 5s to retrieve 10s of traces), but this can be probably improved by bypassing the temporary wav conversion and binding the WavPack C library directly.