# Caterva - A Compressed and Multidimensional Container For Medium/Big Data 

<img src="static/caterva-logo.png" width=15% style="margin-left:auto; margin-right:auto">



Aleix Alcacer - The Blosc Development Team

# What is Caterva?


<img src="static/caterva.png" width=20% style="margin-left:auto; margin-right:auto">

Caterva is an open source C and Python **library** and a **format** built on top of Blosc2 that implements a compressed, chunked and multidimensional array.

# Why another chunked array?

Although there are several formats that implement multidimensional and chunked arrays such as Zarr and HDF5 (among some others), Caterva is specially optimized for **efficient data slicing**.

Thanks to the new features of Blosc2, Caterva can extract data from a compressed array at **very high speed**.

# Main features

* **Double multidimensional partitioning**: fast slicing.
* **Metalayers**: add metadata to your arrays.
* **Type-less**: flexibly define your own data types (i.e. as metalayers).
* **Open source**: https://github.com/Blosc/python-caterva


# How does Caterva achieve this performance?

<br>
<img src="static/zarr.png" width=15% style="margin-left:auto; margin-right:auto">


Other chunking libraries store data into multidimensional **chunks**. This makes slices extraction from compressed data more efficient, since only the chunks containing the slices are decompressed. 



<img src="static/cat.png" width=15% style="margin-left:auto; margin-right:auto">


Caterva also introduces an additional level of partitioning. Within each chunk, the data is repartitioned into smaller multidimensional sets called **blocks**.

In this way, Caterva can read blocks individually (and also in parallel) instead of whole chunks, which improves slices extraction by decompressing only those blocks containing the slice.




# Performance comparison 

Now we will analyze the performance of extracting some hyperplanes from chunked arrays created with Caterva, Zarr and HDF5.

For comparison purposes, all arrays have been compressed using the Blosc compressor.

The chunks have also been optimized to extract hyperslices from the second dimension.

<table>
    <tr>
        <td>
            <img src="static/dim0.png" width=80% style="margin-left:0; margin-right:auto">
        </td>
        <td>
            <img src="static/dim1.png" width=80% style="margin-left:auto; margin-right:0">
        </td>
    </tr>
    <tr>
        <td style="text-align: center;">
            First dimension
        </td>
        <td style="text-align: center;">
            Second dimension
        </td>
    </tr>
</table>

Thanks to the second level of partitioning, Caterva will decompress less data than the other formats.


Time ellapsed extracting a set of hyperslices in each dimension:

<img src="static/performance.png" width=65% style="margin-left:auto; margin-right:auto">

# Thank you very much!

Takeaway: *If you are interested in increasing the performance of array slicing, give Caterva a try!*

To learn more about Caterva, checkout our SciPy poster at https://blosc.github.io/caterva-scipy21/

We are open to hear suggestions at:
* Twitter: https://twitter.com/Blosc2
* Github: https://github.com/Blosc/python-caterva
