# NDArray: A NDim, Compressed Data Container

NDArray objects let users perform different operations with  arrays like setting, copying or slicing them. In this section, we are going to see how to create and manipulate these NDArray arrays, which possess metadata and data. The data is *chunked* and *compressed*; the metadata gives information about the data itself, as well as the chunking and compression. Chunking and compression are features which make NDArray arrays very efficient for working with large data.

In [None]:
import numpy as np
import blosc2

## Creating an array
Let's start by creating a 2D array with 100M elements filled with ``arange``. We can then print out the metadata, which contains information about: the array data (such as ``shape`` and ``dtype``); and how the data is compressed and stored, such as chunk- and block-shapes (``chunks`` and ``blocks``) and compression params (``CParams``). See [here](https://www.blosc.org/python-blosc2/getting_started/overview.html) for an explanation of chunking and blocking.



In [None]:
shape = (10_000, 10_000)
array = blosc2.arange(np.prod(shape), shape=shape)
print(array.info)

The ``cratio`` parameter tells us how effective the compression is, since it gives the ratio between the number of bytes required to store the array in uncompressed and compressed form. Here we require almost 500x less space for the compressed array! Note that all the compression and decompression parameters are set to the default, and ``chunks`` and ``blocks`` have been selected automatically - playing around with them will affect the ``cratio`` (as well as compression and decompression speed).

We can also create an NDArray by compressing a NumPy array:

In [None]:
nparray = np.linspace(0, 100, np.prod(shape), dtype=np.float64).reshape(shape)
b2array = blosc2.asarray(nparray)
print(b2array.info)

or an iterator:

In [None]:
N = 1000_000
rng = np.random.default_rng()
it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
sa = blosc2.fromiter(it, dtype="i4,f4,f8", shape=(N,))
print(sa.info)
print(f"first 3 rows of sa: {sa[:3]}")

**Exercise:** Create a 2D NDArray with shape (1000, 10_000) and filled with sequential integers using the `range` iterator. Then use `blosc2.arange` to create the same array, and check that the two arrays are equal.  Use `%time` magick tool to time the two operations.  What do you notice about the time taken?  Why do you think this is happening?

In [None]:
# Code your solution here


## Reading and modifying data
NDArray arrays cannot be read directly, since they are compressed, and so must be decompressed first (to NumPy arrays, which are stored in memory). This can be done for the full array using the ``[:]`` operator, which returns a NumPy array.

In [None]:
temp = array[:]  # This will decompress the full array
type(temp)


However it is often not necessary (or desirable) to load the whole array into memory. We can easily read just small parts of NDArray arrays to a NumPy array, quickly, via standard indexing routines.

In [None]:
res1 = array[0]  # get first element
res2 = array[6:10]  # get slice
print(f"Got one element (of shape {res1.shape}) and slice of shape {res2.shape}.")
print(res1)
print(res2)

We can modify the data in the array using standard NumPy indexing too, using either NumPy or NDArray arrays as the data source.  For example, we can set the first row to zeros (using an NDArray array) and the first column to ones (using a NumPy array)

In [None]:
array[0, :] = blosc2.zeros(10000, dtype=array.dtype)
array[:, 0] = np.ones(10000, dtype=array.dtype)
print(array)

Note that ``array`` is still an NDArray array. Let's check that the entries were correctly modified.

In [None]:
print(array[0, 0])
print(array[0, :])
print(array[:, 0])

## Enlarging the array
Existing arrays can be enlarged. This is one operation that is greatly enhanced by the chunking procedure implemented in NDArray arrays.

In [None]:
%time array.resize((10_001, 10_000))
print(array.shape)
array[10_000, :] = 1
array[10_000, :]

In [None]:
%time nparray2 = np.resize(nparray, (10_001, 10_001))

Enlarging a NumPy array requires a full copy of the data, since underlying data are stored contiguously in memory, which is very costly: new memory to hold the extended array is allocated, the old data is copied to part of the new memory, and then the new data is written to the remaining new memory.
Enlarging is a much faster operation for NDArray arrays because data is chunked, and the chunks may be stored non-contiguously in memory, so one may simply write the necessary new chunks to some arbitrary address in memory and leave the old chunks untouched. The references to the new chunk addresses are then added in the NDArray container, which is a very quick operation.

You can also shrink the array.

In [None]:
array.resize((9_000, 10_000))
print(array.shape)
print(array[8_999])  # This works
# array[9_000]  # This will raise an exception

## Persistent data
We can use the `save()` method to store the array on disk.  This is very useful when you are working with a large array but do not need to access it often.


In [None]:
array.save("array_tutorial.b2nd", mode="w")  # , contiguous=True)
!ls -lh array_tutorial.b2nd


For arrays, it is usual to use the `.b2nd` extension. Now let's open the saved array and check that the data saved correctly (decompressing first to be able to compare):

In [None]:
array2 = blosc2.open("array_tutorial.b2nd")
np.array_equal(array2, array) # Make sure saved array matches original

In fact it is possible to create a NDArray array directly on disk, specifying where it will be stored, without first creating it in memory. We may also specify the compression/decompression and other storage parameters (e.g ``chunks`` and ``blocks``). For example, a 1000x1000 array filled with the string ``"pepe"`` can be created like this:

In [None]:
array1 = blosc2.full(
    (1000, 1000),
    fill_value=b"pepe",
    chunks=(100, 100),
    blocks=(50, 50),
    urlpath="array1_tutorial.b2nd",
    mode="w",
)
!ls -lh array1_tutorial.b2nd

We can also write direct to disk using the other constructors we saw previously.

In [None]:
it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
sa = blosc2.fromiter(it, dtype="i4,f4,f8", shape=(N,), urlpath="sa-1M.b2nd", mode="w")
print("3 first rows of sa:", sa[:3])
b2array = blosc2.asarray(nparray, urlpath="linspace_array.b2nd", mode="w")
print("3 first rows of b2array:", b2array[:3])

To delete saved data, one may use the ``remove_urlpath`` method.

In [None]:
blosc2.remove_urlpath("array_tutorial.b2nd")
blosc2.remove_urlpath("array1_tutorial.b2nd")
blosc2.remove_urlpath("sa-1M.b2nd")
blosc2.remove_urlpath("linspace_array.b2nd")

## Compression params
Let's see how to copy the NDArray data whilst altering the compression parameters. This may be useful in many contexts, for example testing how changing the codec of an existing array affects the compression ratio.

In [None]:
cparams = blosc2.CParams(
    codec=blosc2.Codec.LZ4,
    clevel=9,
    filters=[blosc2.Filter.SHUFFLE],
    filters_meta=[0],
)

array2 = array.copy(chunks=(500, 10_000), blocks=(50, 10_000), cparams=cparams)
print(array2.info)

In [None]:
print(array.info)

In this case the compression ratio is much higher for the original array, since we have changed to a different codec that is optimised for compression speed, not compression ratio. In general there is a tradeoff between the two.

**Exercise**: Do a computation on the two arrays and see how the time taken compares.  For example, you could compute the sum of all the elements in each array with the `sum()` method.

In [None]:
# Your solution here

#### Native Blosc2 codecs
Blosc2 supports many standard codecs, since there is no one-size-fits-all compression solution - one codec may be perfect for one context, but quite suboptimal in another.
* ZLIB codec: uses the DEFLATE algorithm, is standard, and works well for images.
* ZSTD codec: similar compression ratio to ZLIB but faster compression/decompression
* LZ4 codec: even faster comp/decomp than ZSTD but reduced compression ratio.
 * BloscLZ: Blosc implementation of the popular LZ algorithms (good for repeated data e.g. text). Similar tradeoff to LZ4.

Finally, via package extensions to Blosc2, one may access the JPEG2000 family of compression algorithms, which aim for a compromise between compression ratio and image quality; Blosc2 implements GROK (``blosc2-grok``) and OPENHTJ2K (``blosc2-openhtj2k``).

## TreeStore: Endowing your data with a hierarchical structure
With the `TreeStore` class, you can create a hierarchical structure for your data. This is useful when you want to store data in a tree-like format, where each node can have multiple children. The `TreeStore` class allows you to create, read, and modify trees of NDArray arrays.

Let's see an example:

In [None]:
with blosc2.TreeStore("example_tree.b2z", mode="w") as tstore:
    tstore["/data"] = np.array([1, 2, 3])  # numpy array
    tstore["/dir1/data1"] = blosc2.ones((2, 10)) # blosc2 array
    tstore["/dir1/data2"] = blosc2.linspace(0, 1, 1e7, shape=(10, 1000, 1000)) # blosc2 array
    tstore.vlmeta["author"] = "blosc2"
    tstore["/dir1"].vlmeta["year"] = 2025
!ls -lh "example_tree.b2z"

This will create a tree structure with a root node and two child nodes:
![Alt text](tree-store-example.png)

Let's explore the tree structure we just created.  Let's re-open the `TreeStore` and print out a dataset and some metadata.

In [None]:
tstore2 = blosc2.TreeStore("example_tree.b2z", mode="r")
print("/dir1/data1:\n", tstore2["/dir1/data1"][:])
print("root metadata:", tstore2.vlmeta[:])
print("/dir1 metadata:", tstore2["/dir1"].vlmeta[:])

In [None]:
list(tstore2)  # List all keys in the tree

In [None]:
for key, node in tstore2.items():
    print(f"Node: {key}, Data: {node[1] if isinstance(node, blosc2.NDArray) else node.vlmeta[:]}")