# NDArray: A NDim, Compressed Data Container

NDArray objects let users perform different operations with NDArray arrays like setting, copying or slicing them. In this section, we are going to see how to create and manipulate NDArray arrays. NDArray objects possess metadata and data. The data is *chunked* and *compressed*; the metadata gives information about the data itself, as well as the chunking and compression. Chunking and compression are features which make NDArray arrays very efficient for working with large data.

In [121]:
import numpy as np

import blosc2

## Creating an array
Let's start by creating a 2D array with 100M elements filled with ``arange``. We can then print out the metadata, which contains information about: the array data (such as ``shape`` and ``dtype``); and how the data is compressed and stored, such as chunk- and block-shapes (``chunks`` and ``blocks``) and compression params (``CParams``). See [here](../overview.html#ndarray-an-n-dimensional-store) for an explanation of chunking and blocking.



In [122]:
shape = (10_000, 10_000)
array = blosc2.arange(np.prod(shape), shape=shape)
array.info

0,1
type,NDArray
shape,"(10000, 10000)"
chunks,"(50, 10000)"
blocks,"(4, 10000)"
dtype,int64
cratio,491.73
cparams,"CParams(codec=, codec_meta=0, clevel=1, use_dict=False, typesize=8, nthreads=10, blocksize=320000, splitmode=, filters=[, , , , , ], filters_meta=[0, 0, 0, 0, 0, 0], tuner=)"
dparams,DParams(nthreads=10)


The ``cratio`` parameter tells us how effective the compression is, since it gives the ratio between the number of bytes required to store the array in uncompressed and compressed form. Here we require almost 500x less space for the compressed array! Note that all the compression and decompression parameters are set to the default, and ``chunks`` and ``blocks`` have been selected automatically - playing around with them will affect the ``cratio`` (as well as compression and decompression speed).

We can also create an NDArray by compressing a NumPy array:

In [123]:
nparray = np.linspace(0, 100, np.prod(shape), dtype=np.float64).reshape(shape)
b2array = blosc2.asarray(nparray)
b2array.info

0,1
type,NDArray
shape,"(10000, 10000)"
chunks,"(50, 10000)"
blocks,"(4, 10000)"
dtype,float64
cratio,26.18
cparams,"CParams(codec=, codec_meta=0, clevel=1, use_dict=False, typesize=8, nthreads=10, blocksize=320000, splitmode=, filters=[, , , , , ], filters_meta=[0, 0, 0, 0, 0, 0], tuner=)"
dparams,DParams(nthreads=10)


or an iterator:

In [124]:
N = 1000_000
rng = np.random.default_rng()
it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
sa = blosc2.fromiter(it, dtype="i4,f4,f8", shape=(N,))
sa.info

0,1
type,NDArray
shape,"(1000000,)"
chunks,"(250000,)"
blocks,"(31250,)"
dtype,"[('f0', '"
cratio,2.24
cparams,"CParams(codec=, codec_meta=0, clevel=1, use_dict=False, typesize=16, nthreads=10, blocksize=500000, splitmode=, filters=[, , , , , ], filters_meta=[0, 0, 0, 0, 0, 0], tuner=)"
dparams,DParams(nthreads=10)



## Reading and modifying data
NDArray arrays cannot be read directly, since they are compressed, and so must be decompressed first (to NumPy arrays, which are stored in memory). This can be done for the full array using the ``[:]`` operator, which returns a NumPy array.

In [125]:
temp = array[:]  # This will decompress the full array
type(temp)

numpy.ndarray


However it is often not necessary (or desirable) to load the whole array into memory. We can easily read just small parts of NDArray arrays to a NumPy array, quickly, via standard indexing routines.

In [126]:
array[0]

array([   0,    1,    2, ..., 9997, 9998, 9999], shape=(10000,))

We can modify the data in the array using standard NumPy indexing too, using either NumPy or NDArray arrays as the data source.  For example, we can set the first row to zeros and the first column to ones. ``array`` will still be a NDArray array.

In [127]:
array[0, :] = blosc2.zeros(10000, dtype=array.dtype)
array[:, 0] = np.ones(10000, dtype=array.dtype)
print(array)

<blosc2.ndarray.NDArray object at 0x7c54c3f01550>


In [128]:
array[0, 0]

array(1)

In [129]:
array[0, :]

array([1, 0, 0, ..., 0, 0, 0], shape=(10000,))

In [130]:
array[:, 0]

array([1, 1, 1, ..., 1, 1, 1], shape=(10000,))

## Enlarging the array
Existing arrays can be enlarged. This is one operation that is greatly enhanced by the chunking procedure implemented in NDArray arrays.

In [131]:
array.resize((10_001, 10_000))
print(array.shape)
array[10_000, :] = 1
array[10_000, :]

(10001, 10000)


array([1, 1, 1, ..., 1, 1, 1], shape=(10000,))

Enlarging a NumPy array requires a full copy of the data, since underlying data are stored contiguously in memory; hence new memory to hold the extended array is allocated, the old data is copied to the new memory, and then the new data is appended.
Enlarging is a much faster operation for NDArray arrays because data is chunked, and the chunks may be stored non-contiguously in memory, so one may simply write the necessary new chunks to some arbitrary address in memory and leave the old chunks untouched.

You can also shrink the array.

In [132]:
array.resize((9_000, 10_000))
print(array.shape)
print(array[8_999])  # This works
# array[9_000]  # This will raise an exception

(9000, 10000)
[       1 89990001 89990002 ... 89999997 89999998 89999999]


## Persistent data
We can use the `save()` method to store the array on disk.  This is very useful when you have a large array that you want to keep around but do not need to access all the time.


In [133]:
array.save("array_tutorial.b2nd", mode="w")  # , contiguous=True)
!ls -lh array_tutorial.b2nd

-rw-r--r-- 1 lshaw lshaw 1.7M Jul 23 16:19 array_tutorial.b2nd



For arrays, it is usual to use the `.b2nd` extension. Now let's open the saved array and check it matches the original array. Let's check the data saved correctly (decompressing first to be able to compare):

In [134]:
array2 = blosc2.open("array_tutorial.b2nd")
np.all(array2[:] == array[:])  # Make sure saved array matches original

np.True_

In fact it is possible to create a NDArray array directly on disk, specifying where it will be stored, and no memory will be used at all. We may also specify the compression/decompression and other storage parameters (e.g ``chunks`` and ``blocks``). For example, a 1000x1000 array filled with the string "pepe" can be created like this:

In [135]:
array1 = blosc2.full(
    (1000, 1000),
    fill_value=b"pepe",
    chunks=(100, 100),
    blocks=(50, 50),
    urlpath="array1_tutorial.b2nd",
    mode="w",
)
!ls -lh array1_tutorial.b2nd

-rw-r--r-- 1 lshaw lshaw 4.0K Jul 23 16:19 array1_tutorial.b2nd


We can also write direct to disk using the other constructors we saw previously.

In [136]:
it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
sa = blosc2.fromiter(it, dtype="i4,f4,f8", shape=(N,), urlpath="sa-1M.b2nd", mode="w")
b2array = blosc2.asarray(nparray, urlpath="linspace_array.b2nd", mode="w")

## Compression params
Let's see how to copy the data of NDArray array, but changing the compression parameters of the copy. This may be useful in many contexts, for example testing how changing the codec of an existing array affects the compression ratio.

In [137]:
cparams = blosc2.CParams(
    codec=blosc2.Codec.LZ4,
    clevel=9,
    filters=[blosc2.Filter.BITSHUFFLE],
    filters_meta=[0],
)

array2 = array.copy(chunks=(500, 10_000), blocks=(50, 10_000), cparams=cparams)
print(array2.info)

type    : NDArray
shape   : (9000, 10000)
chunks  : (500, 10000)
blocks  : (50, 10000)
dtype   : int64
cratio  : 70.63
cparams : CParams(codec=<Codec.LZ4: 1>, codec_meta=0, clevel=9, use_dict=False, typesize=8,
        : nthreads=10, blocksize=4000000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.BITSHUFFLE: 2>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=10)



In [138]:
print(array.info)

type    : NDArray
shape   : (9000, 10000)
chunks  : (50, 10000)
blocks  : (4, 10000)
dtype   : int64
cratio  : 437.77
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=8,
        : nthreads=10, blocksize=320000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=10)



In this case the compression ratio is much higher for the original array, since we have changed to a different codec that is optimised for compression speed, not compression ratio. In general there is a tradeoff between the two.

That's all for now.  There are more examples in the [examples directory of the git repository](https://github.com/Blosc/python-blosc2/tree/main/examples/ndarray) for you to explore.  Enjoy!