# Tutorial 3: Working with remote datasets
In this tutorial we will see how to work with remote datasets using Blosc2 and Caterva2, a library that allows you to manage large remote datasets, providing a client-server architecture. Caterva2 is built on top of Blosc2, so it can handle large datasets efficiently using all the tools and tricks you've learned in the previous tutorials.

In [1]:
import math
import blosc2
import caterva2 as cat2
import numpy as np
import psutil
import os
import time

%load_ext memory_profiler

# --- Memory profiler ---
def getmem():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024

First, we need to connect to the remote server - we have a demo running at `https://cat2.cloud/demo` that's good to get you started quickly.

In [2]:
urlbase = "https://cat2.cloud/demo"
root = "@public"
dataset1 = "examples/cubeA.b2nd"
dataset2 = "examples/cubeB.b2nd"

### Caterva2 approach: flexible, remote management in the server

Computational paradigms for remote data often require downloading data to the local machine to compute with it, which can be excessively computationally demanding.

This problem is solved by Caterva2, which allows us to work with remote datasets in a more flexible way, doing as much as possible on the server side, and only downloading the data we really want to consult. You can also perform all sorts of file management operations (move, delete, copy datasets, etc.) on the server side - see the [Caterva2 documentation](https://ironarray.io/caterva2-doc/tutorials/API.html).

First we set up a connection from the client (our local machine) to the server, and define a pointer to the root where our desired dataset is stored. The root is a special directory that contains all the datasets we want to work with, and it is defined by the `@` symbol followed by the name of the root (in this case, `@public`). We'll then time the process of opening the remote dataset, and check the memory usage before and after opening it.

In [3]:
client = cat2.Client(urlbase)
myroot = client.get(root)

In [4]:
m0 = getmem()
t0 = time.time()
cat2_arr1 = myroot[dataset1]
print("Time to open remote dataset 1:", time.time() - t0, 's')
print("Memory usage after opening remote dataset 1:", getmem() - m0, "MB")

m0 = getmem()
t0 = time.time()
cat2_arr2 = myroot[dataset2]
print("Time to open remote dataset 2:", time.time() - t0, 's')
print("Memory usage after opening remote dataset 2:", getmem() - m0, "MB")

Time to open remote dataset 1: 0.3622901439666748 s
Memory usage after opening remote dataset 1: 0.625 MB
Time to open remote dataset 2: 0.043787479400634766 s
Memory usage after opening remote dataset 2: 0.0 MB


Both time and memory usage are very low, since we are only creating a pointer to the remote dataset, and not downloading it to the local machine.

**EXERCISE**:

a) Print out the shape of the datasets and calculate their size (in MB). What class is ``cat2_arr1`` an instance of?

b) You should find that the remote access to the dataset has returned a ``caterva2.client.Dataset`` object. Try and get a slice of the dataset, e.g. `cat2_arr1[500:502, 302, 900:905]`, and print out its type. Profile the time and memory consumption of this operation.


In [5]:
# a) Calculate the shape and nbytes of the dataset, and print out the type of cat2_arr1
# SOLUTION
print(f"cat2_arr1 of type {type(cat2_arr1)}. cat2_arr1 shape, nbytes - {cat2_arr1.shape}, {np.prod(cat2_arr1.shape) * np.dtype(cat2_arr1.dtype).itemsize // 1024**2} MB, cat2_arr2 shape, nbytes - {cat2_arr2.shape}, {np.prod(cat2_arr2.shape) * np.dtype(cat2_arr2.dtype).itemsize // 1024**2} MB.")

# b) Get a slice of cat2_arr1, print out its type, and profile time and memory consumption
# SOLUTION
m0 = getmem()
t0 = time.time()
result = cat2_arr1[500:502, 302, 900:905]
print("Time to slice dataset 1:", time.time() - t0, 's')
print("Memory usage after slicing remote dataset 1:", getmem() - m0, "MB")
print("Type of slice:", type(result))
del result


cat2_arr1 of type <class 'caterva2.client.Dataset'>. cat2_arr1 shape, nbytes - (1, 1000, 1000), 7 MB, cat2_arr2 shape, nbytes - (1, 1000, 1000), 7 MB.
Time to slice dataset 1: 0.0449831485748291 s
Memory usage after slicing remote dataset 1: 0.25 MB
Type of slice: <class 'numpy.ndarray'>



Now we are going to try and compute a lazy expression, but executed server-side and only downloading the result locally. We thus take advantage of the greater computational resources of the server to perform the computation and execute where the datasets already are, thus saving on transfer time. Now let's see how much time we can save in total!

In order to have write-access to the server, we have to authenticate ourselves with the server. This is done by passing a tuple with the username and password to the ``Client`` constructor. If you don't have an account, you can create one at https://cat2.cloud/demo/login.

The code below defines a lazy expression and then uses the ``upload`` method of the ``Client`` class to save the expression on the server which refers to the remote dataset.



In [6]:
# Define a simple lazy expression using cat2_arr1, cat2_arr2
client = cat2.Client(urlbase, ("user@example.com", "foobar11"))  # Replace with your credentials
client.timeout = 60  # Set a timeout for the operations
locexpr = cat2_arr1 + cat2_arr2
lexpr = client.upload(local_dset=locexpr, remotepath='@personal/my_lazyexpr.b2nd')

**EXERCISE**

a) Inspect the type of `lexpr`. You will see that we could compute it, but instead access metadata of the result (what type is it?), such as its shape and dtype.

b) Finally we can compute and return the lazy expression using the ``[:]`` method of the local pointer to the remote lazy expression. Profile the time and memory usage for this operation, and compare it to the time and memory usage of computing the lazy expression using blosc2 that we saw above. You can also check what happens when you only compute a small slice of the result.

In [7]:
# b) Get a local pointer to the remote lazy expression and access its metadata
# SOLUTION
print(type(lexpr))

# c) Compute the lazy expression and profile time and memory usage
# SOLUTION
m0 = getmem()
t0 = time.time()
expr_result = lexpr[:]
print("Time to compute:", time.time() - t0, 's')
print("Memory usage after computation:", getmem() - m0, "MB")

<class 'caterva2.client.Dataset'>
Time to compute: 0.2932281494140625 s
Memory usage after computation: 9.6953125 MB


### User-defined Functions on the Server

Blosc2 also allows one to take user-defined functions (UDFs) and translate them into the lazy imperative paradigm using ``LazyUDF`` objects, which implement the same computation interface as ``LazyExpr``.

Moreover, these can be uploaded and executed on the server just like ``LazyExpr`` objects. Let's dive in and define a ``LazyUDF``!

In [8]:
def myUDF(inputs, output, offset): # UDFs have to have this signature...
    x, y = inputs
    # ... and they must fill output like this
    output[:] = x + y

# Create LazyUDF object
lazyudf = blosc2.lazyudf(myUDF, (cat2_arr1, cat2_arr2), dtype=cat2_arr1.dtype)

**EXERCISE**:

a) Check the attributes of the ``lazyudf`` variable - dtype, shape, type.

b) Use ``client.upload`` to upload the ``LazyUDF`` to the server, and examine the returned object, to check it has the same metadata as the local ``lazyudf`` variable. This should all be very fast.

c) Compute the function and compare it with the output of the lazy expression ``expr_result`` from the previous exercise. The values should be the same!

In [9]:
# a) Check attributes of lazyudf
print(f'lazyudf has shape {lazyudf.shape}, dtype {lazyudf.dtype} and is of type {type(lazyudf)}')

# b) Upload to server
server_udf = client.upload(local_dset=lazyudf, remotepath='@personal/lazyudf.b2nd')
print(f'server_udf has shape {server_udf.shape}, dtype {server_udf.dtype} and is of type {type(server_udf)}')

# c) compute and compare
udf_result = server_udf[:]
np.testing.assert_array_equal(udf_result, expr_result)

lazyudf has shape (1, 1000, 1000), dtype float64 and is of type <class 'blosc2.lazyexpr.LazyUDF'>
server_udf has shape (1, 1000, 1000), dtype float64 and is of type <class 'caterva2.client.Dataset'>


## BONUS: Setting up your own server

In case you have your own server running that you would like to connect to and run Caterva2 on, you can do so by following these steps.
1) In a command line terminal, execute ``CATERVA2_SECRET=c2sikrit cat2sub``. Check that the server is running at `http://localhost:8000`.
2) In a separate terminal, execute ``cat2adduser user@example.com foobar11`` to create a new user on the server.

You may then return to this Jupyter notebook and use the code below to connect to your server.
Explore the Caterva2 documentation for the ``Client`` class and try and upload a file to the server (see [this tutorial](https://ironarray.io/caterva2-doc/tutorials/API.html) if you get stuck).

In [10]:
client = cat2.Client("http://localhost:8000", ("user@example.com", "foobar11"))
#
## YOUR CODE HERE
#