# Tutorial 3: Working with remote datasets
In this tutorial we will see how to work with remote datasets using Blosc2 and Caterva2, a library that allows you to manage large remote datasets, providing a client-server architecture. Caterva2 is built on top of Blosc2, so it can handle large datasets efficiently using all the tools and tricks you've learned in the previous tutorials.

In [None]:
import math
import blosc2
import caterva2 as cat2
import numpy as np
import psutil
import os
import time

%load_ext memory_profiler

# --- Memory profiler ---
def getmem():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024

First, we need to connect to the remote server - we have a demo running at `https://cat2.cloud/demo` that's good to get you started quickly.

In [None]:
urlbase = "https://cat2.cloud/demo"
root = "@public"
dataset1 = "examples/cubeA.b2nd"
dataset2 = "examples/cubeB.b2nd"

### Blosc2 approach: simple, but allows slicing of remote datasets

Noe let's access the remote dataset using Blosc2, which is possible using essentially the same interface as for datasets stored on disk - the only difference is that we need to wrap the urlpath as a `blosc2.URLPath` object, to indicate that we are accessing a remote dataset.


In [None]:
urlpath = blosc2.URLPath(f"{root}/{dataset1}", urlbase)
m0 = getmem()
t0 = time.time()
blosc_arr1 = blosc2.open(urlpath, mode="r")
print("Time to open remote dataset 1:", time.time() - t0, 's')
print("Memory usage after opening remote dataset 1:", getmem() - m0, "MB")

urlpath = blosc2.URLPath(f"{root}/{dataset2}", urlbase)
m0 = getmem()
t0 = time.time()
blosc_arr2 = blosc2.open(urlpath, mode="r")
print("Time to open remote dataset 2:", time.time() - t0, 's')
print("Memory usage after opening remote dataset 2:", getmem() - m0, "MB")

When we request data from the remote server, it will be sent chunk-by-chunk, compressed, which increases transfer speeds significantly (by the same factor as the compression ratio), which in this case is:

In [None]:
print(f'cratio of arr1: {round(blosc_arr1.meta["schunk"]["cratio"],2)}x')
print(f'cratio of arr2: {round(blosc_arr2.meta["schunk"]["cratio"],2)}x')


**EXERCISE**:

a) Print out the shapes of the datasets and calculate their size (in MB). You will find that ``arr.schunk.nbytes`` does not work, since the variable is not a Blosc2 NDArray. Instead find the correct entry in the ``arr.meta["schunk"]`` dict. What class is are the arrays instances of? Why is it that opening the remote dataset uses no memory when the dataset is so large?

b) You should find that the remote access to the dataset has returned a ``C2Array`` object, which is a (memory-light) pointer to the remote dataset. Try and get a slice of ``blosc_arr2`` `blosc_arr2[1, :1000, :1000]`, and print out its type. Profile the time and memory consumption of this operation, and compare it to the time and memory consumption of opening the remote dataset.

c) Recalling the string-based lazy expression constructor from the previous tutorial, we can define a lazy expression which calculates some complicated expression. Execute it via ``[:]``, returning a NumPy array. Profile the time and memory usage for the operation. You can use ``%memit`` or ``getmem()`` as you prefer. You can also check what happens when you only compute a small slice of the result (is it any faster? how much memory is used?).

In [None]:
# a) Calculate the shape and nbytes of the datasets, and print out the type of blosc_arr1
#
## YOUR CODE HERE
#

# b) Get a slice of blosc_arr1, print out its type, and profile time and memory consumption
m0 = getmem()
t0 = time.time()
#
##YOUR CODE HERE
#
print("Time to open slice of remote dataset 1:", time.time() - t0, 's')
print("Memory usage after opening slice of remote dataset 1:", getmem() - m0, "MB")


# c) Define a simple lazy expression using blosc_arr1, blosc_arr2 and compute it
expr = 'mean(sin(arr1) + 2 * exp(arr2 + 1), axis = 1)'
lexpr = blosc2.lazyexpr(expr, operands={'arr1':blosc_arr1, 'arr2':blosc_arr2})
#
## YOUR CODE HERE
#
del result

### Caterva2 approach: more flexible, allowing remote management in the server

Using blosc2 we thus can access remote datasets and download necessary slices. However, in order to do computations, we must download data to the local machine and compute with it, which could be excessively computationally demanding. Moreover, we cannot manage do file management on the server side (move, delete, copy datasets, etc.).

These problems are solved by Caterva2, which allows us to work with remote datasets in a more flexible way, doing as much as possible on the server side, and only downloading the data we really want to consult. You can also perform all sorts of file management operations on the server side - see the [Caterva2 documentation](https://ironarray.io/caterva2-doc/tutorials/API.html).

First we set up a connection from the client (our local machine) to the server, and define a pointer to the root where our desired dataset is stored. The root is a special directory that contains all the datasets we want to work with, and it is defined by the `@` symbol followed by the name of the root (in this case, `@public`). We'll then time the process of opening the remote dataset, and check the memory usage before and after opening it.

In [None]:
client = cat2.Client(urlbase)
myroot = client.get(root)

In [None]:
m0 = getmem()
t0 = time.time()
cat2_arr1 = myroot[dataset1]
print("Time to open remote dataset 1:", time.time() - t0, 's')
print("Memory usage after opening remote dataset 1:", getmem() - m0, "MB")

m0 = getmem()
t0 = time.time()
cat2_arr2 = myroot[dataset2]
print("Time to open remote dataset 2:", time.time() - t0, 's')
print("Memory usage after opening remote dataset 2:", getmem() - m0, "MB")

As before, both time and memory usage are very low, since again we are only creating a pointer to the remote dataset, and not downloading it to the local machine.

**EXERCISE**:

a) Print out the shape of the datasets and calculate their size (in MB). What class is ``cat2_arr1`` an instance of?

b) You should find that the remote access to the dataset has returned a ``caterva2.client.Dataset`` object. Try and get a slice of the dataset, e.g. `cat2_arr1[500:502, 302, 900:905]`, and print out its type. Profile the time and memory consumption of this operation, and compare it to the time and memory consumption of accessing the remote dataset using blosc2 that we saw above.


In [None]:
# a) Calculate the shape and nbytes of the dataset, and print out the type of cat2_ds
#
## YOUR CODE HERE
#

# b) Get a slice of cat2_ds, print out its type, and profile time and memory consumption
#
## YOUR CODE HERE
#


Now we are going to try and compute the same lazy expression as before, but executed server-side and only downloading the result locally. We thus take advantage of the greater computational resources of the server to perform the computation and execute where the datasets already are, thus saving on transfer time. Now let's see how much time we can save in total!

In order to have write-access to the server, we have to authenticate ourselves with the server. This is done by passing a tuple with the username and password to the ``Client`` constructor. If you don't have an account, you can create one at https://cat2.cloud/demo/login.

**EXERCISE**:

a) Modify the code below to authenticate yourself and then use the ``lazyexpr`` constructor of the ``Client`` class to save a lazy expression on the server which refers to the remote dataset.


In [None]:
# a) Define a simple lazy expression using cat2_ds
client = cat2.Client(urlbase, ("user@example.com", "foobar12"))  # Replace with your credentials
client.timeout = 60  # Set a timeout for the operations
lexpr_path = client.lazyexpr('my_lazyexpr', expression=expr, operands={'arr1': cat2_arr1.path, 'arr2': cat2_arr2.path}) # Replace 'YOUR_EXPRESSION' with a valid expression

b) Inspect the type of `lexpr`. You will see that we cannot simply compute it. Instead, we need to use the ``Client.get`` method to get a local pointer to the remote lazy expression.
Do this, and access metadata of the result (what type is it?), such as its shape and dtype.

c) Finally we can compute and return the lazy expression using the ``[:]`` method of the local pointer to the remote lazy expression. Profile the time and memory usage for this operation, and compare it to the time and memory usage of computing the lazy expression using blosc2 that we saw above. You can also check what happens when you only compute a small slice of the result.

In [None]:
# b) Get a local pointer to the remote lazy expression and access its metadata
#
## YOUR CODE HERE
#

# c) Compute the lazy expression and profile time and memory usage
#
## YOUR CODE HERE
#

## BONUS: Setting up your own server

In case you have your own server running that you would like to connect to and run Caterva2 on, you can do so by following these steps.
1) In a command line terminal, execute ``CATERVA2_SECRET=c2sikrit cat2sub``. Check that the server is running at `http://localhost:8000`.
2) In a separate terminal, execute ``cat2adduser user@example.com foobar11`` to create a new user on the server.

You may then return to this Jupyter notebook and use the code below to connect to your server.
Explore the Caterva2 documentation for the ``Client`` class and try and upload a file to the server (see [this tutorial](https://ironarray.io/caterva2-doc/tutorials/API.html) if you get stuck.

In [None]:
client = cat2.Client("http://localhost:8000", ("user@example.com", "foobar11"))
#
## YOUR CODE HERE
#