# Planetary Computer Hub Overview

The Planetary Computer Hub is a *convenient* place to do geospatial data analysis on Azure. There are many ways to compute on Azure (VMs, Azure Functions, Kubernetes, Azure ML, ...). You can use the Planetary Computer data and APIs from any of them (see https://planetarycomputer.microsoft.com/docs/concepts/computing/), but make sure the compute is in the **West Europe** Azure region.

You can sign in at https://planetarycomputer.microsoft.com/compute. Request an account at https://planetarycomputer.microsoft.com/account/request if you haven't already.

<img src="https://planetarycomputer.microsoft.com/_images/hub-login.png" width="25%"/>

Once signed in, you choose an *environment*

<img src="https://planetarycomputer.microsoft.com/_images/hub-profiles.png" width="25%"/>

Check out the [JupyterLab](https://jupyterlab.readthedocs.io/) docs if you're new to JupyterLab and want to learn more. The main thing for today is `shift+Enter`.

## Cloud-native Geospatial

Why do we care about doing our data analysis in the cloud?

1. There's too much data. It simply doesn't fit on a single hard drive.
2. You potentially want access to all of it. Thousands of others potentially want to access all of it.
3. We want to do more complicated and compute-intensive analysis.

In [None]:
!time gdalinfo /vsicurl/https://naipeuwest.blob.core.windows.net/naip/v002/ia/2019/ia_60cm_2019/42091/m_4209150_sw_15_060_20190828.tif > /dev/null

I ran that same command on my laptop, from my home. It took *much* longer.

```console
> time ./gdalinfo /vsicurl/https://naipeuwest.blob.core.windows.net/naip/v002/ia/2019/ia_60cm_2019/42091/m_4209150_sw_15_060_20190828.tif > /dev/null

real    0m6.706s
user    0m0.351s
sys     0m0.097s
```

So the most important point: **put your compute next to the data**. It's faster and cheaper. All the Planetary Computer data is in the **West Europe** region.

## The Filesystem

There are two parts to your filesystem:

1. Your *home* directory, at `/home/jovyan/`, which persists across sessions. Put your code, notebooks, outputs, etc. here.
2. Everything else, which does *not* persist across sessions.

The size of your home directory is fairly limited. It's really intended for small things like code. It's *not* intended for large amounts of data.

If you do need to (temporarily) store lots of data, put it in `/tmp`. This might be helpful for file formats like GRIB2 that can't be streamed over the network.

In [None]:
import tempfile

with tempfile.TemporaryDirectory() as td:
    # save temporary data files here
    print(td)

But note that anything saved outside of `/tmp` will not be available the next time you start your notebook server.

## The software environment

When you logged in, you picked the software environment (Python, R, Tensorflow, PyTorch, ...). This consists of a conda environment at `/srv/conda/envs/notebook`.

In [None]:
!echo $CONDA_PREFIX

Notice that's outside of the home directory, so any changes you make to the software environment will be lost when your session restarts. Those environments are created using a set of Dockerfiles at https://github.com/Microsoft/planetary-computer-containers. Reach out there if you have a package that would be appropriate to add to the base environment.

You can install additional packages "at runtime".

In [None]:
%pip install cyberpandas

In [None]:
import cyberpandas

You can also install packages using conda / mamba.

## Scaling with Dask

<img src=https://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg width="10%">

Dask is a way to scale your Python code. It can be helpful when you're working with datasets that are larger than memory, or you're running some computationally expensive job that can be parallized.
The Planetary Computer Hub includes a service to easily create Dask clusters. In short: you get to distribute your computation on many machines without having to worry about the hassles of actually setting up machines, managing environments, networking, etc.

We'll spend a few minutes learning about Dask, but many of the lessons are applicable to other parallel computing environments.

The core idea behind Dask, and any other parallel computing framework, is to get some work done faster by splitting some large job into smaller pieces and running those smaller pieces in parallel.

In [None]:
import dask
import dask.distributed

client = dask.distributed.Client()
client

Let's define a couple of functions to simulate real work.

In [None]:
import time

def inc(x):
    time.sleep(1)
    return x + 1

def add(x, y):
    time.sleep(1)
    return x + y

We'll run a workload that should take about 3 seconds.

In [None]:
%%time
x = inc(1)
y = inc(2)
z = add(x, y)

Now, let's parallelize that with Dask, specifically [`dask.delayed`](https://docs.dask.org/en/stable/delayed.html).

In [None]:
%%time
x = dask.delayed(inc)(1)
y = dask.delayed(inc)(2)
z = dask.delayed(add)(x, y)

What just happened? We haven't actually run our computation yet. `dask.delayed`, and several other components of Dask, are *lazy*. They don't run your computation until you actually ask it to run.

In [None]:
z

Instead, we've built up a kind of recipe for our computation: a task graph that, when executed, will give us our result. We can visualize the task graph:

In [None]:
z.visualize()

And get the actual result with `.compute()`.

In [None]:
%%time
z.compute()

We can see a few things. First, if you had your Task Stream plot open, you'll see that Dask did indeed run the two `inc` tasks in parallel. Second, you'll see that the overall time was just over 2s, higher than our theoretical best time of 2s. Dask does incur some overhead related to task scheduling and data movement. See https://docs.dask.org/en/stable/best-practices.html for more.

So that's Dask in a nutshell.

<img src="https://docs.dask.org/en/stable/_images/dask-overview.svg" width="50%"/>

With Dask, you can parallelize your workload on a single machine, or a distributed cluster of machines. Typically, distributed computing comes with a bunch of headaches around infrastructure: setting up the machines, getting them talking to eachother, etc.

In [None]:
import dask_gateway

cluster = gateway = dask_gateway.GatewayCluster()
cluster.scale(2)
client = cluster.get_client()
cluster

That created a cluster with all the default options. You can customize a few things when created your cluster.

In [None]:
gateway = dask_gateway.Gateway()

options = gateway.cluster_options()
options

Pass those to `gateway.new_cluster()`

In [None]:
cluster.close()