# Problems with data access

This notebook performs a similar operation as the last notebook: loading some NWM data from Azure Blob Storage. But instead of visualizing the data, we'll focus into some potential performance issues with *accessing* the data as-is. We'll compare two ways of getting the data:

1. A "download" workflow, where you download the files ahead of time to your local disk
2. A "cloud-native" workflow, where you read the data directly from Blob Storage.


## Background

First, some background on different "styles" of working that the Pangeo community have identified.


![](download-model.png)

Under the "download" style of working, there are two distinct phases. An initial download phase, were data are downloaded from the HTTP / FTP / whatever server to your local workspace (laptop, workstation, HPC, etc.). Once the data is downloaded, you start your iterative clean / transform / analyze / visualize cycle. This model works alright for smaller, static datasets. It breaks down for datasets that are updating frequently, or are so large that downloading the archive isn't an option.

![](cloud-native-model.png)

Under the "cloud-native" model, there isn't an initial download phase; the data stay where they are in Azure Blob Storage. Instead, you read the data directly into memory on your compute. Crucially, the compute is deployed in the same Azure region as the data, which gives you a nice, high-bandwidth connection between the storage and compute services.

Unsurprisingly, we're big fans of the cloud-native model. While there are some nuances, it's a lower barrier to entry for newcomers. And it scales extremely well to large datasets (depending on the format and access pattern, as we'll dig into now).

In [1]:
import adlfs
import azure.storage.blob
import planetary_computer
import xarray as xr

fs = adlfs.AzureBlobFileSystem(
    "noaanwm", credential=planetary_computer.sas.get_token("noaanwm", "nwm").token
)
# force xarray to import everything
xr.tutorial.open_dataset("air_temperature");

## Download Workflow

First, we'll use the classic "download model" workflow style, where we download the data to disk ahead of time.

In [2]:
import urllib.request

In [3]:
print("Downloading from Blob Storage")
%time filename, response = urllib.request.urlretrieve("https://noaanwm.blob.core.windows.net/nwm/nwm.20230123/short_range/nwm.t00z.short_range.land.f001.conus.nc")
print("-" * 80)

print("Reading metadata")
%time ds = xr.open_dataset(filename)
print("-" * 80)

print("Loading data")
%time ds = ds["SOILSAT_TOP"].load()

Downloading from Blob Storage
CPU times: user 74.5 ms, sys: 92 ms, total: 167 ms
Wall time: 425 ms
--------------------------------------------------------------------------------
Reading metadata
CPU times: user 18.2 ms, sys: 3.57 ms, total: 21.8 ms
Wall time: 21.8 ms
--------------------------------------------------------------------------------
Loading data
CPU times: user 316 ms, sys: 155 ms, total: 471 ms
Wall time: 471 ms


Timing will vary a bit, but we're seeing *roughly* 500 ms to download the data, 20 ms to read the metadata, and 500 ms to load the data from disk.

## Cloud-native model

The `open_dataset` in the last notebook, reading from blob storage, might have felt a bit slow. Let's do some timings and logging to see what's going on.

In [4]:
import logging
import pathlib
import azure.core.pipeline.policies

p = pathlib.Path("log.txt")
p.unlink(missing_ok=True)

# Ensure range requests are logged
azure.core.pipeline.policies.HttpLoggingPolicy.DEFAULT_HEADERS_ALLOWLIST.add(
    "Content-Range"
)

logger = logging.getLogger()
logging.basicConfig(level=logging.DEBUG, filename="log.txt")

In [5]:
%%time
prefix = "nwm/nwm.20230123"

ds = xr.open_dataset(
    fs.open(f"{prefix}/short_range/nwm.t00z.short_range.land.f001.conus.nc")
)
display(ds)

CPU times: user 212 ms, sys: 54.2 ms, total: 267 ms
Wall time: 476 ms


So about 1 – 1.5 seconds *just to read the metadata*. Let's load up some data too.

In [6]:
logger.info(f"{' Reading Data ':=^80}")

%time soil_saturation = ds["SOILSAT_TOP"].load()

CPU times: user 435 ms, sys: 139 ms, total: 574 ms
Wall time: 962 ms


And another 1 – 1.5 seconds to read the data. The logs will help us figure out what's going on.

## Inspecting the logs

We wrote a bunch of output to `log.txt`, which we'll go through now. For some context, the overall workflow here is `xarary` uses `h5netcdf` to load the NetCDF file. `h5netcdf` will "open" the "file" we give it, and do a bunch of seeks and reads to read the HDF5 file format. But, crucially, we don't have a regular file here. Instead, we have this `fsspec.OpenFile` thing. When `h5netcdf` reads the first eight bytes, fsspec will go off and make an HTTP request to download that data from Blob Storage.

So to understand the performance, we'll want to look for file reads and HTTP requests.


1. Look at the number of "reads" by xarray / h5netcdf (~ 130!)
2. Count the number of HTTP requests (~ 13!)

To summarize the timings

| Stage | Download | Stream |
| --- | --- | --- |
| Download | 0.5 | - |
| Metadata | 0.02 | 1.5 |
| Data | 0.5 | 1.5 |
| **Total** | **1.02** | **3.0** |

Not looking so good for the "cloud-native" way, huh? Stay tuned!

## Lessons

In the cloud-native approach, we only download data on demand. This works great for cloud-friendly file formats like Cloud Optimized GeoTIFF and Zarr.

With `fsspec` reading *non*-cloud-optimzied file (HDF5 files), we can emulate the cloud-native workflow where we download data on demand. It's extremely convenient, but has *very* different performance characteristics compared to a local file system. In general, the cloud native approach works best when

1. Metadata is in a consolidated location (true for COG, Zarr; not true for HDF5, grib)
2. You're accessing a subset of the file (Think reading a small spatial subset of a large COG. A "download" model would download the whole file, wasting a bunch of bandwidth)
3. You're accessing the data in parallel

In the next couple notebooks we'll look at some better ways to access these data on the cloud. We'll see

1. Kerchunk: Optimizes *metadata reads* from the *existing files*. A better cloud-optimzied layer on top of non-cloud-optimized files
2. Convert the data to a cloud-friendly format (Zarr, geoparquet)

## Aside: Cloud-Native on the Planetary Computer

It's perhaps worth seeing a bit what going all in on this cloud-native approach gets you.

For example, the [Planetary Computer Explorer](https://planetarycomputer.microsoft.com/explore?c=124.0274%2C-16.4940&z=9.00&v=2) is a web application that lets you explore a bunch of datasets hosted on the Planetary Computer. The National Water Model isn't there yet, both because it's in NetCDF and isn't cataloged in STAC.

Next up, let's move to [using-kerchunk](using-kerchunk.ipynb).