## Problems with data access

This notebok performs a similar operation as the last notebook: loading some NWM data from Azure Blob Storage. But instead of actually using the data, we'll dig into some potential performance issues with using the data as-is. We'll compare two ways of getting the data:

1. A "download" workflow, where you download the files ahead of time to your local disk
2. A "cloud-native" workflow, where you read the data directly from Blob Storage.

In [2]:
import azure.storage.blob
import planetary_computer
import adlfs
import xarray as xr

fs = adlfs.AzureBlobFileSystem(
    "noaanwm", credential=planetary_computer.sas.get_token("noaanwm", "nwm").token
)

## Download Workflow

Next, we'll use the "download model" workflow style, where we download the data ahead of time.

In [5]:
import urllib.request

In [6]:
%time filename, response = urllib.request.urlretrieve("https://noaanwm.blob.core.windows.net/nwm/nwm.20230123/short_range/nwm.t00z.short_range.land.f001.conus.nc")
%time ds = xr.open_dataset(filename)
%time ds = ds["SOILSAT_TOP"].load()

CPU times: user 74 ms, sys: 76.5 ms, total: 150 ms
Wall time: 432 ms
CPU times: user 19.3 ms, sys: 0 ns, total: 19.3 ms
Wall time: 19.3 ms
CPU times: user 320 ms, sys: 131 ms, total: 451 ms
Wall time: 451 ms


To summarize the timings

| Stage | Download | Stream |
| --- | --- | --- |
| Download | 0.4 | - |
| Metadata | 0.02 | 2.8 |
| Data | 0.45 | 1.0 |
| **Total** | **0.87** | **3.8** |

Not looking so good for the "cloud-native" way, huh?

## Cloud-native model

Did that `open_dataset` in the last notebook feel a bit slow? Let's do some timings and logging.

In [2]:
import logging
import pathlib

p = pathlib.Path("log.txt")
p.unlink(missing_ok=True)

logger = logging.getLogger()
logging.basicConfig(level=logging.DEBUG, filename="log.txt")

In [3]:
%%time
prefix = "nwm/nwm.20230123"

ds = xr.open_dataset(
    fs.open(f"{prefix}/short_range/nwm.t00z.short_range.land.f001.conus.nc")
)
display(ds)

CPU times: user 2.22 s, sys: 140 ms, total: 2.36 s
Wall time: 2.72 s


So about 2-3 seconds *just to read the metadata*. Let's load up some data too.

In [4]:
logger.info(f"{' Reading Data ':=^80}")

%time soil_saturation = ds["SOILSAT_TOP"].load()

CPU times: user 426 ms, sys: 157 ms, total: 583 ms
Wall time: 930 ms


## Inspecting the logs

We wrote a bunch of output to `log.txt`. Let's see what was going on.

1. Look at the number of "reads" by xarray / h5netcdf (~ 130!)
2. Count the number of HTTP requests (~ 13!)

## Lessons

This cloud-native approach works by intercepting `read` calls and doing HTTP requests for you. This can be extremely convenient, but has *very* different performance characteristics compared to a local file system. In general, the cloud native approach works best when

1. Metadata is in a consolidated location (true for COG, Zarr; not true for HDF5, grib)
2. You're accessing a subset of the data
3. You're accessing the data in parallel