---
title: Reading RASI Virtualizarr for Rio Grande
author: Maheshwari Neelam
date: 'November 19, 2025'
execute:
  cache: true
  freeze: true
---


You can launch this notbook using mybinder, by clicking the button below.

<a href="https://binder.openveda.cloud/v2/gh/NASA-IMPACT/veda-docs/HEAD?labpath=user-guide/notebooks/tutorials/netcdf-to-cog-cmip6.ipynb">
<img src="https://binder.openveda.cloud/badge_logo.svg" alt="Binder" title="A cute binder" width="150"/> 
</a>


## Approach

[Virtual Zarr with Icechunk](https://icechunk.io/en/stable/virtual/) is a cloud-optimized format for storing and accessing large geospatial datasets without duplicating the underlying data. It provides fast, efficient access to chunked array data stored in cloud object storage.

Reading RASI data from virtual Zarr stores enables rapid analysis and visualization of hydrological variables across specific watersheds and HUCs (Hydrologic Unit Codes).

This tutorial shows how to read virtual Zarr data from S3 and create spatial and temporal visualizations for the Rio Grande HUC using [Xarray](https://github.com/pydata/xarray), [Icechunk](https://icechunk.io/), and [Cartopy](https://scitools.org.uk/cartopy/).

1. Step-by-step guide to accessing RASI virtual Zarr stores from S3
2. Subsetting data for Rio Grande basin (HUC 13)
3. Creating spatial mean time series and temporal mean maps
4. Handling fill values and data quality issues

## Step by step

### Step 0 - Installs

In [None]:
import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import icechunk
import pandas as pd
import numpy as np

### Step 1 - # Configuration

In [None]:
s3_bucket = "nasa-waterinsight"
s3_prefix = "virtual-zarr-store/icechunk/RASI/HISTORICAL"
s3_region = "us-west-2"
virtual_data_path = "s3://nasa-waterinsight/RASI/"  # Where actual data chunks are stored


### Step 2 - # Open Icechunk repository with virtual chunk authorization

In [None]:
storage = icechunk.s3_storage(
    bucket=s3_bucket,
    region=s3_region,
    prefix=s3_prefix,
    anonymous=True
)

config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(
    icechunk.VirtualChunkContainer(
        url_prefix=virtual_data_path,
        store=icechunk.s3_store(region=s3_region, anonymous=True)
    )
)

### Step 3 - Set up credentials for anonymous access to virtual chunks

In [None]:
virtual_credentials = icechunk.containers_credentials(
    {virtual_data_path: icechunk.s3_anonymous_credentials()}
)

repo = icechunk.Repository.open(
    storage=storage, 
    config=config,
    authorize_virtual_chunk_access=virtual_credentials
)
session = repo.readonly_session(branch="main")

### Step 4 - # Load dataset

In [None]:
ds = xr.open_zarr(session.store, consolidated=False)
print("Dataset loaded:")
print(ds)

# Use TotalPrecip_Percentiles variable
data_var = "TotalPrecip_Percentiles"
print(f"\nVisualizing: {data_var}")


### Step 5 - # Check the dataset 

In [None]:
# Check variable attributes
print(f"\nVariable attributes:")
for attr, value in ds[data_var].attrs.items():
    print(f"  {attr}: {value}")

# Check raw data range before any processing
print(f"\nRaw data info:")
print(f"  Data type: {ds[data_var].dtype}")
print(f"  Shape: {ds[data_var].shape}")
print(f"  Chunk size: {ds[data_var].chunks}")

### Step 6 - # Subset for Rio Grande basin (approximate HUC 13 extent)

In [None]:
# Rio Grande basin coordinates: ~26°N to 39°N, ~107°W to 97°W
rio_grande = ds.sel(
    lat=slice(26, 39),
    lon=slice(-107, -97)
)
print(f"\nSubset to Rio Grande basin:")
print(f"  Lat range: {float(rio_grande.lat.min()):.2f}° to {float(rio_grande.lat.max()):.2f}°")
print(f"  Lon range: {float(rio_grande.lon.min()):.2f}° to {float(rio_grande.lon.max()):.2f}°")

# Select median percentile (50th)
data_median = rio_grande[data_var].sel(percentile=50)

# Load a sample to check actual values
sample = data_median.isel(time=0).compute()
print(f"\nSample data (first timestep) - BEFORE masking:")
print(f"  Min: {float(sample.min()):.4f}")
print(f"  Max: {float(sample.max()):.4f}")
print(f"  Mean: {float(sample.mean()):.4f}")

### Step 7. Mask fill values (-9999)

In [None]:
# Mask fill values (-9999)
data_median_masked = data_median.where(data_median > -9000)

sample_masked = data_median_masked.isel(time=0).compute()
print(f"\nSample data (first timestep) - AFTER masking fill values:")
print(f"  Min: {float(sample_masked.min()):.4f}")
print(f"  Max: {float(sample_masked.max()):.4f}")
print(f"  Mean: {float(sample_masked.mean()):.4f}")
print(f"  Valid pixels: {(~np.isnan(sample_masked)).sum().values} / {sample_masked.size}")

### Step 7 - # Calculate spatial and temporal metrics

In [None]:
# Calculate spatial mean time series (simple mean over Rio Grande basin for each time step)
# Using masked data to exclude fill values
spatial_mean_ts = data_median_masked.mean(dim=['lat', 'lon'], skipna=True)

# Convert to pandas
ts_df = spatial_mean_ts.to_pandas()

# Calculate temporal mean (mean over time for each grid cell)
temporal_mean = data_median_masked.mean(dim='time', skipna=True)


### Step 8 - # Create figure with two subplots

In [None]:
fig = plt.figure(figsize=(18, 10))

# 1. Spatial Mean Time Series (top)
ax1 = plt.subplot(2, 1, 1)
ax1.plot(ts_df.index, ts_df.values, linewidth=2, marker='o', markersize=8, 
         alpha=0.8, color='steelblue', label='Spatial Mean')
ax1.set_xlabel('Date', fontsize=12, fontweight='bold')
ax1.set_ylabel(f"{data_var.replace('_', ' ')} (m³/s)", fontsize=12, fontweight='bold')
ax1.set_title(f'Rio Grande Basin - {data_var.replace("_", " ")} Spatial Mean Time Series\n(Note: Units metadata shows m³/s but may represent precipitation)', 
             fontsize=13, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
plt.setp(ax1.xaxis.get_majorticklabels(), rotation=45, ha='right')

# 2. Temporal Mean Map (bottom)
ax2 = plt.subplot(2, 1, 2, projection=ccrs.PlateCarree())
im = temporal_mean.plot(
    ax=ax2,
    transform=ccrs.PlateCarree(),
    cmap='Blues',
    cbar_kwargs={'label': f'{data_var.replace("_", " ")} (m³/s)', 'shrink': 0.8}
)
ax2.coastlines(resolution='50m', linewidth=0.5)
ax2.add_feature(cfeature.BORDERS, linewidth=0.5, edgecolor='gray')
ax2.add_feature(cfeature.STATES, linewidth=0.3, edgecolor='gray', alpha=0.5)
# ax2.add_feature(cfeature.RIVERS, linewidth=1, edgecolor='blue', alpha=0.5)
ax2.gridlines(draw_labels=True, linewidth=0.5, alpha=0.5)
ax2.set_title(f'Rio Grande Basin - {data_var.replace("_", " ")} Temporal Mean', 
             fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('rasi_rio_grande_analysis.png', dpi=300, bbox_inches='tight')
print("\n✓ Plot saved as rasi_rio_grande_analysis.png")
print(f"\nSpatial Mean Statistics (after masking fill values):")
print(f"  Mean: {ts_df.mean():.4f} m³/s")
print(f"  Std Dev: {ts_df.std():.4f} m³/s")
print(f"  Min: {ts_df.min():.4f} (on {ts_df.idxmin().strftime('%Y-%m')})")
print(f"  Max: {ts_df.max():.4f} (on {ts_df.idxmax().strftime('%Y-%m')})")
print(f"\n⚠️  Note: Metadata says units are 'm³/s' which is unusual for precipitation.")
print(f"     Values now look reasonable after masking -9999 fill values.")
plt.show()