# 10: Scaling to Large Datasets

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Austfi/xsnowForPatrol/blob/main/notebooks/10_large_datasets.ipynb)

Regional archives can contain hundreds of stations and decades of simulations. Learn strategies for scaling xsnow analysis with Dask, chunking, and streaming IO.

## What You'll Learn

- Profiling memory footprint and coordinate sizes
- Chunking xsnow datasets for Dask-backed computation
- Executing lazy computations with progress monitoring
- Streaming subsets from disk to keep memory usage low
- Persisting derived products efficiently


## Installation (For Colab Users)

If you're using Google Colab, run the cell below to install xsnow and dependencies. If you're running locally and have already installed xsnow, you can skip this cell.


In [None]:
%pip install -q numpy pandas xarray matplotlib seaborn dask netcdf4
%pip install -q git+https://gitlab.com/avacollabra/postprocessing/xsnow


In [None]:
import pandas as pd
import numpy as np
import xarray as xr
import dask.array as da
from dask.diagnostics import ProgressBar
import matplotlib.pyplot as plt
import seaborn as sns
import xsnow

sns.set(style='whitegrid', context='talk')


In [None]:
        print("Loading xsnow sample data for large dataset workflows...")
        print("Using xsnow.single_profile_timeseries() as a lightweight proxy")
        print()

        try:
            ds = xsnow.single_profile_timeseries()
            base_ds = getattr(ds, 'data', ds)
            print("✅ Data loaded successfully!")
            print(base_ds)
        except Exception as exc:
            print(f"❌ Error loading sample data: {exc}")
            print("
Make sure xsnow is properly installed:")
            print("  pip install git+https://gitlab.com/avacollabra/postprocessing/xsnow")
            ds = None
            base_ds = None


## Part 1: Inspect Dataset Size

Review coordinate sizes and estimated memory usage to plan chunking strategies.


In [None]:
if base_ds is not None:
    size_bytes = base_ds.nbytes if hasattr(base_ds, 'nbytes') else sum(v.nbytes for v in base_ds.data_vars.values())
    print(f"Approximate in-memory size: {size_bytes / 1e6:.2f} MB")
    for dim, length in base_ds.dims.items():
        print(f"Dimension {dim}: {length}")


## Part 2: Apply Chunking

Use xarray's chunking API to prepare for Dask parallelism.


In [None]:
if base_ds is not None:
    chunk_plan = {'time': 48, 'layer': 60}
    chunked = base_ds.chunk(chunk_plan)
    print(chunked)
else:
    chunked = None
    print("Dataset not available.")


## Part 3: Lazy Computations with Progress Bars

Combine xsnow operations with Dask diagnostics to monitor progress.


In [None]:
if 'chunked' in locals() and chunked is not None:
    with ProgressBar():
        mean_temp = chunked['temperature'].mean(dim=['layer']) if 'temperature' in chunked else None
        if mean_temp is not None:
            result = mean_temp.compute()
            print(result)
        else:
            print("Temperature variable missing; adjust the computation to available variables.")


## Part 4: Stream Data in Windows

Iterate through manageable time windows instead of loading everything at once.


In [None]:
if base_ds is not None and 'time' in base_ds.coords:
    window = pd.Timedelta('7D')
    start = pd.to_datetime(str(base_ds.coords['time'].values[0]))
    end = pd.to_datetime(str(base_ds.coords['time'].values[-1]))
    current = start
    summaries = []
    while current < end:
        subset = base_ds.sel(time=slice(current, current + window))
        if 'density' in subset.data_vars:
            summaries.append({
                'window_start': current,
                'window_end': current + window,
                'mean_density': float(subset['density'].mean().values),
            })
        current += window
    summary_df = pd.DataFrame(summaries)
    display(summary_df.head())
else:
    print('Dataset missing time coordinate.')


## Part 5: Persist Results to Disk

Save intermediate products to Zarr/NetCDF for reuse without recomputation.


In [None]:
if 'chunked' in locals() and chunked is not None:
    target = chunked[['temperature', 'density']] if all(v in chunked.data_vars for v in ['temperature', 'density']) else chunked
    target = target.isel(location=0) if 'location' in target.dims else target
    output_path = 'cache_chunked.zarr'
    target.to_zarr(output_path, mode='w')
    print(f"Persisted chunked subset to {output_path} (overwrite mode).")


## Summary

- Inspect dimension sizes before loading everything into memory.
- Chunk data to leverage Dask and monitor computations with progress bars.
- Stream processing and persistent caches keep workflows responsive on large archives.

**Next steps:** Explore scheduling and distributed clusters in `10a_large_datasets_performance.ipynb`.
