# 05: Working with Custom Data

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Austfi/xsnowForPatrol/blob/main/notebooks/05_working_with_custom_data.ipynb)

This notebook shows you how to prepare and load your own SNOWPACK output files into xsnow.

## What You'll Learn

- Preparing your own .pro and .smet files
- File format requirements
- Loading custom data
- Troubleshooting common issues
- Merging multiple data sources
- Basic data validation

> **Note**: For comprehensive missing data handling, see **07_data_quality_and_cleaning.ipynb**. For performance optimization and Zarr format, see **09_performance_and_storage.ipynb**.


## Installation (For Colab Users)

If you're using Google Colab, run the cell below to install xsnow and dependencies. If you're running locally and have already installed xsnow, you can skip this cell.


In [None]:

%pip install -q numpy pandas xarray matplotlib seaborn dask netcdf4 zarr
%pip install -q git+https://gitlab.com/avacollabra/postprocessing/xsnow



In [None]:
import xsnow
import os
import glob



In [None]:
# Example: Explore xsnow sample data
import xsnow

print("xsnow provides sample datasets:")
print()

# Example 1: Single profile
print("1. Single profile (one snapshot):")
ds_single = xsnow.single_profile()
print(f"   Dimensions: {dict(ds_single.dims)}")

print()
# Example 2: Time series
print("2. Time series (multiple snapshots over time):")
ds_timeseries = xsnow.single_profile_timeseries()
print(f"   Dimensions: {dict(ds_timeseries.dims)}")


## Part 1: File Format Requirements

xsnow can read SNOWPACK output files in these formats:

### .pro Files (Profile Time Series)

- **Format**: SNOWPACK profile format (legacy)
- **Contains**: Time series of snow profiles with layer-by-layer data
- **Required**: Header with station metadata, profile data blocks
- **Generated by**: SNOWPACK when `PROF_FORMAT = PRO` in .ini file

### .smet Files (Meteorological Time Series)

- **Format**: SMET (MeteoIO format)
- **Contains**: Time series of scalar variables (no layers)
- **Required**: SMET header with field descriptions, time series data
- **Generated by**: SNOWPACK or MeteoIO for meteorological data

### Other Formats

xsnow may support other formats (check documentation):
- NetCDF (if SNOWPACK outputs to NetCDF)
- Other SNOWPACK output formats


## Part 2: Preparing Your Files

### Step 1: Generate SNOWPACK Output

If you're running SNOWPACK yourself:

1. **Configure SNOWPACK** (via Inishell or .ini file):
   - Set `PROF_FORMAT = PRO` to generate .pro files
   - Configure which variables to output
   - Set output directory

2. **Run SNOWPACK** simulation

3. **Check output files**:
   - Look for `.pro` files in output directory
   - Check for `.smet` files if configured

### Step 2: Verify File Format

Let's check if your files are in the correct format:


In [None]:
# Check for .pro files in data directory
data_dir = "data"
pro_files = glob.glob(os.path.join(data_dir, "*.pro"))
smet_files = glob.glob(os.path.join(data_dir, "*.smet"))

print(f"Found {len(pro_files)} .pro files:")
for f in pro_files[:5]:  # Show first 5
    print(f"  {f}")

print(f"\nFound {len(smet_files)} .smet files:")
for f in smet_files[:5]:  # Show first 5
    print(f"  {f}")

# Quick format check
if pro_files:
    first_file = pro_files[0]
    print(f"\nInspecting first .pro file: {first_file}")
    with open(first_file, 'r') as f:
        first_lines = [f.readline() for _ in range(10)]
        print("First 5 non-empty lines:")
        for i, line in enumerate(first_lines[:5]):
            if line.strip():  # Only show non-empty lines
                print(f"  Line {i+1}: {line.strip()[:80]}")
else:
    print("\nNo .pro files found in data directory.")
    print("Note: You can use xsnow's built-in sample data instead!")


## Part 3: Loading Your Custom Data

Now let's load your files:


**Now You Try**: After checking for files, try:
- Listing all files in a different directory
- Checking file sizes to see which files are largest
- Using `os.path.getmtime()` to find the most recently modified file


In [None]:
# Method 1: Load a single file
# Uncomment and modify path to load your own file:
# ds = xsnow.read("data/your_file.pro")


### Loading Multiple Files

You can load multiple files at once:


**Now You Try**: After loading a file, try:
- Loading a file from a different directory (use an absolute path)
- Loading multiple files and comparing their dimensions
- Inspecting the first few rows of data after loading


In [None]:
# Method 2: Load multiple files
# List of files
# ds = xsnow.read(['data/file1.pro', 'data/file2.pro'])

# All files in directory
# ds = xsnow.read('data/')

# Mix of .pro and .smet
# ds = xsnow.read(['data/profile.pro', 'data/meteo.smet'])


## Part 4: Troubleshooting Common Issues

### Issue 1: File Not Found

**Error**: `FileNotFoundError` or similar

**Solutions**:
- Check file path is correct
- Use absolute paths if relative paths don't work
- Verify file exists: `os.path.exists('path/to/file.pro')`


**Now You Try**: After troubleshooting file format issues, try:
- Creating a function to validate file format before loading
- Writing code to automatically detect file format (.pro vs .smet)
- Checking if files have the expected header structure


In [None]:
# Example: Check if file exists before loading
# test_file = "data/your_file.pro"
# if os.path.exists(test_file):
#     ds = xsnow.read(test_file)
# else:
#     print(f"File not found: {test_file}")


### Issue 2: Format Not Recognized

**Error**: File format not supported or parsing errors

**Solutions**:
- Verify file is actual .pro or .smet format (not just renamed)
- Check file header matches expected format
- Try opening file in text editor to inspect structure
- Check SNOWPACK version compatibility


In [None]:
# Inspect file header
# Uncomment to inspect your own file:
# test_file = "data/your_file.pro"
# with open(test_file, 'r') as f:
#     header_lines = [f.readline().strip() for _ in range(20)]
#     print("Header lines (first 20, non-empty):")
#     for i, line in enumerate(header_lines):
#         if line:  # Skip empty lines
#             print(f"  {i+1}: {line[:100]}")


### Issue 3: Missing Variables

**Problem**: Expected variables not in dataset

**Solutions**:
- Check SNOWPACK output configuration
- Verify variables were enabled in SNOWPACK .ini file
- Some variables may be computed by xsnow (like HS, z)
- Check variable names match xsnow's expected names


**Now You Try**: After validating your data, try:
- Creating a summary report of data quality (count of NaNs, value ranges, etc.)
- Writing a function to automatically validate multiple datasets
- Comparing validation results between different files or locations


In [None]:
# Check available variables in your dataset
# Uncomment after loading your data:
# print("Available variables in dataset:")
# for var in list(ds.data_vars.keys())[:20]:
#     print(f"  {var}: {ds[var].dims}")

# Check for common variables
# common_vars = ['density', 'temperature', 'HS', 'grain_type', 'grain_size']
# for var in common_vars:
#     if var in ds.data_vars:
#         print(f"  {var}: found")
#     else:
#         print(f"  {var}: not found")


### Issue 4: Time Alignment Problems

**Problem**: Multiple files have different time ranges or frequencies

**Solutions**:
- xsnow will try to align times automatically
- Check time ranges: `ds.coords['time'].values`
- Resample if needed: `ds.resample(time='1H').mean()`
- Manually select overlapping time periods


In [None]:
# Check time range and frequency
# Uncomment after loading your data:
# times = ds.coords['time'].values
# print(f"Time range: {times[0]} to {times[-1]}")
# print(f"Total time steps: {len(times)}")
# 
# if len(times) > 1:
#     time_diff = times[1] - times[0]
#     print(f"Time step frequency: {time_diff}")
#     
#     # Check if times are regular
#     if len(times) > 2:
#         time_diffs = np.diff(times)
#         if np.allclose(time_diffs, time_diffs[0]):
#             print("Times are regularly spaced")
#         else:
#             print("Times are irregularly spaced")


**Now You Try**: After merging data, try:
- Merging data from 3 or more files
- Checking that merged data has the expected dimensions
- Comparing variables before and after merging to ensure nothing was lost


## Part 5: Data Validation

After loading, validate your data:


In [None]:
import numpy as np

# Check for NaN values
# Uncomment after loading your data:
# nan_count = ds['density'].isnull().sum().values
# total_count = ds['density'].size
# if nan_count > 0:
#     print(f"Found {nan_count} NaN values in density ({100*nan_count/total_count:.1f}%)")
# else:
#     print("No NaN values in density")

# Check for reasonable value ranges
# density_vals = ds['density'].values
# valid_vals = density_vals[~np.isnan(density_vals)]
# if len(valid_vals) > 0:
#     print(f"Density range: {valid_vals.min():.1f} to {valid_vals.max():.1f} kg/mÂ³")
#     if valid_vals.min() < 0 or valid_vals.max() > 1000:
#         print("Density values outside typical range (0-1000 kg/mÂ³)")
#     else:
#         print("Density values in reasonable range")

# temp_vals = ds['temperature'].values
# valid_vals = temp_vals[~np.isnan(temp_vals)]
# if len(valid_vals) > 0:
#     print(f"Temperature range: {valid_vals.min():.1f} to {valid_vals.max():.1f} Â°C")


## Part 6: Merging Profile and Meteorological Data


> **Note on Missing Data**: For comprehensive coverage of missing data detection, handling strategies (interpolation, filling, dropping), and cleaning pipelines, see **07_data_quality_and_cleaning.ipynb**.


In [None]:


# Check results
print(f"\nOriginal missing: {ds['density'].isnull().sum().values}")
print(f"After mean fill: {density_mean_filled.isnull().sum().values}")
print(f"After median fill: {density_median_filled.isnull().sum().values}")
print(f"After location mean fill: {density_location_filled.isnull().sum().values}")


#### Strategy 4: Dropping Missing Data

Sometimes it's better to remove data with missing values:


In [None]:
# Example: Drop missing data with different strategies

# Strategy 1: Drop any layer with missing density
density_no_missing_layers = ds['density'].dropna(dim='layer')
print(f"Drop missing layers: {density_no_missing_layers.dims.get('layer', 'N/A')} layers remaining")

# Strategy 2: Drop entire profiles with ANY missing values (aggressive)
profiles_clean = ds.dropna(dim='layer', how='any')
print(f"Drop profiles with any missing: {profiles_clean.dims.get('time', 'N/A')} profiles remaining")

# Strategy 3: Drop only if ALL values are missing (conservative)
profiles_partial = ds.dropna(dim='layer', how='all')
print(f"Drop only fully-missing layers: {profiles_partial.dims.get('time', 'N/A')} profiles remaining")

# Strategy 4: Drop specific time steps with missing data
time_clean = ds.dropna(dim='time', how='any')
print(f"Drop time steps with any missing: {time_clean.dims.get('time', 'N/A')} time steps remaining")


#### Strategy 5: Multi-Step Cleaning Pipeline

For real-world data, combine multiple strategies:


In [None]:
# Comprehensive cleaning pipeline for density data
def clean_density_data(ds, variable='density'):
    """
    Multi-step cleaning pipeline for density data.
    
    Steps:
    1. Interpolate missing values along layer dimension
    2. Forward/backward fill any remaining gaps
    3. Fill with location-specific mean if still missing
    4. Drop profiles with >50% missing data
    """
    density = ds[variable].copy()
    
    # Step 1: Interpolate along layer dimension (for depth profiles)
    density = density.interpolate_na(dim='layer', method='linear')
    
    # Step 2: Forward then backward fill
    density = density.fillna(method='ffill', dim='layer')
    density = density.fillna(method='bfill', dim='layer')
    
    # Step 3: Fill remaining with location-specific mean
    location_mean = density.mean(dim=['time', 'layer', 'slope', 'realization'])
    density = density.fillna(location_mean)
    
    # Step 4: Drop profiles with >50% missing (if any remain)
    missing_per_profile = density.isnull().sum(dim='layer') / density.sizes['layer']
    valid_profiles = missing_per_profile < 0.5
    
    # Apply filter (keep only valid profiles)
    # Note: This is simplified - in practice you'd need to filter the dataset
    
    return density

# Apply cleaning pipeline
density_cleaned = clean_density_data(ds, 'density')
print("Cleaning pipeline applied:")
print(f"  Original missing: {ds['density'].isnull().sum().values}")
print(f"  After cleaning: {density_cleaned.isnull().sum().values}")
print(f"  Missing values removed: {ds['density'].isnull().sum().values - density_cleaned.isnull().sum().values}")


### Choosing the Right Strategy

**Use interpolation when:**
- Missing values are in time series data
- You have enough neighboring data points
- Values change smoothly over time/depth

**Use forward/backward fill when:**
- Missing values are at edges (surface or bottom layers)
- Values are relatively constant
- Quick fix needed

**Use statistical imputation when:**
- Missing values are scattered randomly
- You have good overall statistics
- Interpolation isn't appropriate

**Use dropping when:**
- Missing data is extensive (>50% of profile)
- Missing data indicates data quality issues
- You have enough remaining data for analysis

**Use multi-step pipeline when:**
- Data has complex missing patterns
- You need robust cleaning
- Production data processing


**Now You Try**: After learning about missing data handling, try:
- Creating a validation report for your dataset showing missing data patterns
- Applying different cleaning strategies and comparing results
- Creating a custom cleaning function for your specific data needs
- Validating that cleaned data produces reasonable analysis results


## Part 5.5: Data Type Optimization

For large datasets, optimizing data types can significantly reduce memory usage and improve performance. Understanding and choosing appropriate data types (dtypes) is crucial for efficient data processing.

**What you'll see**: The examples below show how to check, convert, and optimize data types to reduce memory usage while maintaining data quality.


### Understanding Data Types

Different data types use different amounts of memory:


In [None]:
# Check current data types
print("Current data types in dataset:")
for var in list(ds.data_vars.keys())[:5]:
    dtype = ds[var].dtype
    print(f"  {var}: {dtype}")

# Check memory usage
import sys
density_size = ds['density'].nbytes / (1024**2)  # Size in MB
print(f"\nDensity array size: {density_size:.2f} MB")

# Common data types and their memory usage:
print("\nCommon NumPy data types:")
print("  int8:  1 byte  (range: -128 to 127)")
print("  int16: 2 bytes (range: -32,768 to 32,767)")
print("  int32: 4 bytes (range: -2.1B to 2.1B)")
print("  int64: 8 bytes (range: very large)")
print("  float32: 4 bytes (single precision, ~7 decimal digits)")
print("  float64: 8 bytes (double precision, ~15 decimal digits)")


### Converting Data Types

Convert data to more memory-efficient types when appropriate:


In [None]:
# Example 1: Convert float64 to float32 (halves memory usage)
# Check if values fit in float32 range
density_vals = ds['density'].values
density_min = density_vals.min()
density_max = density_vals.max()

print(f"Original dtype: {ds['density'].dtype}")
print(f"Value range: {density_min:.1f} to {density_max:.1f} kg/mÂ³")
print(f"Original size: {ds['density'].nbytes / (1024**2):.2f} MB")

# Convert to float32 (if values fit)
if density_min >= np.finfo(np.float32).min and density_max <= np.finfo(np.float32).max:
    density_float32 = ds['density'].astype('float32')
    print(f"Converted to float32: {density_float32.dtype}")
    print(f"New size: {density_float32.nbytes / (1024**2):.2f} MB")
    print(f"Memory saved: {(ds['density'].nbytes - density_float32.nbytes) / (1024**2):.2f} MB")
    print(f"  (50% reduction for this variable)")
else:
    print("Values outside float32 range - cannot convert safely")


### Memory Optimization Strategies


In [None]:
# Strategy 1: Convert all float64 to float32 (if precision allows)
def optimize_dtypes(ds):
    """Convert data types to more memory-efficient types."""
    ds_optimized = ds.copy()
    
    for var in ds.data_vars:
        if ds[var].dtype == 'float64':
            # Check if values fit in float32
            vals = ds[var].values
            if np.isfinite(vals).all():  # Check for inf/nan
                vals_finite = vals[np.isfinite(vals)]
                if len(vals_finite) > 0:
                    min_val = vals_finite.min()
                    max_val = vals_finite.max()
                    if (min_val >= np.finfo(np.float32).min and 
                        max_val <= np.finfo(np.float32).max):
                        ds_optimized[var] = ds[var].astype('float32')
                        print(f"  {var}: float64 -> float32")
    
    return ds_optimized

# Apply optimization
ds_optimized = optimize_dtypes(ds)

# Compare memory usage
original_size = sum(var.nbytes for var in ds.data_vars.values()) / (1024**2)
optimized_size = sum(var.nbytes for var in ds_optimized.data_vars.values()) / (1024**2)
print(f"\nOriginal dataset size: {original_size:.2f} MB")
print(f"Optimized dataset size: {optimized_size:.2f} MB")
print(f"Memory saved: {original_size - optimized_size:.2f} MB ({100*(original_size-optimized_size)/original_size:.1f}%)")


### When to Use Different Data Types

**Use float32 when:**
- Values fit in range (~-3.4e38 to 3.4e38)
- ~7 decimal digits precision is sufficient
- Memory is limited
- Working with large datasets

**Use float64 when:**
- High precision is required (~15 decimal digits)
- Values might be very large or very small
- Scientific calculations need maximum precision
- Memory is not a concern

**Use int types when:**
- Data is integer-valued (e.g., layer counts, indices)
- Can use int16 or int32 instead of int64
- Significant memory savings possible


### Practical Example: Optimizing Large Time Series

For large time series datasets, dtype optimization can save significant memory:


In [None]:
# Example: Optimize a large time series dataset
# Simulate a large dataset (many time steps, locations, layers)
print("Example: Optimizing large time series dataset")

# Check current memory usage
total_memory = 0
for var_name, var in ds.data_vars.items():
    var_memory = var.nbytes / (1024**2)
    total_memory += var_memory
    if var_memory > 0.1:  # Show variables using >0.1 MB
        print(f"  {var_name}: {var_memory:.2f} MB ({var.dtype})")

print(f"\nTotal dataset memory: {total_memory:.2f} MB")

# Optimize: Convert float64 to float32
ds_opt = ds.copy()
for var_name in ds.data_vars:
    if ds[var_name].dtype == 'float64':
        try:
            ds_opt[var_name] = ds[var_name].astype('float32')
        except (ValueError, OverflowError):
            print(f"  Cannot convert {var_name} to float32 (values out of range)")

# Check optimized memory
total_memory_opt = sum(var.nbytes for var in ds_opt.data_vars.values()) / (1024**2)
print(f"\nOptimized dataset memory: {total_memory_opt:.2f} MB")
print(f"Memory saved: {total_memory - total_memory_opt:.2f} MB")
print(f"Reduction: {100*(total_memory - total_memory_opt)/total_memory:.1f}%")

# For very large datasets, this can save gigabytes!
print("\nðŸ’¡ Tip: For datasets with 1000s of time steps and locations,")
print("   dtype optimization can save significant memory and improve performance.")


### Trade-offs: Precision vs Memory

**Important considerations:**
- **Precision loss**: float32 has ~7 decimal digits vs float64's ~15
- **Range limits**: float32 range is smaller than float64
- **Performance**: float32 can be faster on some systems
- **Compatibility**: Some operations may require float64

**Best practice**: 
- Use float32 for storage and visualization
- Use float64 for critical calculations
- Test that precision loss doesn't affect your analysis


**Now You Try**: After learning about data type optimization, try:
- Checking the data types and memory usage of your dataset
- Converting appropriate variables to float32 and measuring memory savings
- Creating a function to automatically optimize data types for your datasets
- Comparing analysis results before and after dtype optimization to ensure precision is sufficient


In [None]:
# Example: Merge profile and meteo data
# Load both at once
# ds = xsnow.read(['data/profile.pro', 'data/meteo.smet'])

# Or load separately and merge
# ds_pro = xsnow.read('data/profile.pro')
# ds_met = xsnow.read('data/meteo.smet')
# ds_combined = xr.merge([ds_pro, ds_met])  # Using xarray's merge


## Part 7: Working with Zarr Format for Large Datasets

Since xsnow is built on xarray, it can work with Zarr-backed datasets. **Zarr** is a format for storing chunked, compressed, N-dimensional arrays, which is ideal for large snowpack datasets.

### What is Zarr?

**Zarr** is a storage format that:
- Stores data in **chunks** (smaller pieces) rather than one large file
- Supports **compression** to reduce storage size
- Enables **lazy loading** (only load what you need)
- Works well with **Dask** for parallel computing
- Is efficient for **large time series** or **multiple locations**

### When to Use Zarr

Consider using Zarr when you have:
- **Large time series**: Many years of hourly/daily data
- **Multiple locations**: Data from many stations or grid points
- **Ensemble runs**: Multiple realizations or scenarios
- **Limited memory**: Need to work with data larger than RAM
- **Cloud storage**: Want to store data in cloud object storage (S3, GCS, etc.)

### Benefits for Snowpack Data

- **Efficient storage**: Compressed chunks reduce file size
- **Fast access**: Load only the chunks you need
- **Parallel processing**: Works seamlessly with Dask
- **Scalable**: Handle datasets that don't fit in memory

### Converting to Zarr Format

**What you'll see**: The examples below show how to save xsnow datasets to Zarr format and load them back.


import zarr
import xarray as xr

# Example 1: Save dataset to Zarr format
# Load sample data first
ds = xsnow.single_profile_timeseries()

# Save to Zarr (chunked and compressed)
zarr_path = "snowpack_data.zarr"
print(f"Saving dataset to Zarr format: {zarr_path}")

# Configure chunking strategy
# Chunk by time and location for efficient access
chunks = {
    'time': 100,      # 100 time steps per chunk
    'location': 1,    # 1 location per chunk
    'layer': -1,      # All layers in one chunk
    'slope': -1,      # All slopes in one chunk
    'realization': -1 # All realizations in one chunk
}

# Save to Zarr with compression
try:
    ds.to_zarr(
        zarr_path,
        mode='w',  # 'w' = write (overwrite), 'a' = append
        encoding={
            'density': {'compressor': zarr.Blosc(cname='zstd', clevel=3)},
            'temperature': {'compressor': zarr.Blosc(cname='zstd', clevel=3)},
        }
    )
    print(f"âœ… Saved to {zarr_path}")
    
    # Check file size
    import os
    if os.path.exists(zarr_path):
        # Zarr creates a directory, so we need to check its size
        total_size = sum(
            os.path.getsize(os.path.join(dirpath, filename))
            for dirpath, dirnames, filenames in os.walk(zarr_path)
            for filename in filenames
        )
        print(f"   Zarr store size: {total_size / 1024 / 1024:.2f} MB")
except Exception as e:
    print(f"Note: Zarr save example (may need actual data): {e}")

# Example 2: Load from Zarr format
# This enables lazy loading - data isn't loaded until you access it
try:
    ds_zarr = xr.open_zarr(zarr_path)
    print(f"\nâœ… Loaded from Zarr: {zarr_path}")
    print(f"   Dimensions: {dict(ds_zarr.dims)}")
    print(f"   Data is lazy-loaded (not in memory yet)")
    
    # Accessing data triggers loading
    print(f"\n   Accessing a small subset...")
    sample = ds_zarr['density'].isel(location=0, time=0, layer=0).values
    print(f"   Sample value: {sample}")
except Exception as e:
    print(f"\nNote: Zarr load example (file may not exist): {e}")

# Example 3: Working with large datasets using Dask
# Zarr works seamlessly with Dask for parallel processing
try:
    # Open with Dask chunks for parallel processing
    ds_dask = xr.open_zarr(zarr_path, chunks={'time': 50, 'location': 1})
    print(f"\nâœ… Opened with Dask chunks")
    print(f"   Chunks: {ds_dask.chunks}")
    print(f"   Data type: {type(ds_dask['density'].data)}")
    print(f"   (Dask array - computation is lazy and parallel)")
except Exception as e:
    print(f"\nNote: Dask example (file may not exist): {e}")

print("\nðŸ’¡ Key points:")
print("   - Zarr stores data in compressed chunks")
print("   - Enables lazy loading (load only what you need)")
print("   - Works with Dask for parallel processing")
print("   - Ideal for datasets larger than available RAM")


### Chunking Strategy for Snowpack Data

When saving to Zarr, choose chunk sizes based on how you'll access the data:

- **Time-based access** (e.g., "all data for January"): Chunk by time
- **Location-based access** (e.g., "all data for Station A"): Chunk by location
- **Profile-based access** (e.g., "complete profiles"): Chunk by time and location together
- **Layer-based access** (e.g., "surface layer over time"): Chunk by layer

**Rule of thumb**: Chunk size should be 10-100 MB for optimal performance.

### Zarr vs NetCDF

| Feature | Zarr | NetCDF |
|---------|------|--------|
| **Chunking** | Native, flexible | Limited |
| **Compression** | Multiple algorithms | Limited |
| **Lazy loading** | Excellent | Good |
| **Cloud storage** | Native support | Requires special setup |
| **File format** | Directory of files | Single file |
| **Best for** | Large datasets, cloud | Standard scientific data |

**For snowpack data**: Use Zarr when you have large time series or multiple locations. Use NetCDF for standard-sized datasets or when compatibility is important.

**Now You Try**: After learning about Zarr, try:
- Saving a dataset to Zarr format with different chunk sizes
- Loading from Zarr and comparing memory usage vs. loading from NetCDF
- Experimenting with different compression levels to balance size vs. speed


## Summary

âœ… **What we learned:**

1. **File formats**: .pro (profiles) and .smet (meteorological) files
2. **Loading custom data**: Use `xsnow.read()` with your file paths
3. **Multiple files**: Load lists of files or entire directories
4. **Troubleshooting**: Common issues and solutions
5. **Validation**: Check data quality and ranges
6. **Merging**: Combine profile and meteo data
7. **Zarr format**: Chunked, compressed storage for large datasets

## Key Tips

- **File paths**: Use absolute paths if relative paths cause issues
- **Format verification**: Inspect file headers to ensure correct format
- **Variable names**: Check xsnow documentation for expected variable names
- **Time alignment**: xsnow handles this automatically when merging
- **Data quality**: Always validate loaded data
- **Large datasets**: Consider Zarr format for efficient storage and access
- **Chunking**: Choose chunk sizes based on your access patterns

## Next Steps

Now that you can load your own data:
- Apply analysis techniques from previous notebooks
- Create visualizations with your data
- Or learn to extend xsnow: **06_extending_xsnow.ipynb**

## Exercises

1. Load one of your own .pro files and inspect its structure
2. Check for missing variables and verify data ranges
3. Load multiple files and compare their time ranges
4. Merge a .pro and .smet file if you have both
5. Validate your data and identify any quality issues
6. Save a dataset to Zarr format and compare file sizes with NetCDF
7. Load from Zarr and experiment with different chunk sizes
