# 05: Working with Custom Data

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Austfi/xsnowForPatrol/blob/main/notebooks/05_working_with_custom_data.ipynb)

This notebook shows you how to prepare and load your own SNOWPACK output files into xsnow.

## What You'll Learn

- Preparing your own .pro and .smet files
- File format requirements
- Loading custom data
- Troubleshooting common issues
- Merging multiple data sources
- Basic data validation

> **Note**: For comprehensive missing data handling, see **07_data_quality_and_cleaning.ipynb**. For performance optimization and Zarr format, see **09_performance_and_storage.ipynb**.


## Installation (For Colab Users)

If you're using Google Colab, run the cell below to install xsnow and dependencies. If you're running locally and have already installed xsnow, you can skip this cell.


In [None]:

%pip install -q numpy pandas xarray matplotlib seaborn dask netcdf4 zarr
%pip install -q git+https://gitlab.com/avacollabra/postprocessing/xsnow



In [None]:
import xsnow
import os
import glob



In [None]:
# Example: Explore xsnow sample data
import xsnow

print("xsnow provides sample datasets:")
print()

# Example 1: Single profile
print("1. Single profile (one snapshot):")
ds_single = xsnow.single_profile()
print(f"   Dimensions: {dict(ds_single.dims)}")

print()
# Example 2: Time series
print("2. Time series (multiple snapshots over time):")
ds_timeseries = xsnow.single_profile_timeseries()
print(f"   Dimensions: {dict(ds_timeseries.dims)}")


## Part 1: File Format Requirements

xsnow can read SNOWPACK output files in these formats:

### .pro Files (Profile Time Series)

- **Format**: SNOWPACK profile format (legacy)
- **Contains**: Time series of snow profiles with layer-by-layer data
- **Required**: Header with station metadata, profile data blocks
- **Generated by**: SNOWPACK when `PROF_FORMAT = PRO` in .ini file

### .smet Files (Meteorological Time Series)

- **Format**: SMET (MeteoIO format)
- **Contains**: Time series of scalar variables (no layers)
- **Required**: SMET header with field descriptions, time series data
- **Generated by**: SNOWPACK or MeteoIO for meteorological data

### Other Formats

xsnow may support other formats (check documentation):
- NetCDF (if SNOWPACK outputs to NetCDF)
- Other SNOWPACK output formats


## Part 2: Preparing Your Files

### Step 1: Generate SNOWPACK Output

If you're running SNOWPACK yourself:

1. **Configure SNOWPACK** (via Inishell or .ini file):
   - Set `PROF_FORMAT = PRO` to generate .pro files
   - Configure which variables to output
   - Set output directory

2. **Run SNOWPACK** simulation

3. **Check output files**:
   - Look for `.pro` files in output directory
   - Check for `.smet` files if configured

### Step 2: Verify File Format

Let's check if your files are in the correct format:


In [None]:
# Check for .pro files in data directory
data_dir = "data"
pro_files = glob.glob(os.path.join(data_dir, "*.pro"))
smet_files = glob.glob(os.path.join(data_dir, "*.smet"))

print(f"Found {len(pro_files)} .pro files:")
for f in pro_files[:5]:  # Show first 5
    print(f"  {f}")

print(f"\nFound {len(smet_files)} .smet files:")
for f in smet_files[:5]:  # Show first 5
    print(f"  {f}")

# Quick format check
if pro_files:
    first_file = pro_files[0]
    print(f"\nInspecting first .pro file: {first_file}")
    with open(first_file, 'r') as f:
        first_lines = [f.readline() for _ in range(10)]
        print("First 5 non-empty lines:")
        for i, line in enumerate(first_lines[:5]):
            if line.strip():  # Only show non-empty lines
                print(f"  Line {i+1}: {line.strip()[:80]}")
else:
    print("\nNo .pro files found in data directory.")
    print("Note: You can use xsnow's built-in sample data instead!")


## Part 3: Loading Your Custom Data

Now let's load your files:


**Now You Try**: After checking for files, try:
- Listing all files in a different directory
- Checking file sizes to see which files are largest
- Using `os.path.getmtime()` to find the most recently modified file


In [None]:
# Method 1: Load a single file
# Uncomment and modify path to load your own file:
# ds = xsnow.read("data/your_file.pro")


### Loading Multiple Files

You can load multiple files at once:


**Now You Try**: After loading a file, try:
- Loading a file from a different directory (use an absolute path)
- Loading multiple files and comparing their dimensions
- Inspecting the first few rows of data after loading


In [None]:
# Method 2: Load multiple files
# List of files
# ds = xsnow.read(['data/file1.pro', 'data/file2.pro'])

# All files in directory
# ds = xsnow.read('data/')

# Mix of .pro and .smet
# ds = xsnow.read(['data/profile.pro', 'data/meteo.smet'])


## Part 4: Troubleshooting Common Issues

### Issue 1: File Not Found

**Error**: `FileNotFoundError` or similar

**Solutions**:
- Check file path is correct
- Use absolute paths if relative paths don't work
- Verify file exists: `os.path.exists('path/to/file.pro')`


**Now You Try**: After troubleshooting file format issues, try:
- Creating a function to validate file format before loading
- Writing code to automatically detect file format (.pro vs .smet)
- Checking if files have the expected header structure


In [None]:
# Example: Check if file exists before loading
# test_file = "data/your_file.pro"
# if os.path.exists(test_file):
#     ds = xsnow.read(test_file)
# else:
#     print(f"File not found: {test_file}")


### Issue 2: Format Not Recognized

**Error**: File format not supported or parsing errors

**Solutions**:
- Verify file is actual .pro or .smet format (not just renamed)
- Check file header matches expected format
- Try opening file in text editor to inspect structure
- Check SNOWPACK version compatibility


In [None]:
# Inspect file header
# Uncomment to inspect your own file:
# test_file = "data/your_file.pro"
# with open(test_file, 'r') as f:
#     header_lines = [f.readline().strip() for _ in range(20)]
#     print("Header lines (first 20, non-empty):")
#     for i, line in enumerate(header_lines):
#         if line:  # Skip empty lines
#             print(f"  {i+1}: {line[:100]}")


### Issue 3: Missing Variables

**Problem**: Expected variables not in dataset

**Solutions**:
- Check SNOWPACK output configuration
- Verify variables were enabled in SNOWPACK .ini file
- Some variables may be computed by xsnow (like HS, z)
- Check variable names match xsnow's expected names


**Now You Try**: After validating your data, try:
- Creating a summary report of data quality (count of NaNs, value ranges, etc.)
- Writing a function to automatically validate multiple datasets
- Comparing validation results between different files or locations


In [None]:
# Check available variables in your dataset
# Uncomment after loading your data:
# print("Available variables in dataset:")
# for var in list(ds.data_vars.keys())[:20]:
#     print(f"  {var}: {ds[var].dims}")

# Check for common variables
# common_vars = ['density', 'temperature', 'HS', 'grain_type', 'grain_size']
# for var in common_vars:
#     if var in ds.data_vars:
#         print(f"  {var}: found")
#     else:
#         print(f"  {var}: not found")


### Issue 4: Time Alignment Problems

**Problem**: Multiple files have different time ranges or frequencies

**Solutions**:
- xsnow will try to align times automatically
- Check time ranges: `ds.coords['time'].values`
- Resample if needed: `ds.resample(time='1H').mean()`
- Manually select overlapping time periods


In [None]:
# Check time range and frequency
# Uncomment after loading your data:
# times = ds.coords['time'].values
# print(f"Time range: {times[0]} to {times[-1]}")
# print(f"Total time steps: {len(times)}")
# 
# if len(times) > 1:
#     time_diff = times[1] - times[0]
#     print(f"Time step frequency: {time_diff}")
#     
#     # Check if times are regular
#     if len(times) > 2:
#         time_diffs = np.diff(times)
#         if np.allclose(time_diffs, time_diffs[0]):
#             print("Times are regularly spaced")
#         else:
#             print("Times are irregularly spaced")


**Now You Try**: After merging data, try:
- Merging data from 3 or more files
- Checking that merged data has the expected dimensions
- Comparing variables before and after merging to ensure nothing was lost


## Part 5: Data Validation

After loading, validate your data:


In [None]:
import numpy as np

# Check for NaN values
# Uncomment after loading your data:
# nan_count = ds['density'].isnull().sum().values
# total_count = ds['density'].size
# if nan_count > 0:
#     print(f"Found {nan_count} NaN values in density ({100*nan_count/total_count:.1f}%)")
# else:
#     print("No NaN values in density")

# Check for reasonable value ranges
# density_vals = ds['density'].values
# valid_vals = density_vals[~np.isnan(density_vals)]
# if len(valid_vals) > 0:
#     print(f"Density range: {valid_vals.min():.1f} to {valid_vals.max():.1f} kg/m³")
#     if valid_vals.min() < 0 or valid_vals.max() > 1000:
#         print("Density values outside typical range (0-1000 kg/m³)")
#     else:
#         print("Density values in reasonable range")

# temp_vals = ds['temperature'].values
# valid_vals = temp_vals[~np.isnan(temp_vals)]
# if len(valid_vals) > 0:
#     print(f"Temperature range: {valid_vals.min():.1f} to {valid_vals.max():.1f} °C")


## Part 6: Merging Profile and Meteorological Data


> **Note on Missing Data**: For comprehensive coverage of missing data detection, handling strategies (interpolation, filling, dropping), and cleaning pipelines, see **07_data_quality_and_cleaning.ipynb**.


In [None]:


# Check results
print(f"\nOriginal missing: {ds['density'].isnull().sum().values}")
print(f"After mean fill: {density_mean_filled.isnull().sum().values}")
print(f"After median fill: {density_median_filled.isnull().sum().values}")
print(f"After location mean fill: {density_location_filled.isnull().sum().values}")


#### Strategy 4: Dropping Missing Data

Sometimes it's better to remove data with missing values:


In [None]:
# Example: Drop missing data with different strategies

# Strategy 1: Drop any layer with missing density
density_no_missing_layers = ds['density'].dropna(dim='layer')
print(f"Drop missing layers: {density_no_missing_layers.dims.get('layer', 'N/A')} layers remaining")

# Strategy 2: Drop entire profiles with ANY missing values (aggressive)
profiles_clean = ds.dropna(dim='layer', how='any')
print(f"Drop profiles with any missing: {profiles_clean.dims.get('time', 'N/A')} profiles remaining")

# Strategy 3: Drop only if ALL values are missing (conservative)
profiles_partial = ds.dropna(dim='layer', how='all')
print(f"Drop only fully-missing layers: {profiles_partial.dims.get('time', 'N/A')} profiles remaining")

# Strategy 4: Drop specific time steps with missing data
time_clean = ds.dropna(dim='time', how='any')
print(f"Drop time steps with any missing: {time_clean.dims.get('time', 'N/A')} time steps remaining")


#### Strategy 5: Multi-Step Cleaning Pipeline

For real-world data, combine multiple strategies:


In [None]:
# Comprehensive cleaning pipeline for density data
def clean_density_data(ds, variable='density'):
    """
    Multi-step cleaning pipeline for density data.
    
    Steps:
    1. Interpolate missing values along layer dimension
    2. Forward/backward fill any remaining gaps
    3. Fill with location-specific mean if still missing
    4. Drop profiles with >50% missing data
    """
    density = ds[variable].copy()
    
    # Step 1: Interpolate along layer dimension (for depth profiles)
    density = density.interpolate_na(dim='layer', method='linear')
    
    # Step 2: Forward then backward fill
    density = density.fillna(method='ffill', dim='layer')
    density = density.fillna(method='bfill', dim='layer')
    
    # Step 3: Fill remaining with location-specific mean
    location_mean = density.mean(dim=['time', 'layer', 'slope', 'realization'])
    density = density.fillna(location_mean)
    
    # Step 4: Drop profiles with >50% missing (if any remain)
    missing_per_profile = density.isnull().sum(dim='layer') / density.sizes['layer']
    valid_profiles = missing_per_profile < 0.5
    
    # Apply filter (keep only valid profiles)
    # Note: This is simplified - in practice you'd need to filter the dataset
    
    return density

# Apply cleaning pipeline
density_cleaned = clean_density_data(ds, 'density')
print("Cleaning pipeline applied:")
print(f"  Original missing: {ds['density'].isnull().sum().values}")
print(f"  After cleaning: {density_cleaned.isnull().sum().values}")
print(f"  Missing values removed: {ds['density'].isnull().sum().values - density_cleaned.isnull().sum().values}")


### Choosing the Right Strategy

**Use interpolation when:**
- Missing values are in time series data
- You have enough neighboring data points
- Values change smoothly over time/depth

**Use forward/backward fill when:**
- Missing values are at edges (surface or bottom layers)
- Values are relatively constant
- Quick fix needed

**Use statistical imputation when:**
- Missing values are scattered randomly
- You have good overall statistics
- Interpolation isn't appropriate

**Use dropping when:**
- Missing data is extensive (>50% of profile)
- Missing data indicates data quality issues
- You have enough remaining data for analysis

**Use multi-step pipeline when:**
- Data has complex missing patterns
- You need robust cleaning
- Production data processing


**Now You Try**: After learning about missing data handling, try:
- Creating a validation report for your dataset showing missing data patterns
- Applying different cleaning strategies and comparing results
- Creating a custom cleaning function for your specific data needs
- Validating that cleaned data produces reasonable analysis results


> **Note on Data Type Optimization**: For comprehensive coverage of data type optimization, memory management, and performance tuning for large datasets, see **09_performance_and_storage.ipynb**.
