# 07: Data Quality and Cleaning

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Austfi/xsnowForPatrol/blob/main/notebooks/07_data_quality_and_cleaning.ipynb)

This notebook covers comprehensive strategies for detecting, handling, and cleaning missing data in xsnow datasets.

## What You'll Learn

- Detecting missing data (NaN values, outliers, invalid ranges)
- Missing data handling strategies (interpolation, filling, dropping)
- Data validation and quality checks
- Cleaning pipelines for production use

> **Note**: This is a reference notebook covering advanced data quality topics. The main tutorial notebooks focus on core functionality.


## Installation (For Colab Users)

If you're using Google Colab, run the cell below to install xsnow and dependencies.


In [None]:
%pip install -q numpy pandas xarray matplotlib seaborn
%pip install -q git+https://gitlab.com/avacollabra/postprocessing/xsnow


In [None]:
import xsnow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load sample data
ds = xsnow.single_profile_timeseries()
print("✅ Data loaded successfully")

%matplotlib inline


## Part 1: Detecting Missing Data

Before cleaning data, you need to identify what's missing and where.


In [None]:
# Check for missing values
missing_density = ds['density'].isnull()
n_missing = missing_density.sum().values
total = missing_density.size
print(f"Missing density values: {n_missing} out of {total} ({100*n_missing/total:.2f}%)")

# Check multiple variables
print("\nMissing data summary:")
for var in ['density', 'temperature', 'HS']:
    if var in ds.data_vars:
        n_miss = ds[var].isnull().sum().values
        n_total = ds[var].size
        pct = 100 * n_miss / n_total if n_total > 0 else 0
        print(f"  {var}: {n_miss} missing ({pct:.2f}%)")


## Part 2: Missing Data Handling Strategies

### Strategy 1: Interpolation for Time Series

For time series data, interpolation estimates missing values from neighboring time points.


In [None]:
# Example: Interpolate missing values in snow height time series
hs_series = ds['HS'].isel(location=0, slope=0, realization=0)

# Method 1: Linear interpolation (estimates between known points)
hs_interpolated_linear = hs_series.interpolate_na(dim='time', method='linear')
print("Linear interpolation: Estimates missing values from neighboring time points")

# Method 2: Polynomial interpolation (smoother curves)
hs_interpolated_poly = hs_series.interpolate_na(dim='time', method='polynomial', order=2)
print("Polynomial interpolation: Uses polynomial fit for smoother estimates")

# Compare results
print(f"\nOriginal missing values: {hs_series.isnull().sum().values}")
print(f"After linear interpolation: {hs_interpolated_linear.isnull().sum().values}")
print(f"After polynomial interpolation: {hs_interpolated_poly.isnull().sum().values}")


### Strategy 2: Forward/Backward Fill for Profile Data

For profile data (depth layers), use forward or backward fill to propagate values.


In [None]:
# Example: Fill missing density values in a profile
profile_density = ds['density'].isel(location=0, time=0, slope=0, realization=0)

# Forward fill: Use previous layer's value (useful when surface layers are missing)
density_ffill = profile_density.fillna(method='ffill', dim='layer')
print("Forward fill: Missing values filled with previous layer's density")

# Backward fill: Use next layer's value (useful when deep layers are missing)
density_bfill = profile_density.fillna(method='bfill', dim='layer')
print("Backward fill: Missing values filled with next layer's density")

# Combined: Forward then backward (handles gaps in middle)
density_combined = profile_density.fillna(method='ffill', dim='layer').fillna(method='bfill', dim='layer')
print("Combined fill: Forward then backward fill")

# Check results
print(f"\nOriginal missing: {profile_density.isnull().sum().values}")
print(f"After forward fill: {density_ffill.isnull().sum().values}")
print(f"After backward fill: {density_bfill.isnull().sum().values}")
print(f"After combined fill: {density_combined.isnull().sum().values}")


### Strategy 3: Statistical Imputation

Fill missing values with statistical measures (mean, median, mode).


In [None]:
# Example: Fill missing density with mean value
mean_density = ds['density'].mean().values
density_mean_filled = ds['density'].fillna(mean_density)
print(f"Mean fill: Missing values filled with overall mean ({mean_density:.1f} kg/m³)")

# Fill with median (more robust to outliers)
median_density = ds['density'].median().values
density_median_filled = ds['density'].fillna(median_density)
print(f"Median fill: Missing values filled with overall median ({median_density:.1f} kg/m³)")

# Fill with location-specific mean (better for multi-location data)
density_location_mean = ds['density'].mean(dim=['time', 'layer', 'slope', 'realization'])
density_location_filled = ds['density'].fillna(density_location_mean)
print("Location-specific mean fill: Each location uses its own mean")


### Strategy 4: Dropping Missing Data

Sometimes it's better to remove data with missing values.


In [None]:
# Example: Drop missing data with different strategies

# Strategy 1: Drop any layer with missing density
density_no_missing_layers = ds['density'].dropna(dim='layer')
print(f"Drop missing layers: {density_no_missing_layers.dims.get('layer', 'N/A')} layers remaining")

# Strategy 2: Drop entire profiles with ANY missing values (aggressive)
profiles_clean = ds.dropna(dim='layer', how='any')
print(f"Drop profiles with any missing: {profiles_clean.dims.get('time', 'N/A')} profiles remaining")

# Strategy 3: Drop only if ALL values are missing (conservative)
profiles_partial = ds.dropna(dim='layer', how='all')
print(f"Drop only fully-missing layers: {profiles_partial.dims.get('time', 'N/A')} profiles remaining")


## Part 3: Data Validation

Check for invalid values, outliers, and data quality issues.


In [None]:
# Check for reasonable value ranges
density_vals = ds['density'].values
valid_vals = density_vals[~np.isnan(density_vals)]

if len(valid_vals) > 0:
    print(f"Density range: {valid_vals.min():.1f} to {valid_vals.max():.1f} kg/m³")
    if valid_vals.min() < 0 or valid_vals.max() > 1000:
        print("⚠️ Warning: Density values outside typical range (0-1000 kg/m³)")
    else:
        print("✅ Density values in reasonable range")


## Summary

✅ **What we learned:**

1. **Missing data detection**: Using `.isnull()` and `.sum()`
2. **Interpolation**: For time series data
3. **Forward/backward fill**: For profile data
4. **Statistical imputation**: Using mean, median, etc.
5. **Data validation**: Checking value ranges and quality
6. **Dropping data**: When appropriate

## Key Techniques

- **`.isnull()`**: Detect missing values
- **`.interpolate_na()`**: Interpolate missing values
- **`.fillna()`**: Fill missing values with constants or methods
- **`.dropna()`**: Remove missing values

## Next Steps

Return to the main tutorial notebooks to continue learning xsnow.
