# Exploratory Data Analysis of NWM and USGS Time Series

This notebook performs exploratory analysis of the National Water Model (NWM) forecasts and USGS observed runoff data.

## Objectives:
1. Understand the structure and properties of the data
2. Identify patterns and trends in NWM forecasts and USGS observations
3. Visualize forecast errors across different lead times
4. Evaluate data quality issues such as missing values and outliers
5. Identify potential features for model development

In [4]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from glob import glob

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams.update({'font.size': 12})

## 1. Data Loading and Inspection

First, let's load the USGS observed data and NWM forecast data for both streams.
Based on the file inspection, we know that:

- USGS files have columns: 'DateTime', 'USGSFlowValue', and quality code
- NWM files have columns: 'model_initialization_time', 'model_output_valid_time', 'streamflow_value'

In [None]:
# Define paths - adjust to match the actual data location
data_path = "../data/raw"
stream_ids = ["20380357", "21609641"]

# Function to load USGS data with correct column names
def load_usgs_data(stream_id):
    usgs_files = glob(os.path.join(data_path, str(stream_id), "*_Strt_*.csv"))
    if not usgs_files:
        raise FileNotFoundError(f"No USGS data files found for stream {stream_id}")
    
    usgs_df = pd.read_csv(usgs_files[0])
    # Convert to datetime with correct column name
    usgs_df['DateTime'] = pd.to_datetime(usgs_df['DateTime'])
    usgs_df.set_index('DateTime', inplace=True)
    
    return usgs_df

# Function to load NWM data with correct column names
def load_nwm_data(stream_id):
    nwm_files = glob(os.path.join(data_path, str(stream_id), "streamflow_*.csv"))
    if not nwm_files:
        raise FileNotFoundError(f"No NWM data files found for stream {stream_id}")
    
    dfs = []
    for file in nwm_files:
        df = pd.read_csv(file)
        dfs.append(df)
    
    nwm_df = pd.concat(dfs, ignore_index=True)
    # Convert to datetime using correct format for these columns
    nwm_df['model_initialization_time'] = pd.to_datetime(nwm_df['model_initialization_time'], format='%Y-%m-%d_%H:%M:%S')
    nwm_df['model_output_valid_time'] = pd.to_datetime(nwm_df['model_output_valid_time'], format='%Y-%m-%d_%H:%M:%S')
    
    # Calculate lead time in hours (needed for analysis)
    nwm_df['lead_time'] = (nwm_df['model_output_valid_time'] - nwm_df['model_initialization_time']).dt.total_seconds() / 3600
    
    return nwm_df

# Load data for each stream
data = {}
for stream_id in stream_ids:
    try:
        usgs_df = load_usgs_data(stream_id)
        nwm_df = load_nwm_data(stream_id)
        data[stream_id] = {
            "usgs": usgs_df,
            "nwm": nwm_df
        }
        print(f"Successfully loaded data for stream {stream_id}")
    except Exception as e:
        print(f"Error loading data for stream {stream_id}: {e}")

KeyError: 'datetime'

## 2. Examining Data Structure

Let's take a look at the structure and contents of both datasets.

In [None]:
# Stream ID to examine
stream_id = stream_ids[0]  # First stream

print("USGS Data Structure:")
print(data[stream_id]["usgs"].head())
print("\nUSGS Data Info:")
print(data[stream_id]["usgs"].info())
print("\nUSGS Data Statistics:")
print(data[stream_id]["usgs"].describe())

print("\n\nNWM Data Structure:")
print(data[stream_id]["nwm"].head())
print("\nNWM Data Info:")
print(data[stream_id]["nwm"].info())
print("\nNWM Data Statistics by Lead Time:")
print(data[stream_id]["nwm"].groupby("lead_time")["streamflow_value"].describe().head())

## 3. Temporal Coverage and Availability

Let's examine the temporal coverage and availability of the data.

In [None]:
for stream_id in stream_ids:
    usgs_df = data[stream_id]["usgs"]
    nwm_df = data[stream_id]["nwm"]
    
    print(f"Stream {stream_id} - USGS Data:")
    print(f"Start date: {usgs_df.index.min()}")
    print(f"End date: {usgs_df.index.max()}")
    print(f"Total records: {len(usgs_df)}")
    print(f"Missing values: {usgs_df.isna().sum().sum()}")
    
    print(f"\nStream {stream_id} - NWM Data:")
    print(f"Start reference time: {nwm_df['model_initialization_time'].min()}")
    print(f"End reference time: {nwm_df['model_initialization_time'].max()}")
    print(f"Start value time: {nwm_df['model_output_valid_time'].min()}")
    print(f"End value time: {nwm_df['model_output_valid_time'].max()}")
    print(f"Total records: {len(nwm_df)}")
    print(f"Missing values: {nwm_df.isna().sum().sum()}")
    print("-" * 60)

## 4. Data Distribution Analysis

Let's visualize the distribution of runoff values in both datasets.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

for i, stream_id in enumerate(stream_ids):
    usgs_df = data[stream_id]["usgs"]
    nwm_df = data[stream_id]["nwm"]
    
    # Histogram of USGS observed runoff (updated column name)
    ax = axes[i, 0]
    ax.hist(usgs_df['USGSFlowValue'], bins=50, alpha=0.7)
    ax.set_title(f"Stream {stream_id} - USGS Observed Runoff Distribution")
    ax.set_xlabel("Runoff")
    ax.set_ylabel("Frequency")
    
    # Histogram of NWM forecast runoff by lead time (updated column name)
    ax = axes[i, 1]
    lead_times = [1, 6, 12, 18]  # Representative lead times
    for lead in lead_times:
        lead_data = nwm_df[nwm_df['lead_time'] == lead]['streamflow_value']
        ax.hist(lead_data, bins=50, alpha=0.4, label=f"Lead {lead}h")
    
    ax.set_title(f"Stream {stream_id} - NWM Forecast Runoff Distribution by Lead Time")
    ax.set_xlabel("Runoff")
    ax.set_ylabel("Frequency")
    ax.legend()

plt.tight_layout()
plt.show()

## 5. Time Series Alignment

Let's restructure the NWM data to align with USGS observations based on valid times.

In [None]:
# Restructure NWM data to align with USGS observations
def align_nwm_with_usgs(stream_id, lead_times=[1, 6, 12, 18]):
    usgs_df = data[stream_id]["usgs"]
    nwm_df = data[stream_id]["nwm"]
    
    aligned_data = {}
    # Using the correct column name for USGS flow data
    aligned_data['usgs'] = usgs_df['USGSFlowValue']
    
    for lead in lead_times:
        # Filter for this lead time (we calculated this from timestamps)
        lead_df = nwm_df[nwm_df['lead_time'] == lead].copy()
        # Set index to valid time (when the forecast is for)
        lead_df.set_index('model_output_valid_time', inplace=True)
        # Get the streamflow column using correct name
        lead_series = lead_df['streamflow_value']
        # Add to aligned data with a descriptive name
        aligned_data[f'nwm_lead_{lead}'] = lead_series
    
    # Combine into one DataFrame
    return pd.DataFrame(aligned_data)

In [None]:
# Create aligned data for visualization
for stream_id in stream_ids:
    aligned_df = align_nwm_with_usgs(stream_id)
    
    # Check how much data we have before and after alignment
    usgs_count = len(data[stream_id]["usgs"])
    aligned_count = len(aligned_df.dropna())
    data_retention = (aligned_count / usgs_count) * 100 if usgs_count > 0 else 0
    
    print(f"Stream {stream_id} - Data retention after alignment: {data_retention:.2f}% ({aligned_count}/{usgs_count} rows)")
    
    # Plot time series for a 3-month period
    sample_start = pd.Timestamp('2022-01-01')
    sample_end = pd.Timestamp('2022-04-01')
    sample_df = aligned_df.loc[sample_start:sample_end]
    
    plt.figure(figsize=(18, 8))
    plt.plot(sample_df.index, sample_df['usgs'], label='USGS Observed', linewidth=2)
    for lead in [1, 6, 12, 18]:
        if f'nwm_lead_{lead}' in sample_df.columns:
            plt.plot(sample_df.index, sample_df[f'nwm_lead_{lead}'], 
                     label=f'NWM Lead {lead}h', alpha=0.7)
    
    plt.title(f"Stream {stream_id} - Observed vs. Forecast Runoff (Jan-Mar 2022)")
    plt.xlabel("Date")
    plt.ylabel("Runoff")
    plt.legend()
    plt.grid(True)
    plt.show()

## 6. Forecast Error Analysis

Let's calculate and visualize the errors in the NWM forecasts compared to USGS observations.

In [None]:
# Calculate forecast errors
for stream_id in stream_ids:
    aligned_df = align_nwm_with_usgs(stream_id, lead_times=range(1, 19))
    
    # Add error columns (forecast - observed)
    for lead in range(1, 19):
        if f'nwm_lead_{lead}' in aligned_df.columns:
            aligned_df[f'error_lead_{lead}'] = aligned_df[f'nwm_lead_{lead}'] - aligned_df['usgs']
    
    # Plot error distributions for selected lead times
    plt.figure(figsize=(15, 8))
    lead_times = [1, 6, 12, 18]
    for i, lead in enumerate(lead_times):
        if f'error_lead_{lead}' in aligned_df.columns:
            plt.subplot(2, 2, i+1)
            sns.histplot(aligned_df[f'error_lead_{lead}'].dropna(), kde=True)
            plt.title(f"Stream {stream_id} - Error Distribution (Lead {lead}h)")
            plt.xlabel("Forecast Error (NWM - USGS)")
    
    plt.tight_layout()
    plt.show()
    
    # Plot error vs lead time
    error_means = [aligned_df[f'error_lead_{lead}'].mean() for lead in range(1, 19) if f'error_lead_{lead}' in aligned_df.columns]
    error_stds = [aligned_df[f'error_lead_{lead}'].std() for lead in range(1, 19) if f'error_lead_{lead}' in aligned_df.columns]
    lead_list = [lead for lead in range(1, 19) if f'error_lead_{lead}' in aligned_df.columns]
    
    plt.figure(figsize=(12, 6))
    plt.errorbar(lead_list, error_means, yerr=error_stds, fmt='o-')
    plt.title(f"Stream {stream_id} - Mean Forecast Error by Lead Time")
    plt.xlabel("Lead Time (hours)")
    plt.ylabel("Mean Error (NWM - USGS)")
    plt.grid(True)
    plt.show()

## 7. Correlation Analysis

Let's investigate the correlation between observed values and forecasts at different lead times.

In [None]:
import scipy.stats as stats

for stream_id in stream_ids:
    aligned_df = align_nwm_with_usgs(stream_id, lead_times=range(1, 19))
    
    correlations = []
    lead_times = []
    
    for lead in range(1, 19):
        col = f'nwm_lead_{lead}'
        if col in aligned_df.columns:
            # Drop any NaN values
            valid_data = aligned_df[['usgs', col]].dropna()
            if len(valid_data) > 0:
                corr, _ = stats.pearsonr(valid_data['usgs'], valid_data[col])
                correlations.append(corr)
                lead_times.append(lead)
    
    # Plot correlation vs lead time
    plt.figure(figsize=(12, 6))
    plt.plot(lead_times, correlations, 'o-')
    plt.title(f"Stream {stream_id} - Correlation between Observed and Forecast Runoff")
    plt.xlabel("Lead Time (hours)")
    plt.ylabel("Pearson Correlation Coefficient")
    plt.grid(True)
    plt.ylim(0, 1)
    plt.show()

## 8. Seasonal Patterns in Forecast Errors

Let's investigate if there are seasonal patterns in the forecast errors.

In [None]:
for stream_id in stream_ids:
    aligned_df = align_nwm_with_usgs(stream_id)
    
    # Add month column
    aligned_df['month'] = aligned_df.index.month
    
    # Calculate mean errors by month for different lead times
    plt.figure(figsize=(12, 8))
    for i, lead in enumerate([1, 6, 12, 18]):
        if f'error_lead_{lead}' in aligned_df.columns:
            monthly_errors = aligned_df.groupby('month')[f'error_lead_{lead}'].mean()
            plt.plot(monthly_errors.index, monthly_errors.values, 'o-', 
                     label=f'Lead {lead}h')
    
    plt.title(f"Stream {stream_id} - Monthly Mean Forecast Error")
    plt.xlabel("Month")
    plt.ylabel("Mean Error (NWM - USGS)")
    plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                             'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
    plt.legend()
    plt.grid(True)
    plt.show()

## 9. Autocorrelation Analysis of Forecast Errors

Let's investigate the autocorrelation in forecast errors.

In [None]:
from pandas.plotting import autocorrelation_plot

for stream_id in stream_ids:
    aligned_df = align_nwm_with_usgs(stream_id)
    
    # Add error columns if they don't exist
    for lead in [1, 6, 12, 18]:
        if f'nwm_lead_{lead}' in aligned_df.columns and f'error_lead_{lead}' not in aligned_df.columns:
            aligned_df[f'error_lead_{lead}'] = aligned_df[f'nwm_lead_{lead}'] - aligned_df['usgs']
    
    # Plot autocorrelation for errors at different lead times
    plt.figure(figsize=(15, 12))
    for i, lead in enumerate([1, 6, 12, 18]):
        if f'error_lead_{lead}' in aligned_df.columns:
            plt.subplot(2, 2, i+1)
            autocorrelation_plot(aligned_df[f'error_lead_{lead}'].dropna())
            plt.title(f"Stream {stream_id} - Error Autocorrelation (Lead {lead}h)")
    
    plt.tight_layout()
    plt.show()

## 10. Error Persistence Analysis

Let's analyze whether errors are persistent across lead times, which would indicate potential advantages for the baseline persistence model.

In [None]:
for stream_id in stream_ids:
    aligned_df = align_nwm_with_usgs(stream_id, lead_times=range(1, 19))
    
    # Calculate errors
    for lead in range(1, 19):
        col = f'nwm_lead_{lead}'
        if col in aligned_df.columns and f'error_lead_{lead}' not in aligned_df.columns:
            aligned_df[f'error_lead_{lead}'] = aligned_df[col] - aligned_df['usgs']
    
    # Calculate correlation between errors at different lead times
    error_cols = [col for col in aligned_df.columns if col.startswith('error_lead_')]
    if error_cols:
        error_corr = aligned_df[error_cols].corr()
        
        plt.figure(figsize=(12, 10))
        sns.heatmap(error_corr, annot=True, cmap='coolwarm', fmt='.2f', 
                   linewidths=0.5, square=True)
        plt.title(f"Stream {stream_id} - Error Correlation Matrix")
        plt.tight_layout()
        plt.show()

## 11. Flow Regime Analysis

Let's examine how forecast errors vary across different flow regimes (low, medium, high flows).

In [None]:
for stream_id in stream_ids:
    aligned_df = align_nwm_with_usgs(stream_id)
    
    # Add flow regime classification based on quantiles
    flow_quantiles = aligned_df['usgs'].quantile([0.25, 0.75]).values
    aligned_df['flow_regime'] = 'Medium'
    aligned_df.loc[aligned_df['usgs'] <= flow_quantiles[0], 'flow_regime'] = 'Low'
    aligned_df.loc[aligned_df['usgs'] >= flow_quantiles[1], 'flow_regime'] = 'High'
    
    # Make sure error columns exist
    lead_times = [1, 6, 12, 18]
    for lead in lead_times:
        if f'nwm_lead_{lead}' in aligned_df.columns and f'error_lead_{lead}' not in aligned_df.columns:
            aligned_df[f'error_lead_{lead}'] = aligned_df[f'nwm_lead_{lead}'] - aligned_df['usgs']
    
    # Create boxplots of errors by flow regime
    plt.figure(figsize=(15, 10))
    for i, lead in enumerate(lead_times):
        if f'error_lead_{lead}' in aligned_df.columns:
            plt.subplot(2, 2, i+1)
            error_by_regime = [aligned_df[aligned_df['flow_regime'] == regime][f'error_lead_{lead}'].dropna() 
                              for regime in ['Low', 'Medium', 'High']]
            
            plt.boxplot(error_by_regime, labels=['Low', 'Medium', 'High'], showfliers=False)
            plt.title(f"Stream {stream_id} - Forecast Error by Flow Regime (Lead {lead}h)")
            plt.xlabel("Flow Regime")
            plt.ylabel("Forecast Error (NWM - USGS)")
            plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Calculate and print error statistics by flow regime
    print(f"Error Statistics by Flow Regime for Stream {stream_id}:")
    for lead in lead_times:
        if f'error_lead_{lead}' in aligned_df.columns:
            print(f"\nLead Time: {lead} hours")
            for regime in ['Low', 'Medium', 'High']:
                regime_errors = aligned_df[aligned_df['flow_regime'] == regime][f'error_lead_{lead}'].dropna()
                if len(regime_errors) > 0:
                    print(f"  {regime} Flow Regime:")
                    print(f"    Mean Error: {regime_errors.mean():.2f}")
                    print(f"    RMSE: {np.sqrt(np.mean(regime_errors**2)):.2f}")
                    print(f"    % Bias: {100 * regime_errors.mean() / aligned_df[aligned_df['flow_regime'] == regime]['usgs'].mean():.2f}%")

## 12. Data Alignment Analysis

The README mentions a significant data loss (~75%) during alignment. Let's investigate the causes of this data loss to better understand the challenges in data preprocessing.

In [None]:
for stream_id in stream_ids:
    usgs_df = data[stream_id]["usgs"]
    nwm_df = data[stream_id]["nwm"]
    
    # Check for missing data by time pattern
    usgs_by_hour = usgs_df.groupby(usgs_df.index.hour).count()['USGSFlowValue']  # Using correct column name
    nwm_by_hour = nwm_df.groupby(nwm_df['model_output_valid_time'].dt.hour).count()['streamflow_value']  # Using correct column name
    
    # Plot missing data patterns by hour of day
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    sns.barplot(x=usgs_by_hour.index, y=usgs_by_hour.values)
    plt.title(f"Stream {stream_id} - USGS Data Count by Hour")
    plt.xlabel("Hour of Day")
    plt.ylabel("Data Count")
    
    plt.subplot(1, 2, 2)
    sns.barplot(x=nwm_by_hour.index, y=nwm_by_hour.values)
    plt.title(f"Stream {stream_id} - NWM Data Count by Hour")
    plt.xlabel("Hour of Day")
    plt.ylabel("Data Count")
    
    plt.tight_layout()
    plt.show()
    
    # Count NWM data by lead time
    nwm_by_lead = nwm_df.groupby('lead_time').count()['streamflow_value']  # Using correct column name
    
    plt.figure(figsize=(12, 6))
    sns.barplot(x=nwm_by_lead.index.astype(int), y=nwm_by_lead.values)
    plt.title(f"Stream {stream_id} - NWM Data Count by Lead Time")
    plt.xlabel("Lead Time (hours)")
    plt.ylabel("Data Count")
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # Calculate percentage of aligned data
    usgs_timestamps = set(usgs_df.index)
    nwm_timestamps = {}
    
    for lead in range(1, 19):
        lead_df = nwm_df[nwm_df['lead_time'] == lead]
        if not lead_df.empty:
            nwm_timestamps[lead] = set(lead_df['model_output_valid_time'])
    
    # Calculate intersection for each lead time
    for lead in range(1, 19):
        if lead in nwm_timestamps:
            intersection = usgs_timestamps.intersection(nwm_timestamps[lead])
            if nwm_timestamps[lead]:
                match_percent = len(intersection) / len(usgs_timestamps) * 100
                print(f"Stream {stream_id} - Lead {lead}h: {match_percent:.1f}% of USGS timestamps have matching NWM forecasts")

## 13. Summary of Findings and Implications for Model Development

Based on our exploratory data analysis, here is a summary of key findings and their implications for the Seq2Seq LSTM model development:

### Key Findings

1. **Data Coverage and Alignment:**
   - Significant data loss during alignment between NWM forecasts and USGS observations
   - Temporal patterns in data availability may affect model training
   - NWM forecasts show varying availability across different lead times

2. **Error Distribution and Characteristics:**
   - Forecast errors increase with lead time, as expected in hydrological forecasting
   - Error distributions tend to be positively skewed (NWM often overpredicts runoff)
   - Strong autocorrelation in forecast errors suggests potential for the LSTM to leverage temporal patterns

3. **Flow Regime Analysis:**
   - Error magnitude and bias vary significantly across different flow regimes (low, medium, high)
   - High flows show larger absolute errors but potentially smaller relative errors
   - The model may need to handle these regime-dependent error patterns

4. **Seasonal Patterns:**
   - Clear seasonal patterns in forecast errors suggest the importance of capturing seasonal dependencies
   - These patterns differ by lead time, indicating potentially complex temporal dynamics

5. **Error Persistence and Correlation:**
   - Strong correlation between errors at consecutive lead times
   - Error autocorrelation suggests past errors contain valuable information for future error prediction
   - The persistence baseline model likely benefits from these correlations

### Implications for Seq2Seq LSTM Model Development

1. **Feature Engineering:**
   - Include lagged values of observed runoff, forecasts, and errors as they contain valuable information
   - Consider adding explicit seasonal features (e.g., month, day of year) to help capture seasonal patterns
   - Flow regime indicators might help the model adapt to different error patterns across flow magnitudes

2. **Model Architecture:**
   - The sequence length should be sufficient to capture relevant temporal dependencies (24 hours seems reasonable based on autocorrelation analysis)
   - The strong correlation between errors at different lead times supports the simultaneous prediction of all lead times (1-18h)
   - Consider using attention mechanisms to allow the model to focus on the most relevant parts of the input sequence

3. **Training Considerations:**
   - The positively skewed error distributions suggest that a custom loss function might be important
   - Different weighting schemes might be explored to balance performance across flow regimes
   - TimeSeriesSplit cross-validation is crucial given the strong temporal dependencies in the data

4. **Evaluation Strategy:**
   - Evaluate model performance separately for different flow regimes and seasons
   - Compare against both the raw NWM forecasts and the persistence baseline
   - Use multiple metrics (CC, RMSE, PBIAS, NSE) as they provide complementary information about model performance