# Exploratory Data Analysis for Runoff Forecasting

This notebook explores the USGS observed runoff data and NWM forecast data to prepare for an LSTM-based runoff forecasting model.

## Data Structure

**USGS Data:**
* Each stream has one CSV file with hourly observations
* File naming pattern: `*_Strt_*.csv`
* Columns:
  * DateTime: Timestamp with timezone (e.g., "2021-04-20 07:00:00+00:00")
  * USGSFlowValue: The observed runoff value
  * USGS_GageID or 00060_cd: A quality code (mostly "A")

**NWM Data:**
* Each stream has 25 monthly CSV files with forecasts
* File naming pattern: `streamflow_[streamID]_[YYYYMM].csv`
* Columns:
  * NWM_version_number: Version of the NWM model (e.g., "v2.1")
  * model_initialization_time: When the forecast was issued
  * model_output_valid_time: When the forecast is valid for
  * streamflow_value: The predicted runoff value
  * streamID: The stream identifier

## Import Required Libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob
from datetime import datetime, timedelta
import re
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

# Increase figure resolution for better clarity
plt.rcParams['figure.dpi'] = 120
plt.rcParams['figure.figsize'] = [12, 8]

## Helper Functions for Data Loading and Processing

In [None]:
def load_usgs_data(folder_path):
    """Load USGS observation data for a given stream."""
    # Find the USGS file using pattern matching
    usgs_files = glob.glob(os.path.join(folder_path, '*_Strt_*.csv'))
    
    if not usgs_files:
        print(f"No USGS data files found in {folder_path}")
        return None
    
    # Load the first matching file
    usgs_file = usgs_files[0]
    print(f"Loading USGS data from: {usgs_file}")
    
    # Extract USGS station ID from filename
    station_id = os.path.basename(usgs_file).split('_')[0]
    
    # Read the CSV file
    df = pd.read_csv(usgs_file)
    
    # Convert datetime column
    df['DateTime'] = pd.to_datetime(df['DateTime'])
    
    return df, station_id

def load_nwm_data(folder_path, stream_id):
    """Load and combine all NWM forecast data for a given stream."""
    # Find all NWM files for this stream
    nwm_files = glob.glob(os.path.join(folder_path, f"streamflow_{stream_id}_*.csv"))
    
    if not nwm_files:
        print(f"No NWM data files found for stream {stream_id} in {folder_path}")
        return None
    
    print(f"Found {len(nwm_files)} NWM data files for stream {stream_id}")
    
    # Load and combine all files
    dfs = []
    for file in sorted(nwm_files):
        df = pd.read_csv(file)
        dfs.append(df)
    
    # Combine all monthly dataframes
    combined_df = pd.concat(dfs, ignore_index=True)
    
    # Convert datetime columns
    datetime_cols = ['model_initialization_time', 'model_output_valid_time']
    for col in datetime_cols:
        if col in combined_df.columns:
            combined_df[col] = pd.to_datetime(combined_df[col])
    
    return combined_df

def get_stream_folders():
    """Get all stream folders in the current directory."""
    # List all directories in the current directory
    stream_folders = [d for d in os.listdir() if os.path.isdir(d) and d.isdigit()]
    return stream_folders

## Explore Available Data

In [None]:
# List all stream folders
stream_folders = get_stream_folders()
print(f"Found {len(stream_folders)} stream folders: {stream_folders}")

# Look at file structure in the first folder
if stream_folders:
    first_folder = stream_folders[0]
    files = os.listdir(first_folder)
    
    # Count different file types
    usgs_files = [f for f in files if "_Strt_" in f]
    nwm_files = [f for f in files if "streamflow_" in f]
    
    print(f"\nIn folder {first_folder}:")
    print(f"- USGS observation files: {len(usgs_files)}")
    print(f"- NWM forecast files: {len(nwm_files)}")
    
    # Show file naming patterns
    if usgs_files:
        print(f"\nExample USGS file: {usgs_files[0]}")
    if nwm_files:
        print(f"Example NWM files: {nwm_files[:3]} ...")
        
        # Extract date range from NWM files
        dates = [re.search(r'_(\d{6})\.csv$', f).group(1) for f in nwm_files if re.search(r'_(\d{6})\.csv$', f)]
        dates = [f"{d[:4]}-{d[4:]}" for d in dates]
        dates.sort()
        if dates:
            print(f"NWM data spans from {dates[0]} to {dates[-1]}")

## Load and Examine USGS Observation Data

In [None]:
# Select the first stream for initial analysis
stream_folder = stream_folders[0]
usgs_data, station_id = load_usgs_data(stream_folder)

# Display basic information
print(f"\nUSGS Data Overview for Station {station_id}:")
print(f"Shape: {usgs_data.shape}")
print("\nColumns:")
print(usgs_data.columns.tolist())

# Display the first few rows
print("\nFirst 5 rows:")
display(usgs_data.head())

# Get data statistics
print("\nUSGS Flow Value Statistics:")
display(usgs_data['USGSFlowValue'].describe())

# Check for missing values
print("\nMissing values:")
print(usgs_data.isnull().sum())

# Check date range
date_range = usgs_data['DateTime'].agg(['min', 'max'])
print(f"\nDate range: {date_range['min']} to {date_range['max']}")
print(f"Total duration: {date_range['max'] - date_range['min']}")

## Visualize USGS Observation Data

In [None]:
# Time series plot of USGS flow values
plt.figure(figsize=(14, 6))
plt.plot(usgs_data['DateTime'], usgs_data['USGSFlowValue'], color='blue', linewidth=1)
plt.title(f'USGS Flow Values Over Time for Station {station_id}')
plt.xlabel('Date')
plt.ylabel('Flow Value (cfs)')
plt.grid(True)
plt.tight_layout()
plt.show()

# Distribution of flow values
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(usgs_data['USGSFlowValue'], kde=True, bins=30)
plt.title('Distribution of Flow Values')
plt.xlabel('Flow Value (cfs)')

plt.subplot(1, 2, 2)
sns.histplot(np.log1p(usgs_data['USGSFlowValue']), kde=True, bins=30)
plt.title('Distribution of Log-Transformed Flow Values')
plt.xlabel('Log(Flow Value + 1)')

plt.tight_layout()
plt.show()

# Examine flow code distribution if available
quality_code_col = next((col for col in usgs_data.columns 
                         if 'code' in col.lower() or 'GageID' in col), None)

if quality_code_col:
    code_counts = usgs_data[quality_code_col].value_counts()
    print(f"\nDistribution of quality codes ({quality_code_col}):")
    display(code_counts)
    
    plt.figure(figsize=(8, 5))
    sns.countplot(y=usgs_data[quality_code_col])
    plt.title('Distribution of Quality Codes')
    plt.tight_layout()
    plt.show()

## Load and Examine NWM Forecast Data

In [None]:
# Get the stream ID from the folder name
stream_id = stream_folder
nwm_data = load_nwm_data(stream_folder, stream_id)

# Display basic information
print(f"\nNWM Data Overview for Stream {stream_id}:")
print(f"Shape: {nwm_data.shape}")
print("\nColumns:")
print(nwm_data.columns.tolist())

# Display the first few rows
print("\nFirst 5 rows:")
display(nwm_data.head())

# Get data statistics
print("\nNWM Flow Value Statistics:")
display(nwm_data['streamflow_value'].describe())

# Check for missing values
print("\nMissing values:")
print(nwm_data.isnull().sum())

# Check unique NWM versions
if 'NWM_version_number' in nwm_data.columns:
    versions = nwm_data['NWM_version_number'].unique()
    print(f"\nNWM model versions: {versions}")

# Check date range
if 'model_output_valid_time' in nwm_data.columns:
    date_range = nwm_data['model_output_valid_time'].agg(['min', 'max'])
    print(f"\nForecast valid time range: {date_range['min']} to {date_range['max']}")
    print(f"Total duration: {date_range['max'] - date_range['min']}")

## Visualize NWM Forecast Data

In [None]:
# Time series plot of NWM flow values
plt.figure(figsize=(14, 6))
plt.plot(nwm_data['model_output_valid_time'], nwm_data['streamflow_value'], 
         color='green', linewidth=1, alpha=0.7)
plt.title(f'NWM Forecast Flow Values Over Time for Stream {stream_id}')
plt.xlabel('Valid Time')
plt.ylabel('Flow Value (cfs)')
plt.grid(True)
plt.tight_layout()
plt.show()

# Distribution of flow values
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(nwm_data['streamflow_value'], kde=True, bins=30)
plt.title('Distribution of NWM Flow Values')
plt.xlabel('Flow Value (cfs)')

plt.subplot(1, 2, 2)
sns.histplot(np.log1p(nwm_data['streamflow_value']), kde=True, bins=30)
plt.title('Distribution of Log-Transformed NWM Flow Values')
plt.xlabel('Log(Flow Value + 1)')

plt.tight_layout()
plt.show()

# Examine forecast lead times
if 'model_initialization_time' in nwm_data.columns and 'model_output_valid_time' in nwm_data.columns:
    # Calculate lead time in hours
    nwm_data['lead_time_hours'] = (nwm_data['model_output_valid_time'] - 
                                  nwm_data['model_initialization_time']).dt.total_seconds() / 3600
    
    # Plot distribution of lead times
    plt.figure(figsize=(10, 5))
    sns.histplot(nwm_data['lead_time_hours'], bins=30, kde=True)
    plt.title('Distribution of Forecast Lead Times')
    plt.xlabel('Lead Time (hours)')
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.tight_layout()
    plt.show()
    
    # Show common lead times
    common_lead_times = nwm_data['lead_time_hours'].value_counts().head(10)
    print("\nMost common forecast lead times (hours):")
    display(common_lead_times)

## Compare USGS Observations with NWM Forecasts

In [None]:
# Prepare data for comparison
# We'll use the NWM forecasts with the shortest lead time for each valid time point

# First, make sure the USGS data has a datetime index for easy alignment
usgs_aligned = usgs_data.copy()
usgs_aligned.set_index('DateTime', inplace=True)

# For NWM data, select the shortest lead time for each valid time
if 'lead_time_hours' in nwm_data.columns:
    # Group by valid time and select the entry with shortest lead time
    nwm_shortest_leads = nwm_data.loc[nwm_data.groupby('model_output_valid_time')['lead_time_hours'].idxmin()]
    
    # Set the index to the valid time for alignment
    nwm_aligned = nwm_shortest_leads.set_index('model_output_valid_time')
    
    # Align the time series and create a combined dataframe
    combined = pd.DataFrame({
        'USGS_Flow': usgs_aligned['USGSFlowValue'],
        'NWM_Flow': nwm_aligned['streamflow_value']
    })
    
    # Plot both time series
    plt.figure(figsize=(14, 8))
    plt.plot(combined.index, combined['USGS_Flow'], label='USGS Observed', color='blue', linewidth=1.5)
    plt.plot(combined.index, combined['NWM_Flow'], label='NWM Forecast', color='green', linewidth=1, alpha=0.7)
    plt.title(f'Comparison of USGS Observed vs NWM Forecast Flow Values for Stream {stream_id}')
    plt.xlabel('Date')
    plt.ylabel('Flow Value (cfs)')
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()
    
    # Create a scatter plot to visualize the relationship
    plt.figure(figsize=(10, 8))
    
    # Drop any rows with missing values
    combined_clean = combined.dropna()
    
    plt.scatter(combined_clean['USGS_Flow'], combined_clean['NWM_Flow'], 
                alpha=0.5, edgecolor='k', linewidth=0.5)
    
    # Add a perfect prediction line
    max_val = max(combined_clean['USGS_Flow'].max(), combined_clean['NWM_Flow'].max())
    plt.plot([0, max_val], [0, max_val], 'r--', linewidth=1)
    
    plt.title('USGS Observed vs NWM Forecast Flow Values')
    plt.xlabel('USGS Observed Flow (cfs)')
    plt.ylabel('NWM Forecast Flow (cfs)')
    plt.grid(True)
    plt.axis('equal')
    plt.tight_layout()
    plt.show()
    
    # Calculate correlation and error metrics
    correlation = combined_clean.corr().iloc[0, 1]
    mae = np.mean(np.abs(combined_clean['USGS_Flow'] - combined_clean['NWM_Flow']))
    rmse = np.sqrt(np.mean((combined_clean['USGS_Flow'] - combined_clean['NWM_Flow'])**2))
    mape = np.mean(np.abs((combined_clean['USGS_Flow'] - combined_clean['NWM_Flow']) / combined_clean['USGS_Flow'])) * 100
    
    print(f"Correlation between USGS and NWM: {correlation:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f} cfs")
    print(f"Root Mean Squared Error (RMSE): {rmse:.4f} cfs")
    print(f"Mean Absolute Percentage Error (MAPE): {mape:.4f}%")
    
    # Also calculate log-space metrics to reduce influence of extreme values
    print("\nLog-space metrics:")
    combined_clean_log = np.log1p(combined_clean)
    log_correlation = combined_clean_log.corr().iloc[0, 1]
    log_mae = np.mean(np.abs(combined_clean_log['USGS_Flow'] - combined_clean_log['NWM_Flow']))
    log_rmse = np.sqrt(np.mean((combined_clean_log['USGS_Flow'] - combined_clean_log['NWM_Flow'])**2))
    
    print(f"Log-space Correlation: {log_correlation:.4f}")
    print(f"Log-space MAE: {log_mae:.4f}")
    print(f"Log-space RMSE: {log_rmse:.4f}")

## Time Series Analysis and Feature Engineering for LSTM

In [None]:
# Conduct time series analysis on USGS data (observed values)
# We'll resample to daily data for seasonal decomposition
usgs_daily = usgs_aligned['USGSFlowValue'].resample('D').mean()

# Plot the resampled daily time series
plt.figure(figsize=(14, 6))
usgs_daily.plot()
plt.title('Daily Average USGS Flow Values')
plt.xlabel('Date')
plt.ylabel('Flow Value (cfs)')
plt.grid(True)
plt.tight_layout()
plt.show()

# Seasonal Decomposition
try:
    # Fill any missing values for decomposition
    usgs_daily_filled = usgs_daily.fillna(method='ffill').fillna(method='bfill')
    
    # Perform decomposition if we have enough data
    if len(usgs_daily_filled) >= 2*365:  # At least 2 years for good decomposition
        decomposition = seasonal_decompose(usgs_daily_filled, model='additive', period=365)
        
        fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(14, 12))
        decomposition.observed.plot(ax=ax1)
        ax1.set_title('Observed')
        ax1.set_ylabel('Flow Value')
        
        decomposition.trend.plot(ax=ax2)
        ax2.set_title('Trend')
        ax2.set_ylabel('Flow Value')
        
        decomposition.seasonal.plot(ax=ax3)
        ax3.set_title('Seasonality')
        ax3.set_ylabel('Flow Value')
        
        decomposition.resid.plot(ax=ax4)
        ax4.set_title('Residuals')
        ax4.set_ylabel('Flow Value')
        
        plt.tight_layout()
        plt.show()
    else:
        print("Not enough data for seasonal decomposition (need at least 2 years)")
except Exception as e:
    print(f"Seasonal decomposition failed: {e}")

# Autocorrelation and Partial Autocorrelation Analysis
plt.figure(figsize=(14, 7))

plt.subplot(1, 2, 1)
plot_acf(usgs_daily.dropna(), lags=40, ax=plt.gca())
plt.title('Autocorrelation Function')

plt.subplot(1, 2, 2)
plot_pacf(usgs_daily.dropna(), lags=40, ax=plt.gca())
plt.title('Partial Autocorrelation Function')

plt.tight_layout()
plt.show()

## Feature Engineering Ideas for LSTM Model

In [None]:
# Create a sample of features that would be useful for LSTM model

# We'll use the combined dataframe from earlier
if 'combined' in locals():
    # Create a copy to avoid modifying the original
    features_df = combined.copy()
    
    # 1. Add time-based features
    features_df['hour'] = features_df.index.hour
    features_df['day'] = features_df.index.day
    features_df['month'] = features_df.index.month
    features_df['year'] = features_df.index.year
    features_df['day_of_week'] = features_df.index.dayofweek
    features_df['day_of_year'] = features_df.index.dayofyear
    
    # 2. Add lag features (past values)
    for lag in [1, 3, 6, 12, 24, 48, 72]:  # Hours
        features_df[f'USGS_lag_{lag}h'] = features_df['USGS_Flow'].shift(lag)
        features_df[f'NWM_lag_{lag}h'] = features_df['NWM_Flow'].shift(lag)
    
    # 3. Add rolling statistics
    for window in [6, 12, 24, 48, 72]:  # Hours
        features_df[f'USGS_rolling_mean_{window}h'] = features_df['USGS_Flow'].rolling(window=window).mean()
        features_df[f'USGS_rolling_std_{window}h'] = features_df['USGS_Flow'].rolling(window=window).std()
        features_df[f'USGS_rolling_min_{window}h'] = features_df['USGS_Flow'].rolling(window=window).min()
        features_df[f'USGS_rolling_max_{window}h'] = features_df['USGS_Flow'].rolling(window=window).max()
    
    # 4. Difference features (rate of change)
    features_df['USGS_diff_1h'] = features_df['USGS_Flow'].diff(1)
    features_df['USGS_diff_24h'] = features_df['USGS_Flow'].diff(24)
    
    # 5. Cyclic encoding of time features (preserves cyclical nature)
    features_df['hour_sin'] = np.sin(2 * np.pi * features_df['hour'] / 24)
    features_df['hour_cos'] = np.cos(2 * np.pi * features_df['hour'] / 24)
    features_df['month_sin'] = np.sin(2 * np.pi * features_df['month'] / 12)
    features_df['month_cos'] = np.cos(2 * np.pi * features_df['month'] / 12)
    features_df['day_of_year_sin'] = np.sin(2 * np.pi * features_df['day_of_year'] / 365)
    features_df['day_of_year_cos'] = np.cos(2 * np.pi * features_df['day_of_year'] / 365)
    
    # Display the features dataframe
    print("Feature-engineered DataFrame for LSTM model:")
    display(features_df.head())
    
    # Show the total number of features
    print(f"\nTotal number of features: {features_df.shape[1]}")
    
    # Correlation of features with the target (future flow)
    # Let's say we want to predict 24 hours ahead
    target_col = 'USGS_Flow'
    features_df['target_24h_ahead'] = features_df[target_col].shift(-24)
    
    # Calculate correlation with target
    corr_with_target = features_df.corr()['target_24h_ahead'].sort_values(ascending=False)
    
    print("\nTop 15 features most correlated with the target (24h ahead flow):")
    display(corr_with_target.head(15))

## LSTM Model Preparation Guidelines

Based on the exploratory data analysis, here are the recommended steps for building an LSTM model for runoff forecasting:

### 1. Data Preprocessing

- **Handling Missing Values**: Fill or interpolate missing values in both USGS and NWM data.
- **Feature Scaling**: Normalize or standardize all features to the range [0,1] or having mean 0 and std 1.
- **Log Transformation**: Consider log-transforming flow values to reduce skewness (use `np.log1p()`).
- **Resampling**: Consider resampling data to a uniform frequency (hourly or daily).

### 2. Feature Engineering

- **Time-based Features**: Hour, day, month, day of week, day of year.
- **Cyclic Encoding**: Sine and cosine transformations of cyclical time features.
- **Lag Features**: Past values of flow at different time lags (1h, 6h, 12h, 24h, etc.).
- **Rolling Window Statistics**: Mean, min, max, std of past flow values.
- **Difference Features**: Rate of change between time steps.
- **NWM Forecast Features**: Include NWM forecasts as features.

### 3. Sequence Preparation for LSTM

- **Sequence Length**: Based on ACF/PACF analysis, use a window of significant lag periods (e.g., 72 hours).
- **Sequence Creation**: Create input sequences X (features over time) and target sequences y (future flow values).
- **Sequence Format**: Shape should be [samples, time steps, features].

### 4. Train-Validation-Test Split

- **Chronological Split**: Use earlier data for training, middle for validation, and latest for testing.
- **Proportions**: Typically 70% train, 15% validation, 15% test.

### 5. LSTM Model Architecture

- **Input Layer**: LSTM layer with appropriate input shape.
- **Hidden Layers**: Additional LSTM layers with dropout for regularization.
- **Output Layer**: Dense layer with linear activation for regression.
- **Sample Structure**:
  ```python
  model = Sequential([
      LSTM(50, return_sequences=True, input_shape=(sequence_length, num_features)),
      Dropout(0.2),
      LSTM(50),
      Dropout(0.2),
      Dense(1)
  ])
  ```

### 6. Model Training

- **Loss Function**: Mean Squared Error (MSE) or Mean Absolute Error (MAE).
- **Optimizer**: Adam with appropriate learning rate.
- **Early Stopping**: To prevent overfitting.
- **Learning Rate Scheduling**: Reduce learning rate when plateauing.

### 7. Model Evaluation

- **Metrics**: RMSE, MAE, MAPE, and correlation coefficient.
- **Visualization**: Compare predicted vs actual values.
- **Benchmark**: Compare with simpler models and NWM forecasts.

### 8. Multi-Step Forecasting Strategy

- **Direct Method**: Train separate models for each forecast horizon.
- **Recursive Method**: Use one-step predictions as inputs for next steps.
- **Multiple Output**: Train a single model to predict multiple time steps ahead.

## Analyze Second Stream for Comparison

In [None]:
# Check if we have multiple streams to compare
if len(stream_folders) > 1:
    # Select the second stream
    second_stream_folder = stream_folders[1]
    second_usgs_data, second_station_id = load_usgs_data(second_stream_folder)
    
    if second_usgs_data is not None:
        # Plot the two streams' USGS data for comparison
        plt.figure(figsize=(14, 8))
        
        # Align both datasets to have the same datetime index format
        first_stream_ts = usgs_data.set_index('DateTime')['USGSFlowValue']
        second_stream_ts = second_usgs_data.set_index('DateTime')['USGSFlowValue']
        
        # Plot both time series
        plt.plot(first_stream_ts.index, first_stream_ts, 
                 label=f'Stream {station_id}', color='blue', linewidth=1.5)
        plt.plot(second_stream_ts.index, second_stream_ts, 
                 label=f'Stream {second_station_id}', color='red', linewidth=1.5)
        
        plt.title('Comparison of Runoff Between Two Streams')
        plt.xlabel('Date')
        plt.ylabel('Flow Value (cfs)')
        plt.legend()
        plt.grid(True)
        plt.tight_layout()
        plt.show()
        
        # Normalize the data and plot again to compare patterns
        first_stream_norm = (first_stream_ts - first_stream_ts.min()) / (first_stream_ts.max() - first_stream_ts.min())
        second_stream_norm = (second_stream_ts - second_stream_ts.min()) / (second_stream_ts.max() - second_stream_ts.min())
        
        plt.figure(figsize=(14, 8))
        plt.plot(first_stream_norm.index, first_stream_norm, 
                 label=f'Stream {station_id} (normalized)', color='blue', linewidth=1.5)
        plt.plot(second_stream_norm.index, second_stream_norm, 
                 label=f'Stream {second_station_id} (normalized)', color='red', linewidth=1.5)
        
        plt.title('Comparison of Normalized Runoff Patterns Between Two Streams')
        plt.xlabel('Date')
        plt.ylabel('Normalized Flow Value')
        plt.legend()
        plt.grid(True)
        plt.tight_layout()
        plt.show()
        
        # Calculate correlation between the two streams
        # Align the two time series first
        aligned_df = pd.DataFrame({
            f'stream_{station_id}': first_stream_ts,
            f'stream_{second_station_id}': second_stream_ts
        })
        
        correlation = aligned_df.corr().iloc[0, 1]
        print(f"Correlation between the two streams: {correlation:.4f}")
        
        # This correlation could be useful for multi-stream modeling

## Conclusion and Next Steps

### Key Findings

1. **Data Coverage**: We have analyzed runoff data from multiple streams covering approximately two years (April 2021 to April 2023).
2. **Data Types**:
   - USGS observations (ground truth)
   - NWM forecasts at various lead times
3. **Temporal Patterns**: The runoff data shows clear seasonal patterns and potentially some recurring events.
4. **NWM Performance**: We've quantified how well NWM forecasts match USGS observations.
5. **Feature Importance**: Identified key features that correlate strongly with future runoff values.

### Next Steps for LSTM Model Development

1. **Data Preprocessing**:
   - Complete the feature engineering process outlined above
   - Create a unified dataset with all relevant features
   - Handle missing values and normalize/standardize features

2. **Sequence Preparation**:
   - Create time sequences for LSTM input based on ACF/PACF analysis
   - Structure the data in the format [samples, time steps, features]

3. **Model Development**:
   - Implement the LSTM architecture
   - Experiment with different numbers of layers and units
   - Add regularization to prevent overfitting

4. **Evaluation Framework**:
   - Set up proper metrics for runoff forecasting
   - Compare with baseline models and NWM forecasts
   - Evaluate at different forecast horizons

5. **Multi-Stream Approach**:
   - Consider using data from multiple streams in a single model
   - Explore transfer learning between similar streams

6. **Operational Deployment**:
   - Plan for model updating as new data becomes available
   - Consider ensemble approaches combining LSTM with NWM forecasts

This exploratory analysis provides a strong foundation for building an effective LSTM-based runoff forecasting model.