# Model Development: Seq2Seq LSTM for Runoff Forecast Error Correction

This notebook demonstrates the development and tuning of a Sequence-to-Sequence LSTM model for NWM runoff forecast error correction. We'll explore different model architectures and hyperparameters to find the best configuration.

## Objectives
1. Import and prepare the preprocessed data
2. Implement the Seq2Seq LSTM architecture
3. Perform hyperparameter tuning with temporal cross-validation
4. Evaluate candidate models and select the best one
5. Save the final trained model

In [None]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import TimeSeriesSplit
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
import sys

# Add parent directory to path for importing project modules
sys.path.append('..')
from src.preprocess import DataPreprocessor
from src.model import Seq2SeqLSTMModel
from src.tuner import Seq2SeqTuner

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams.update({'font.size': 12})

## 1. Data Loading and Preparation

First, let's load the preprocessed data for model development.

In [None]:
# Data paths
raw_data_path = "../data/raw"
processed_data_path = "../data/processed"
models_path = "../models"

# Ensure directories exist
os.makedirs(processed_data_path, exist_ok=True)
os.makedirs(models_path, exist_ok=True)

# Initialize data preprocessor
preprocessor = DataPreprocessor(
    raw_data_path=raw_data_path,
    processed_data_path=processed_data_path,
    sequence_length=24  # Default: 24 hours of past data
)

# Process data for both streams
stream_ids = ["20380357", "21609641"]
data = preprocessor.process_data(stream_ids=stream_ids)

In [None]:
# Extract training/validation data for the first stream
stream_id = stream_ids[0]  # Using first stream for development
X_encoder_train = data[stream_id]['train_val']['X_encoder']
X_decoder_train = data[stream_id]['train_val']['X_decoder']
y_train = data[stream_id]['train_val']['y']

print(f"Encoder input shape: {X_encoder_train.shape}")
print(f"Decoder input shape: {X_decoder_train.shape}")
print(f"Target output shape: {y_train.shape}")

## 2. Create Base Model

Let's create a base Seq2Seq LSTM model to understand its architecture before tuning.

In [None]:
# Create base model
base_model = Seq2SeqLSTMModel(
    encoder_timesteps=X_encoder_train.shape[1],  # Number of timesteps in encoder sequence
    encoder_features=X_encoder_train.shape[2],   # Number of features per timestep
    decoder_timesteps=18,                        # 18 hours (lead times 1-18)
    lstm_units=64,                               # Base number of LSTM units
    dropout_rate=0.2,                            # Initial dropout rate
    learning_rate=0.001,                         # Initial learning rate
    num_layers=2                                 # Initial number of LSTM layers
)

# Build and display model architecture
model = base_model.build_model()
model.summary()

## 3. Define TimeSeriesSplit for Temporal Cross-Validation

For time series data, we need to ensure that our validation approach respects the temporal order of observations.

In [None]:
# Define TimeSeriesSplit for temporal cross-validation
n_splits = 3
tscv = TimeSeriesSplit(n_splits=n_splits)

# Visualize the splits
plt.figure(figsize=(15, 5))
for i, (train_idx, val_idx) in enumerate(tscv.split(X_encoder_train)):
    plt.scatter(val_idx, [i] * len(val_idx), c='red', s=10, label='Validation' if i == 0 else '')
    plt.scatter(train_idx, [i] * len(train_idx), c='blue', s=10, label='Training' if i == 0 else '')
plt.title('TimeSeriesSplit Cross-validation')
plt.ylabel('Split')
plt.xlabel('Sample index')
plt.legend()
plt.show()

## 4. Basic Training without Hyperparameter Tuning

Let's train a basic model on the first fold of our data to check if everything works correctly.

In [None]:
# Get first split
for train_idx, val_idx in tscv.split(X_encoder_train):
    X_enc_fold_train, X_enc_fold_val = X_encoder_train[train_idx], X_encoder_train[val_idx]
    X_dec_fold_train, X_dec_fold_val = X_decoder_train[train_idx], X_decoder_train[val_idx]
    y_fold_train, y_fold_val = y_train[train_idx], y_train[val_idx]
    break  # Only use first split for basic test

# Setup callbacks
callbacks = [
    EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True),
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6)
]

# Train model
base_history = base_model.train(
    X_enc_fold_train, X_dec_fold_train, y_fold_train,
    validation_data=([X_enc_fold_val, X_dec_fold_val], y_fold_val),
    batch_size=32,
    epochs=30,
    callbacks=callbacks,
    verbose=1
)

In [None]:
# Plot training history
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(base_history.history['loss'], label='Training Loss')
plt.plot(base_history.history['val_loss'], label='Validation Loss')
plt.title('Loss over epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(base_history.history['mae'], label='Training MAE')
plt.plot(base_history.history['val_mae'], label='Validation MAE')
plt.title('Mean Absolute Error over epochs')
plt.xlabel('Epoch')
plt.ylabel('MAE')
plt.legend()

plt.tight_layout()
plt.show()

## 5. Hyperparameter Tuning with KerasTuner

Now let's perform hyperparameter tuning using KerasTuner and TimeSeriesSplit validation.

In [None]:
# Initialize tuner
tuner = Seq2SeqTuner(
    encoder_timesteps=X_encoder_train.shape[1],
    encoder_features=X_encoder_train.shape[2],
    decoder_timesteps=18,
    project_name='nwm_seq2seq_tuning',
    directory=models_path
)

In [None]:
# Setup tuner
tuner.setup_tuner(
    tuner_type='hyperband',  # Using Hyperband algorithm for efficiency
    max_trials=20,           # Reduced for notebook, use higher value for full search
    executions_per_trial=1   # Number of times to train each trial
)

In [None]:
# Perform hyperparameter search with TimeSeriesSplit
tuner_results, best_hps = tuner.search_with_time_series_cv(
    X_encoder_train, 
    X_decoder_train, 
    y_train, 
    n_splits=3,        # Number of time series folds
    batch_size=32,     # Batch size
    epochs=30,         # Max epochs per trial
    verbose=1          # Verbose output
)

## 6. Examine Hyperparameter Tuning Results

In [None]:
# Get top performing trials
top_trials = tuner_results.oracle.get_best_trials(5)
print("Top 5 trials:")
for i, trial in enumerate(top_trials):
    print(f"\nTrial {i+1} - Score: {trial.score:.5f}")
    print(f"  LSTM Units: {trial.hyperparameters.values['lstm_units']}")
    print(f"  Dropout Rate: {trial.hyperparameters.values['dropout_rate']}")
    print(f"  Learning Rate: {trial.hyperparameters.values['learning_rate']}")
    print(f"  Number of Layers: {trial.hyperparameters.values['num_layers']}")

## 7. Build and Train Final Model with Best Hyperparameters

Let's build and train the final model with the best hyperparameters on the full training+validation dataset.

In [None]:
# Get best hyperparameters
best_hps = tuner_results.get_best_hyperparameters(1)[0]

print("Best hyperparameters:")
print(f"LSTM Units: {best_hps.get('lstm_units')}")
print(f"Dropout Rate: {best_hps.get('dropout_rate')}")
print(f"Learning Rate: {best_hps.get('learning_rate')}")
print(f"Number of Layers: {best_hps.get('num_layers')}")

In [None]:
# Create final model with best hyperparameters
final_model = Seq2SeqLSTMModel(
    encoder_timesteps=X_encoder_train.shape[1],
    encoder_features=X_encoder_train.shape[2],
    decoder_timesteps=18,
    lstm_units=best_hps.get('lstm_units'),
    dropout_rate=best_hps.get('dropout_rate'),
    learning_rate=best_hps.get('learning_rate'),
    num_layers=best_hps.get('num_layers')
)

final_model.build_model()
final_model.model.summary()

In [None]:
# Define callbacks for final training
final_callbacks = [
    EarlyStopping(monitor='loss', patience=15, restore_best_weights=True),
    ReduceLROnPlateau(monitor='loss', factor=0.5, patience=5, min_lr=1e-6),
    ModelCheckpoint(
        filepath=os.path.join(models_path, "nwm_lstm_model_checkpoint.keras"),
        monitor='loss',
        save_best_only=True,
        save_weights_only=False,
        verbose=1
    )
]

In [None]:
# Train final model on full training set
final_history = final_model.train(
    X_encoder_train, X_decoder_train, y_train,
    batch_size=32,
    epochs=100,  # Higher value as we now use early stopping
    callbacks=final_callbacks,
    verbose=1
)

In [None]:
# Plot final training history
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(final_history.history['loss'], label='Training Loss')
plt.title('Loss over epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(final_history.history['mae'], label='Training MAE')
plt.title('Mean Absolute Error over epochs')
plt.xlabel('Epoch')
plt.ylabel('MAE')
plt.legend()

plt.tight_layout()
plt.show()

## 8. Save Final Model

In [None]:
# Save the final model
final_model.save(os.path.join(models_path, "nwm_lstm_model.keras"))
print(f"Final model saved to {os.path.join(models_path, 'nwm_lstm_model.keras')}")

## 9. Additional Analysis: Impact of Hyperparameters

Let's analyze how different hyperparameters affected model performance.

In [None]:
# Create plots to visualize hyperparameter impact
import kerastuner as kt

# Extract trials data
trials_df = pd.DataFrame([
    {
        'lstm_units': trial.hyperparameters.values['lstm_units'],
        'dropout_rate': trial.hyperparameters.values['dropout_rate'],
        'learning_rate': trial.hyperparameters.values['learning_rate'],
        'num_layers': trial.hyperparameters.values['num_layers'],
        'score': trial.score
    } for trial in tuner_results.oracle.trials.values() if trial.score is not None
])

# Plot impact of each hyperparameter
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# LSTM Units
sns.boxplot(x='lstm_units', y='score', data=trials_df, ax=axes[0, 0])
axes[0, 0].set_title('Impact of LSTM Units on Performance')
axes[0, 0].set_ylabel('Validation Loss')

# Dropout Rate
sns.boxplot(x='dropout_rate', y='score', data=trials_df, ax=axes[0, 1])
axes[0, 1].set_title('Impact of Dropout Rate on Performance')
axes[0, 1].set_ylabel('Validation Loss')

# Learning Rate
sns.boxplot(x='learning_rate', y='score', data=trials_df, ax=axes[1, 0])
axes[1, 0].set_title('Impact of Learning Rate on Performance')
axes[1, 0].set_ylabel('Validation Loss')
axes[1, 0].set_xticklabels(axes[1, 0].get_xticklabels(), rotation=45)

# Number of Layers
sns.boxplot(x='num_layers', y='score', data=trials_df, ax=axes[1, 1])
axes[1, 1].set_title('Impact of Number of Layers on Performance')
axes[1, 1].set_ylabel('Validation Loss')

plt.tight_layout()
plt.show()

## 10. Summary and Next Steps

In this notebook, we have:
1. Loaded and prepared the preprocessed data for model development
2. Created a base Seq2Seq LSTM model and tested it on a validation fold
3. Performed hyperparameter tuning using KerasTuner with TimeSeriesSplit validation
4. Analyzed the impact of different hyperparameters on model performance
5. Built and trained a final model with the best hyperparameters
6. Saved the trained model for later use in forecast correction

Next steps will include:
- Evaluating the model on the held-out test set (Oct 2022 - Apr 2023)
- Comparing model performance against the baseline persistence model
- Visualizing forecast corrections and calculating evaluation metrics
- Creating the required plots for the technical report