# **Chapter 26: Recurrent Neural Networks**

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand why standard feedforward networks (MLPs) are insufficient for sequential data
- Explain the architecture of a basic recurrent neural network (RNN) and how it processes sequences
- Identify the vanishing/exploding gradient problem and how it limits simple RNNs
- Implement Long Short‑Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks using Keras
- Prepare time‑series data in the correct format (samples, timesteps, features) for RNNs
- Build and train RNN models for predicting NEPSE stock returns
- Use bidirectional RNNs to capture patterns from both past and future (when applicable)
- Stack multiple RNN layers to increase model capacity
- Apply proper validation techniques (walk‑forward) to avoid look‑ahead bias
- Compare RNN performance with MLPs and traditional models on the NEPSE dataset

---

## **26.1 Sequential Data Processing**

Time‑series data, by its nature, is sequential: the order of observations matters. In previous chapters, we used **lag features** to feed past information into models like MLPs, random forests, or linear regression. For example, we created columns like `Return_Lag1`, `Return_Lag2`, etc., and treated them as independent features. While this works, it has limitations:

- The model does not explicitly learn the **temporal dynamics**; it treats lags as separate dimensions without capturing the evolving pattern.
- For longer sequences, the number of lag features grows, leading to high dimensionality and potential overfitting.
- The model cannot easily learn patterns that depend on the **order** of the sequence, such as trends or repeating motifs.

Recurrent neural networks (RNNs) are designed to handle sequential data by maintaining a **hidden state** that is updated as the network processes each time step. This hidden state acts as a memory, allowing information to persist across the sequence.

---

## **26.2 Basic RNN Architecture**

A simple RNN processes a sequence one element at a time. At each time step `t`, it takes the current input `x_t` and the previous hidden state `h_{t-1}`, and computes a new hidden state `h_t`:

`h_t = tanh(W_xh * x_t + W_hh * h_{t-1} + b_h)`

The output at each step can be `h_t` itself (if we want a sequence of outputs) or just the final hidden state (if we only need a single prediction after the whole sequence).

For time‑series forecasting, we often use the **many‑to‑one** architecture: we feed a sequence of past observations (e.g., last 20 days of returns) and predict the next value.

### **26.2.1 Vanishing/Exploding Gradients**

During training, gradients are propagated back through time (Backpropagation Through Time – BPTT). For long sequences, repeated multiplication of the same weight matrix can cause gradients to vanish (become very small) or explode (become very large). Vanishing gradients prevent the network from learning long‑range dependencies; exploding gradients cause unstable training.

This is why simple RNNs are rarely used in practice. Instead, we use gated architectures like LSTM and GRU.

---

## **26.3 Long Short‑Term Memory (LSTM)**

LSTMs introduce a **cell state** and three gates (input, forget, output) that regulate the flow of information. This design allows the network to remember information over long periods and avoid the vanishing gradient problem.

- **Forget gate:** decides what to discard from the cell state.
- **Input gate:** decides what new information to store.
- **Output gate:** decides what to output based on the cell state.

The equations are more complex, but conceptually, LSTMs can learn which past information is relevant and for how long to keep it.

### **26.3.1 Gated Recurrent Units (GRU)**

GRUs are a simplified version of LSTMs with two gates (reset and update). They have fewer parameters and often perform similarly to LSTMs on many tasks. GRUs are a good default for time‑series forecasting.

---

## **26.4 Preparing Data for RNNs**

RNNs expect input in the shape: `(samples, timesteps, features)`. For the NEPSE dataset, we need to create sequences of past returns (and possibly other features) to predict the next value.

Suppose we decide to use a window of 20 past days to predict the next day's return. For each day `t`, we create a sequence of the previous 20 days' returns (and possibly other features like volume, RSI, etc.). The target is the return at day `t+1`.

### **26.4.1 Creating Sequences**

We'll write a function to generate these sequences from our DataFrame.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Load and prepare NEPSE data (single symbol)
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date']).reset_index(drop=True)
symbol = df['Symbol'].unique()[0]
df_stock = df[df['Symbol'] == symbol].copy()

# Compute returns
df_stock['Return'] = df_stock['Close'].pct_change() * 100

# Optionally, add other features (e.g., volume change, RSI, etc.)
# For simplicity, we'll start with only returns
feature_columns = ['Return']  # we can add more later

# Drop NaN (first row of returns)
df_stock = df_stock.dropna(subset=feature_columns)

# Function to create sequences
def create_sequences(data, seq_length, target_col_idx=0):
    """
    data: numpy array of shape (n_samples, n_features)
    seq_length: number of past steps to use
    target_col_idx: which column is the target (0 for first column)
    Returns:
        X: (n_samples - seq_length, seq_length, n_features)
        y: (n_samples - seq_length,)
    """
    X, y = [], []
    for i in range(len(data) - seq_length):
        X.append(data[i:i+seq_length, :])          # all features
        y.append(data[i+seq_length, target_col_idx]) # target (next value)
    return np.array(X), np.array(y)

# Choose sequence length (e.g., 20 days)
seq_length = 20

# Get data as numpy array (only the feature columns)
data = df_stock[feature_columns].values

# Create sequences
X, y = create_sequences(data, seq_length)

print(f"X shape: {X.shape}")  # (samples, timesteps, features)
print(f"y shape: {y.shape}")
```

**Explanation:**

- `create_sequences` slides a window of length `seq_length` over the data. For each window, it takes all features in that window as input `X`, and the next value (target column) as output `y`.
- This produces a 3D array `X` of shape `(n_samples - seq_length, seq_length, n_features)`.
- We must ensure we do not use future information; the sequences are constructed only from past data relative to each prediction point.

### **26.4.2 Train/Test Split (Temporal)**

We need to split the sequences temporally. Since the sequences are overlapping, we cannot randomly shuffle; we must take the first part as training, later part as test.

```python
# Number of samples
n_samples = X.shape[0]
split_idx = int(n_samples * 0.8)

X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

print(f"Train samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}")
```

### **26.4.3 Scaling**

RNNs also benefit from scaled inputs. However, we must be careful: scaling should be applied per feature across time, but using a scaler fitted on the training set **only**. Importantly, we should **not** scale across the time dimension; we scale each feature independently.

We can reshape the data to 2D (samples * timesteps, features), fit the scaler, then reshape back.

```python
# Reshape to 2D: (samples * timesteps, features)
original_shape = X_train.shape
X_train_2d = X_train.reshape(-1, X_train.shape[-1])

# Fit scaler on training 2D data
scaler = StandardScaler()
X_train_scaled_2d = scaler.fit_transform(X_train_2d)

# Reshape back to 3D
X_train_scaled = X_train_scaled_2d.reshape(original_shape)

# Transform test data using same scaler
X_test_2d = X_test.reshape(-1, X_test.shape[-1])
X_test_scaled_2d = scaler.transform(X_test_2d)
X_test_scaled = X_test_scaled_2d.reshape(X_test.shape)

print(f"Scaled X_train shape: {X_train_scaled.shape}")
```

**Explanation:**

- We reshape to 2D so that the scaler sees all time steps as independent observations. This is valid because we assume the distribution of each feature is stationary across time (for the training period). This is a common approach.
- Alternatively, we could scale each feature using its mean and std across the entire training set without reshaping; reshaping achieves the same result.

---

## **26.5 Building an LSTM Model in Keras**

We'll build a simple LSTM model with one hidden layer.

```python
model = keras.Sequential([
    layers.LSTM(50, activation='tanh', return_sequences=False, input_shape=(X_train_scaled.shape[1], X_train_scaled.shape[2])),
    layers.Dense(1)  # output layer for regression
])

model.compile(optimizer='adam', loss='mse', metrics=['mae'])
model.summary()
```

**Explanation:**

- `LSTM(50, ...)` creates an LSTM layer with 50 units. The default activation for the recurrent step is `tanh`, and for the output it's also `tanh` (but we use a separate Dense layer for the final output).
- `return_sequences=False` means this layer returns only the last output (after processing the whole sequence). This is typical for many‑to‑one forecasting.
- `input_shape` is `(timesteps, n_features)`.
- The output `Dense(1)` produces the predicted return.

### **26.5.1 Training with Validation**

We'll use a validation split that respects time order. Since our data is already in temporal order, we can use `validation_split` (which takes the last part) or manually create a validation set from the end of the training data.

```python
# Use last 10% of training as validation
val_size = int(len(X_train_scaled) * 0.1)
X_val = X_train_scaled[-val_size:]
y_val = y_train[-val_size:]
X_train_final = X_train_scaled[:-val_size]
y_train_final = y_train[:-val_size]

# Train
history = model.fit(
    X_train_final, y_train_final,
    validation_data=(X_val, y_val),
    epochs=50,
    batch_size=32,
    verbose=0
)

# Plot loss
plt.figure(figsize=(10,4))
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.legend()
plt.show()
```

**Explanation:**

- We manually split the training data into a final training set and a validation set, preserving temporal order. This ensures that validation data is always later than training data.
- Training for 50 epochs; we can add early stopping.

### **26.5.2 Evaluation on Test Set**

```python
y_pred = model.predict(X_test_scaled).flatten()
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"LSTM Test RMSE: {rmse:.4f}")

# Compare with a simple baseline (e.g., predicting 0)
baseline_rmse = np.sqrt(mean_squared_error(y_test, np.zeros_like(y_test)))
print(f"Baseline (predict 0) RMSE: {baseline_rmse:.4f}")
```

---

## **26.6 Stacked RNNs**

We can stack multiple LSTM layers to learn higher‑level temporal abstractions. When stacking, intermediate layers must return sequences (`return_sequences=True`) so that the next LSTM layer receives a sequence.

```python
model_stacked = keras.Sequential([
    layers.LSTM(50, return_sequences=True, input_shape=(X_train_scaled.shape[1], X_train_scaled.shape[2])),
    layers.LSTM(50, return_sequences=False),
    layers.Dense(1)
])
model_stacked.compile(optimizer='adam', loss='mse', metrics=['mae'])
model_stacked.summary()
```

**Explanation:**

- The first LSTM returns a sequence (same length as input) to feed into the second LSTM.
- The second LSTM returns only the last output.
- Stacking increases capacity but also risk of overfitting; use with sufficient data and regularization.

---

## **26.7 Bidirectional RNNs**

A bidirectional RNN runs two independent RNNs – one forward through the sequence, one backward – and concatenates their outputs. This allows the network to use both past and future context. However, for time‑series forecasting, using future context at prediction time is impossible (would be look‑ahead). Therefore, **bidirectional RNNs are not suitable for forecasting** where we only have past data. They are useful for tasks like sequence labeling (where the whole sequence is available) or for imputation, but not for predicting the next step. We mention it for completeness and to avoid misapplication.

---

## **26.8 Training Tips for RNNs**

### **26.8.1 Sequence Length**

Choosing the sequence length is crucial. Too short, and the model may miss important longer‑term dependencies. Too long, and it may overfit or be computationally expensive. For daily stock returns, common lengths are 20 (one trading month) to 60 (one quarter). Experiment with different values and validate.

### **26.8.2 Stateful RNNs**

By default, Keras RNNs are **stateless**: the hidden state is reset after each batch. For very long sequences, you can use **stateful** RNNs where the state persists across batches, but this requires careful data handling and is rarely needed for typical time‑series forecasting where we use fixed‑length windows.

### **26.8.3 Regularization**

- **Dropout:** Apply dropout to inputs or recurrent connections. In Keras, LSTM has `dropout` (for input) and `recurrent_dropout` (for recurrent state). Use small values (0.2‑0.3).
- **Weight regularization:** Add L1/L2 penalties to kernels.
- **Early stopping** as always.

```python
model = keras.Sequential([
    layers.LSTM(50, dropout=0.2, recurrent_dropout=0.2, return_sequences=False, input_shape=(timesteps, n_features)),
    layers.Dense(1)
])
```

### **26.8.4 Learning Rate and Optimizer**

Adam with default settings usually works well. You can also try RMSprop.

### **26.8.5 Scaling Targets**

If the target (returns) has a very wide range, scaling it (e.g., to zero mean unit variance) can help. However, for returns, it's usually fine to leave as is.

---

## **26.9 Example: Multi‑step Ahead Forecasting**

Sometimes we want to predict multiple steps ahead (e.g., next 5 days). There are several strategies:

- **Direct multi‑step:** Train a model to predict t+1, another for t+2, etc. (or a model with multiple outputs).
- **Recursive:** Use the one‑step model iteratively, feeding its predictions back as input for the next step.
- **Sequence‑to‑sequence:** Encode the input sequence and decode a sequence of future values (encoder‑decoder).

We'll demonstrate a simple direct multi‑step with multiple outputs.

```python
# Create target for next 5 days
y_multi = []
for i in range(1, 6):
    y_multi.append(df_stock['Return'].shift(-i).values)

# Stack into array (samples, 5)
y_multi = np.column_stack(y_multi)

# Remove the last 5 rows where target is NaN
y_multi = y_multi[:-5] if seq_length else y_multi

# Align X (must also remove last 5 rows)
X_multi = data[:len(y_multi)]

# Create sequences (same function, but now y is multi-dimensional)
X_multi_seq, y_multi_seq = create_sequences_multi(X_multi, y_multi, seq_length)
# We need a modified create_sequences that handles multi-dimensional y
# (simplified: assume y_multi is aligned with the end of each window)

# Build model with 5 output units
model_multi = keras.Sequential([
    layers.LSTM(50, input_shape=(seq_length, n_features)),
    layers.Dense(5)  # 5 outputs
])
model_multi.compile(optimizer='adam', loss='mse')
model_multi.fit(...)
```

**Explanation:**

- The model outputs 5 values simultaneously, predicting the next 5 days. This is a direct multi‑step approach.
- The loss is MSE computed across all 5 steps.

---

## **26.10 Comparison with MLP and Other Models**

We can compare the LSTM with a simple MLP that uses lagged features as input (like in Chapter 25). Typically, on the NEPSE dataset, the LSTM might perform slightly better if there are meaningful sequential patterns, but the difference may not be huge due to the noisy nature of returns. The key advantage of RNNs is their ability to learn from sequences of arbitrary length without explicitly creating many lag features.

```python
# Build an MLP on the same data (but using flattened sequences)
X_train_flat = X_train_scaled.reshape(X_train_scaled.shape[0], -1)
X_test_flat = X_test_scaled.reshape(X_test_scaled.shape[0], -1)

mlp = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train_flat.shape[1],)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1)
])
mlp.compile(optimizer='adam', loss='mse')
mlp.fit(X_train_flat, y_train, validation_split=0.1, epochs=50, batch_size=32, verbose=0)

y_pred_mlp = mlp.predict(X_test_flat).flatten()
rmse_mlp = np.sqrt(mean_squared_error(y_test, y_pred_mlp))
print(f"MLP Test RMSE: {rmse_mlp:.4f}")
print(f"LSTM Test RMSE: {rmse:.4f}")
```

**Explanation:**

- The MLP sees the same sequence but as a flat vector; it has no notion of temporal order. The LSTM explicitly models the order.
- In practice, the LSTM may have a slight edge, especially with longer sequences.

---

## **26.11 Practical Considerations for NEPSE**

- **Data size:** For a single stock, we may have only a few thousand trading days. LSTMs typically require more data; consider pooling multiple stocks or using smaller models.
- **Feature engineering:** While LSTMs can learn from raw sequences, providing additional features (volume, technical indicators) often helps. Include them in the feature dimension.
- **Regularization is crucial:** Use dropout, recurrent dropout, and early stopping.
- **Walk‑forward validation:** Use time‑series CV to tune hyperparameters (sequence length, units, dropout). Implement a manual loop over expanding windows.
- **Benchmarking:** Always compare with simpler models (ARIMA, linear regression, MLP) to justify the added complexity.

---

## **26.12 Chapter Summary**

In this chapter, we explored Recurrent Neural Networks for time‑series forecasting, using the NEPSE dataset as an example.

- **RNN basics:** They maintain a hidden state to process sequences, but simple RNNs suffer from vanishing gradients.
- **LSTM and GRU** are gated architectures that can learn long‑term dependencies.
- **Data preparation:** We must create sequences of shape `(samples, timesteps, features)`.
- **Building models in Keras:** We implemented an LSTM for one‑step ahead return prediction, and discussed multi‑step strategies.
- **Stacked and bidirectional RNNs:** Stacking increases capacity; bidirectional is not suitable for forecasting.
- **Training tips:** Scale data, use dropout, early stopping, and proper temporal validation.
- **Comparison:** LSTMs may outperform MLPs if sequential patterns exist, but require careful tuning.

### **Practical Takeaways for the NEPSE System:**

- Start with a simple LSTM with one hidden layer, using a sequence length of 20‑60 days.
- Include additional features (volume, RSI, volatility) as extra channels.
- Use walk‑forward validation to tune hyperparameters and avoid overfitting.
- Regularize heavily; financial returns are noisy, and LSTMs can easily overfit.
- Compare with MLP and ARIMA to determine if the sequential modeling adds value.

In the next chapter, **Chapter 27: Convolutional Neural Networks for Time‑Series**, we will see how CNNs can also be applied to sequential data, often with competitive results and faster training.

---

**End of Chapter 26**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='25. neural_network_fundamentals.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='27. convolutional_neural_networks_for_time_series.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
