# Chapter 27: Convolutional Neural Networks for Time‑Series

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand how convolutional neural networks (CNNs) can be applied to sequential data
- Explain the mechanics of 1D convolutions and how they differ from 2D convolutions used in images
- Design and implement 1D CNN models for time‑series forecasting using the NEPSE dataset
- Utilize pooling layers to reduce dimensionality and extract salient features
- Grasp the concept of dilated convolutions and their role in capturing long‑range dependencies
- Build Temporal Convolutional Networks (TCNs) with residual connections for improved performance
- Combine CNN and RNN layers in hybrid architectures to leverage both local patterns and sequential memory
- Apply best practices for training CNNs on time‑series data, including regularization and validation
- Compare CNN‑based models with MLPs, RNNs, and traditional models on the NEPSE prediction task

---

## **27.1 Introduction to CNNs for Sequential Data**

Convolutional Neural Networks (CNNs) have revolutionized computer vision, but they are also highly effective for time‑series analysis. When applied to sequences, 1D convolutions slide a filter (kernel) across the time dimension to detect local patterns—such as short‑term trends, spikes, or repeated motifs. These learned patterns are then combined hierarchically to recognize more complex structures.

For the NEPSE prediction system, CNNs can automatically extract relevant features from raw sequences of returns, volume, or technical indicators. Unlike RNNs, which process sequences step‑by‑step, CNNs are **parallelizable** and often faster to train. They have also been shown to achieve state‑of‑the‑art results on various time‑series benchmarks when properly designed (e.g., WaveNet, Temporal Convolutional Networks).

### **27.1.1 Why Use CNNs for Time‑Series?**

- **Local pattern detection:** Convolutions capture short‑term dependencies (e.g., the shape of a price spike) regardless of where they occur in the sequence.
- **Hierarchical features:** Stacking convolutional layers allows the network to learn increasingly abstract representations (from raw movements to patterns like "head and shoulders").
- **Efficiency:** Convolutions can be computed in parallel, making them faster than recurrent layers, especially on GPUs.
- **Flexible receptive field:** Through dilation, CNNs can cover long time spans without a huge number of parameters.

---

## **27.2 1D Convolutions**

A 1D convolution operates over a single dimension (time). For an input sequence of length `L` with `C` channels (features), a convolution kernel of size `k` slides across time, computing a dot product between the kernel weights and the corresponding window of data. The result is a new sequence (feature map) of length `L - k + 1` (with no padding) or `L` (with padding). Multiple kernels produce multiple output channels.

### **27.2.1 Understanding the Operation**

Imagine we have a sequence of daily returns `r₁, r₂, …, r_T`. A kernel of size 3 might learn to detect a pattern like "down, up, up". At each position, it computes:

`output[t] = w₁ * r_t + w₂ * r_{t+1} + w₃ * r_{t+2} + b`

where `w` are the learned weights and `b` is a bias. After training, certain kernels become activated by specific local shapes.

### **27.2.2 Implementing a 1D CNN in Keras**

We'll build a simple CNN for the NEPSE return prediction task. We'll use the same sequence data as in Chapter 26 (20‑day windows of returns). We'll add a convolutional layer followed by a dense output layer.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Load and prepare data (same as in Chapter 26)
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date']).reset_index(drop=True)
symbol = df['Symbol'].unique()[0]
df_stock = df[df['Symbol'] == symbol].copy()

# Compute returns
df_stock['Return'] = df_stock['Close'].pct_change() * 100

# Use only returns as feature for simplicity
feature_columns = ['Return']
df_stock = df_stock.dropna(subset=feature_columns)

# Function to create sequences (same as before)
def create_sequences(data, seq_length):
    X, y = [], []
    for i in range(len(data) - seq_length):
        X.append(data[i:i+seq_length, :])
        y.append(data[i+seq_length, 0])  # target is the next return
    return np.array(X), np.array(y)

seq_length = 20
data = df_stock[feature_columns].values
X, y = create_sequences(data, seq_length)

# Temporal split
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Scale features (reshape to 2D, scale, reshape back)
original_shape = X_train.shape
X_train_2d = X_train.reshape(-1, X_train.shape[-1])
scaler = StandardScaler()
X_train_scaled_2d = scaler.fit_transform(X_train_2d)
X_train_scaled = X_train_scaled_2d.reshape(original_shape)

X_test_2d = X_test.reshape(-1, X_test.shape[-1])
X_test_scaled_2d = scaler.transform(X_test_2d)
X_test_scaled = X_test_scaled_2d.reshape(X_test.shape)

print(f"X_train_scaled shape: {X_train_scaled.shape}")  # (samples, timesteps, features)
```

Now we build a simple 1D CNN model:

```python
model_cnn = keras.Sequential([
    layers.Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(X_train_scaled.shape[1], X_train_scaled.shape[2])),
    layers.MaxPooling1D(pool_size=2),
    layers.Conv1D(filters=32, kernel_size=3, activation='relu'),
    layers.GlobalAveragePooling1D(),  # or Flatten()
    layers.Dense(1)
])

model_cnn.compile(optimizer='adam', loss='mse', metrics=['mae'])
model_cnn.summary()
```

**Explanation:**

- `Conv1D` with 64 filters and kernel size 3. The input shape is `(timesteps=20, features=1)`. The layer outputs a tensor of shape `(batch, timesteps - kernel_size + 1, 64)` if no padding, but Keras uses 'valid' padding by default (no padding), so the time dimension reduces. To keep the length the same, we can add `padding='same'`.
- `MaxPooling1D` with pool size 2 reduces the time dimension by half, extracting the most salient features.
- A second `Conv1D` layer with 32 filters.
- `GlobalAveragePooling1D` averages over the time dimension, producing a fixed‑size vector for each sample, which is then fed to a dense output layer. Alternatively, we could use `Flatten()`, but global pooling reduces parameters and helps prevent overfitting.
- The output layer is a single neuron for regression.

### **27.2.3 Training and Evaluation**

```python
# Validation split (temporal)
val_size = int(len(X_train_scaled) * 0.1)
X_val = X_train_scaled[-val_size:]
y_val = y_train[-val_size:]
X_train_final = X_train_scaled[:-val_size]
y_train_final = y_train[:-val_size]

history = model_cnn.fit(
    X_train_final, y_train_final,
    validation_data=(X_val, y_val),
    epochs=50,
    batch_size=32,
    verbose=0
)

# Plot loss
plt.figure(figsize=(10,4))
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.legend()
plt.show()

# Test evaluation
y_pred = model_cnn.predict(X_test_scaled).flatten()
rmse_cnn = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"CNN Test RMSE: {rmse_cnn:.4f}")
```

**Explanation:**

- We train similarly to the LSTM model. The CNN often trains faster because convolutions are parallelizable.
- The performance may be comparable to the LSTM; the CNN captures local patterns but does not have an explicit memory of long‑term dependencies beyond what the stacked convolutions can cover (receptive field).

---

## **27.3 Pooling Layers**

Pooling layers downsample the feature maps, reducing dimensionality and providing translation invariance. Common types:

- **Max pooling:** Takes the maximum value over a window. Preserves the most activated features.
- **Average pooling:** Takes the average. Smoother but may lose sharp features.

In time‑series, pooling helps to condense the representation and focus on the most important local patterns. However, excessive pooling can discard useful temporal information. For forecasting, global pooling (over the whole time dimension) is often used before the final dense layer.

---

## **27.4 Temporal Convolutional Networks (TCN)**

A Temporal Convolutional Network (TCN) is a specific architecture that combines several principles:

- **Causal convolutions:** Ensure that predictions at time `t` depend only on times `≤ t` (no look‑ahead). Achieved by padding only on the left.
- **Dilated convolutions:** Introduce gaps in the kernel to exponentially increase the receptive field without increasing the number of parameters.
- **Residual connections:** Allow training of very deep networks by adding skip connections.

### **27.4.1 Dilated Convolutions**

In a standard convolution, the kernel is applied to contiguous time steps. With dilation, we skip steps. For example, a kernel of size 3 with dilation rate 2 covers positions `t, t+2, t+4`. By stacking layers with exponentially increasing dilation (1,2,4,8,…), the receptive field grows quickly, enabling the network to capture long‑range dependencies.

### **27.4.2 Building a Simple TCN Block**

We can implement a TCN using Keras with custom layers, but there are also libraries like `keras-tcn`. Here's a simplified version of a residual block with dilated convolutions:

```python
def residual_block(x, dilation_rate, filters, kernel_size=3):
    # Save input for skip connection
    shortcut = x
    
    # First dilated convolution
    x = layers.Conv1D(filters, kernel_size, padding='causal', dilation_rate=dilation_rate, activation='relu')(x)
    x = layers.Dropout(0.2)(x)
    
    # Second dilated convolution
    x = layers.Conv1D(filters, kernel_size, padding='causal', dilation_rate=dilation_rate, activation='relu')(x)
    x = layers.Dropout(0.2)(x)
    
    # Add residual (if dimensions match; otherwise use 1x1 conv)
    if shortcut.shape[-1] != filters:
        shortcut = layers.Conv1D(filters, 1, padding='same')(shortcut)
    x = layers.add([x, shortcut])
    x = layers.Activation('relu')(x)
    return x

# Build TCN model
inputs = keras.Input(shape=(X_train_scaled.shape[1], X_train_scaled.shape[2]))
x = inputs
filters = 32
for dilation_rate in [1, 2, 4, 8]:
    x = residual_block(x, dilation_rate, filters)
    filters *= 2  # double filters each block (optional)

x = layers.GlobalAveragePooling1D()(x)
x = layers.Dense(64, activation='relu')(x)
outputs = layers.Dense(1)(x)

model_tcn = keras.Model(inputs, outputs)
model_tcn.compile(optimizer='adam', loss='mse', metrics=['mae'])
model_tcn.summary()
```

**Explanation:**

- Each residual block contains two dilated convolutional layers with `padding='causal'` (ensures no leakage from future).
- Dilation rates increase exponentially (1,2,4,8), giving a large receptive field with few layers.
- Filters are doubled in each block to increase capacity.
- A global average pooling compresses the time dimension, followed by a dense layer.
- The model is trained similarly to the simple CNN.

---

## **27.5 CNN‑RNN Hybrids**

Combining CNNs and RNNs can leverage the strengths of both: CNNs extract local features, and RNNs model temporal dependencies. A common pattern is to use one or more convolutional layers to preprocess the input sequence, then feed the resulting feature maps into an LSTM or GRU.

```python
model_hybrid = keras.Sequential([
    layers.Conv1D(filters=64, kernel_size=5, activation='relu', padding='same', input_shape=(seq_length, 1)),
    layers.MaxPooling1D(pool_size=2),
    layers.Conv1D(filters=32, kernel_size=3, activation='relu', padding='same'),
    layers.MaxPooling1D(pool_size=2),
    layers.LSTM(50, return_sequences=False),
    layers.Dense(1)
])

model_hybrid.compile(optimizer='adam', loss='mse', metrics=['mae'])
model_hybrid.summary()
```

**Explanation:**

- The convolutional layers reduce the time dimension (via pooling) and extract local features.
- The LSTM then processes the reduced sequence, capturing longer‑term dependencies.
- This hybrid can be more efficient than a pure LSTM on long sequences because the CNN reduces the sequence length.

---

## **27.6 Architectural Patterns**

When designing CNNs for time‑series, consider these patterns:

- **Early vs. late fusion:** If you have multiple features (e.g., returns, volume, RSI), you can either process them together from the start (early fusion) or have separate branches that are combined later (late fusion).
- **Dilated stacks:** Stacking dilated convolutions (like TCN) often works well for long sequences.
- **Residual connections:** Essential for deep networks to avoid vanishing gradients.
- **Global pooling vs. flattening:** Global average pooling is parameter‑efficient and often generalizes better than flattening followed by dense layers.

---

## **27.7 Implementation Strategies**

### **27.7.1 Data Preparation**

- Ensure sequences are constructed without look‑ahead.
- Scale features per channel (as shown).
- For multi‑step forecasting, you can have multiple output neurons (one per future step) or use a sequence‑to‑sequence architecture.

### **27.7.2 Regularization**

- Use dropout after convolutional layers (spatial dropout can be applied across channels).
- Apply batch normalization to stabilize training.
- Use L2 weight regularization.
- Early stopping.

### **27.7.3 Hyperparameter Tuning**

Important hyperparameters:

- Number of filters
- Kernel size
- Number of layers
- Dilation rates
- Pooling size
- Learning rate and optimizer

Use walk‑forward validation to tune these, as with any time‑series model.

### **27.7.4 Comparing with RNNs and MLPs**

CNNs often train faster and can achieve competitive accuracy. For the NEPSE dataset, they might capture short‑term patterns (e.g., 2‑3 day reversals) effectively. Long‑term dependencies (e.g., monthly cycles) may require deeper networks or larger dilation rates.

---

## **27.8 Practical Considerations for NEPSE**

- **Feature engineering:** Even with CNNs, providing derived features (RSI, moving averages) as additional channels often helps. The CNN can learn to combine them.
- **Sequence length:** Experiment with lengths from 10 to 60 days. Use validation performance to decide.
- **Multi‑step forecasting:** For predicting multiple days ahead, a TCN with multiple output neurons or a sequence‑to‑sequence model can be used.
- **Ensemble:** Combining CNN and LSTM predictions (simple average) may improve robustness.

---

## **27.9 Comparison with Other Models**

We can benchmark the CNN against the LSTM and MLP from previous chapters. Typically, on financial returns, all models may have similar RMSE, but CNNs often train faster. The choice may come down to computational resources and whether the data has strong local patterns.

---

## **27.10 Chapter Summary**

In this chapter, we explored the application of Convolutional Neural Networks to time‑series forecasting, using the NEPSE dataset as a running example.

- **1D convolutions** detect local patterns in sequences and are highly parallelizable.
- **Pooling layers** reduce dimensionality and provide a degree of invariance.
- **Temporal Convolutional Networks (TCNs)** use dilated convolutions and residual connections to capture long‑range dependencies efficiently.
- **CNN‑RNN hybrids** combine the strengths of both architectures.
- We implemented simple CNN, TCN, and hybrid models in Keras and evaluated them on the return prediction task.
- Regularization and proper temporal validation are essential to avoid overfitting.

### **Practical Takeaways for the NEPSE System:**

- Start with a simple 1D CNN as a fast baseline.
- If the sequence is long, consider a TCN with dilation.
- Use multiple feature channels (returns, volume, technical indicators) to enrich the input.
- Compare CNN performance with LSTM and MLP; often CNNs are competitive and faster.
- Always validate with walk‑forward to ensure the model generalizes to new market regimes.

In the next chapter, **Chapter 28: Transformer Models for Time‑Series**, we will delve into attention‑based architectures that have recently achieved state‑of‑the‑art results in many sequence‑to‑sequence tasks.

---

**End of Chapter 27**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='26. recurrent_neural_networks.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='28. transformer_models_for_time_series.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
