# TCN Architectures and Variants - Comprehensive Technical Reference

**Version 2.0** - Expanded Documentation (February 2026)

This notebook provides complete technical documentation of Temporal Convolutional Network (TCN) architectures used in the TAPE-TCN portfolio optimization system.

**Audience**: Researchers, developers, and practitioners requiring deep understanding of TCN theory and implementation for portfolio RL.

## Table of Contents

 1. [Introduction and Motivation](#section1)
 2. [Theoretical Foundations](#section2)
 3. [TCN Block Implementation](#section3)
 4. [TCN Variants Taxonomy](#section4)
 5. [Multi-Head Self-Attention](#section5)
 6. [Fusion Architecture](#section6)
 7. [Receptive Field Analysis](#section7)
 8. [Portfolio Optimization Application](#section8)
 9. [Computational Complexity](#section9)
 10. [References](#section10)

## 1. Introduction and Motivation <a id='section1'></a>

### Why TCN for Portfolio Optimization?

Temporal Convolutional Networks (TCNs) offer key advantages for sequential portfolio allocation:

**1. Parallelizable Training**: Unlike RNNs/LSTMs, TCNs process entire sequences in parallel â†’ faster training on long financial time series.

**2. Stable Gradients**: Residual connections + dilated convolutions prevent vanishing gradients, crucial for long-term market dependencies.

**3. Flexible Receptive Fields**: Exponentially growing receptive fields via dilation capture multi-scale patterns: daily volatility, monthly trends, quarterly earnings.

**4. Causal Structure**: Built-in causality ensures no future information leakage, critical for realistic backtesting.

**5. Variable-Length Sequences**: TCNs handle any sequence length without architecture changes.

### Connection to TAPE-TCN Portfolio RL

In this project:

- **Input**: Multi-asset feature sequences (technical indicators + fundamentals + macro variables)
- **TCN Processing**: Temporal encoding of market dynamics and regime shifts
-  **Output**: Dirichlet concentration parameters Î± for portfolio weight sampling

TCNs learn to map market states â†’ optimal allocations while capturing:

- **Cross-asset correlations** (via Fusion pathway)
- **Regime persistence** (via long receptive fields)
- **Multi-horizon risk-return** (via TAPE reward)

**Key Papers**:

- Bai et al. (2018): TCN foundations [bai2018tcn]
- Jiang et al. (2017): DRL for portfolio management [jiang2017deep]
- Yang et al. (2022): Dirichlet portfolio RL [yang2022selective]
- AndrÃ© & Coqueret (2021): Dirichlet factor portfolios [andre2021dirichlet]

## 2. Theoretical Foundations <a id='section2'></a>

### 2.1 Causal Convolution

A **causal convolution** ensures output at time $t$ depends only on inputs $\leq t$, never future timesteps.

For 1D convolution with kernel size $k$:

$$
y_t = \sum_{i=0}^{k-1} w_i \cdot x_{t-i}
$$

Implementation: **causal padding** - left-pad input by $(k-1)$ zeros before standard convolution.

### 2.2 Dilated Convolution

A **dilated convolution** with rate $d$ samples input with gaps:

$$
y_t = \sum_{i=0}^{k-1} w_i \cdot x_{t - i \cdot d}
$$

- $d=1$: standard convolution
- $d=2$: samples every other timestep  
- $d=4$: samples every 4th timestep

**Benefit**: Exponentially increases receptive field without adding parameters.

### 2.3 Residual Connections

Each TCN block includes skip connection:

$$
\text{output} = \text{Activation}(\text{Conv}(x) + x)
$$

If dimensions mismatch, use 1x1 projection:

$$
\text{output} = \text{Activation}(\text{Conv}(x) + W_{\text{proj}} \cdot x)
$$

**Benefit**: Enables gradient flow through very deep networks (He et al. 2015).

### 2.4 TCN vs RNN/LSTM

| Aspect | RNN/LSTM | TCN |
|--------|----------|-----|
| **Training** | Sequential | Fully parallel |
| **Speed** | Slow | Fast (GPU-optimized) |
| **Receptive Field** | Full history | Controlled by design |
| **Gradient Flow** | Vanishing/exploding | Stable (residuals) |
| **Memory** | $O(T)$ states | $O(1)$ per step |
| **Long Dependencies** | Difficult | Excellent (dilations) |

**References**: Bai et al. (2018), He et al. (2015) for ResNets

## 4. TCN Variants Taxonomy <a id='section4'></a>

This project implements **three TCN variants**, each with different complexity-expressiveness tradeof fs.

### 4.1 Variant A: Plain TCN (`TCNActor`, `TCNCritic`)

**Architecture**:
```
Input (batch, timesteps, features)
  ↓
TCNBlock_1 (filters=32, dilation=2)
  ↓
TCNBlock_2 (filters=64, dilation=4)
  ↓
TCNBlock_3 (filters=64, dilation=8)
  ↓
GlobalAveragePooling1D → (batch, 64)
  ↓
Dense(num_actions) → Dirichlet α
```

**Configuration**:
- `actor_critic_type = "TCN"`
- `tcn_filters = [32, 64, 64]`
- `tcn_kernel_size = 5`
- `tcn_dilations = [2, 4, 8]`

**Best for**: Baseline, computational efficiency, interpretability

**Receptive Field**: RF = 113 timesteps (exceeds sequence_length=60)

### 4.2 Variant B: TCN + Attention (`TCNAttentionActor`, `TCNAttentionCritic`)

**Architecture**:
```
Input (batch, timesteps, features)
  ↓
TCN Blocks (same as Plain TCN)
  ↓
Projection → (batch, timesteps, attention_dim=64)
  ↓
Multi-Head Self-Attention (4 heads)
  ↓
GlobalAveragePooling1D
  ↓
Dense → Dirichlet α
```

**Configuration**:
- `actor_critic_type = "TCN"` + `use_attention = True` OR `actor_critic_type = "TCN_ATTENTION"`
- `attention_heads = 4`
- `attention_dim = 64`

**Best for**: Learning temporal importance weighting, regime-dependent allocation

**Tradeoff**: +15% parameters, +attention overhead, +interpretability (attention weights)

### 4.3 Variant C: TCN + Fusion (`TCNFusionActor`, `TCNFusionCritic`)

**Architecture** (Dual Pathway):
```
Input (batch, timesteps, features)
  ↓
┌──────────────────────┬──────────────────────┐
│ Per-Asset Pathway    │ Global Pathway       │
│                      │                      │
│ Reshape by assets    │ (no reshape)         │
│  ↓                   │  ↓                   │
│ Shared TCN on each   │ TCN on full input    │
│  ↓                   │  ↓                   │
│ Time pooling         │ Time pooling         │
│  ↓                   │  ↓                   │
│ Project → embed_dim  │ Project → embed_dim  │
│  ↓                   │                      │
│ Cross-Asset Attention│                      │
│  ↓                   │                      │
│ Asset pooling        │                      │
└──────────────────────┴──────────────────────┘
                ↓
        Gated Fusion:
        gate = σ(W · [h_asset, h_global])
        h = gate ⊙ h_asset + (1-gate) ⊙ h_global
                ↓
        Dense → Dirichlet α
```

**Configuration**:
- `actor_critic_type = "TCN"` + `use_fusion = True`
- `fusion_embed_dim = 128`
- `fusion_attention_heads = 4`

**Best for**: Capturing cross-asset relationships + global market context

**Complexity**: Highest parameter count (~2x Plain TCN), most expressive

### 4.4 Variant Comparison

| Variant | Parameters | FLOPs/Step | Interpretability | Use Case |
|---------|------------|------------|------------------|----------|
| Plain TCN | Baseline | Baseline | Medium | Fast prototyping, baseline |
| TCN+Attention | +15% | +20% | High (attn weights) | Regime detection |
| TCN+Fusion | +100% | +150% | High (asset relationships) | Cross-asset strategy |

**Implementation**: `src/agents/actor_critic_tf.py`

**References**: Li et al. (2025) for fusion architecture [li2025ttsnet], André & Coqueret (2021) for Dirichlet portfolios [andre2021dirichlet]

## 5. Multi-Head Self-Attention <a id='section5'></a>

### 5.1 Attention Mechanism

After TCN feature extraction, **multi-head self-attention** learns to weight timesteps by importance.

**Query-Key-Value**:
$$
Q = XW_Q, \quad K = XW_K, \quad V = XW_V
$$

**Scaled Dot-Product Attention**:
$$
\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
$$

**Multi-Head**:
$$
\text{MultiHead}(X) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W_O
$$

where each head $i$ operates on a subspace:
$$
\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)
$$

### 5.2 Why Attention After TCN?

**TCN strengths**: Efficient temporal encoding with large receptive fields 

**TCN limitation**: Equal weighting across receptive field (via pooling)

**Attention benefit**: Learn which timesteps matter most for current decision

**Example**: During market crash, attention may focus on recent extreme moves; during calm periods, focus on longer-term trends.

### 5.3 Implementation

From `src/agents/actor_critic_tf.py::MultiHeadSelfAttention`:

- **Heads**: 4 (default)
- **Dimension**: 64 (default)
- **Dropout**: 0.1
- **Scale factor**: $1/\sqrt{d_k}$ where $d_k = d_{\text{model}} / h = 64/4 = 16$

**Positional information**: Implicit via TCN's causal structure (no explicit positional encoding needed)

**References**: Vaswani et al. (2017) for transformers, Li et al. (2025) for TCN-attention fusion [li2025ttsnet]

## 6. Fusion Architecture <a id='section6'></a>

The **TCN Fusion** variant implements a dual-pathway design to capture both:
- **Per-asset temporal patterns** (individual stock dynamics)
- **Global market context** (macro regime, cross-asset correlations)

### 6.1 Dual Pathway Design

**Per-Asset Pathway**:
1. Reshape input: split features by asset
2. Apply shared TCN encoder to each asset's time series
3. Pool over time → per-asset embeddings
4. Cross-asset attention → learn asset interactions
5. Pool over assets → single representation

**Global Pathway**:
1. Process full input (all assets concatenated)
2. Apply TCN on global state
3. Pool over time → global market embedding

###  6.2 Gated Fusion

Combine pathways via learned gate:

$$
\mathbf{g} = \sigma(W_g \cdot [\mathbf{h}_{\text{asset}}, \mathbf{h}_{\text{global}}])
$$

$$
\mathbf{h}_{\text{fused}} = \mathbf{g} \odot \mathbf{h}_{\text{asset}} + (1 - \mathbf{g}) \odot \mathbf{h}_{\text{global}}
$$

where $\sigma$ is sigmoid, $\odot$ is element-wise product.

**Intuition**: Gate learns when to rely on asset-specific signals vs. global market context.
 
### 6.3 Cross-Asset Attention

Within per-asset pathway, attention captures asset relationships:

$$
\alpha_{ij} = \frac{\exp(\mathbf{q}_i^\top \mathbf{k}_j / \sqrt{d})}{ \sum_k \exp(\mathbf{q}_i^\top \mathbf{k}_k / \sqrt{d})}
$$

$$
\mathbf{h}_i = \sum_j \alpha_{ij} \mathbf{v}_j
$$

**Example**: High-tech stocks may attend strongly to each other during sector rotation.

### 6.4 Implementation Details

From `src/agents/actor_critic_tf.py::TCNFusionActor`:

```python
# Per-asset pathway
x_assets = reshape_by_asset(x)  # (batch, timesteps, per_asset_dim)
x_assets = shared_tcn(x_assets)
x_assets = time_pool(x_assets)
x_assets = project(x_assets)
x_assets = cross_asset_attention(x_assets)
h_asset = asset_pool(x_assets)

# Global pathway
h_global = time_pool(tcn(x))
h_global = project(h_global)

# Gated fusion
gate = sigmoid(gate_layer([h_asset, h_global]))
h_fused = gate * h_asset + (1-gate) * h_global
```

**References**: Li et al. (2025) TTSNet [li2025ttsnet], Wong & Liu (2025) multi-modal portfolio [wong2025portfolio]

## 7. Receptive Field Analysis <a id='section7'></a>

### 7.1 What is Receptive Field?

The **receptive field** (RF) is the number of past timesteps that can influence the current output.

For a TCN block with:
- Kernel size $k$
- Dilation rate $d$  
- 2 convolutional layers

Single block RF contribution:
$$
\text{RF}_{\text{block}} = 2(k-1)d
$$

### 7.2 Multi-Block Receptive Field

For a stack of $B$ blocks with dilations $\{d_1, d_2, ..., d_B\}$:

$$
\text{RF}_{\text{total}} = 1 + 2(k-1) \sum_{i=1}^B d_i
$$

**Derivation**: Each block adds $2(k-1)d_i$ to receptive field. The "+1" accounts for the current timestep.

### 7.3 Current Configuration

From `src/config.py::PHASE1_CONFIG`:
- $k = 5$
- $\{d_1, d_2, d_3\} = \{2, 4, 8\}$
- `sequence_length = 60`

**Calculation**:
$$
\text{RF} = 1 + 2(5-1)(2+4+8) = 1 + 8 \times 14 = 113
$$

**Interpretation**: RF (113) > sequence length (60) → TCN can attend to full available history!

### 7.4 Dilation Schedule Impact

Exponential dilation ($d_i = 2^i$) provides exponential RF growth:

| Blocks | Dilations | RF ($k=5$) |
|--------|-----------|------------|
| 3 | [1, 2, 4] | 57 |
| 3 | [2, 4, 8] | 113 |
| 4 | [1, 2, 4, 8] | 121 |
| 4 | [2, 4, 8, 16] | 241 |

**Guideline**: Choose dilations so $\text{RF} \geq \text{sequence\_length}$ for full context.

### 7.5 Receptive Field Visualization

```
Block 1 (d=2): samples t-0, t-2, t-4, t-6, t-8
Block 2 (d=4): samples output from Block 1, effective reach t-16
Block 3 (d=8): samples output from Block 2, effective reach t-112
```

Each block **doubles** the effective receptive field when using exponential dilations.

**References**: Bai et al. (2018) [bai2018tcn]

## 8. Portfolio Optimization Application <a id='section8'></a>

### 8.1 Why Temporal Modeling Matters

Portfolio allocation requires understanding:
1. **Momentum persistence**: Trends continue over multiple days
2. **Volatility clustering**: High volatility follows high volatility
3. **Regime shifts**: Bull/bear markets, crisis periods
4. **Multi-horizon risk**: Drawdowns unfold over weeks/months
5. **Transaction costs**: Rebalancing decisions depend on recent changes

**TCN advantages**:
- Large RF captures long-term patterns
- Causal structure ensures realistic backtesting
- Parallel training enables fast experimentation

### 8.2 Connection to TAPE Reward

The TAPE reward system evaluates portfolios on multi-horizon metrics:
- **Sharpe ratio**: Requires estimating returns distribution
- **Sortino ratio**: Requires downside deviation tracking
- **MDD**: Requires drawdown history
- **Turnover**: Requires action memory

**TCN role**: Learn temporal patterns that optimize these multi-horizon objectives.

### 8.3 Asset Differentiation

Different assets have different temporal characteristics:
- **Growth stocks**: High momentum, high volatility
- **Value stocks**: Mean-reverting, low volatility
- **Defensive**: Counter-cyclical behavior

**Fusion architecture**: Captures both individual asset dynamics + cross-asset relationships.

### 8.4 Actuarial Features Integration

TCNs process actuarial drawdown features (from `src/actuarial.py`):
- `Actuarial_Expected_Recovery`: Time to recover from drawdown
- `Actuarial_Prob_30d`, `Actuarial_Prob_60d`: Drawdown probability forecasts
- `Actuarial_Reserve_Severity`: Risk reserve sizing

These features capture non-Markovian risk dynamics that TCNs can leverage.

**References**:
- Jiang et al. (2017): EIIE framework [jiang2017deep]
- Yang et al. (2022): Dirichlet portfolio RL [yang2022selective]
- Zhang et al. (2020): DL for portfolio optimization [zhang2020deep]
- Wong & Liu (2025): Multi-modal portfolio [wong2025portfolio]

## 9. Computational Complexity <a id='section9'></a>

### 9.1 TCN Parameter Count

For a single TCN block with input channels $C_{\text{in}}$, output channels $C_{\text{out}}$, kernel size $k$:

**Two Conv1D layers**:
$$
\text{Params}_{\text{conv}} = 2 \times (k \times C_{\text{in}} \times C_{\text{out}} + C_{\text{out}})
$$

**Downsample** (if needed):
$$
\text{Params}_{\text{downsample}} = C_{\text{in}} \times C_{\text{out}} + C_{\text{out}}
$$

For **Plain TCN** with `filters=[32,64,64]`, `k=5`, input=100 features:
- Block 1: $2(5 \times 100 \times 32) + $ downsample $\approx 35K$
- Block 2: $2(5 \times 32 \times 64) + $ downsample $\approx 22K$
- Block 3: $2(5 \times 64 \times 64) \approx 40K$
- **Total**: ~100K parameters

### 9.2 FLOPs Analysis

For sequence length $T$, single Conv1D:
$$
\text{FLOPs}_{\text{conv}} = T \times k \times C_{\text{in}} \times C_{\text{out}}
$$

**TCN advantage**: Parallel across $T$ (vs. RNN's sequential $T$ operations)

### 9.3 Attention Overhead

Multi-head attention with $h$ heads, dimension $d$, sequence length $T$:

$$
\text{FLOPs}_{\text{attn}} = 4Td^2 + 2T^2d
$$

**Self-attention bottleneck**: $O(T^2)$ for attention matrix computation

For $T=60$, $d=64$: $\text{FLOPs}_{\text{attn}} \approx 1.6M$ (small overhead)

### 9.4 Architecture Comparison

| Architecture | Params | FLOPs/Forward | Training Speed | Memory |
|--------------|--------|---------------|----------------|--------|
| Plain TCN | 100K | 5M | 1.0x (baseline) | 1.0x |
| TCN + Attention | 115K | 6.6M | 0.9x | 1.2x |
| TCN + Fusion | 200K | 12M | 0.6x | 1.8x |

**Training times** (empirical on this project, 10 assets, 60-step sequences):
- Plain TCN: ~20 sec/epoch
- TCN+Attention: ~22 sec/epoch
- TCN+Fusion: ~35 sec/epoch

### 9.5 Memory Efficiency

**TCN**: $O(1)$ hidden state (vs. RNN's $O(T)$ sequential states)

**Batch processing**: TCN fully parallelizes → better GPU utilization

**References**: Bai et al. (2018) for TCN efficiency [bai2018tcn]

## 10. References <a id='section10'></a>

### Core TCN Architecture

- **[bai2018tcn]** Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. _arXiv:1803.01271_.
- **[li2025ttsnet]** Li, Z., Luo, S., Liu, H., Tang, C., & Miao, J. (2025). TTSNet: Transformer–Temporal Convolutional Network–Self-Attention with Feature Fusion for Prediction of Remaining Useful Life. _Sensors, 25_(2), 432.
- **[xu2021portfolio]** Xu, X., & Zhang, Y. (2021). DP-TCN: Differential privacy-inspired TCN for stock prediction using financial news. _arXiv:2106.09121_.

### Deep RL for Portfolio Optimization

- **[jiang2017deep]** Jiang, Z., Xu, D., & Liang, J. (2017). A deep reinforcement learning framework for the financial portfolio management problem. _arXiv:1706.10059_.
- **[yang2022selective]** Yang, H., Park, H., & Lee, K. (2022). A selective portfolio management algorithm with off-policy reinforcement learning using Dirichlet distribution. _Axioms, 11_(12), 664.
- **[zhang2020deep]** Zhang, Z., Zohren, S., & Roberts, S. (2020). Deep learning for portfolio optimization. _Oxford-Man Institute Working Paper_.
- **[sood2023deep]** Sood, S., Papasotiriou, K., Vaiciulis, M., & Balch, T. (2023). Deep reinforcement learning for optimal portfolio allocation. _AAAI Conference_.
- **[wong2025portfolio]** Wong, J., & Liu, L. L. (2025). Portfolio optimization through a multi-modal deep reinforcement learning framework. _Engineering: Open Access, 3_(4), 1-8.
- **[choudhary2025risk]** Choudhary, H., Orra, A., Sahoo, K., & Thakur, M. (2025). Risk-adjusted deep reinforcement learning for portfolio optimization: A multi-reward approach. _IJCIS, 18_(1), 126.
- **[wang2025risk]** Wang, X., & Liu, L. (2025). Risk-sensitive deep reinforcement learning for portfolio optimization. _J. Risk Financial Management, 18_(7), 347.

### Dirichlet Policies

- **[andre2021dirichlet]** André, E., & Coqueret, G. (2021). Dirichlet policies for reinforced factor portfolios. _arXiv:2011.05381v3_.
- **[tian2022prescriptive]** Tian, Y., Han, M., Kulkarni, C., & Fink, O. (2022). A prescriptive Dirichlet power allocation policy with deep reinforcement learning. _Engineering Applications of AI, 112_, 104882.

### Reward Shaping

- **[ng1999policy]** Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping.
- **[marom2018belief]** Marom, O., & Rosman, B. (2018). Belief reward shaping in reinforcement learning. _AAAI Conference_.
- **[huang2024self]** Huang, Y., Zhou, C., Zhang, L., & Lu, X. (2024). A self-rewarding mechanism in deep reinforcement learning for trading strategy optimization. _Mathematics, 12_(24), 4020.

### ResNets and Architecture Foundations

- **[he2015resnet]** He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. _CVPR_.
- **[vaswani2017attention]** Vaswani, A., et al. (2017). Attention is all you need. _NeurIPS_.

### Additional Papers in `related_works/`

See `tcn_documentation/related_works/` for 28 additional papers covering:
- Potential-based reward shaping
- Curriculum learning
- Activation functions (AB-Swish, RSigELU)
- Graph convolutional networks for RL
- Market sentiment integration
- Ensemble trading strategies

In [None]:
# Verification: Check current TCN configuration
from src.config import PHASE1_CONFIG

ap = PHASE1_CONFIG['agent_params']
k = ap['tcn_kernel_size']
d = ap['tcn_dilations']
rf = 1 + 2*(k-1)*sum(d)

print('=== Current TCN Configuration ===')
print(f"Architecture: {ap['actor_critic_type']}")
print(f"TCN filters: {ap['tcn_filters']}")
print(f"Kernel size: {k}")
print(f"Dilations: {d}")
print(f"Sequence length: {ap['sequence_length']}")
print(f"Theoretical receptive field: {rf}")
print(f"\nRF > sequence_length: {rf > ap['sequence_length']}")
print(f"Attention enabled: {ap.get('use_attention', False)}")
print(f"Fusion enabled: {ap.get('use_fusion', False)}")