# TAPE-TCN State-of-the-Art Research Plan (Project Guide)

**Project**: Adaptive Portfolio RL (TCN-only track)  
**Scope**: End-to-end guide for ongoing research, engineering, and publication workflow  
**Status**: Living plan aligned to current implementation in `adaptive_portfolio_rl/src`


## 1) Purpose of This Notebook

This notebook is the master research guide for the TCN program. It is designed to:

- keep experiments scientifically consistent,
- connect implementation choices to theory,
- enforce reproducible logging and evaluation,
- provide a publication-ready structure for methods and results.

It is intentionally practical: each research pillar maps to code, metrics, checkpoints, and decision criteria.


## 2) Current Project Baseline (Code-Aligned)

From current implementation:

- **Assets**: 10 equities + cash (`MSFT, GOOGL, JPM, JNJ, XOM, PG, NEE, LIN, CAT, UNH`).
- **Train/Test split**: train ends `2019-12-31`, OOS starts `2020-01-01`.
- **Policy**: Dirichlet actor-critic with TCN variants (`TCN`, `TCN_ATTENTION`, `TCN_FUSION`).
- **Reward**: Three-component TAPE reward with DSR/PBRS + turnover proximity + drawdown dual controller + terminal TAPE scaling.
- **Current TCN defaults**: `tcn_filters=[32,64,64]`, `kernel=5`, `dilations=[2,4,8]`, `sequence_length=60`.
- **Current PPO defaults**: `max_total_timesteps=100000`, `timesteps_per_ppo_update=250`, `ppo_epochs=5`, `batch_size=256`, `lr=7e-4`, `entropy_coef=0.01`.


## 3) The Seven Core Pillars

1. **Risk-aware Reward Engineering (TAPE Core)**  
2. **Actuarial Drawdown Intelligence**  
3. **Dirichlet Simplex Policy Design**  
4. **Temporal Modeling with TCN Family**  
5. **Multi-source Feature System (TI + Macro + Fundamentals + Cross-sectional)**  
6. **Execution Realism and Turnover Governance**  
7. **Robust Evaluation, Reproducibility, and Deployment Readiness**

These seven pillars are the backbone of the research narrative and should be preserved in all write-ups.


## 4) Mathematical System Summary

### 4.1 Step Reward (Conceptual)

\[
r_t = r_t^{	ext{base}} + r_t^{	ext{DSR/PBRS}} + r_t^{	ext{turnover}} - r_t^{	ext{drawdown-penalty}}
\]

with terminal scaling term:

\[
r_t \leftarrow r_t \cdot \left(1 + \kappa_{	ext{terminal}}\,	ext{TAPE}_{	ext{episode}}ight)
\]

### 4.2 Turnover Proximity Target

\[
\Delta_t = |	au_t - 	au^*|,\quad
r_t^{	ext{turnover}} = lpha_{	au}\,\max\left(0, 1 - rac{\Delta_t}{b_{	au}}ight)
\]

where \(	au^*\) is target turnover, \(b_{	au}\) is tolerance band.

### 4.3 Drawdown Dual Penalty

\[
	ext{DD}_t = 1 - rac{V_t}{\max_{s\le t}V_s},\quad
r_t^{	ext{drawdown-penalty}} = \lambda_t\,\max(0, 	ext{DD}_t - d_{	ext{trigger}})
\]

\[
\lambda_{t+1} = \Pi_{[\lambda_{\min},\lambda_{\max}]}\left(\lambda_t + \eta_\lambda\,g_tight)
\]

### 4.4 Dirichlet Policy

\[
\mathbf{w}_t \sim 	ext{Dirichlet}(oldsymbol{lpha}_t),\quad
lpha_i > 0
\]

Current alpha map is activation-controlled (e.g., `elu`, `softplus`, etc.), with dynamic epsilon floor:

\[
\epsilon_t = \max(\epsilon_{\min},\epsilon_{\max}(1-	ext{progress}_t))
\]


## 5) Pillar 1: Risk-aware Reward Engineering (TAPE Core)

**Research Question**: Can a multi-objective reward improve risk-adjusted return stability over long OOS windows?  
**Hypothesis**: TAPE will dominate raw-return reward on Sharpe/Sortino/MDD tradeoff.

**Implementation anchors**:
- `src/environment_tape_rl.py`
- `src/reward_utils.py`
- `src/profile_manager.py`

**Primary KPIs**:
- OOS Sharpe, OOS Sortino, OOS MDD
- Turnover and win-rate stability
- TAPE score consistency across checkpoints

**Failure signs**:
- TAPE high but Sharpe low in OOS,
- reward-term dominance imbalance,
- unstable dual variable dynamics.


## 6) Pillar 2: Actuarial Drawdown Intelligence

**Research Question**: Do actuarial drawdown-state features improve crisis robustness?  
**Hypothesis**: Actuarial features reduce tail-risk and improve recovery behavior under stress regimes.

**Implementation anchors**:
- `src/actuarial.py`
- `src/data_utils.py::add_actuarial_features`
- `src/config.py::ACTUARIAL_PARAMS`

**Actuarial feature family**:
- `Actuarial_Expected_Recovery`
- `Actuarial_Prob_30d`
- `Actuarial_Prob_60d`
- `Actuarial_Reserve_Severity`

**KPIs**:
- Drawdown depth/frequency,
- recovery time from local troughs,
- MDD stability across random start dates.


## 7) Pillar 3: Dirichlet Simplex Policy Design

**Research Question**: Does native-simplex action modeling improve allocation validity and optimization stability?  
**Hypothesis**: Dirichlet actor reduces projection artifacts and yields cleaner weight dynamics.

**Implementation anchors**:
- `src/agents/actor_critic_tf.py`
- `src/agents/ppo_agent_tf.py`

**Decision knobs**:
- `dirichlet_alpha_activation` (`elu` / `softplus` / alternatives)
- `dirichlet_epsilon` annealing (`max`, `min`)
- deterministic evaluation modes (`mode`, `mean`)

**KPIs**:
- action diversity (`action_uniques`),
- concentration behavior (`argmax_alpha_uniques`),
- turnover-quality relationship.


## 8) Pillar 4: Temporal Modeling with TCN Family

**Research Question**: Which TCN variant best handles multi-regime portfolio control?  
**Hypothesis**: Fusion and attention variants can improve regime sensitivity if regularized correctly.

**Variants in code**:
- `TCN`
- `TCN_ATTENTION`
- `TCN_FUSION`

**Current receptive field guide** (causal stack):

\[
R = 1 + \sum_{b=1}^{B} 2(k-1)d_b
\]

For \(k=5\), \(d=[2,4,8]\), \(B=3\):

\[
R = 1 + 2\cdot4\cdot(2+4+8) = 113
\]

This is larger than sequence length 60, so context saturation should be monitored.


## 9) Pillar 5: Multi-source Feature System

**Research Question**: Can richer, economically grounded features improve cross-asset differentiation?  
**Hypothesis**: Cross-sectional and fundamental features improve signal quality relative to pure technical inputs.

**Feature domains**:
- Technical indicators,
- Macro/FRED features,
- Fundamentals (quarterly aligned),
- Regime features,
- Cross-sectional rankings/z-scores,
- Actuarial features.

**Implementation anchors**:
- `src/data_utils.py`
- `src/build_fundamentals_from_csv.py`
- `src/generate_fundamental_features.py`

**Data integrity controls**:
- warm-up and NaN handling before split,
- train-only scaler fitting,
- disabled-feature hard drop (no zombie columns).


## 10) Pillar 6: Execution Realism and Turnover Governance

**Research Question**: Can the policy stay tradable under realistic frictions while preserving alpha?  
**Hypothesis**: Turnover-aware reward with position and cash bounds improves deployability.

**Execution controls**:
- transaction cost model,
- max single position,
- minimum cash position,
- turnover target + band + curriculum.

**KPIs**:
- daily turnover distribution,
- turnover-adjusted Sharpe,
- concentration drift,
- net-vs-gross performance gap.


## 11) Pillar 7: Robust Evaluation and Reproducibility

**Research Question**: Is the strategy robust across decision mode, start date, and checkpoint selection?  
**Hypothesis**: Robust models retain rank across `det_mode`, `det_mean`, and stochastic tracks.

**Required protocol**:
- deterministic track 1: `det_mode`
- deterministic track 2: `det_mean`
- stochastic multi-run track: sampled actions

**Mandatory logs**:
- metadata JSON per run,
- per-episode CSV,
- summary CSV,
- weight/action/alpha artifacts per evaluation track.

**Selection rule (recommended)**:
- choose by OOS metrics first,
- use TAPE threshold as secondary filter,
- avoid in-sample-only checkpoint selection.


## 12) Workplan (Execution Roadmap)

### Stage A — System Lock (Now)
- Freeze baseline config and code hashes.
- Validate data pipeline integrity and feature inventory.
- Validate train/test date boundaries and scaler behavior.

### Stage B — Variant Training Campaign
- Run `TCN`, `TCN_ATTENTION`, `TCN_FUSION` under aligned budgets.
- Keep reward and data controls fixed.
- Capture unified logs and metadata.

### Stage C — Robustness Campaign
- Deterministic dual-mode + stochastic runs.
- Random-start evaluations over full OOS.
- Check horizon sensitivity and drawdown dynamics.

### Stage D — Reporting Pack
- Build result tables with placeholders replaced only by verified runs.
- Produce portfolio behavior diagnostics (weights, alphas, turnover, drawdown paths).
- Finalize manuscript-ready charts and methods narrative.


## 13) Decision Gates (Go / No-Go)

A run is **Go** only if all hold:

- OOS Sharpe meets target band,
- OOS MDD stays under risk budget,
- turnover within execution budget,
- no leakage flags,
- stable behavior across det/stochastic tracks.

If any gate fails, freeze deployment claims and iterate model controls before further conclusions.


## 14) References Map from `related_works`

This plan should cite and align with the project’s local corpus in:
`notebooks/documentation/related_works/`

### Reward shaping and safe RL
- *Policy Invariance under reward transformations - Theory and application to reward shaping.pdf*
- *Potential-Based Reward Shaping in Reinforcement Learning.pdf*
- *Improving the Effectiveness of Potential-Based Reward Shaping in Reinforcement Learning.pdf*
- *Belief Reward Shaping in Reinforcement Learning.pdf*
- *Benchmarking Potential Based Rewards for Learning Humanoid Locomotion.pdf*
- *NeurIPS-2020-learning-to-utilize-shaping-rewards-a-new-approach-of-reward-shaping-Paper.pdf*
- *NeurIPS-2022-exploration-guided-reward-shaping-for-reinforcement-learning-under-sparse-rewards-Paper-Conference.pdf*

### Portfolio RL and risk-aware optimization
- *Deep Reinforcement Learning for Optimal Portfolio Allocation-A Comparative Study with Mean-Variance Optimization.pdf*
- *Risk-Adjusted Deep Reinforcement Learning for Portfolio Optimization A Multi-reward Approach.pdf*
- *Risk-Sensitive Deep Reinforcement Learning for Portfolio Optimization.pdf*
- *Deep Reinforcement Learning for Automated Stock Trading-An Ensemble Strategy.pdf*
- *A Self-Rewarding Mechanism in Deep Reinforcement Learning for Trading Strategy Optimization.pdf*
- *portfolio-optimization-through-a-multimodal-deep-reinforcement-learning-framework.pdf*

### Dirichlet policy and simplex allocation
- *Dirichlet policies for reinforced factor portfolios.pdf*
- *A prescriptive Dirichlet power allocation policy with deep reinforcement Learning.pdf*
- *A Selective Portfolio Management Algorithm with Off-Policy Reinforcement Learning Using Dirichlet Distribution.pdf*

### TCN / attention / fusion architecture references
- *TTSNet- Transformer–Temporal Convolutional Network–Self-Attention with Feature Fusion for Prediction.pdf*
- *Graph convolutional recurrent networks for reward shaping in RL.pdf*

### Activation design references
- *An overview of the activation functions used in deep learning algorithms.pdf*
- *Parametric RSigELU--a new trainable activation function for deep learning.pdf*
- *AB-Swish-- A Novel Activation Function for Neural Networks in Hybrid Portfolio Optimization Frameworks.pdf*

### Curriculum and training strategy
- *curriculum learning.pdf*


## 15) Lame-Man Summary (Simple Version)

- You are teaching a portfolio agent not just to make money, but to survive bad markets and trade realistically.
- The project stands on 7 pillars: smart reward, actuarial safety, valid allocation math, strong TCN memory, rich features, real-world trading constraints, and strict evaluation.
- Every experiment should be judged by **risk-adjusted performance and reliability**, not return alone.
- If a model looks good only in one mode or one window, it is not ready.
- This notebook is your operating manual: update it each time you change core settings.


In [None]:
# Quick project-alignment check (run optionally)
from src.config import PHASE1_CONFIG

print('Architecture:', PHASE1_CONFIG['agent_params']['actor_critic_type'])
print('TCN filters:', PHASE1_CONFIG['agent_params']['tcn_filters'])
print('Kernel:', PHASE1_CONFIG['agent_params']['tcn_kernel_size'])
print('Dilations:', PHASE1_CONFIG['agent_params']['tcn_dilations'])
print('Dirichlet activation:', PHASE1_CONFIG['agent_params']['dirichlet_alpha_activation'])
print('Train timesteps:', PHASE1_CONFIG['training_params']['max_total_timesteps'])
print('PPO step/update:', PHASE1_CONFIG['training_params']['timesteps_per_ppo_update'])
print('Tickers:', PHASE1_CONFIG['ASSET_TICKERS'])
