# TAPE Reward System - Comprehensive Technical Reference

**Version**: 2.0 (Expanded February 2026)

**Scope**: Complete technical documentation of the TAPE (Terminal Aggregate Performance Enhancement) reward system used in the project, including mathematical formulations and code mapping.


## Table of Contents

1. [Introduction and Philosophy](#section1)
2. [The Three-Component Step Reward](#section2)
3. [Terminal Tape Utility](#section3)
4. [Active vs. Reserved Components](#section4)
5. [Code Implementation Map](#section5)
6. [Configuration Guide](#section6)
7. [Mathematical Notation](#section7)


## 1. Introduction and Philosophy <a id='section1'></a>

The **Terminal Aggregate Performance Enhancement (TAPE)** system is designed to solve a fundamental challenge in portfolio reinforcement learning: the misalignment between *step-wise* rewards (daily returns) and *episode-wise* objectives (Sharpe ratio, Drawdown, Turnover).

### Core Philosophy

1.  **Dense Step Signals**: Agents need frequent feedback to learn. We provide this via a multi-component daily reward signal (Net Return + DSR + Turnover).
2.  **Holistic Terminal Assessment**: Financial performance is not just accumulated return. It includes risk-adjusted metrics (Sharpe/Sortino) and behavioral constraints (Turnover). The **Terminal TAPE Bonus** acts as a final "report card" that provides a strong gradient toward these holistic goals.
3.  **Behavioral Shaping**: Instead of hard constraints (which break gradients), we use *proximity rewards* (e.g., for turnover) to guide the agent into desirable behavioral bands.


## 2. The Three-Component Step Reward <a id='section2'></a>

The step-level reward $R_t$ is composed of three active components:

$$
R_t = R_t^{\text{base}} + R_t^{\text{DSR}} + R_t^{\text{turnover}}
$$

### 2.1 Component 1: Base Reward (Net Return)

This component incentivizes raw capital growth, net of transaction costs.

$$
R_t^{\text{base}} = \text{Clip}(\text{NetReturn}_t \times 100, -10, 10)
$$

Where:
- $\text{NetReturn}_t$: Portfolio return after deducting transaction costs.
- $100$: Scaling factor to bring magnitude to $\approx [-2, 2]$ range.
- `enable_base_reward`: Config flag to toggle this component.

### 2.2 Component 2: Differential Sharpe Ratio (DSR)

Based on Moody & Saffell (2001), DSR provides a step-by-step update that maximizes the Sharpe Ratio. We use it as a **Potential-Based Reward Shaping (PBRS)** mechanism:

$$
R_t^{\text{shaped}} = R_t + \gamma \Phi(S_{t+1}) - \Phi(S_t)
$$

In our implementation:
$$
R_t^{\text{DSR}} = \lambda_{\text{DSR}} \cdot (\gamma \cdot S_{t}^{\text{rolling}} - S_{t-1}^{\text{rolling}})
$$

Where:
- $S_t^{\text{rolling}}$: Sharpe ratio calculated over a rolling window (`dsr_window=60`).
- $\lambda_{\text{DSR}}$: Scalar (`dsr_scalar=5.0`) to balance this signal with returns.
- $\gamma$: Discount factor (`gamma=0.99`).

**Goal**: Encourage consistency and low volatility, not just raw return.

### 2.3 Component 3: Turnover Proximity Reward

To enforce a specific trading frequency (e.g., "active but not churned"), we reward the agent for keeping turnover $\tau_t$ close to a target $\tau^*$.

$$
\text{Deviation}_t = |\tau_t - \tau^*|
$$

We define a tolerance band $\delta = \tau^* \cdot \text{band\_fraction}$.

- **Inside Band** ($|\tau_t - \tau^*| \le \delta$): Positive reward scaling with proximity.
  $$R_t^{\text{turnover}} = \lambda_{\text{TO}} \cdot (1 - \frac{\text{Deviation}_t}{\delta})$$
- **Outside Band**: Negative penalty scaling with excess deviation.
  $$R_t^{\text{turnover}} = -\lambda_{\text{TO}} \cdot (\frac{\text{Deviation}_t}{\delta} - 1)$$

Where:
- $\lambda_{\text{TO}}$: `turnover_penalty_scalar` (default 5.0).

**Goal**: Guide agent to a specific location on the active-passive spectrum.

## 3. Terminal TAPE Utility <a id='section3'></a>

At the end of an episode, we calculate a holistic **TAPE Score** ($0.0 - 1.0$) based on multiple metrics:

1.  **Sharpe Ratio**
2.  **Sortino Ratio**
3.  **Max Drawdown**
4.  **Turnover**
5.  **Skewness**

Each metric $x_i$ is mapped to a utility $U_i(x_i)$ via a **Skewed Utility Function**:

$$
U(x; \mu, \sigma^-, \sigma^+) = \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)
$$

Where $\sigma = \sigma^-$ if $x < \mu$ and $\sigma = \sigma^+$ if $x > \mu$. This allows us to be punitive on downside (e.g., low Sharpe) but permissive on upside.

**Final Terminal Bonus**:
$$
R_T^{\text{bonus}} = \text{Clip}(\text{TAPE Score} \times \Lambda_{\text{term}}, -C, C)
$$

Where:
- $\Lambda_{\text{term}}$: `tape_terminal_scalar` (default 1000.0).
- $C$: `tape_terminal_clip` (default 5.0).

This bonus acts as a powerful final gradient to align the agent's long-term behavior with complex preferences.

## 4. Active vs. Reserved Components <a id='section4'></a>

### Current Status

| Component | Status | Config Key | Description |
|-----------|--------|------------|-------------|
| **Base Reward** | ✅ **Active** | `enable_base_reward` | Net return contribution |
| **DSR Component** | ✅ **Active** | `dsr_scalar` | Sharpe shaping |
| **Turnover** | ✅ **Active** | `target_turnover` | Proximity target |
| **Drawdown Controller** | ⏸️ **Inactive** | `drawdown_constraint` | **Reserved**. Code initializes params but does not currently sum this term in `_get_reward`. |

### Note on Drawdown Controller
The environment code (`environment_tape_rl.py`) contains logic to initialize a Dual-Variable Drawdown Controller (Lagrangian method). However, in the current `_get_reward` function, this penalty term is **not added** to the final reward sum. It is reserved for future constraint-based experiments.

## 5. Code Implementation Map <a id='section5'></a>

### `src/environment_tape_rl.py`
- `_get_reward()`: The master function summing components.
  - **Base**: `base_reward = portfolio_return * 100`
  - **DSR**: `dsr_component` using rolling buffer `self.dsr_history`
  - **Turnover**: `turnover_reward` logic with `deviation` and `band_fraction`
- `step()`: Calculates terminal metrics and applies `tape_score` bonus.

### `src/reward_utils.py`
- `calculate_sharpe_ratio_dsr()`: Optimized incremental Sharpe calculation.
- `calculate_tape_score()`: Aggregates metrics using profile weights.
- `skewed_utility_function()`: Implements the asymmetric Gaussian utility.

## 6. Configuration Guide <a id='section6'></a>

Key parameters in `src/config.py` (`environment_params`):

```python
"environment_params": {
    "reward_system": "tape",           # Enable this system
    "dsr_window": 60,                  # Rolling window for Sharpe
    "dsr_scalar": 5.0,                 # Strength of Sharpe shaping
    
    # Turnover Control
    "target_turnover": 0.76,           # Target daily turnover (76%)
    "turnover_target_band": 0.20,      # Tolerance ±20%
    "turnover_penalty_scalar": 5.0,    # Strength of proximity reward
    
    # Terminal
    "tape_terminal_scalar": 1000.0,    # Final bonus multiplier
    "tape_terminal_clip": 5.0,         # Clip bonus magnitude
    
    # Base
    "enable_base_reward": True         # Include raw returns
}
```

## 7. Mathematical Notation <a id='section7'></a>

- $R_t$: Total reward at step $t$
- $\tau_t$: Portfolio turnover at step $t$
- $S_t$: Rolling Sharpe ratio
- $\gamma$: Discount factor (0.99)
- $\mu, \sigma$: Utility function parameters
- $\lambda$: Scalar coefficients matching code config