# RA-KL Research Notebook

## Regime-Aware KL Budget Control for PPO in Portfolio Optimization

This notebook is a paper-ready research workflow to:
- define the RA-KL method and claims,
- run ablations,
- compute training stability diagnostics,
- evaluate out-of-sample performance,
- generate publication tables/figures.

Use this notebook as your primary write-up workspace for the KL-management contribution.

## 1) Study Scope and Contribution Claims

### Problem
Static PPO KL thresholds are fragile in non-stationary markets. They can either:
- allow policy over-jumps (instability), or
- trigger frequent early-stops (under-updating).

### Proposed Method
**RA-KL (Regime-Aware KL Budget Controller)**: an online controller that adapts PPO aggressiveness using update-level feedback.

### Main Contribution Statement
RA-KL transforms PPO trust-region control from static hyperparameters to a closed-loop, regime-aware budgeting mechanism.

### Testable Hypotheses
1. RA-KL reduces KL overshoot rate and early-stop frequency.
2. RA-KL improves stability-adjusted performance (Sharpe/Sortino/Calmar).
3. RA-KL lowers unnecessary turnover without collapsing responsiveness.

## 2) Method Specification (Paper Draft)

### 2.1 Controller Inputs
At update $t$, define:
- $k_t$: observed approximate KL,
- $c_t$: clip fraction,
- $s_t$: early-stop indicator,
- $a_t$: alpha-dispersion proxy (Dirichlet alpha std),
- $r_t$: regime/stress proxy (e.g., volatility bucket).

Use exponential moving averages (EMA):
\[
\bar{k}_t = (1-\lambda_k)\bar{k}_{t-1} + \lambda_k k_t
\]
(similarly for $\bar{c}_t, \bar{s}_t, \bar{a}_t$).

### 2.2 KL Tracking Error
\[
e_t = \frac{\bar{k}_t}{k^{\text{base}}} - 1
\]
where $k^{\text{base}}$ is baseline KL target.

### 2.3 Adaptive Controls
\[
\eta_t = \text{clip}(\eta_{t-1} \exp(-\kappa_\eta e_t^+), \eta_{\min}, \eta_{\max})
\]
\[
\epsilon_t = \text{clip}(\epsilon_{t-1}(1-\kappa_\epsilon e_t^+), \epsilon_{\min}, \epsilon_{\max})
\]
\[
k_t^{\text{target}} = \text{clip}(k^{\text{base}} \exp(-\kappa_k e_t^+ + \kappa_{\text{relax}} e_t^-), k_{\min}, k_{\max})
\]

where $e_t^+ = \max(e_t,0)$ and $e_t^- = \max(-e_t,0)$.

### 2.4 Regime Gate
During stress regimes, tighten KL budget via multiplier $\rho_t \in (0,1]$:
\[
k_t^{\text{target}} \leftarrow \rho_t k_t^{\text{target}}
\]
Example: $\rho_t=0.8$ in high-vol regime.

### 2.5 Dirichlet-aware Dampening
If KL is high and alpha dispersion changes abruptly, apply temporary cooldown (N updates) to actor LR and clip.

### 2.6 Expected Effect
- fewer over-aggressive updates,
- fewer repetitive early-stops,
- smoother portfolio reallocation path,
- better train-to-test robustness.

In [None]:
# 3) Imports and plotting defaults
from pathlib import Path
import json
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 160)

In [None]:
# 4) Configure run paths
# Set this to your run folder that contains logs/*episodes*.csv, *step_diagnostics*.csv, *summary*.csv
RESULTS_ROOT = Path('/content/adaptive_portfolio_rl/tcn_fusion_results')
LOGS_DIR = RESULTS_ROOT / 'logs'

# Optional fixed timestamp suffix (example: '20260221_071837').
# If None, notebook auto-picks latest files by mtime.
RUN_TAG = None

print('RESULTS_ROOT:', RESULTS_ROOT)
print('LOGS_DIR:', LOGS_DIR)

In [None]:
# 5) File loader helpers

def _pick_latest(pattern: str, logs_dir: Path, run_tag: str | None = None):
    files = sorted(logs_dir.glob(pattern), key=lambda p: p.stat().st_mtime)
    if run_tag:
        tagged = [p for p in files if run_tag in p.name]
        if tagged:
            return tagged[-1]
    return files[-1] if files else None


def load_run_artifacts(logs_dir: Path, run_tag: str | None = None):
    episodes_p = _pick_latest('*episodes*.csv', logs_dir, run_tag)
    steps_p = _pick_latest('*step_diagnostics*.csv', logs_dir, run_tag)
    summary_p = _pick_latest('*summary*.csv', logs_dir, run_tag)
    meta_p = _pick_latest('*_metadata.json', logs_dir, run_tag)
    manifest_p = _pick_latest('*active_feature_manifest.json', logs_dir, run_tag)

    out = {
        'episodes_path': episodes_p,
        'steps_path': steps_p,
        'summary_path': summary_p,
        'metadata_path': meta_p,
        'manifest_path': manifest_p,
        'episodes': pd.read_csv(episodes_p) if episodes_p else None,
        'steps': pd.read_csv(steps_p) if steps_p else None,
        'summary': pd.read_csv(summary_p) if summary_p else None,
        'metadata': json.loads(meta_p.read_text(encoding='utf-8')) if meta_p else None,
        'manifest': json.loads(manifest_p.read_text(encoding='utf-8')) if manifest_p else None,
    }
    return out

art = load_run_artifacts(LOGS_DIR, RUN_TAG)
for k in ['episodes_path','steps_path','summary_path','metadata_path','manifest_path']:
    print(f'{k}:', art[k])

In [None]:
# 6) Quick schema inspection

def show_schema(df: pd.DataFrame | None, name: str, n=40):
    if df is None:
        print(f'{name}: None')
        return
    print(f'
{name}: shape={df.shape}')
    print('columns:', list(df.columns[:n]))
    if len(df.columns) > n:
        print(f'... (+{len(df.columns)-n} more)')

show_schema(art['episodes'], 'episodes')
show_schema(art['steps'], 'step_diagnostics')
show_schema(art['summary'], 'summary')

## 3) Ablation Plan (state in paper)

### Baselines
1. **Static PPO-KL**: fixed target_kl and fixed clip.
2. **Scheduled PPO**: hand-crafted rollout/batch/LR schedule only.

### Proposed Variants
3. **Adaptive-KL only**: closed-loop KL controller without regime gate.
4. **Adaptive-KL + Dirichlet Dampening**.
5. **Full RA-KL**: Adaptive-KL + regime gate + Dirichlet dampening.

### Controlled Factors
- same seed set, data split, feature set,
- same architecture (TCN_FUSION),
- same reward system (TAPE) unless in reward ablation.

In [None]:
# 7) Register experiment runs for cross-run ablation comparison
# Fill with your available runs (paths or log files).

EXPERIMENTS = [
    {
        'label': 'static_kl',
        'group': 'baseline',
        'logs_dir': LOGS_DIR,
        'run_tag': None,  # put explicit tag for this run
        'method': 'Static PPO-KL',
    },
    # Add more runs here:
    # {'label': 'ra_kl_full', 'group': 'proposed', 'logs_dir': Path(...), 'run_tag': '2026....', 'method': 'RA-KL Full'},
]

print('Experiments configured:', len(EXPERIMENTS))

In [None]:
# 8) Metric extractor utilities (robust to column naming differences)

def pick_col(df, candidates, default=None):
    for c in candidates:
        if c in df.columns:
            return c
    return default


def compute_training_stability(episodes: pd.DataFrame | None, steps: pd.DataFrame | None):
    res = {}

    if steps is not None and len(steps) > 0:
        kl_col = pick_col(steps, ['approx_kl', 'kl', 'ppo_approx_kl'])
        clip_col = pick_col(steps, ['clip_fraction', 'ppo_clip_fraction'])
        early_col = pick_col(steps, ['early_stop_kl_triggered', 'kl_early_stop', 'early_stop'])
        turn_col = pick_col(steps, ['turnover', 'episode_turnover', 'turnover_pct'])

        if kl_col:
            res['kl_mean'] = float(steps[kl_col].mean())
            res['kl_p95'] = float(steps[kl_col].quantile(0.95))
            res['kl_max'] = float(steps[kl_col].max())
        if clip_col:
            res['clip_fraction_mean'] = float(steps[clip_col].mean())
        if early_col:
            res['early_stop_rate'] = float((steps[early_col] > 0).mean())
        if turn_col:
            res['turnover_mean'] = float(steps[turn_col].mean())

    if episodes is not None and len(episodes) > 0:
        shp_col = pick_col(episodes, ['Sharpe', 'sharpe', 'sharpe_ratio'])
        ret_col = pick_col(episodes, ['Return', 'total_return', 'episode_return'])
        dd_col = pick_col(episodes, ['Max_Drawdown', 'max_drawdown', 'MDD'])

        if shp_col:
            res['episode_sharpe_mean'] = float(episodes[shp_col].mean())
            res['episode_sharpe_p75'] = float(episodes[shp_col].quantile(0.75))
        if ret_col:
            res['episode_return_mean'] = float(episodes[ret_col].mean())
        if dd_col:
            res['episode_mdd_mean'] = float(episodes[dd_col].mean())

    return res

In [None]:
# 9) Build ablation table

def load_one_experiment(exp):
    art = load_run_artifacts(exp['logs_dir'], exp.get('run_tag'))
    stab = compute_training_stability(art['episodes'], art['steps'])

    row = {
        'label': exp['label'],
        'group': exp.get('group'),
        'method': exp.get('method'),
        'run_tag': exp.get('run_tag'),
    }
    row.update(stab)
    return row, art

rows = []
loaded = {}
for exp in EXPERIMENTS:
    row, loaded_art = load_one_experiment(exp)
    rows.append(row)
    loaded[exp['label']] = loaded_art

ablation_df = pd.DataFrame(rows)
ablation_df

In [None]:
# 10) KL diagnostics plots for one selected run
SELECT_LABEL = EXPERIMENTS[0]['label'] if EXPERIMENTS else None
sel = loaded.get(SELECT_LABEL, {})
steps_df = sel.get('steps')

if steps_df is None or steps_df.empty:
    print('No step_diagnostics loaded for selected run.')
else:
    kl_col = pick_col(steps_df, ['approx_kl', 'kl', 'ppo_approx_kl'])
    clip_col = pick_col(steps_df, ['clip_fraction', 'ppo_clip_fraction'])
    early_col = pick_col(steps_df, ['early_stop_kl_triggered', 'kl_early_stop', 'early_stop'])
    x_col = pick_col(steps_df, ['update', 'update_count', 'step', 'timesteps'])

    x = steps_df[x_col] if x_col else np.arange(len(steps_df))

    fig, axes = plt.subplots(3, 1, figsize=(12, 10), sharex=True)

    if kl_col:
        axes[0].plot(x, steps_df[kl_col], lw=1.2)
        axes[0].set_title('Approx KL over training updates')
        axes[0].set_ylabel('KL')

    if clip_col:
        axes[1].plot(x, steps_df[clip_col], lw=1.2, color='tab:orange')
        axes[1].set_title('Clip fraction over updates')
        axes[1].set_ylabel('clip_fraction')

    if early_col:
        axes[2].plot(x, steps_df[early_col], lw=1.0, color='tab:red')
        axes[2].set_title('KL early-stop trigger indicator')
        axes[2].set_ylabel('trigger')

    axes[2].set_xlabel('update index')
    plt.tight_layout()
    plt.show()

In [None]:
# 11) Performance plots (return, sharpe, drawdown, turnover)

sel = loaded.get(SELECT_LABEL, {})
ep_df = sel.get('episodes')

if ep_df is None or ep_df.empty:
    print('No episodes file loaded for selected run.')
else:
    idx = np.arange(len(ep_df))
    cols = {
        'return': pick_col(ep_df, ['Return', 'total_return', 'episode_return']),
        'sharpe': pick_col(ep_df, ['Sharpe', 'sharpe', 'sharpe_ratio']),
        'drawdown': pick_col(ep_df, ['Max_Drawdown', 'max_drawdown', 'MDD']),
        'turnover': pick_col(ep_df, ['Turnover', 'turnover', 'turnover_pct']),
    }

    fig, axes = plt.subplots(2, 2, figsize=(13, 8))
    axes = axes.ravel()

    for ax, (name, col) in zip(axes, cols.items()):
        if col:
            ax.plot(idx, ep_df[col], lw=1.2)
            ax.set_title(f'{name} ({col})')
            ax.set_xlabel('episode')
        else:
            ax.set_title(f'{name}: column not found')
            ax.axis('off')

    plt.tight_layout()
    plt.show()

In [None]:
# 12) Paper-ready summary table with ranking
rank_cols = [c for c in [
    'episode_sharpe_mean',
    'episode_return_mean',
    'episode_mdd_mean',
    'turnover_mean',
    'kl_mean',
    'early_stop_rate',
] if c in ablation_df.columns]

summary_table = ablation_df.copy()
if 'episode_sharpe_mean' in summary_table:
    summary_table = summary_table.sort_values('episode_sharpe_mean', ascending=False)

summary_table.reset_index(drop=True)

In [None]:
# 13) Export publication assets (CSV + LaTeX)
OUT_DIR = Path('./paper_outputs_ra_kl')
OUT_DIR.mkdir(parents=True, exist_ok=True)

ablation_csv = OUT_DIR / 'ablation_summary.csv'
ablation_df.to_csv(ablation_csv, index=False)
print('Saved:', ablation_csv)

# LaTeX table for paper
latex_path = OUT_DIR / 'ablation_summary.tex'
with open(latex_path, 'w', encoding='utf-8') as f:
    f.write(ablation_df.to_latex(index=False, float_format=lambda x: f'{x:0.4f}' if isinstance(x, float) else str(x)))
print('Saved:', latex_path)

## 4) Write-up Blocks You Can Paste into Paper

### Method paragraph (short)
We introduce Regime-Aware KL Budgeting (RA-KL), a closed-loop PPO controller that adapts target KL, actor learning rate, and clip range from update-level feedback (approximate KL, clipping pressure, and early-stop incidence), with additional regime-conditioned tightening during market stress and Dirichlet-policy dampening when concentration dynamics become unstable.

### Experimental protocol paragraph
All ablations use the same data split, architecture (TCN_FUSION), reward design (TAPE), and seed protocol. We compare static PPO trust-region settings against adaptive variants, and report both optimization diagnostics (KL overshoot, clip fraction, early-stop rate) and portfolio performance metrics (Sharpe, Sortino, drawdown, turnover, return).

### Main finding template
RA-KL reduced KL overshoot frequency by [X%], reduced early-stop rate by [Y%], and improved out-of-sample Sharpe by [Z], while maintaining lower turnover and comparable drawdown.

## 5) Reporting Checklist

- Include exact hyperparameter bounds for dynamic controls.
- Report mean and variance over multiple seeds.
- Separate training stability metrics from test performance metrics.
- Add failure-case analysis (when RA-KL underperforms).
- Provide compute-cost comparison (time/update and wall-clock).
- Include robustness checks across benchmark and stress windows.