# RL News Trading Agent - Google Colab Training

Single-file implementation for iterative development.

## Sections:
1. **Setup** - Smart install (skips already installed packages)
2. **Config** - Experiment configurations
3. **Data** - Generate/load cached market & news data
4. **Environment** - Custom Gymnasium trading environment
5. **Training** - Train PPO agent (saves models automatically)
5b. **Quick Evaluate** - Load saved models WITHOUT retraining ‚ö°
6. **Results** - Display metrics with CLAUDE_RESULTS markers
7. **Visualization** - Training progress plots

---

## üöÄ Quick Start

**–ü–µ—Ä–≤—ã–π –∑–∞–ø—É—Å–∫:** `Runtime ‚Üí Run All` (~5 min)

**–ü–æ–≤—Ç–æ—Ä–Ω—ã–µ –∑–∞–ø—É—Å–∫–∏ (–±—ã—Å—Ç—Ä–æ):**
1. Run Section 1-4 (Setup, Config, Data, Environment)
2. Run Section 5b (Quick Evaluate) ‚Üê –∑–∞–≥—Ä—É–∂–∞–µ—Ç –º–æ–¥–µ–ª–∏ –±–µ–∑ –ø–µ—Ä–µ–æ–±—É—á–µ–Ω–∏—è
3. Run Section 6-7 (Results)

---
## SECTION 1: Setup

In [None]:
# ============================================
# SECTION 1: Setup (Smart Install)
# ============================================
# –ó–∞–ø—É—Å–∫–∞—Ç—å 1 —Ä–∞–∑ –≤ –Ω–∞—á–∞–ª–µ —Å–µ—Å—Å–∏–∏
# –ü—Ä–∏ –ø–æ–≤—Ç–æ—Ä–Ω—ã—Ö –∑–∞–ø—É—Å–∫–∞—Ö –ø—Ä–æ–ø—É—Å–∫–∞–µ—Ç —É–∂–µ —É—Å—Ç–∞–Ω–æ–≤–ª–µ–Ω–Ω—ã–µ –ø–∞–∫–µ—Ç—ã

import subprocess
import sys

def install_if_missing(package, import_name=None):
    """Install package only if not already installed."""
    if import_name is None:
        import_name = package.split('[')[0].replace('-', '_')
    try:
        __import__(import_name)
        return False  # Already installed
    except ImportError:
        print(f"üì¶ Installing {package}...")
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package, '-q'])
        return True

# Install only missing packages
packages = [
    ('stable-baselines3[extra]', 'stable_baselines3'),
    ('gymnasium', 'gymnasium'),
    ('yfinance', 'yfinance'),
    ('ta', 'ta'),
    ('plotly', 'plotly'),
]

installed_count = 0
for pkg_info in packages:
    if isinstance(pkg_info, tuple):
        pkg, imp = pkg_info
    else:
        pkg, imp = pkg_info, None
    if install_if_missing(pkg, imp):
        installed_count += 1

if installed_count > 0:
    print(f"‚úÖ Installed {installed_count} new packages")
else:
    print("‚úÖ All packages already installed (skipped)")

# Now import everything
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3 import PPO, SAC, A2C
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import BaseCallback
import traceback
import time

# Define paths for caching
DATA_CACHE_PATH = '/content/data_cache.npz'
MODELS_DIR = '/content/models'
RESULTS_PATH = '/content/experiment_results.json'

print(f"\n‚úÖ Setup complete")
print(f"Gymnasium version: {gym.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"\nüìÅ Cache paths:")
print(f"  Data: {DATA_CACHE_PATH}")
print(f"  Models: {MODELS_DIR}")
print(f"  Results: {RESULTS_PATH}")

---
## SECTION 2: Experiment Configuration

Define multiple experiments to compare different training approaches.

In [None]:
# ============================================
# SECTION 2: Experiment Configuration
# ============================================
# Run #3: Scaling Up - –∫–æ–º–±–∏–Ω–∏—Ä—É–µ–º –ª—É—á—à–∏–µ –Ω–∞—Ö–æ–¥–∫–∏
# –í—ã–≤–æ–¥—ã –∏–∑ Run #2:
# ‚úÖ 100K steps = +187% (vs +118% baseline) - –†–ê–ë–û–¢–ê–ï–¢
# ‚úÖ High entropy 0.05 = –ª—É—á—à–∏–π Sharpe 2.75 - –†–ê–ë–û–¢–ê–ï–¢
# ‚ùå Low LR 1e-4 = —Ö—É–∂–µ —Ä–µ–∑—É–ª—å—Ç–∞—Ç—ã - –ù–ï –†–ê–ë–û–¢–ê–ï–¢

EXPERIMENTS = {
    # ===========================================
    # BASELINES (–ª—É—á—à–∏–µ –∏–∑ Run #2)
    # ===========================================
    "norm_100k": {
        "name": "100K steps (Run #2 best return)",
        "reward_type": "simple_pnl",
        "normalize_obs": True,
        "entropy_coef": 0.01,
        "transaction_penalty": 0.0,
        "sharpe_window": 20,
        "action_repeat_penalty": 0.0,
        "learning_rate": 3e-4,
        "timesteps": 100000,
    },

    "norm_high_ent": {
        "name": "High Entropy (Run #2 best Sharpe)",
        "reward_type": "simple_pnl",
        "normalize_obs": True,
        "entropy_coef": 0.05,
        "transaction_penalty": 0.0,
        "sharpe_window": 20,
        "action_repeat_penalty": 0.0,
        "learning_rate": 3e-4,
        "timesteps": 50000,
    },

    # ===========================================
    # RUN #3 NEW EXPERIMENTS
    # ===========================================

    # –ì–∏–ø–æ—Ç–µ–∑–∞ 1: –∫–æ–º–±–∏–Ω–∞—Ü–∏—è –ª—É—á—à–∏—Ö (100K + high entropy)
    "combo_best": {
        "name": "100K + High Entropy (combo)",
        "reward_type": "simple_pnl",
        "normalize_obs": True,
        "entropy_coef": 0.05,  # Best Sharpe config
        "transaction_penalty": 0.0,
        "sharpe_window": 20,
        "action_repeat_penalty": 0.0,
        "learning_rate": 3e-4,
        "timesteps": 100000,  # Best return config
    },

    # –ì–∏–ø–æ—Ç–µ–∑–∞ 2: –µ—â–µ –±–æ–ª—å—à–µ timesteps
    "steps_200k": {
        "name": "200K steps",
        "reward_type": "simple_pnl",
        "normalize_obs": True,
        "entropy_coef": 0.01,
        "transaction_penalty": 0.0,
        "sharpe_window": 20,
        "action_repeat_penalty": 0.0,
        "learning_rate": 3e-4,
        "timesteps": 200000,  # 2x –æ—Ç –ª—É—á—à–µ–≥–æ
    },

    # –ì–∏–ø–æ—Ç–µ–∑–∞ 3: –≤—ã—à–µ LR (—Ä–∞–∑ –Ω–∏–∑–∫–∏–π –Ω–µ —Ä–∞–±–æ—Ç–∞–µ—Ç)
    "high_lr": {
        "name": "High LR (5e-4)",
        "reward_type": "simple_pnl",
        "normalize_obs": True,
        "entropy_coef": 0.01,
        "transaction_penalty": 0.0,
        "sharpe_window": 20,
        "action_repeat_penalty": 0.0,
        "learning_rate": 5e-4,  # –í—ã—à–µ —Å—Ç–∞–Ω–¥–∞—Ä—Ç–Ω–æ–≥–æ
        "timesteps": 50000,
    },

    # –ì–∏–ø–æ—Ç–µ–∑–∞ 4: –º–∞–∫—Å–∏–º–∞–ª—å–Ω–∞—è –∫–æ–º–±–∏–Ω–∞—Ü–∏—è
    "steps_200k_ent": {
        "name": "200K + High Entropy (max)",
        "reward_type": "simple_pnl",
        "normalize_obs": True,
        "entropy_coef": 0.05,  # High entropy
        "transaction_penalty": 0.0,
        "sharpe_window": 20,
        "action_repeat_penalty": 0.0,
        "learning_rate": 3e-4,
        "timesteps": 200000,  # Max timesteps
    },
}

print("‚úÖ Experiment configurations loaded (Run #3 - Scaling Up)")
print(f"Total experiments: {len(EXPERIMENTS)}")
print("\nüìã Experiments:")
for key, config in EXPERIMENTS.items():
    lr = config.get('learning_rate', 3e-4)
    ts = config.get('timesteps', 50000)
    print(f"  - {config['name']}")
    print(f"      LR: {lr}, Steps: {ts//1000}K, Ent: {config['entropy_coef']}")

---
## SECTION 3: Data Collection

Generate synthetic market data and news sentiment for testing.

In [None]:
# ============================================
# SECTION 3: Data Collection (with Caching)
# ============================================
# –î–∞–Ω–Ω—ã–µ –≥–µ–Ω–µ—Ä–∏—Ä—É—é—Ç—Å—è 1 —Ä–∞–∑ –∏ —Å–æ—Ö—Ä–∞–Ω—è—é—Ç—Å—è –≤ –∫–µ—à
# –ü—Ä–∏ –ø–æ–≤—Ç–æ—Ä–Ω—ã—Ö –∑–∞–ø—É—Å–∫–∞—Ö –∑–∞–≥—Ä—É–∂–∞—é—Ç—Å—è –∏–∑ –∫–µ—à–∞ (~–º–≥–Ω–æ–≤–µ–Ω–Ω–æ)

def generate_synthetic_market_data(n_days=365):
    """
    Generate synthetic OHLCV data with technical indicators.
    """
    dates = pd.date_range(end=datetime.now(), periods=n_days, freq='D')
    
    # Simulate price with random walk + trend
    np.random.seed(42)
    returns = np.random.normal(0.0005, 0.02, n_days)  # Mean daily return ~0.05%, volatility 2%
    prices = 100 * np.exp(np.cumsum(returns))
    
    # Generate OHLCV
    df = pd.DataFrame({
        'timestamp': dates,
        'open': prices * np.random.uniform(0.98, 1.0, n_days),
        'high': prices * np.random.uniform(1.0, 1.02, n_days),
        'low': prices * np.random.uniform(0.97, 1.0, n_days),
        'close': prices,
        'volume': np.random.uniform(1e6, 5e6, n_days)
    })
    
    # Technical indicators
    df['returns_1d'] = df['close'].pct_change()
    df['returns_7d'] = df['close'].pct_change(7)
    
    # RSI
    delta = df['close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
    rs = gain / loss
    df['rsi_14'] = 100 - (100 / (1 + rs))
    
    # MACD
    ema_12 = df['close'].ewm(span=12).mean()
    ema_26 = df['close'].ewm(span=26).mean()
    df['macd'] = ema_12 - ema_26
    df['macd_signal'] = df['macd'].ewm(span=9).mean()
    
    # Bollinger Bands
    sma_20 = df['close'].rolling(window=20).mean()
    std_20 = df['close'].rolling(window=20).std()
    df['bollinger_upper'] = sma_20 + (std_20 * 2)
    df['bollinger_lower'] = sma_20 - (std_20 * 2)
    
    # ATR (Average True Range)
    high_low = df['high'] - df['low']
    high_close = np.abs(df['high'] - df['close'].shift())
    low_close = np.abs(df['low'] - df['close'].shift())
    ranges = pd.concat([high_low, high_close, low_close], axis=1)
    true_range = np.max(ranges, axis=1)
    df['atr_14'] = true_range.rolling(14).mean()
    
    # Volume ratio
    df['volume_ratio'] = df['volume'] / df['volume'].rolling(20).mean()
    
    df.fillna(0, inplace=True)
    return df

def generate_synthetic_news_data(n_days=365):
    """
    Generate synthetic news sentiment data.
    """
    dates = pd.date_range(end=datetime.now(), periods=n_days, freq='D')
    
    np.random.seed(43)
    
    # Sentiment follows a random walk between -1 and 1
    sentiment_base = np.cumsum(np.random.normal(0, 0.1, n_days))
    sentiment_base = np.clip(sentiment_base, -3, 3) / 3  # Normalize to [-1, 1]
    
    df = pd.DataFrame({
        'timestamp': dates,
        'sentiment_1h': sentiment_base + np.random.normal(0, 0.1, n_days),
        'sentiment_24h': sentiment_base,
        'sentiment_7d': pd.Series(sentiment_base).rolling(7).mean().fillna(0).values,
        'sentiment_trend': pd.Series(sentiment_base).diff().fillna(0).values,
        'news_volume': np.random.poisson(20, n_days),
        'news_velocity': np.random.uniform(0.5, 2.0, n_days)
    })
    
    # Clip sentiment to [-1, 1]
    for col in ['sentiment_1h', 'sentiment_24h', 'sentiment_7d', 'sentiment_trend']:
        df[col] = np.clip(df[col], -1, 1)
    
    return df


def get_or_generate_data(n_days=500, force_regenerate=False):
    """
    Load data from cache or generate new data.
    Set force_regenerate=True to regenerate even if cache exists.
    """
    if os.path.exists(DATA_CACHE_PATH) and not force_regenerate:
        print("üì¶ Loading cached data...")
        data = np.load(DATA_CACHE_PATH, allow_pickle=True)
        market_data = pd.DataFrame(data['market'].item())
        news_data = pd.DataFrame(data['news'].item())
        print(f"‚úÖ Loaded from cache: {DATA_CACHE_PATH}")
    else:
        print("üîÑ Generating new synthetic data...")
        market_data = generate_synthetic_market_data(n_days=n_days)
        news_data = generate_synthetic_news_data(n_days=n_days)
        
        # Save to cache
        np.savez(DATA_CACHE_PATH, 
                 market=market_data.to_dict(), 
                 news=news_data.to_dict())
        print(f"üíæ Saved to cache: {DATA_CACHE_PATH}")
    
    return market_data, news_data


# Load or generate data
market_data, news_data = get_or_generate_data(n_days=500)

print(f"\n‚úÖ Data ready")
print(f"Market data shape: {market_data.shape}")
print(f"News data shape: {news_data.shape}")
print(f"\nMarket data sample:")
print(market_data.head())
print(f"\nNews data sample:")
print(news_data.head())

# Hint: To regenerate data, run:
# market_data, news_data = get_or_generate_data(n_days=500, force_regenerate=True)

---
## SECTION 4: Environment

Custom Gymnasium trading environment with configurable rewards and normalization.

In [None]:
class TradingEnv(gym.Env):
    """
    Custom trading environment compatible with Stable Baselines3.
    
    Observation Space:
        - market: 15 technical indicators
        - news: 6 sentiment features
        - portfolio: 5 position metrics
    
    Action Space:
        - Discrete(7): HOLD, BUY_25%, BUY_50%, BUY_100%, SELL_25%, SELL_50%, SELL_100%
    """
    
    def __init__(self, market_data, news_data, config=None, initial_balance=10000, commission=0.001):
        super(TradingEnv, self).__init__()
        
        self.market_data = market_data.reset_index(drop=True)
        self.news_data = news_data.reset_index(drop=True)
        self.initial_balance = initial_balance
        self.commission = commission
        
        # Experiment configuration
        if config is None:
            config = EXPERIMENTS["baseline"]
        self.config = config
        
        # Action space: 0=HOLD, 1-3=BUY, 4-6=SELL
        self.action_space = spaces.Discrete(7)
        
        # Observation space
        self.observation_space = spaces.Dict({
            'market': spaces.Box(low=-np.inf, high=np.inf, shape=(15,), dtype=np.float32),
            'news': spaces.Box(low=-1, high=1, shape=(6,), dtype=np.float32),
            'portfolio': spaces.Box(low=-np.inf, high=np.inf, shape=(5,), dtype=np.float32)
        })
        
        # Normalization statistics (running mean/std)
        self.obs_mean = None
        self.obs_std = None
        self.obs_count = 0
        
        self.reset()
    
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        
        self.current_step = 50  # Start after warm-up period for indicators
        self.balance = self.initial_balance
        self.shares_held = 0
        self.total_value = self.initial_balance
        self.trades = []
        self.portfolio_values = [self.initial_balance]
        self.last_action = 0
        
        # For Sharpe-based reward
        self.recent_returns = []
        
        return self._get_observation(), {}
    
    def _normalize_observation(self, obs):
        """Apply observation normalization if enabled."""
        if not self.config.get("normalize_obs", False):
            return obs
        
        # Initialize normalization statistics
        if self.obs_mean is None:
            self.obs_mean = {k: np.zeros_like(v) for k, v in obs.items()}
            self.obs_std = {k: np.ones_like(v) for k, v in obs.items()}
        
        # Update running statistics (Welford's online algorithm)
        self.obs_count += 1
        normalized_obs = {}
        
        for key in obs.keys():
            delta = obs[key] - self.obs_mean[key]
            self.obs_mean[key] += delta / self.obs_count
            delta2 = obs[key] - self.obs_mean[key]
            self.obs_std[key] = np.sqrt((self.obs_std[key]**2 * (self.obs_count - 1) + delta * delta2) / self.obs_count + 1e-8)
            
            # Normalize
            normalized_obs[key] = (obs[key] - self.obs_mean[key]) / (self.obs_std[key] + 1e-8)
            normalized_obs[key] = np.clip(normalized_obs[key], -10, 10)  # Clip extreme values
        
        return normalized_obs
    
    def _get_observation(self):
        """Get current observation."""
        row = self.market_data.iloc[self.current_step]
        news_row = self.news_data.iloc[self.current_step]
        
        # Market features (15)
        market_features = np.array([
            row['close'] / 100,  # Normalized price
            row['returns_1d'],
            row['returns_7d'],
            row['rsi_14'] / 100,
            row['macd'] / row['close'] if row['close'] > 0 else 0,
            row['macd_signal'] / row['close'] if row['close'] > 0 else 0,
            (row['close'] - row['bollinger_lower']) / (row['bollinger_upper'] - row['bollinger_lower']) if row['bollinger_upper'] != row['bollinger_lower'] else 0.5,
            row['atr_14'] / row['close'] if row['close'] > 0 else 0,
            row['volume_ratio'],
            row['volume'] / 1e6,  # Normalized volume
            (row['high'] - row['low']) / row['close'] if row['close'] > 0 else 0,
            (row['close'] - row['open']) / row['open'] if row['open'] > 0 else 0,
            row['high'] / row['close'] if row['close'] > 0 else 1,
            row['low'] / row['close'] if row['close'] > 0 else 1,
            row['volume'] / row['volume'] if self.current_step == 0 else row['volume'] / self.market_data.iloc[self.current_step-1]['volume']
        ], dtype=np.float32)
        
        # News features (6)
        news_features = np.array([
            news_row['sentiment_1h'],
            news_row['sentiment_24h'],
            news_row['sentiment_7d'],
            news_row['sentiment_trend'],
            news_row['news_volume'] / 50,  # Normalized
            news_row['news_velocity']
        ], dtype=np.float32)
        
        # Portfolio features (5)
        current_price = row['close']
        portfolio_value = self.balance + self.shares_held * current_price
        
        portfolio_features = np.array([
            self.balance / self.initial_balance,  # Cash ratio
            self.shares_held * current_price / self.initial_balance if self.initial_balance > 0 else 0,  # Position ratio
            portfolio_value / self.initial_balance - 1,  # Return
            self.shares_held / 100 if self.shares_held > 0 else 0,  # Normalized shares
            len(self.trades) / 100  # Normalized trade count
        ], dtype=np.float32)
        
        obs = {
            'market': market_features,
            'news': news_features,
            'portfolio': portfolio_features
        }
        
        return self._normalize_observation(obs)
    
    def _calculate_reward(self, portfolio_value, action):
        """Calculate reward based on configuration."""
        reward_type = self.config.get("reward_type", "simple_pnl")
        
        if reward_type == "simple_pnl":
            # Simple P&L reward
            reward = (portfolio_value - self.total_value) / self.total_value
        
        elif reward_type == "sharpe_based":
            # Sharpe-based reward (risk-adjusted returns)
            portfolio_return = (portfolio_value - self.total_value) / self.total_value
            self.recent_returns.append(portfolio_return)
            
            # Keep only recent window
            window = self.config.get("sharpe_window", 20)
            if len(self.recent_returns) > window:
                self.recent_returns.pop(0)
            
            # Calculate Sharpe-like reward
            if len(self.recent_returns) >= 2:
                mean_return = np.mean(self.recent_returns)
                std_return = np.std(self.recent_returns)
                sharpe = mean_return / (std_return + 1e-9)
                reward = sharpe
            else:
                reward = portfolio_return
        
        else:
            reward = 0
        
        # Apply transaction penalty
        transaction_penalty = self.config.get("transaction_penalty", 0.0)
        if action != 0:  # Not HOLD
            reward -= transaction_penalty
        
        # Apply action repeat penalty (discourage same action repeatedly)
        action_repeat_penalty = self.config.get("action_repeat_penalty", 0.0)
        if action == self.last_action and action != 0:
            reward -= action_repeat_penalty
        
        return reward
    
    def step(self, action):
        """Execute one time step."""
        current_price = self.market_data.iloc[self.current_step]['close']
        
        # Execute action
        if action == 0:  # HOLD
            pass
        elif action in [1, 2, 3]:  # BUY
            buy_pct = [0.25, 0.5, 1.0][action - 1]
            amount_to_invest = self.balance * buy_pct
            shares_to_buy = (amount_to_invest / current_price) * (1 - self.commission)
            
            if shares_to_buy > 0:
                self.shares_held += shares_to_buy
                self.balance -= amount_to_invest
                self.trades.append({
                    'step': self.current_step,
                    'action': 'BUY',
                    'shares': shares_to_buy,
                    'price': current_price
                })
        
        elif action in [4, 5, 6]:  # SELL
            sell_pct = [0.25, 0.5, 1.0][action - 4]
            shares_to_sell = self.shares_held * sell_pct
            
            if shares_to_sell > 0:
                self.balance += shares_to_sell * current_price * (1 - self.commission)
                self.shares_held -= shares_to_sell
                self.trades.append({
                    'step': self.current_step,
                    'action': 'SELL',
                    'shares': shares_to_sell,
                    'price': current_price
                })
        
        # Calculate portfolio value
        portfolio_value = self.balance + self.shares_held * current_price
        self.portfolio_values.append(portfolio_value)
        
        # Calculate reward
        reward = self._calculate_reward(portfolio_value, action)
        self.total_value = portfolio_value
        self.last_action = action
        
        # Move to next step
        self.current_step += 1
        
        # Check if episode is done
        done = self.current_step >= len(self.market_data) - 1
        truncated = False
        
        return self._get_observation(), reward, done, truncated, {}
    
    def render(self, mode='human'):
        """Render the environment (optional)."""
        current_price = self.market_data.iloc[self.current_step]['close']
        portfolio_value = self.balance + self.shares_held * current_price
        profit = ((portfolio_value / self.initial_balance) - 1) * 100
        
        print(f"Step: {self.current_step} | Price: ${current_price:.2f} | "
              f"Balance: ${self.balance:.2f} | Shares: {self.shares_held:.2f} | "
              f"Portfolio: ${portfolio_value:.2f} | Profit: {profit:.2f}%")

print("‚úÖ Environment class defined")

---
## SECTION 5: Training

Train multiple experiments and compare results.

In [None]:
# ============================================
# SECTION 5: Training (with Model Saving)
# ============================================
# –û–±—É—á–∞–µ—Ç –≤—Å–µ —ç–∫—Å–ø–µ—Ä–∏–º–µ–Ω—Ç—ã –∏ –°–û–•–†–ê–ù–Ø–ï–¢ –º–æ–¥–µ–ª–∏
# –ò—Å–ø–æ–ª—å–∑—É–µ—Ç learning_rate –∏ timesteps –∏–∑ –∫–æ–Ω—Ñ–∏–≥–∞ –∫–∞–∂–¥–æ–≥–æ —ç–∫—Å–ø–µ—Ä–∏–º–µ–Ω—Ç–∞
# ‚ö†Ô∏è Run #3: 200K steps —ç–∫—Å–ø–µ—Ä–∏–º–µ–Ω—Ç—ã –∑–∞–π–º—É—Ç ~10 –º–∏–Ω –∫–∞–∂–¥—ã–π

class ProgressCallback(BaseCallback):
    """Custom callback for logging training progress."""
    def __init__(self, check_freq, verbose=1):
        super(ProgressCallback, self).__init__(verbose)
        self.check_freq = check_freq
        self.episode_rewards = []
        self.episode_lengths = []
    
    def _on_step(self):
        if self.n_calls % self.check_freq == 0:
            if len(self.model.ep_info_buffer) > 0:
                mean_reward = np.mean([ep_info['r'] for ep_info in self.model.ep_info_buffer])
                mean_length = np.mean([ep_info['l'] for ep_info in self.model.ep_info_buffer])
                self.episode_rewards.append(mean_reward)
                self.episode_lengths.append(mean_length)
                if self.verbose > 0:
                    print(f"  Step: {self.n_calls:6d} | Mean reward: {mean_reward:8.4f} | Mean ep length: {mean_length:.1f}")
        return True


# ============================================
# Model Save/Load Functions
# ============================================

def save_model(model, experiment_name):
    """Save trained model to disk."""
    os.makedirs(MODELS_DIR, exist_ok=True)
    path = f'{MODELS_DIR}/{experiment_name}.zip'
    model.save(path)
    print(f"üíæ Model saved: {path}")
    return path

def load_model(experiment_name, env=None):
    """Load model from disk. Returns None if not found."""
    path = f'{MODELS_DIR}/{experiment_name}.zip'
    if os.path.exists(path):
        print(f"üìÇ Loading model: {path}")
        return PPO.load(path, env=env)
    return None

def list_saved_models():
    """List all saved models."""
    if not os.path.exists(MODELS_DIR):
        return []
    return [f.replace('.zip', '') for f in os.listdir(MODELS_DIR) if f.endswith('.zip')]

def save_results(results_dict):
    """Save experiment results to JSON."""
    serializable = {}
    for exp_key, results in results_dict.items():
        if 'error' in results:
            serializable[exp_key] = results
        else:
            serializable[exp_key] = {
                k: v if not isinstance(v, (np.ndarray, list)) or k != 'portfolio_values' 
                else [float(x) for x in v]
                for k, v in results.items()
            }
    with open(RESULTS_PATH, 'w') as f:
        json.dump(serializable, f, indent=2, default=str)
    print(f"üíæ Results saved: {RESULTS_PATH}")

def load_results():
    """Load experiment results from JSON."""
    if os.path.exists(RESULTS_PATH):
        with open(RESULTS_PATH, 'r') as f:
            return json.load(f)
    return None


def evaluate_agent(model, market_data, news_data, config):
    """Evaluate a trained agent and return metrics."""
    eval_env = TradingEnv(market_data, news_data, config=config, initial_balance=10000)
    obs, info = eval_env.reset()
    done = False
    
    actions_taken = []
    rewards_list = []
    
    while not done:
        action, _ = model.predict(obs, deterministic=True)
        actions_taken.append(int(action))
        obs, reward, done, truncated, info = eval_env.step(action)
        rewards_list.append(reward)
        done = done or truncated
    
    final_price = eval_env.market_data.iloc[eval_env.current_step - 1]['close']
    final_value = eval_env.balance + eval_env.shares_held * final_price
    total_return = (final_value / eval_env.initial_balance - 1) * 100
    
    initial_price = eval_env.market_data.iloc[50]['close']
    buy_hold_return = ((final_price / initial_price) - 1) * 100
    
    returns_array = np.array(eval_env.portfolio_values[1:]) / np.array(eval_env.portfolio_values[:-1]) - 1
    sharpe = np.mean(returns_array) / (np.std(returns_array) + 1e-9) * np.sqrt(252)
    
    portfolio_values = np.array(eval_env.portfolio_values)
    running_max = np.maximum.accumulate(portfolio_values)
    drawdown = (portfolio_values - running_max) / running_max
    max_drawdown = np.min(drawdown) * 100
    
    winning_trades = sum(1 for r in rewards_list if r > 0)
    win_rate = (winning_trades / len(rewards_list) * 100) if len(rewards_list) > 0 else 0
    
    action_names = ['HOLD', 'BUY_25%', 'BUY_50%', 'BUY_100%', 'SELL_25%', 'SELL_50%', 'SELL_100%']
    action_dist = {name: actions_taken.count(i) / len(actions_taken) * 100 for i, name in enumerate(action_names)}
    
    return {
        'final_value': final_value,
        'total_return': total_return,
        'buy_hold_return': buy_hold_return,
        'outperformance': total_return - buy_hold_return,
        'sharpe': sharpe,
        'max_drawdown': max_drawdown,
        'win_rate': win_rate,
        'num_trades': len(eval_env.trades),
        'action_dist': action_dist,
        'portfolio_values': eval_env.portfolio_values
    }


# ============================================
# RUN TRAINING
# ============================================

experiment_results = {}
total_start_time = time.time()

print("="*80)
print("STARTING MULTI-EXPERIMENT TRAINING (Run #3 - Scaling Up)")
print("="*80)
print(f"üìÅ Models will be saved to: {MODELS_DIR}")
print(f"üìä Total experiments: {len(EXPERIMENTS)}")
print(f"‚è±Ô∏è Estimated time: ~15-20 min (200K steps experiments)")

for exp_key, exp_config in EXPERIMENTS.items():
    try:
        # Get config-specific hyperparameters
        learning_rate = exp_config.get('learning_rate', 3e-4)
        total_timesteps = exp_config.get('timesteps', 50000)
        entropy_coef = exp_config.get('entropy_coef', 0.01)
        
        print(f"\n{'='*80}")
        print(f"EXPERIMENT: {exp_config['name']}")
        print(f"{'='*80}")
        print(f"Hyperparameters:")
        print(f"  - learning_rate: {learning_rate}")
        print(f"  - timesteps: {total_timesteps}")
        print(f"  - entropy_coef: {entropy_coef}")
        print(f"  - normalize_obs: {exp_config.get('normalize_obs', False)}")
        print(f"  - transaction_penalty: {exp_config.get('transaction_penalty', 0.0)}")
        print()
        
        # Create environment
        train_env = TradingEnv(market_data, news_data, config=exp_config)
        
        # Initialize PPO with config-specific parameters
        model = PPO(
            "MultiInputPolicy",
            train_env,
            learning_rate=learning_rate,  # From config!
            n_steps=2048,
            batch_size=64,
            n_epochs=10,
            gamma=0.99,
            gae_lambda=0.95,
            clip_range=0.2,
            ent_coef=entropy_coef,  # From config!
            verbose=0
        )
        
        print(f"Starting training ({total_timesteps//1000}K timesteps)...")
        
        # Calculate epochs based on timesteps
        timesteps_per_epoch = 10000
        epochs = total_timesteps // timesteps_per_epoch
        
        callback = ProgressCallback(check_freq=10000, verbose=1)
        start_time = time.time()
        
        for epoch in range(epochs):
            print(f"\n  Epoch {epoch + 1}/{epochs}:")
            model.learn(
                total_timesteps=timesteps_per_epoch,
                callback=callback,
                reset_num_timesteps=False
            )
        
        training_time = time.time() - start_time
        print(f"\n‚úÖ Training complete in {training_time:.2f}s")
        
        # Save model
        save_model(model, exp_key)
        
        # Evaluate
        print(f"Evaluating...")
        results = evaluate_agent(model, market_data, news_data, exp_config)
        results['training_time'] = training_time
        results['config'] = exp_config
        
        experiment_results[exp_key] = results
        
        print(f"‚úÖ {exp_config['name']} complete!")
        print(f"   Return: {results['total_return']:+.2f}% | Sharpe: {results['sharpe']:.2f} | Drawdown: {results['max_drawdown']:.2f}%")
        
    except Exception as e:
        print(f"\n‚ùå ERROR in experiment {exp_key}:")
        print(f"Error type: {type(e).__name__}")
        print(f"Error message: {str(e)}")
        traceback.print_exc()
        experiment_results[exp_key] = {'error': str(e), 'config': exp_config}

# Save results
save_results(experiment_results)

total_time = time.time() - total_start_time
print(f"\n{'='*80}")
print(f"ALL EXPERIMENTS COMPLETE")
print(f"Total time: {total_time:.2f}s ({total_time/60:.1f} minutes)")
print(f"{'='*80}")

saved = list_saved_models()
print(f"\nüíæ Saved models ({len(saved)}):")
for m in saved:
    print(f"  - {m}")

In [None]:
# ============================================
# SECTION 5b: Quick Evaluate (Skip Training)
# ============================================
# –ó–∞–≥—Ä—É–∂–∞–µ—Ç —Å–æ—Ö—Ä–∞–Ω–µ–Ω–Ω—ã–µ –º–æ–¥–µ–ª–∏ –∏ –æ—Ü–µ–Ω–∏–≤–∞–µ—Ç –∏—Ö
# –ù–ï –ü–ï–†–ï–û–ë–£–ß–ê–ï–¢ - —ç–∫–æ–Ω–æ–º–∏—Ç ~5 –º–∏–Ω—É—Ç!

print("="*80)
print("QUICK EVALUATE - Loading saved models")
print("="*80)

# Check for saved models
saved_models = list_saved_models()
print(f"\nüìÇ Found {len(saved_models)} saved models: {saved_models}")

if len(saved_models) == 0:
    print("\n‚ö†Ô∏è No saved models found!")
    print("Run Section 5 (Training) first to train and save models.")
else:
    experiment_results = {}
    
    for exp_key, exp_config in EXPERIMENTS.items():
        print(f"\n{'='*60}")
        print(f"Loading: {exp_config['name']}")
        
        model = load_model(exp_key)
        
        if model is None:
            print(f"  ‚ö†Ô∏è Model not found for {exp_key}, skipping...")
            continue
        
        # Evaluate
        print(f"  Evaluating...")
        results = evaluate_agent(model, market_data, news_data, exp_config)
        results['training_time'] = 0  # Not trained this session
        results['config'] = exp_config
        
        experiment_results[exp_key] = results
        
        print(f"  ‚úÖ Return: {results['total_return']:+.2f}% | Sharpe: {results['sharpe']:.2f} | Drawdown: {results['max_drawdown']:.2f}%")
    
    print(f"\n{'='*80}")
    print(f"‚úÖ Quick Evaluate complete! Loaded {len(experiment_results)} models.")
    print("Now run Section 6-7 to see detailed results.")
    print("="*80)

---
## SECTION 5b: Quick Evaluate (Skip Training)

**–ò—Å–ø–æ–ª—å–∑—É–π —ç—Ç—É —è—á–µ–π–∫—É —á—Ç–æ–±—ã –∑–∞–≥—Ä—É–∑–∏—Ç—å —É–∂–µ –æ–±—É—á–µ–Ω–Ω—ã–µ –º–æ–¥–µ–ª–∏ –ë–ï–ó –ø–µ—Ä–µ–æ–±—É—á–µ–Ω–∏—è.**

–ö–æ–≥–¥–∞ –∏—Å–ø–æ–ª—å–∑–æ–≤–∞—Ç—å:
- –ü–æ—Å–ª–µ –ø–µ—Ä–≤–æ–≥–æ Run All (–º–æ–¥–µ–ª–∏ —É–∂–µ —Å–æ—Ö—Ä–∞–Ω–µ–Ω—ã)
- –ö–æ–≥–¥–∞ —Ö–æ—á–µ—à—å –ø—Ä–æ—Å—Ç–æ –ø–æ—Å–º–æ—Ç—Ä–µ—Ç—å —Ä–µ–∑—É–ª—å—Ç–∞—Ç—ã
- –ö–æ–≥–¥–∞ –∏–∑–º–µ–Ω–∏–ª —Ç–æ–ª—å–∫–æ –∫–æ–Ω—Ñ–∏–≥–∏ –≤—ã–≤–æ–¥–∞ (Section 6-7)

‚ö†Ô∏è –¢—Ä–µ–±—É–µ—Ç: Section 1-4 –¥–æ–ª–∂–Ω—ã –±—ã—Ç—å –∑–∞–ø—É—â–µ–Ω—ã

---
## SECTION 6: Results Comparison

Display comparison table of all experiments with clear markers for Claude.

In [None]:
print("\n" + "="*80)
print("CLAUDE_RESULTS_START")
print("="*80)

print("\nüìä EXPERIMENT COMPARISON TABLE")
print("="*80)

# Create comparison table
comparison_data = []
for exp_key, results in experiment_results.items():
    if 'error' in results:
        comparison_data.append({
            'Experiment': results['config']['name'],
            'Status': 'ERROR',
            'Return': 'N/A',
            'Sharpe': 'N/A',
            'Drawdown': 'N/A',
            'Win Rate': 'N/A',
            'Trades': 'N/A',
            'Time': 'N/A'
        })
    else:
        comparison_data.append({
            'Experiment': results['config']['name'],
            'Status': '‚úÖ',
            'Return': f"{results['total_return']:+.2f}%",
            'Sharpe': f"{results['sharpe']:.2f}",
            'Drawdown': f"{results['max_drawdown']:.2f}%",
            'Win Rate': f"{results['win_rate']:.2f}%",
            'Trades': str(results['num_trades']),
            'Time': f"{results['training_time']:.1f}s"
        })

# Print table header
headers = ['Experiment', 'Status', 'Return', 'Sharpe', 'Drawdown', 'Win Rate', 'Trades', 'Time']
col_widths = [35, 8, 12, 10, 12, 12, 8, 10]

header_row = ""
for header, width in zip(headers, col_widths):
    header_row += f"{header:<{width}}"
print(header_row)
print("-" * 80)

# Print table rows
for row in comparison_data:
    row_str = ""
    for header, width in zip(headers, col_widths):
        row_str += f"{row[header]:<{width}}"
    print(row_str)

print("\n" + "="*80)
print("üìà DETAILED RESULTS BY EXPERIMENT")
print("="*80)

for exp_key, results in experiment_results.items():
    if 'error' in results:
        print(f"\n‚ùå {results['config']['name']}")
        print(f"   Error: {results['error']}")
        continue
    
    print(f"\n{results['config']['name']}")
    print("-" * 80)
    print(f"Performance Metrics:")
    print(f"  Total Return:          {results['total_return']:+.2f}%")
    print(f"  Buy & Hold Return:     {results['buy_hold_return']:+.2f}%")
    print(f"  Outperformance:        {results['outperformance']:+.2f}%")
    print(f"  Sharpe Ratio:          {results['sharpe']:.2f}")
    print(f"  Max Drawdown:          {results['max_drawdown']:.2f}%")
    print(f"  Win Rate:              {results['win_rate']:.2f}%")
    print(f"  Total Trades:          {results['num_trades']}")
    print(f"  Training Time:         {results['training_time']:.2f}s")
    
    print(f"\nAction Distribution:")
    for action_name, pct in results['action_dist'].items():
        bar = "‚ñà" * int(pct / 2)
        print(f"  {action_name:12} {pct:5.1f}% {bar}")
    
    print(f"\nConfiguration:")
    for key, value in results['config'].items():
        if key != 'name':
            print(f"  {key:25} {value}")

print("\n" + "="*80)
print("üèÜ BEST PERFORMERS")
print("="*80)

# Find best performers
valid_results = {k: v for k, v in experiment_results.items() if 'error' not in v}

if valid_results:
    best_return = max(valid_results.items(), key=lambda x: x[1]['total_return'])
    best_sharpe = max(valid_results.items(), key=lambda x: x[1]['sharpe'])
    best_drawdown = max(valid_results.items(), key=lambda x: -x[1]['max_drawdown'])  # Less negative is better
    best_winrate = max(valid_results.items(), key=lambda x: x[1]['win_rate'])
    
    print(f"Best Return:       {best_return[1]['config']['name']:40} {best_return[1]['total_return']:+.2f}%")
    print(f"Best Sharpe Ratio: {best_sharpe[1]['config']['name']:40} {best_sharpe[1]['sharpe']:.2f}")
    print(f"Best Drawdown:     {best_drawdown[1]['config']['name']:40} {best_drawdown[1]['max_drawdown']:.2f}%")
    print(f"Best Win Rate:     {best_winrate[1]['config']['name']:40} {best_winrate[1]['win_rate']:.2f}%")
    
    print(f"\nüìå RECOMMENDATION")
    print("-" * 80)
    
    # Calculate composite score
    scores = {}
    for exp_key, results in valid_results.items():
        # Normalize metrics (higher is better)
        score = (
            results['total_return'] / 100 * 0.3 +  # 30% weight on returns
            results['sharpe'] / 2 * 0.3 +           # 30% weight on Sharpe
            -results['max_drawdown'] / 20 * 0.2 +   # 20% weight on drawdown
            results['win_rate'] / 100 * 0.2         # 20% weight on win rate
        )
        scores[exp_key] = score
    
    best_overall = max(scores.items(), key=lambda x: x[1])
    best_exp = valid_results[best_overall[0]]
    
    print(f"Best Overall:      {best_exp['config']['name']}")
    print(f"  Composite Score: {best_overall[1]:.4f}")
    print(f"  Total Return:    {best_exp['total_return']:+.2f}%")
    print(f"  Sharpe Ratio:    {best_exp['sharpe']:.2f}")
    print(f"  Max Drawdown:    {best_exp['max_drawdown']:.2f}%")
    print(f"  Win Rate:        {best_exp['win_rate']:.2f}%")
else:
    print("No valid results to compare.")

print("\n" + "="*80)
print("CLAUDE_RESULTS_END")
print("="*80)

In [None]:
try:
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # Filter valid results
    valid_results = {k: v for k, v in experiment_results.items() if 'error' not in v}
    
    if not valid_results:
        print("No valid results to visualize.")
    else:
        # Plot 1: Portfolio value comparison
        ax1 = axes[0, 0]
        colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
        for idx, (exp_key, results) in enumerate(valid_results.items()):
            steps = range(len(results['portfolio_values']))
            ax1.plot(steps, results['portfolio_values'], 
                    label=results['config']['name'], 
                    linewidth=2, 
                    color=colors[idx % len(colors)])
        
        ax1.axhline(y=10000, color='gray', linestyle='--', alpha=0.5, label='Initial Balance')
        ax1.set_title('Portfolio Value Over Time - All Experiments', fontsize=14, fontweight='bold')
        ax1.set_xlabel('Steps')
        ax1.set_ylabel('Portfolio Value ($)')
        ax1.legend(loc='best', fontsize=9)
        ax1.grid(True, alpha=0.3)
        
        # Plot 2: Returns comparison (bar chart)
        ax2 = axes[0, 1]
        exp_names = [v['config']['name'][:20] for v in valid_results.values()]
        returns = [v['total_return'] for v in valid_results.values()]
        buy_hold = [v['buy_hold_return'] for v in valid_results.values()]
        
        x = np.arange(len(exp_names))
        width = 0.35
        
        bars1 = ax2.bar(x - width/2, returns, width, label='Agent Return', color='#2ca02c')
        bars2 = ax2.bar(x + width/2, buy_hold, width, label='Buy & Hold', color='#d62728')
        
        ax2.set_title('Total Return Comparison', fontsize=14, fontweight='bold')
        ax2.set_ylabel('Return (%)')
        ax2.set_xticks(x)
        ax2.set_xticklabels(exp_names, rotation=45, ha='right', fontsize=8)
        ax2.legend()
        ax2.grid(True, alpha=0.3, axis='y')
        ax2.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
        
        # Add value labels on bars
        for bars in [bars1, bars2]:
            for bar in bars:
                height = bar.get_height()
                ax2.text(bar.get_x() + bar.get_width()/2., height,
                        f'{height:.1f}%',
                        ha='center', va='bottom' if height > 0 else 'top', 
                        fontsize=7)
        
        # Plot 3: Sharpe Ratio & Max Drawdown
        ax3 = axes[1, 0]
        sharpe_ratios = [v['sharpe'] for v in valid_results.values()]
        max_drawdowns = [abs(v['max_drawdown']) for v in valid_results.values()]
        
        x = np.arange(len(exp_names))
        
        ax3_twin = ax3.twinx()
        
        bars1 = ax3.bar(x - width/2, sharpe_ratios, width, label='Sharpe Ratio', color='#1f77b4', alpha=0.8)
        bars2 = ax3_twin.bar(x + width/2, max_drawdowns, width, label='Max Drawdown (abs)', color='#ff7f0e', alpha=0.8)
        
        ax3.set_title('Risk-Adjusted Metrics', fontsize=14, fontweight='bold')
        ax3.set_ylabel('Sharpe Ratio', color='#1f77b4')
        ax3_twin.set_ylabel('Max Drawdown (%) [abs]', color='#ff7f0e')
        ax3.set_xticks(x)
        ax3.set_xticklabels(exp_names, rotation=45, ha='right', fontsize=8)
        ax3.tick_params(axis='y', labelcolor='#1f77b4')
        ax3_twin.tick_params(axis='y', labelcolor='#ff7f0e')
        ax3.grid(True, alpha=0.3, axis='y')
        
        # Add legends
        lines1, labels1 = ax3.get_legend_handles_labels()
        lines2, labels2 = ax3_twin.get_legend_handles_labels()
        ax3.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=9)
        
        # Plot 4: Win Rate comparison
        ax4 = axes[1, 1]
        win_rates = [v['win_rate'] for v in valid_results.values()]
        
        bars = ax4.barh(exp_names, win_rates, color=colors[:len(exp_names)])
        ax4.set_title('Win Rate Comparison', fontsize=14, fontweight='bold')
        ax4.set_xlabel('Win Rate (%)')
        ax4.axvline(x=50, color='red', linestyle='--', alpha=0.5, label='50% (Random)')
        ax4.legend()
        ax4.grid(True, alpha=0.3, axis='x')
        
        # Add value labels
        for i, (bar, val) in enumerate(zip(bars, win_rates)):
            ax4.text(val + 1, i, f'{val:.1f}%', va='center', fontsize=9)
        
        plt.tight_layout()
        plt.savefig('experiment_comparison.png', dpi=150, bbox_inches='tight')
        plt.show()
        
        print("\n‚úÖ Visualization complete")
        print("Plot saved as: experiment_comparison.png")

except Exception as e:
    print(f"\n‚ùå ERROR during visualization:")
    print(f"Error: {str(e)}")
    traceback.print_exc()

---
## SECTION 7: Save Results

Save all results to a downloadable text file for Claude to review.

---
## Summary

This notebook implements a **multi-experiment RL trading agent** with A/B testing framework:

**What's included:**
1. **Setup**: Installed dependencies (stable-baselines3, gymnasium, etc.)
2. **Experiment Configuration**: 4 experiments to compare different approaches
   - Baseline (Simple PnL reward)
   - Sharpe-based Reward (risk-adjusted returns + transaction penalties)
   - Normalized Observations (feature scaling for better learning)
   - Best Combo (Sharpe + Normalized + higher entropy)
3. **Data**: Generated synthetic market data (OHLCV + technical indicators) and news sentiment
4. **Environment**: Custom Gymnasium trading environment with:
   - Configurable reward functions (simple_pnl, sharpe_based)
   - Optional observation normalization
   - Transaction and action repeat penalties
   - Observation space: market (15), news (6), portfolio (5) features
   - Action space: 7 discrete actions (HOLD, BUY 25/50/100%, SELL 25/50/100%)
5. **Training**: Trained PPO agent for each experiment (5 epochs √ó 10K timesteps = 50K total)
6. **Results**: Comprehensive comparison table with:
   - Performance metrics (Return, Sharpe, Drawdown, Win Rate)
   - Detailed action distributions
   - Best performer identification
   - Composite scoring for overall recommendation
7. **Visualization**: Multi-panel comparison plots

**How to use this notebook:**
1. Open in Google Colab: https://colab.research.google.com/github/AssTrahanec/rl-trading-agent/blob/main/colab_notebooks/rl_training.ipynb
2. Runtime ‚Üí Run All
3. Wait for all experiments to complete (~10-15 minutes for 4 experiments)
4. Copy everything between CLAUDE_RESULTS_START/END markers
5. Share with Claude for analysis

**Iterative workflow:**
- Claude analyzes results ‚Üí identifies best approaches
- Claude updates experiment configurations or adds new experiments
- Claude commits and pushes to GitHub
- User refreshes Colab (F5) ‚Üí Run All ‚Üí Copy results
- Repeat until performance is satisfactory

**Possible next improvements:**
- Add more experiments (different algorithms: SAC, A2C)
- Test different hyperparameters (learning rate, entropy coefficient)
- Implement real market data (yfinance, ccxt)
- Add FinBERT for real news sentiment analysis
- Implement walk-forward validation
- Add more sophisticated reward functions