# RL News Trading Agent - Google Colab Training

Single-file implementation for iterative development.

**Sections:**
1. Setup - Install dependencies
2. Data Collection - Generate synthetic market & news data
3. Environment - Custom Gymnasium trading environment
4. Training - Train PPO agent
5. Results - Display metrics with clear markers
6. Visualization - Training progress plots
7. Save Results - Export results to downloadable file

---
## SECTION 1: Setup

In [None]:
# Install required packages
!pip install stable-baselines3[extra] gymnasium transformers yfinance ccxt ta plotly -q

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3 import PPO, SAC, A2C
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import BaseCallback
import traceback
import time

print("‚úÖ Setup complete")
print(f"Gymnasium version: {gym.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

---
## SECTION 2: Data Collection

Generate synthetic market data and news sentiment for testing.

In [None]:
def generate_synthetic_market_data(n_days=365):
    """
    Generate synthetic OHLCV data with technical indicators.
    """
    dates = pd.date_range(end=datetime.now(), periods=n_days, freq='D')
    
    # Simulate price with random walk + trend
    np.random.seed(42)
    returns = np.random.normal(0.0005, 0.02, n_days)  # Mean daily return ~0.05%, volatility 2%
    prices = 100 * np.exp(np.cumsum(returns))
    
    # Generate OHLCV
    df = pd.DataFrame({
        'timestamp': dates,
        'open': prices * np.random.uniform(0.98, 1.0, n_days),
        'high': prices * np.random.uniform(1.0, 1.02, n_days),
        'low': prices * np.random.uniform(0.97, 1.0, n_days),
        'close': prices,
        'volume': np.random.uniform(1e6, 5e6, n_days)
    })
    
    # Technical indicators
    df['returns_1d'] = df['close'].pct_change()
    df['returns_7d'] = df['close'].pct_change(7)
    
    # RSI
    delta = df['close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
    rs = gain / loss
    df['rsi_14'] = 100 - (100 / (1 + rs))
    
    # MACD
    ema_12 = df['close'].ewm(span=12).mean()
    ema_26 = df['close'].ewm(span=26).mean()
    df['macd'] = ema_12 - ema_26
    df['macd_signal'] = df['macd'].ewm(span=9).mean()
    
    # Bollinger Bands
    sma_20 = df['close'].rolling(window=20).mean()
    std_20 = df['close'].rolling(window=20).std()
    df['bollinger_upper'] = sma_20 + (std_20 * 2)
    df['bollinger_lower'] = sma_20 - (std_20 * 2)
    
    # ATR (Average True Range)
    high_low = df['high'] - df['low']
    high_close = np.abs(df['high'] - df['close'].shift())
    low_close = np.abs(df['low'] - df['close'].shift())
    ranges = pd.concat([high_low, high_close, low_close], axis=1)
    true_range = np.max(ranges, axis=1)
    df['atr_14'] = true_range.rolling(14).mean()
    
    # Volume ratio
    df['volume_ratio'] = df['volume'] / df['volume'].rolling(20).mean()
    
    df.fillna(0, inplace=True)
    return df

def generate_synthetic_news_data(n_days=365):
    """
    Generate synthetic news sentiment data.
    """
    dates = pd.date_range(end=datetime.now(), periods=n_days, freq='D')
    
    np.random.seed(43)
    
    # Sentiment follows a random walk between -1 and 1
    sentiment_base = np.cumsum(np.random.normal(0, 0.1, n_days))
    sentiment_base = np.clip(sentiment_base, -3, 3) / 3  # Normalize to [-1, 1]
    
    df = pd.DataFrame({
        'timestamp': dates,
        'sentiment_1h': sentiment_base + np.random.normal(0, 0.1, n_days),
        'sentiment_24h': sentiment_base,
        'sentiment_7d': pd.Series(sentiment_base).rolling(7).mean().fillna(0).values,
        'sentiment_trend': pd.Series(sentiment_base).diff().fillna(0).values,
        'news_volume': np.random.poisson(20, n_days),
        'news_velocity': np.random.uniform(0.5, 2.0, n_days)
    })
    
    # Clip sentiment to [-1, 1]
    for col in ['sentiment_1h', 'sentiment_24h', 'sentiment_7d', 'sentiment_trend']:
        df[col] = np.clip(df[col], -1, 1)
    
    return df

# Generate data
market_data = generate_synthetic_market_data(n_days=500)
news_data = generate_synthetic_news_data(n_days=500)

print("‚úÖ Data generation complete")
print(f"Market data shape: {market_data.shape}")
print(f"News data shape: {news_data.shape}")
print(f"\nMarket data sample:")
print(market_data.head())
print(f"\nNews data sample:")
print(news_data.head())

---
## SECTION 3: Environment

Custom Gymnasium environment for trading.

In [None]:
class TradingEnv(gym.Env):
    """
    Custom trading environment compatible with Stable Baselines3.
    
    Observation Space:
        - market: 15 technical indicators
        - news: 6 sentiment features
        - portfolio: 5 position metrics
    
    Action Space:
        - Discrete(7): HOLD, BUY_25%, BUY_50%, BUY_100%, SELL_25%, SELL_50%, SELL_100%
    """
    
    def __init__(self, market_data, news_data, initial_balance=10000, commission=0.001):
        super(TradingEnv, self).__init__()
        
        self.market_data = market_data.reset_index(drop=True)
        self.news_data = news_data.reset_index(drop=True)
        self.initial_balance = initial_balance
        self.commission = commission
        
        # Action space: 0=HOLD, 1-3=BUY, 4-6=SELL
        self.action_space = spaces.Discrete(7)
        
        # Observation space
        self.observation_space = spaces.Dict({
            'market': spaces.Box(low=-np.inf, high=np.inf, shape=(15,), dtype=np.float32),
            'news': spaces.Box(low=-1, high=1, shape=(6,), dtype=np.float32),
            'portfolio': spaces.Box(low=-np.inf, high=np.inf, shape=(5,), dtype=np.float32)
        })
        
        self.reset()
    
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        
        self.current_step = 50  # Start after warm-up period for indicators
        self.balance = self.initial_balance
        self.shares_held = 0
        self.total_value = self.initial_balance
        self.trades = []
        self.portfolio_values = [self.initial_balance]
        
        return self._get_observation(), {}
    
    def _get_observation(self):
        """Get current observation."""
        row = self.market_data.iloc[self.current_step]
        news_row = self.news_data.iloc[self.current_step]
        
        # Market features (15)
        market_features = np.array([
            row['close'] / 100,  # Normalized price
            row['returns_1d'],
            row['returns_7d'],
            row['rsi_14'] / 100,
            row['macd'] / row['close'] if row['close'] > 0 else 0,
            row['macd_signal'] / row['close'] if row['close'] > 0 else 0,
            (row['close'] - row['bollinger_lower']) / (row['bollinger_upper'] - row['bollinger_lower']) if row['bollinger_upper'] != row['bollinger_lower'] else 0.5,
            row['atr_14'] / row['close'] if row['close'] > 0 else 0,
            row['volume_ratio'],
            row['volume'] / 1e6,  # Normalized volume
            (row['high'] - row['low']) / row['close'] if row['close'] > 0 else 0,
            (row['close'] - row['open']) / row['open'] if row['open'] > 0 else 0,
            row['high'] / row['close'] if row['close'] > 0 else 1,
            row['low'] / row['close'] if row['close'] > 0 else 1,
            row['volume'] / row['volume'] if self.current_step == 0 else row['volume'] / self.market_data.iloc[self.current_step-1]['volume']
        ], dtype=np.float32)
        
        # News features (6)
        news_features = np.array([
            news_row['sentiment_1h'],
            news_row['sentiment_24h'],
            news_row['sentiment_7d'],
            news_row['sentiment_trend'],
            news_row['news_volume'] / 50,  # Normalized
            news_row['news_velocity']
        ], dtype=np.float32)
        
        # Portfolio features (5)
        current_price = row['close']
        portfolio_value = self.balance + self.shares_held * current_price
        
        portfolio_features = np.array([
            self.balance / self.initial_balance,  # Cash ratio
            self.shares_held * current_price / self.initial_balance if self.initial_balance > 0 else 0,  # Position ratio
            portfolio_value / self.initial_balance - 1,  # Return
            self.shares_held / 100 if self.shares_held > 0 else 0,  # Normalized shares
            len(self.trades) / 100  # Normalized trade count
        ], dtype=np.float32)
        
        return {
            'market': market_features,
            'news': news_features,
            'portfolio': portfolio_features
        }
    
    def step(self, action):
        """Execute one time step."""
        current_price = self.market_data.iloc[self.current_step]['close']
        
        # Execute action
        if action == 0:  # HOLD
            pass
        elif action in [1, 2, 3]:  # BUY
            buy_pct = [0.25, 0.5, 1.0][action - 1]
            amount_to_invest = self.balance * buy_pct
            shares_to_buy = (amount_to_invest / current_price) * (1 - self.commission)
            
            if shares_to_buy > 0:
                self.shares_held += shares_to_buy
                self.balance -= amount_to_invest
                self.trades.append({
                    'step': self.current_step,
                    'action': 'BUY',
                    'shares': shares_to_buy,
                    'price': current_price
                })
        
        elif action in [4, 5, 6]:  # SELL
            sell_pct = [0.25, 0.5, 1.0][action - 4]
            shares_to_sell = self.shares_held * sell_pct
            
            if shares_to_sell > 0:
                self.balance += shares_to_sell * current_price * (1 - self.commission)
                self.shares_held -= shares_to_sell
                self.trades.append({
                    'step': self.current_step,
                    'action': 'SELL',
                    'shares': shares_to_sell,
                    'price': current_price
                })
        
        # Calculate portfolio value
        portfolio_value = self.balance + self.shares_held * current_price
        self.portfolio_values.append(portfolio_value)
        
        # Calculate reward
        reward = (portfolio_value - self.total_value) / self.total_value
        self.total_value = portfolio_value
        
        # Move to next step
        self.current_step += 1
        
        # Check if episode is done
        done = self.current_step >= len(self.market_data) - 1
        truncated = False
        
        return self._get_observation(), reward, done, truncated, {}
    
    def render(self, mode='human'):
        """Render the environment (optional)."""
        current_price = self.market_data.iloc[self.current_step]['close']
        portfolio_value = self.balance + self.shares_held * current_price
        profit = ((portfolio_value / self.initial_balance) - 1) * 100
        
        print(f"Step: {self.current_step} | Price: ${current_price:.2f} | "
              f"Balance: ${self.balance:.2f} | Shares: {self.shares_held:.2f} | "
              f"Portfolio: ${portfolio_value:.2f} | Profit: {profit:.2f}%")

# Test environment
print("Creating trading environment...")
env = TradingEnv(market_data, news_data)
print(f"‚úÖ Environment created")
print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")

# Test reset and step
obs, info = env.reset()
print(f"\nInitial observation shapes:")
print(f"  - market: {obs['market'].shape}")
print(f"  - news: {obs['news'].shape}")
print(f"  - portfolio: {obs['portfolio'].shape}")

# Test random action
action = env.action_space.sample()
obs, reward, done, truncated, info = env.step(action)
print(f"\nTest step: action={action}, reward={reward:.6f}, done={done}")

---
## SECTION 4: Training

Train PPO agent with progressive checkpoints.

In [None]:
class ProgressCallback(BaseCallback):
    """
    Custom callback for logging training progress.
    """
    def __init__(self, check_freq, verbose=1):
        super(ProgressCallback, self).__init__(verbose)
        self.check_freq = check_freq
        self.episode_rewards = []
        self.episode_lengths = []
    
    def _on_step(self):
        if self.n_calls % self.check_freq == 0:
            # Get episode info from the logger
            if len(self.model.ep_info_buffer) > 0:
                mean_reward = np.mean([ep_info['r'] for ep_info in self.model.ep_info_buffer])
                mean_length = np.mean([ep_info['l'] for ep_info in self.model.ep_info_buffer])
                self.episode_rewards.append(mean_reward)
                self.episode_lengths.append(mean_length)
                
                if self.verbose > 0:
                    print(f"Step: {self.n_calls} | Mean reward: {mean_reward:.4f} | Mean ep length: {mean_length:.1f}")
        
        return True

try:
    print("Initializing PPO agent...")
    
    # Create environment
    train_env = TradingEnv(market_data, news_data)
    
    # Initialize PPO model
    model = PPO(
        "MultiInputPolicy",
        train_env,
        learning_rate=3e-4,
        n_steps=2048,
        batch_size=64,
        n_epochs=10,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        ent_coef=0.01,
        verbose=1
    )
    
    print("‚úÖ Model initialized")
    print(f"Policy: {model.policy}")
    print(f"\nStarting training...\n")
    
    # Progressive training with checkpoints
    callback = ProgressCallback(check_freq=2048, verbose=1)
    start_time = time.time()
    
    # Train for multiple epochs with intermediate evaluations
    total_timesteps = 0
    epochs = 5
    timesteps_per_epoch = 10000
    
    for epoch in range(epochs):
        print(f"\n{'='*60}")
        print(f"EPOCH {epoch + 1}/{epochs}")
        print(f"{'='*60}")
        
        model.learn(
            total_timesteps=timesteps_per_epoch,
            callback=callback,
            reset_num_timesteps=False
        )
        
        total_timesteps += timesteps_per_epoch
        
        # Quick evaluation
        eval_env = TradingEnv(market_data, news_data)
        obs, info = eval_env.reset()
        done = False
        episode_reward = 0
        
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, done, truncated, info = eval_env.step(action)
            episode_reward += reward
            done = done or truncated
        
        final_value = eval_env.balance + eval_env.shares_held * eval_env.market_data.iloc[eval_env.current_step - 1]['close']
        total_return = (final_value / eval_env.initial_balance - 1) * 100
        
        print(f"\nEpoch {epoch + 1} Evaluation:")
        print(f"  Episode reward: {episode_reward:.4f}")
        print(f"  Total return: {total_return:.2f}%")
        print(f"  Final portfolio value: ${final_value:.2f}")
        print(f"  Number of trades: {len(eval_env.trades)}")
    
    training_time = time.time() - start_time
    print(f"\n‚úÖ Training complete in {training_time:.2f} seconds")
    print(f"Total timesteps: {total_timesteps}")

except Exception as e:
    print(f"\n‚ùå ERROR during training:")
    print(f"Error type: {type(e).__name__}")
    print(f"Error message: {str(e)}")
    print(f"\nFull traceback:")
    traceback.print_exc()

---
## SECTION 5: Results

Display final metrics with clear markers for Claude to parse.

In [None]:
try:
    print("\n" + "="*60)
    print("CLAUDE_RESULTS_START")
    print("="*60)
    
    # Final evaluation on full dataset
    eval_env = TradingEnv(market_data, news_data, initial_balance=10000)
    obs, info = eval_env.reset()
    done = False
    
    actions_taken = []
    rewards_list = []
    
    while not done:
        action, _ = model.predict(obs, deterministic=True)
        actions_taken.append(action)
        obs, reward, done, truncated, info = eval_env.step(action)
        rewards_list.append(reward)
        done = done or truncated
    
    # Calculate metrics
    final_price = eval_env.market_data.iloc[eval_env.current_step - 1]['close']
    final_value = eval_env.balance + eval_env.shares_held * final_price
    total_return = (final_value / eval_env.initial_balance - 1) * 100
    
    # Calculate buy & hold baseline
    initial_price = eval_env.market_data.iloc[50]['close']
    buy_hold_return = ((final_price / initial_price) - 1) * 100
    
    # Sharpe ratio (simplified)
    returns_array = np.array(eval_env.portfolio_values[1:]) / np.array(eval_env.portfolio_values[:-1]) - 1
    sharpe = np.mean(returns_array) / (np.std(returns_array) + 1e-9) * np.sqrt(252)
    
    # Max drawdown
    portfolio_values = np.array(eval_env.portfolio_values)
    running_max = np.maximum.accumulate(portfolio_values)
    drawdown = (portfolio_values - running_max) / running_max
    max_drawdown = np.min(drawdown) * 100
    
    # Win rate
    winning_trades = sum(1 for r in rewards_list if r > 0)
    win_rate = (winning_trades / len(rewards_list) * 100) if len(rewards_list) > 0 else 0
    
    # Print results
    print(f"\nüìä FINAL RESULTS")
    print(f"{'-'*60}")
    print(f"Initial Balance:        ${eval_env.initial_balance:,.2f}")
    print(f"Final Portfolio Value:  ${final_value:,.2f}")
    print(f"Total Return:           {total_return:+.2f}%")
    print(f"Buy & Hold Return:      {buy_hold_return:+.2f}%")
    print(f"Outperformance:         {total_return - buy_hold_return:+.2f}%")
    print(f"")
    print(f"Sharpe Ratio:           {sharpe:.2f}")
    print(f"Max Drawdown:           {max_drawdown:.2f}%")
    print(f"Win Rate:               {win_rate:.2f}%")
    print(f"Total Trades:           {len(eval_env.trades)}")
    print(f"Training Time:          {training_time:.2f}s")
    print(f"Total Timesteps:        {total_timesteps:,}")
    print(f"")
    
    # Action distribution
    action_names = ['HOLD', 'BUY_25%', 'BUY_50%', 'BUY_100%', 'SELL_25%', 'SELL_50%', 'SELL_100%']
    print(f"üìà ACTION DISTRIBUTION")
    print(f"{'-'*60}")
    for i, name in enumerate(action_names):
        count = actions_taken.count(i)
        pct = (count / len(actions_taken) * 100) if len(actions_taken) > 0 else 0
        print(f"{name:12} {count:5d} ({pct:5.1f}%)")
    
    print(f"\n{'='*60}")
    print("CLAUDE_RESULTS_END")
    print("="*60)

except Exception as e:
    print("\n" + "="*60)
    print("CLAUDE_RESULTS_START")
    print("="*60)
    print(f"\n‚ùå ERROR during evaluation:")
    print(f"Error type: {type(e).__name__}")
    print(f"Error message: {str(e)}")
    print(f"\nFull traceback:")
    traceback.print_exc()
    print(f"\n{'='*60}")
    print("CLAUDE_RESULTS_END")
    print("="*60)

---
## SECTION 6: Visualization

Plot training progress and portfolio performance.

In [None]:
try:
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Plot 1: Portfolio value over time
    ax1 = axes[0, 0]
    steps = range(len(eval_env.portfolio_values))
    ax1.plot(steps, eval_env.portfolio_values, label='Agent Portfolio', linewidth=2)
    ax1.axhline(y=eval_env.initial_balance, color='gray', linestyle='--', label='Initial Balance')
    ax1.set_title('Portfolio Value Over Time', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Steps')
    ax1.set_ylabel('Portfolio Value ($)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Price and trades
    ax2 = axes[0, 1]
    price_steps = range(50, 50 + len(eval_env.portfolio_values))
    prices = eval_env.market_data.iloc[50:50+len(eval_env.portfolio_values)]['close'].values
    ax2.plot(price_steps, prices, label='Price', linewidth=2, alpha=0.7)
    
    # Mark buy/sell trades
    for trade in eval_env.trades:
        if trade['action'] == 'BUY':
            ax2.scatter(trade['step'], trade['price'], color='green', marker='^', s=100, alpha=0.6, zorder=5)
        elif trade['action'] == 'SELL':
            ax2.scatter(trade['step'], trade['price'], color='red', marker='v', s=100, alpha=0.6, zorder=5)
    
    ax2.set_title('Price and Trading Actions', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Steps')
    ax2.set_ylabel('Price ($)')
    ax2.legend(['Price', 'Buy', 'Sell'])
    ax2.grid(True, alpha=0.3)
    
    # Plot 3: Drawdown
    ax3 = axes[1, 0]
    ax3.fill_between(range(len(drawdown)), drawdown * 100, 0, alpha=0.3, color='red')
    ax3.plot(range(len(drawdown)), drawdown * 100, color='red', linewidth=2)
    ax3.set_title('Drawdown', fontsize=14, fontweight='bold')
    ax3.set_xlabel('Steps')
    ax3.set_ylabel('Drawdown (%)')
    ax3.grid(True, alpha=0.3)
    
    # Plot 4: Returns distribution
    ax4 = axes[1, 1]
    ax4.hist(returns_array * 100, bins=50, alpha=0.7, edgecolor='black')
    ax4.axvline(x=0, color='red', linestyle='--', linewidth=2)
    ax4.set_title('Returns Distribution', fontsize=14, fontweight='bold')
    ax4.set_xlabel('Return (%)')
    ax4.set_ylabel('Frequency')
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('trading_results.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("\n‚úÖ Visualization complete")
    print("Plot saved as: trading_results.png")

except Exception as e:
    print(f"\n‚ùå ERROR during visualization:")
    print(f"Error: {str(e)}")
    traceback.print_exc()

In [None]:
---
## Summary

This notebook implements a complete RL trading agent pipeline for Google Colab:

1. **Setup**: Installed dependencies (stable-baselines3, gymnasium, etc.)
2. **Data**: Generated synthetic market data (OHLCV + technical indicators) and news sentiment
3. **Environment**: Created custom Gymnasium trading environment with:
   - Observation space: market (15), news (6), portfolio (5) features
   - Action space: 7 discrete actions (HOLD, BUY 25/50/100%, SELL 25/50/100%)
   - Reward: Portfolio value change with transaction costs
4. **Training**: Trained PPO agent over 5 epochs (50K total timesteps)
5. **Results**: Evaluated performance with clear markers (CLAUDE_RESULTS_START/END)
6. **Visualization**: Plotted portfolio value, trades, drawdown, and returns
7. **Export**: Saved results to downloadable text file

**How to use this notebook:**
1. Upload to Google Colab: https://colab.research.google.com
2. Click "Runtime" ‚Üí "Run All"
3. Wait for execution to complete (~5-10 minutes)
4. Download `rl_training_results.txt` from the Files panel
5. Share the results with Claude for analysis and iteration

**Iterative workflow:**
- Claude analyzes results and suggests improvements
- User updates code in Colab or uses Claude's updated version
- Run again to test changes
- Repeat until performance is satisfactory

**Possible improvements:**
- Use real market data (yfinance, ccxt)
- Integrate actual news sentiment (FinBERT)
- Tune hyperparameters (learning rate, epochs, etc.)
- Try different RL algorithms (SAC, A2C)
- Add more sophisticated reward functions
- Implement proper backtesting with walk-forward validation
- Add transaction costs and slippage modeling

---
## SECTION 7: Save Results

Save all results to a downloadable text file for Claude to review.

---
## Summary

This notebook implements a complete RL trading agent pipeline:

1. **Setup**: Installed dependencies (stable-baselines3, gymnasium, etc.)
2. **Data**: Generated synthetic market data (OHLCV + technical indicators) and news sentiment
3. **Environment**: Created custom Gymnasium trading environment with:
   - Observation space: market (15), news (6), portfolio (5) features
   - Action space: 7 discrete actions (HOLD, BUY 25/50/100%, SELL 25/50/100%)
   - Reward: Portfolio value change with transaction costs
4. **Training**: Trained PPO agent over 5 epochs (50K total timesteps)
5. **Results**: Evaluated performance with clear markers (CLAUDE_RESULTS_START/END)
6. **Visualization**: Plotted portfolio value, trades, drawdown, and returns

**Next steps for iteration:**
- Upload this notebook to Kaggle
- Make it public
- Run it and share the URL
- Claude will fetch results via WebFetch
- Iterate on bugs, hyperparameters, or features

**Possible improvements:**
- Use real market data (yfinance, ccxt)
- Integrate actual news sentiment (FinBERT)
- Tune hyperparameters (learning rate, epochs, etc.)
- Try different RL algorithms (SAC, A2C)
- Add more sophisticated reward functions
- Implement proper backtesting with walk-forward validation