# **RL AGENT QUICK‑START GUIDE**
*Last updated: 2025-05-11*

## **PART 0: OVERVIEW**

### **A) PURPOSE**
Provide a **plug‑and‑play** reinforcement‑learning stack—callbacks, custom LSTM policy, trainer, and smoke‑test harness—for experimenting with **PPO‑based trading agents** in any Gym‑compatible environment.

### **B) HOW TO USE THIS NOTEBOOK**
1. Install dependencies (Part 1).
2. Skim the package layout (Part 2) for available hooks.
3. Implement your own Gym environment following the API contract (Part 3).
4. Train & evaluate your agent with the example code cells (Part 4).
5. Iterate—tune hyper‑parameters, swap callbacks, or extend the policy.

## **PART 1: SETUP & DEPENDENCIES**

### **A) INSTALLATION**
```bash
pip install -r requirements.txt  # stable-baselines3, sb3-contrib, gymnasium, torch, pandas, etc.
```
💡 *For CUDA support, follow the official PyTorch instructions.*

## **PART 2: PACKAGE COMPONENTS**

### **A) CALLBACKS & UTILITIES**
| File | Highlights |
| ---- | ---------- |
| `callbacks.py` | `EarlyStoppingCallback` (reward‑plateau detector)  
`CheckpointCallback` (saves best model) |

Both inherit SB3 `BaseCallback`, so they work with *any* SB3 / SB3‑Contrib algorithm.

### **B) CUSTOM POLICY**
| File | Class | Architecture |
| ---- | ----- | ------------ |
| `policy.py` | `TradingLSTMPolicy` | 2‑layer 128‑unit MLP ➜ 64‑hidden LSTM ➜ actor & critic heads |

Factory helper `make_trading_lstm_policy()` returns the class for easy SB3 registration.

### **C) TRAINER & SMOKE TEST**
* **Trainer** – `train.py > train_agent()` wires everything (vectorised env, TensorBoard, callbacks) and persists `final_model.zip`.
* **Smoke Test** – `test_agent.py` spins up a minimal `SimpleTradingEnv`, runs a 20 k‑step training loop, reloads the model, and makes a deterministic prediction.

## **PART 3: GYM ENVIRONMENT CONTRACT**

### **A) `reset()` & `step()`**

#### **Purpose**
Cleanly initialise each episode and advance the environment **one step per agent action**.

#### **Thought Process**
A predictable, SB3‑compatible API lets the same agent implementation run on simulated order‑book data, synthetic price streams, or live feeds.

#### **Method**
```python
def reset(self, seed=None, options=None):
    self.pointer = 0        # time index
    self.inventory = 0      # cleared position
    return self._obs(), {}

def step(self, action):
    pnl   = self._execute(action)            # realised P&L if order fills
    carry = -self.hold_cost_coeff * abs(self.inventory)
    reward = pnl + carry

    self.pointer += 1
    terminated = self.pointer >= self.max_steps
    info = {
        'inventory': self.inventory,
        'realized_pnl': pnl,
        'action_mask': self._action_mask(),
    }
    return self._obs(), reward, terminated, False, info
```

### **B) ACTION MASKING**

#### **Purpose**
Prevent physically impossible moves—e.g. **selling with zero inventory** or **exceeding position limits**—from ever reaching the policy network.

#### **Thought Process**
Instead of adding a huge negative reward for invalid moves (which slows learning), we expose a Boolean *mask* so sampling & evaluation phases only consider **feasible actions**.

#### **Method**
* Return `info['action_mask']` on every `step()` call—`True` for valid indices, `False` otherwise.
* Training helpers (`run_random_episode`, etc.) pick from the **masked set** when generating random actions.

### **C) REWARD FLOW**

#### **Purpose**
Align the learning signal with **portfolio objectives**—maximise realised P&L while penalising risky inventory carry.

#### **Thought Process**
A simple additive scheme keeps the reward **scale stable** across assets and episodes, which helps PPO’s advantage normalisation.

#### **Method**
```
reward_t = realised_pnl_t  -  hold_cost_coeff * |inventory_t|
```
Where `realised_pnl_t` comes from trade executions and `hold_cost_coeff` is a tunable penalty for open positions.

## **PART 4: TRAIN & INFERENCE PLAYGROUND**

In [None]:
# 🚀 TRAIN – 20k timesteps on the toy SimpleTradingEnv
from agent.test_agent import SimpleTradingEnv
from agent.train import train_agent

env = SimpleTradingEnv(obs_dim=10, episode_length=100)

config = dict(
    total_timesteps=20_000,
    n_envs=2,
    learning_rate=3e-4,
    batch_size=64,
    n_steps=128,
    early_stopping=True,
    check_freq=1_000,
    patience=3,
    save_checkpoints=True,
    save_freq=5_000,
    use_custom_policy=True,
)

model = train_agent(env=env,
                    config=config,
                    log_dir='./logs/demo_run',
                    save_path='./models/demo_run',
                    verbose=1)


In [None]:
# 🔍 INFERENCE – load and predict
from sb3_contrib.ppo_recurrent import RecurrentPPO

model = RecurrentPPO.load('./models/demo_run/final_model')

env = SimpleTradingEnv(obs_dim=10, episode_length=100)
obs, _ = env.reset()
lstm_state = None

action, lstm_state = model.predict(obs, state=lstm_state, deterministic=True)
print(f'Predicted action: {action}')


## **PART 5: NEXT STEPS**

* Swap `SimpleTradingEnv` for your custom **LOBEnv**—the trainer, callbacks, and policy require **zero changes**.
* Point TensorBoard to `./logs` for live reward & loss curves.
* Extend by:
  * Writing new callbacks (e.g. Slack alerts)
  * Replacing `RecurrentPPO` with `RecurrentA2C` or `RecurrentSAC` (one‑liner change in `train.py`).