# RL CQL Training

This notebook:
1. Loads offline RL tensors from `data/processed/rl_tensors_*.npz`
2. Loads CQL hyperparameters from `configs/model.yaml` and `configs/training.yaml`
3. Trains a dueling CQL using `src/rl/cql.py`
4. Saves the trained model to `checkpoints/`

## 1. Imports & Paths

In [1]:
import os
import sys
from pathlib import Path
import torch
import yaml

PROJECT_ROOT = Path(os.getcwd()).resolve().parent
sys.path.append(str(PROJECT_ROOT))

print("PROJECT_ROOT:", PROJECT_ROOT)

PROJECT_ROOT: C:\Users\Matth\OneDrive\Desktop\CS3346\MLB-Bullpen-Strategy


In [2]:
from src.rl.cql import (
    load_cql_training_config,
    train_cql,
    BullpenOfflineDataset,
    RLDatasetConfig,
)
from src.ope.offline_eval_cql import (
    OfflineEvalConfig,
    load_model_and_dataset,
    evaluate_td_error_full_mse,
    direct_policy_value_estimate,
    compute_policy_behavior_stats,
    compute_q_distributions,
    summarize_policy_behavior_stats,
    summarize_q_distributions,
)

## 2. Configurations

In [7]:
DATA_DIR = PROJECT_ROOT / "data"
PROC_DIR = DATA_DIR / "processed"
CONFIG_DIR = PROJECT_ROOT / "configs"
MODELS_DIR = PROJECT_ROOT / "models"

MODELS_DIR.mkdir(parents=True, exist_ok=True)

YEAR_TAG = "2022_2023"
RL_TENSORS_PATH = PROC_DIR / f"rl_tensors_{YEAR_TAG}.npz"
MODEL_CFG_PATH = CONFIG_DIR / "model.yaml"
TRAIN_CFG_PATH = CONFIG_DIR / "training.yaml"
MODEL_OUT_PATH = MODELS_DIR / f"cql_model_{YEAR_TAG}.pth"

print("RL tensors:", RL_TENSORS_PATH)
print("Model config:", MODEL_CFG_PATH)
print("Training config:", TRAIN_CFG_PATH)
print("Model output:", MODEL_OUT_PATH)

RL tensors: C:\Users\Matth\OneDrive\Desktop\CS3346\MLB-Bullpen-Strategy\data\processed\rl_tensors_2022_2023.npz
Model config: C:\Users\Matth\OneDrive\Desktop\CS3346\MLB-Bullpen-Strategy\configs\model.yaml
Training config: C:\Users\Matth\OneDrive\Desktop\CS3346\MLB-Bullpen-Strategy\configs\training.yaml
Model output: C:\Users\Matth\OneDrive\Desktop\CS3346\MLB-Bullpen-Strategy\models\cql_model_2022_2023.pth


## 3. Load Dataset & Build Model

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

train_cfg = load_cql_training_config(
    model_config_path=MODEL_CFG_PATH,
    data_path=RL_TENSORS_PATH,
    device=device,
)

train_cfg

ds = BullpenOfflineDataset(
    RLDatasetConfig(
        data_path=train_cfg.data_path,
        device=train_cfg.device,
    )
)

print("Dataset size:", len(ds))
print("State dim:", ds.state_dim)
print("Num actions:", ds.num_actions)
print("H (next hitters window):", ds.H)
print("R (max relievers per team):", ds.R)

Using device: cpu
Dataset size: 407660
State dim: 208
Num actions: 11
H (next hitters window): 5
R (max relievers per team): 10


## 4. Create Dueling CQL Model + Trainer
This calls train_dqn(train_cfg), which:

* loads BullpenOfflineDataset from train_cfg.data_path
* splits into train/val by train_cfg.val_fraction
* trains a dueling CQL with a target network
* logs TD-error periodically using evaluate_td_error in cql.py

In [5]:
cql_model = train_cql(train_cfg)

[CQL] step=0 loss=2683.03076
      val_td_error=1922.08275
      (new best val TD: 1922.08275)
[CQL] step=1000 loss=70.73781
      val_td_error=63.18487
      (new best val TD: 63.18487)
[CQL] step=2000 loss=51.70225
      val_td_error=47.31063
      (new best val TD: 47.31063)
[CQL] step=3000 loss=61.52235
      val_td_error=42.82033
      (new best val TD: 42.82033)
[CQL] step=4000 loss=39.97438
      val_td_error=37.18340
      (new best val TD: 37.18340)
[CQL] step=5000 loss=25.22389
      val_td_error=31.89558
      (new best val TD: 31.89558)
[CQL] step=6000 loss=43.89261
      val_td_error=26.80671
      (new best val TD: 26.80671)
[CQL] step=7000 loss=30.33501
      val_td_error=23.26473
      (new best val TD: 23.26473)
[CQL] step=8000 loss=26.39179
      val_td_error=18.94468
      (new best val TD: 18.94468)
[CQL] step=9000 loss=15.05893
      val_td_error=16.67717
      (new best val TD: 16.67717)
[CQL] step=10000 loss=13.25649
      val_td_error=14.56026
      (new best va

## 5. Save trained model weights

In [8]:
torch.save(cql_model.state_dict(), MODEL_OUT_PATH)
MODEL_OUT_PATH

WindowsPath('C:/Users/Matth/OneDrive/Desktop/CS3346/MLB-Bullpen-Strategy/models/cql_model_2022_2023.pth')

## Offline Policy Evaluation (OPE)
Now we use src/ope/offline_eval.py to:

* load the saved model and dataset
* compute:
    * Mean Squared TD Error (MSTE)
    * Direct Q-based value of the greedy policy
    * Action agreement with the logged policy

In [9]:
ope_cfg = OfflineEvalConfig(
    model_config_path=MODEL_CFG_PATH,
    model_path=MODEL_OUT_PATH,
    tensors_path=RL_TENSORS_PATH,
    device=device,
    batch_size=2048,
    gamma=train_cfg.gamma,
)

eval_model, eval_ds, eval_loader = load_model_and_dataset(ope_cfg)

print("Eval dataset size:", len(eval_ds))
print("State dim:", eval_ds.state_dim)
print("Num actions:", eval_ds.num_actions)

Eval dataset size: 407660
State dim: 208
Num actions: 11


## 6. Mean Squared TD Error (MSTE)
This is the mean squared Bellman residual over the full dataset. It reuses evaluate_td_error from cql.py under the hood, passing model as both the online and target networks.

In [10]:
mste = evaluate_td_error_full_mse(
    model=eval_model,
    loader=eval_loader,
    gamma=ope_cfg.gamma,
    device=ope_cfg.device,
)

print(f"Mean Squared TD Error (MSTE): {mste:.6f}")

Mean Squared TD Error (MSTE): 0.941603


## 7. Direct Q-based value estimate (FQE-style Direct Method)
For each state s: - compute Q(s, a) for all actions - mask unavailable actions - take greedy action a* = argmax_a Q(s, a) - define V_hat(s) = Q(s, a*)

Then average V_hat(s) across the dataset as an estimate of V(pi_greedy).

In [11]:
dm_value = direct_policy_value_estimate(
    model=eval_model,
    loader=eval_loader,
    device=ope_cfg.device,
)

print(f"Direct Q-based value estimate (V(pi_greedy)): {dm_value:.6f}")

Direct Q-based value estimate (V(pi_greedy)): 4.577549


## 8. Policy Behavior Stats and Q distributions
How often does the greedy CQL action (respecting availability mask) match the logged (historical) action from the dataset?

In [12]:
# New distributional metrics
policy_stats = compute_policy_behavior_stats(eval_model, eval_loader, device=ope_cfg.device)

q_stats = compute_q_distributions(eval_model, eval_loader, device=ope_cfg.device)

## 9. Summary

In [13]:
print("========= FINAL CQL EVALUATION RESULTS =========")
print(f"TD Error (MSTE):              {mste:.6f}")
print(f"Direct Q-based V(pi_greedy):  {dm_value:.6f}")
summarize_policy_behavior_stats(policy_stats)
summarize_q_distributions(q_stats)

TD Error (MSTE):              0.941603
Direct Q-based V(pi_greedy):  4.577549
=== Policy vs Behavior Stats ===
Num samples:   407660
Num actions:   11

Behavior pull rate: 5.90%
Policy pull rate:   90.87%
Action agreement:   9.43%

Behavior action counts (per action index):
[383628   4016   3546   3018   2587   2369   2204   1923   1662   1423
   1284]
Policy action counts (per action index):
[37228 66452 29504 16041 28919 21101 48095 12143 53480 17726 76971]
Valid action counts (per action index):
[407660 368480 357208 365623 361313 371378 361377 375523 364592 364655
 373403]
=== Q Distribution Stats ===
q_all_valid: n=4071212, mean=4.373, std=3.121, min=0.469, max=73.523
q_stay: n=407660, mean=4.179, std=3.039, min=0.469, max=68.737
q_best_pull: n=407660, mean=4.573, std=3.246, min=0.724, max=73.523
q_stay_minus_best_pull: n=407660, mean=-0.394, std=0.398, min=-4.786, max=0.639


In [13]:
import numpy as np
from pathlib import Path

npz = np.load(Path("../data/processed/rl_tensors_2022_2023.npz"))

for key in ["reward_folded"]:
    x = npz[key]
    print(key, "shape:", x.shape)
    print(
        key,
        "mean:", float(x.mean()),
        "std:", float(x.std()),
        "min:", float(x.min()),
        "max:", float(x.max()),
    )

reward_folded shape: (407660,)
reward_folded mean: -0.010023725219070911 std: 0.6859701871871948 min: -7.711379528045654 max: 1.1493159532546997
