# GC-SAC OPE Demo (FQE + DM/TIS/DR)

最小示例：加载 SAC checkpoint → 采样行为数据 → 训练 FQE → 计算 DM/TIS/DR 估计。


In [1]:
from pathlib import Path
import numpy as np
import torch as th

from stable_baselines3 import SAC
from stable_baselines3.common.evaluation import evaluate_policy

from gc_ope.env.get_env import get_env
from gc_ope.utils.load_config_with_hydra import load_config

from gc_ope.algorithm.ope.logged_dataset import collect_logged_dataset
from gc_ope.algorithm.ope.fqe import FQETrainer
from gc_ope.algorithm.ope.ope_input import build_ope_inputs
from gc_ope.algorithm.ope.estimators import dm_estimate, tis_estimate, dr_estimate

PROJECT_ROOT_DIR = Path().absolute().parent.parent.parent.parent
PROJECT_ROOT_DIR


Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
pybullet build time: Dec 11 2025 17:43:29


PosixPath('/home/maxine/ai4robot/gc_ope')

## 准备环境与策略

In [2]:
# 准备环境与策略
cfg = load_config(
    config_path="../../../configs/train",
    config_name="config",
)
cfg.env.env_id = "FlyCraft-v0"

env = get_env(cfg.env)

ckpt_path_1 = PROJECT_ROOT_DIR / "checkpoints/flycraft/easy/sac/seed_1/rl_model_100000_steps"
ckpt_path_2 = PROJECT_ROOT_DIR / "checkpoints/flycraft/easy/sac/seed_1/rl_model_200000_steps"

behavior_algo = SAC.load(ckpt_path_1)
eval_algo = SAC.load(ckpt_path_2)

gamma = float(getattr(eval_algo, "gamma", 0.99))
print("gamma=", gamma)



load config from: /home/maxine/ai4robot/gc_ope/configs/env_configs/flycraft/env_config_for_ppo_easy.json
3 Generator(PCG64) Generator(PCG64)
gamma= 0.995


## 采样行为数据，并缓存评价策略动作/对数概率
Logged dataset from behavior policy rollouts.

Contains transitions $$(s_t, a_t, r_{t+1}, s_{t+1}, \text{done}_t)$$
collected by rolling out a behavior policy, along with optional
precomputed evaluation policy actions and log-probabilities.

1. 没用DictReplayBuffer：除了(s,a,r,s',d)，还需要识别变长轨迹，显式存`traj_id$、`step_index`
2. 直接在rollout时，用评价策略对s、(s,a)进行概率获取
3. 对于状态`obs`：存两类，`dict`是原始goal-conditioned RL的状态形式，放平成`flat`？用于

Attributes:
- obs_flat: Flattened observations (N, obs_dim).
- actions: Actions taken by behavior policy (N, act_dim).
- rewards: Rewards (N,).
- next_obs_flat: Next observations (N, obs_dim).
- dones: Episode termination flags (N,).
- traj_id: Trajectory ID for each transition (N,).
- step_index: Step index within trajectory (N,).
- obs_dict: Original dict observations (list of N dicts).
- next_obs_dict: Original dict next observations (list of N dicts)
- behavior_log_prob: Log-probability of actions under behavior policy (N,).
  
*对于评价策略：*
- eval_action_curr: Evaluation policy actions at $s_t$ (N, act_dim) or None.
- eval_action_next: Evaluation policy actions at $s_{t+1}$ (N, act_dim) or None.
- eval_log_prob_curr: Log-probability of eval actions at $s_t$ (N,) or None.
- eval_log_prob_next: Log-probability of eval actions at $s_{t+1}$ (N,) or None.

``` python
if eval_algo is not None:
    # a_t under eval policy on s_t and s_{t+1}
    a_curr, _ = eval_algo.predict(obs, deterministic=True)
    a_next, _ = eval_algo.predict(next_obs, deterministic=True)
    a_curr_t, eval_logp_curr = _compute_action_log_prob(eval_algo, obs, a_curr)
    a_next_t, eval_logp_next = _compute_action_log_prob(eval_algo, next_obs, a_next)
    eval_action_curr.append(a_curr_t)
    eval_action_next.append(a_next_t)
    eval_log_prob_curr.append(eval_logp_curr)
    eval_log_prob_next.append(eval_logp_next)
```

In [None]:
# 采样行为数据，并缓存评价策略动作/对数概率
#TODO: 将行为数据采样&评价策略缓存，分开，可分别设置、可复用。
n_episodes = 20
max_steps = 400

dataset = collect_logged_dataset(
    env=env,
    behavior_algo=behavior_algo,
    eval_algo=eval_algo,
    n_episodes=n_episodes,
    max_steps=max_steps,
)



print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 34。target: (208.22, -8.12, -4.01)。achieved target: (192.55, 1.12, 21.15)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 49。target: (197.91, -6.81, 14.07)。achieved target: (185.93, 1.66, 36.61)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 65。target: (161.37, -2.18, 1.00)。achieved target: (181.41, 2.02, 48.31)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 90。target: (193.06, 1.74, 14.27)。achieved target: (164.07, 13.05, 81.55)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 40。target: (245.63, -4.32, 8.91)。achieved target: (188.89, 1.93, 30.94)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_ba

In [None]:
dataset.__dict__.keys()

dict_keys(['obs_flat', 'actions', 'rewards', 'next_obs_flat', 'dones', 'traj_id', 'step_index', 'obs_dict', 'next_obs_dict', 'behavior_log_prob', 'eval_action_curr', 'eval_action_next', 'eval_log_prob_curr', 'eval_log_prob_next'])

In [8]:
print(
    f"Collected {len(dataset.obs_flat)} transitions from {dataset.traj_id.max() + 1} episodes; "
    f"obs_dim={dataset.obs_flat.shape[1]}, act_dim={dataset.actions.shape[1]}"
)

Collected 1419 transitions from 20 episodes; obs_dim=14, act_dim=3


## 准备OPEInput
### 训练 FQE
Fitted Q Evaluation trainer for continuous goal-conditioned policies.

FQE is an off-policy evaluation method that approximates a Q function
$Q_\theta(s, a)$ for the evaluation policy $\pi_\phi(s)$.

The FQE loss is:

$$
    L(\theta) = \mathbb{E}_{(s_t, a_t, r_{t+1}, s_{t+1}) \sim D}
        \left[ \left( Q_\theta(s_t, a_t) - r_{t+1}
            - \gamma Q_{\theta'}(s_{t+1}, \pi_\phi(s_{t+1})) \right)^2 \right]
$$

where $D$ is the logged dataset, $\theta'$ is the target network
parameters (soft-updated with $\tau$), and $\pi_\phi(s_{t+1})$
is the deterministic action from the evaluation policy.

The trained Q function in FQE estimates evaluation metrics more accurately
than the Q function learned during policy training.

#QUESTION：好像没有做goal-conditioned？现在是把整个obs('observation', 'desired_goal', 'achieved_goal')拉平作为一个obs。

In [14]:
display(dataset.obs_flat[0])
display(dataset.obs_dict[0].keys())
display(dataset.obs_dict[0]['observation'])
display(dataset.obs_dict[0]['achieved_goal'])
display(dataset.obs_dict[0]['desired_goal'])

array([0.5      , 0.5145179, 0.5      , 0.2      , 0.5      , 0.5      ,
       0.5      , 0.25     , 0.2082162, 0.4549032, 0.4888545, 0.2      ,
       0.5      , 0.5      ], dtype=float32)

dict_keys(['observation', 'desired_goal', 'achieved_goal'])

array([0.5      , 0.5145179, 0.5      , 0.2      , 0.5      , 0.5      ,
       0.5      , 0.25     ], dtype=float32)

array([0.2, 0.5, 0.5], dtype=float32)

array([0.2082162, 0.4549032, 0.4888545], dtype=float32)

In [15]:
# 训练 FQE
#TODO: FQE train和predict的过程，放到build_ope_inputs里去，不对外暴露
obs_dim = dataset.obs_flat.shape[1]
act_dim = dataset.actions.shape[1]

fqe = FQETrainer(
    obs_dim=obs_dim,
    act_dim=act_dim,
    eval_algo=eval_algo,
    gamma=gamma,
    lr=3e-4,
    tau=0.005,
)

loss_log = []


def _logger(epoch: int, loss: float):
    if epoch % 50 == 0 or epoch == 1:
        print(f"Epoch {epoch:04d} | FQE loss={loss:.3f}")
    loss_log.append((epoch, loss))


fqe.fit(dataset, batch_size=256, n_epochs=300, shuffle=True, logger=_logger)



Epoch 0001 | FQE loss=378.711
Epoch 0050 | FQE loss=343.292
Epoch 0100 | FQE loss=314.811
Epoch 0150 | FQE loss=306.488
Epoch 0200 | FQE loss=301.351
Epoch 0250 | FQE loss=268.231
Epoch 0300 | FQE loss=236.435


### 构造 OPE 输入
Build OPE inputs from logged dataset and trained FQE model.

Computes evaluation policy actions/log-probs and Q-values needed for
DM, TIS, and DR estimators.

In [16]:
# 构造 OPE 输入
#TODO: FQE train和predict的过程，放到build_ope_inputs里去。实现起来，传入参数：q_function_method
inputs = build_ope_inputs(dataset, eval_algo=eval_algo, fqe=fqe, gamma=gamma)


In [17]:
inputs.__dict__.keys()

dict_keys(['obs_flat', 'actions', 'rewards', 'next_obs_flat', 'dones', 'traj_id', 'step_index', 'behavior_log_prob', 'eval_action', 'eval_log_prob', 'q_sa_behavior', 'q_sa_eval', 'gamma'])

## 计算 DM / TIS / DR
### DM
Direct Method (DM) estimator.

DM estimates the policy value using the FQE Q-function:

$$

    \hat{V}^{\text{DM}} = \frac{1}{N} \sum_{i=1}^N Q(s_i, \pi_{\text{eval}}(s_i)) $$

If ``initial_only=True``, only uses initial states (step_index == 0):

$$

    \hat{V}^{\text{DM}} = \frac{1}{|\mathcal{I}_0|} \sum_{i \in \mathcal{I}_0} Q(s_i, \pi_{\text{eval}}(s_i)) $$

where $\mathcal{I}_0$ is the set of initial state indices.

### TIS
Trajectory-wise Importance Sampling (TIS) estimator.

TIS estimates the policy value using trajectory-level importance weights:

$$

    \hat{V}^{\text{TIS}} = \frac{1}{M} \sum_{\tau=1}^M w_\tau G_\tau

$$

where $M$ is the number of trajectories, $G_\tau$ is the
discounted return of trajectory $\tau$, and the importance weight is:

$$

    w_\tau = \prod_{t=0}^{T_\tau-1} \frac{\pi_{\text{eval}}(a_t | s_t)}{\pi_{\text{behavior}}(a_t | s_t)}
        = \exp\left( \sum_{t=0}^{T_\tau-1} \left( \log \pi_{\text{eval}}(a_t | s_t)
            - \log \pi_{\text{behavior}}(a_t | s_t) \right) \right)
$$

### DR
Doubly Robust (DR) estimator.

DR combines importance sampling with a control variate (Q-function) to
reduce variance. For each trajectory $\tau$, the estimate is:

$$

    \hat{V}_\tau^{\text{DR}} = \sum_{t=0}^{T_\tau-1} \gamma^t \left[
        w_t (r_t - Q(s_t, a_t)) + w_{t-1} Q(s_t, \pi_{\text{eval}}(s_t))
    \right]
$$

where $w_t = \prod_{k=0}^t \frac{\pi_{\text{eval}}(a_k | s_k)}{\pi_{\text{behavior}}(a_k | s_k)}$
is the step-wise importance weight, $w_{-1} = 1$, and
$Q(s_t, a_t)$ is the Q-value for the behavior action while
$Q(s_t, \pi_{\text{eval}}(s_t))$ is for the evaluation policy action.

The overall estimate is:

$$

    \hat{V}^{\text{DR}} = \frac{1}{M} \sum_{\tau=1}^M \hat{V}_\tau^{\text{DR}}
$$


In [20]:
#TODO: 需要修改，初始状态的Q值，需要用初始状态的Q值，而不是用FQE的Q值？
#TODO: 将计算各轨迹的v值，和计算mean&ci的过程，分开。
dm_all = dm_estimate(inputs, initial_only=False)
# dm_init = dm_estimate(inputs, initial_only=True)
tis_res = tis_estimate(inputs)
dr_res = dr_estimate(inputs)

print("DM (step-wise):", dm_all)
# print("DM (initial-state):", dm_init)
print("TIS:", tis_res)
print("DR:", dr_res)



DM (step-wise): EstimateResult(mean=-27.3273868560791, ci_lower=-27.494298934936523, ci_upper=-27.148042678833008)
TIS: EstimateResult(mean=-2.541983344700355e+21, ci_lower=-7.689500609917867e+21, ci_upper=-6.466416358947754)
DR: EstimateResult(mean=-1.87473706078356e+21, ci_lower=-5.671088024972073e+21, ci_upper=5695156699267072.0)


In [19]:
# 在线评估真实回报（可选，耗时）
if True:
    mean_r_b, std_r_b = evaluate_policy(behavior_algo, env, n_eval_episodes=5, deterministic=True)
    mean_r_e, std_r_e = evaluate_policy(eval_algo, env, n_eval_episodes=5, deterministic=True)
    print(f"behavior return: {mean_r_b:.2f} ± {std_r_b:.2f}")
    print(f"eval return:     {mean_r_e:.2f} ± {std_r_e:.2f}")





print, Train, [31mnegative_overload_and_big_phi_termination。[0m steps: 251。target: (206.70, 8.42, -17.65)。achieved target: (70.86, -15.47, 25.64)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 58。target: (235.09, -6.62, 27.86)。achieved target: (178.99, 3.42, 52.36)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 93。target: (212.37, 2.14, 28.23)。achieved target: (162.29, 15.84, 84.77)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 84。target: (228.70, 5.80, -26.75)。achieved target: (168.67, 19.70, 76.02)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 35。target: (186.93, -8.30, -18.39)。achieved target: (192.96, 1.07, 20.23)。expert steps: 0。




print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 122。target: (179.68, -0.14, 20.97)。achieved target: (186.12, -14.12, 25.64)。expert steps: 0。
print, Train, [31mtimeout_termination。[0m steps: 399。target: (246.52, 4.16, -17.18)。achieved target: (265.60, -1.20, 35.98)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 54。target: (204.50, 4.12, -26.89)。achieved target: (193.27, 9.00, 14.09)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 101。target: (217.99, -2.63, 5.38)。achieved target: (208.04, -8.86, 10.96)。expert steps: 0。
print, Train, [31mcontinuousely_move_away_termination_based_on_mu_error_and_chi_error。[0m steps: 59。target: (216.95, 3.38, 1.38)。achieved target: (199.45, -0.20, 9.97)。expert steps: 0。
behavior return: -209.68 ± 31.57
eval return:     -180.59 ± 3.18
