# Learning a Stable Transition: A Representation-First Approach to Offline-to-Online Reinforcement Learning

**Abstract**: A core difficulty in Offline-to-Online (O2O) RL is mitigating "value hallucination" (as seen in model-based RL) when the agent encounters novel states during online interaction. While replay-based methods implicitly manage this transition, they often lack an explicit mechanism to quantify the distribution shift, leading to instability. We propose a "representation-first" O2O framework that decouples the memory of the offline data from the online learning process. We first learn a Data Support Representation (DSR), a density or energy-based model, exclusively on the offline dataset. This DSR provides a continuous measure of support for any state-action pair. During the online phase, this DSR is used in two ways: (1) as a pessimism regularizer for the value function, scaling conservatism inversely with the data support, and (2) as an intrinsic exploration bonus, guiding the agent to efficiently explore regions just beyond the boundary of the offline dataset. This approach removes the need for a shared replay buffer, aligns with work on continual learning representations, and allows the online learner (e.g., a simple on-policy agent) to adapt without catastrophic overestimation. We show that our method prevents the common performance drop seen in O2O fine-tuning and achieves a more stable and monotonic improvement in online performance.

In [None]:
# Setup: install minimal dependencies (Gymnasium Mujoco, Mujoco engine, Torch, NumPy, h5py)
!pip -q install gymnasium[mujoco]==0.29.1 mujoco==3.1.6 torch numpy h5py


In [None]:
# Mount Google Drive and add repo to PYTHONPATH
from google.colab import drive
drive.mount('/content/drive')
import os, sys
REPO_ROOT = '/content/drive/MyDrive/O2O'  # change if you store it elsewhere
sys.path.append(REPO_ROOT)
print('Repo root:', REPO_ROOT)
!ls -la "$REPO_ROOT"


In [None]:
# Config — choose task and directories in Drive
TASK = 'hopper-medium-v2'  # or 'walker2d-medium-v2'
ENV_MAP = {'hopper-medium-v2':'Hopper-v4','walker2d-medium-v2':'Walker2d-v4'}
ENV_ID = ENV_MAP[TASK]
DRIVE_ROOT = '/content/drive/MyDrive/O2O'
DATA_DIR = f'{DRIVE_ROOT}/data'
CKPT_DIR = f'{DRIVE_ROOT}/checkpts'
LOG_DIR = f'{DRIVE_ROOT}/logs'
for d in (DATA_DIR, CKPT_DIR, LOG_DIR):
    os.makedirs(d, exist_ok=True)
H5_FILE = {'hopper-medium-v2':'hopper-medium-v2.hdf5','walker2d-medium-v2':'walker2d-medium-v2.hdf5'}[TASK]
H5_PATH = f'{DATA_DIR}/{H5_FILE}'
NPZ_PATH = f"{DATA_DIR}/{TASK.replace('-v2','').replace('-','_')}.npz"
DSR_OUT = f"{CKPT_DIR}/{TASK.replace('-v2','').replace('-','_')}_dsr.pt"
# Optional: Behavior Cloning pretraining for policy init
RUN_BC = True
BC_ACTOR = f"{CKPT_DIR}/{TASK.replace('-v2','').replace('-','_')}_bc_actor.pt"
print('ENV_ID:', ENV_ID)
print('H5:', H5_PATH)
print('NPZ:', NPZ_PATH)
print('DSR:', DSR_OUT)
print('BC actor (if RUN_BC):', BC_ACTOR)


In [None]:
# Download D4RL-style HDF5 and convert to NPZ (states/actions)
import urllib.request, h5py, numpy as np
BASE_URL = 'https://rail.eecs.berkeley.edu/datasets/offline_rl/gym_mujoco_v2'
if not os.path.exists(H5_PATH):
    url = f"{BASE_URL}/{H5_FILE}"
    print('Downloading', url)
    urllib.request.urlretrieve(url, H5_PATH)
    print('Saved', H5_PATH)
else:
    print('Found HDF5', H5_PATH)
def h5_to_npz(h5_path, npz_path, max_samples=None):
    with h5py.File(h5_path, 'r') as f:
        obs = np.array(f['observations'], dtype=np.float32)
        acts = np.array(f['actions'], dtype=np.float32)
    if max_samples is not None and max_samples < len(obs):
        idx = np.random.permutation(len(obs))[:max_samples]
        obs, acts = obs[idx], acts[idx]
    np.savez_compressed(npz_path, states=obs, actions=acts)
    return obs.shape, acts.shape
if not os.path.exists(NPZ_PATH):
    shapes = h5_to_npz(H5_PATH, NPZ_PATH, max_samples=None)
    print('Converted to NPZ', NPZ_PATH, 'shapes=', shapes)
else:
    print('Found NPZ', NPZ_PATH)


In [None]:
# Train DSR on the offline dataset using the repo script
import torch
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using device:', DEVICE)
!python {REPO_ROOT + '/train_dsr.py'} --offline_path {NPZ_PATH} --env_id {ENV_ID} --dsr_out {DSR_OUT} --epochs 20 --batch_size 2048 --device {DEVICE}


In [None]:
# Optional: Behavior Cloning pretraining to warm-start the actor
if RUN_BC:
    print('Running BC pretraining...')
    !python {REPO_ROOT + '/pretrain_bc.py'} --offline_path {NPZ_PATH} --env_id {ENV_ID} --out_actor {BC_ACTOR} --device {DEVICE} --epochs 20 --batch_size 1024
else:
    print('Skipping BC pretraining (RUN_BC=False)')


In [None]:
# Compose optional init flag for PPO if BC was run
INIT_ACTOR_ARG = f'--init_actor {BC_ACTOR}' if RUN_BC else ''
print('INIT_ACTOR_ARG:', INIT_ACTOR_ARG)


In [None]:
# Run online PPO with DSR pessimism + bonus
TOTAL_STEPS = 300_000
STEPS_PER_EPOCH = 4096
MINIBATCH = 256
ITERS = 10
!python {REPO_ROOT + '/train_online.py'} --env_id {ENV_ID} --dsr_path {DSR_OUT} --total_steps {TOTAL_STEPS} --steps_per_epoch {STEPS_PER_EPOCH} --minibatch_size {MINIBATCH} --train_iters {ITERS} --device {DEVICE} --log_csv {LOG_DIR} {INIT_ACTOR_ARG}
