# Introduction

In this tutorial, we will show how to use PyTupli to set up an efficient pipeline for offline reinforcement learning (RL) for a custom environment. This includes
- creating a benchmark and uploading it to an instance of a TupliStorage,
- re-loading this benchmark from the storage, 
- recording RL tuples of (state, action, reward, done) for this benchmark and uploading them to the storage, 
- creating a dataset from the stored episodes, and
- training an offline RL agent using d3rlpy.

You can skip the last part, but if you want to try that, you have to install the d3rlpy library using `pip install d3rlpy`.

In [1]:
import io
import os
import tempfile
import math
from typing import Optional

import numpy as np
import pandas as pd

from gymnasium import spaces
from gymnasium.envs.classic_control import utils, MountainCarEnv
from gymnasium.wrappers import TimeLimit

import d3rlpy
from d3rlpy.algos import DiscreteCQLConfig
from d3rlpy.dataset import MDPDataset

from pytupli.benchmark import TupliEnvWrapper
from pytupli.storage import TupliAPIClient, TupliStorage, FileStorage
from gymnasium import Env
from pytupli.schema import ArtifactMetadata, FilterEQ, EpisodeMetadataCallback
from pytupli.dataset import TupliDataset, NumpyTupleParser

  from .autonotebook import tqdm as notebook_tqdm


[2m2025-07-28 15:45.23[0m [[32m[1minfo     [0m] [1mRegister Shimmy environments. [0m


PyTupli has two storage options: A local FileStorage and using MongoDB as a backend in the TupliAPIClient. You can run this notebook with both storage types by adjusting the flag below. If you want to use the TupliAPIClient, follow the instructions in the Readme to start the application.

In [2]:
STORAGE_FLAG = 'api'  # "api"

### Creating a Custom Environment
We will use the MountainCar example from gymnasium with a small modification: The cart is slowed down by wind in the horizontal direction. We load the wind data from a csv file.

In [3]:
class CustomMountainCarEnv(MountainCarEnv):
    """
    ## Description

    The Mountain Car MDP is a deterministic MDP that consists of a car placed stochastically
    at the bottom of a sinusoidal valley, with the only possible actions being the accelerations
    that can be applied to the car in either direction. The goal of the MDP is to strategically
    accelerate the car to reach the goal state on top of the right hill. There are two versions
    of the mountain car domain in gymnasium: one with discrete actions and one with continuous.
    This version is the one with discrete actions.

    This MDP first appeared in [Andrew Moore's PhD Thesis (1990)](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-209.pdf)

    ```
    @TECHREPORT{Moore90efficientmemory-based,
        author = {Andrew William Moore},
        title = {Efficient Memory-based Learning for Robot Control},
        institution = {University of Cambridge},
        year = {1990}
    }
    ```

    ## Observation Space

    The observation is a `ndarray` with shape `(2,)` where the elements correspond to the following:

    | Num | Observation                          | Min   | Max  | Unit         |
    |-----|--------------------------------------|-------|------|--------------|
    | 0   | position of the car along the x-axis | -1.2  | 0.6  | position (m) |
    | 1   | velocity of the car                  | -0.07 | 0.07 | velocity (v) |

    ## Action Space

    There are 3 discrete deterministic actions:

    - 0: Accelerate to the left
    - 1: Don't accelerate
    - 2: Accelerate to the right

    ## Transition Dynamics:

    Given an action, the mountain car follows the following transition dynamics:

    *velocity<sub>t+1</sub> = velocity<sub>t</sub> + (action - 1) * force - cos(3 * position<sub>t</sub>) * gravity*

    *position<sub>t+1</sub> = position<sub>t</sub> + velocity<sub>t+1</sub>*

    where force = 0.001 and gravity = 0.0025. The collisions at either end are inelastic with the velocity set to 0
    upon collision with the wall. The position is clipped to the range `[-1.2, 0.6]` and
    velocity is clipped to the range `[-0.07, 0.07]`.

    ## Reward:

    The goal is to reach the flag placed on top of the right hill as quickly as possible, as such the agent is
    penalised with a reward of -1 for each timestep.

    ## Starting State

    The position of the car is assigned a uniform random value in *[-0.6 , -0.4]*.
    The starting velocity of the car is always assigned to 0.

    ## Episode End

    The episode ends if either of the following happens:
    1. Termination: The position of the car is greater than or equal to 0.5 (the goal position on top of the right hill)
    2. Truncation: The length of the episode is 200.

    ## Arguments

    Mountain Car has two parameters for `gymnasium.make` with `render_mode` and `goal_velocity`.
    On reset, the `options` parameter allows the user to change the bounds used to determine the new random state.

    ```python
    >>> import gymnasium as gym
    >>> env = gym.make("MountainCar-v0", render_mode="rgb_array", goal_velocity=0.1)  # default goal_velocity=0
    >>> env
    <TimeLimit<OrderEnforcing<PassiveEnvChecker<MountainCarEnv<MountainCar-v0>>>>>
    >>> env.reset(seed=123, options={"x_init": np.pi/2, "y_init": 0.5})  # default x_init=np.pi, y_init=1.0
    (array([-0.46352962,  0.        ], dtype=float32), {})

    ```

    ## Version History

    * v0: Initial versions release
    """

    metadata = {
        'render_modes': ['human', 'rgb_array'],
        'render_fps': 30,
    }

    def __init__(self, data_path: str, render_mode: Optional[str] = None, goal_velocity=0):
        self.min_position = -1.2
        self.max_position = 0.6
        self.max_speed = 0.07
        self.goal_position = 0.5
        self.goal_velocity = goal_velocity
        self.current_step = 0
        self.data = pd.read_csv(data_path, index_col=0, header=None) * 0.01

        self.force = 0.001
        self.gravity = 0.0025

        self.low = np.array([self.min_position, -self.max_speed], dtype=np.float32)
        self.high = np.array([self.max_position, self.max_speed], dtype=np.float32)

        self.render_mode = render_mode

        self.screen_width = 600
        self.screen_height = 400
        self.screen = None
        self.clock = None
        self.isopen = True

        self.action_space = spaces.Discrete(3)
        self.observation_space = spaces.Box(self.low, self.high, dtype=np.float32)

    def step(self, action: int):
        assert self.action_space.contains(action), f'{action!r} ({type(action)}) invalid'

        position, velocity = self.state
        velocity += (
            (action - 1) * self.force
            + math.cos(3 * position) * (-self.gravity)
            - self.data.loc[self.current_step].to_numpy().flatten()[0] * math.cos(position)
        )
        velocity = np.clip(velocity, -self.max_speed, self.max_speed)
        position += velocity
        position = np.clip(position, self.min_position, self.max_position)
        if position == self.min_position and velocity < 0:
            velocity = 0

        terminated = bool(position >= self.goal_position and velocity >= self.goal_velocity)
        reward = -1.0
        self.current_step += 1

        self.state = (position, velocity)
        if self.render_mode == 'human':
            self.render()
        # truncation=False as the time limit is handled by the `TimeLimit` wrapper added during `make`
        return np.array(self.state, dtype=np.float32), reward, terminated, False, {}

    def reset(
        self,
        *,
        seed: Optional[int] = None,
        options: Optional[dict] = None,
    ):
        super().reset(seed=seed)
        # Note that if you use custom reset bounds, it may lead to out-of-bound
        # state/observations.
        self.current_step = 0
        low, high = utils.maybe_parse_reset_bounds(options, -0.6, -0.4)
        self.state = np.array([self.np_random.uniform(low=low, high=high), 0])

        if self.render_mode == 'human':
            self.render()
        return np.array(self.state, dtype=np.float32), {}

### Serialize Environment for Upload
As a next step, we want to upload our environment to our storage using PyTupli. For this, we will detach the csv file from the environment, upload it seperately, and replace the data attribute in the environment with the id of the stored object. This allows us to re-use artifacts such as csv files in multiple benchmarks. For example, consider a case where you only want to change one parameter within the environment, e.g., the maximum speed. You would have to create a new benchmark, but could re-use the csv file! PyTupli automatically recognizes such duplicates. 

To separate the csv file, we have to subclass the TupliEnvWrapper class and overwrite the `_serialize()` and `_deserialize()` members. The TupliEnvWrapper is essentially a gymnasium wrapper that records RL tuples in the `step()` function, but it has a lot of extra functionalities for interacting with the storage.

In [4]:
class MyTupliEnvWrapper(TupliEnvWrapper):
    def _serialize(self, env) -> Env:
        related_data_sources = []
        ds = env.unwrapped.data
        metadata = ArtifactMetadata(name='test')
        data_kwargs = {'header': None}
        try:
            content = ds.to_csv(encoding='utf-8', **data_kwargs)
            content = content.encode(encoding='utf-8')
        except Exception as e:
            raise ValueError(f'Failed to serialize data source: {e}')

        ds_storage_metadata = self.storage.store_artifact(artifact=content, metadata=metadata)
        related_data_sources.append(ds_storage_metadata.id)
        setattr(env.unwrapped, 'data', ds_storage_metadata.id)
        return env, related_data_sources

    @classmethod
    def _deserialize(cls, env: Env, storage: TupliStorage) -> Env:
        data_kwargs = {'header': None, 'index_col': 0}
        ds = storage.load_artifact(env.unwrapped.data)
        ds = ds.decode('utf-8')
        d = io.StringIO(ds)
        df = pd.read_csv(d, **data_kwargs)

        env.unwrapped.data = df
        return env

In [5]:
# which storag to use
if STORAGE_FLAG == 'api':
    storage = TupliAPIClient()
elif STORAGE_FLAG == 'file':
    storage = FileStorage()
else:
    raise ValueError(f"Unknown storage flag: {STORAGE_FLAG}. Has to be 'api' or 'file'.")

In [6]:
# instantiate the environment
max_eps_length = 999
data_path = '/home/hannah/Documents/Code/pytupli/docs/source/tutorials/data/wind_data.csv'
env = TimeLimit(
    CustomMountainCarEnv(render_mode=None, data_path=data_path), max_episode_steps=max_eps_length
)
# Now we can create the benchmark
tupli_env = MyTupliEnvWrapper(env, storage=storage)

### Uploading and Downloading Benchmarks
We will now upload the benchmark and download it again.

In [7]:
tupli_env.store(name='mountain-car-v0', description='Mountain Car v0 benchmark')

Let us list the uploaded benchmarks:

In [8]:
%system
!pytupli list_benchmarks

id                                created_by    is_public    created_at                  hash                                                              metadata
--------------------------------  ------------  -----------  --------------------------  ----------------------------------------------------------------  -------------------------------------------------------------------------------------------------------------------------
9ab24bb4e96c4b38bf0c3577bdb4e4bb  hannah        ❌           2025-07-10T13:37:28.852619  a23a5fdeafb580c2d170f19da7ce4969164d8e458c6eeef2494ad71ff5c340c1  {'name': 'test_env', 'description': '', 'difficulty': None, 'version': None, 'extra': {}}
411704fb2dd34f50982817141264840b  hannah        ❌           2025-07-10T13:48:14.966254  6837628829fbb513e75eae4e0f33ad322f310780731c2308ab90fc4218fb8d47  {'name': 'second_test_env', 'description': '', 'difficulty': None, 'version': None, 'extra': {}}
989f723675fc4b85ab19c1621a96f9dc  hannah        ❌           2025

As a next step, we show how to download the benchmark. Note that this is only for demonstration purposes! When loading the benchmark, we can pass a callback that will later be used to add metadate to recorded episodes. We provide a simple example of such a function.

In [9]:
class MyCallback(EpisodeMetadataCallback):
    def __init__(self):
        super().__init__()
        # we will compute the cumulative reward for an episode
        self.cum_reward = 0
        # Furthermore, we want to store the fact that the episode was not an expert episode
        self.is_expert = False
    def reset(self):
        # we will compute the cumulative reward for an episode
        self.cum_reward = 0
    def __call__(self, tuple):
        self.cum_reward += tuple.reward
        return {"cum_eps_reward": [self.cum_reward], "is_expert": self.is_expert}


In [10]:
loaded_tupli_env = MyTupliEnvWrapper.load(storage=storage, benchmark_id=tupli_env.id, metadata_callback=MyCallback())

### Recording Episodes for Offline RL Training
The TupliEnvWrapper wrapper allows us to record all interactions with the custom environment to the storage in form of tuples (state, action, reward, terminal, timeout). This can then be used for training an offline RL agent for this environment using any offline RL library. For simplicity, we will use a random policy to generate the data.

In [11]:
# For reproducibility when generating episodes
np.random.seed(42)
obs, info = loaded_tupli_env.reset(seed=42)

for step in range(2000):
    action = np.int64(np.random.randint(low=0, high=3))
    obs, reward, done, truncated, info = loaded_tupli_env.step(action)
    if done or truncated:
        print(f'Episode finished after {step + 1} timesteps')
        obs, info = loaded_tupli_env.reset()

Episode finished after 999 timesteps
Episode finished after 1998 timesteps


### Downloading Episodes for a Benchmark
Next, let us download all episodes that have been recorded for our benchmark. For this, we create a TupliDataset using a filter with the id of the benchmark.

In [12]:
# Create dataset
mdp_dataset = TupliDataset(storage=storage).with_benchmark_filter(
    FilterEQ(key='id', value=loaded_tupli_env.id)
)
mdp_dataset.load()

We can show the contents of the dataset using the `preview()` method:

In [13]:
mdp_dataset.preview()

[EpisodeItem(id='8236cd52226b44cdb9ed453d180ce4fa', created_by='hannah', is_public=False, created_at='2025-07-28T15:45:24.757499', benchmark_id='4b076cdedc4d433ea4da6d2801c05ca0', metadata={'cum_eps_reward': [-999.0], 'is_expert': False}, n_tuples=999, terminated=False, timeout=True, tuples=[RLTuple(state=[-0.5113905072212219, 0.0008338160114362836], action=2, reward=-1.0, info={}, terminal=False, timeout=False), RLTuple(state=[-0.5116723775863647, -0.0002818846551235765], action=0, reward=-1.0, info={}, terminal=False, timeout=False), RLTuple(state=[-0.5110931396484375, 0.0005792493466287851], action=2, reward=-1.0, info={}, terminal=False, timeout=False), RLTuple(state=[-0.5096522569656372, 0.0014408753486350179], action=2, reward=-1.0, info={}, terminal=False, timeout=False), RLTuple(state=[-0.5093242526054382, 0.000327992340316996], action=0, reward=-1.0, info={}, terminal=False, timeout=False), RLTuple(state=[-0.5101639032363892, -0.0008396401535719633], action=0, reward=-1.0, inf

### Training an Offline RL Agent
Finally, we use d3rlpy to train an offline RL agent on this environment. Note that we do not train it to convergence, we only show how to get from a PyTupli dataset to actually doing offline RL! Our TupliDataset has a method for converting all episodes into numpy arrays for states, actions, rewards, terminals, and timeouts. This can be customized if other output formats are required. Using these arrays, we can then create an `MDPDataset`, which is the required input format for all d3rlpy algorithms. 

In [14]:
obs, act, rew, terminal, truncated = mdp_dataset.convert_to_tensors(parser=NumpyTupleParser)
# create d3rlpy dataset
d3rlpy_dataset = MDPDataset(
    observations=obs, actions=act, rewards=rew, terminals=terminal, timeouts=truncated
)

[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mSignatures have been automatically determined.[0m [36maction_signature[0m=[35mSignature(dtype=[dtype('float32')], shape=[(1,)])[0m [36mobservation_signature[0m=[35mSignature(dtype=[dtype('float32')], shape=[(2,)])[0m [36mreward_signature[0m=[35mSignature(dtype=[dtype('float32')], shape=[(1,)])[0m
[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mAction-space has been automatically determined.[0m [36maction_space[0m=[35m<ActionSpace.DISCRETE: 2>[0m
[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mAction size has been automatically determined.[0m [36maction_size[0m=[35m3[0m


Finally, let us show that training an agent using conservative Q-learning (CQL) works with this data:

In [15]:
# algorithm for offline training: CQL from d3rlpy
d3rlpy.seed(1)  # for reproducibility
algo = DiscreteCQLConfig(batch_size=64, alpha=2.0, target_update_interval=1000).create(device='cpu')
# train
algo.fit(dataset=d3rlpy_dataset, n_steps=10000, n_steps_per_epoch=100)

[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mdataset info                  [0m [36mdataset_info[0m=[35mDatasetInfo(observation_signature=Signature(dtype=[dtype('float32')], shape=[(2,)]), action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), action_space=<ActionSpace.DISCRETE: 2>, action_size=3)[0m
[2m2025-07-28 15:45.25[0m [[32m[1mdebug    [0m] [1mBuilding models...            [0m
[2m2025-07-28 15:45.25[0m [[32m[1mdebug    [0m] [1mModels have been built.       [0m
[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mDirectory is created at d3rlpy_logs/DiscreteCQL_20250728154525[0m
[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mParameters                    [0m [36mparams[0m=[35m{'observation_shape': [2], 'action_size': 3, 'config': {'type': 'discrete_cql', 'params': {'batch_size': 64, 'gamma': 0.99, 'observation_scaler': {'type': 'none', 'params': {}}, 'a

Epoch 1/100: 100%|██████████| 100/100 [00:00<00:00, 377.93it/s, loss=2.41, td_loss=0.213, conservative_loss=1.1]

[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=1 step=100[0m [36mepoch[0m=[35m1[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005567145347595215, 'time_algorithm_update': 0.0020180344581604004, 'loss': 2.3966327929496765, 'td_loss': 0.19853577110916376, 'conservative_loss': 1.09904851436615, 'time_step': 0.00263277530670166}[0m [36mstep[0m=[35m100[0m
[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_100.d3[0m



Epoch 2/100: 100%|██████████| 100/100 [00:00<00:00, 439.61it/s, loss=2.21, td_loss=0.016, conservative_loss=1.1]

[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=2 step=200[0m [36mepoch[0m=[35m2[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004550933837890625, 'time_algorithm_update': 0.00175459623336792, 'loss': 2.212840859889984, 'td_loss': 0.01538949720794335, 'conservative_loss': 1.0987256860733032, 'time_step': 0.0022612261772155763}[0m [36mstep[0m=[35m200[0m
[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_200.d3[0m



Epoch 3/100: 100%|██████████| 100/100 [00:00<00:00, 438.60it/s, loss=2.2, td_loss=0.00351, conservative_loss=1.1]

[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=3 step=300[0m [36mepoch[0m=[35m3[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00045412778854370117, 'time_algorithm_update': 0.0017646265029907227, 'loss': 2.200702414512634, 'td_loss': 0.003590560882585123, 'conservative_loss': 1.0985559260845184, 'time_step': 0.002267448902130127}[0m [36mstep[0m=[35m300[0m
[2m2025-07-28 15:45.25[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_300.d3[0m



Epoch 4/100: 100%|██████████| 100/100 [00:00<00:00, 392.28it/s, loss=2.2, td_loss=0.00166, conservative_loss=1.1]

[2m2025-07-28 15:45.26[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=4 step=400[0m [36mepoch[0m=[35m4[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005457544326782226, 'time_algorithm_update': 0.0019359111785888672, 'loss': 2.1985737800598146, 'td_loss': 0.001622819603071548, 'conservative_loss': 1.0984754824638367, 'time_step': 0.0025356578826904296}[0m [36mstep[0m=[35m400[0m





[2m2025-07-28 15:45.26[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_400.d3[0m


Epoch 5/100: 100%|██████████| 100/100 [00:00<00:00, 407.84it/s, loss=2.2, td_loss=0.00415, conservative_loss=1.1]

[2m2025-07-28 15:45.26[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=5 step=500[0m [36mepoch[0m=[35m5[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004919314384460449, 'time_algorithm_update': 0.001887047290802002, 'loss': 2.19991708278656, 'td_loss': 0.004285427670693025, 'conservative_loss': 1.09781583070755, 'time_step': 0.0024393200874328613}[0m [36mstep[0m=[35m500[0m
[2m2025-07-28 15:45.26[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_500.d3[0m



Epoch 6/100: 100%|██████████| 100/100 [00:00<00:00, 432.74it/s, loss=2.2, td_loss=0.00218, conservative_loss=1.1]

[2m2025-07-28 15:45.26[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=6 step=600[0m [36mepoch[0m=[35m6[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004746437072753906, 'time_algorithm_update': 0.0017737793922424317, 'loss': 2.198714690208435, 'td_loss': 0.0020858740305993704, 'conservative_loss': 1.0983144092559813, 'time_step': 0.002298996448516846}[0m [36mstep[0m=[35m600[0m
[2m2025-07-28 15:45.26[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_600.d3[0m



Epoch 7/100: 100%|██████████| 100/100 [00:00<00:00, 428.39it/s, loss=2.2, td_loss=0.00353, conservative_loss=1.1]

[2m2025-07-28 15:45.26[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=7 step=700[0m [36mepoch[0m=[35m7[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00046657800674438477, 'time_algorithm_update': 0.001801440715789795, 'loss': 2.2006174540519714, 'td_loss': 0.003812542131054215, 'conservative_loss': 1.098402464389801, 'time_step': 0.0023208904266357423}[0m [36mstep[0m=[35m700[0m
[2m2025-07-28 15:45.26[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_700.d3[0m



Epoch 8/100: 100%|██████████| 100/100 [00:00<00:00, 397.56it/s, loss=2.2, td_loss=0.00462, conservative_loss=1.1]

[2m2025-07-28 15:45.27[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=8 step=800[0m [36mepoch[0m=[35m8[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004999542236328125, 'time_algorithm_update': 0.0019462394714355468, 'loss': 2.201106686592102, 'td_loss': 0.004484999722335487, 'conservative_loss': 1.098310842514038, 'time_step': 0.0025028419494628907}[0m [36mstep[0m=[35m800[0m





[2m2025-07-28 15:45.27[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_800.d3[0m


Epoch 9/100: 100%|██████████| 100/100 [00:00<00:00, 399.19it/s, loss=2.2, td_loss=0.00228, conservative_loss=1.1]

[2m2025-07-28 15:45.27[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=9 step=900[0m [36mepoch[0m=[35m9[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004914283752441406, 'time_algorithm_update': 0.0019498515129089357, 'loss': 2.1986337685585022, 'td_loss': 0.00250896273937542, 'conservative_loss': 1.0980623996257781, 'time_step': 0.002493433952331543}[0m [36mstep[0m=[35m900[0m





[2m2025-07-28 15:45.27[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_900.d3[0m


Epoch 10/100: 100%|██████████| 100/100 [00:00<00:00, 374.70it/s, loss=2.2, td_loss=0.0028, conservative_loss=1.1]

[2m2025-07-28 15:45.27[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=10 step=1000[0m [36mepoch[0m=[35m10[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005762290954589844, 'time_algorithm_update': 0.0020333194732666017, 'loss': 2.1994874477386475, 'td_loss': 0.002849989581736736, 'conservative_loss': 1.0983187305927276, 'time_step': 0.0026563167572021484}[0m [36mstep[0m=[35m1000[0m
[2m2025-07-28 15:45.27[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_1000.d3[0m



Epoch 11/100: 100%|██████████| 100/100 [00:00<00:00, 360.13it/s, loss=2.3, td_loss=0.102, conservative_loss=1.1]

[2m2025-07-28 15:45.27[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=11 step=1100[0m [36mepoch[0m=[35m11[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0006136512756347657, 'time_algorithm_update': 0.0020970726013183594, 'loss': 2.28967072725296, 'td_loss': 0.0928150053115678, 'conservative_loss': 1.0984278655052184, 'time_step': 0.0027637600898742674}[0m [36mstep[0m=[35m1100[0m
[2m2025-07-28 15:45.27[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_1100.d3[0m



Epoch 12/100: 100%|██████████| 100/100 [00:00<00:00, 354.22it/s, loss=2.2, td_loss=0.000653, conservative_loss=1.1]

[2m2025-07-28 15:45.28[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=12 step=1200[0m [36mepoch[0m=[35m12[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0006206274032592773, 'time_algorithm_update': 0.0021375727653503416, 'loss': 2.1974543070793153, 'td_loss': 0.0006476261564239394, 'conservative_loss': 1.0984033381938934, 'time_step': 0.0028111600875854494}[0m [36mstep[0m=[35m1200[0m
[2m2025-07-28 15:45.28[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_1200.d3[0m



Epoch 13/100: 100%|██████████| 100/100 [00:00<00:00, 359.39it/s, loss=2.2, td_loss=0.000719, conservative_loss=1.1]

[2m2025-07-28 15:45.28[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=13 step=1300[0m [36mepoch[0m=[35m13[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0006099987030029297, 'time_algorithm_update': 0.002111635208129883, 'loss': 2.197365963459015, 'td_loss': 0.0008116933463315945, 'conservative_loss': 1.0982771360874175, 'time_step': 0.0027706408500671386}[0m [36mstep[0m=[35m1300[0m
[2m2025-07-28 15:45.28[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_1300.d3[0m



Epoch 14/100: 100%|██████████| 100/100 [00:00<00:00, 363.63it/s, loss=2.2, td_loss=0.00086, conservative_loss=1.1]

[2m2025-07-28 15:45.28[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=14 step=1400[0m [36mepoch[0m=[35m14[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0006060028076171876, 'time_algorithm_update': 0.0020788097381591797, 'loss': 2.1977316999435423, 'td_loss': 0.000837066586536821, 'conservative_loss': 1.0984473168849944, 'time_step': 0.002737417221069336}[0m [36mstep[0m=[35m1400[0m
[2m2025-07-28 15:45.28[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_1400.d3[0m



Epoch 15/100: 100%|██████████| 100/100 [00:00<00:00, 352.61it/s, loss=2.2, td_loss=0.000487, conservative_loss=1.1]

[2m2025-07-28 15:45.28[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=15 step=1500[0m [36mepoch[0m=[35m15[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0006205201148986817, 'time_algorithm_update': 0.0021503591537475586, 'loss': 2.197506539821625, 'td_loss': 0.000559124128849362, 'conservative_loss': 1.0984737122058867, 'time_step': 0.002822751998901367}[0m [36mstep[0m=[35m1500[0m





[2m2025-07-28 15:45.28[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_1500.d3[0m


Epoch 16/100: 100%|██████████| 100/100 [00:00<00:00, 343.59it/s, loss=2.2, td_loss=0.000888, conservative_loss=1.1]

[2m2025-07-28 15:45.29[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=16 step=1600[0m [36mepoch[0m=[35m16[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0006248164176940918, 'time_algorithm_update': 0.002216043472290039, 'loss': 2.197997591495514, 'td_loss': 0.00094447792347637, 'conservative_loss': 1.0985265529155732, 'time_step': 0.0028961730003356934}[0m [36mstep[0m=[35m1600[0m
[2m2025-07-28 15:45.29[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_1600.d3[0m



Epoch 17/100: 100%|██████████| 100/100 [00:00<00:00, 344.67it/s, loss=2.2, td_loss=0.000705, conservative_loss=1.1]

[2m2025-07-28 15:45.29[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=17 step=1700[0m [36mepoch[0m=[35m17[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0006282734870910645, 'time_algorithm_update': 0.0021921420097351075, 'loss': 2.1970662832260133, 'td_loss': 0.0007343449354812037, 'conservative_loss': 1.0981659662723542, 'time_step': 0.002886338233947754}[0m [36mstep[0m=[35m1700[0m
[2m2025-07-28 15:45.29[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_1700.d3[0m



Epoch 18/100: 100%|██████████| 100/100 [00:00<00:00, 367.11it/s, loss=2.2, td_loss=0.000985, conservative_loss=1.1]

[2m2025-07-28 15:45.29[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=18 step=1800[0m [36mepoch[0m=[35m18[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005885624885559082, 'time_algorithm_update': 0.0020654106140136717, 'loss': 2.194919857978821, 'td_loss': 0.0009621448916732334, 'conservative_loss': 1.0969788503646851, 'time_step': 0.0027103424072265625}[0m [36mstep[0m=[35m1800[0m
[2m2025-07-28 15:45.29[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_1800.d3[0m



Epoch 19/100: 100%|██████████| 100/100 [00:00<00:00, 391.83it/s, loss=2.2, td_loss=0.000917, conservative_loss=1.1]

[2m2025-07-28 15:45.30[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=19 step=1900[0m [36mepoch[0m=[35m19[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005143189430236816, 'time_algorithm_update': 0.001968593597412109, 'loss': 2.1962599110603334, 'td_loss': 0.0009440037869353546, 'conservative_loss': 1.0976579535007476, 'time_step': 0.002539358139038086}[0m [36mstep[0m=[35m1900[0m
[2m2025-07-28 15:45.30[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_1900.d3[0m



Epoch 20/100: 100%|██████████| 100/100 [00:00<00:00, 413.44it/s, loss=2.2, td_loss=0.000863, conservative_loss=1.1]

[2m2025-07-28 15:45.30[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=20 step=2000[0m [36mepoch[0m=[35m20[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004642653465270996, 'time_algorithm_update': 0.0018935203552246094, 'loss': 2.1974512267112734, 'td_loss': 0.0008664641072391533, 'conservative_loss': 1.0982923746109008, 'time_step': 0.002407989501953125}[0m [36mstep[0m=[35m2000[0m
[2m2025-07-28 15:45.30[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_2000.d3[0m



Epoch 21/100: 100%|██████████| 100/100 [00:00<00:00, 388.95it/s, loss=2.27, td_loss=0.0712, conservative_loss=1.1]

[2m2025-07-28 15:45.30[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=21 step=2100[0m [36mepoch[0m=[35m21[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005271315574645996, 'time_algorithm_update': 0.0019734764099121095, 'loss': 2.261094753742218, 'td_loss': 0.0650152921580593, 'conservative_loss': 1.0980397248268128, 'time_step': 0.002556498050689697}[0m [36mstep[0m=[35m2100[0m
[2m2025-07-28 15:45.30[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_2100.d3[0m



Epoch 22/100: 100%|██████████| 100/100 [00:00<00:00, 385.53it/s, loss=2.2, td_loss=0.00115, conservative_loss=1.1]

[2m2025-07-28 15:45.30[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=22 step=2200[0m [36mepoch[0m=[35m22[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.000541989803314209, 'time_algorithm_update': 0.001987617015838623, 'loss': 2.197791721820831, 'td_loss': 0.0011231240205233917, 'conservative_loss': 1.098334299325943, 'time_step': 0.002582676410675049}[0m [36mstep[0m=[35m2200[0m
[2m2025-07-28 15:45.30[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_2200.d3[0m



Epoch 23/100: 100%|██████████| 100/100 [00:00<00:00, 413.56it/s, loss=2.19, td_loss=0.00109, conservative_loss=1.1]

[2m2025-07-28 15:45.31[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=23 step=2300[0m [36mepoch[0m=[35m23[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004741644859313965, 'time_algorithm_update': 0.001876215934753418, 'loss': 2.194902892112732, 'td_loss': 0.0010723149092518724, 'conservative_loss': 1.0969152915477753, 'time_step': 0.0024067497253417967}[0m [36mstep[0m=[35m2300[0m
[2m2025-07-28 15:45.31[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_2300.d3[0m



Epoch 24/100: 100%|██████████| 100/100 [00:00<00:00, 424.30it/s, loss=2.2, td_loss=0.000945, conservative_loss=1.1]

[2m2025-07-28 15:45.31[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=24 step=2400[0m [36mepoch[0m=[35m24[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004554891586303711, 'time_algorithm_update': 0.0018364500999450684, 'loss': 2.1958152437210083, 'td_loss': 0.0009611728510935791, 'conservative_loss': 1.0974270331859588, 'time_step': 0.0023450636863708496}[0m [36mstep[0m=[35m2400[0m
[2m2025-07-28 15:45.31[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_2400.d3[0m



Epoch 25/100: 100%|██████████| 100/100 [00:00<00:00, 428.66it/s, loss=2.2, td_loss=0.000974, conservative_loss=1.1]

[2m2025-07-28 15:45.31[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=25 step=2500[0m [36mepoch[0m=[35m25[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00045788049697875977, 'time_algorithm_update': 0.0018152213096618653, 'loss': 2.1973110389709474, 'td_loss': 0.0009258302650414407, 'conservative_loss': 1.0981926012039185, 'time_step': 0.002320990562438965}[0m [36mstep[0m=[35m2500[0m
[2m2025-07-28 15:45.31[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_2500.d3[0m



Epoch 26/100: 100%|██████████| 100/100 [00:00<00:00, 417.96it/s, loss=2.2, td_loss=0.000771, conservative_loss=1.1]

[2m2025-07-28 15:45.31[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=26 step=2600[0m [36mepoch[0m=[35m26[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004761838912963867, 'time_algorithm_update': 0.0018532466888427734, 'loss': 2.196728525161743, 'td_loss': 0.0007932424181490205, 'conservative_loss': 1.0979676389694213, 'time_step': 0.0023805713653564453}[0m [36mstep[0m=[35m2600[0m
[2m2025-07-28 15:45.31[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_2600.d3[0m



Epoch 27/100: 100%|██████████| 100/100 [00:00<00:00, 384.42it/s, loss=2.2, td_loss=0.00115, conservative_loss=1.1]

[2m2025-07-28 15:45.32[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=27 step=2700[0m [36mepoch[0m=[35m27[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005347943305969239, 'time_algorithm_update': 0.001997036933898926, 'loss': 2.197197813987732, 'td_loss': 0.0011636634092428721, 'conservative_loss': 1.0980170750617981, 'time_step': 0.0025886750221252443}[0m [36mstep[0m=[35m2700[0m





[2m2025-07-28 15:45.32[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_2700.d3[0m


Epoch 28/100: 100%|██████████| 100/100 [00:00<00:00, 362.13it/s, loss=2.2, td_loss=0.000858, conservative_loss=1.1]

[2m2025-07-28 15:45.32[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=28 step=2800[0m [36mepoch[0m=[35m28[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005988383293151856, 'time_algorithm_update': 0.002096092700958252, 'loss': 2.1970012974739075, 'td_loss': 0.0008431548319640569, 'conservative_loss': 1.0980790734291077, 'time_step': 0.002748849391937256}[0m [36mstep[0m=[35m2800[0m
[2m2025-07-28 15:45.32[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_2800.d3[0m



Epoch 29/100: 100%|██████████| 100/100 [00:00<00:00, 410.93it/s, loss=2.2, td_loss=0.00127, conservative_loss=1.1]

[2m2025-07-28 15:45.32[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=29 step=2900[0m [36mepoch[0m=[35m29[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004908180236816406, 'time_algorithm_update': 0.0018802642822265624, 'loss': 2.1956692242622378, 'td_loss': 0.0013493687222944572, 'conservative_loss': 1.0971599233150482, 'time_step': 0.0024227428436279295}[0m [36mstep[0m=[35m2900[0m
[2m2025-07-28 15:45.32[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_2900.d3[0m



Epoch 30/100: 100%|██████████| 100/100 [00:00<00:00, 374.62it/s, loss=2.2, td_loss=0.00183, conservative_loss=1.1]

[2m2025-07-28 15:45.32[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=30 step=3000[0m [36mepoch[0m=[35m30[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005670523643493652, 'time_algorithm_update': 0.0020356225967407228, 'loss': 2.196509358882904, 'td_loss': 0.001714693953981623, 'conservative_loss': 1.097397335767746, 'time_step': 0.0026561284065246583}[0m [36mstep[0m=[35m3000[0m
[2m2025-07-28 15:45.32[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_3000.d3[0m



Epoch 31/100: 100%|██████████| 100/100 [00:00<00:00, 418.41it/s, loss=2.26, td_loss=0.0625, conservative_loss=1.1]

[2m2025-07-28 15:45.33[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=31 step=3100[0m [36mepoch[0m=[35m31[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00046095371246337893, 'time_algorithm_update': 0.0018667006492614747, 'loss': 2.2531930565834046, 'td_loss': 0.05699506776465569, 'conservative_loss': 1.0980989944934845, 'time_step': 0.0023780179023742674}[0m [36mstep[0m=[35m3100[0m
[2m2025-07-28 15:45.33[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_3100.d3[0m



Epoch 32/100: 100%|██████████| 100/100 [00:00<00:00, 407.62it/s, loss=2.2, td_loss=0.00104, conservative_loss=1.1]

[2m2025-07-28 15:45.33[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=32 step=3200[0m [36mepoch[0m=[35m32[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004948973655700684, 'time_algorithm_update': 0.0018903970718383788, 'loss': 2.196646854877472, 'td_loss': 0.0009938515130488669, 'conservative_loss': 1.0978265023231506, 'time_step': 0.0024421000480651855}[0m [36mstep[0m=[35m3200[0m
[2m2025-07-28 15:45.33[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_3200.d3[0m



Epoch 33/100: 100%|██████████| 100/100 [00:00<00:00, 430.93it/s, loss=2.2, td_loss=0.000858, conservative_loss=1.1]

[2m2025-07-28 15:45.33[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=33 step=3300[0m [36mepoch[0m=[35m33[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004577851295471191, 'time_algorithm_update': 0.0018019366264343261, 'loss': 2.1971650981903075, 'td_loss': 0.0008624040643917397, 'conservative_loss': 1.098151340484619, 'time_step': 0.002308976650238037}[0m [36mstep[0m=[35m3300[0m
[2m2025-07-28 15:45.33[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_3300.d3[0m



Epoch 34/100: 100%|██████████| 100/100 [00:00<00:00, 412.23it/s, loss=2.2, td_loss=0.000943, conservative_loss=1.1]

[2m2025-07-28 15:45.33[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=34 step=3400[0m [36mepoch[0m=[35m34[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00048842191696167, 'time_algorithm_update': 0.0018771505355834961, 'loss': 2.19635671377182, 'td_loss': 0.0009327739142463542, 'conservative_loss': 1.0977119731903076, 'time_step': 0.0024143147468566896}[0m [36mstep[0m=[35m3400[0m
[2m2025-07-28 15:45.33[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_3400.d3[0m



Epoch 35/100: 100%|██████████| 100/100 [00:00<00:00, 416.87it/s, loss=2.2, td_loss=0.00114, conservative_loss=1.1]

[2m2025-07-28 15:45.34[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=35 step=3500[0m [36mepoch[0m=[35m35[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004731059074401855, 'time_algorithm_update': 0.001864159107208252, 'loss': 2.1976770973205566, 'td_loss': 0.0011044965320616028, 'conservative_loss': 1.0982862997055054, 'time_step': 0.0023872184753417967}[0m [36mstep[0m=[35m3500[0m
[2m2025-07-28 15:45.34[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_3500.d3[0m



Epoch 36/100: 100%|██████████| 100/100 [00:00<00:00, 370.42it/s, loss=2.2, td_loss=0.000734, conservative_loss=1.1]

[2m2025-07-28 15:45.34[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=36 step=3600[0m [36mepoch[0m=[35m36[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005660319328308106, 'time_algorithm_update': 0.0020635390281677247, 'loss': 2.1967822265625, 'td_loss': 0.0007547940395306795, 'conservative_loss': 1.0980137073993683, 'time_step': 0.002687985897064209}[0m [36mstep[0m=[35m3600[0m
[2m2025-07-28 15:45.34[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_3600.d3[0m



Epoch 37/100: 100%|██████████| 100/100 [00:00<00:00, 403.28it/s, loss=2.19, td_loss=0.00114, conservative_loss=1.1]

[2m2025-07-28 15:45.34[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=37 step=3700[0m [36mepoch[0m=[35m37[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005019855499267578, 'time_algorithm_update': 0.0019105076789855957, 'loss': 2.194888060092926, 'td_loss': 0.0011231697170296683, 'conservative_loss': 1.0968824434280395, 'time_step': 0.0024664974212646484}[0m [36mstep[0m=[35m3700[0m





[2m2025-07-28 15:45.34[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_3700.d3[0m


Epoch 38/100: 100%|██████████| 100/100 [00:00<00:00, 390.65it/s, loss=2.2, td_loss=0.00107, conservative_loss=1.1]

[2m2025-07-28 15:45.34[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=38 step=3800[0m [36mepoch[0m=[35m38[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005265402793884277, 'time_algorithm_update': 0.0019679880142211914, 'loss': 2.1961218523979187, 'td_loss': 0.001076404427876696, 'conservative_loss': 1.0975227224826813, 'time_step': 0.0025478482246398928}[0m [36mstep[0m=[35m3800[0m
[2m2025-07-28 15:45.34[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_3800.d3[0m



Epoch 39/100: 100%|██████████| 100/100 [00:00<00:00, 410.59it/s, loss=2.2, td_loss=0.00105, conservative_loss=1.1]

[2m2025-07-28 15:45.35[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=39 step=3900[0m [36mepoch[0m=[35m39[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004711127281188965, 'time_algorithm_update': 0.0019014263153076172, 'loss': 2.1965386128425597, 'td_loss': 0.0010531386702496092, 'conservative_loss': 1.0977427423000337, 'time_step': 0.0024236011505126952}[0m [36mstep[0m=[35m3900[0m
[2m2025-07-28 15:45.35[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_3900.d3[0m



Epoch 40/100: 100%|██████████| 100/100 [00:00<00:00, 418.40it/s, loss=2.2, td_loss=0.000764, conservative_loss=1.1]

[2m2025-07-28 15:45.35[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=40 step=4000[0m [36mepoch[0m=[35m40[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004645562171936035, 'time_algorithm_update': 0.0018659543991088867, 'loss': 2.1961905717849732, 'td_loss': 0.0007454320683609694, 'conservative_loss': 1.09772256731987, 'time_step': 0.002378692626953125}[0m [36mstep[0m=[35m4000[0m
[2m2025-07-28 15:45.35[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_4000.d3[0m



Epoch 41/100: 100%|██████████| 100/100 [00:00<00:00, 426.11it/s, loss=2.25, td_loss=0.054, conservative_loss=1.1]

[2m2025-07-28 15:45.35[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=41 step=4100[0m [36mepoch[0m=[35m41[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00045660495758056643, 'time_algorithm_update': 0.0018251967430114747, 'loss': 2.2444769239425657, 'td_loss': 0.04920598843396874, 'conservative_loss': 1.0976354658603669, 'time_step': 0.002335050106048584}[0m [36mstep[0m=[35m4100[0m
[2m2025-07-28 15:45.35[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_4100.d3[0m



Epoch 42/100: 100%|██████████| 100/100 [00:00<00:00, 417.92it/s, loss=2.2, td_loss=0.00118, conservative_loss=1.1]

[2m2025-07-28 15:45.35[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=42 step=4200[0m [36mepoch[0m=[35m42[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004755043983459473, 'time_algorithm_update': 0.0018530845642089843, 'loss': 2.197680506706238, 'td_loss': 0.0011245051398873329, 'conservative_loss': 1.0982780075073242, 'time_step': 0.002380106449127197}[0m [36mstep[0m=[35m4200[0m
[2m2025-07-28 15:45.35[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_4200.d3[0m



Epoch 43/100: 100%|██████████| 100/100 [00:00<00:00, 422.72it/s, loss=2.2, td_loss=0.000876, conservative_loss=1.1]

[2m2025-07-28 15:45.36[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=43 step=4300[0m [36mepoch[0m=[35m43[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00046542406082153323, 'time_algorithm_update': 0.0018411803245544434, 'loss': 2.19569308757782, 'td_loss': 0.0008939958154223859, 'conservative_loss': 1.0973995530605316, 'time_step': 0.00235353946685791}[0m [36mstep[0m=[35m4300[0m
[2m2025-07-28 15:45.36[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_4300.d3[0m



Epoch 44/100: 100%|██████████| 100/100 [00:00<00:00, 424.83it/s, loss=2.2, td_loss=0.000986, conservative_loss=1.1]

[2m2025-07-28 15:45.36[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=44 step=4400[0m [36mepoch[0m=[35m44[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004541301727294922, 'time_algorithm_update': 0.0018338370323181153, 'loss': 2.197810397148132, 'td_loss': 0.0010860397884971463, 'conservative_loss': 1.0983621823787688, 'time_step': 0.0023408055305480955}[0m [36mstep[0m=[35m4400[0m
[2m2025-07-28 15:45.36[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_4400.d3[0m



Epoch 45/100: 100%|██████████| 100/100 [00:00<00:00, 427.27it/s, loss=2.2, td_loss=0.00112, conservative_loss=1.1]

[2m2025-07-28 15:45.36[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=45 step=4500[0m [36mepoch[0m=[35m45[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00045701980590820315, 'time_algorithm_update': 0.001823270320892334, 'loss': 2.1963061475753785, 'td_loss': 0.0011160826054401696, 'conservative_loss': 1.0975950312614442, 'time_step': 0.002328629493713379}[0m [36mstep[0m=[35m4500[0m
[2m2025-07-28 15:45.36[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_4500.d3[0m



Epoch 46/100: 100%|██████████| 100/100 [00:00<00:00, 423.77it/s, loss=2.2, td_loss=0.000778, conservative_loss=1.1]

[2m2025-07-28 15:45.36[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=46 step=4600[0m [36mepoch[0m=[35m46[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004597926139831543, 'time_algorithm_update': 0.0018334841728210448, 'loss': 2.1983505201339724, 'td_loss': 0.0007569296075962484, 'conservative_loss': 1.0987967932224274, 'time_step': 0.002347161769866943}[0m [36mstep[0m=[35m4600[0m
[2m2025-07-28 15:45.36[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_4600.d3[0m



Epoch 47/100: 100%|██████████| 100/100 [00:00<00:00, 422.02it/s, loss=2.2, td_loss=0.000766, conservative_loss=1.1]

[2m2025-07-28 15:45.37[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=47 step=4700[0m [36mepoch[0m=[35m47[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00045778989791870115, 'time_algorithm_update': 0.0018497180938720703, 'loss': 2.1964164352416993, 'td_loss': 0.0007753977939137257, 'conservative_loss': 1.097820520401001, 'time_step': 0.0023577427864074707}[0m [36mstep[0m=[35m4700[0m
[2m2025-07-28 15:45.37[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_4700.d3[0m



Epoch 48/100: 100%|██████████| 100/100 [00:00<00:00, 393.45it/s, loss=2.2, td_loss=0.00124, conservative_loss=1.1]

[2m2025-07-28 15:45.37[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=48 step=4800[0m [36mepoch[0m=[35m48[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005004525184631348, 'time_algorithm_update': 0.001968719959259033, 'loss': 2.1953913831710814, 'td_loss': 0.0011732379256864079, 'conservative_loss': 1.097109078168869, 'time_step': 0.002527334690093994}[0m [36mstep[0m=[35m4800[0m





[2m2025-07-28 15:45.37[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_4800.d3[0m


Epoch 49/100: 100%|██████████| 100/100 [00:00<00:00, 385.98it/s, loss=2.19, td_loss=0.00118, conservative_loss=1.1]

[2m2025-07-28 15:45.37[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=49 step=4900[0m [36mepoch[0m=[35m49[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005402588844299317, 'time_algorithm_update': 0.0019860291481018067, 'loss': 2.195286548137665, 'td_loss': 0.0011594361462630332, 'conservative_loss': 1.0970635497570038, 'time_step': 0.002579333782196045}[0m [36mstep[0m=[35m4900[0m
[2m2025-07-28 15:45.37[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_4900.d3[0m



Epoch 50/100: 100%|██████████| 100/100 [00:00<00:00, 407.55it/s, loss=2.19, td_loss=0.000825, conservative_loss=1.1]

[2m2025-07-28 15:45.37[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=50 step=5000[0m [36mepoch[0m=[35m50[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005037856101989747, 'time_algorithm_update': 0.0018882203102111817, 'loss': 2.194396321773529, 'td_loss': 0.0008400411676848307, 'conservative_loss': 1.0967781388759612, 'time_step': 0.0024425601959228516}[0m [36mstep[0m=[35m5000[0m
[2m2025-07-28 15:45.37[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_5000.d3[0m



Epoch 51/100: 100%|██████████| 100/100 [00:00<00:00, 363.68it/s, loss=2.24, td_loss=0.0451, conservative_loss=1.1]

[2m2025-07-28 15:45.38[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=51 step=5100[0m [36mepoch[0m=[35m51[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005719494819641113, 'time_algorithm_update': 0.0021027255058288573, 'loss': 2.2378094482421873, 'td_loss': 0.041239810792612845, 'conservative_loss': 1.0982848227024078, 'time_step': 0.002734043598175049}[0m [36mstep[0m=[35m5100[0m
[2m2025-07-28 15:45.38[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_5100.d3[0m



Epoch 52/100: 100%|██████████| 100/100 [00:00<00:00, 358.04it/s, loss=2.2, td_loss=0.00234, conservative_loss=1.1]

[2m2025-07-28 15:45.38[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=52 step=5200[0m [36mepoch[0m=[35m52[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005881166458129882, 'time_algorithm_update': 0.002130899429321289, 'loss': 2.1982424426078797, 'td_loss': 0.002603997699916363, 'conservative_loss': 1.0978192222118377, 'time_step': 0.0027790045738220214}[0m [36mstep[0m=[35m5200[0m
[2m2025-07-28 15:45.38[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_5200.d3[0m



Epoch 53/100: 100%|██████████| 100/100 [00:00<00:00, 387.23it/s, loss=2.2, td_loss=0.00196, conservative_loss=1.1]

[2m2025-07-28 15:45.38[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=53 step=5300[0m [36mepoch[0m=[35m53[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005287861824035644, 'time_algorithm_update': 0.001987776756286621, 'loss': 2.1960549640655516, 'td_loss': 0.001880011151661165, 'conservative_loss': 1.0970874762535094, 'time_step': 0.0025689053535461427}[0m [36mstep[0m=[35m5300[0m
[2m2025-07-28 15:45.38[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_5300.d3[0m



Epoch 54/100: 100%|██████████| 100/100 [00:00<00:00, 297.48it/s, loss=2.2, td_loss=0.000848, conservative_loss=1.1]

[2m2025-07-28 15:45.39[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=54 step=5400[0m [36mepoch[0m=[35m54[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005863070487976074, 'time_algorithm_update': 0.0026813626289367678, 'loss': 2.1959192824363707, 'td_loss': 0.0009607326565310359, 'conservative_loss': 1.0974792742729187, 'time_step': 0.003337249755859375}[0m [36mstep[0m=[35m5400[0m
[2m2025-07-28 15:45.39[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_5400.d3[0m



Epoch 55/100: 100%|██████████| 100/100 [00:00<00:00, 415.45it/s, loss=2.2, td_loss=0.00116, conservative_loss=1.1]

[2m2025-07-28 15:45.39[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=55 step=5500[0m [36mepoch[0m=[35m55[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.000460207462310791, 'time_algorithm_update': 0.0018834710121154786, 'loss': 2.196174066066742, 'td_loss': 0.0012187362852273508, 'conservative_loss': 1.0974776697158815, 'time_step': 0.0023953890800476074}[0m [36mstep[0m=[35m5500[0m
[2m2025-07-28 15:45.39[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_5500.d3[0m



Epoch 56/100: 100%|██████████| 100/100 [00:00<00:00, 386.78it/s, loss=2.2, td_loss=0.0025, conservative_loss=1.1]

[2m2025-07-28 15:45.39[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=56 step=5600[0m [36mepoch[0m=[35m56[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005307507514953614, 'time_algorithm_update': 0.001985924243927002, 'loss': 2.197875699996948, 'td_loss': 0.0023507915827212854, 'conservative_loss': 1.0977624487876891, 'time_step': 0.00257326602935791}[0m [36mstep[0m=[35m5600[0m
[2m2025-07-28 15:45.39[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_5600.d3[0m



Epoch 57/100: 100%|██████████| 100/100 [00:00<00:00, 386.88it/s, loss=2.2, td_loss=0.0011, conservative_loss=1.1]  

[2m2025-07-28 15:45.39[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=57 step=5700[0m [36mepoch[0m=[35m57[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005351758003234863, 'time_algorithm_update': 0.0019792938232421876, 'loss': 2.1951411747932434, 'td_loss': 0.0010955849025049247, 'conservative_loss': 1.0970227921009064, 'time_step': 0.0025716471672058107}[0m [36mstep[0m=[35m5700[0m





[2m2025-07-28 15:45.39[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_5700.d3[0m


Epoch 58/100: 100%|██████████| 100/100 [00:00<00:00, 416.13it/s, loss=2.2, td_loss=0.00129, conservative_loss=1.1]

[2m2025-07-28 15:45.40[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=58 step=5800[0m [36mepoch[0m=[35m58[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00047307252883911134, 'time_algorithm_update': 0.0018701672554016114, 'loss': 2.1969187307357787, 'td_loss': 0.0012697618824313395, 'conservative_loss': 1.097824476957321, 'time_step': 0.002392282485961914}[0m [36mstep[0m=[35m5800[0m
[2m2025-07-28 15:45.40[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_5800.d3[0m



Epoch 59/100: 100%|██████████| 100/100 [00:00<00:00, 408.85it/s, loss=2.19, td_loss=0.00188, conservative_loss=1.1]

[2m2025-07-28 15:45.40[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=59 step=5900[0m [36mepoch[0m=[35m59[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00047320127487182617, 'time_algorithm_update': 0.0019121074676513672, 'loss': 2.1937217545509338, 'td_loss': 0.0018052738095866517, 'conservative_loss': 1.0959582471847533, 'time_step': 0.0024355602264404296}[0m [36mstep[0m=[35m5900[0m
[2m2025-07-28 15:45.40[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_5900.d3[0m



Epoch 60/100: 100%|██████████| 100/100 [00:00<00:00, 418.71it/s, loss=2.2, td_loss=0.00119, conservative_loss=1.1]

[2m2025-07-28 15:45.40[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=60 step=6000[0m [36mepoch[0m=[35m60[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004694795608520508, 'time_algorithm_update': 0.001861875057220459, 'loss': 2.1961928606033325, 'td_loss': 0.0012504387358785608, 'conservative_loss': 1.0974712038040162, 'time_step': 0.002376384735107422}[0m [36mstep[0m=[35m6000[0m
[2m2025-07-28 15:45.40[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_6000.d3[0m



Epoch 61/100: 100%|██████████| 100/100 [00:00<00:00, 410.58it/s, loss=2.24, td_loss=0.0434, conservative_loss=1.1]

[2m2025-07-28 15:45.40[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=61 step=6100[0m [36mepoch[0m=[35m61[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00047635555267333985, 'time_algorithm_update': 0.0018982386589050293, 'loss': 2.236229724884033, 'td_loss': 0.03967914224718697, 'conservative_loss': 1.0982752883434295, 'time_step': 0.0024237799644470214}[0m [36mstep[0m=[35m6100[0m
[2m2025-07-28 15:45.40[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_6100.d3[0m



Epoch 62/100: 100%|██████████| 100/100 [00:00<00:00, 404.97it/s, loss=2.2, td_loss=0.00187, conservative_loss=1.1]

[2m2025-07-28 15:45.41[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=62 step=6200[0m [36mepoch[0m=[35m62[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004819440841674805, 'time_algorithm_update': 0.0019266748428344728, 'loss': 2.197395284175873, 'td_loss': 0.0018316685914760455, 'conservative_loss': 1.0977818071842194, 'time_step': 0.0024573302268981936}[0m [36mstep[0m=[35m6200[0m
[2m2025-07-28 15:45.41[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_6200.d3[0m



Epoch 63/100: 100%|██████████| 100/100 [00:00<00:00, 405.50it/s, loss=2.2, td_loss=0.00162, conservative_loss=1.1]

[2m2025-07-28 15:45.41[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=63 step=6300[0m [36mepoch[0m=[35m63[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004749917984008789, 'time_algorithm_update': 0.001930844783782959, 'loss': 2.1965748119354247, 'td_loss': 0.001626943639712408, 'conservative_loss': 1.0974739360809327, 'time_step': 0.002455434799194336}[0m [36mstep[0m=[35m6300[0m
[2m2025-07-28 15:45.41[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_6300.d3[0m



Epoch 64/100: 100%|██████████| 100/100 [00:00<00:00, 410.23it/s, loss=2.2, td_loss=0.00117, conservative_loss=1.1]

[2m2025-07-28 15:45.41[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=64 step=6400[0m [36mepoch[0m=[35m64[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004700160026550293, 'time_algorithm_update': 0.0019088816642761231, 'loss': 2.197249164581299, 'td_loss': 0.0011479717545444146, 'conservative_loss': 1.0980505990982055, 'time_step': 0.0024258971214294435}[0m [36mstep[0m=[35m6400[0m
[2m2025-07-28 15:45.41[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_6400.d3[0m



Epoch 65/100: 100%|██████████| 100/100 [00:00<00:00, 412.26it/s, loss=2.2, td_loss=0.0015, conservative_loss=1.1]

[2m2025-07-28 15:45.41[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=65 step=6500[0m [36mepoch[0m=[35m65[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004769134521484375, 'time_algorithm_update': 0.0018893003463745118, 'loss': 2.1957614421844482, 'td_loss': 0.0014366957996389828, 'conservative_loss': 1.0971623718738557, 'time_step': 0.0024144744873046877}[0m [36mstep[0m=[35m6500[0m
[2m2025-07-28 15:45.41[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_6500.d3[0m



Epoch 66/100: 100%|██████████| 100/100 [00:00<00:00, 406.36it/s, loss=2.2, td_loss=0.00151, conservative_loss=1.1]

[2m2025-07-28 15:45.42[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=66 step=6600[0m [36mepoch[0m=[35m66[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004748296737670898, 'time_algorithm_update': 0.001923069953918457, 'loss': 2.1966270327568056, 'td_loss': 0.0014970313938101755, 'conservative_loss': 1.0975650000572204, 'time_step': 0.0024482226371765137}[0m [36mstep[0m=[35m6600[0m
[2m2025-07-28 15:45.42[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_6600.d3[0m



Epoch 67/100: 100%|██████████| 100/100 [00:00<00:00, 324.99it/s, loss=2.2, td_loss=0.00169, conservative_loss=1.1]

[2m2025-07-28 15:45.42[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=67 step=6700[0m [36mepoch[0m=[35m67[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0006723475456237793, 'time_algorithm_update': 0.0023181843757629395, 'loss': 2.1983171677589417, 'td_loss': 0.0017722686848719604, 'conservative_loss': 1.0982724475860595, 'time_step': 0.003056778907775879}[0m [36mstep[0m=[35m6700[0m
[2m2025-07-28 15:45.42[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_6700.d3[0m



Epoch 68/100: 100%|██████████| 100/100 [00:00<00:00, 419.24it/s, loss=2.2, td_loss=0.00114, conservative_loss=1.1]

[2m2025-07-28 15:45.42[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=68 step=6800[0m [36mepoch[0m=[35m68[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00045365333557128906, 'time_algorithm_update': 0.0018648409843444824, 'loss': 2.1956448483467104, 'td_loss': 0.0012274235920631327, 'conservative_loss': 1.0972087144851685, 'time_step': 0.002373466491699219}[0m [36mstep[0m=[35m6800[0m
[2m2025-07-28 15:45.42[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_6800.d3[0m



Epoch 69/100: 100%|██████████| 100/100 [00:00<00:00, 413.47it/s, loss=2.2, td_loss=0.00108, conservative_loss=1.1]

[2m2025-07-28 15:45.42[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=69 step=6900[0m [36mepoch[0m=[35m69[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00047734737396240237, 'time_algorithm_update': 0.0018783402442932129, 'loss': 2.1954086685180663, 'td_loss': 0.0010653546734829434, 'conservative_loss': 1.0971716606616975, 'time_step': 0.0024070024490356447}[0m [36mstep[0m=[35m6900[0m
[2m2025-07-28 15:45.42[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_6900.d3[0m



Epoch 70/100: 100%|██████████| 100/100 [00:00<00:00, 421.44it/s, loss=2.2, td_loss=0.00129, conservative_loss=1.1]

[2m2025-07-28 15:45.43[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=70 step=7000[0m [36mepoch[0m=[35m70[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004603123664855957, 'time_algorithm_update': 0.0018489694595336915, 'loss': 2.1972744131088255, 'td_loss': 0.0013059828532277606, 'conservative_loss': 1.0979842114448548, 'time_step': 0.002361128330230713}[0m [36mstep[0m=[35m7000[0m
[2m2025-07-28 15:45.43[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_7000.d3[0m



Epoch 71/100: 100%|██████████| 100/100 [00:00<00:00, 385.32it/s, loss=2.24, td_loss=0.0424, conservative_loss=1.1]

[2m2025-07-28 15:45.43[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=71 step=7100[0m [36mepoch[0m=[35m71[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005326128005981445, 'time_algorithm_update': 0.0019896578788757323, 'loss': 2.2326128482818604, 'td_loss': 0.03883167172083631, 'conservative_loss': 1.0968905925750732, 'time_step': 0.0025800132751464845}[0m [36mstep[0m=[35m7100[0m
[2m2025-07-28 15:45.43[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_7100.d3[0m



Epoch 72/100: 100%|██████████| 100/100 [00:00<00:00, 404.99it/s, loss=2.2, td_loss=0.00172, conservative_loss=1.1]

[2m2025-07-28 15:45.43[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=72 step=7200[0m [36mepoch[0m=[35m72[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.000496530532836914, 'time_algorithm_update': 0.0019109487533569337, 'loss': 2.1950860810279846, 'td_loss': 0.001654102717875503, 'conservative_loss': 1.0967159831523896, 'time_step': 0.0024579763412475586}[0m [36mstep[0m=[35m7200[0m





[2m2025-07-28 15:45.43[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_7200.d3[0m


Epoch 73/100: 100%|██████████| 100/100 [00:00<00:00, 404.06it/s, loss=2.2, td_loss=0.00152, conservative_loss=1.1]

[2m2025-07-28 15:45.43[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=73 step=7300[0m [36mepoch[0m=[35m73[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005002140998840332, 'time_algorithm_update': 0.0019114255905151368, 'loss': 2.1955630135536195, 'td_loss': 0.0014652931969612838, 'conservative_loss': 1.0970488595962524, 'time_step': 0.0024619674682617187}[0m [36mstep[0m=[35m7300[0m





[2m2025-07-28 15:45.43[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_7300.d3[0m


Epoch 74/100: 100%|██████████| 100/100 [00:00<00:00, 384.02it/s, loss=2.2, td_loss=0.0012, conservative_loss=1.1]

[2m2025-07-28 15:45.44[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=74 step=7400[0m [36mepoch[0m=[35m74[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005357074737548828, 'time_algorithm_update': 0.0019951415061950683, 'loss': 2.1952951455116274, 'td_loss': 0.0011524688403005711, 'conservative_loss': 1.0970713353157044, 'time_step': 0.0025900816917419434}[0m [36mstep[0m=[35m7400[0m
[2m2025-07-28 15:45.44[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_7400.d3[0m



Epoch 75/100: 100%|██████████| 100/100 [00:00<00:00, 359.23it/s, loss=2.2, td_loss=0.00105, conservative_loss=1.1]

[2m2025-07-28 15:45.44[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=75 step=7500[0m [36mepoch[0m=[35m75[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.000604555606842041, 'time_algorithm_update': 0.002111508846282959, 'loss': 2.196189987659454, 'td_loss': 0.0011116434112773278, 'conservative_loss': 1.0975391697883605, 'time_step': 0.002771108150482178}[0m [36mstep[0m=[35m7500[0m
[2m2025-07-28 15:45.44[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_7500.d3[0m



Epoch 76/100: 100%|██████████| 100/100 [00:00<00:00, 406.81it/s, loss=2.19, td_loss=0.00148, conservative_loss=1.1]

[2m2025-07-28 15:45.44[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=76 step=7600[0m [36mepoch[0m=[35m76[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004975366592407227, 'time_algorithm_update': 0.0018982982635498047, 'loss': 2.1946453285217284, 'td_loss': 0.0015036492938816082, 'conservative_loss': 1.0965708434581756, 'time_step': 0.002446579933166504}[0m [36mstep[0m=[35m7600[0m
[2m2025-07-28 15:45.44[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_7600.d3[0m



Epoch 77/100: 100%|██████████| 100/100 [00:00<00:00, 423.70it/s, loss=2.2, td_loss=0.00129, conservative_loss=1.1]

[2m2025-07-28 15:45.44[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=77 step=7700[0m [36mepoch[0m=[35m77[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00047012090682983397, 'time_algorithm_update': 0.0018267321586608886, 'loss': 2.196987895965576, 'td_loss': 0.0012641044548945502, 'conservative_loss': 1.0978618907928466, 'time_step': 0.002348353862762451}[0m [36mstep[0m=[35m7700[0m
[2m2025-07-28 15:45.44[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_7700.d3[0m



Epoch 78/100: 100%|██████████| 100/100 [00:00<00:00, 371.20it/s, loss=2.2, td_loss=0.00112, conservative_loss=1.1]

[2m2025-07-28 15:45.45[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=78 step=7800[0m [36mepoch[0m=[35m78[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005690073966979981, 'time_algorithm_update': 0.002055826187133789, 'loss': 2.195310065746307, 'td_loss': 0.0011137244015117175, 'conservative_loss': 1.0970981705188751, 'time_step': 0.002682504653930664}[0m [36mstep[0m=[35m7800[0m
[2m2025-07-28 15:45.45[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_7800.d3[0m



Epoch 79/100: 100%|██████████| 100/100 [00:00<00:00, 403.94it/s, loss=2.2, td_loss=0.00116, conservative_loss=1.1]

[2m2025-07-28 15:45.45[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=79 step=7900[0m [36mepoch[0m=[35m79[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005019116401672363, 'time_algorithm_update': 0.0019088006019592286, 'loss': 2.195496542453766, 'td_loss': 0.001138850417046342, 'conservative_loss': 1.0971788382530212, 'time_step': 0.0024642467498779295}[0m [36mstep[0m=[35m7900[0m
[2m2025-07-28 15:45.45[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_7900.d3[0m



Epoch 80/100: 100%|██████████| 100/100 [00:00<00:00, 369.64it/s, loss=2.2, td_loss=0.00111, conservative_loss=1.1] 

[2m2025-07-28 15:45.45[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=80 step=8000[0m [36mepoch[0m=[35m80[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005827975273132324, 'time_algorithm_update': 0.0020485162734985352, 'loss': 2.1947550106048586, 'td_loss': 0.001167198322364129, 'conservative_loss': 1.0967939066886903, 'time_step': 0.002692546844482422}[0m [36mstep[0m=[35m8000[0m
[2m2025-07-28 15:45.45[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_8000.d3[0m



Epoch 81/100: 100%|██████████| 100/100 [00:00<00:00, 419.20it/s, loss=2.23, td_loss=0.0371, conservative_loss=1.1]

[2m2025-07-28 15:45.45[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=81 step=8100[0m [36mepoch[0m=[35m81[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00047295331954956056, 'time_algorithm_update': 0.0018500924110412599, 'loss': 2.227931935787201, 'td_loss': 0.033952856429386884, 'conservative_loss': 1.0969895398616791, 'time_step': 0.0023734259605407717}[0m [36mstep[0m=[35m8100[0m
[2m2025-07-28 15:45.45[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_8100.d3[0m



Epoch 82/100: 100%|██████████| 100/100 [00:00<00:00, 421.24it/s, loss=2.2, td_loss=0.00183, conservative_loss=1.1]

[2m2025-07-28 15:45.46[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=82 step=8200[0m [36mepoch[0m=[35m82[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004636073112487793, 'time_algorithm_update': 0.0018477320671081543, 'loss': 2.19747983455658, 'td_loss': 0.0023927097214618697, 'conservative_loss': 1.0975435674190521, 'time_step': 0.002362477779388428}[0m [36mstep[0m=[35m8200[0m
[2m2025-07-28 15:45.46[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_8200.d3[0m



Epoch 83/100: 100%|██████████| 100/100 [00:00<00:00, 386.21it/s, loss=2.2, td_loss=0.00325, conservative_loss=1.1]

[2m2025-07-28 15:45.46[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=83 step=8300[0m [36mepoch[0m=[35m83[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.000538039207458496, 'time_algorithm_update': 0.001985733509063721, 'loss': 2.2000172615051268, 'td_loss': 0.0034504746797028927, 'conservative_loss': 1.098283394575119, 'time_step': 0.0025764894485473633}[0m [36mstep[0m=[35m8300[0m
[2m2025-07-28 15:45.46[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_8300.d3[0m



Epoch 84/100: 100%|██████████| 100/100 [00:00<00:00, 419.34it/s, loss=2.2, td_loss=0.00298, conservative_loss=1.1]

[2m2025-07-28 15:45.46[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=84 step=8400[0m [36mepoch[0m=[35m84[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004709506034851074, 'time_algorithm_update': 0.0018519878387451172, 'loss': 2.1969930720329285, 'td_loss': 0.0028046430594986303, 'conservative_loss': 1.097094213962555, 'time_step': 0.002372426986694336}[0m [36mstep[0m=[35m8400[0m
[2m2025-07-28 15:45.46[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_8400.d3[0m



Epoch 85/100: 100%|██████████| 100/100 [00:00<00:00, 409.10it/s, loss=2.2, td_loss=0.00295, conservative_loss=1.1]

[2m2025-07-28 15:45.46[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=85 step=8500[0m [36mepoch[0m=[35m85[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004938220977783203, 'time_algorithm_update': 0.0018848967552185058, 'loss': 2.197575035095215, 'td_loss': 0.0028086735983379185, 'conservative_loss': 1.0973831844329833, 'time_step': 0.0024326634407043457}[0m [36mstep[0m=[35m8500[0m
[2m2025-07-28 15:45.46[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_8500.d3[0m



Epoch 86/100: 100%|██████████| 100/100 [00:00<00:00, 413.77it/s, loss=2.2, td_loss=0.00271, conservative_loss=1.1]

[2m2025-07-28 15:45.47[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=86 step=8600[0m [36mepoch[0m=[35m86[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004716086387634277, 'time_algorithm_update': 0.0018826532363891602, 'loss': 2.197786946296692, 'td_loss': 0.0026098875643219797, 'conservative_loss': 1.0975885307788849, 'time_step': 0.002405276298522949}[0m [36mstep[0m=[35m8600[0m
[2m2025-07-28 15:45.47[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_8600.d3[0m



Epoch 87/100: 100%|██████████| 100/100 [00:00<00:00, 385.00it/s, loss=2.2, td_loss=0.00274, conservative_loss=1.1]

[2m2025-07-28 15:45.47[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=87 step=8700[0m [36mepoch[0m=[35m87[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005232834815979004, 'time_algorithm_update': 0.002008681297302246, 'loss': 2.197643308639526, 'td_loss': 0.002755700311390683, 'conservative_loss': 1.0974437963962556, 'time_step': 0.002584958076477051}[0m [36mstep[0m=[35m8700[0m





[2m2025-07-28 15:45.47[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_8700.d3[0m


Epoch 88/100: 100%|██████████| 100/100 [00:00<00:00, 405.34it/s, loss=2.2, td_loss=0.00213, conservative_loss=1.1]

[2m2025-07-28 15:45.47[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=88 step=8800[0m [36mepoch[0m=[35m88[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004944634437561035, 'time_algorithm_update': 0.001909637451171875, 'loss': 2.1980531549453737, 'td_loss': 0.0023269034427357838, 'conservative_loss': 1.0978631222248076, 'time_step': 0.002456667423248291}[0m [36mstep[0m=[35m8800[0m





[2m2025-07-28 15:45.47[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_8800.d3[0m


Epoch 89/100: 100%|██████████| 100/100 [00:00<00:00, 398.64it/s, loss=2.2, td_loss=0.00367, conservative_loss=1.1]

[2m2025-07-28 15:45.47[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=89 step=8900[0m [36mepoch[0m=[35m89[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005027151107788086, 'time_algorithm_update': 0.0019388103485107422, 'loss': 2.198425705432892, 'td_loss': 0.0036549880728125573, 'conservative_loss': 1.0973853635787965, 'time_step': 0.002494988441467285}[0m [36mstep[0m=[35m8900[0m
[2m2025-07-28 15:45.47[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_8900.d3[0m



Epoch 90/100: 100%|██████████| 100/100 [00:00<00:00, 415.54it/s, loss=2.2, td_loss=0.00237, conservative_loss=1.1]

[2m2025-07-28 15:45.48[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=90 step=9000[0m [36mepoch[0m=[35m90[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0004624009132385254, 'time_algorithm_update': 0.001880943775177002, 'loss': 2.1986197805404664, 'td_loss': 0.0026663483370793985, 'conservative_loss': 1.0979767179489135, 'time_step': 0.002394850254058838}[0m [36mstep[0m=[35m9000[0m
[2m2025-07-28 15:45.48[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_9000.d3[0m



Epoch 91/100: 100%|██████████| 100/100 [00:00<00:00, 391.35it/s, loss=2.23, td_loss=0.0362, conservative_loss=1.1]

[2m2025-07-28 15:45.48[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=91 step=9100[0m [36mepoch[0m=[35m91[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005315661430358887, 'time_algorithm_update': 0.001959683895111084, 'loss': 2.228097147941589, 'td_loss': 0.033212613491923546, 'conservative_loss': 1.0974422657489777, 'time_step': 0.002542593479156494}[0m [36mstep[0m=[35m9100[0m
[2m2025-07-28 15:45.48[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_9100.d3[0m



Epoch 92/100: 100%|██████████| 100/100 [00:00<00:00, 427.37it/s, loss=2.2, td_loss=0.00235, conservative_loss=1.1]

[2m2025-07-28 15:45.48[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=92 step=9200[0m [36mepoch[0m=[35m92[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00045597076416015625, 'time_algorithm_update': 0.0018234753608703613, 'loss': 2.1968222665786743, 'td_loss': 0.0023351298249326647, 'conservative_loss': 1.0972435677051544, 'time_step': 0.0023288178443908692}[0m [36mstep[0m=[35m9200[0m
[2m2025-07-28 15:45.48[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_9200.d3[0m



Epoch 93/100: 100%|██████████| 100/100 [00:00<00:00, 426.30it/s, loss=2.2, td_loss=0.00227, conservative_loss=1.1]

[2m2025-07-28 15:45.48[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=93 step=9300[0m [36mepoch[0m=[35m93[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00046092748641967773, 'time_algorithm_update': 0.0018229055404663085, 'loss': 2.1981692504882813, 'td_loss': 0.002199953671079129, 'conservative_loss': 1.0979846489429474, 'time_step': 0.0023334813117980957}[0m [36mstep[0m=[35m9300[0m
[2m2025-07-28 15:45.48[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_9300.d3[0m



Epoch 94/100: 100%|██████████| 100/100 [00:00<00:00, 384.56it/s, loss=2.2, td_loss=0.00219, conservative_loss=1.1]

[2m2025-07-28 15:45.49[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=94 step=9400[0m [36mepoch[0m=[35m94[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005285000801086426, 'time_algorithm_update': 0.002005882263183594, 'loss': 2.197166223526001, 'td_loss': 0.0024109763628803195, 'conservative_loss': 1.097377623319626, 'time_step': 0.0025871992111206055}[0m [36mstep[0m=[35m9400[0m
[2m2025-07-28 15:45.49[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_9400.d3[0m



Epoch 95/100: 100%|██████████| 100/100 [00:00<00:00, 398.11it/s, loss=2.2, td_loss=0.00302, conservative_loss=1.1]

[2m2025-07-28 15:45.49[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=95 step=9500[0m [36mepoch[0m=[35m95[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005094480514526368, 'time_algorithm_update': 0.0019360923767089845, 'loss': 2.1988195276260374, 'td_loss': 0.0029319437482627107, 'conservative_loss': 1.0979437935352325, 'time_step': 0.0025005745887756348}[0m [36mstep[0m=[35m9500[0m





[2m2025-07-28 15:45.49[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_9500.d3[0m


Epoch 96/100: 100%|██████████| 100/100 [00:00<00:00, 404.66it/s, loss=2.2, td_loss=0.00149, conservative_loss=1.1]

[2m2025-07-28 15:45.49[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=96 step=9600[0m [36mepoch[0m=[35m96[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.0005018281936645508, 'time_algorithm_update': 0.0019066476821899415, 'loss': 2.1952530360221862, 'td_loss': 0.0014993083794252015, 'conservative_loss': 1.0968768572807313, 'time_step': 0.002458498477935791}[0m [36mstep[0m=[35m9600[0m





[2m2025-07-28 15:45.49[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_9600.d3[0m


Epoch 97/100: 100%|██████████| 100/100 [00:00<00:00, 414.73it/s, loss=2.19, td_loss=0.00232, conservative_loss=1.1]

[2m2025-07-28 15:45.49[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=97 step=9700[0m [36mepoch[0m=[35m97[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00048806190490722655, 'time_algorithm_update': 0.0018617892265319824, 'loss': 2.1928140926361084, 'td_loss': 0.002345390365226194, 'conservative_loss': 1.095234352350235, 'time_step': 0.0023996424674987793}[0m [36mstep[0m=[35m9700[0m
[2m2025-07-28 15:45.49[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_9700.d3[0m



Epoch 98/100: 100%|██████████| 100/100 [00:00<00:00, 416.84it/s, loss=2.2, td_loss=0.002, conservative_loss=1.1] 

[2m2025-07-28 15:45.50[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=98 step=9800[0m [36mepoch[0m=[35m98[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00047081470489501953, 'time_algorithm_update': 0.0018648099899291992, 'loss': 2.19599750995636, 'td_loss': 0.001943185601849109, 'conservative_loss': 1.0970271587371827, 'time_step': 0.00238771915435791}[0m [36mstep[0m=[35m9800[0m
[2m2025-07-28 15:45.50[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_9800.d3[0m



Epoch 99/100: 100%|██████████| 100/100 [00:00<00:00, 415.27it/s, loss=2.2, td_loss=0.00217, conservative_loss=1.1]

[2m2025-07-28 15:45.50[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=99 step=9900[0m [36mepoch[0m=[35m99[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00047873735427856446, 'time_algorithm_update': 0.001869077682495117, 'loss': 2.195319347381592, 'td_loss': 0.0021575475920690224, 'conservative_loss': 1.0965808951854705, 'time_step': 0.002396836280822754}[0m [36mstep[0m=[35m9900[0m
[2m2025-07-28 15:45.50[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_9900.d3[0m



Epoch 100/100: 100%|██████████| 100/100 [00:00<00:00, 321.70it/s, loss=2.2, td_loss=0.00222, conservative_loss=1.1]

[2m2025-07-28 15:45.50[0m [[32m[1minfo     [0m] [1mDiscreteCQL_20250728154525: epoch=100 step=10000[0m [36mepoch[0m=[35m100[0m [36mmetrics[0m=[35m{'time_sample_batch': 0.00057830810546875, 'time_algorithm_update': 0.002337493896484375, 'loss': 2.1974499917030332, 'td_loss': 0.002233420611009933, 'conservative_loss': 1.0976082861423493, 'time_step': 0.002981810569763184}[0m [36mstep[0m=[35m10000[0m





[2m2025-07-28 15:45.50[0m [[32m[1minfo     [0m] [1mModel parameters are saved to d3rlpy_logs/DiscreteCQL_20250728154525/model_10000.d3[0m


[(1,
  {'time_sample_batch': 0.0005567145347595215,
   'time_algorithm_update': 0.0020180344581604004,
   'loss': 2.3966327929496765,
   'td_loss': 0.19853577110916376,
   'conservative_loss': 1.09904851436615,
   'time_step': 0.00263277530670166}),
 (2,
  {'time_sample_batch': 0.0004550933837890625,
   'time_algorithm_update': 0.00175459623336792,
   'loss': 2.212840859889984,
   'td_loss': 0.01538949720794335,
   'conservative_loss': 1.0987256860733032,
   'time_step': 0.0022612261772155763}),
 (3,
  {'time_sample_batch': 0.00045412778854370117,
   'time_algorithm_update': 0.0017646265029907227,
   'loss': 2.200702414512634,
   'td_loss': 0.003590560882585123,
   'conservative_loss': 1.0985559260845184,
   'time_step': 0.002267448902130127}),
 (4,
  {'time_sample_batch': 0.0005457544326782226,
   'time_algorithm_update': 0.0019359111785888672,
   'loss': 2.1985737800598146,
   'td_loss': 0.001622819603071548,
   'conservative_loss': 1.0984754824638367,
   'time_step': 0.0025356578826

### Storing and Retrieving Policy Parameters as Artifacts
Now let's store the trained policy parameters as an artifact associated with our benchmark. This demonstrates how to link artifacts to specific benchmarks for better organization and retrieval.

In [16]:
# Save the model to a temporary file
with tempfile.NamedTemporaryFile(suffix='.pt', delete=False) as temp_file:
    temp_path = temp_file.name

# Save the model using d3rlpy's save_model method
algo.save_model(temp_path)

# Read the file content as bytes
with open(temp_path, 'rb') as f:
    policy_artifact = f.read()

# Clean up the temporary file
os.unlink(temp_path)

# Create metadata for the artifact, linking it to our benchmark
policy_metadata = ArtifactMetadata(
    name='trained_cql_policy', 
    description='Trained CQL policy parameters for MountainCar environment',
    benchmark_id=loaded_tupli_env.id  # Link artifact to the benchmark
)

# Store the artifact
stored_policy = storage.store_artifact(artifact=policy_artifact, metadata=policy_metadata)
print(f"Stored policy artifact with ID: {stored_policy.id}")
print(f"Artifact linked to benchmark: {stored_policy.benchmark_id}")

Stored policy artifact with ID: 48ecc6b86e9a4cdeafd000e293511a6b
Artifact linked to benchmark: 7b334ed6b7724bec8872418ad32667a4


Another collaborator could now retrieve this artifact by listing all artifacts associated to our benchmark and downloading it. 

In [17]:
# Create a filter to find artifacts associated with our benchmark
benchmark_filter = FilterEQ(key='benchmark_id', value=loaded_tupli_env.id)

# List all artifacts associated with this benchmark
benchmark_artifacts = storage.list_artifacts(filter=benchmark_filter)

print(f"Found {len(benchmark_artifacts)} artifacts associated with benchmark {loaded_tupli_env.id}:")
for artifact in benchmark_artifacts:
    print(f"  - ID: {artifact.id}")
    print(f"    Name: {artifact.name}")
    print(f"    Description: {artifact.description}")
    print(f"    Benchmark ID: {artifact.benchmark_id}")
    print(f"    Created: {artifact.created_at}")
    print()

Found 0 artifacts associated with benchmark 4b076cdedc4d433ea4da6d2801c05ca0:


Finally, we demonstrate deserialization of the stored policy.

In [18]:
# Load the policy artifact
loaded_policy_artifact = storage.load_artifact(stored_policy.id)

# Write the bytes to a temporary file
with tempfile.NamedTemporaryFile(suffix='.pt', delete=False) as temp_file:
    temp_path = temp_file.name
    temp_file.write(loaded_policy_artifact)

# Create a new algorithm instance and load the model
loaded_algo = DiscreteCQLConfig().create(device='cpu')
loaded_algo.build_with_env(loaded_tupli_env)
loaded_algo.load_model(temp_path)

print("Successfully loaded trained CQL policy!")


Successfully loaded trained CQL policy!


### Testing the Trained Policy
Now, let us test the trained (and loaded) policy:

In [19]:
# activate rendering
setattr(loaded_tupli_env.unwrapped, 'render_mode', 'human')
# deactivate recording of episodes
loaded_tupli_env.deactivate_recording()
# run the environment
np.random.seed(seed=42)
obs, info = loaded_tupli_env.reset(seed=42)

for step in range(800):
    action = np.int64(loaded_algo.predict(np.expand_dims(obs, axis=0))[0])
    obs, reward, done, truncated, info = loaded_tupli_env.step(action)
    if done or truncated:
        print(f'Episode finished after {step + 1} timesteps')
        obs, info = loaded_tupli_env.reset()
# deactivate rendering
loaded_tupli_env.close()

Episode finished after 375 timesteps
Episode finished after 694 timesteps


The trained policy manages to reach the flag even though it has only learned from random actions!

### Deleting Benchmarks
To clean up our storage, we now delete the benchmark and all related artifacts. Episodes will automatically be deleted, too.

In [20]:
# Clean up: First delete the policy artifact we created
print(f"Deleting policy artifact: {stored_policy.id}")
storage.delete_artifact(stored_policy.id)

# Then delete the benchmark and all remaining related artifacts
# Episodes will automatically be deleted too
print(f"Deleting benchmark: {loaded_tupli_env.id}")
loaded_tupli_env.delete(delete_artifacts=True)

print("Cleanup completed!")

Deleting policy artifact: 48ecc6b86e9a4cdeafd000e293511a6b
Deleting benchmark: 4b076cdedc4d433ea4da6d2801c05ca0
Cleanup completed!
