<h1 style="color:rgb(0,120,170)"><b>Master Thesis - Evaluating Offline Reinforcement Learning Under Suboptimal Data Quality: A Comparative Study of BC, BVE and IQL
</b></h1>
<h2 style="color:rgb(0,120,170)"><u>Part 2: Training Offline RL Agents</u></h2>

<b>Author:</b> Manuel Sperl<br>
<b>Date:</b> 2025

This notebook is part of my master’s thesis project. The goal is to generate diverse offline reinforcement learning (RL) datasets for evaluating three algorithms — Behavioral Cloning (BC), Behavior Value Estimation (BVE), and Implicit Q-Learning (IQL) — under varying data quality conditions.

Offline RL methods rely entirely on pre-collected data and cannot interact with the environment during training. As such, the quality and nature of the dataset play a critical role in downstream performance. In the secound notebook, I train three offline RL Agents under varying dataset quality levels.

The following offline RL methods are implemented in this notebook:

<b>Behavioral Cloning (BC)</b>: A supervised learning approach that directly mimics the actions of an expert agent by minimizing the difference between predicted and expert actions.

<b>Behavior Value Estimation (BVE)</b>: A method that estimates the value of actions taken by the agent, allowing it to learn from suboptimal actions while still leveraging expert data.

<b>Implicit Q-Learning (IQL)</b>: A more advanced method that learns a policy by implicitly estimating the Q-values of actions, enabling it to generalize better from suboptimal data.

For each of these methods, I train agents on datasets with varying levels of perturbation (0%, 5%, 10%, and 20%) to simulate different data quality conditions. The perturbation levels represent the degree of noise or suboptimality in the dataset, which is crucial for evaluating the robustness of each method.

We train each agent for 10 epochs, for a total of 3 different seed values to ensure reproducibility and robustness of the results. The training process is designed to be efficient, with each agent trained for a fixed number of epochs to allow for quick iterations and evaluations. All resulting agents are saved to disk, allowing for later evaluation and comparison of their performance across different data quality levels.

<h4><u>The notebook is structured in three main parts:</u></h4>

<b><a href="#models_beginner">1. Training all models on: Beginner Dataset</a></b>

<b><a href="#models_intermediate">2. Training all models on: Intermediate Dataset</a></b>

<b><a href="#models_expert">2. Training all models on: Expert Dataset</a></b>



<div class="alert alert-info">

<h3 style="color:rgb(0,120,170)">How to use this notebook</h3>

This notebook is structured to be executed from start to finish. Before you begin, ensure that all necessary packages are installed.
</div>

<h3 style="color:rgb(0,120,170)">Imports</h3>

In [None]:
import torch
import warnings

warnings.filterwarnings("ignore")

# ensure the module is re-imported after changes
import importlib

import datasets.dataset_utils
importlib.reload(datasets.dataset_utils)

from datasets.dataset_utils import set_all_seeds, create_environment, load_and_prepare_dataset

import offline_rl_models.behavioral_cloning_bc.bc_utils
importlib.reload(offline_rl_models.behavioral_cloning_bc.bc_utils)

from offline_rl_models.behavioral_cloning_bc.bc_utils import train_and_evaluate_BC

import offline_rl_models.implicit_q_learning_iql.iql_utils
importlib.reload(offline_rl_models.implicit_q_learning_iql.iql_utils)

from offline_rl_models.implicit_q_learning_iql.iql_utils import train_and_evaluate_IQL

import offline_rl_models.behavior_value_estimation_bve.bve_utils
importlib.reload(offline_rl_models.behavior_value_estimation_bve.bve_utils)

from offline_rl_models.behavior_value_estimation_bve.bve_utils import train_and_evaluate_BVE

<h3 style="color:rgb(0,120,170)">Global Variables</h3>

In [4]:
SEED = 12345
ENV_ID = 'SeaquestNoFrameskip-v4'
EPOCHS = 10
SEEDS = 3
BATCH_SIZE = 64

<h3 style="color:rgb(0,120,170)">Environment Setup</h3>

In [5]:
# set seed for reproducability
set_all_seeds(SEED)

# force PyTorch to use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

# initialize enviornment
env = create_environment(env_id=ENV_ID, seed=SEED)

Device: cuda


<div id='models_beginner'>
<h2 style="color:rgb(0,120,170)">Training all models on: Beginner Dataset</h2>

In this section, we will train all three offline RL methods (BC, BVE, and IQL) on the Beginner dataset, with varying levels of perturbation (0%, 5%, 10%, and 20%). Each method will be trained for 10 epochs, and the results will be saved to disk for later evaluation.

<h3><u>0% Perturbation:</u></h3>

First, we will load and prepare the Beginner dataset with 0% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/beginner_logs/seaquest_beginner_perturb0.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_beginner_perturb0 dataset ===
Preprocessing and splitting seaquest_beginner_perturb0 dataset...
Creating dataloaders for seaquest_beginner_perturb0...
Dataloaders ready for: seaquest_beginner_perturb0


<h4>Behavioral Cloning (BC)</h4>

In [None]:
%%time

# BC model -> Beginner dataset, 0% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb0',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Beginner | Perturbation 0% -----")

Training BC on seaquest_beginner_perturb0
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:05:42<00:00, 394.22s/it]


Finished Training on seaquest_beginner_perturb0
    ➤ Avg Train Loss: 0.24313
    ➤ Avg Validation Loss: 0.49041
    ➤ Avg Reward: 122.00
Saving the model...
Model saved to agent_methods/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb0/bc_model_perturb0_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:05:02<00:00, 390.22s/it]


Finished Training on seaquest_beginner_perturb0
    ➤ Avg Train Loss: 0.23865
    ➤ Avg Validation Loss: 0.51509
    ➤ Avg Reward: 112.00
Saving the model...
Model saved to agent_methods/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb0/bc_model_perturb0_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:04:15<00:00, 385.59s/it]


Finished Training on seaquest_beginner_perturb0
    ➤ Avg Train Loss: 0.24239
    ➤ Avg Validation Loss: 0.50235
    ➤ Avg Reward: 108.00
Saving the model...
Model saved to agent_methods/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb0/bc_model_perturb0_seed3.pth
Return Stats saved to agent_methods/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb0/stats_perturb0.pkl
----- Execution time: BC - Beginner | Perturbation 0% -----
CPU times: total: 3h 58min 13s
Wall time: 3h 15min 8s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Beginner dataset, 0% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb0',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Beginner | Perturbation 0% -----")

Training BVE on seaquest_beginner_perturb0
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:01:08<00:00, 366.83s/it]


Finished Training on seaquest_beginner_perturb0
    ➤ Avg Train Loss: 0.04047
    ➤ Avg Val Loss: 0.04236
    ➤ Avg Reward: 68.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb0/bve_model_perturb0_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:02:02<00:00, 372.21s/it]


Finished Training on seaquest_beginner_perturb0
    ➤ Avg Train Loss: 0.04019
    ➤ Avg Val Loss: 0.04214
    ➤ Avg Reward: 82.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb0/bve_model_perturb0_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:04:20<00:00, 386.01s/it]


Finished Training on seaquest_beginner_perturb0
    ➤ Avg Train Loss: 0.04046
    ➤ Avg Val Loss: 0.04241
    ➤ Avg Reward: 100.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb0/bve_model_perturb0_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb0/stats_perturb0.pkl
----- Execution time: BVE - Beginner | Perturbation 0% -----
CPU times: total: 3h 49min 39s
Wall time: 3h 7min 38s


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Beginner dataset, 0% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb0',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Beginner | Perturbation 0% -----") 

Training IQL on seaquest_beginner_perturb0
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:11:16<00:00, 427.65s/it]


Finished Training on seaquest_beginner_perturb0
    ➤ Avg Actor Loss: 0.24376
    ➤ Avg Critic1 Loss: 0.04095
    ➤ Avg Critic2 Loss: 0.04088
    ➤ Avg Value Loss: 0.00059
    ➤ Avg Validation Loss: 0.52930
    ➤ Avg Reward: 66.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb0/iql_model_perturb0_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:12:34<00:00, 435.48s/it]


Finished Training on seaquest_beginner_perturb0
    ➤ Avg Actor Loss: 0.22850
    ➤ Avg Critic1 Loss: 0.04078
    ➤ Avg Critic2 Loss: 0.04073
    ➤ Avg Value Loss: 0.00061
    ➤ Avg Validation Loss: 0.53246
    ➤ Avg Reward: 58.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb0/iql_model_perturb0_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:13:57<00:00, 443.74s/it]


Finished Training on seaquest_beginner_perturb0
    ➤ Avg Actor Loss: 0.22105
    ➤ Avg Critic1 Loss: 0.04037
    ➤ Avg Critic2 Loss: 0.04057
    ➤ Avg Value Loss: 0.00073
    ➤ Avg Validation Loss: 0.52086
    ➤ Avg Reward: 82.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb0/iql_model_perturb0_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb0/stats_perturb0.pkl
----- Execution time: IQL - Beginner | Perturbation 0% -----
CPU times: total: 4h 20min 8s
Wall time: 3h 38min 1s


-----------------------------

<h3><u>5% Perturbation:</u></h3>

First, we will load and prepare the Beginner dataset with 5% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/beginner_logs/seaquest_beginner_perturb5.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_beginner_perturb5 dataset ===
Preprocessing and splitting seaquest_beginner_perturb5 dataset...
Creating dataloaders for seaquest_beginner_perturb5...
Dataloaders ready for: seaquest_beginner_perturb5


<h4>Behavioral Cloning (BC)</h4>

In [None]:
%%time

# BC model -> Beginner dataset, 5% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb5',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Beginner | Perturbation 5% -----")

Training BC on seaquest_beginner_perturb5
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:06:03<00:00, 396.32s/it]


Finished Training on seaquest_beginner_perturb5
    ➤ Avg Train Loss: 0.54945
    ➤ Avg Validation Loss: 0.80399
    ➤ Avg Reward: 120.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb5/bc_model_perturb5_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:05:04<00:00, 390.43s/it]


Finished Training on seaquest_beginner_perturb5
    ➤ Avg Train Loss: 0.53910
    ➤ Avg Validation Loss: 0.81655
    ➤ Avg Reward: 122.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb5/bc_model_perturb5_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:05:43<00:00, 394.30s/it]


Finished Training on seaquest_beginner_perturb5
    ➤ Avg Train Loss: 0.53908
    ➤ Avg Validation Loss: 0.81277
    ➤ Avg Reward: 96.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb5/bc_model_perturb5_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb5/stats_perturb5.pkl
----- Execution time: BC - Beginner | Perturbation 5% -----
CPU times: total: 3h 58min 54s
Wall time: 3h 16min 58s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Beginner dataset, 5% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb5',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Beginner | Perturbation 5% -----")

Training BVE on seaquest_beginner_perturb5
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:04:04<00:00, 384.43s/it]


Finished Training on seaquest_beginner_perturb5
    ➤ Avg Train Loss: 0.03992
    ➤ Avg Val Loss: 0.04436
    ➤ Avg Reward: 98.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb5/bve_model_perturb5_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:03:41<00:00, 382.14s/it]


Finished Training on seaquest_beginner_perturb5
    ➤ Avg Train Loss: 0.03994
    ➤ Avg Val Loss: 0.04446
    ➤ Avg Reward: 104.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb5/bve_model_perturb5_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:05:46<00:00, 394.64s/it]


Finished Training on seaquest_beginner_perturb5
    ➤ Avg Train Loss: 0.03984
    ➤ Avg Val Loss: 0.04444
    ➤ Avg Reward: 134.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb5/bve_model_perturb5_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb5/stats_perturb5.pkl
----- Execution time: BVE - Beginner | Perturbation 5% -----
CPU times: total: 3h 55min 8s
Wall time: 3h 13min 42s


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Beginner dataset, 5% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb5',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Beginner | Perturbation 5% -----") 

Training IQL on seaquest_beginner_perturb5
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:10:51<00:00, 425.15s/it]


Finished Training on seaquest_beginner_perturb5
    ➤ Avg Actor Loss: 0.38308
    ➤ Avg Critic1 Loss: 0.04042
    ➤ Avg Critic2 Loss: 0.04041
    ➤ Avg Value Loss: 0.00023
    ➤ Avg Validation Loss: 0.82830
    ➤ Avg Reward: 82.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb5/iql_model_perturb5_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:12:45<00:00, 436.51s/it]


Finished Training on seaquest_beginner_perturb5
    ➤ Avg Actor Loss: 0.35953
    ➤ Avg Critic1 Loss: 0.04033
    ➤ Avg Critic2 Loss: 0.04034
    ➤ Avg Value Loss: 0.00022
    ➤ Avg Validation Loss: 0.82422
    ➤ Avg Reward: 54.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb5/iql_model_perturb5_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:14:18<00:00, 445.82s/it]


Finished Training on seaquest_beginner_perturb5
    ➤ Avg Actor Loss: 0.34982
    ➤ Avg Critic1 Loss: 0.04011
    ➤ Avg Critic2 Loss: 0.04018
    ➤ Avg Value Loss: 0.00021
    ➤ Avg Validation Loss: 0.82678
    ➤ Avg Reward: 78.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb5/iql_model_perturb5_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb5/stats_perturb5.pkl
----- Execution time: IQL - Beginner | Perturbation 5% -----
CPU times: total: 4h 19min 30s
Wall time: 3h 38min 8s


-----------------

<h3><u>10% Perturbation:</u></h3>

First, we will load and prepare the Beginner dataset with 10% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/beginner_logs/seaquest_beginner_perturb10.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_beginner_perturb10 dataset ===
Preprocessing and splitting seaquest_beginner_perturb10 dataset...
Creating dataloaders for seaquest_beginner_perturb10...
Dataloaders ready for: seaquest_beginner_perturb10


<h4>Behavioral Cloning (BC)</h4>

In [7]:
%%time

# BC model -> Beginner dataset, 10% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb10',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Beginner | Perturbation 10% -----")

Training BC on seaquest_beginner_perturb10
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:03:17<00:00, 379.80s/it]


Finished Training on seaquest_beginner_perturb10
    ➤ Avg Train Loss: 0.78067
    ➤ Avg Validation Loss: 1.13129
    ➤ Avg Reward: 112.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb10/bc_model_perturb10_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:03:17<00:00, 379.74s/it]


Finished Training on seaquest_beginner_perturb10
    ➤ Avg Train Loss: 0.78177
    ➤ Avg Validation Loss: 1.14384
    ➤ Avg Reward: 128.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb10/bc_model_perturb10_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:02:54<00:00, 377.49s/it]


Finished Training on seaquest_beginner_perturb10
    ➤ Avg Train Loss: 0.79521
    ➤ Avg Validation Loss: 1.11932
    ➤ Avg Reward: 102.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb10/bc_model_perturb10_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb10/stats_perturb10.pkl
----- Execution time: BC - Beginner | Perturbation 10% -----
CPU times: total: 3h 49min 58s
Wall time: 3h 9min 38s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Beginner dataset, 10% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb10',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Beginner | Perturbation 10% -----")

Training BVE on seaquest_beginner_perturb10
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:02:21<00:00, 374.14s/it]


Finished Training on seaquest_beginner_perturb10
    ➤ Avg Train Loss: 0.04170
    ➤ Avg Val Loss: 0.03870
    ➤ Avg Reward: 130.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb10/bve_model_perturb10_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:03:05<00:00, 378.59s/it]


Finished Training on seaquest_beginner_perturb10
    ➤ Avg Train Loss: 0.04174
    ➤ Avg Val Loss: 0.03865
    ➤ Avg Reward: 118.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb10/bve_model_perturb10_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:04:24<00:00, 386.48s/it]


Finished Training on seaquest_beginner_perturb10
    ➤ Avg Train Loss: 0.04160
    ➤ Avg Val Loss: 0.03860
    ➤ Avg Reward: 100.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb10/bve_model_perturb10_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb10/stats_perturb10.pkl
----- Execution time: BVE - Beginner | Perturbation 10% -----
CPU times: total: 3h 51min 34s
Wall time: 3h 10min 1s


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Beginner dataset, 10% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb10',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Beginner | Perturbation 10% -----") 

Training IQL on seaquest_beginner_perturb10
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:09:12<00:00, 415.20s/it]


Finished Training on seaquest_beginner_perturb10
    ➤ Avg Actor Loss: 0.48676
    ➤ Avg Critic1 Loss: 0.04186
    ➤ Avg Critic2 Loss: 0.04184
    ➤ Avg Value Loss: 0.00026
    ➤ Avg Validation Loss: 1.14102
    ➤ Avg Reward: 44.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb10/iql_model_perturb10_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:10:46<00:00, 424.69s/it]


Finished Training on seaquest_beginner_perturb10
    ➤ Avg Actor Loss: 0.49054
    ➤ Avg Critic1 Loss: 0.04202
    ➤ Avg Critic2 Loss: 0.04205
    ➤ Avg Value Loss: 0.00026
    ➤ Avg Validation Loss: 1.12868
    ➤ Avg Reward: 36.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb10/iql_model_perturb10_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:12:38<00:00, 435.81s/it]


Finished Training on seaquest_beginner_perturb10
    ➤ Avg Actor Loss: 0.47351
    ➤ Avg Critic1 Loss: 0.04179
    ➤ Avg Critic2 Loss: 0.04187
    ➤ Avg Value Loss: 0.00025
    ➤ Avg Validation Loss: 1.12495
    ➤ Avg Reward: 70.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb10/iql_model_perturb10_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb10/stats_perturb10.pkl
----- Execution time: IQL - Beginner | Perturbation 10% -----
CPU times: total: 4h 15min 15s
Wall time: 3h 32min 50s


-----------------

<h3><u>20% Perturbation:</u></h3>

First, we will load and prepare the Beginner dataset with 20% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/beginner_logs/seaquest_beginner_perturb20.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_beginner_perturb20 dataset ===
Preprocessing and splitting seaquest_beginner_perturb20 dataset...
Creating dataloaders for seaquest_beginner_perturb20...
Dataloaders ready for: seaquest_beginner_perturb20


<h4>Behavioral Cloning (BC)</h4>

In [7]:
%%time

# BC model -> Beginner dataset, 20% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb20',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Beginner | Perturbation 20% -----")

Training BC on seaquest_beginner_perturb20
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:05:59<00:00, 395.94s/it]


Finished Training on seaquest_beginner_perturb20
    ➤ Avg Train Loss: 1.16269
    ➤ Avg Validation Loss: 1.64394
    ➤ Avg Reward: 90.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb20/bc_model_perturb20_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:14:19<00:00, 445.99s/it]


Finished Training on seaquest_beginner_perturb20
    ➤ Avg Train Loss: 1.17402
    ➤ Avg Validation Loss: 1.64498
    ➤ Avg Reward: 108.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb20/bc_model_perturb20_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:03:44<00:00, 382.40s/it]


Finished Training on seaquest_beginner_perturb20
    ➤ Avg Train Loss: 1.17974
    ➤ Avg Validation Loss: 1.63410
    ➤ Avg Reward: 96.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb20/bc_model_perturb20_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_beginner/perturb20/stats_perturb20.pkl
----- Execution time: BC - Beginner | Perturbation 20% -----
CPU times: total: 4h 5min 9s
Wall time: 3h 24min 11s


<h4>Behavior Value Estimation (BVE)</h4>

In [14]:
%%time

# BVE model -> Beginner dataset, 20% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb20',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Beginner | Perturbation 20% -----")

Training BVE on seaquest_beginner_perturb20
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:03:25<00:00, 380.56s/it]


Finished Training on seaquest_beginner_perturb20
    ➤ Avg Train Loss: 0.04720
    ➤ Avg Val Loss: 0.04362
    ➤ Avg Reward: 120.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb20/bve_model_perturb20_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:04:26<00:00, 386.68s/it]


Finished Training on seaquest_beginner_perturb20
    ➤ Avg Train Loss: 0.04734
    ➤ Avg Val Loss: 0.04358
    ➤ Avg Reward: 150.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb20/bve_model_perturb20_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:04:10<00:00, 385.04s/it]


Finished Training on seaquest_beginner_perturb20
    ➤ Avg Train Loss: 0.04708
    ➤ Avg Val Loss: 0.04358
    ➤ Avg Reward: 154.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb20/bve_model_perturb20_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_beginner/perturb20/stats_perturb20.pkl
----- Execution time: BVE - Beginner | Perturbation 20% -----
CPU times: total: 3h 53min 24s
Wall time: 3h 12min 13s


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Beginner dataset, 20% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_beginner_perturb20',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Beginner | Perturbation 20% -----") 

Training IQL on seaquest_beginner_perturb20
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:11:09<00:00, 426.94s/it]


Finished Training on seaquest_beginner_perturb20
    ➤ Avg Actor Loss: 0.68427
    ➤ Avg Critic1 Loss: 0.04792
    ➤ Avg Critic2 Loss: 0.04792
    ➤ Avg Value Loss: 0.00028
    ➤ Avg Validation Loss: 1.54215
    ➤ Avg Reward: 26.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb20/iql_model_perturb20_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:13:14<00:00, 439.41s/it]


Finished Training on seaquest_beginner_perturb20
    ➤ Avg Actor Loss: 0.66933
    ➤ Avg Critic1 Loss: 0.04791
    ➤ Avg Critic2 Loss: 0.04793
    ➤ Avg Value Loss: 0.00026
    ➤ Avg Validation Loss: 1.55796
    ➤ Avg Reward: 10.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb20/iql_model_perturb20_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:13:04<00:00, 438.41s/it]


Finished Training on seaquest_beginner_perturb20
    ➤ Avg Actor Loss: 0.66543
    ➤ Avg Critic1 Loss: 0.04751
    ➤ Avg Critic2 Loss: 0.04758
    ➤ Avg Value Loss: 0.00031
    ➤ Avg Validation Loss: 1.54557
    ➤ Avg Reward: 48.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb20/iql_model_perturb20_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_beginner/perturb20/stats_perturb20.pkl
----- Execution time: IQL - Beginner | Perturbation 20% -----
CPU times: total: 4h 24min 26s
Wall time: 3h 37min 40s


------------------

<div id='models_intermediate'>
<h2 style="color:rgb(0,120,170)">Training all models on: Intermediate Dataset</h2>

In this section, we will train all three offline RL methods (BC, BVE, and IQL) on the Intermediate dataset, with varying levels of perturbation (0%, 5%, 10%, and 20%). Each method will be trained for 10 epochs, and the results will be saved to disk for later evaluation.

<h3><u>0% Perturbation:</u></h3>

First, we will load and prepare the Intermediate dataset with 0% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/intermediate_logs/seaquest_intermediate_perturb0.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_intermediate_perturb0 dataset ===
Preprocessing and splitting seaquest_intermediate_perturb0 dataset...
Creating dataloaders for seaquest_intermediate_perturb0...
Dataloaders ready for: seaquest_intermediate_perturb0


<h4>Behavioral Cloning (BC)</h4>

In [7]:
%%time

# BC model -> Intermediate dataset, 0% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb0',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Intermediate | Perturbation 0% -----")

Training BC on seaquest_intermediate_perturb0
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:05:51<00:00, 395.15s/it]


Finished Training on seaquest_intermediate_perturb0
    ➤ Avg Train Loss: 0.28432
    ➤ Avg Validation Loss: 0.41366
    ➤ Avg Reward: 234.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb0/bc_model_perturb0_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:06:51<00:00, 401.14s/it]


Finished Training on seaquest_intermediate_perturb0
    ➤ Avg Train Loss: 0.27632
    ➤ Avg Validation Loss: 0.40081
    ➤ Avg Reward: 246.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb0/bc_model_perturb0_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:06:50<00:00, 401.03s/it]


Finished Training on seaquest_intermediate_perturb0
    ➤ Avg Train Loss: 0.28067
    ➤ Avg Validation Loss: 0.41067
    ➤ Avg Reward: 228.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb0/bc_model_perturb0_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb0/stats_perturb0.pkl
----- Execution time: BC - Intermediate | Perturbation 0% -----
CPU times: total: 4h 3min 37s
Wall time: 3h 19min 41s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Intermediate dataset, 0% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb0',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Intermediate | Perturbation 0% -----")

Training BVE on seaquest_intermediate_perturb0
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:04:33<00:00, 387.35s/it]


Finished Training on seaquest_intermediate_perturb0
    ➤ Avg Train Loss: 0.08025
    ➤ Avg Val Loss: 0.07806
    ➤ Avg Reward: 38.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb0/bve_model_perturb0_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:07:23<00:00, 404.31s/it]


Finished Training on seaquest_intermediate_perturb0
    ➤ Avg Train Loss: 0.08030
    ➤ Avg Val Loss: 0.07794
    ➤ Avg Reward: 6.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb0/bve_model_perturb0_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:08:24<00:00, 410.45s/it]


Finished Training on seaquest_intermediate_perturb0
    ➤ Avg Train Loss: 0.08300
    ➤ Avg Val Loss: 0.08294
    ➤ Avg Reward: 0.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb0/bve_model_perturb0_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb0/stats_perturb0.pkl
----- Execution time: BVE - Intermediate | Perturbation 0% -----
CPU times: total: 4h 1min 52s
Wall time: 3h 20min 31s


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Intermediate dataset, 0% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb0',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Intermediate | Perturbation 0% -----") 

Training IQL on seaquest_intermediate_perturb0
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:11:54<00:00, 431.41s/it]


Finished Training on seaquest_intermediate_perturb0
    ➤ Avg Actor Loss: 0.35819
    ➤ Avg Critic1 Loss: 0.07922
    ➤ Avg Critic2 Loss: 0.07914
    ➤ Avg Value Loss: 0.00277
    ➤ Avg Validation Loss: 0.52132
    ➤ Avg Reward: 204.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb0/iql_model_perturb0_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:13:47<00:00, 442.74s/it]


Finished Training on seaquest_intermediate_perturb0
    ➤ Avg Actor Loss: 0.35734
    ➤ Avg Critic1 Loss: 0.07983
    ➤ Avg Critic2 Loss: 0.07973
    ➤ Avg Value Loss: 0.00322
    ➤ Avg Validation Loss: 0.52078
    ➤ Avg Reward: 212.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb0/iql_model_perturb0_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:15:48<00:00, 454.80s/it]


Finished Training on seaquest_intermediate_perturb0
    ➤ Avg Actor Loss: 0.36840
    ➤ Avg Critic1 Loss: 0.07873
    ➤ Avg Critic2 Loss: 0.07978
    ➤ Avg Value Loss: 0.00379
    ➤ Avg Validation Loss: 0.53048
    ➤ Avg Reward: 228.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb0/iql_model_perturb0_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb0/stats_perturb0.pkl
----- Execution time: IQL - Intermediate | Perturbation 0% -----
CPU times: total: 4h 22min 50s
Wall time: 3h 41min 42s


-----------------------------

<h3><u>5% Perturbation:</u></h3>

First, we will load and prepare the Intermediate dataset with 5% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/intermediate_logs/seaquest_intermediate_perturb5.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_intermediate_perturb5 dataset ===
Preprocessing and splitting seaquest_intermediate_perturb5 dataset...
Creating dataloaders for seaquest_intermediate_perturb5...
Dataloaders ready for: seaquest_intermediate_perturb5


<h4>Behavioral Cloning (BC)</h4>

In [7]:
%%time

# BC model -> Intermediate dataset, 5% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb5',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Intermediate | Perturbation 5% -----")

Training BC on seaquest_intermediate_perturb5
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:05:16<00:00, 391.69s/it]


Finished Training on seaquest_intermediate_perturb5
    ➤ Avg Train Loss: 0.56711
    ➤ Avg Validation Loss: 0.73694
    ➤ Avg Reward: 230.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb5/bc_model_perturb5_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:04:58<00:00, 389.83s/it]


Finished Training on seaquest_intermediate_perturb5
    ➤ Avg Train Loss: 0.57056
    ➤ Avg Validation Loss: 0.72916
    ➤ Avg Reward: 246.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb5/bc_model_perturb5_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:05:09<00:00, 390.94s/it]


Finished Training on seaquest_intermediate_perturb5
    ➤ Avg Train Loss: 0.56762
    ➤ Avg Validation Loss: 0.73089
    ➤ Avg Reward: 236.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb5/bc_model_perturb5_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb5/stats_perturb5.pkl
----- Execution time: BC - Intermediate | Perturbation 5% -----
CPU times: total: 3h 58min
Wall time: 3h 15min 32s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Intermediate dataset, 5% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb5',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Intermediate | Perturbation 5% -----")

Training BVE on seaquest_intermediate_perturb5
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:04:15<00:00, 385.56s/it]


Finished Training on seaquest_intermediate_perturb5
    ➤ Avg Train Loss: 0.08246
    ➤ Avg Val Loss: 0.08396
    ➤ Avg Reward: 78.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb5/bve_model_perturb5_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:03:59<00:00, 383.91s/it]


Finished Training on seaquest_intermediate_perturb5
    ➤ Avg Train Loss: 0.08158
    ➤ Avg Val Loss: 0.08267
    ➤ Avg Reward: 68.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb5/bve_model_perturb5_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:07:03<00:00, 402.35s/it]


Finished Training on seaquest_intermediate_perturb5
    ➤ Avg Train Loss: 0.08595
    ➤ Avg Val Loss: 0.08895
    ➤ Avg Reward: 36.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb5/bve_model_perturb5_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb5/stats_perturb5.pkl
----- Execution time: BVE - Intermediate | Perturbation 5% -----
CPU times: total: 3h 57min
Wall time: 3h 15min 28s


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Intermediate dataset, 5% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb5',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Intermediate | Perturbation 5% -----") 

Training IQL on seaquest_intermediate_perturb5
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:12:13<00:00, 433.36s/it]


Finished Training on seaquest_intermediate_perturb5
    ➤ Avg Actor Loss: 0.58017
    ➤ Avg Critic1 Loss: 0.08222
    ➤ Avg Critic2 Loss: 0.08203
    ➤ Avg Value Loss: 0.00279
    ➤ Avg Validation Loss: 0.82237
    ➤ Avg Reward: 206.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb5/iql_model_perturb5_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:13:20<00:00, 440.04s/it]


Finished Training on seaquest_intermediate_perturb5
    ➤ Avg Actor Loss: 0.56678
    ➤ Avg Critic1 Loss: 0.08202
    ➤ Avg Critic2 Loss: 0.08189
    ➤ Avg Value Loss: 0.00281
    ➤ Avg Validation Loss: 0.83321
    ➤ Avg Reward: 220.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb5/iql_model_perturb5_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:14:58<00:00, 449.82s/it]


Finished Training on seaquest_intermediate_perturb5
    ➤ Avg Actor Loss: 0.61793
    ➤ Avg Critic1 Loss: 0.08116
    ➤ Avg Critic2 Loss: 0.08287
    ➤ Avg Value Loss: 0.00358
    ➤ Avg Validation Loss: 0.81915
    ➤ Avg Reward: 224.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb5/iql_model_perturb5_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb5/stats_perturb5.pkl
----- Execution time: IQL - Intermediate | Perturbation 5% -----
CPU times: total: 4h 22min 50s
Wall time: 3h 40min 45s


-----------------

<h3><u>10% Perturbation:</u></h3>

First, we will load and prepare the Intermediate dataset with 10% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/intermediate_logs/seaquest_intermediate_perturb10.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_intermediate_perturb10 dataset ===
Preprocessing and splitting seaquest_intermediate_perturb10 dataset...
Creating dataloaders for seaquest_intermediate_perturb10...
Dataloaders ready for: seaquest_intermediate_perturb10


<h4>Behavioral Cloning (BC)</h4>

In [7]:
%%time

# BC model -> Intermediate dataset, 10% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb10',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Intermediate | Perturbation 10% -----")

Training BC on seaquest_intermediate_perturb10
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:05:24<00:00, 392.40s/it]


Finished Training on seaquest_intermediate_perturb10
    ➤ Avg Train Loss: 0.80976
    ➤ Avg Validation Loss: 1.01118
    ➤ Avg Reward: 246.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb10/bc_model_perturb10_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:05:02<00:00, 390.26s/it]


Finished Training on seaquest_intermediate_perturb10
    ➤ Avg Train Loss: 0.81260
    ➤ Avg Validation Loss: 1.01867
    ➤ Avg Reward: 228.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb10/bc_model_perturb10_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:04:39<00:00, 387.99s/it]


Finished Training on seaquest_intermediate_perturb10
    ➤ Avg Train Loss: 0.81154
    ➤ Avg Validation Loss: 1.01699
    ➤ Avg Reward: 232.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb10/bc_model_perturb10_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb10/stats_perturb10.pkl
----- Execution time: BC - Intermediate | Perturbation 10% -----
CPU times: total: 3h 57min 6s
Wall time: 3h 15min 14s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Intermediate dataset, 10% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb10',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Intermediate | Perturbation 10% -----")

Training BVE on seaquest_intermediate_perturb10
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:05:17<00:00, 391.71s/it]


Finished Training on seaquest_intermediate_perturb10
    ➤ Avg Train Loss: 0.08220
    ➤ Avg Val Loss: 0.08228
    ➤ Avg Reward: 58.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb10/bve_model_perturb10_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:06:10<00:00, 397.02s/it]


Finished Training on seaquest_intermediate_perturb10
    ➤ Avg Train Loss: 0.08187
    ➤ Avg Val Loss: 0.08247
    ➤ Avg Reward: 12.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb10/bve_model_perturb10_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:04:22<00:00, 386.23s/it]


Finished Training on seaquest_intermediate_perturb10
    ➤ Avg Train Loss: 0.08419
    ➤ Avg Val Loss: 0.08647
    ➤ Avg Reward: 40.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb10/bve_model_perturb10_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb10/stats_perturb10.pkl
----- Execution time: BVE - Intermediate | Perturbation 10% -----
CPU times: total: 3h 57min 38s
Wall time: 3h 16min


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Intermediate dataset, 10% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb10',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Intermediate | Perturbation 10% -----")

Training IQL on seaquest_intermediate_perturb10
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:12:11<00:00, 433.19s/it]


Finished Training on seaquest_intermediate_perturb10
    ➤ Avg Actor Loss: 0.78640
    ➤ Avg Critic1 Loss: 0.08167
    ➤ Avg Critic2 Loss: 0.08145
    ➤ Avg Value Loss: 0.00310
    ➤ Avg Validation Loss: 1.06722
    ➤ Avg Reward: 192.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb10/iql_model_perturb10_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:13:53<00:00, 443.33s/it]


Finished Training on seaquest_intermediate_perturb10
    ➤ Avg Actor Loss: 0.75196
    ➤ Avg Critic1 Loss: 0.08132
    ➤ Avg Critic2 Loss: 0.08113
    ➤ Avg Value Loss: 0.00294
    ➤ Avg Validation Loss: 1.08212
    ➤ Avg Reward: 212.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb10/iql_model_perturb10_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:15:48<00:00, 454.89s/it]


Finished Training on seaquest_intermediate_perturb10
    ➤ Avg Actor Loss: 0.84405
    ➤ Avg Critic1 Loss: 0.08033
    ➤ Avg Critic2 Loss: 0.08224
    ➤ Avg Value Loss: 0.00388
    ➤ Avg Validation Loss: 1.05568
    ➤ Avg Reward: 212.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb10/iql_model_perturb10_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb10/stats_perturb10.pkl
----- Execution time: IQL - Intermediate | Perturbation 10% -----
CPU times: total: 4h 24min 51s
Wall time: 3h 42min 7s


-------------------

<h3><u>20% Perturbation:</u></h3>

First, we will load and prepare the Intermediate dataset with 20% perturbation.

In [7]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/intermediate_logs/seaquest_intermediate_perturb20.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_intermediate_perturb20 dataset ===
Preprocessing and splitting seaquest_intermediate_perturb20 dataset...
Creating dataloaders for seaquest_intermediate_perturb20...
Dataloaders ready for: seaquest_intermediate_perturb20


<h4>Behavioral Cloning (BC)</h4>

In [7]:
%%time

# BC model -> Intermediate dataset, 20% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb20',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Intermediate | Perturbation 20% -----")

Training BC on seaquest_intermediate_perturb20
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:01:12<00:00, 367.25s/it]


Finished Training on seaquest_intermediate_perturb20
    ➤ Avg Train Loss: 1.22012
    ➤ Avg Validation Loss: 1.44635
    ➤ Avg Reward: 236.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb20/bc_model_perturb20_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:00:26<00:00, 362.65s/it]


Finished Training on seaquest_intermediate_perturb20
    ➤ Avg Train Loss: 1.20236
    ➤ Avg Validation Loss: 1.46492
    ➤ Avg Reward: 226.67
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb20/bc_model_perturb20_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:00:17<00:00, 361.72s/it]


Finished Training on seaquest_intermediate_perturb20
    ➤ Avg Train Loss: 1.21812
    ➤ Avg Validation Loss: 1.45246
    ➤ Avg Reward: 230.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb20/bc_model_perturb20_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_intermediate/perturb20/stats_perturb20.pkl
----- Execution time: BC - Intermediate | Perturbation 20% -----
CPU times: total: 3h 48min 14s
Wall time: 3h 2min 4s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Intermediate dataset, 20% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb20',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Intermediate | Perturbation 20% -----")

Training BVE on seaquest_intermediate_perturb20
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:00:51<00:00, 365.14s/it]


Finished Training on seaquest_intermediate_perturb20
    ➤ Avg Train Loss: 0.08359
    ➤ Avg Val Loss: 0.08713
    ➤ Avg Reward: 86.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb20/bve_model_perturb20_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:01:05<00:00, 366.57s/it]


Finished Training on seaquest_intermediate_perturb20
    ➤ Avg Train Loss: 0.08216
    ➤ Avg Val Loss: 0.08302
    ➤ Avg Reward: 48.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb20/bve_model_perturb20_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:02:48<00:00, 376.81s/it]


Finished Training on seaquest_intermediate_perturb20
    ➤ Avg Train Loss: 0.08416
    ➤ Avg Val Loss: 0.08657
    ➤ Avg Reward: 28.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb20/bve_model_perturb20_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_intermediate/perturb20/stats_perturb20.pkl
----- Execution time: BVE - Intermediate | Perturbation 20% -----
CPU times: total: 3h 48min 32s
Wall time: 3h 4min 56s


<h4>Implicit Q-Learning (IQL)</h4>

In [8]:
%%time

# IQL model -> Intermediate dataset, 20% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_intermediate_perturb20',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Intermediate | Perturbation 20% -----")

Training IQL on seaquest_intermediate_perturb20
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:12:23<00:00, 434.31s/it]


Finished Training on seaquest_intermediate_perturb20
    ➤ Avg Actor Loss: 1.01801
    ➤ Avg Critic1 Loss: 0.08178
    ➤ Avg Critic2 Loss: 0.08171
    ➤ Avg Value Loss: 0.00259
    ➤ Avg Validation Loss: 1.45974
    ➤ Avg Reward: 218.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb20/iql_model_perturb20_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:13:58<00:00, 443.81s/it]


Finished Training on seaquest_intermediate_perturb20
    ➤ Avg Actor Loss: 0.98326
    ➤ Avg Critic1 Loss: 0.08133
    ➤ Avg Critic2 Loss: 0.08122
    ➤ Avg Value Loss: 0.00248
    ➤ Avg Validation Loss: 1.46601
    ➤ Avg Reward: 206.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb20/iql_model_perturb20_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:16:04<00:00, 456.40s/it]


Finished Training on seaquest_intermediate_perturb20
    ➤ Avg Actor Loss: 1.15043
    ➤ Avg Critic1 Loss: 0.08068
    ➤ Avg Critic2 Loss: 0.08215
    ➤ Avg Value Loss: 0.00354
    ➤ Avg Validation Loss: 1.45665
    ➤ Avg Reward: 218.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb20/iql_model_perturb20_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_intermediate/perturb20/stats_perturb20.pkl
----- Execution time: IQL - Intermediate | Perturbation 20% -----
CPU times: total: 4h 25min 25s
Wall time: 3h 42min 38s


----------------

<div id='models_expert'>
<h2 style="color:rgb(0,120,170)">Training all models on: Expert Dataset</h2>

In this section, we will train all three offline RL methods (BC, BVE, and IQL) on the Expert dataset, with varying levels of perturbation (0%, 5%, 10%, and 20%). Each method will be trained for 10 epochs, and the results will be saved to disk for later evaluation.

<h3><u>0% Perturbation:</u></h3>

First, we will load and prepare the Expert dataset with 0% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/expert_logs/seaquest_expert_perturb0.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_expert_perturb0 dataset ===
Preprocessing and splitting seaquest_expert_perturb0 dataset...
Creating dataloaders for seaquest_expert_perturb0...
Dataloaders ready for: seaquest_expert_perturb0


<h4>Behavioral Cloning (BC)</h4>

In [7]:
%%time

# BC model -> Expert dataset, 0% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb0',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Expert | Perturbation 0% -----")

Training BC on seaquest_expert_perturb0
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [58:28<00:00, 350.88s/it]


Finished Training on seaquest_expert_perturb0
    ➤ Avg Train Loss: 0.61396
    ➤ Avg Validation Loss: 2.50880
    ➤ Avg Reward: 310.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb0/bc_model_perturb0_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [57:47<00:00, 346.72s/it]


Finished Training on seaquest_expert_perturb0
    ➤ Avg Train Loss: 0.62624
    ➤ Avg Validation Loss: 2.41913
    ➤ Avg Reward: 262.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb0/bc_model_perturb0_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [58:15<00:00, 349.50s/it]


Finished Training on seaquest_expert_perturb0
    ➤ Avg Train Loss: 0.61294
    ➤ Avg Validation Loss: 2.47897
    ➤ Avg Reward: 300.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb0/bc_model_perturb0_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb0/stats_perturb0.pkl
----- Execution time: BC - Expert | Perturbation 0% -----
CPU times: total: 3h 37min 28s
Wall time: 2h 54min 38s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Expert dataset, 0% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb0',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Expert | Perturbation 0% -----")

Training BVE on seaquest_expert_perturb0
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [57:34<00:00, 345.44s/it]


Finished Training on seaquest_expert_perturb0
    ➤ Avg Train Loss: 0.14414
    ➤ Avg Val Loss: 0.14347
    ➤ Avg Reward: 102.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb0/bve_model_perturb0_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [59:12<00:00, 355.23s/it]


Finished Training on seaquest_expert_perturb0
    ➤ Avg Train Loss: 0.14477
    ➤ Avg Val Loss: 0.14326
    ➤ Avg Reward: 58.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb0/bve_model_perturb0_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:00:08<00:00, 360.85s/it]


Finished Training on seaquest_expert_perturb0
    ➤ Avg Train Loss: 0.14418
    ➤ Avg Val Loss: 0.14427
    ➤ Avg Reward: 70.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb0/bve_model_perturb0_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb0/stats_perturb0.pkl
----- Execution time: BVE - Expert | Perturbation 0% -----
CPU times: total: 3h 38min 41s
Wall time: 2h 57min 6s


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Expert dataset, 0% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb0',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Expert | Perturbation 0% -----")

Training IQL on seaquest_expert_perturb0
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:09:50<00:00, 419.07s/it]


Finished Training on seaquest_expert_perturb0
    ➤ Avg Actor Loss: 0.90878
    ➤ Avg Critic1 Loss: 0.14487
    ➤ Avg Critic2 Loss: 0.14479
    ➤ Avg Value Loss: 0.00165
    ➤ Avg Validation Loss: 2.06453
    ➤ Avg Reward: 154.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb0/iql_model_perturb0_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:12:07<00:00, 432.74s/it]


Finished Training on seaquest_expert_perturb0
    ➤ Avg Actor Loss: 0.85197
    ➤ Avg Critic1 Loss: 0.14568
    ➤ Avg Critic2 Loss: 0.14552
    ➤ Avg Value Loss: 0.00143
    ➤ Avg Validation Loss: 2.06200
    ➤ Avg Reward: 178.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb0/iql_model_perturb0_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:13:52<00:00, 443.27s/it]


Finished Training on seaquest_expert_perturb0
    ➤ Avg Actor Loss: 0.88189
    ➤ Avg Critic1 Loss: 0.14435
    ➤ Avg Critic2 Loss: 0.14482
    ➤ Avg Value Loss: 0.00178
    ➤ Avg Validation Loss: 2.06715
    ➤ Avg Reward: 200.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb0/iql_model_perturb0_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb0/stats_perturb0.pkl
----- Execution time: IQL - Expert | Perturbation 0% -----
CPU times: total: 4h 17min 30s
Wall time: 3h 36min 3s


----------------

<h3><u>5% Perturbation:</u></h3>

First, we will load and prepare the Expert dataset with 5% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/expert_logs/seaquest_expert_perturb5.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_expert_perturb5 dataset ===
Preprocessing and splitting seaquest_expert_perturb5 dataset...
Creating dataloaders for seaquest_expert_perturb5...
Dataloaders ready for: seaquest_expert_perturb5


<h4>Behavioral Cloning (BC)</h4>

In [7]:
%%time

# BC model -> Expert dataset, 5% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb5',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Expert | Perturbation 5% -----")

Training BC on seaquest_expert_perturb5
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [57:33<00:00, 345.35s/it]


Finished Training on seaquest_expert_perturb5
    ➤ Avg Train Loss: 0.81537
    ➤ Avg Validation Loss: 2.67363
    ➤ Avg Reward: 278.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb5/bc_model_perturb5_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [57:11<00:00, 343.15s/it]


Finished Training on seaquest_expert_perturb5
    ➤ Avg Train Loss: 0.79753
    ➤ Avg Validation Loss: 2.76204
    ➤ Avg Reward: 278.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb5/bc_model_perturb5_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [57:20<00:00, 344.01s/it]


Finished Training on seaquest_expert_perturb5
    ➤ Avg Train Loss: 0.81591
    ➤ Avg Validation Loss: 2.69931
    ➤ Avg Reward: 296.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb5/bc_model_perturb5_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb5/stats_perturb5.pkl
----- Execution time: BC - Expert | Perturbation 5% -----
CPU times: total: 3h 34min 20s
Wall time: 2h 52min 12s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Expert dataset, 5% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb5',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Expert | Perturbation 5% -----")

Training BVE on seaquest_expert_perturb5
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [58:39<00:00, 351.98s/it]


Finished Training on seaquest_expert_perturb5
    ➤ Avg Train Loss: 0.13826
    ➤ Avg Val Loss: 0.14059
    ➤ Avg Reward: 158.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb5/bve_model_perturb5_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [59:10<00:00, 355.06s/it]


Finished Training on seaquest_expert_perturb5
    ➤ Avg Train Loss: 0.13895
    ➤ Avg Val Loss: 0.14089
    ➤ Avg Reward: 108.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb5/bve_model_perturb5_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:00:12<00:00, 361.22s/it]


Finished Training on seaquest_expert_perturb5
    ➤ Avg Train Loss: 0.13805
    ➤ Avg Val Loss: 0.14084
    ➤ Avg Reward: 148.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb5/bve_model_perturb5_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb5/stats_perturb5.pkl
----- Execution time: BVE - Expert | Perturbation 5% -----
CPU times: total: 3h 40min 7s
Wall time: 2h 58min 13s


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Expert dataset, 5% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb5',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Expert | Perturbation 5% -----")

Training IQL on seaquest_expert_perturb5
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:07:49<00:00, 406.91s/it]


Finished Training on seaquest_expert_perturb5
    ➤ Avg Actor Loss: 0.94390
    ➤ Avg Critic1 Loss: 0.13908
    ➤ Avg Critic2 Loss: 0.13910
    ➤ Avg Value Loss: 0.00113
    ➤ Avg Validation Loss: 2.08642
    ➤ Avg Reward: 174.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb5/iql_model_perturb5_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:09:54<00:00, 419.41s/it]


Finished Training on seaquest_expert_perturb5
    ➤ Avg Actor Loss: 0.93750
    ➤ Avg Critic1 Loss: 0.13934
    ➤ Avg Critic2 Loss: 0.13917
    ➤ Avg Value Loss: 0.00114
    ➤ Avg Validation Loss: 2.08482
    ➤ Avg Reward: 188.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb5/iql_model_perturb5_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:11:17<00:00, 427.74s/it]


Finished Training on seaquest_expert_perturb5
    ➤ Avg Actor Loss: 0.94147
    ➤ Avg Critic1 Loss: 0.13869
    ➤ Avg Critic2 Loss: 0.13900
    ➤ Avg Value Loss: 0.00121
    ➤ Avg Validation Loss: 2.08925
    ➤ Avg Reward: 182.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb5/iql_model_perturb5_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb5/stats_perturb5.pkl
----- Execution time: IQL - Expert | Perturbation 5% -----
CPU times: total: 4h 11min 28s
Wall time: 3h 29min 14s


----------------

<h3><u>10% Perturbation:</u></h3>

First, we will load and prepare the Expert dataset with 10% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/expert_logs/seaquest_expert_perturb10.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_expert_perturb10 dataset ===
Preprocessing and splitting seaquest_expert_perturb10 dataset...
Creating dataloaders for seaquest_expert_perturb10...
Dataloaders ready for: seaquest_expert_perturb10


<h4>Behavioral Cloning (BC)</h4>

In [8]:
%%time

# BC model -> Expert dataset, 10% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb10',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Expert | Perturbation 10% -----")

Training BC on seaquest_expert_perturb10
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:03:33<00:00, 381.39s/it]


Finished Training on seaquest_expert_perturb10
    ➤ Avg Train Loss: 0.98455
    ➤ Avg Validation Loss: 3.03464
    ➤ Avg Reward: 268.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb10/bc_model_perturb10_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:04:01<00:00, 384.10s/it]


Finished Training on seaquest_expert_perturb10
    ➤ Avg Train Loss: 0.97900
    ➤ Avg Validation Loss: 3.04738
    ➤ Avg Reward: 288.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb10/bc_model_perturb10_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:03:09<00:00, 378.96s/it]


Finished Training on seaquest_expert_perturb10
    ➤ Avg Train Loss: 0.95708
    ➤ Avg Validation Loss: 3.03210
    ➤ Avg Reward: 280.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb10/bc_model_perturb10_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb10/stats_perturb10.pkl
----- Execution time: BC - Expert | Perturbation 10% -----
CPU times: total: 3h 52min 1s
Wall time: 3h 10min 52s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Expert dataset, 10% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb10',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Expert | Perturbation 10% -----")

Training BVE on seaquest_expert_perturb10
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:03:53<00:00, 383.40s/it]


Finished Training on seaquest_expert_perturb10
    ➤ Avg Train Loss: 0.14170
    ➤ Avg Val Loss: 0.14945
    ➤ Avg Reward: 190.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb10/bve_model_perturb10_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:01:48<00:00, 370.90s/it]


Finished Training on seaquest_expert_perturb10
    ➤ Avg Train Loss: 0.14223
    ➤ Avg Val Loss: 0.14840
    ➤ Avg Reward: 96.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb10/bve_model_perturb10_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:03:09<00:00, 378.95s/it]


Finished Training on seaquest_expert_perturb10
    ➤ Avg Train Loss: 0.14147
    ➤ Avg Val Loss: 0.14889
    ➤ Avg Reward: 102.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb10/bve_model_perturb10_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb10/stats_perturb10.pkl
----- Execution time: BVE - Expert | Perturbation 10% -----
CPU times: total: 3h 50min 5s
Wall time: 3h 9min 2s


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Expert dataset, 10% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb10',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Expert | Perturbation 10% -----")

Training IQL on seaquest_expert_perturb10
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:08:36<00:00, 411.61s/it]


Finished Training on seaquest_expert_perturb10
    ➤ Avg Actor Loss: 1.03359
    ➤ Avg Critic1 Loss: 0.14298
    ➤ Avg Critic2 Loss: 0.14289
    ➤ Avg Value Loss: 0.00108
    ➤ Avg Validation Loss: 2.31632
    ➤ Avg Reward: 158.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb10/iql_model_perturb10_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:10:49<00:00, 424.96s/it]


Finished Training on seaquest_expert_perturb10
    ➤ Avg Actor Loss: 1.01935
    ➤ Avg Critic1 Loss: 0.14287
    ➤ Avg Critic2 Loss: 0.14277
    ➤ Avg Value Loss: 0.00115
    ➤ Avg Validation Loss: 2.31462
    ➤ Avg Reward: 160.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb10/iql_model_perturb10_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:11:24<00:00, 428.44s/it]


Finished Training on seaquest_expert_perturb10
    ➤ Avg Actor Loss: 1.03351
    ➤ Avg Critic1 Loss: 0.14192
    ➤ Avg Critic2 Loss: 0.14230
    ➤ Avg Value Loss: 0.00129
    ➤ Avg Validation Loss: 2.30956
    ➤ Avg Reward: 208.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb10/iql_model_perturb10_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb10/stats_perturb10.pkl
----- Execution time: IQL - Expert | Perturbation 10% -----
CPU times: total: 4h 12min 10s
Wall time: 3h 31min 3s


----------------

<h3><u>20% Perturbation:</u></h3>

First, we will load and prepare the Expert dataset with 20% perturbation.

In [6]:
dataloaders = load_and_prepare_dataset(
    dataset_path='datasets/expert_logs/seaquest_expert_perturb20.pkl',
    batch_size=BATCH_SIZE,
    seed=SEED
)

=== Loading seaquest_expert_perturb20 dataset ===
Preprocessing and splitting seaquest_expert_perturb20 dataset...
Creating dataloaders for seaquest_expert_perturb20...
Dataloaders ready for: seaquest_expert_perturb20


<h4>Behavioral Cloning (BC)</h4>

In [7]:
%%time

# BC model -> Expert dataset, 20% perturbation
train_and_evaluate_BC(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb20',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BC - Expert | Perturbation 20% -----")

Training BC on seaquest_expert_perturb20
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [58:14<00:00, 349.41s/it]


Finished Training on seaquest_expert_perturb20
    ➤ Avg Train Loss: 1.24102
    ➤ Avg Validation Loss: 3.06819
    ➤ Avg Reward: 270.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb20/bc_model_perturb20_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [58:17<00:00, 349.79s/it]


Finished Training on seaquest_expert_perturb20
    ➤ Avg Train Loss: 1.24820
    ➤ Avg Validation Loss: 2.99306
    ➤ Avg Reward: 254.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb20/bc_model_perturb20_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:03:53<00:00, 383.37s/it]


Finished Training on seaquest_expert_perturb20
    ➤ Avg Train Loss: 1.24026
    ➤ Avg Validation Loss: 3.04161
    ➤ Avg Reward: 274.00
Saving the model...
Model saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb20/bc_model_perturb20_seed3.pth
Return Stats saved to offline_rl_models/behavioral_cloning_bc/bc_logs/seaquest_expert/perturb20/stats_perturb20.pkl
----- Execution time: BC - Expert | Perturbation 20% -----
CPU times: total: 3h 45min 47s
Wall time: 3h 33s


<h4>Behavior Value Estimation (BVE)</h4>

In [7]:
%%time

# BVE model -> Expert dataset, 20% perturbation
train_and_evaluate_BVE(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb20',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: BVE - Expert | Perturbation 20% -----")

Training BVE on seaquest_expert_perturb20
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:01:59<00:00, 371.93s/it]


Finished Training on seaquest_expert_perturb20
    ➤ Avg Train Loss: 0.14251
    ➤ Avg Val Loss: 0.14471
    ➤ Avg Reward: 146.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb20/bve_model_perturb20_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:02:15<00:00, 373.53s/it]


Finished Training on seaquest_expert_perturb20
    ➤ Avg Train Loss: 0.14323
    ➤ Avg Val Loss: 0.14539
    ➤ Avg Reward: 100.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb20/bve_model_perturb20_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:03:29<00:00, 380.91s/it]


Finished Training on seaquest_expert_perturb20
    ➤ Avg Train Loss: 0.14223
    ➤ Avg Val Loss: 0.14514
    ➤ Avg Reward: 138.00
Saving the model...
Saved model to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb20/bve_model_perturb20_seed3.pth
Saved stats to offline_rl_models/behavior_value_estimation_bve/bve_logs/seaquest_expert/perturb20/stats_perturb20.pkl
----- Execution time: BVE - Expert | Perturbation 20% -----
CPU times: total: 3h 49min 19s
Wall time: 3h 7min 53s


<h4>Implicit Q-Learning (IQL)</h4>

In [7]:
%%time

# IQL model -> Expert dataset, 20% perturbation
train_and_evaluate_IQL(
    dataloaders=dataloaders,
    device=device,
    seeds=SEEDS,
    epochs=EPOCHS,
    dataset='seaquest_expert_perturb20',
    env_id=ENV_ID,
    seed=SEED
)

print("----- Execution time: IQL - Expert | Perturbation 20% -----")

Training IQL on seaquest_expert_perturb20
-- Starting Seed 1/3 --


Epochs: 100%|██████████| 10/10 [1:19:33<00:00, 477.37s/it]


Finished Training on seaquest_expert_perturb20
    ➤ Avg Actor Loss: 1.14735
    ➤ Avg Critic1 Loss: 0.14402
    ➤ Avg Critic2 Loss: 0.14395
    ➤ Avg Value Loss: 0.00114
    ➤ Avg Validation Loss: 2.36784
    ➤ Avg Reward: 168.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb20/iql_model_perturb20_seed1.pth
-- Starting Seed 2/3 --


Epochs: 100%|██████████| 10/10 [1:11:14<00:00, 427.48s/it]


Finished Training on seaquest_expert_perturb20
    ➤ Avg Actor Loss: 1.14457
    ➤ Avg Critic1 Loss: 0.14401
    ➤ Avg Critic2 Loss: 0.14391
    ➤ Avg Value Loss: 0.00118
    ➤ Avg Validation Loss: 2.36264
    ➤ Avg Reward: 190.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb20/iql_model_perturb20_seed2.pth
-- Starting Seed 3/3 --


Epochs: 100%|██████████| 10/10 [1:12:05<00:00, 432.58s/it]


Finished Training on seaquest_expert_perturb20
    ➤ Avg Actor Loss: 1.15624
    ➤ Avg Critic1 Loss: 0.14297
    ➤ Avg Critic2 Loss: 0.14330
    ➤ Avg Value Loss: 0.00133
    ➤ Avg Validation Loss: 2.36016
    ➤ Avg Reward: 170.00
Saving the model...
Model saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb20/iql_model_perturb20_seed3.pth
Return Stats saved to offline_rl_models/implicit_q_learning_iql/iql_logs/seaquest_expert/perturb20/stats_perturb20.pkl
----- Execution time: IQL - Expert | Perturbation 20% -----
CPU times: total: 4h 26min 34s
Wall time: 3h 43min 6s
