## Open notebook in:
| Colab                                 |  Gradient                                                                                                                                         |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/transformers-the-definitive-guide/blob/master/CH03/ch03_decision_transformer.ipynb)                                             | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com//github.com/Nicolepcx/transformers-the-definitive-guide/blob/main/CH03/ch03_decision_transformer.ipynb)|             

#About this notebook

The code in this notebook is adapted from the [original code of the paper](https://github.com/kzl/decision-transformer/blob/master/gym/decision_transformer/models/decision_transformer.py) as well from the [example script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/decision_transformer). The trained Hugging Face models are contributed by [Edward Beeching](https://huggingface.co/edbeeching/decision-transformer-gym-halfcheetah-medium). The required mean and standard normlization coeficients are taken from the respective model card of the model.  




# Installing dependencies

Note, you need these to be able to run the environment, do not change this, otherwise it will break.

In [None]:
!apt-get install -y \
    libgl1-mesa-dev \
    libgl1-mesa-glx \
    libglew-dev \
    libosmesa6-dev \
    software-properties-common \
    patchelf \
    xvfb

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
software-properties-common is already the newest version (0.99.22.9).
The following additional packages will be installed:
  libegl-dev libfontenc1 libgl-dev libgles-dev libgles1 libglu1-mesa libglu1-mesa-dev
  libglvnd-core-dev libglvnd-dev libglx-dev libopengl-dev libosmesa6 libxfont2 libxkbfile1
  x11-xkb-utils xfonts-base xfonts-encodings xfonts-utils xserver-common
The following NEW packages will be installed:
  libegl-dev libfontenc1 libgl-dev libgl1-mesa-dev libgl1-mesa-glx libgles-dev libgles1 libglew-dev
  libglu1-mesa libglu1-mesa-dev libglvnd-core-dev libglvnd-dev libglx-dev libopengl-dev libosmesa6
  libosmesa6-dev libxfont2 libxkbfile1 patchelf x11-xkb-utils xfonts-base xfonts-encodings
  xfonts-utils xserver-common xvfb
0 upgraded, 25 newly installed, 0 to remove and 45 not upgraded.
Need to get 11.9 MB of archives.
After this operation, 31.5 MB of additional disk space will b

## Installing Pip packages
We also require the following pip packages:

In [None]:
!pip -q install gym==0.23.0 \
                free-mujoco-py==2.1.6 \
                transformers==4.38.2 \
                colabgymrender==1.0.2 \
                xvfbwrapper==0.2.9 \
                imageio==2.31.6 \
                mujoco==3.1.5

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/624.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━[0m [32m430.1/624.4 kB[0m [31m13.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m624.4/624.4 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m72.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m128.1 MB

## Importing the packages


In [None]:
import torch
import mujoco_py
import gym
import numpy as np

from colabgymrender.recorder import Recorder
from transformers import DecisionTransformerModel

import warnings
warnings.filterwarnings("ignore")


Compiling /usr/local/lib/python3.10/dist-packages/mujoco_py/cymj.pyx because it changed.
[1/1] Cythonizing /usr/local/lib/python3.10/dist-packages/mujoco_py/cymj.pyx


INFO:root:running build_ext
INFO:root:building 'mujoco_py.cymj' extension
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_310_linuxcpuextensionbuilder
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_310_linuxcpuextensionbuilder/temp.linux-x86_64-cpython-310
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_310_linuxcpuextensionbuilder/temp.linux-x86_64-cpython-310/usr
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_310_linuxcpuextensionbuilder/temp.linux-x86_64-cpython-310/usr/local
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_310_linuxcpuextensionbuilder/temp.linux-x86_64-cpython-310/usr/local/lib
INFO:root:creating /usr/local/lib/python3.10/dist-packages/mujoco_py/generated/_pyxbld_2.0.2.13_310_linuxcpuextensionbuilder/temp.linux-x86_64-cpython-31

## Defining a function that performs masked autoregressive predictive of actions.

The model's prediction is conditioned on sequences of states, actions, time-steps and returns. The action for the current time-step is included as zeros and masked in to not skew the model's attention distribution.

In [None]:
def get_action(model, states, actions, rewards, returns_to_go, timesteps):
    # We don't care about the past rewards in this model
    max_length = model.config.max_length
    state_dim = model.config.state_dim
    action_dim = model.config.act_dim

    states = states.reshape(1, -1, state_dim)[:, -max_length:]
    actions = actions.reshape(1, -1, action_dim)[:, -max_length:]
    returns_to_go = returns_to_go.reshape(1, -1, 1)[:, -max_length:]
    timesteps = timesteps.reshape(1, -1)[:, -max_length:]
    padding = max_length - states.shape[1]

    # Pad all tokens to sequence length
    attention_mask = torch.cat([torch.zeros(padding), torch.ones(states.shape[1])], dim=0).to(dtype=torch.long).reshape(1, -1)
    pad_tensor = lambda x, dim: torch.cat([torch.zeros((1, padding, dim)), x], dim=1).float()

    states = pad_tensor(states, state_dim)
    actions = pad_tensor(actions, action_dim)
    returns_to_go = pad_tensor(returns_to_go, 1)
    timesteps = torch.cat([torch.zeros((1, padding), dtype=torch.long), timesteps], dim=1)

    state_preds, action_preds, return_preds = model(
        states=states,
        actions=actions,
        rewards=rewards,
        returns_to_go=returns_to_go,
        timesteps=timesteps,
        attention_mask=attention_mask,
        return_dict=False,
    )

    return action_preds[0, -1]


In [None]:
def interact_with_environment(env, model, state_mean, state_std, max_ep_len, target_return_value, state_dim, act_dim, scale, device):
    episode_return, episode_length = 0, 0
    state = env.reset()
    target_return = torch.tensor([target_return_value], device=device, dtype=torch.float32).reshape(1, 1)
    states = torch.from_numpy(state).reshape(1, state_dim).to(device=device, dtype=torch.float32)
    actions = torch.zeros((0, act_dim), device=device, dtype=torch.float32)
    rewards = torch.zeros(0, device=device, dtype=torch.float32)
    timesteps = torch.tensor([[0]], device=device, dtype=torch.long)

    for t in range(max_ep_len):
        actions = torch.cat([actions, torch.zeros((1, act_dim), device=device)], dim=0)
        rewards = torch.cat([rewards, torch.zeros(1, device=device)], dim=0)

        normalized_states = (states - state_mean) / state_std
        action = get_action(model, normalized_states, actions, rewards, target_return, timesteps)
        actions[-1] = action

        state, reward, done, _ = env.step(action.detach().cpu().numpy())
        cur_state = torch.from_numpy(state).reshape(1, state_dim).to(device)
        states = torch.cat([states, cur_state.to(device)], dim=0)
        rewards[-1] = reward

        pred_return = target_return[0, -1] - (reward / scale)
        target_return = torch.cat([target_return, pred_return.reshape(1, 1).to(device)], dim=1)
        timesteps = torch.cat([timesteps, torch.tensor([[t + 1]], device=device, dtype=torch.long)], dim=1)

        episode_return += reward
        episode_length += 1

        if done:
            break

    return episode_return, episode_length



In [None]:
def setup_environment(env_name):
    if env_name == "HalfCheetah-v3":
        state_mean = np.array(
            [-0.06845774, 0.01641455, -0.18354906, -0.27624607, -0.34061527, -0.09339716,
             -0.21321271, -0.08774239, 5.1730075, -0.04275195, -0.03610836, 0.14053793,
             0.06049833, 0.09550975, 0.067391, 0.00562739, 0.01338279]
        )
        state_std = np.array(
            [0.07472999, 0.30234998, 0.3020731, 0.34417078, 0.17619242, 0.5072056, 0.25670078,
             0.32948127, 1.2574149, 0.7600542, 1.9800916, 6.5653625, 7.4663677, 4.472223, 10.566964,
             5.6719327, 7.498259]
        )
        model_name = "edbeeching/decision-transformer-gym-halfcheetah-medium"
    elif env_name == "Walker2d-v3":
        state_mean = np.array(
            [1.218966, 0.14163373, -0.03704914, -0.1381431, 0.51382244, -0.0471911, -0.47288352, 0.04225416, 2.3948874,
             -0.03143199, 0.04466356, -0.02390724, -0.10134014, 0.09090938, -0.00419264, -0.12120572, -0.5497064]
        )
        state_std = np.array(
            [0.12311358, 0.324188, 0.11456084, 0.26230657, 0.5640279, 0.22718786, 0.38373196, 0.7373677, 1.2387927,
             0.7980206, 1.5664079, 1.8092705, 3.0256042, 4.062486, 1.4586568, 3.744569, 5.585129]
        )
        model_name = "edbeeching/decision-transformer-gym-walker2d-medium"
    else:
        raise ValueError("Unsupported environment")

    state_mean = torch.from_numpy(state_mean).to(device='cpu')
    state_std = torch.from_numpy(state_std).to(device='cpu')
    model = DecisionTransformerModel.from_pretrained(model_name).to(device='cpu')

    state_dim = env.observation_space.shape[0]
    act_dim = env.action_space.shape[0]

    return env, model, state_mean, state_std, state_dim, act_dim

In [None]:
device = "cpu"
max_ep_len = 1000
scale = 1000.0  # normalization for rewards/returns
TARGET_RETURN = 12000 / scale  # evaluation is conditioned on a return of 3600, scaled accordingly
directory = './video'

In [None]:
env_options = {
        "1": "HalfCheetah-v3",
        "2": "Walker2d-v3"
    }

In [None]:
# Get user input for environment choice
user_input = input("Enter the environment name (e.g., 1 for 'HalfCheetah-v3' or 2 for 'Walker2d-v3'): ")


Enter the environment name (e.g., 1 for 'HalfCheetah-v3' or 2 for 'Walker2d-v3'): 1


In [None]:
# Validate input and get the corresponding environment name
if user_input in env_options:
    env_name = env_options[user_input]
else:
    print("Invalid input. Please enter 1 or 2.")


In [None]:
env = gym.make(env_name)
env = Recorder(env, directory, fps=30)


env, model, state_mean, state_std, state_dim, act_dim = setup_environment(env_name)

config.json:   0%|          | 0.00/950 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/6.61M [00:00<?, ?B/s]

In [None]:
# Main script
episode_return, episode_length = interact_with_environment(env, model, state_mean, state_std, max_ep_len, TARGET_RETURN, state_dim, act_dim, scale, device)

# Play the environment after interaction
env.play()

Moviepy - Building video __temp__.mp4.
Moviepy - Writing video __temp__.mp4



                                                                

Moviepy - Done !
Moviepy - video ready __temp__.mp4




In [None]:
import numpy as np

# Example input embeddings (batch_size, sequence_length, embedding_dim)
input_embeddings = np.random.rand(2, 5, 8)  # 2 sequences, each of length 5, embedding dimension 8

# Example positional embeddings (batch_size, sequence_length, embedding_dim)
positional_embeddings = np.random.rand(2, 5, 8)  # Same dimensions as input embeddings

# Adding positional embeddings
added_embeddings = input_embeddings + positional_embeddings

# Concatenating positional embeddings
concatenated_embeddings = np.concatenate((input_embeddings, positional_embeddings), axis=-1)


print("\nShape of Input Embeddings:", input_embeddings.shape)
print("Shape of Positional Embeddings:", positional_embeddings.shape)
print("Shape of Added Embeddings:", added_embeddings.shape)
print("Shape of Concatenated Embeddings:", concatenated_embeddings.shape)



Shape of Input Embeddings: (2, 5, 8)
Shape of Positional Embeddings: (2, 5, 8)
Shape of Added Embeddings: (2, 5, 8)
Shape of Concatenated Embeddings: (2, 5, 16)
