<h1><center> Домашняя работа #3</center></h1>

Задача:

- еализуйте алгоритм А2С (Advanced Actor Critic)
- обучите агента в среде Car Racing;

Описание задачи на сайте Gymnasium ([ссылка](https://gymnasium.farama.org/environments/box2d/lunar_lander/))

## Imports

In [1]:
!nvidia-smi

Sat Dec  6 11:00:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   54C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import sys

sys.path.append("..")

In [3]:
import gymnasium as gym
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import torch

from tqdm import trange
from torch import nn
from torch.nn import functional as F

# Commented since the code is commented too
# from stable_baselines3 import A2C
# from stable_baselines3.common.callbacks import BaseCallback
# from stable_baselines3.common.env_util import make_vec_env
# from stable_baselines3.common.vec_env import (
#     DummyVecEnv,
#     VecMonitor,
#     VecFrameStack,
#     VecTransposeImage,
# )

In [4]:
# %load_ext autoreload
# %autoreload 2

from src.torch_utils import get_device, preprocess_state, transform_state_to_tensor
from src.actor_critic import (
    ActorNet,
    ValueNet,
    compute_returns,
)

In [5]:
# %load_ext tensorboard

## Environment

**Observation space:**

A top-down 96x96 RGB image of the car and race track.

**Actions:**

- 0: steering, -1 is full left, +1 is full right
- 1: gas
- 2: braking

The three numbers (in order) are:

1. Steering
   - Range: [-1.0, 1.0]
   - Negative values: turn left
   - Positive values: turn right
2. Acceleration (Gas)
   -  Range: [0.0, 1.0]
   - 0 = no acceleration
   - 1 = full acceleration
3. Brake
    -  Range: [0.0, 1.0]
   - 0 = no braking
   - 1 = full braking

---

Example Actions:

- [0.0, 0.5, 0.0] → Go straight, accelerate at 50% power, no brake.
- [-0.8, 0.1, 0.0] → Sharp left turn, low acceleration.
- [0.3, 0.0, 0.7] → Gentle right turn, no gas, brake at 70%.

**Rewards:**

The reward is -0.1 every frame and +1000/N for every track tile visited, where N is the total number of tiles visited in the track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

In [6]:
# %pip install swig

In [7]:
# %pip install Box2D

---
## Версия в stable_baselines3

Мы используем CnnPolicy, так как нам нужна сверточная нейросеть для обработки изображения

Обучение идет, но нужно очень много шагов. Нормально не работает. -> Код закоментирован

In [8]:
# env = make_vec_env(
#     env_id="CarRacing-v3",
#     n_envs=4,
#     env_kwargs={"continuous": True, 'max_episode_steps': 1_000},
#     vec_env_cls=DummyVecEnv
# )

In [9]:
# env = VecMonitor(env)
# env = VecFrameStack(env, n_stack=4)
# env = VecTransposeImage(env)

In [10]:
# model = A2C(
#     policy="CnnPolicy",
#     n_steps=512,
#     gamma=0.99,
#     learning_rate=3e-4,
#     max_grad_norm=0.5,
#     use_rms_prop=True,
#     vf_coef=0.25,
#     ent_coef=0.01,
#     gae_lambda=0.95,
#     normalize_advantage=True,
#     tensorboard_log=None,
#     env=env,
#     verbose=1
# )

In [11]:
# model.learn(
#     total_timesteps=250_000,
#     progress_bar=True
# )

In [12]:
# %tensorboard --logdir ./a2c_carracing_tensorboard/

In [13]:
# model.save("car_racing_baseline")

In [14]:
# model = A2C.load("car_racing_baseline")

In [15]:
# def create_agent_env():
#     def _init():
#         env = gym.make(
#             "CarRacing-v3",
#             continuous=True,
#             domain_randomize=False,
#             lap_complete_percent=0.95,
#             max_episode_steps=5_000,
#             render_mode="rgb_array",
#         )
#         return env
#     env = DummyVecEnv([_init])
#     env = VecFrameStack(env, n_stack=4)
#     env = VecTransposeImage(env)
#     return env

In [16]:
# agent_env = create_agent_env()
# render_env = gym.make(
#     "CarRacing-v3",
#     continuous=True,
#     domain_randomize=False,
#     lap_complete_percent=0.95,
#     max_episode_steps=5_000,
#     render_mode="human",
# )

In [17]:
# obs_agent = agent_env.reset()
# obs_render, _ = render_env.reset()

# done = False
# score = 0

# while not done:
#     action, _ = model.predict(obs_agent, deterministic=True)
#     obs_agent, reward, done, _ = agent_env.step(action)
#     _, _, _,_, _ = render_env.step(action[0])
#     score += reward[0]

In [18]:
# agent_env.close()
# render_env.close()

---
## Свой класс

## Реализация:
1. Инициализируем случайным образом сети политики (actor) $\pi^{\mu}(a|s)|_{\theta^{\mu}}$ и V-функции (critic) $V^{\theta}(s)|_{\theta^{V}}$ с весами $\theta^V$ и $\theta^{\mu}$ и целевые сети $V'$ и $\pi'$: $\theta^{V'} \gets \theta^V$ и $\theta^{\mu'} \gets \theta^{\mu}$
2. Устанавливаем число эпизодов обучения $M$ и для каждого эпизода выполняем:
3. Проходим траекторию, пока не достигнем конечного состояния.
    - Находясь в состоянии $s_t$ действуем в силу текущей политики и выбираем действие $a_t = \pi^{\mu}(s_t)|_{\theta^{\mu}}$
    - Выполняем действие $a_t$ и переходим в состояние $s_{t+1}$ и получаем награду $r_t$
    - В состоянии $s_{t+1}$ действуя в силу текущей политики выбираем действие $a_{t+1} = \pi^{\mu}(s_{t+1})|_{\theta^{\mu}}$
    - Вычисляем $Loss(\theta^V)=\big( r_t + \gamma V^{\theta}(s_{t+1}) - V^{\theta}(s_t) \big)^2$
    - Вычисляем $Loss(\theta^{\mu}) = \ln{\pi^{\mu}(a_t|s_t)}(r_t + \gamma V^{\theta}(s_{t+1}) - V^{\theta}(s_t))$
    - Обновляем веса: </br>
    __Внимание!__ У V-функции мы ___минимизируем___ веса, а в политике ___максимизируем_!__ </br>
      $\quad \quad \theta^V \gets \theta^V - \alpha \nabla_{\theta^V}Loss(\theta^V)$, </br>
      $\quad \quad \theta^{\mu} \gets \theta^{\mu} + \beta \nabla_{\theta^{\mu}}Loss(\theta^{\mu})$
    - Обновляем целевые сети: </br>
    $\quad \quad \theta^{V'} \gets \tau \theta^V + (1 - \tau) \theta^{V'}$, </br>
    $\quad \quad \theta^{\mu'} \gets \tau \theta^{\mu} + (1 - \tau) \theta^{\mu'}$

In [19]:
env = gym.make(
    "CarRacing-v3",
    continuous=True,
    domain_randomize=False,
    lap_complete_percent=0.95,
    max_episode_steps=1_000,
    # render_mode="human",  # Раскомментируйте, чтобы увидеть игру
)

Чтобы использовать сразу несколько изображений

In [20]:
env = gym.wrappers.FrameStackObservation(env, 4)

In [21]:
# Пример state
print(env.observation_space)

Box(0, 255, (4, 96, 96, 3), uint8)


In [22]:
# Пример action
print(env.action_space.sample())

[-0.82968724  0.45596352  0.7342646 ]


In [23]:
device = get_device()

Используемое устройство: cuda


In [24]:
# Основные параметры RL
gamma = torch.tensor(0.99).to(device)  # discount_factor
num_episodes = 250

# Основные параметры DL
lr = 1e-4
batch_size = 128
max_grad_norm = 0.5

In [25]:
actor_model = ActorNet().to(device)
value_model = ValueNet().to(device)

In [26]:
# state = transform_state_to_tensor(state)

In [27]:
# actor_model.get_action_and_log_prob(state)

In [28]:
opt_actor = torch.optim.AdamW(actor_model.parameters(), lr=lr, fused=True)
opt_value = torch.optim.AdamW(value_model.parameters(), lr=lr, fused=True)

In [None]:
reward_records = []

for episode in trange(num_episodes):

    done = False
    visited_states = []
    actions = []
    rewards = []
    state, _ = env.reset()

    # Play one episode (collect trajectory)
    while not done:

        state = preprocess_state(state)
        state = transform_state_to_tensor(state, device=device)
        visited_states.append(state)

        with torch.no_grad():
            action, _ = actor_model.get_action_and_log_prob(state)

        state, reward, terminated, truncated, _ = env.step(
            action.cpu().numpy().flatten()
        )

        done = terminated or truncated
        actions.append(action)
        rewards.append(reward)

    # Prepare data
    rewards_tensor = torch.tensor(rewards, device=device)

    # Train value model
    opt_value.zero_grad()

    values = value_model(torch.cat(visited_states))
    values = values.squeeze()

    returns = compute_returns(
        rewards=rewards_tensor, values=values, gamma=gamma, device=device
    )
    advantages = (returns - values).detach()

    value_model_loss = F.mse_loss(values, returns)
    value_model_loss.backward()

    nn.utils.clip_grad_norm_(value_model.parameters(), 0.5)
    opt_value.step()

    # Train actor model
    opt_actor.zero_grad()

    _, log_probs_tensor = actor_model.get_action_and_log_prob(torch.cat(visited_states))
    # log_probs_tensor = torch.cat(log_probs)

    policy_loss = -log_probs_tensor * advantages
    policy_loss.sum().backward()

    nn.utils.clip_grad_norm_(actor_model.parameters(), 0.5)
    opt_actor.step()

    reward_records.append(sum(rewards))

    if episode % 5 == 0:
        print(f"Mean for last 5 episodes is {np.mean(reward_records[-5:])}")
        print(f"Mean for last 50 episodes is {np.mean(reward_records[-50:])}")

    # Stop if mean reward for 100 episodes > 475.0
    # Naive early stopping
    if np.average(reward_records[-100:]) > 475.0:
        break

print(f"\nDone in {episode+1} episodes")
env.close()

  0%|          | 1/250 [00:11<48:44, 11.75s/it]

Mean for last 5 episodes is -23.357664233576653
Mean for last 50 episodes is -23.357664233576653


  2%|▏         | 6/250 [01:12<49:04, 12.07s/it]

Mean for last 5 episodes is -30.43669613159731
Mean for last 50 episodes is -29.256857481927202


  4%|▍         | 11/250 [02:12<47:25, 11.91s/it]

Mean for last 5 episodes is -33.981671330021754
Mean for last 50 episodes is -31.404500140151995


  5%|▌         | 13/250 [02:36<46:47, 11.85s/it]

In [30]:
returns = []
next_value = 0  # For terminal state

# Work backwards through the episode
for t in reversed(range(len(rewards_tensor))):
    # If next state is terminal, next_value = 0
    if t == len(rewards_tensor) - 1:
        returns_t = rewards_tensor[t]
    else:
        returns_t = rewards_tensor[t] + gamma * rewards_tensor[t + 1]
    returns.insert(0, returns_t)

tensor([ 6.7214, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990,  3.2266,  3.2612, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,  3.2266,  3.2612, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990, -0.1990,
        -0.1990, -0.1990, -0.1990, -0.19

## Training graphs

In [None]:
table = pd.DataFrame(rewards, columns=["steps", "reward"])
# table = table.iloc[2_000:, :]  # remove exploratory_period

In [None]:
plt.plot(table.index, table["reward"].rolling(100).mean())
plt.xlabel("Training episode")
plt.ylabel("Reward")
plt.title("Average reward per 100 episodes")
plt.show()

In [None]:
plt.plot(table.index, table["steps"].rolling(100).mean())
plt.xlabel("Training episode")
plt.ylabel("steps")
plt.title("Average steps per 100 episodes")
plt.show()

## Анимация

In [None]:
env = gym.make(
    "CarRacing-v3",
    continuous=True,
    domain_randomize=False,
    lap_complete_percent=0.95,
    max_episode_steps=5_000,
    render_mode="human",  # Раскомментируйте, чтобы увидеть игру
)

In [None]:
device = get_device()

done = False
score = 0
state, _ = env.reset()
state = torch.tensor(state, dtype=torch.float32, device=device)
# n_actions = env.action_space.n
# n_observations = len(state)

while not done:
    env.render()  # Раскомментируйте, чтобы увидеть игру
    # with torch.no_grad():
    #     # best action
    #     action = policy_net_2(state).argmax()
    action = env.action_space.sample()
    next_state, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    state = next_state
    state = torch.tensor(state, dtype=torch.float32, device=device)
    score += reward

env.close()

print(f"Score is: {score}")

In [None]:
torch.ones(1)

In [None]:
env.close()