<h1><center> Домашняя работа #3</center></h1>

Задача:

- еализуйте алгоритм А2С (Advanced Actor Critic)
- обучите агента в среде Car Racing;

Описание задачи на сайте Gymnasium ([ссылка](https://gymnasium.farama.org/environments/box2d/lunar_lander/))

## Imports

In [1]:
1 + 1

2

In [2]:
# !nvidia-smi

In [3]:
import sys

sys.path.append("..")

In [4]:
import gymnasium as gym
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import torch

from tqdm import trange
from torch import nn
from torch.nn import functional as F

# Commented since the code is commented too
# from stable_baselines3 import A2C
# from stable_baselines3.common.callbacks import BaseCallback
# from stable_baselines3.common.env_util import make_vec_env
# from stable_baselines3.common.vec_env import (
#     DummyVecEnv,
#     VecMonitor,
#     VecFrameStack,
#     VecTransposeImage,
# )

In [5]:
# %load_ext autoreload
# %autoreload 2

from src.torch_utils import get_device, preprocess_state, transform_state_to_tensor
from src.actor_critic import (
    ActorNet,
    ValueNet,
    compute_returns,
)

In [6]:
# %load_ext tensorboard

## Environment

**Observation space:**

A top-down 96x96 RGB image of the car and race track.

**Actions:**

- 0: steering, -1 is full left, +1 is full right
- 1: gas
- 2: braking

The three numbers (in order) are:

1. Steering
   - Range: [-1.0, 1.0]
   - Negative values: turn left
   - Positive values: turn right
2. Acceleration (Gas)
   -  Range: [0.0, 1.0]
   - 0 = no acceleration
   - 1 = full acceleration
3. Brake
    -  Range: [0.0, 1.0]
   - 0 = no braking
   - 1 = full braking

---

Example Actions:

- [0.0, 0.5, 0.0] → Go straight, accelerate at 50% power, no brake.
- [-0.8, 0.1, 0.0] → Sharp left turn, low acceleration.
- [0.3, 0.0, 0.7] → Gentle right turn, no gas, brake at 70%.

**Rewards:**

The reward is -0.1 every frame and +1000/N for every track tile visited, where N is the total number of tiles visited in the track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.

In [7]:
# %pip install swig

In [8]:
# %pip install Box2D

---
## Версия в stable_baselines3

Мы используем CnnPolicy, так как нам нужна сверточная нейросеть для обработки изображения

Обучение идет, но нужно очень много шагов. Нормально не работает. -> Код закоментирован

In [9]:
# env = make_vec_env(
#     env_id="CarRacing-v3",
#     n_envs=4,
#     env_kwargs={"continuous": True, 'max_episode_steps': 1_000},
#     vec_env_cls=DummyVecEnv
# )

In [10]:
# env = VecMonitor(env)
# env = VecFrameStack(env, n_stack=4)
# env = VecTransposeImage(env)

In [11]:
# model = A2C(
#     policy="CnnPolicy",
#     n_steps=512,
#     gamma=0.99,
#     learning_rate=3e-4,
#     max_grad_norm=0.5,
#     use_rms_prop=True,
#     vf_coef=0.25,
#     ent_coef=0.01,
#     gae_lambda=0.95,
#     normalize_advantage=True,
#     tensorboard_log=None,
#     env=env,
#     verbose=1
# )

In [12]:
# model.learn(
#     total_timesteps=250_000,
#     progress_bar=True
# )

In [13]:
# %tensorboard --logdir ./a2c_carracing_tensorboard/

In [14]:
# model.save("car_racing_baseline")

In [15]:
# model = A2C.load("car_racing_baseline")

In [16]:
# def create_agent_env():
#     def _init():
#         env = gym.make(
#             "CarRacing-v3",
#             continuous=True,
#             domain_randomize=False,
#             lap_complete_percent=0.95,
#             max_episode_steps=5_000,
#             render_mode="rgb_array",
#         )
#         return env
#     env = DummyVecEnv([_init])
#     env = VecFrameStack(env, n_stack=4)
#     env = VecTransposeImage(env)
#     return env

In [17]:
# agent_env = create_agent_env()
# render_env = gym.make(
#     "CarRacing-v3",
#     continuous=True,
#     domain_randomize=False,
#     lap_complete_percent=0.95,
#     max_episode_steps=5_000,
#     render_mode="human",
# )

In [18]:
# obs_agent = agent_env.reset()
# obs_render, _ = render_env.reset()

# done = False
# score = 0

# while not done:
#     action, _ = model.predict(obs_agent, deterministic=True)
#     obs_agent, reward, done, _ = agent_env.step(action)
#     _, _, _,_, _ = render_env.step(action[0])
#     score += reward[0]

In [19]:
# agent_env.close()
# render_env.close()

---
## Свой класс

## Реализация:
1. Инициализируем случайным образом сети политики (actor) $\pi^{\mu}(a|s)|_{\theta^{\mu}}$ и V-функции (critic) $V^{\theta}(s)|_{\theta^{V}}$ с весами $\theta^V$ и $\theta^{\mu}$ и целевые сети $V'$ и $\pi'$: $\theta^{V'} \gets \theta^V$ и $\theta^{\mu'} \gets \theta^{\mu}$
2. Устанавливаем число эпизодов обучения $M$ и для каждого эпизода выполняем:
3. Проходим траекторию, пока не достигнем конечного состояния.
    - Находясь в состоянии $s_t$ действуем в силу текущей политики и выбираем действие $a_t = \pi^{\mu}(s_t)|_{\theta^{\mu}}$
    - Выполняем действие $a_t$ и переходим в состояние $s_{t+1}$ и получаем награду $r_t$
    - В состоянии $s_{t+1}$ действуя в силу текущей политики выбираем действие $a_{t+1} = \pi^{\mu}(s_{t+1})|_{\theta^{\mu}}$
    - Вычисляем $Loss(\theta^V)=\big( r_t + \gamma V^{\theta}(s_{t+1}) - V^{\theta}(s_t) \big)^2$
    - Вычисляем $Loss(\theta^{\mu}) = \ln{\pi^{\mu}(a_t|s_t)}(r_t + \gamma V^{\theta}(s_{t+1}) - V^{\theta}(s_t))$
    - Обновляем веса: </br>
    __Внимание!__ У V-функции мы ___минимизируем___ веса, а в политике ___максимизируем_!__ </br>
      $\quad \quad \theta^V \gets \theta^V - \alpha \nabla_{\theta^V}Loss(\theta^V)$, </br>
      $\quad \quad \theta^{\mu} \gets \theta^{\mu} + \beta \nabla_{\theta^{\mu}}Loss(\theta^{\mu})$
    - Обновляем целевые сети: </br>
    $\quad \quad \theta^{V'} \gets \tau \theta^V + (1 - \tau) \theta^{V'}$, </br>
    $\quad \quad \theta^{\mu'} \gets \tau \theta^{\mu} + (1 - \tau) \theta^{\mu'}$

In [20]:
device = get_device()

Используемое устройство: cuda


In [30]:
# Основные параметры RL
gamma = torch.tensor(0.99).to(device)  # discount_factor
num_episodes = 500

# Основные параметры DL
lr = 3e-4  # as in the original paper
max_grad_norm = 0.5

In [31]:
env = gym.make(
    "CarRacing-v3",
    continuous=True,
    domain_randomize=False,
    lap_complete_percent=0.95,
    max_episode_steps=1_000,
    # render_mode="human",  # Раскомментируйте, чтобы увидеть игру
)

In [32]:
env = gym.wrappers.FrameStackObservation(env, 4)

In [33]:
actor_model = ActorNet().to(device)
value_model = ValueNet().to(device)

In [34]:
# state = transform_state_to_tensor(state)

In [35]:
# actor_model.get_action_and_log_prob(state)

In [36]:
opt_actor = torch.optim.AdamW(actor_model.parameters(), lr=lr, fused=True)
opt_value = torch.optim.AdamW(value_model.parameters(), lr=lr, fused=True)

In [37]:
reward_records = []

for episode in trange(num_episodes):

    done = False
    visited_states = []
    terminated_flags = []
    actions = []
    rewards = []
    state, _ = env.reset()

    # Play one episode (collect trajectory)
    while not done:

        state = preprocess_state(state)
        state = transform_state_to_tensor(state, device=device)
        visited_states.append(state.squeeze(0))

        with torch.no_grad():
            raw_actions, transformed_actions = actor_model.get_actions(state)

        state, reward, terminated, truncated, _ = env.step(
            transformed_actions.cpu().numpy().flatten()
        )

        done = terminated or truncated
        terminated_flags.append(done)
        actions.append(raw_actions)
        rewards.append(reward)

    # Prepare data
    rewards_tensor = torch.tensor(rewards, device=device)
    terminated_tensor = torch.tensor(terminated_flags, device=device)

    # Train value model
    opt_value.zero_grad()

    values = value_model(torch.stack(visited_states))
    values = values.squeeze()

    returns = compute_returns(
        rewards=rewards_tensor,
        terminated=terminated_tensor,
        values=values,
        gamma=gamma,
        device=device
    )
    advantages = (returns - values).detach()
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    value_model_loss = F.mse_loss(values, returns)
    value_model_loss.backward()

    nn.utils.clip_grad_norm_(value_model.parameters(), max_grad_norm)
    opt_value.step()

    # Train actor model
    opt_actor.zero_grad()

    log_probs_tensor = actor_model.get_log_prob_given_actions(
        state=torch.stack(visited_states), raw_actions=torch.cat(actions)
    )
    # log_probs_tensor = torch.cat(log_probs)

    policy_loss = -log_probs_tensor * advantages
    policy_loss.sum().backward()

    nn.utils.clip_grad_norm_(actor_model.parameters(), max_grad_norm)
    opt_actor.step()

    reward_records.append(sum(rewards))

    if episode % 5 == 0:
        print(f"Mean for last 5 episodes is {np.mean(reward_records[-5:])}")
        print(f"Mean for last 50 episodes is {np.mean(reward_records[-50:])}")

    # Stop if mean reward for 100 episodes > 475.0
    # Naive early stopping
    if np.average(reward_records[-100:]) > 475.0:
        break

print(f"\nDone in {episode+1} episodes")
env.close()

  0%|          | 1/500 [00:10<1:27:10, 10.48s/it]

Mean for last 5 episodes is -17.21854304635753
Mean for last 50 episodes is -17.21854304635753


  1%|          | 6/500 [01:05<1:32:07, 11.19s/it]

Mean for last 5 episodes is -28.411161060444243
Mean for last 50 episodes is -26.54572472476313


  2%|▏         | 11/500 [01:58<1:26:27, 10.61s/it]

Mean for last 5 episodes is -20.344433514479256
Mean for last 50 episodes is -23.72695599281591


  3%|▎         | 16/500 [02:52<1:27:08, 10.80s/it]

Mean for last 5 episodes is -29.5166412774947
Mean for last 50 episodes is -25.53623264427803


  4%|▍         | 21/500 [03:45<1:25:24, 10.70s/it]

Mean for last 5 episodes is -25.221517900089392
Mean for last 50 episodes is -25.461300562328358


  5%|▌         | 26/500 [04:39<1:25:25, 10.81s/it]

Mean for last 5 episodes is -22.57429486886528
Mean for last 50 episodes is -24.906107159739307


  6%|▌         | 31/500 [05:33<1:24:04, 10.76s/it]

Mean for last 5 episodes is -27.065541204828445
Mean for last 50 episodes is -25.25440297346336


  7%|▋         | 36/500 [06:25<1:21:02, 10.48s/it]

Mean for last 5 episodes is -28.221786252479273
Mean for last 50 episodes is -25.666539539993348


  8%|▊         | 41/500 [07:20<1:22:22, 10.77s/it]

Mean for last 5 episodes is -25.83052684360818
Mean for last 50 episodes is -25.686537991653694


  9%|▉         | 46/500 [08:13<1:20:45, 10.67s/it]

Mean for last 5 episodes is -19.6609165642845
Mean for last 50 episodes is -25.0315791408527


 10%|█         | 51/500 [09:07<1:21:13, 10.85s/it]

Mean for last 5 episodes is -10.413394424283577
Mean for last 50 episodes is -23.726021391085688


 11%|█         | 56/500 [10:01<1:19:21, 10.72s/it]

Mean for last 5 episodes is -30.875515634574334
Mean for last 50 episodes is -23.972456848498695


 12%|█▏        | 61/500 [10:53<1:16:00, 10.39s/it]

Mean for last 5 episodes is -19.09905248621047
Mean for last 50 episodes is -23.84791874567182


 13%|█▎        | 66/500 [11:46<1:16:24, 10.56s/it]

Mean for last 5 episodes is -12.483002136664993
Mean for last 50 episodes is -22.14455483158885


 14%|█▍        | 71/500 [12:39<1:15:05, 10.50s/it]

Mean for last 5 episodes is -13.949937111297666
Mean for last 50 episodes is -21.017396752709672


 15%|█▌        | 76/500 [13:32<1:15:14, 10.65s/it]

Mean for last 5 episodes is -24.944866882696424
Mean for last 50 episodes is -21.25445395409279


 16%|█▌        | 81/500 [14:26<1:15:09, 10.76s/it]

Mean for last 5 episodes is -25.451809306380973
Mean for last 50 episodes is -21.093080764248043


 17%|█▋        | 86/500 [15:20<1:13:29, 10.65s/it]

Mean for last 5 episodes is -30.310867667084096
Mean for last 50 episodes is -21.301988905708523


 18%|█▊        | 91/500 [16:14<1:13:58, 10.85s/it]

Mean for last 5 episodes is -11.234877199149787
Mean for last 50 episodes is -19.84242394126268


 19%|█▉        | 96/500 [17:08<1:12:06, 10.71s/it]

Mean for last 5 episodes is -22.81143136467528
Mean for last 50 episodes is -20.15747542130176


 20%|██        | 101/500 [18:02<1:10:58, 10.67s/it]

Mean for last 5 episodes is -13.27770253881739
Mean for last 50 episodes is -20.443906232755147


 21%|██        | 106/500 [18:56<1:11:37, 10.91s/it]

Mean for last 5 episodes is -13.404686532006437
Mean for last 50 episodes is -18.696823322498354


 22%|██▏       | 111/500 [19:48<1:09:03, 10.65s/it]

Mean for last 5 episodes is -26.171115503000493
Mean for last 50 episodes is -19.404029624177355


 23%|██▎       | 116/500 [20:42<1:09:28, 10.85s/it]

Mean for last 5 episodes is -29.39109228281371
Mean for last 50 episodes is -21.094838638792226


 24%|██▍       | 121/500 [21:36<1:07:51, 10.74s/it]

Mean for last 5 episodes is -24.00420189068806
Mean for last 50 episodes is -22.100265116731272


 25%|██▌       | 126/500 [22:30<1:07:02, 10.76s/it]

Mean for last 5 episodes is -14.909960283107011
Mean for last 50 episodes is -21.096774456772327


 26%|██▌       | 131/500 [23:24<1:06:19, 10.79s/it]

Mean for last 5 episodes is -22.74545995521979
Mean for last 50 episodes is -20.826139521656206


 27%|██▋       | 136/500 [24:16<1:03:17, 10.43s/it]

Mean for last 5 episodes is -24.735244547549758
Mean for last 50 episodes is -20.26857720970277


 28%|██▊       | 141/500 [25:10<1:03:45, 10.66s/it]

Mean for last 5 episodes is -15.249309115896228
Mean for last 50 episodes is -20.670020401377414


 29%|██▉       | 146/500 [26:03<1:02:49, 10.65s/it]

Mean for last 5 episodes is -31.598305696303186
Mean for last 50 episodes is -21.548707834540206


 30%|███       | 151/500 [26:58<1:03:19, 10.89s/it]

Mean for last 5 episodes is -34.338858367833055
Mean for last 50 episodes is -23.654823417441772


 31%|███       | 156/500 [27:52<1:01:58, 10.81s/it]

Mean for last 5 episodes is -11.816609392575618
Mean for last 50 episodes is -23.49601570349869


 32%|███▏      | 161/500 [28:46<1:00:30, 10.71s/it]

Mean for last 5 episodes is -17.195112476248408
Mean for last 50 episodes is -22.598415400823484


 33%|███▎      | 166/500 [29:41<1:00:42, 10.91s/it]

Mean for last 5 episodes is -22.13385025227024
Mean for last 50 episodes is -21.872691197769132


 34%|███▍      | 171/500 [30:34<58:09, 10.61s/it]

Mean for last 5 episodes is -17.430267823724602
Mean for last 50 episodes is -21.21529779107279


 35%|███▌      | 176/500 [31:27<57:05, 10.57s/it]

Mean for last 5 episodes is -26.19059553994391
Mean for last 50 episodes is -22.34336131675648


 36%|███▌      | 181/500 [32:20<57:46, 10.87s/it]

Mean for last 5 episodes is -7.397364540221938
Mean for last 50 episodes is -20.808551775256692


 37%|███▋      | 186/500 [33:13<55:07, 10.53s/it]

Mean for last 5 episodes is -26.169853305453493
Mean for last 50 episodes is -20.952012651047067


 38%|███▊      | 191/500 [34:07<55:54, 10.86s/it]

Mean for last 5 episodes is -14.392073343087997
Mean for last 50 episodes is -20.866289073766247


 39%|███▉      | 196/500 [35:01<54:15, 10.71s/it]

Mean for last 5 episodes is -13.320308152847213
Mean for last 50 episodes is -19.03848931942065


 40%|████      | 201/500 [35:54<53:22, 10.71s/it]

Mean for last 5 episodes is -28.7620237002623
Mean for last 50 episodes is -18.480805852663572


 41%|████      | 206/500 [36:49<53:11, 10.86s/it]

Mean for last 5 episodes is -14.71481467749798
Mean for last 50 episodes is -18.770626381155807


 42%|████▏     | 211/500 [37:44<52:51, 10.97s/it]

Mean for last 5 episodes is 0.71844717184184
Mean for last 50 episodes is -16.97927041634678


 43%|████▎     | 216/500 [38:38<51:48, 10.95s/it]

Mean for last 5 episodes is -19.243299923271458
Mean for last 50 episodes is -16.6902153834469


 44%|████▍     | 221/500 [39:34<51:46, 11.13s/it]

Mean for last 5 episodes is 5.300673980571237
Mean for last 50 episodes is -14.41712120301732


 45%|████▌     | 226/500 [40:28<49:19, 10.80s/it]

Mean for last 5 episodes is -26.792799910375173
Mean for last 50 episodes is -14.477341640060446


 46%|████▌     | 231/500 [41:23<48:43, 10.87s/it]

Mean for last 5 episodes is -24.41666716953466
Mean for last 50 episodes is -16.17927190299172


 47%|████▋     | 236/500 [42:19<48:56, 11.12s/it]

Mean for last 5 episodes is -20.24015900412067
Mean for last 50 episodes is -15.586302472858439


 48%|████▊     | 241/500 [43:13<47:09, 10.92s/it]

Mean for last 5 episodes is -14.781595878106382
Mean for last 50 episodes is -15.625254726360277


 49%|████▉     | 246/500 [44:08<46:26, 10.97s/it]

Mean for last 5 episodes is -13.928009602034024
Mean for last 50 episodes is -15.686024871278958


 50%|█████     | 251/500 [45:01<43:59, 10.60s/it]

Mean for last 5 episodes is -26.907327157772563
Mean for last 50 episodes is -15.500555217029982


 51%|█████     | 256/500 [45:56<44:27, 10.93s/it]

Mean for last 5 episodes is -11.906208215768357
Mean for last 50 episodes is -15.219694570857023


 52%|█████▏    | 261/500 [46:49<42:36, 10.70s/it]

Mean for last 5 episodes is -19.66243094795478
Mean for last 50 episodes is -17.25778238283668


 53%|█████▎    | 266/500 [47:42<41:56, 10.75s/it]

Mean for last 5 episodes is -21.179735736211917
Mean for last 50 episodes is -17.451425964130728


 54%|█████▍    | 271/500 [48:36<41:27, 10.86s/it]

Mean for last 5 episodes is -1.822281530128479
Mean for last 50 episodes is -18.1637215152007


 55%|█████▌    | 276/500 [49:28<39:18, 10.53s/it]

Mean for last 5 episodes is -13.39824468084681
Mean for last 50 episodes is -16.82426599224786


 56%|█████▌    | 281/500 [50:22<38:55, 10.66s/it]

Mean for last 5 episodes is -26.022214210333395
Mean for last 50 episodes is -16.984820696327738


 57%|█████▋    | 286/500 [51:17<39:06, 10.96s/it]

Mean for last 5 episodes is -10.697831703316178
Mean for last 50 episodes is -16.03058796624729


 58%|█████▊    | 291/500 [52:11<38:19, 11.00s/it]

Mean for last 5 episodes is -20.13600630273134
Mean for last 50 episodes is -16.56602900870978


 59%|█████▉    | 296/500 [53:07<38:13, 11.25s/it]

Mean for last 5 episodes is -4.733948652098276
Mean for last 50 episodes is -15.646622913716207


 60%|██████    | 301/500 [54:02<36:22, 10.97s/it]

Mean for last 5 episodes is -28.00819822159027
Mean for last 50 episodes is -15.75671002009798


 61%|██████    | 306/500 [54:55<35:07, 10.87s/it]

Mean for last 5 episodes is -9.69570688175569
Mean for last 50 episodes is -15.535659886696713


 62%|██████▏   | 311/500 [55:51<34:59, 11.11s/it]

Mean for last 5 episodes is -31.82630364663537
Mean for last 50 episodes is -16.75204715656477


 63%|██████▎   | 316/500 [56:45<33:26, 10.91s/it]

Mean for last 5 episodes is -31.158812822395134
Mean for last 50 episodes is -17.749954865183096


 64%|██████▍   | 321/500 [57:39<31:55, 10.70s/it]

Mean for last 5 episodes is -18.68959435840411
Mean for last 50 episodes is -19.43668614801066


 65%|██████▌   | 326/500 [58:32<31:03, 10.71s/it]

Mean for last 5 episodes is -39.520517782337976
Mean for last 50 episodes is -22.048913458159777


 66%|██████▌   | 331/500 [59:26<30:22, 10.78s/it]

Mean for last 5 episodes is 4.389629628518323
Mean for last 50 episodes is -19.007729074274604


 67%|██████▋   | 336/500 [1:00:20<29:22, 10.75s/it]

Mean for last 5 episodes is -23.520807446111483
Mean for last 50 episodes is -20.290026648554132


 68%|██████▊   | 341/500 [1:01:14<28:28, 10.75s/it]

Mean for last 5 episodes is -39.016389387305026
Mean for last 50 episodes is -22.1780649570115


 69%|██████▉   | 346/500 [1:02:07<27:03, 10.54s/it]

Mean for last 5 episodes is -28.815597993678978
Mean for last 50 episodes is -24.586229891169573


 70%|███████   | 351/500 [1:03:00<26:31, 10.68s/it]

Mean for last 5 episodes is -10.966054693736222
Mean for last 50 episodes is -22.882015538384167


 71%|███████   | 356/500 [1:03:54<25:38, 10.68s/it]

Mean for last 5 episodes is -16.155923036485298
Mean for last 50 episodes is -23.528037153857127


 72%|███████▏  | 361/500 [1:04:48<25:00, 10.79s/it]

Mean for last 5 episodes is -28.92297125848159
Mean for last 50 episodes is -23.23770391504175


 73%|███████▎  | 366/500 [1:05:42<24:02, 10.77s/it]

Mean for last 5 episodes is -24.274951737474822
Mean for last 50 episodes is -22.549317806549716


 74%|███████▍  | 371/500 [1:06:35<22:43, 10.57s/it]

Mean for last 5 episodes is -45.836500326790556
Mean for last 50 episodes is -25.26400840338836


 75%|███████▌  | 376/500 [1:07:30<22:37, 10.95s/it]

Mean for last 5 episodes is -78.023047738651
Mean for last 50 episodes is -29.114261399019668


 76%|███████▌  | 381/500 [1:08:25<21:25, 10.80s/it]

Mean for last 5 episodes is -81.64130900457226
Mean for last 50 episodes is -37.717355262328724


 77%|███████▋  | 386/500 [1:09:21<21:08, 11.13s/it]

Mean for last 5 episodes is -83.11069699499092
Mean for last 50 episodes is -43.67634421721667


 78%|███████▊  | 391/500 [1:10:16<20:03, 11.04s/it]

Mean for last 5 episodes is -82.38232518636099
Mean for last 50 episodes is -48.01293779712227


 79%|███████▉  | 396/500 [1:11:11<19:18, 11.14s/it]

Mean for last 5 episodes is -83.87455719826143
Mean for last 50 episodes is -53.5188337175805


 80%|████████  | 401/500 [1:12:08<18:41, 11.33s/it]

Mean for last 5 episodes is -84.19140254440617
Mean for last 50 episodes is -60.8413685026475


 81%|████████  | 406/500 [1:13:03<17:15, 11.01s/it]

Mean for last 5 episodes is -84.27181037579182
Mean for last 50 episodes is -67.65295723657816


 82%|████████▏ | 411/500 [1:13:58<16:17, 10.98s/it]

Mean for last 5 episodes is -83.16577751645892
Mean for last 50 episodes is -73.07723786237588


 83%|████████▎ | 416/500 [1:14:52<15:17, 10.92s/it]

Mean for last 5 episodes is -83.01106005522091
Mean for last 50 episodes is -78.9508486941505


 84%|████████▍ | 421/500 [1:15:48<14:38, 11.12s/it]

Mean for last 5 episodes is -83.32142939745661
Mean for last 50 episodes is -82.6993416012171


 85%|████████▌ | 426/500 [1:16:43<13:31, 10.97s/it]

Mean for last 5 episodes is -83.9325992218302
Mean for last 50 episodes is -83.29029674953503


 86%|████████▌ | 431/500 [1:17:38<12:45, 11.09s/it]

Mean for last 5 episodes is -82.57898602088899
Mean for last 50 episodes is -83.3840644511667


 87%|████████▋ | 436/500 [1:18:33<11:41, 10.96s/it]

Mean for last 5 episodes is -83.28508216993444
Mean for last 50 episodes is -83.40150296866105


 88%|████████▊ | 441/500 [1:19:29<10:56, 11.12s/it]

Mean for last 5 episodes is -83.24815955889116
Mean for last 50 episodes is -83.48808640591406


 88%|████████▊ | 442/500 [1:19:48<10:28, 10.83s/it]


KeyboardInterrupt: 

In [None]:
torch.save(actor_model.state_dict(), "hw_3_trained_agent.pickle")

## Training graphs

In [None]:
table = pd.DataFrame(reward_records, columns=["reward"])
# table = table.iloc[2_000:, :]  # remove exploratory_period

In [None]:
plt.plot(table.index, table["reward"])
plt.xlabel("Training episode")
plt.ylabel("Reward")
plt.title("Average reward per 100 episodes")
plt.show()

In [None]:
plt.plot(table.index, table["reward"].rolling(100).mean())
plt.xlabel("Training episode")
plt.ylabel("Reward")
plt.title("Average reward per 100 episodes")
plt.show()

## Анимация

In [None]:
env = gym.make(
    "CarRacing-v3",
    continuous=True,
    domain_randomize=False,
    lap_complete_percent=0.95,
    max_episode_steps=5_000,
    render_mode="human",  # Раскомментируйте, чтобы увидеть игру
)

In [None]:
env = gym.wrappers.FrameStackObservation(env, 4)

In [None]:
device = get_device()

done = False
state, _ = env.reset()

while not done:
    env.render()
    state = preprocess_state(state)
    state = transform_state_to_tensor(state, device=device)
    with torch.no_grad():
            _, transformed_actions = actor_model.get_actions(state)

    state, reward, terminated, truncated, _ = env.step(
        transformed_actions.cpu().numpy().flatten()
    )

    done = terminated or truncated

env.close()

print(f"Score is: {score}")

Используемое устройство: cpu


KeyboardInterrupt: 

In [None]:
env.close()