Name: Pouya Lahabi

Student ID: 400109843

In this assignment, we will implement and test REINFORCE and PPO, which are both on-policy RL algortihms.

# REINFORCE algorithm **(40 points)**

## Setup

We must first install the required packages.

In [None]:
!pip -q install gymnasium[mujoco]
!pip install imageio -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.8/211.8 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import gymnasium as gym
import random
import matplotlib
from matplotlib import pyplot as plt
import numpy as np
from collections import namedtuple, deque
import imageio

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Normal
from torch.distributions import Categorical

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## Explore the environment

We will train an REINFORCE agent on the `CartPole` environment.

This code displays a video given it's path.

In [None]:
from IPython.display import HTML
from base64 import b64encode

def show_video(path):
    mp4 = open(path, 'rb').read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    return HTML("""
    <video width=400 controls>
          <source src="%s" type="video/mp4">
    </video>
    """ % data_url)

Explore the `CartPole` environment using random actions. At each timestep, render the current frame, and use it to make a video of the trajectory.

In [None]:
env = gym.make("CartPole-v1", render_mode="rgb_array")
frames = []

env.reset()
for _ in range(100):
    frames.append(env.render())
    # select a random action
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        break
env.close()
imageio.mimsave('./CartPole.mp4', frames, fps=25)
show_video('./CartPole.mp4')




## Policy Network **(10 points)**

Complete the following code to build an agent that predicts the the probability of playing each action, given the state.

In [None]:
class PolicyNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(PolicyNetwork, self).__init__()
        
        # Define the Policy Network architecture
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
    def forward(self, x):
        
        # predict the probability of playing each action
        x = F.relu(self.fc1(x))
        # x = F.relu(self.fc2(x))

        action_probs = F.softmax(self.fc2(x), dim=-1)
        return action_probs

## Agent **(20 points)**

REINFORCE algorithm works by interacting with an environment by taking actions based on a policy. As the agent collects rewards from the environment, it records the outcomes and the **log probabilities** of the actions it took. At the end of an episode, the algorithm calculates the total **discounted reward** from each step—this is known as the return.

$$ R_t = \sum_{k=t}^{T} \gamma^{k-t} r_k
 $$

These returns are used to weight the logged probabilities, actions that lead to higher returns are made more probable.


$$ \theta \leftarrow \theta + \alpha \sum_{t=0}^{T-1} \gamma^t R_t \nabla_\theta \log \pi_\theta(a_t|s_t)
 $$


In [None]:
class REINFORCEAgent:
    def __init__(self, policy, optimizer, gamma=0.99):
        self.policy = policy
        self.optimizer = optimizer
        self.gamma = gamma
        self.log_probs = []
        self.rewards = []

    def select_action(self, state):
        
        # select an action by sampling from the actor's response
        state = torch.from_numpy(state).float().unsqueeze(0)
        probs = self.policy(state)
        m = torch.distributions.Categorical(probs)
        action = m.sample()
        self.log_probs.append(m.log_prob(action))
        return action.item()

    def update_policy(self):
        R = 0
        policy_loss = []
        returns = []

        
        # Calculate the discounted reward
        for r in self.rewards[::-1]:
            R = r + self.gamma * R
            returns.insert(0, R)

        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)

        
        # Calculate the policy loss
        for log_prob, return_ in zip(self.log_probs, returns):
          policy_loss.append(-log_prob * return_)

        self.optimizer.zero_grad()
        
        policy_loss = torch.cat(policy_loss).sum()
        policy_loss.backward()
        self.optimizer.step()

        # Reset the rewards and log probabilities
        del self.rewards[:]
        del self.log_probs[:]

    def store_reward(self, reward):
        self.rewards.append(reward)


## Training **(5 points)**

Define the hyperparameters and complete the training loop.

In [None]:
env = gym.make('CartPole-v1')
input_size = env.observation_space.shape[0]
output_size = env.action_space.n
lr = 1e-2

policy = PolicyNetwork(input_size, 128, output_size)
optimizer = optim.Adam(policy.parameters(), lr=lr)
agent = REINFORCEAgent(policy, optimizer)

num_episodes = 1000

for episode in range(num_episodes):
    state, info = env.reset()
    total_reward = 0

    
    # collect rewards and log probabilities for updating the policy in a loop
    while True:
        action = agent.select_action(state)
        next_state, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        agent.store_reward(reward)

        state = next_state

        if terminated or truncated:
            break

    agent.update_policy()
    if episode % 50 == 0:
        print(f'Episode {episode+1}: Total Reward = {total_reward}')
env.close()

Episode 1: Total Reward = 18.0
Episode 51: Total Reward = 8.0
Episode 101: Total Reward = 29.0
Episode 151: Total Reward = 105.0
Episode 201: Total Reward = 500.0
Episode 251: Total Reward = 500.0
Episode 301: Total Reward = 500.0
Episode 351: Total Reward = 500.0
Episode 401: Total Reward = 500.0
Episode 451: Total Reward = 500.0
Episode 501: Total Reward = 500.0
Episode 551: Total Reward = 11.0
Episode 601: Total Reward = 124.0
Episode 651: Total Reward = 190.0
Episode 701: Total Reward = 500.0
Episode 751: Total Reward = 500.0
Episode 801: Total Reward = 122.0
Episode 851: Total Reward = 303.0
Episode 901: Total Reward = 309.0
Episode 951: Total Reward = 218.0


## Evaluation **(5 points)**

Here we use the trained agent and collect a trajectory using it's policy. Calculate the cumulative reward by adding rewards in each time space. Save and display the video of this run in the end.

In [None]:
env = gym.make("CartPole-v1", render_mode="rgb_array")
state, _ = env.reset()
frames = []

total_reward = 0

# run the policy in the environment in a loop


for _ in range(1000):
    frames.append(env.render())
    # select a random action
    action = agent.select_action(state)
    next_state, reward, terminated, truncated, info = env.step(action)
    state = next_state
    if terminated or truncated:
        break

env.close()
print(f'Total Reward: {total_reward}')

imageio.mimsave('./eval_reinforce.mp4', frames, fps=25)
show_video('./eval_reinforce.mp4')



Total Reward: 0


# Proximal Policy Optimization **(60 points)**

## Setup

## Explore the environment

This code is essential for rendering MUJOCO based environments.

In [None]:
# Configure MuJoCo to use the EGL rendering backend (requires GPU)
%env MUJOCO_GL=egl

env: MUJOCO_GL=egl


We will train a PPO agent in the `HalfCheetah` environment. This environment features continuous actions and more complex mechanics.

Explore this environment using random actions as well, and display the video of the resulting trajectory.

* What are the observation and action spaces of this environment?

* Are values bounded?

In [None]:
env = gym.make("HalfCheetah-v4", render_mode="rgb_array")
env.reset()
frames = []

for _ in range(100):
    frames.append(env.render())
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        break
env.close()
imageio.mimsave('./HalfCheetah.mp4', frames, fps=25)
show_video('./HalfCheetah.mp4')


## Actor & Critic **(15 points)**

Proximal Policy Optimization (PPO) is an advanced reinforcement learning algorithm that uses separate actor and critic networks to optimize policy performance.

The actor network is responsible for predicting a probability distribution over actions (discrete) or estimating the value for each action (continuous), given the current state, while the critic network evaluates how good the action taken by the actor is, by predicting the reward based on state.


In [51]:
class Actor(nn.Module):
    def __init__(self, state_dim, hidden_size, action_dim):
        super(Actor, self).__init__()
        
        # Define the Actor architecture
        self.fc1 = nn.Linear(state_dim, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size // 2)
        self.mu = nn.Linear(hidden_size // 2, action_dim)
        self.log_std = nn.Parameter(torch.zeros(action_dim), requires_grad=True)

    def forward(self, state):
        
        # In case of continuous environment, we usually
        # predict a mean and std for each action and sample
        # the action from a normal distribution
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        mu = torch.tanh(self.mu(x))
        std = torch.exp(self.log_std)
        return mu, std

class Critic(nn.Module):
    def __init__(self, state_dim, hidden_size):
        super(Critic, self).__init__()
        
        # Define the Critic architecture
        self.fc1 = nn.Linear(state_dim, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size // 2)
        self.value = nn.Linear(hidden_size // 2, 1)

    def forward(self, state):
        
        # Predict the value of state
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        value = self.value(x)
        return value


## Memory

PPO algorithms need to store sequences of actions, states, log probabilities, rewards, and state values to train the agent. This data is captured in the `Memory` class, which facilitates batch processing by holding and then clearing these elements at the end of each training iteration.

In [52]:
class Memory:
    def __init__(self):
        self.actions = []
        self.states = []
        self.logprobs = []
        self.rewards = []
        self.state_values = []

    def clear(self):
        del self.actions[:]
        del self.states[:]
        del self.logprobs[:]
        del self.rewards[:]
        del self.state_values[:]


## Agent **(35 points)**

In PPO, the actor's goal is to maximize the expected return. However, direct maximization can cause large policy updates, risking instability. To prevent this, PPO employs a clipping mechanism, limiting policy changes to a defined range.

$$ L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right]
 $$

Additionally, it uses a probability ratio to scale updates, ensuring changes This ratio provides a scaling factor for the policy updates, ensuring that changes are made in proportion to the improvement in policy performance.

$$ r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}
 $$

 The critic aims to minimize the error between its predictions and the actual returns.

 $$ L^{VF}(\phi) = \left( V_\phi(s_t) - \hat{R}_t \right)^2
 $$

In [53]:
class PPO(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_size=64, lr=1e-4, gamma=0.99, epochs=4, eps_clip=0.2):
        super(PPO, self).__init__()
        self.gamma = gamma
        self.eps_clip = eps_clip
        self.epochs = epochs

        self.actor = Actor(state_dim, hidden_size, action_dim)
        self.critic = Critic(state_dim, hidden_size)

        self.optimizer_actor = optim.Adam(self.actor.parameters(), lr=lr)
        self.optimizer_critic = optim.Adam(self.critic.parameters(), lr=lr)
        self.memory = Memory()

    def select_action(self, state):
        
        # Save state, action, log probability and state value of current step in the memory buffer.
        # predict the actions by sampling from a normal distribution
        # based on the mean and std calculated by actor
        state = torch.tensor(state, dtype=torch.float32)
        mean, std = self.actor(state)
        dist = Normal(mean, std)
        action = dist.sample()
        action_logprob = dist.log_prob(action)
        state_value = self.critic(state)

        self.memory.states.append(state)
        self.memory.actions.append(action)
        self.memory.logprobs.append(action_logprob)
        self.memory.state_values.append(state_value)

        return action.detach().numpy()

    def evaluate(self, state, action):
        
        # evaluate the state value of this state and log probability of choosing this action
        action = torch.tensor(action,dtype=torch.float32)
        mean, std = self.actor(state)
        dist = Normal(mean, std)
        action_logprobs = dist.log_prob(action)
        entropy = dist.entropy()
        state_value = self.critic(state)

        return action_logprobs, state_value, entropy

    def update(self):
        rewards = []
        discounted_reward = 0
        
        # Calculate discounted rewards
        for reward in self.memory.rewards[::-1]:
            discounted_reward = reward + self.gamma * discounted_reward
            rewards.insert(0, discounted_reward)

        rewards = torch.tensor(rewards, dtype=torch.float32)
        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-9)

        
        # load saved states, actions, log probs, and state values
        old_states = torch.stack(self.memory.states)
        old_actions = torch.stack(self.memory.actions)
        old_logprobs = torch.stack(self.memory.logprobs)
        old_state_values = torch.stack(self.memory.state_values).squeeze()

        
        # Calculate advantages for each timestep (usually difference of rewards and state values)
        advantages = rewards - old_state_values.detach()
        advantages = advantages.unsqueeze(-1)

        loss_ac = 0
        loss_cri = 0
        for _ in range(self.epochs):
            # calculate logprobs and state values based on the new policy
            

            logprobs, state_values, entropy = self.evaluate(old_states, old_actions)
            ratios = torch.exp(logprobs - old_logprobs.detach())

            
            # Calculate the loss function and perform the optimization

            surr1 = ratios * advantages
            surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
            loss_actor = -torch.min(surr1, surr2).mean().mean()
            loss_critic = F.mse_loss(torch.squeeze(state_values), rewards)


            self.optimizer_actor.zero_grad()
            loss_actor.backward()
            loss_ac += loss_actor.item()
            self.optimizer_actor.step()

            self.optimizer_critic.zero_grad()
            loss_critic.backward()
            loss_cri += loss_critic.item()
            self.optimizer_critic.step()

        # clear the buffer
        self.memory.clear()
        return loss_ac, loss_cri

## Training **(5 points)**

Define the hyperparameters and complete the training loop.

In [55]:
env = gym.make("HalfCheetah-v4")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
hidden_size = 64
lr = 3e-4

print(state_dim)
print(action_dim)

agent = model = PPO(state_dim, action_dim, hidden_size=hidden_size, lr=lr)

# We need to train for many more steps to achieve acceptable results compared to the last environment
num_episodes = 2000

actor_losses = []
critic_losses = []
moving_rewards = np.array([])

for episode in range(num_episodes):
    state, _ = env.reset()
    total_reward = 0
    
    # write the training loop
    while True:
        action = agent.select_action(state)
        next_state, reward, terminated, truncated, info = env.step(action)

        agent.memory.rewards.append(reward)

        state = next_state
        total_reward += reward

        if terminated or truncated:
            break


    if episode % 5 == 0:
        loss_ac, loss_cri = agent.update()
        actor_losses.append(loss_ac)
        critic_losses.append(loss_cri)
    moving_rewards = np.append(moving_rewards, total_reward)
    if episode % 50 == 0:
        print(f"actor loss:\t{loss_ac:.6f}")
        print(f"critic loss:\t{loss_cri:.6f}")
        print(f'Episode {episode}: Going Reward = {moving_rewards.mean():.1f}: Std = {moving_rewards.std():.1f}')
        moving_rewards = np.array([])

env.close()

17
6


  action = torch.tensor(action,dtype=torch.float32)


actor loss:	-0.839822
critic loss:	4.307190
Episode 0: Going Reward = -798.1: Std = 0.0
actor loss:	0.014746
critic loss:	4.049279
Episode 50: Going Reward = -766.7: Std = 81.4
actor loss:	0.028678
critic loss:	4.011692
Episode 100: Going Reward = -716.8: Std = 80.7
actor loss:	0.067217
critic loss:	3.987872
Episode 150: Going Reward = -703.1: Std = 72.9
actor loss:	0.040208
critic loss:	3.986250
Episode 200: Going Reward = -682.7: Std = 72.2
actor loss:	-0.039015
critic loss:	3.907885
Episode 250: Going Reward = -672.0: Std = 67.8
actor loss:	0.135488
critic loss:	4.033622
Episode 300: Going Reward = -652.3: Std = 49.6
actor loss:	0.208288
critic loss:	4.038214
Episode 350: Going Reward = -608.8: Std = 64.7
actor loss:	0.031417
critic loss:	3.314217
Episode 400: Going Reward = -608.6: Std = 60.4
actor loss:	0.743354
critic loss:	4.571159
Episode 450: Going Reward = -588.7: Std = 71.9
actor loss:	0.284984
critic loss:	4.193894
Episode 500: Going Reward = -561.5: Std = 72.9
actor loss:	

In [56]:
torch.save(agent.state_dict(), 'model.pth')

In [57]:
env = gym.make("HalfCheetah-v4")

for episode in range(num_episodes):
    state, _ = env.reset()
    total_reward = 0
    
    # write the training loop
    while True:
        action = agent.select_action(state)
        next_state, reward, terminated, truncated, info = env.step(action)

        agent.memory.rewards.append(reward)

        state = next_state
        total_reward += reward

        if terminated or truncated:
            break


    if episode % 5 == 0:
        loss_ac, loss_cri = agent.update()
        actor_losses.append(loss_ac)
        critic_losses.append(loss_cri)
    moving_rewards = np.append(moving_rewards, total_reward)
    if episode % 50 == 0:
        print(f"actor loss:\t{loss_ac:.6f}")
        print(f"critic loss:\t{loss_cri:.6f}")
        print(f'Episode {episode}: Going Reward = {moving_rewards.mean():.1f}: Std = {moving_rewards.std():.1f}')
        moving_rewards = np.array([])

env.close()

  action = torch.tensor(action,dtype=torch.float32)


actor loss:	1.036371
critic loss:	2.489498
Episode 0: Going Reward = 43.4: Std = 234.0
actor loss:	-0.855331
critic loss:	1.981381
Episode 50: Going Reward = 0.3: Std = 211.5
actor loss:	0.024499
critic loss:	2.142779
Episode 100: Going Reward = -21.6: Std = 237.3
actor loss:	0.417421
critic loss:	1.461275
Episode 150: Going Reward = 20.2: Std = 257.9
actor loss:	-0.699259
critic loss:	2.116474
Episode 200: Going Reward = 72.7: Std = 283.2
actor loss:	0.088421
critic loss:	1.253533
Episode 250: Going Reward = 48.8: Std = 277.3
actor loss:	0.398929
critic loss:	1.438171
Episode 300: Going Reward = 65.1: Std = 279.2
actor loss:	-2.629447
critic loss:	4.865219
Episode 350: Going Reward = 112.1: Std = 306.8
actor loss:	0.322394
critic loss:	1.579695
Episode 400: Going Reward = 208.0: Std = 274.8
actor loss:	1.735177
critic loss:	2.243730
Episode 450: Going Reward = 114.4: Std = 295.2
actor loss:	1.096504
critic loss:	1.762571
Episode 500: Going Reward = 224.6: Std = 314.7
actor loss:	0.310

In [58]:
torch.save(agent.state_dict(), 'model.pth')

In [59]:
env = gym.make("HalfCheetah-v4")

for episode in range(num_episodes):
    state, _ = env.reset()
    total_reward = 0
    
    # write the training loop
    while True:
        action = agent.select_action(state)
        next_state, reward, terminated, truncated, info = env.step(action)

        agent.memory.rewards.append(reward)

        state = next_state
        total_reward += reward

        if terminated or truncated:
            break


    if episode % 5 == 0:
        loss_ac, loss_cri = agent.update()
        actor_losses.append(loss_ac)
        critic_losses.append(loss_cri)
    moving_rewards = np.append(moving_rewards, total_reward)
    if episode % 50 == 0:
        print(f"actor loss:\t{loss_ac:.6f}")
        print(f"critic loss:\t{loss_cri:.6f}")
        print(f'Episode {episode}: Going Reward = {moving_rewards.mean():.1f}: Std = {moving_rewards.std():.1f}')
        moving_rewards = np.array([])

env.close()

  action = torch.tensor(action,dtype=torch.float32)


actor loss:	1.730663
critic loss:	2.125857
Episode 0: Going Reward = 564.9: Std = 463.5
actor loss:	0.843161
critic loss:	1.529854
Episode 50: Going Reward = 464.1: Std = 429.3
actor loss:	0.088880
critic loss:	1.235499
Episode 100: Going Reward = 498.2: Std = 459.3
actor loss:	0.244473
critic loss:	1.094931
Episode 150: Going Reward = 671.9: Std = 508.9
actor loss:	0.994212
critic loss:	2.768891
Episode 200: Going Reward = 829.1: Std = 481.2
actor loss:	0.382234
critic loss:	1.137798
Episode 250: Going Reward = 702.6: Std = 480.3
actor loss:	-0.342030
critic loss:	1.133339
Episode 300: Going Reward = 887.6: Std = 511.1
actor loss:	-0.062790
critic loss:	0.875749
Episode 350: Going Reward = 851.2: Std = 537.8
actor loss:	-0.959186
critic loss:	1.427001
Episode 400: Going Reward = 743.4: Std = 529.6
actor loss:	-0.615260
critic loss:	1.347581
Episode 450: Going Reward = 942.5: Std = 518.3
actor loss:	0.385449
critic loss:	0.877702
Episode 500: Going Reward = 870.5: Std = 499.7
actor los

In [60]:
torch.save(agent.state_dict(), 'model.pth')

In [61]:
env = gym.make("HalfCheetah-v4")

for episode in range(4000):
    state, _ = env.reset()
    total_reward = 0
    
    # write the training loop
    while True:
        action = agent.select_action(state)
        next_state, reward, terminated, truncated, info = env.step(action)

        agent.memory.rewards.append(reward)

        state = next_state
        total_reward += reward

        if terminated or truncated:
            break


    if episode % 5 == 0:
        loss_ac, loss_cri = agent.update()
        actor_losses.append(loss_ac)
        critic_losses.append(loss_cri)
    moving_rewards = np.append(moving_rewards, total_reward)
    if episode % 50 == 0:
        print(f"actor loss:\t{loss_ac:.6f}")
        print(f"critic loss:\t{loss_cri:.6f}")
        print(f'Episode {episode}: Going Reward = {moving_rewards.mean():.1f}: Std = {moving_rewards.std():.1f}')
        moving_rewards = np.array([])

env.close()

  action = torch.tensor(action,dtype=torch.float32)


actor loss:	-0.069505
critic loss:	0.924055
Episode 0: Going Reward = 1237.8: Std = 599.2
actor loss:	-0.283444
critic loss:	1.256054
Episode 50: Going Reward = 1300.3: Std = 513.6
actor loss:	1.459521
critic loss:	2.836913
Episode 100: Going Reward = 1400.7: Std = 603.8
actor loss:	0.466365
critic loss:	1.247801
Episode 150: Going Reward = 1261.3: Std = 639.9
actor loss:	0.630232
critic loss:	1.257712
Episode 200: Going Reward = 1297.9: Std = 643.9
actor loss:	-0.366151
critic loss:	1.299490
Episode 250: Going Reward = 1208.5: Std = 581.6
actor loss:	0.647270
critic loss:	1.369054
Episode 300: Going Reward = 1358.2: Std = 616.3
actor loss:	1.922846
critic loss:	4.125993
Episode 350: Going Reward = 1408.7: Std = 655.0
actor loss:	0.489997
critic loss:	1.271141
Episode 400: Going Reward = 1366.2: Std = 609.1
actor loss:	1.082786
critic loss:	1.780468
Episode 450: Going Reward = 1383.5: Std = 657.5
actor loss:	0.077922
critic loss:	1.404174
Episode 500: Going Reward = 1575.2: Std = 544.8

## Evaluation **(5 points)**

Evaluate the trained policy on the environment. Calculate the cumulative reward and display the video of the trajectory.

In [65]:
env = gym.make("HalfCheetah-v4", render_mode="rgb_array")
state, _ = env.reset()
frames = []

total_reward = 0

# run the policy in the environment in a loop
for _ in range(1000):
    frames.append(env.render())
    # select a random action
    action = agent.select_action(state)
    next_state, reward, terminated, truncated, info = env.step(action)

    total_reward += reward

    state = next_state
    if terminated or truncated:
        break


env.close()
print(f'Total Reward: {total_reward}')

imageio.mimsave('./eval_ppo.mp4', frames, fps=25)
show_video('./eval_ppo.mp4')

Total Reward: 1687.6510305009226
