### CDS NYU
### DS-GA 3001 | Reinforcement Learning
### Lab 8
## Fundamental Actor-Critic Algorithms

---
## Section Leader
Akshitha Kumbam – ak11071@nyu.edu

Kushagra Khatwani – kk5395@nyu.edu
<hr>
<br>

In today's lab, we will learn to implement two essential algorithms from the Actor-Critic family: Advantage Actor Critic (https://openai.com/research/openai-baselines-acktr-a2c), and Deep Deterministic Policy Gradients (https://www.deepmind.com/publications/deterministic-policy-gradient-algorithms).

<br>

A2C helps in reducing the variance of the updates stemming from the empirical cumulative reward as seen in REINFORCE. We will see how it does so. Whereas, DDPG focuses on a different kind of problems, those with continuous action spaces. Can you think of a way to solve continuous action spaces using DQNs? 

### Part I
### Advantage Actor Critic

Recollect the policy gradient update from REINFORCE: $\sum_t \nabla G_t \log \pi(a_t | s_t)$

#### Using Q(s, a) and V(s)
Let's compute the advantage of an action defined by

<center>$A(s, a) = Q(s, a) - V(s)$</center>

Note: We may also subtract the average return of state s instead of using V(s). It is called using a baseline. However, it is easy to see that approximating V(s) non-linearly may be better than using a moving average.

#### Required Imports

In [5]:
# Advantage Actor-Critic Algorithm on Pong, Author: Anudeep Tubati, NYU
# Modified implementation of REINFORCE at https://karpathy.github.io/2016/05/31/rl/
import sys
import numpy as np
import random

#Quick fix for M1 architecture (M2/M3 might also need this) if torch imports fail. Reinstalling torch works as well.
# import os
# os.environ['KMP_DUPLICATE_LIB_OK']='True'

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Bernoulli

import gymnasium as gym
import ale_py

gym.register_envs(ale_py)

from PIL import Image

from time import sleep

from pathlib import Path


# code for the only two actions in Pong
# 2 -> UP, 3 -> DOWN
ACTIONS = [2, 3]

#### Helper Functions

In [2]:
def prepro(img):
    """ prepro 210x160x3 uint8 frame into 6000 (75x80) 1D float vector """
    img = img[35:185]
    img = img[::2, ::2, 0]
    img[img == 144] = 0
    img[img == 109] = 0
    img[img != 0] = 1

    return img.astype(np.float32).ravel()

def discount_rewards(reward):
    # Compute the gamma-discounted rewards over an episode
    gamma = 0.99    # discount rate
    running_add = 0
    discounted_r = torch.zeros_like(reward)

    for i in reversed(range(0, len(reward))):
        if reward[i] != 0: # reset the sum, since this was a game boundary (pong specific!)
            running_add = 0
        running_add = running_add * gamma + reward[i]
        discounted_r[i] = running_add

    discounted_r -= torch.mean(discounted_r) # normalizing the result
    discounted_r /= torch.std(discounted_r) # divide by standard deviation
    return discounted_r


def log(filename, string):
    with open(filename, 'a+') as logger:
        logger.write(string)


# wrapper for the Gym environment that uses our helper functions
class Pong(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.env = env

        self.ROWS = 75
        self.COLS = 80

        self.state_size = (self.ROWS * self.COLS, )

    def reset(self):
        self.prev_x = prepro(self.env.reset()[0])

        return np.zeros(self.state_size).flatten()
    
    def step(self, action):
        next_state, reward, done, _, info = self.env.step(action)

        next_state = prepro(next_state)
        actual_next_state = next_state - self.prev_x
        self.prev_x = next_state

        return actual_next_state.flatten(), reward, done, info

#### Agent and Networks

In [3]:
class Actor(nn.Module):
    def __init__(self, input_size, action_size):
        super(Actor, self).__init__()

        self.fc1 = nn.Linear(input_size, 200)
        self.output = nn.Linear(200, action_size)
    
    def forward(self, x):

        x = F.relu(self.fc1(x))
        action_prob = torch.sigmoid(self.output(x))

        return action_prob


class Critic(nn.Module):
    def __init__(self, input_size):
        super(Critic, self).__init__()

        self.fc1 = nn.Linear(input_size, 200)
        self.output = nn.Linear(200, 1)
    
    def forward(self, x):

        x = F.relu(self.fc1(x))
        value = self.output(x)

        return value


class A2CAgent(object):
    def __init__(self, input_size, log_filename):
        self.actor = Actor(input_size, 1)
        self.critic = Critic(input_size)
        self.optimizerA = optim.RMSprop(self.actor.parameters(), lr=0.001, weight_decay=0.99)
        self.optimizerC = optim.RMSprop(self.critic.parameters(), lr=0.001, weight_decay=0.99)

        self.memory = {
            'rewards': [],
            'log_probs': [],
            'states': []
        }
        self.epoch = 0

        self.log_filename = log_filename
    
    def select_action(self, state):

        # sample an action from stochastic policy
        action_prob = self.actor.forward(state)
        dist = Bernoulli(action_prob)

        sampled_val = dist.sample()
        action_idx = int(sampled_val.item())

        # compute log prob
        # print(sampled_val.item() == 1.0, sampled_val, action_idx)
        action_to_take = ACTIONS[action_idx]

        self.memory['log_probs'].append(dist.log_prob(sampled_val))

        return action_to_take

    def remember(self, state, reward):
        self.memory['states'].append(state)
        self.memory['rewards'].append(reward)
    
    def update_network(self):
        len_r = len(self.memory['rewards'])
        assert len_r == len(self.memory['log_probs'])

        # convert to tensors for ease of operation
        self.memory['rewards'] = torch.tensor(self.memory['rewards'], dtype=torch.float32)
        discounted_r = discount_rewards(self.memory['rewards']).unsqueeze(1)
        self.memory['log_probs'] = torch.stack(self.memory['log_probs'])
        states = torch.stack(self.memory['states'])

        # get V values of states from critic
        values = self.critic.forward(states)

        CHANGE_SCHED_EPOCH = 1000

        # train only critic first, then train policy as well
        if self.epoch <= CHANGE_SCHED_EPOCH:
            # calculate policy loss
            policy_losses = (-1 * self.memory['log_probs']) * discounted_r

            # calculate loss for critic
            value_loss = F.mse_loss(values, discounted_r)

        else:
            if self.epoch == (CHANGE_SCHED_EPOCH + 9):
                self.optimizerA.param_groups[0]['lr'] = 0.0005
                print("\nACTOR LR CHANGED TO 0.0005")

            # calculate advantages for A2C
            advantages = discounted_r - values.detach()

            # calculate policy loss
            policy_losses = (-1 * self.memory['log_probs']) * advantages

            # calculate targets for critic by adding discounted next_state
            # values (except for last state)
            targets = self.memory['rewards'].unsqueeze(1).clone()
            targets[:-1] += (0.99 * values.detach())[1:]

            # calculate value loss from targets
            value_loss = F.mse_loss(values, targets)

        # print(f"[{self.epoch}]", value_loss)

        # crux of training
        self.optimizerA.zero_grad()
        self.lossA = policy_losses.sum()
        self.lossA.backward()
        self.optimizerA.step()

        self.optimizerC.zero_grad()
        self.lossC = value_loss
        self.lossC.backward()
        self.optimizerC.step()

        # reset memory because this is on-policy
        for k in self.memory.keys():
            self.memory[k] = []
    
    def learn(self, env, num_epochs, roll_size, start=0):

        print(f"Resuming from {start + 1}, Writing to {self.log_filename}\n")
        # self.log_file.write(f"Resuming from {start + 1}\n\n")

        assert roll_size == 10

        avg = -float('inf')
        best_avg = -float('inf')
        max_score = -float('inf')
        all_scores = np.zeros((num_epochs, ), dtype=np.int32)

        for eps_idx in range(start + 1, num_epochs):
            self.epoch = eps_idx

            # beginning of an episode
            state = env.reset()
            state = torch.tensor(state, dtype=torch.float32)
            done = False
            score = 0

            while not done:

                action = self.select_action(state)

                # run one step
                next_state, reward, done, _ = env.step(action)
                next_state = torch.tensor(next_state, dtype=torch.float32)

                self.remember(state, reward)
                state = next_state

                score += reward

            # bookkeeping of stats
            all_scores[eps_idx] = score
            if score > max_score:
                max_score = score

            sys.stdout.write(f"\r [{eps_idx}]: {score}, Avg: {avg:.2f}, Max: {max_score}, Best_avg: {best_avg:.2f}")
            sys.stdout.flush()
            
            if ((eps_idx + 1) % roll_size) == 0:
                avg = np.mean(all_scores[(eps_idx + 1) - roll_size:eps_idx])
                if avg > best_avg:
                    best_avg = avg
                    self.save(eps_idx, "pong_checkpoint_bestavg")

                # print(f"\n [{eps_idx}]: {score}, Avg: {avg:.2f}, Max: {max_score}, Best_avg: {best_avg:.2f}")
                stat_string = f" [{eps_idx}]: {score}, Avg: {avg:.2f}, Max: {max_score}, Best_avg: {best_avg:.2f}\n"
                log(self.log_filename, stat_string)
                self.save(eps_idx, "pong_checkpoint_latest")

                np.save('checkpoints/all_scores.npy', all_scores, allow_pickle=False)
            
            # train every 10 episodes
            if ((eps_idx + 1) % 10) == 0:
                self.update_network()
            
            # graph the scores every 100 eps
            if ((eps_idx + 1) % 100) == 0:
                pass
        
        avg = np.mean(all_scores)
        max_score = np.max(all_scores)
        print(f"\n [{eps_idx}]: {score}, Avg: {avg:.2f}, Max: {max_score}, Best_avg: {best_avg:.2f}")

    def save(self, epoch, path):
        save_dir = 'checkpoints/'
        path = save_dir + path + ".pt"

        Path(save_dir).mkdir(exist_ok=True)

        try:
            lossA = self.lossA
            lossC = self.lossC
        except AttributeError:
            lossA = None
            lossC = None

        torch.save(
            {
                'epoch': epoch,
                'actor_state_dict': self.actor.state_dict(),
                'critic_state_dict': self.critic.state_dict(),
                'optimizerA_state_dict': self.optimizerA.state_dict(),
                'optimizerC_state_dict': self.optimizerC.state_dict(),
                'lossA': lossA,
                'lossC': lossC,
            },
            path
        )

    def load(self, path):
        save_dir = 'checkpoints/'
        path = save_dir + path + ".pt"

        checkpoint = torch.load(path)

        epoch = checkpoint['epoch']
        self.actor.load_state_dict(checkpoint['actor_state_dict'])
        self.critic.load_state_dict(checkpoint['critic_state_dict'])
        self.optimizerA.load_state_dict(checkpoint['optimizerA_state_dict'])
        self.optimizerC.load_state_dict(checkpoint['optimizerC_state_dict'])

        return epoch

#### Training the agent

In [7]:
env = gym.make("PongDeterministic-v0")
env = Pong(env)

if Path('logs.txt').exists():
    print("Logs already exist, appending to them.")
agent = A2CAgent(6000, 'logs.txt')
epoch_resume = -1
# epoch_resume = agent.load('pong_checkpoint_bestavg')
agent.learn(env, 1000, 10, epoch_resume)              

  logger.deprecation(
A.L.E: Arcade Learning Environment (version 0.10.1+6a7e0ae)
[Powered by Stella]


Resuming from 0, Writing to logs.txt

 [19]: -19.0, Avg: -20.67, Max: -19.0, Best_avg: -20.67

KeyboardInterrupt: 

In [None]:
# Testing out the trained agent
episodes = 1
env = gym.make("PongDeterministic-v0", render_mode="human")
env = Pong(env)
obs = env.reset()
done = False
print(obs)
for ep in range(episodes):
    while not done:
        obs = torch.tensor(obs, dtype=torch.float32)
        action = agent.select_action(obs)
        # run one step
        obs, reward, done, _ = env.step(action)
env.close

### Part II
### Deep Deterministic Policy Gradients

#### Required Imports

In [1]:
# derived from https://github.com/wpiszlogin/driver_critic/
import numpy as np
import tensorflow as tf
from tools import *
from tensorflow.keras import layers
from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam

import gymnasium as gym
import ale_py

gym.register_envs(ale_py)


2025-04-03 12:34:09.491115: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### Base Class for the agent

Unlike DQN where we set target network weights to the trainable network weights in defined intervals, we use a smoother update for the target networks' weights. 
![tau_update](https://spinningup.openai.com/en/latest/_images/math/d417987803ca9f61ac60741880a748129bd66dde.svg)

In [2]:
class BaseSolution:
    def __init__(self, action_space, model_outputs=None, noise_mean=None, noise_std=None):

        # Hyperparameters
        self.gamma = 0.99
        self.actor_lr = 0.00001
        self.critic_lr = 0.002
        self.tau = 0.005
        self.memory_capacity = 60000

        # For problems that have specific outputs of an actor model
        self.need_decode_out = model_outputs is not None
        self.model_action_out = model_outputs if model_outputs else action_space.shape[0]
        self.action_space = action_space

        # Init noise generator
        if noise_mean is None:
            noise_mean = np.full(self.model_action_out, 0.0, np.float32)
        if noise_std is None:
            noise_std  = np.full(self.model_action_out, 0.2, np.float32)
        std = self.noise = NoiseGenerator(noise_mean, noise_std)

        # Initialize buffer R
        self.r_buffer = MemoriesRecorder(memory_capacity=self.memory_capacity)

        self.actor_opt      = Adam(self.actor_lr)
        self.critic_opt     = Adam(self.critic_lr)
        self.actor          = None
        self.critic         = None
        self.target_actor   = None
        self.target_critic  = None

    def reset(self):
        self.noise.reset()

    def build_actor(self, state_shape, name="Actor"):
        inputs = layers.Input(shape=state_shape)
        x = inputs
        x = layers.Conv2D(16, kernel_size=(5, 5), strides=(4, 4), padding='valid', use_bias=False, activation="relu")(x)
        x = layers.Conv2D(32, kernel_size=(3, 3), strides=(3, 3), padding='valid', use_bias=False, activation="relu")(x)
        x = layers.Conv2D(32, kernel_size=(3, 3), strides=(3, 3), padding='valid', use_bias=False, activation="relu")(x)

        x = layers.Flatten()(x)
        x = layers.Dense(64, activation='relu')(x)
        y = layers.Dense(self.model_action_out, activation='tanh')(x)

        model = Model(inputs=inputs, outputs=y, name=name)
        model.summary()
        return model

    def build_critic(self, state_shape, name="Critic"):
        state_inputs = layers.Input(shape=state_shape)
        x = state_inputs
        x = layers.Conv2D(16, kernel_size=(5, 5), strides=(4, 4), padding='valid', use_bias=False, activation="relu")(x)
        x = layers.Conv2D(32, kernel_size=(3, 3), strides=(3, 3), padding='valid', use_bias=False, activation="relu")(x)
        x = layers.Conv2D(32, kernel_size=(3, 3), strides=(3, 3), padding='valid', use_bias=False, activation="relu")(x)

        x = layers.Flatten()(x)
        action_inputs = layers.Input(shape=(self.model_action_out,))
        x = layers.concatenate([x, action_inputs])

        x = layers.Dense(64, activation='relu')(x)
        x = layers.Dense(32, activation='relu')(x)
        y = layers.Dense(1)(x)

        model = Model(inputs=[state_inputs, action_inputs], outputs=y, name=name)
        model.summary()
        return model

    def init_networks(self, state_shape):
        self.actor  = self.build_actor(state_shape)
        self.critic = self.build_critic(state_shape)

        # Build target networks in the same way
        self.target_actor  = self.build_actor(state_shape, name='TargetActor')
        self.target_critic = self.build_critic(state_shape, name='TargetCritic')

        # Copy parameters from action and critic
        self.target_actor.set_weights(self.actor.get_weights())
        self.target_critic.set_weights(self.critic.get_weights())

    def get_action(self, state, add_noise=True):
        prep_state = self.preprocess(state)
        if self.actor is None:
            self.init_networks(prep_state.shape)

        # Get result from a network
        tensor_state = tf.expand_dims(tf.convert_to_tensor(prep_state), 0)
        actor_output = self.actor(tensor_state).numpy()

        # Add noise
        if add_noise:
            actor_output = actor_output[0] + self.noise.generate()
        else:
            actor_output = actor_output[0]

        if self.need_decode_out:
            env_action = self.decode_model_output(actor_output)
        else:
            env_action = actor_output

        # Clip min-max
        env_action = np.clip(np.array(env_action), a_min=self.action_space.low, a_max=self.action_space.high)
        return env_action, actor_output

    def decode_model_output(self, model_out):
        return np.array([model_out[0], model_out[1].clip(0, 1), -model_out[1].clip(-1, 0)])

    def preprocess(self, img, greyscale=False):
        img = img.copy()
        # Remove numbers and enlarge speed bar
        for i in range(88, 93+1):
            img[i, 0:12, :] = img[i, 12, :]

        # Unify grass color
        replace_color(img, original=(102, 229, 102), new_value=(102, 204, 102))

        if greyscale:
            img = img.mean(axis=2)
            img = np.expand_dims(img, 2)

        # Make car black
        car_color = 68.0
        car_area = img[67:77, 42:53]
        car_area[car_area == car_color] = 0

        # Scale from 0 to 1
        img = img / img.max()

        # Unify track color
        img[(img > 0.411) & (img < 0.412)] = 0.4
        img[(img > 0.419) & (img < 0.420)] = 0.4

        # Change color of kerbs
        game_screen = img[0:83, :]
        game_screen[game_screen == 1] = 0.80
        return img

    def learn(self, state, train_action, reward, new_state):
        # Store transition in R
        prep_state     = self.preprocess(state)
        prep_new_state = self.preprocess(new_state)
        self.r_buffer.write(prep_state, train_action, reward, prep_new_state)

        # Sample mini-batch from R
        state_batch, action_batch, reward_batch, new_state_batch  = self.r_buffer.sample()

        state_batch     = tf.convert_to_tensor(state_batch)
        action_batch    = tf.convert_to_tensor(action_batch)
        reward_batch    = tf.convert_to_tensor(reward_batch)
        reward_batch    = tf.cast(reward_batch, dtype=tf.float32)
        new_state_batch = tf.convert_to_tensor(new_state_batch)

        self.update_actor_critic(state_batch, action_batch, reward_batch, new_state_batch)

        # Update target networks
        self.update_target_network(self.target_actor.variables, self.actor.variables)
        self.update_target_network(self.target_critic.variables, self.critic.variables)

    @tf.function
    def update_actor_critic(self, state, action, reward, new_state):
        # Update critic
        with tf.GradientTape() as tape:
            # Calc y
            new_action = self.target_actor(new_state, training=True)
            y = reward + self.gamma * self.target_critic([new_state, new_action], training=True)

            critic_loss = tf.math.reduce_mean(tf.square(y - self.critic([state, action], training=True)))

        critic_gradients = tape.gradient(critic_loss, self.critic.trainable_variables)
        self.critic_opt.apply_gradients(zip(critic_gradients, self.critic.trainable_variables))

        # Update actor
        with tf.GradientTape() as tape:
            critic_out = self.critic([state, self.actor(state, training=True)], training=True)
            actor_loss = -tf.math.reduce_mean(critic_out)  # Need to maximize

        actor_gradients = tape.gradient(actor_loss, self.actor.trainable_variables)
        self.actor_opt.apply_gradients(zip(actor_gradients, self.actor.trainable_variables))

    # @tf.function
    def update_target_network(self, target_weights, new_weights):
        for t, n in zip(target_weights, new_weights):
            t.assign((1 - self.tau) * t + self.tau * n)

    def save_solution(self, path='models/'):
        self.actor.save(path + 'actor.h5')
        self.critic.save(path + 'critic.h5')
        self.target_actor.save(path + 'target_actor.h5')
        self.target_critic.save(path + 'target_critic.h5')

    def load_solution(self, path='models/'):
        self.actor = tf.keras.models.load_model(path + 'actor.h5')
        self.critic = tf.keras.models.load_model(path + 'critic.h5')
        self.target_actor = tf.keras.models.load_model(path + 'target_actor.h5')
        self.target_critic = tf.keras.models.load_model(path + 'target_critic.h5')

#### Train the model
Gets an average reward of 400 with 30 mins of training on an Intel i5 7th Gen

In [3]:
# Parameters
n_episodes = 100
problem = 'CarRacing-v3'

gym.logger.min_level = "ERROR"
preview = False
best_result = 0
all_episode_reward = []

In [4]:
# Initialize simulation
env = gym.make(problem)
state, info = env.reset()

In [5]:
# Define custom standard deviation for noise
# We need lesser noise for steering
noise_std = np.array([0.1, 4 * 0.2], dtype=np.float32)
solution = BaseSolution(env.action_space, model_outputs=2, noise_std=noise_std)

In [6]:
# Loop of episodes
for ie in range(n_episodes):
    state, info = env.reset()
    solution.reset()
    done = False
    episode_reward = 0
    no_reward_counter = 0

    # One-step-loop
    while not done:
        if preview:
            env.render()

        action, train_action = solution.get_action(state)

        # This will make steering much easier
        action /= 4
        new_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

        # Models action output has a different shape for this problem
        solution.learn(state, train_action, reward, new_state)
        state = new_state
        episode_reward += reward

        if reward < 0:
            no_reward_counter += 1
            if no_reward_counter > 200:
                break
        else:
            no_reward_counter = 0

    all_episode_reward.append(episode_reward)
    average_result = np.array(all_episode_reward[-10:]).mean()
    print('Last result:', episode_reward, 'Average results:', average_result)

    if episode_reward > best_result:
        print('Saving best solution')
        solution.save_solution()
        best_result = episode_reward



Last result: 72.38685121107413 Average results: 72.38685121107413
Saving best solution
Last result: 11.085829959513948 Average results: 41.73634058529404




Last result: 74.97152317880925 Average results: 52.814734783132444
Saving best solution
Last result: 1.6438202247190432 Average results: 40.02200614352909
Last result: -2.4721649484535533 Average results: 31.523171925132562
Last result: 1.8463917525773228 Average results: 26.577041896373355
Last result: -0.16811594202896854 Average results: 22.756305062315878
Last result: 14.47070063694244 Average results: 21.720604509144202
Last result: 2.0105691056907826 Average results: 19.530600575427158
Last result: 74.90793650793836 Average results: 25.068334168678277




Last result: 187.37500000000188 Average results: 36.56714904757105
Saving best solution




Last result: 525.1666666666599 Average results: 87.97523271828564
Saving best solution
Last result: 111.17992831541542 Average results: 91.59607323194626




Last result: 804.2553191489218 Average results: 171.85722312436653
Saving best solution
Last result: 190.348717948716 Average results: 191.1393114140835
Last result: 268.9547703180168 Average results: 217.85014927062744
Last result: 321.66101694914715 Average results: 250.03306255974508
Last result: 639.9999999999907 Average results: 312.58599249604987
Last result: 298.43037974683085 Average results: 342.2279735601639
Last result: 526.8115942028975 Average results: 387.4183393296598




Last result: 814.2857142857009 Average results: 450.10941075822967
Saving best solution
Last result: 258.53267326732214 Average results: 423.4460114182959
Last result: 387.42698961937134 Average results: 451.0707175486915
Last result: 266.97056856187044 Average results: 397.3422424899864
Last result: 424.9999999999869 Average results: 420.8073706951135
Last result: 396.3520547945139 Average results: 433.54709914276316
Last result: 71.00706713781038 Average results: 408.4817041616295
Last result: 127.19444444444727 Average results: 357.20114860607515
Last result: 494.0727272727173 Average results: 376.7653833586638
Last result: 473.77049180327384 Average results: 371.4612731187014
Last result: 530.7189542483519 Average results: 343.10459711496657
Last result: 437.9537953795305 Average results: 361.0467093261874
Last result: 248.43205574912963 Average results: 347.14721593916323
Last result: 98.15177304964685 Average results: 330.2653363879408
Last result: 133.07619047619286 Average resu

KeyboardInterrupt: 

In [7]:
# We can improve stability of solution with noise
noise_mean = np.array([0.0, -0.83], dtype=np.float32)
noise_std = np.array([0.0, 4 * 0.02], dtype=np.float32)
solution = BaseSolution(env.action_space, model_outputs=2, noise_mean=noise_mean, noise_std=noise_std)
solution.load_solution('models/')



#### Evaluate the model

In [8]:
n_episodes = 10
env = gym.make(problem, render_mode="human")
state, info = env.reset()
all_episode_reward = []

In [9]:
# Loop of episodes
for ie in range(n_episodes):
    state, info = env.reset()
    solution.reset()
    done = False
    episode_reward = 0
    no_reward_counter = 0

    # One-step-loop
    while not done:

        action, train_action = solution.get_action(state)

        # This will make steering much easier
        action /= 4
        new_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

        state = new_state
        episode_reward += reward

        if reward < 0:
            no_reward_counter += 1
            if no_reward_counter > 200:
                break
        else:
            no_reward_counter = 0

    all_episode_reward.append(episode_reward)
    average_result = np.array(all_episode_reward).mean()
    print('Last result:', episode_reward, 'Average results:', average_result)
env.close()


Last result: 178.26214689265865 Average results: 178.26214689265865
Last result: 685.7142857142807 Average results: 431.98821630346964


KeyboardInterrupt: 

: 

#### Exercise: Improve the DDPG Agent
1) This code does not use braking efficiently. Find out how and fix it to get a better agent.

2) This code does not make use of the `terminated` flag. Add it to the replay buffer and updates. Check if the model converges faster after doing so.

3) Evaluate the agent without noise.