# Deep Reinforcement Learning Laboratory

In this laboratory session we will work on getting more advanced versions of Deep Reinforcement Learning algorithms up and running. Deep Reinforcement Learning is **hard**, and getting agents to stably train can be frustrating and requires quite a bit of subtlety in analysis of intermediate results. We will start by refactoring (a bit) my implementation of `REINFORCE` on the [Cartpole environment](https://gymnasium.farama.org/environments/classic_control/cart_pole/).

## Exercise 1: Improving my `REINFORCE` Implementation (warm up)

In this exercise we will refactor a bit and improve some aspects of my `REINFORCE` implementation.

**First Things First**: Spend some time playing with the environment to make sure you understand how it works.

In [1]:
# Standard imports.
import numpy as np
import matplotlib.pyplot as plt
import gymnasium as gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# Plus one non standard one -- we need this to sample from policies.
from torch.distributions import Categorical

#Tracking experiments
import wandb
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Currently logged in as: niccolo-arati (dla-labs). Use `wandb login --relogin` to force relogin


True

In [2]:
def temperature_scaled_softmax(logits, temperature):
    logits = logits / temperature
    return F.softmax(logits, dim=-1)

# A simple, but generic, policy network with one hidden layer.
class PolicyNet(nn.Module):
    def __init__(self, env, inner_size = 128, T=1.0):
        super().__init__()
        self.fc1 = nn.Linear(env.observation_space.shape[0], inner_size)
        self.fc2 = nn.Linear(inner_size, env.action_space.n)
        self.relu = nn.ReLU()
        self.temperature = T

    def forward(self, s):
        s = F.relu(self.fc1(s))
        s = temperature_scaled_softmax(self.fc2(s), self.temperature)
        return s

**Next Things Next**: Now get your `REINFORCE` implementation working on the environment. You can import my (probably buggy and definitely inefficient) implementation here. Or even better, refactor an implementation into a separate package from which you can `import` the stuff you need here.

In [3]:
class Reinforce:
    def __init__(self, policy, env, env_render=None, gamma=0.99, num_episodes=10, lr=1e-2,
                 max_len = 500, N = 100):
        self.policy = policy
        self.environment = env
        self.env_render = env_render
        self.gamma = gamma
        self.num_episodes = num_episodes
        self.learning_rate = lr
        self.max_len = max_len #per gli episodi
        self.N = N

    # Given an environment, observation, and policy, sample from pi(a | obs). Returns the
    # selected action and the log probability of that action (needed for policy gradient).
    def select_action(self, obs):
        dist = Categorical(self.policy(obs))
        action = dist.sample() 
        log_prob = dist.log_prob(action)
        return (action.item(), log_prob.reshape(1))

    def select_max_action(self, obs):
        probs = self.policy(obs)
        action = torch.argmax(probs)
        log_prob = torch.log(torch.max(probs))
        return(action.item(), log_prob.reshape(1))

    # Utility to compute the discounted total reward. Torch doesn't like flipped arrays, so we need to
    # .copy() the final numpy array. There's probably a better way to do this.
    def compute_returns(self, rewards):
        return np.flip(np.cumsum([self.gamma**(i+1)*r for (i, r) in enumerate(rewards)][::-1]), 0).copy()

    def setEnvRender(self, env_render):
        self.env_render = env_render
    
    def setPolicy(self, policy):
        self.policy = policy

    # Given an environment and a policy, run it up to the maximum number of steps.
    def run_episode(self, display=False, test=False):
        # Collect just about everything.
        observations = []
        actions = []
        log_probs = []
        rewards = []
        env = self.environment
        if display:
            env = self.env_render

        # Reset the environment and start the episode.
        (obs, info) = env.reset()
        for i in range(self.max_len):
            # Get the current observation, run the policy and select an action.
            obs = torch.tensor(obs)
            if test:
                (action, log_prob) = self.select_max_action(obs)
            else:
                (action, log_prob) = self.select_action(obs)
            observations.append(obs)
            actions.append(action)
            log_probs.append(log_prob)

            # Advance the episode by executing the selected action.
            (obs, reward, term, trunc, info) = env.step(action)
            rewards.append(reward)
            if term or trunc:
                break
        return (observations, actions, torch.cat(log_probs), rewards)

    # A direct, inefficient, and probably buggy of the REINFORCE policy gradient algorithm.
    def reinforce(self):
        # The only non-vanilla part: we use Adam instead of SGD.
        opt = torch.optim.Adam(self.policy.parameters(), lr=self.learning_rate)

        # Track episode rewards in a list.
        running_rewards = [0.0]

        # The main training loop.
        self.policy.train()
        for episode in range(self.num_episodes):
            # Run an episode of the environment, collect everything needed for policy update.
            (observations, actions, log_probs, rewards) = self.run_episode()

            # Compute the discounted reward for every step of the episode.
            returns = torch.tensor(self.compute_returns(rewards), dtype=torch.float32)

            # Keep a running average of total discounted rewards for the whole episode.
            running_reward = 0.05 * returns[0].item() + 0.95 * running_rewards[-1]
            running_rewards.append(running_reward)

            # Standardize returns.
            returns = (returns - returns.mean()) / returns.std()

            # Make an optimization step
            opt.zero_grad()
            loss = (-log_probs * returns).mean()
            loss.backward()
            opt.step()

            metrics = {"Policy Loss": loss,
                       "Running Reward": running_reward}
            wandb.log({**metrics}) 

            # Render an episode after every 100 policy updates.
            if not episode % self.N:
                self.policy.eval()
                (obs, _, _, _) = self.run_episode(display=True)
                self.policy.train()
                print(f'Running reward: {running_rewards[-1]}')

        # Return the running rewards.
        self.policy.eval()
        return (running_rewards, self.policy.state_dict())

Run with initial algorithm (2000 episodes, gamma = 0.99, temperature = 20)

In [4]:
# Your code here. You should be able to train an agent to solve Cartpole. This will be our starting point.

n_run = 5
seeds = [11, 111, 1111, 11111, 111111]
val_seeds = [22, 222, 22222, 22222, 222222]

state_dicts = []
for i in range(n_run):

    #Instantiate a rendering and a non rendering environment
    env_render = gym.make('CartPole-v1', render_mode='human')
    env = gym.make('CartPole-v1')

    torch.manual_seed(seeds[i])
    env_render.reset(seed = val_seeds[i])
    env.reset(seed = seeds[i])

    #track run
    run = wandb.init(
          # Set the project where this run will be logged
          project="Lab3-Final",
          # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
          name=f"Reinforce",
          # Track hyperparameters and run metadata
          config={
          "learning_rate": 1e-2,
          "architecture": "REINFORCE",
          "dataset": "CartPole",
          "hidden_layer_size": 128,
          "episodes": 2000,
          "gamma": 0.99,
          "episode_max_len": 500,
          "N": 100,
          "temperature": 20})

    # Make a policy network.
    policy = PolicyNet(env, inner_size=run.config["hidden_layer_size"], T=run.config["temperature"])

    # Train the agent.
    r = Reinforce(policy, env, env_render, gamma=run.config["gamma"], num_episodes=run.config["episodes"],
                  lr=run.config["learning_rate"], max_len=run.config["episode_max_len"], N=run.config["N"])
    (rewards, state_dict) = r.reinforce()
    
    state_dicts.append(state_dict)

    # Close up everything
    env_render.close()
    env.close()



Running reward: 0.6497082233428956
Running reward: 23.190019938114148
Running reward: 34.2965949373752
Running reward: 82.70511604452692
Running reward: 93.05599589222612
Running reward: 97.93792739779911
Running reward: 97.94823068770908
Running reward: 97.4321737049628
Running reward: 97.27579236350714
Running reward: 98.13662530975331
Running reward: 98.19534296458968
Running reward: 98.33517156102218
Running reward: 98.26824690796535
Running reward: 95.66775462775256
Running reward: 97.85358418639588
Running reward: 93.03835264075786
Running reward: 98.25870593512299
Running reward: 98.34898775205852
Running reward: 98.34952226819374
Running reward: 96.9865712025036


0,1
Policy Loss,▄▃▂▃▄▂▃▂▆▄▅▃▂▄▅▃▄▅▅▁█▄▃▁▅▄▄▃▂▄▃▃▅▄▆▂▄▄▄▁
Running Reward,▁▂▂▂▃▅▇█████████████████████████████████

0,1
Policy Loss,-0.00899
Running Reward,98.34103


Running reward: 0.42808961868286133
Running reward: 25.863219355172127
Running reward: 48.50486924684883
Running reward: 82.82855868124207
Running reward: 94.0840698403936
Running reward: 96.56840462011412
Running reward: 98.00008763034171
Running reward: 77.04336354226602
Running reward: 96.82797656644448
Running reward: 97.15818445644356
Running reward: 98.02453171861248
Running reward: 98.34760131676703
Running reward: 98.08222405243808
Running reward: 98.31886604104206
Running reward: 98.3493439317235
Running reward: 98.34952437696586
Running reward: 98.1931484449106
Running reward: 98.34717201039915
Running reward: 98.2168495838246
Running reward: 96.20843634915487


0,1
Policy Loss,▅▃▁▂▆▅▃▅▅▄▅▄▆▄▄▆▄▅▅▅▆▄▇▄▄█▅▄▄▄▅▄▅▅▄▄▅▄▇▇
Running Reward,▁▂▂▃▄▆▇██▇███▇▇█████████████████████████

0,1
Policy Loss,-0.01536
Running Reward,97.1594


Running reward: 1.0608932495117187
Running reward: 21.76099091345924
Running reward: 57.34512706294905
Running reward: 91.65530371034195
Running reward: 95.37254665452285
Running reward: 90.49952093055856
Running reward: 97.88843788111278
Running reward: 98.32094715222074
Running reward: 96.65827759649117
Running reward: 98.12478714637611
Running reward: 98.34819488195659
Running reward: 97.59485343813407
Running reward: 91.70991170071443
Running reward: 82.90690033189746
Running reward: 73.33164629858186
Running reward: 78.9565067622796
Running reward: 71.85287770872525
Running reward: 86.06232391771704
Running reward: 97.87107920308149
Running reward: 97.88316670816815


0,1
Policy Loss,▂▁▄▅▂▁▁▁▂▂▅▃▃▂▃▃▃▃▂▃▃▃▄▃▃▂▂▃▂▃▅▅█▂▄▃▃▃▂▃
Running Reward,▁▂▂▄▅▇███▇██████████████▇▇▇▆▆▇▆▆▆▇██████

0,1
Policy Loss,-0.00332
Running Reward,97.88481


Running reward: 0.7774312019348145
Running reward: 21.03520649842848
Running reward: 46.62653593972456
Running reward: 92.65362679618268
Running reward: 96.47210269742361
Running reward: 95.46401282522424
Running reward: 98.15981128381767
Running reward: 97.41286383968063
Running reward: 98.32258456860892
Running reward: 98.06122814343607
Running reward: 98.17195798977701
Running reward: 98.22462588853817
Running reward: 98.19977529855937
Running reward: 98.34863885150278
Running reward: 98.34952020251781
Running reward: 98.3264891936881
Running reward: 98.25080875853736
Running reward: 97.74985769077867
Running reward: 98.34434074889974
Running reward: 96.50468099790825


0,1
Policy Loss,▃█▁▃▄▂▃▂▃▂▃▂▃▃▆▂▃▃▄▃▅▂▅▃▄▂▃▃▄▂▃▂▃▃▃▂▃▁▄▂
Running Reward,▁▂▂▃▄▆██████████████████████████████████

0,1
Policy Loss,0.00449
Running Reward,98.30753


Running reward: 0.7774312019348145
Running reward: 28.666278900224185
Running reward: 68.36353509333874
Running reward: 95.22789090103967
Running reward: 95.40643687839002
Running reward: 97.07285936001247
Running reward: 96.68920068707166
Running reward: 98.30876741683528
Running reward: 98.31645413151655
Running reward: 98.00210736414499
Running reward: 98.34746855272117
Running reward: 97.74791352321091
Running reward: 98.07643203330349
Running reward: 98.31865023424464
Running reward: 98.3493426540331
Running reward: 98.25732959429939
Running reward: 96.06293294061365
Running reward: 98.0593281013731
Running reward: 98.10253615536354
Running reward: 96.66928990583492


In [5]:
# And run the final agent for a few episodes.
env_render = gym.make('CartPole-v1', render_mode='human')
env_render.reset(seed = 100)
r.setEnvRender(env_render)

average_test_rewards = []
for state_dict in state_dicts:
    total_rewards = 0
    policy = PolicyNet(env = env_render)
    policy.load_state_dict(state_dict)
    r.setPolicy(policy)
    for _ in range(20):
        (_, _, _, rewards) = r.run_episode(display=True, test=True)
        total_rewards += np.sum(rewards)
    average_test_reward = total_rewards / 20
    print(f'Average reward for episode: {average_test_reward}')
    average_test_rewards.append(average_test_reward)
avg_test_rew = np.sum(average_test_rewards) / 5
print(f'Total Average reward for test episode: {avg_test_rew}')
test_metrics = {"Total average test reward": avg_test_rew}
wandb.log({**test_metrics})
env_render.close()
wandb.finish()

Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Total Average reward for test episode: 500.0


0,1
Policy Loss,▆▄▁▅▃▃▆▇▅▆▇▆▄▅▅▇▆▆▇▇▆▇▆▅▆▅▄▃▄▇▅▅▇▇█▆█▆▆▄
Running Reward,▁▂▃▄▆███████████████████████████████████
Total average test reward,▁

0,1
Policy Loss,-0.00856
Running Reward,98.33905
Total average test reward,500.0


**Last Things Last**: My implementation does a **super crappy** job of evaluating the agent performance during training. The running average is not a very good metric. Modify my implementation so that every $N$ iterations (make $N$ an argument to the training function) the agent is run for $M$ episodes in the environment. Collect and return: (1) The average **total** reward received over the $M$ iterations; and (2) the average episode length. Analyze the performance of your agents with these new metrics.

In [6]:
class ReinforceAvg(Reinforce):
    def __init__(self, policy, env, env_render=None, gamma=0.99, num_episodes=10, lr=1e-2,
                 max_len=500, N=100, eval_episodes=10):
        super().__init__(policy, env, env_render, gamma, num_episodes, lr, max_len, N)
        self.M = eval_episodes

    def run_episode(self, display=False, test=False):
        # Collect just about everything.
        observations = []
        actions = []
        log_probs = []
        rewards = []
        env = self.environment
        if display:
            env = self.env_render

        # Reset the environment and start the episode.
        (obs, info) = env.reset()
        for i in range(self.max_len):
            # Get the current observation, run the policy and select an action.
            obs = torch.tensor(obs)
            if test:
                (action,log_prob) = self.select_max_action(obs)
            else:
                (action, log_prob) = self.select_action(obs)
            observations.append(obs)
            actions.append(action)
            log_probs.append(log_prob)

            # Advance the episode by executing the selected action.
            (obs, reward, term, trunc, info) = env.step(action)
            rewards.append(reward)
            if term or trunc:
                break
        length = i + 1
        return (observations, actions, torch.cat(log_probs), rewards, length)

    def reinforce(self):
        # The only non-vanilla part: we use Adam instead of SGD.
        opt = torch.optim.Adam(self.policy.parameters(), lr=self.learning_rate)

        # Track episode rewards in a list.
        running_rewards = [0.0]
        average_rewards = []
        average_lengths = []

        # The main training loop.
        self.policy.train()
        state_dict = None
        best_reward = 0
        for episode in range(self.num_episodes):
            # Run an episode of the environment, collect everything needed for policy update.
            (observations, actions, log_probs, rewards, length) = self.run_episode()

            # Compute the discounted reward for every step of the episode.
            returns = torch.tensor(self.compute_returns(rewards), dtype=torch.float32)

            # Keep a running average of total discounted rewards for the whole episode.
            running_reward = 0.05 * returns[0].item() + 0.95 * running_rewards[-1]
            running_rewards.append(running_reward)

            # Standardize returns.
            returns = (returns - returns.mean()) / returns.std()

            # Make an optimization step
            opt.zero_grad()
            loss = (-log_probs * returns).mean()
            loss.backward()
            opt.step()

            metrics = {"Policy Loss": loss,
                       "Running Reward": running_reward}
            wandb.log({**metrics})

            # Render an episode after every 100 policy updates.
            if not episode % self.N:
                self.policy.eval()
                total_reward = 0
                total_length = 0
                for _ in range(self.M):
                    (_, _, _, rewards, length) = self.run_episode()
                    total_reward += np.sum(rewards)
                    total_length += length
                average_reward = total_reward / self.M
                average_rewards.append(average_reward)
                print(f'Average Total: {average_reward}')
                average_length = total_length / self.M
                average_lengths.append(average_length)
                print(f'Average Length: {average_length}')

                val_metrics = {"Average Total Reward": average_reward,
                               "Average Length": average_length}
                wandb.log({**val_metrics})

                if average_reward >= best_reward:
                    best_reward = average_reward
                    state_dict = self.policy.state_dict()

                (obs, _, _, _, _) = self.run_episode(display=True)
                self.policy.train()
                print(f'Running reward: {running_rewards[-1]}')

        # Return the running rewards.
        self.policy.eval()
        return (running_rewards, average_rewards, average_lengths, state_dict)

Runs varying hyperparameters, with validation and average total reward and length for episode

In [7]:
state_dicts = []
for i in range(n_run):

    #Instantiate a rendering and a non rendering environment
    env_render = gym.make('CartPole-v1', render_mode='human')
    env = gym.make('CartPole-v1')

    torch.manual_seed(seeds[i])
    env_render.reset(seed = val_seeds[i])
    env.reset(seed = seeds[i])

    run = wandb.init(
          # Set the project where this run will be logged
          project="Lab3-Final",
          # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
          name=f"ReinforceAvg lower gamma",
          # Track hyperparameters and run metadata
          config={
          "learning_rate": 1e-2,
          "architecture": "REINFORCE_AVG",
          "dataset": "CartPole",
          "hidden_layer_size": 128,
          "episodes": 2000,
          "gamma": 0.92,
          "episode_max_len": 500,
          "N": 100,
          "temperature": 20,
          "M": 10})
    
    # Make a policy network.
    policy = PolicyNet(env, inner_size=run.config["hidden_layer_size"], T=run.config["temperature"])

    # Train the agent.
    r = ReinforceAvg(policy, env, env_render, gamma=run.config["gamma"], num_episodes=run.config["episodes"],
                  lr=run.config["learning_rate"], max_len=run.config["episode_max_len"],
                     N=run.config["N"], eval_episodes=run.config["M"])
    (total, average, length, state_dict) = r.reinforce()

    state_dicts.append(state_dict)

    # Close up everything
    env_render.close()
    env.close()

#provare anche a diminuire hidden layer size

Average Total: 26.1
Average Length: 26.1
Running reward: 0.3960641145706177
Average Total: 29.6
Average Length: 29.6
Running reward: 9.704630357956068
Average Total: 64.1
Average Length: 64.1
Running reward: 10.767900320285971
Average Total: 128.3
Average Length: 128.3
Running reward: 11.436967272403658
Average Total: 353.0
Average Length: 353.0
Running reward: 11.492468479901543
Average Total: 172.9
Average Length: 172.9
Running reward: 11.424512905761132
Average Total: 500.0
Average Length: 500.0
Running reward: 11.495140430880197
Average Total: 500.0
Average Length: 500.0
Running reward: 11.499887142211803
Average Total: 413.6
Average Length: 413.6
Running reward: 11.496445273747195
Average Total: 500.0
Average Length: 500.0
Running reward: 11.49540190266793
Average Total: 500.0
Average Length: 500.0
Running reward: 11.499972776830365
Average Total: 500.0
Average Length: 500.0
Running reward: 11.499999838824404
Average Total: 500.0
Average Length: 500.0
Running reward: 11.4999999990

0,1
Average Length,▁▁▂▃▆▃██▇███████████
Average Total Reward,▁▁▂▃▆▃██▇███████████
Policy Loss,▆▆▅▁▃▃▇▄▅▄▅▆▆▆▆▆▅▄█▅▅▆▄▆▅▅▆▆▄▆▅▅▅▅▄▅▆▅▆▇
Running Reward,▁▆▆▆▇███████████████████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Policy Loss,-0.01677
Running Reward,11.5


Average Total: 22.0
Average Length: 22.0
Running reward: 0.3035072088241577
Average Total: 29.9
Average Length: 29.9
Running reward: 9.314940477113208
Average Total: 74.8
Average Length: 74.8
Running reward: 11.156722728170807
Average Total: 108.9
Average Length: 108.9
Running reward: 11.430517254837257
Average Total: 157.2
Average Length: 157.2
Running reward: 11.492831622802584
Average Total: 281.7
Average Length: 281.7
Running reward: 11.499814069222595
Average Total: 194.8
Average Length: 194.8
Running reward: 11.484491454387348
Average Total: 500.0
Average Length: 500.0
Running reward: 11.499129462451933
Average Total: 500.0
Average Length: 500.0
Running reward: 11.499994845956985
Average Total: 500.0
Average Length: 500.0
Running reward: 11.499999969485314
Average Total: 477.0
Average Length: 477.0
Running reward: 11.499999999819313
Average Total: 497.1
Average Length: 497.1
Running reward: 11.499999999998908
Average Total: 480.5
Average Length: 480.5
Running reward: 11.493494939

0,1
Average Length,▁▁▂▂▃▅▄█████████████
Average Total Reward,▁▁▂▂▃▅▄█████████████
Policy Loss,▅▃▅▄▁▇▄▅▇▅▂▇▃▄▆█▅▄▅▇▅▄▄▄▄▄▆▄▇▅▄▅▆▃▄▅▃▅▄▅
Running Reward,▁▅▆▇████████████████████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Policy Loss,-0.01291
Running Reward,11.5


Average Total: 23.9
Average Length: 23.9
Running reward: 0.4972723007202149
Average Total: 25.5
Average Length: 25.5
Running reward: 9.464758245217674
Average Total: 60.5
Average Length: 60.5
Running reward: 10.900748623670577
Average Total: 130.1
Average Length: 130.1
Running reward: 11.384159032809848
Average Total: 313.1
Average Length: 313.1
Running reward: 11.494048335591792
Average Total: 111.9
Average Length: 111.9
Running reward: 11.471558266447634
Average Total: 328.9
Average Length: 328.9
Running reward: 11.433333214352109
Average Total: 461.1
Average Length: 461.1
Running reward: 11.494848494083826
Average Total: 500.0
Average Length: 500.0
Running reward: 11.49996933630357
Average Total: 318.0
Average Length: 318.0
Running reward: 11.499999682556076
Average Total: 490.0
Average Length: 490.0
Running reward: 11.499999969724346
Average Total: 432.0
Average Length: 432.0
Running reward: 11.499951406161763
Average Total: 455.9
Average Length: 455.9
Running reward: 11.4999925745

0,1
Average Length,▁▁▂▃▅▂▅▇█▅█▇▇▃█▇▇█▆█
Average Total Reward,▁▁▂▃▅▂▅▇█▅█▇▇▃█▇▇█▆█
Policy Loss,▇▇▆▇▁▇▇█▆▇▆▆▇▇▇▇▇▆▇▇█▇▇▇▇▇▆▇▇██▇▇▇▇▇▆▇▇▇
Running Reward,▁▅▆▆▇███████████████████████████████████

0,1
Average Length,466.1
Average Total Reward,466.1
Policy Loss,0.00253
Running Reward,11.5


Average Total: 21.2
Average Length: 21.2
Running reward: 0.43566479682922366
Average Total: 49.9
Average Length: 49.9
Running reward: 9.872335727198276
Average Total: 77.9
Average Length: 77.9
Running reward: 11.332912167667539
Average Total: 260.9
Average Length: 260.9
Running reward: 11.43249004610792
Average Total: 500.0
Average Length: 500.0
Running reward: 11.483380017838329
Average Total: 412.8
Average Length: 412.8
Running reward: 11.499685875702824
Average Total: 339.2
Average Length: 339.2
Running reward: 11.49960380057938
Average Total: 410.3
Average Length: 410.3
Running reward: 11.474441784154077
Average Total: 481.2
Average Length: 481.2
Running reward: 11.499836115357384
Average Total: 483.5
Average Length: 483.5
Running reward: 11.499982966908515
Average Total: 500.0
Average Length: 500.0
Running reward: 11.498210679960195
Average Total: 476.5
Average Length: 476.5
Running reward: 11.452192032789084
Average Total: 500.0
Average Length: 500.0
Running reward: 11.4997169515

0,1
Average Length,▁▁▂▅█▇▆▇█████▇▇█████
Average Total Reward,▁▁▂▅█▇▆▇█████▇▇█████
Policy Loss,▄█▁▃▅▂▅▄▄▄▄▅▄▄▅▅▅▃▅▆▄▄▄▄▅▆▅▄▆▅▄▃▄▄▄▅▃▄▄▄
Running Reward,▁▅▆▇████████████████████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Policy Loss,-0.0105
Running Reward,11.49996


Average Total: 31.9
Average Length: 31.9
Running reward: 0.43566479682922366
Average Total: 24.8
Average Length: 24.8
Running reward: 9.166403752386982
Average Total: 44.4
Average Length: 44.4
Running reward: 11.008327461913176
Average Total: 116.6
Average Length: 116.6
Running reward: 11.39817018599383
Average Total: 297.1
Average Length: 297.1
Running reward: 11.49307012171752
Average Total: 395.6
Average Length: 395.6
Running reward: 11.498879004876944
Average Total: 135.7
Average Length: 135.7
Running reward: 11.270078701480966
Average Total: 386.0
Average Length: 386.0
Running reward: 11.49308857102441
Average Total: 283.2
Average Length: 283.2
Running reward: 11.446263014114743
Average Total: 469.0
Average Length: 469.0
Running reward: 11.499681848604828
Average Total: 500.0
Average Length: 500.0
Running reward: 11.49984391328684
Average Total: 462.4
Average Length: 462.4
Running reward: 11.49999907588403
Average Total: 487.8
Average Length: 487.8
Running reward: 11.4999991501796

In [8]:
# And run the final agent for a few episodes.
env_render = gym.make('CartPole-v1', render_mode='human')
env_render.reset(seed = 100)
r.setEnvRender(env_render)

average_test_rewards = []
for state_dict in state_dicts:
    total_rewards = 0
    policy = PolicyNet(env = env_render)
    policy.load_state_dict(state_dict)
    r.setPolicy(policy)
    for _ in range(20):
        (_, _, _, rewards, _) = r.run_episode(display=True, test=True)
        total_rewards += np.sum(rewards)
    average_test_reward = total_rewards / 20
    print(f'Average reward for episode: {average_test_reward}')
    average_test_rewards.append(average_test_reward)
avg_test_rew = np.sum(average_test_rewards) / 5
print(f'Total Average reward for test episode: {avg_test_rew}')
test_metrics = {"Total average test reward": avg_test_rew}
wandb.log({**test_metrics})
env_render.close()
wandb.finish()

Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Total Average reward for test episode: 500.0


0,1
Average Length,▁▁▁▂▅▆▃▆▅██▇███▆███▅
Average Total Reward,▁▁▁▂▅▆▃▆▅██▇███▆███▅
Policy Loss,▄▅▄▅█▅▆▄▄▅▄▄▁▄▄▅▄▄▄▅▅▄▄▄▄▄▄▄▅▄▄▄▄▅▄▅▄▄▄▅
Running Reward,▁▅▆▇▇███████████████████████████████████
Total average test reward,▁

0,1
Average Length,325.8
Average Total Reward,325.8
Policy Loss,-0.0021
Running Reward,11.49985
Total average test reward,500.0


In [9]:
state_dicts = []
for i in range(n_run):

    #Instantiate a rendering and a non rendering environment
    env_render = gym.make('CartPole-v1', render_mode='human')
    env = gym.make('CartPole-v1')

    torch.manual_seed(seeds[i])
    env_render.reset(seed = val_seeds[i])
    env.reset(seed = seeds[i])

    run = wandb.init(
          # Set the project where this run will be logged
          project="Lab3-Final",
          # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
          name=f"ReinforceAvg lower T",
          # Track hyperparameters and run metadata
          config={
          "learning_rate": 1e-2,
          "architecture": "REINFORCE_AVG",
          "dataset": "CartPole",
          "hidden_layer_size": 128,
          "episodes": 2000,
          "gamma": 0.99,
          "episode_max_len": 500,
          "N": 100,
          "temperature": 10,
          "M": 10})
    
    # Make a policy network.
    policy = PolicyNet(env, inner_size=run.config["hidden_layer_size"] , T=run.config["temperature"])

    # Train the agent.
    r = ReinforceAvg(policy, env, env_render, gamma=run.config["gamma"], num_episodes=run.config["episodes"],
                  lr=run.config["learning_rate"], max_len=run.config["episode_max_len"],
                     N=run.config["N"], eval_episodes=run.config["M"])
    (total, average, length, state_dict) = r.reinforce()

    state_dicts.append(state_dict)

    # Close up everything
    env_render.close()
    env.close()

Average Total: 26.1
Average Length: 26.1
Running reward: 0.6497082233428956
Average Total: 59.4
Average Length: 59.4
Running reward: 39.347795523237224
Average Total: 424.0
Average Length: 424.0
Running reward: 87.32869074964754
Average Total: 461.2
Average Length: 461.2
Running reward: 97.4552843648731
Average Total: 500.0
Average Length: 500.0
Running reward: 96.7742111894198
Average Total: 500.0
Average Length: 500.0
Running reward: 98.20799809789683
Average Total: 500.0
Average Length: 500.0
Running reward: 98.2133700963476
Average Total: 500.0
Average Length: 500.0
Running reward: 98.1758297435418
Average Total: 500.0
Average Length: 500.0
Running reward: 98.22037771741685
Average Total: 500.0
Average Length: 500.0
Running reward: 97.52594265140341
Average Total: 500.0
Average Length: 500.0
Running reward: 96.32728488819339
Average Total: 500.0
Average Length: 500.0
Running reward: 98.19207783210567
Average Total: 500.0
Average Length: 500.0
Running reward: 98.04145270893343
Avera

0,1
Average Length,▁▁▇▇██████████▇█████
Average Total Reward,▁▁▇▇██████████▇█████
Policy Loss,▆▇▆▆▅▅▆▆▄▅▄█▁▅▇▅▅▇▆█▃▆▅▇▂▅▄▃▄▅▆▆▇▃▄▅▇█▆▅
Running Reward,▁▂▄▆▇███████████████████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Policy Loss,0.00079
Running Reward,98.34928


Average Total: 22.2
Average Length: 22.2
Running reward: 0.42808961868286133
Average Total: 18.6
Average Length: 18.6
Running reward: 16.36440253796899
Average Total: 31.2
Average Length: 31.2
Running reward: 26.66849904100528
Average Total: 87.1
Average Length: 87.1
Running reward: 53.85440465697726
Average Total: 328.4
Average Length: 328.4
Running reward: 89.97042449802098
Average Total: 223.9
Average Length: 223.9
Running reward: 88.02701860712041
Average Total: 445.7
Average Length: 445.7
Running reward: 97.33720978096878
Average Total: 187.5
Average Length: 187.5
Running reward: 86.99964191210253
Average Total: 76.2
Average Length: 76.2
Running reward: 66.83412140993117
Average Total: 188.2
Average Length: 188.2
Running reward: 82.48210826221373
Average Total: 283.7
Average Length: 283.7
Running reward: 94.00736088958236
Average Total: 205.0
Average Length: 205.0
Running reward: 94.50863231312991
Average Total: 280.2
Average Length: 280.2
Running reward: 87.31770852837317
Average

0,1
Average Length,▁▁▁▂▆▄▇▃▂▃▅▄▅███████
Average Total Reward,▁▁▁▂▆▄▇▃▂▃▅▄▅███████
Policy Loss,▆▃▁█▄▆▄▂▅▃▅▅▄▃▆▄▄▃▄▅▆▃▃▅▄▅▆▅▄▃▅▄▅▅▅▄▅▆▆▃
Running Reward,▁▂▂▂▂▄▅▇█▇▇██▇▇▆▅▆▆▇██▇▇████▇███████████

0,1
Average Length,500.0
Average Total Reward,500.0
Policy Loss,-0.02959
Running Reward,97.61093


Average Total: 25.3
Average Length: 25.3
Running reward: 1.0608932495117187
Average Total: 104.3
Average Length: 104.3
Running reward: 47.828443612874274
Average Total: 455.7
Average Length: 455.7
Running reward: 91.52262721604303
Average Total: 434.0
Average Length: 434.0
Running reward: 94.4630668193838
Average Total: 143.3
Average Length: 143.3
Running reward: 80.96704893988772
Average Total: 189.7
Average Length: 189.7
Running reward: 75.95496642354303
Average Total: 500.0
Average Length: 500.0
Running reward: 98.12179780283537
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34817718346082
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34951746919869
Average Total: 500.0
Average Length: 500.0
Running reward: 98.33790851137228
Average Total: 219.7
Average Length: 219.7
Running reward: 72.32367070855146
Average Total: 500.0
Average Length: 500.0
Running reward: 98.18155952251425
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34853100446843


0,1
Average Length,▁▂▇▇▃▃████▄████▄████
Average Total Reward,▁▂▇▇▃▃████▄████▄████
Policy Loss,▂▃█▅▃▃▅▆▃▃▅▆▃▁▄▅▇▆▂▄▁▄▅▄▃▃▄▄▂▂▄▃▃▆▁▃▂▁▂▃
Running Reward,▁▃▄▆████▆▆▇████████▆▇█████████▇█████████

0,1
Average Length,500.0
Average Total Reward,500.0
Policy Loss,0.01534
Running Reward,98.34953


Average Total: 17.6
Average Length: 17.6
Running reward: 0.7774312019348145
Average Total: 63.4
Average Length: 63.4
Running reward: 30.911007662428396
Average Total: 280.6
Average Length: 280.6
Running reward: 82.61082721961085
Average Total: 500.0
Average Length: 500.0
Running reward: 97.22373255041809
Average Total: 435.9
Average Length: 435.9
Running reward: 97.48509009859872
Average Total: 450.1
Average Length: 450.1
Running reward: 97.15320673544689
Average Total: 500.0
Average Length: 500.0
Running reward: 95.5736432886379
Average Total: 500.0
Average Length: 500.0
Running reward: 98.33309076020156
Average Total: 484.9
Average Length: 484.9
Running reward: 98.34942814958896
Average Total: 500.0
Average Length: 500.0
Running reward: 98.33157593908335
Average Total: 500.0
Average Length: 500.0
Running reward: 98.23227878866895
Average Total: 500.0
Average Length: 500.0
Running reward: 94.51695138562508
Average Total: 500.0
Average Length: 500.0
Running reward: 96.59801497903715
Av

0,1
Average Length,▁▂▅█▇▇█████████▇▂██▄
Average Total Reward,▁▂▅█▇▇█████████▇▂██▄
Policy Loss,▇▂▁▅▆█▇▇▆▇▆▆▆▆▅▅▅▇▆▆▅▄▆▆▅▆▄█▆▅█▅▆▅▆▇▆▇▆▇
Running Reward,▁▂▃▅▇████▇█▇██████████████████▆▅▆█████▇▇

0,1
Average Length,234.3
Average Total Reward,234.3
Policy Loss,0.00441
Running Reward,94.96179


Average Total: 31.9
Average Length: 31.9
Running reward: 0.7774312019348145
Average Total: 28.4
Average Length: 28.4
Running reward: 24.377308287412447
Average Total: 188.0
Average Length: 188.0
Running reward: 78.70860525450098
Average Total: 404.9
Average Length: 404.9
Running reward: 95.86506714537013
Average Total: 435.8
Average Length: 435.8
Running reward: 95.14123001529926
Average Total: 500.0
Average Length: 500.0
Running reward: 98.30091003541413
Average Total: 466.5
Average Length: 466.5
Running reward: 97.98720888660506
Average Total: 291.0
Average Length: 291.0
Running reward: 88.35950526231758
Average Total: 168.9
Average Length: 168.9
Running reward: 68.2593934495506
Average Total: 403.0
Average Length: 403.0
Running reward: 97.63266507433812
Average Total: 500.0
Average Length: 500.0
Running reward: 95.00130962151316
Average Total: 402.7
Average Length: 402.7
Running reward: 97.11364544533767
Average Total: 318.6
Average Length: 318.6
Running reward: 95.96090992142119
Av

In [10]:
# And run the final agent for a few episodes.
env_render = gym.make('CartPole-v1', render_mode='human')
env_render.reset(seed = 100)
r.setEnvRender(env_render)

average_test_rewards = []
for state_dict in state_dicts:
    total_rewards = 0
    policy = PolicyNet(env = env_render)
    policy.load_state_dict(state_dict)
    r.setPolicy(policy)
    for _ in range(20):
        (_, _, _, rewards, _) = r.run_episode(display=True, test=True)
        total_rewards += np.sum(rewards)
    average_test_reward = total_rewards / 20
    print(f'Average reward for episode: {average_test_reward}')
    average_test_rewards.append(average_test_reward)
avg_test_rew = np.sum(average_test_rewards) / 5
print(f'Total Average reward for test episode: {avg_test_rew}')
test_metrics = {"Total average test reward": avg_test_rew}
wandb.log({**test_metrics})
env_render.close()
wandb.finish()

Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Total Average reward for test episode: 500.0


0,1
Average Length,▁▁▃▇▇██▅▃▇█▇▅███████
Average Total Reward,▁▁▃▇▇██▅▃▇█▇▅███████
Policy Loss,▆▅▅▇▆▆▇▆▆▄▇█▆▇▅▁▆▆▃▇▇▅▇▆▄▅▆▅▇▅▆▄▅▅▅▅▆▆▇▄
Running Reward,▁▂▃▅▇██▇██████▇▇▇██▇████████████████████
Total average test reward,▁

0,1
Average Length,500.0
Average Total Reward,500.0
Policy Loss,0.00965
Running Reward,98.29517
Total average test reward,500.0


-----
## Exercise 2: `REINFORCE` with a Value Baseline (warm up)

In this exercise we will augment my implementation (or your own) of `REINFORCE` to subtract a baseline from the target in the update equation in order to stabilize (and hopefully speed-up) convergence. For now we will stick to the Cartpole environment.



In [11]:
class BaselineNet(nn.Module):
    def __init__(self, env, inner_size=128):
        super().__init__()
        self.fc1 = nn.Linear(env.observation_space.shape[0], inner_size)
        self.fc2 = nn.Linear(inner_size, 1)
        self.relu = nn.ReLU()

    def forward(self, s):
        s = F.relu(self.fc1(s))
        s = self.fc2(s)
        return s

**First Things First**: Recall from the slides on Deep Reinforcement Learning that we can **subtract** any function that doesn't depend on the current action from the q-value without changing the (maximum of our) objecttive function $J$:  

$$ \nabla J(\boldsymbol{\theta}) \propto \sum_{s} \mu(s) \sum_a \left( q_{\pi}(s, a) - b(s) \right) \nabla \pi(a \mid s, \boldsymbol{\theta}) $$

In `REINFORCE` this means we can subtract from our target $G_t$:

$$ \boldsymbol{\theta}_{t+1} \triangleq \boldsymbol{\theta}_t + \alpha (G_t - b(S_t)) \frac{\nabla \pi(A_t \mid s, \boldsymbol{\theta})}{\pi(A_t \mid s, \boldsymbol{\theta})} $$

Since we are only interested in the **maximum** of our objective, we can also **rescale** our target by any function that also doesn't depend on the action. A **simple baseline** which is even independent of the state -- that is, it is **constant** for each episode -- is to just **standardize rewards within the episode**. So, we **subtract** the average return and **divide** by the variance of returns:

$$ \boldsymbol{\theta}_{t+1} \triangleq \boldsymbol{\theta}_t + \alpha \left(\frac{G_t - \bar{G}}{\sigma_G}\right) \nabla  \pi(A_t \mid s, \boldsymbol{\theta}) $$

This baseline is **already** implemented in my implementation of `REINFORCE`. Experiment with and without this standardization baseline and compare the performance. We are going to do something more interesting.

In [12]:
# Your code here. Modify your implementation of `REINFORCE` to optionally use the standardize baseline.

class ReinforceStd(ReinforceAvg):
    def __init__(self, policy, env, env_render=None, gamma=0.99, num_episodes=10, lr=1e-2,
                 max_len=500, N=100, eval_episodes=10, baseline=None):
        super().__init__(policy, env, env_render, gamma, num_episodes, lr, max_len, N, eval_episodes)
        self.baseline = baseline

    # A direct, inefficient, and probably buggy of the REINFORCE policy gradient algorithm.
    def reinforce(self):
        # The only non-vanilla part: we use Adam instead of SGD.
        opt = torch.optim.Adam(self.policy.parameters(), lr=self.learning_rate)

        # If we have a baseline network, create the optimizer.
        if self.baseline == 'std':
            print('Training agent with standardization baseline.')
        else:
            print('Training agent with no baseline.')

        # Track episode rewards in a list.
        running_rewards = [0.0]
        average_rewards = []
        average_lengths = []

        # The main training loop.
        self.policy.train()
        state_dict = None
        best_reward = 0
        for episode in range(self.num_episodes):
            # Run an episode of the environment, collect everything needed for policy update.
            (observations, actions, log_probs, rewards, length) = self.run_episode()

            # Compute the discounted reward for every step of the episode.
            returns = torch.tensor(self.compute_returns(rewards), dtype=torch.float32)

            # Keep a running average of total discounted rewards for the whole episode.
            running_reward = 0.05 * returns[0].item() + 0.95 * running_rewards[-1]
            running_rewards.append(running_reward)

            # Handle baseline.
            if self.baseline == 'std':
                target = (returns - returns.mean()) / returns.std()
            else:
                target = returns

            # Make an optimization step
            opt.zero_grad()

            # Update policy network
            loss = (-log_probs * target).mean()
            loss.backward()
            opt.step()

            metrics = {"Policy Loss": loss,
                       "Running Reward": running_reward}
            wandb.log({**metrics})

            # Render an episode after every 100 policy updates.
            if not episode % self.N:
                self.policy.eval()
                total_reward = 0
                total_length = 0
                for _ in range(self.M):
                    (_, _, _, rewards, length) = self.run_episode(display=True)
                    total_reward += np.sum(rewards)
                    total_length += length
                average_reward = total_reward / self.M
                average_rewards.append(average_reward)
                print(f'Average Total: {average_reward}')
                average_length = total_length / self.M
                average_lengths.append(average_length)
                print(f'Average Length: {average_length}')

                val_metrics = {"Average Total Reward": average_reward,
                               "Average Length": average_length}
                wandb.log({**val_metrics})

                if average_reward >= best_reward:
                    best_reward = average_reward
                    state_dict = self.policy.state_dict()

                (obs, _, _, _, _) = self.run_episode()
                self.policy.train()
                print(f'Running reward: {running_rewards[-1]}')

        # Return the running rewards.
        self.policy.eval()
        return (running_rewards, average_rewards, average_lengths, state_dict)

In [13]:
state_dicts = []
for i in range(n_run):

    # Instantiate a rendering and a non-rendering environment.
    env_render = gym.make('CartPole-v1', render_mode='human')
    env = gym.make('CartPole-v1')

    torch.manual_seed(seeds[i])
    env_render.reset(seed = val_seeds[i])
    env.reset(seed = seeds[i])

    run = wandb.init(
          # Set the project where this run will be logged
          project="Lab3-Final",
          # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
          name=f"Reinforce without std",
          # Track hyperparameters and run metadata
          config={
          "learning_rate": 1e-2,
          "architecture": "REINFORCE_STD",
          "dataset": "CartPole",
          "hidden_layer_size": 128,
          "episodes": 2000,
          "gamma": 0.99,
          "episode_max_len": 500,
          "N": 100,
          "temperature": 10,
          "M": 10,
          "baseline": None})
    
    # Make a policy network.
    policy = PolicyNet(env, inner_size=run.config["hidden_layer_size"], T=run.config["temperature"])

    # Train the agent.
    r = ReinforceStd(policy, env, env_render, gamma=run.config["gamma"], num_episodes=run.config["episodes"],
                  lr=run.config["learning_rate"], max_len=run.config["episode_max_len"],
                     N=run.config["N"], eval_episodes=run.config["M"], baseline=run.config["baseline"])
    (total, average, length, state_dict) = r.reinforce()

    state_dicts.append(state_dict)

    # Close up everything
    env_render.close()
    env.close()

Training agent with no baseline.
Average Total: 21.3
Average Length: 21.3
Running reward: 0.6497082233428956
Average Total: 28.3
Average Length: 28.3
Running reward: 21.686505653591553
Average Total: 26.8
Average Length: 26.8
Running reward: 24.00807321889379
Average Total: 33.4
Average Length: 33.4
Running reward: 26.99608878710016
Average Total: 76.3
Average Length: 76.3
Running reward: 38.84036962148689
Average Total: 67.9
Average Length: 67.9
Running reward: 44.86376850914731
Average Total: 91.8
Average Length: 91.8
Running reward: 54.97029859122171
Average Total: 119.1
Average Length: 119.1
Running reward: 61.43057535894768
Average Total: 83.1
Average Length: 83.1
Running reward: 59.958023427857206
Average Total: 158.1
Average Length: 158.1
Running reward: 58.74335755000611
Average Total: 166.0
Average Length: 166.0
Running reward: 75.78445376297306
Average Total: 165.7
Average Length: 165.7
Running reward: 76.47119771135021
Average Total: 210.4
Average Length: 210.4
Running rewar

0,1
Average Length,▁▁▁▁▂▂▂▂▂▃▃▃▄▂▄█▃▃▆▅
Average Total Reward,▁▁▁▁▂▂▂▂▂▃▃▃▄▂▄█▃▃▆▅
Policy Loss,▂▁▆▄▃▄▁▃▅▃▆▃▂▅▃▃▅▅▄▅▇▆▅▅▃▆▆▇▇▄█▇▅▄▆▅▅▆▆▅
Running Reward,▁▂▂▂▂▂▂▃▄▄▄▄▄▅▅▄▅▅▆▇▇▆▇█▆▄▆▇▇██▆▇█▇██▇▇▇

0,1
Average Length,303.7
Average Total Reward,303.7
Policy Loss,12.14417
Running Reward,84.69919


Training agent with no baseline.
Average Total: 25.5
Average Length: 25.5
Running reward: 0.42808961868286133
Average Total: 40.4
Average Length: 40.4
Running reward: 25.83990490063615
Average Total: 20.4
Average Length: 20.4
Running reward: 18.91481959818454
Average Total: 63.8
Average Length: 63.8
Running reward: 38.63250651648312
Average Total: 118.3
Average Length: 118.3
Running reward: 35.08271682967789
Average Total: 108.4
Average Length: 108.4
Running reward: 59.78177470420637
Average Total: 92.5
Average Length: 92.5
Running reward: 56.27718725531761
Average Total: 337.9
Average Length: 337.9
Running reward: 90.47169477535797
Average Total: 470.8
Average Length: 470.8
Running reward: 85.14311458508297
Average Total: 463.5
Average Length: 463.5
Running reward: 98.06804627883203
Average Total: 464.3
Average Length: 464.3
Running reward: 97.30743683056632
Average Total: 478.6
Average Length: 478.6
Running reward: 96.82032698138853
Average Total: 369.1
Average Length: 369.1
Running 

0,1
Average Length,▁▁▁▂▂▂▂▆█▇▇█▆▃▄██▂█▅
Average Total Reward,▁▁▁▂▂▂▂▆█▇▇█▆▃▄██▂█▅
Policy Loss,▃▃█▅▂▃▂▃▇▅▇▆▇▆▅▇▄▅▅▅▄▄▄▆▂█▂█▇▄▄▆▄▆▅▃▃▃▂▁
Running Reward,▁▂▂▂▂▂▃▂▅▄▅▇▆▇▇▅▇██████▇▆▆▅▅▆████▇▆███▄▃

0,1
Average Length,296.0
Average Total Reward,296.0
Policy Loss,3.37506
Running Reward,25.6201


Training agent with no baseline.
Average Total: 29.2
Average Length: 29.2
Running reward: 1.0608932495117187
Average Total: 44.9
Average Length: 44.9
Running reward: 34.353444827513506
Average Total: 69.3
Average Length: 69.3
Running reward: 40.15142466592689
Average Total: 98.1
Average Length: 98.1
Running reward: 57.24275572431227
Average Total: 262.1
Average Length: 262.1
Running reward: 61.67759797621861
Average Total: 133.7
Average Length: 133.7
Running reward: 66.97493610526575
Average Total: 120.8
Average Length: 120.8
Running reward: 75.52974037879046
Average Total: 109.9
Average Length: 109.9
Running reward: 66.67732314014523
Average Total: 250.6
Average Length: 250.6
Running reward: 88.58951810067828
Average Total: 47.7
Average Length: 47.7
Running reward: 49.33560274282194
Average Total: 98.7
Average Length: 98.7
Running reward: 52.27052633196842
Average Total: 142.6
Average Length: 142.6
Running reward: 58.058162675865745
Average Total: 16.4
Average Length: 16.4
Running rew

0,1
Average Length,▁▂▃▃█▄▄▄█▂▃▅▁▁▁▆▇▆▅▅
Average Total Reward,▁▂▃▃█▄▄▄█▂▃▅▁▁▁▆▇▆▅▅
Policy Loss,▅▇▅▄▅█▆▃▇▄▇▆▅▆▆▅▅▅▁▂▃▄▆▂▁▁▁▁▁▇▂▅▆▅▅▆▁▆▃▃
Running Reward,▁▂▃▃▃▄▅▅▆▇▆▇▆▇▆▇█▆▃▂▅▄▆▄▂▁▁▁▂▆▄▆▇▇▇▇▃▄▇█

0,1
Average Length,153.0
Average Total Reward,153.0
Policy Loss,5.74649
Running Reward,95.64626


Training agent with no baseline.
Average Total: 21.8
Average Length: 21.8
Running reward: 0.7774312019348145
Average Total: 41.1
Average Length: 41.1
Running reward: 40.98544742669288
Average Total: 124.0
Average Length: 124.0
Running reward: 55.14791072226677
Average Total: 21.2
Average Length: 21.2
Running reward: 23.30424198665057
Average Total: 175.0
Average Length: 175.0
Running reward: 38.85004798271144
Average Total: 109.7
Average Length: 109.7
Running reward: 62.28220807488584
Average Total: 96.7
Average Length: 96.7
Running reward: 62.32258979033018
Average Total: 271.8
Average Length: 271.8
Running reward: 96.40232410734545
Average Total: 377.5
Average Length: 377.5
Running reward: 92.44703943794701
Average Total: 500.0
Average Length: 500.0
Running reward: 98.30120954885301
Average Total: 198.2
Average Length: 198.2
Running reward: 85.45103502722965
Average Total: 500.0
Average Length: 500.0
Running reward: 97.9377569785624
Average Total: 142.8
Average Length: 142.8
Running 

0,1
Average Length,▁▁▃▁▃▂▂▅▆█▄█▃▂▂▂████
Average Total Reward,▁▁▃▁▃▂▂▅▆█▄█▃▂▂▂████
Policy Loss,▄▅▇▃▆▃▁▁▅█▇▅▆▄▆▆▄▄▄▆▆▃▄▄▄▄▃▄▄▅▄▄▃▃▃▄▄▄▄▅
Running Reward,▁▃▃▄▅▄▁▁▆▇▅▅▆█▇▇████▇███▇▅▆▆▆▅▅▇███████▇

0,1
Average Length,496.0
Average Total Reward,496.0
Policy Loss,7.00684
Running Reward,89.56594


Training agent with no baseline.
Average Total: 19.5
Average Length: 19.5
Running reward: 0.7774312019348145
Average Total: 28.3
Average Length: 28.3
Running reward: 25.33766506726217
Average Total: 32.9
Average Length: 32.9
Running reward: 32.82942792120509
Average Total: 141.1
Average Length: 141.1
Running reward: 70.90733730692628
Average Total: 315.7
Average Length: 315.7
Running reward: 87.09838858929352
Average Total: 83.5
Average Length: 83.5
Running reward: 70.14015037828108
Average Total: 91.5
Average Length: 91.5
Running reward: 62.87526524960545
Average Total: 500.0
Average Length: 500.0
Running reward: 93.40735237851514
Average Total: 489.6
Average Length: 489.6
Running reward: 98.25662608092652
Average Total: 500.0
Average Length: 500.0
Running reward: 97.43213824157567
Average Total: 500.0
Average Length: 500.0
Running reward: 98.3440940338763
Average Total: 82.3
Average Length: 82.3
Running reward: 84.32670736503974
Average Total: 228.8
Average Length: 228.8
Running rewa

In [14]:
# And run the final agent for a few episodes.
env_render = gym.make('CartPole-v1', render_mode='human')
env_render.reset(seed = 100)
r.setEnvRender(env_render)

average_test_rewards = []
for state_dict in state_dicts:
    total_rewards = 0
    policy = PolicyNet(env = env_render)
    policy.load_state_dict(state_dict)
    r.setPolicy(policy)
    for _ in range(20):
        (_, _, _, rewards, _) = r.run_episode(display=True, test=True)
        total_rewards += np.sum(rewards)
    average_test_reward = total_rewards / 20
    print(f'Average reward for episode: {average_test_reward}')
    average_test_rewards.append(average_test_reward)
avg_test_rew = np.sum(average_test_rewards) / 5
print(f'Total Average reward for test episode: {avg_test_rew}')
test_metrics = {"Total average test reward": avg_test_rew}
wandb.log({**test_metrics})
env_render.close()
wandb.finish()

Average reward for episode: 465.1
Average reward for episode: 23.55
Average reward for episode: 465.55
Average reward for episode: 428.95
Average reward for episode: 475.05
Total Average reward for test episode: 371.64


0,1
Average Length,▁▁▁▃▅▂▂████▂▄▃▃▃▂▃▃▄
Average Total Reward,▁▁▁▃▅▂▂████▂▄▃▃▃▂▃▃▄
Policy Loss,▂▂▆█▁█▅▅▅▇▃▆▃▄▄▄▅▅▄▄▄▄▂▅▆▂▅▃▆▄▄▄▄▅▃▅▄▄▄▂
Running Reward,▁▂▂▃▃▄▅▆▇▇▄▄▅▆████████▄▅▇▄▆▆▆▆▆▅▅▆▆▆▇▇██
Total average test reward,▁

0,1
Average Length,255.4
Average Total Reward,255.4
Policy Loss,6.49841
Running Reward,97.95765
Total average test reward,371.64


**The Real Exercise**: Standard practice is to use the state-value function $v(s)$ as a baseline. This is intuitively appealing -- we are more interested in updating out policy for returns that estimate the current **value** worse. Our new update becomes:

$$ \boldsymbol{\theta}_{t+1} \triangleq \boldsymbol{\theta}_t + \alpha (G_t - \tilde{v}(S_t \mid \mathbf{w})) \frac{\nabla \pi(A_t \mid s, \boldsymbol{\theta})}{\pi(A_t \mid s, \boldsymbol{\theta})} $$

where $\tilde{v}(s \mid \mathbf{w})$ is a **deep neural network** with parameters $w$ that estimates $v_\pi(s)$. What neural network? Typically, we use the **same** network architecture as that of the Policy.

**Your Task**: Modify your implementation to fit a second, baseline network to estimate the value function and use it as **baseline**.

In [15]:
class ReinforceBas(ReinforceStd):
    def __init__(self, policy, env, env_render=None, gamma=0.99, num_episodes=10, lr=1e-2,
                 max_len=500, N=100, eval_episodes=10, baseline=None, lrb=1e-2):
        super().__init__(policy, env, env_render, gamma, num_episodes, lr, 
                         max_len, N, eval_episodes, baseline)
        self.learning_rate_baseline = lrb

    def select_action(self, obs):
        value = self.baseline(obs)
        value.compute_grad = True
        dist = Categorical(self.policy(obs))
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return (action.item(), log_prob.reshape(1), value)

    # Given an environment and a policy, run it up to the maximum number of steps.
    def run_episode(self, display=False, test=False):
        # Collect just about everything.
        observations = []
        actions = []
        log_probs = []
        rewards = []
        values = []
        env = self.environment
        if display:
            env = self.env_render

        # Reset the environment and start the episode.
        (obs, info) = env.reset()
        for i in range(self.max_len):
            # Get the current observation, run the policy and select an action.
            obs = torch.tensor(obs)
            if test:
                (action, log_prob) = self.select_max_action(obs)
            else:
                (action, log_prob, value) = self.select_action(obs)
                values.append(value)
            observations.append(obs)
            actions.append(action)
            log_probs.append(log_prob)

            # Advance the episode by executing the selected action.
            (obs, reward, term, trunc, info) = env.step(action)
            rewards.append(reward)
            if term or trunc:
                break
        length = i + 1
        return (observations, actions, torch.cat(log_probs), rewards, length, values)

    # A direct, inefficient, and probably buggy of the REINFORCE policy gradient algorithm.
    def reinforce(self):
        # The only non-vanilla part: we use Adam instead of SGD.
        opt = torch.optim.Adam(self.policy.parameters(), lr=self.learning_rate)

        # If we have a baseline network, create the optimizer.
        if isinstance(self.baseline, nn.Module):
            opt_baseline = torch.optim.Adam(self.baseline.parameters(), lr=self.learning_rate_baseline)
            self.baseline.train()
            print('Training agent with baseline value network.')
        elif self.baseline == 'std':
            print('Training agent with standardization baseline.')
        else:
            print('Training agent with no baseline.')

        # Track episode rewards in a list.
        running_rewards = [0.0]
        average_rewards = []
        average_lengths = []

        # The main training loop.
        self.policy.train()
        state_dict = None
        best_reward = 0
        for episode in range(self.num_episodes):
            # Run an episode of the environment, collect everything needed for policy update.
            (observations, actions, log_probs, rewards, length, values) = self.run_episode()

            # Compute the discounted reward for every step of the episode.
            returns = torch.tensor(self.compute_returns(rewards), dtype=torch.float32)
            values =  torch.cat(values)

            # Keep a running average of total discounted rewards for the whole episode.
            running_reward = 0.05 * returns[0] + 0.95 * running_rewards[-1]
            wandb.log({"Reward": running_reward})
            running_rewards.append(running_reward)

            # Handle baseline.
            if isinstance(self.baseline, nn.Module):
                with torch.no_grad():
                    target = returns - values
            elif self.baseline == 'std':
                target = (returns - returns.mean()) / returns.std()
            else:
                target = returns

            # Make an optimization step
            opt.zero_grad()

            # Update policy network
            loss = (-log_probs * target).mean()
            loss.backward()
            opt.step()

            # Update baseline network.
            if isinstance(self.baseline, nn.Module):
                loss_baseline = F.mse_loss(values, returns)
                wandb.log({"Loss_baseline": loss_baseline})
                opt_baseline.zero_grad()
                loss_baseline.backward()
                opt_baseline.step()

                metrics = {"Policy Loss": loss,
                           "Running Reward": running_reward,
                           "Baseline Loss": loss_baseline}
            else:
              metrics = {"Policy Loss": loss,
                         "Running Reward": running_reward}

            wandb.log({**metrics})

            # Render an episode after every 100 policy updates.
            if not episode % self.N:
                self.policy.eval()
                total_reward = 0
                total_length = 0
                for _ in range(self.M):
                    (_, _, _, rewards, length, values) = self.run_episode()
                    total_reward += np.sum(rewards)
                    total_length += length
                average_reward = total_reward / self.M
                wandb.log({"Avg_total_reward": average_reward})
                average_rewards.append(average_reward)
                print(f'Average Total: {average_reward}')
                average_length = total_length / self.M
                wandb.log({"Avg_length": average_length})
                average_lengths.append(average_length)
                print(f'Average Length: {average_length}')

                val_metrics = {"Average Total Reward": average_reward,
                               "Average Length": average_length}
                wandb.log({**val_metrics})

                if average_reward >= best_reward:
                    best_reward = average_reward
                    state_dict = self.policy.state_dict()

                (obs, _, _, _, _, _) = self.run_episode(display=True)
                self.policy.train()
                print(f'Running reward: {running_rewards[-1]}')

        # Return the running rewards.
        self.policy.eval()
        if isinstance(self.baseline, nn.Module):
            self.baseline.eval()
        return (running_rewards, average_rewards, average_lengths, state_dict)

Runs varying learning_rate and hidden_layer_size for baseline net

In [16]:
state_dicts = []
for i in range(n_run):

    # Instantiate a rendering and a non-rendering environment.
    env_render = gym.make('CartPole-v1', render_mode='human')
    env = gym.make('CartPole-v1')

    torch.manual_seed(seeds[i])
    env_render.reset(seed = val_seeds[i])
    env.reset(seed = seeds[i])

    run = wandb.init(
          # Set the project where this run will be logged
          project="Lab3-Final",
          # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
          name=f"ReinforceBas",
          # Track hyperparameters and run metadata
          config={
          "learning_rate": 1e-2,
          "baaseline_learning_rate":1e-2,
          "architecture": "REINFORCE_BAS",
          "dataset": "CartPole",
          "hidden_layer_size": 128,
          "episodes": 2000,
          "gamma": 0.99,
          "episode_max_len": 500,
          "N": 100,
          "temperature": 10,
          "M": 10,
          "baseline": "BaselineNet",
          "learning_rate_baseline": 1e-2})

    # Make a policy and a baseline network.
    policy = PolicyNet(env, inner_size=run.config["hidden_layer_size"], T=run.config["temperature"])
    baseline = BaselineNet(env)

    # Train the agent.r = ReinforceBas(policy, env, env_render, num_episodes=2000, baseline=baseline)
    r = ReinforceBas(policy, env, env_render, gamma=run.config["gamma"], num_episodes=run.config["episodes"],
                  lr=run.config["learning_rate"], max_len=run.config["episode_max_len"],
                     N=run.config["N"], eval_episodes=run.config["M"], baseline=baseline,
                     lrb=run.config["learning_rate_baseline"])
    (total, average, length, state_dict) = r.reinforce()

    state_dicts.append(state_dict)

    # Close up everything
    env_render.close()
    env.close()

Training agent with baseline value network.
Average Total: 19.8
Average Length: 19.8
Running reward: 0.9013606905937195
Average Total: 125.4
Average Length: 125.4
Running reward: 45.96234893798828
Average Total: 259.2
Average Length: 259.2
Running reward: 89.23333740234375
Average Total: 481.3
Average Length: 481.3
Running reward: 95.04817199707031
Average Total: 500.0
Average Length: 500.0
Running reward: 97.9099349975586
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34683227539062
Average Total: 333.2
Average Length: 333.2
Running reward: 93.80870056152344
Average Total: 499.0
Average Length: 499.0
Running reward: 97.20504760742188
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34073638916016
Average Total: 427.5
Average Length: 427.5
Running reward: 98.03227996826172
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34623718261719
Average Total: 459.9
Average Length: 459.9
Running reward: 98.0577621459961
Average Total: 500.0
Average Length

0,1
Average Length,▁▃▄███▆██▇█▇█████▅█▅
Average Total Reward,▁▃▄███▆██▇█▇█████▅█▅
Avg_length,▁▃▄███▆██▇█▇█████▅█▅
Avg_total_reward,▁▃▄███▆██▇█▇█████▅█▅
Baseline Loss,▁▂▇█▃▄██▇▄▅▂▂▃▁▁▆▆▅▇▇▇▃▃▇▃▇▇▇▇▆▄▆▆▅█▃▃▆▅
Loss_baseline,▁▂▇█▃▄██▇▄▅▂▂▃▁▁▆▆▅▇▇▇▃▃▇▃▇▇▇▇▆▄▆▆▅█▃▃▆▅
Policy Loss,▅▄▇▆▅▅▅▁▅▆▅▅▆▂▄▄▃▃▆▆▄▄▅▅▇▅▅▅▄▄▅▅▄▇▂█▃▅▄▄
Reward,▁▂▄▇▇▇██████████████████████████████████
Running Reward,▁▂▄▇▇▇██████████████████████████████████

0,1
Average Length,317.2
Average Total Reward,317.2
Avg_length,317.2
Avg_total_reward,317.2
Baseline Loss,331.57318
Loss_baseline,331.57318
Policy Loss,0.29328
Reward,98.2318
Running Reward,98.2318


Training agent with baseline value network.
Average Total: 21.8
Average Length: 21.8
Running reward: 0.6927111744880676
Average Total: 93.7
Average Length: 93.7
Running reward: 49.437034606933594
Average Total: 463.5
Average Length: 463.5
Running reward: 94.8961410522461
Average Total: 471.3
Average Length: 471.3
Running reward: 97.31956481933594
Average Total: 467.3
Average Length: 467.3
Running reward: 98.16445922851562
Average Total: 269.2
Average Length: 269.2
Running reward: 97.17326354980469
Average Total: 500.0
Average Length: 500.0
Running reward: 98.28874969482422
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34907531738281
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34005737304688
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34847259521484
Average Total: 500.0
Average Length: 500.0
Running reward: 98.23775482177734
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34877014160156
Average Total: 500.0
Average Length

0,1
Average Length,▁▂▇██▅██████████████
Average Total Reward,▁▂▇██▅██████████████
Avg_length,▁▂▇██▅██████████████
Avg_total_reward,▁▂▇██▅██████████████
Baseline Loss,▃▃▄▂▃█▂▄▃█▁▂▂▁▃▂▄▂▁▁▁▁▁▂▁▄▁▁▃▂▁▃▇▄▂▁▁▁▂▂
Loss_baseline,▃▃▄▂▃█▂▄▃█▁▂▂▁▃▂▄▂▁▁▁▁▁▂▁▄▁▁▃▂▁▃▇▄▂▁▁▁▂▂
Policy Loss,▇▆▁▆▆█▅▅▄▄▄▄▆▅▃▆▇▆▆▄▄▅▅▅▄▇▄▅▃▄▅▆▇▄▄▄▅▅▄▃
Reward,▁▃▅▇████████████████████████████████████
Running Reward,▁▃▅▇████████████████████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,16.47844
Loss_baseline,16.47844
Policy Loss,-0.97385
Reward,98.03937
Running Reward,98.03937


Training agent with baseline value network.
Average Total: 22.7
Average Length: 22.7
Running reward: 1.5372270345687866
Average Total: 123.0
Average Length: 123.0
Running reward: 46.47514343261719
Average Total: 446.3
Average Length: 446.3
Running reward: 94.68729400634766
Average Total: 232.8
Average Length: 232.8
Running reward: 92.20755767822266
Average Total: 498.0
Average Length: 498.0
Running reward: 96.64935302734375
Average Total: 479.9
Average Length: 479.9
Running reward: 96.86875915527344
Average Total: 497.3
Average Length: 497.3
Running reward: 98.23246765136719
Average Total: 303.1
Average Length: 303.1
Running reward: 97.73979949951172
Average Total: 481.8
Average Length: 481.8
Running reward: 98.28963470458984
Average Total: 500.0
Average Length: 500.0
Running reward: 98.31915283203125
Average Total: 500.0
Average Length: 500.0
Running reward: 97.51057434082031
Average Total: 500.0
Average Length: 500.0
Running reward: 98.32536315917969
Average Total: 500.0
Average Leng

0,1
Average Length,▁▂▇▄███▅████████████
Average Total Reward,▁▂▇▄███▅████████████
Avg_length,▁▂▇▄███▅████████████
Avg_total_reward,▁▂▇▄███▅████████████
Baseline Loss,▁▁▇▃▄▂▃▇▃▂▇▅▃▂▂▄▂▃▂▁▂▂▁▁▁▂▁▂▁▁▂█▃▂▁▁▁▂▁▄
Loss_baseline,▁▁▇▃▄▂▃▇▃▂▇▅▃▂▂▄▂▃▂▁▂▂▁▁▁▂▁▂▁▁▂█▃▂▁▁▁▂▁▄
Policy Loss,▅▄▃▃▄▄▄▁▆▄▁▄▇▅▃▆▅▃▆▅▄▃▃▅▄▂▄▅▅▄▅█▅▆▄▄▄▅▄▁
Reward,▁▂▅▇██▇███████████▇█████████████████████
Running Reward,▁▂▅▇██▇███████████▇█████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,536.85193
Loss_baseline,536.85193
Policy Loss,0.66678
Reward,98.34936
Running Reward,98.34936


Training agent with baseline value network.
Average Total: 20.3
Average Length: 20.3
Running reward: 1.1382864713668823
Average Total: 165.1
Average Length: 165.1
Running reward: 50.45021057128906
Average Total: 475.9
Average Length: 475.9
Running reward: 95.73809814453125
Average Total: 384.8
Average Length: 384.8
Running reward: 88.25031280517578
Average Total: 469.2
Average Length: 469.2
Running reward: 94.30438995361328
Average Total: 432.8
Average Length: 432.8
Running reward: 97.99562072753906
Average Total: 500.0
Average Length: 500.0
Running reward: 98.11251831054688
Average Total: 500.0
Average Length: 500.0
Running reward: 95.69313049316406
Average Total: 500.0
Average Length: 500.0
Running reward: 97.94505310058594
Average Total: 500.0
Average Length: 500.0
Running reward: 97.95350646972656
Average Total: 469.4
Average Length: 469.4
Running reward: 98.31255340576172
Average Total: 500.0
Average Length: 500.0
Running reward: 98.25164031982422
Average Total: 500.0
Average Leng

0,1
Average Length,▁▃█▆█▇██████████████
Average Total Reward,▁▃█▆█▇██████████████
Avg_length,▁▃█▆█▇██████████████
Avg_total_reward,▁▃█▆█▇██████████████
Baseline Loss,▁▂▃▃█▂▂▂▄▄▂▄▂▃▃▁▄▆▅▅▂▃▃▅▄▄▂▂▂▂▃▃▁▁▁▁▆▅▃▂
Loss_baseline,▁▂▃▃█▂▂▂▄▄▂▄▂▃▃▁▄▆▅▅▂▃▃▅▄▄▂▂▂▂▃▃▁▁▁▁▆▅▃▂
Policy Loss,▅▅▅▃█▆▅▄▂▄▅▃▄▁▅▄▁▅▃▂▃▃▆▆▄▅▅▅▅▃▃▂▃▅▄▄▄▄▄▅
Reward,▁▂▅▇████████████████████████████████████
Running Reward,▁▂▅▇████████████████████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,205.38142
Loss_baseline,205.38142
Policy Loss,2.27979
Reward,98.34936
Running Reward,98.34936


Training agent with baseline value network.
Average Total: 19.9
Average Length: 19.9
Running reward: 0.8604653477668762
Average Total: 157.3
Average Length: 157.3
Running reward: 56.816139221191406
Average Total: 500.0
Average Length: 500.0
Running reward: 96.77999114990234
Average Total: 476.7
Average Length: 476.7
Running reward: 97.58685302734375
Average Total: 500.0
Average Length: 500.0
Running reward: 98.23280334472656
Average Total: 486.2
Average Length: 486.2
Running reward: 98.34669494628906
Average Total: 314.2
Average Length: 314.2
Running reward: 90.81932830810547
Average Total: 487.1
Average Length: 487.1
Running reward: 98.28250885009766
Average Total: 477.5
Average Length: 477.5
Running reward: 98.3460464477539
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34691619873047
Average Total: 500.0
Average Length: 500.0
Running reward: 98.33350372314453
Average Total: 500.0
Average Length: 500.0
Running reward: 94.72212982177734
Average Total: 479.7
Average Leng

In [18]:
# And run the final agent for a few episodes.
env_render = gym.make('CartPole-v1', render_mode='human')
env_render.reset(seed = 100)
r.setEnvRender(env_render)

average_test_rewards = []
for state_dict in state_dicts:
    total_rewards = 0
    policy = PolicyNet(env = env_render)
    policy.load_state_dict(state_dict)
    r.setPolicy(policy)
    for _ in range(20):
        (_, _, _, rewards, _, _) = r.run_episode(display=True, test=True)
        total_rewards += np.sum(rewards)
    average_test_reward = total_rewards / 20
    print(f'Average reward for episode: {average_test_reward}')
    average_test_rewards.append(average_test_reward)
avg_test_rew = np.sum(average_test_rewards) / 5
print(f'Total Average reward for test episode: {avg_test_rew}')
test_metrics = {"Total average test reward": avg_test_rew}
wandb.log({**test_metrics})
env_render.close()
wandb.finish()

Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Total Average reward for test episode: 500.0


0,1
Average Length,▁▃████▅██████████▇██
Average Total Reward,▁▃████▅██████████▇██
Avg_length,▁▃████▅██████████▇██
Avg_total_reward,▁▃████▅██████████▇██
Baseline Loss,▁▁▄▄▄▂▃▇▇▅▅▁▃▅▅▅▆▄▃▁▄▃▄▂▂▃▂▆▃▄▃▃█▆▁█▅▄▅▃
Loss_baseline,▁▁▄▄▄▂▃▇▇▅▅▁▃▅▅▅▆▄▃▁▄▃▄▂▂▃▂▆▃▄▃▃█▆▁█▅▄▅▃
Policy Loss,█▄▇▇▄▆▃█▅▇▅▆▅▄▅▅▆▆▅▆▃▆▅▆▅▅▆▁▅▅▅▄▂█▅▃▆▄▆▆
Reward,▁▃▆▇███████▇████████████████████████████
Running Reward,▁▃▆▇███████▇████████████████████████████
Total average test reward,▁

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,133.73801
Loss_baseline,133.73801
Policy Loss,-0.54477
Reward,98.33543
Running Reward,98.33543
Total average test reward,500.0


In [19]:
state_dicts = []
for i in range(n_run):

    # Instantiate a rendering and a non-rendering environment.
    env_render = gym.make('CartPole-v1', render_mode='human')
    env = gym.make('CartPole-v1')

    torch.manual_seed(seeds[i])
    env_render.reset(seed = val_seeds[i])
    env.reset(seed = seeds[i])

    run = wandb.init(
          # Set the project where this run will be logged
          project="Lab3-Final",
          # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
          name=f"ReinforceBas lower inner_size",
          # Track hyperparameters and run metadata
          config={
          "learning_rate": 1e-2,
          "baaseline_learning_rate":1e-2,
          "architecture": "REINFORCE_BAS",
          "dataset": "CartPole",
          "hidden_layer_size": 64,
          "episodes": 2000,
          "gamma": 0.99,
          "episode_max_len": 500,
          "N": 100,
          "temperature": 10,
          "M": 10,
          "baseline": "BaselineNet",
          "learning_rate_baseline": 1e-2})

    # Make a policy and a baseline network.
    policy = PolicyNet(env, inner_size=run.config["hidden_layer_size"], T=run.config["temperature"])
    baseline = BaselineNet(env)

    # Train the agent.r = ReinforceBas(policy, env, env_render, num_episodes=2000, baseline=baseline)
    r = ReinforceBas(policy, env, env_render, gamma=run.config["gamma"], num_episodes=run.config["episodes"],
                  lr=run.config["learning_rate"], max_len=run.config["episode_max_len"],
                     N=run.config["N"], eval_episodes=run.config["M"], baseline=baseline,
                     lrb=run.config["learning_rate_baseline"])
    (total, average, length, state_dict) = r.reinforce()

    state_dicts.append(state_dict)

    # Close up everything
    env_render.close()
    env.close()

Training agent with baseline value network.
Average Total: 17.0
Average Length: 17.0
Running reward: 0.4733087122440338
Average Total: 50.6
Average Length: 50.6
Running reward: 32.35449981689453
Average Total: 333.3
Average Length: 333.3
Running reward: 79.63274383544922
Average Total: 496.4
Average Length: 496.4
Running reward: 96.70327758789062
Average Total: 447.5
Average Length: 447.5
Running reward: 96.91535186767578
Average Total: 460.3
Average Length: 460.3
Running reward: 98.22537231445312
Average Total: 490.1
Average Length: 490.1
Running reward: 97.33820343017578
Average Total: 500.0
Average Length: 500.0
Running reward: 94.5257339477539
Average Total: 169.5
Average Length: 169.5
Running reward: 81.21688842773438
Average Total: 500.0
Average Length: 500.0
Running reward: 97.94082641601562
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34701538085938
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34935760498047
Average Total: 500.0
Average Length:

0,1
Average Length,▁▁▆█▇▇██▃███████████
Average Total Reward,▁▁▆█▇▇██▃███████████
Avg_length,▁▁▆█▇▇██▃███████████
Avg_total_reward,▁▁▆█▇▇██▃███████████
Baseline Loss,▂▁▂▁█▃▅▃▂▁▂▄▂▁▅▁▁▃▆▃▁▂▁▆▅▃▇▃▁▂▁▁▁▁▄▃▂▂▁▂
Loss_baseline,▂▁▂▁█▃▅▃▂▁▂▄▂▁▅▁▁▃▆▃▁▂▁▆▅▃▇▃▁▂▁▁▁▁▄▃▂▂▁▂
Policy Loss,▅▄▃▃▆▆█▅▅▄▃▃▃▄▁▅▄▃▄▆▃▂▄▅▄▄▂▆▄▂▄▄▅▃▂▅▄▃▃▄
Reward,▁▂▃▅▇████████▇█▇▇███████████████████████
Running Reward,▁▂▃▅▇████████▇█▇▇███████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,267.01199
Loss_baseline,267.01199
Policy Loss,-0.40022
Reward,98.34526
Running Reward,98.34526


Training agent with baseline value network.
Average Total: 20.5
Average Length: 20.5
Running reward: 0.6497082114219666
Average Total: 73.5
Average Length: 73.5
Running reward: 31.853710174560547
Average Total: 500.0
Average Length: 500.0
Running reward: 87.5874252319336
Average Total: 474.1
Average Length: 474.1
Running reward: 96.6775131225586
Average Total: 455.1
Average Length: 455.1
Running reward: 97.29200744628906
Average Total: 454.2
Average Length: 454.2
Running reward: 97.34679412841797
Average Total: 358.2
Average Length: 358.2
Running reward: 92.523193359375
Average Total: 448.0
Average Length: 448.0
Running reward: 97.01289367675781
Average Total: 223.1
Average Length: 223.1
Running reward: 90.32347106933594
Average Total: 500.0
Average Length: 500.0
Running reward: 98.16915130615234
Average Total: 500.0
Average Length: 500.0
Running reward: 98.20433044433594
Average Total: 500.0
Average Length: 500.0
Running reward: 98.28092193603516
Average Total: 481.0
Average Length: 4

0,1
Average Length,▁▂██▇▇▆▇▄████▇██████
Average Total Reward,▁▂██▇▇▆▇▄████▇██████
Avg_length,▁▂██▇▇▆▇▄████▇██████
Avg_total_reward,▁▂██▇▇▆▇▄████▇██████
Baseline Loss,▁▂▁▂▂▆▃▃▃▄▄▂▃▂▂▁▂▅▄▂▂▂▂▂▁▁▁▁▁▁▁▄▁▁▁▁█▃▂▂
Loss_baseline,▁▂▁▂▂▆▃▃▃▄▄▂▃▂▂▁▂▅▄▂▂▂▂▂▁▁▁▁▁▁▁▄▁▁▁▁█▃▂▂
Policy Loss,▅▅▄▄▅█▆▄▃▃▄▄▃▅▅▄▃▄▃▄▆▄▄▅▃▄▅▄▄▄▄▅▄▄▄▄▁▄▄▅
Reward,▁▂▃▆▇███████████████████████████████████
Running Reward,▁▂▃▆▇███████████████████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,96.8741
Loss_baseline,96.8741
Policy Loss,-2.06757
Reward,98.34908
Running Reward,98.34908


Training agent with baseline value network.
Average Total: 22.2
Average Length: 22.2
Running reward: 1.432761549949646
Average Total: 64.4
Average Length: 64.4
Running reward: 31.198293685913086
Average Total: 324.9
Average Length: 324.9
Running reward: 86.25784301757812
Average Total: 494.7
Average Length: 494.7
Running reward: 96.1632308959961
Average Total: 439.8
Average Length: 439.8
Running reward: 94.91915893554688
Average Total: 500.0
Average Length: 500.0
Running reward: 96.54084014892578
Average Total: 404.8
Average Length: 404.8
Running reward: 98.01779174804688
Average Total: 500.0
Average Length: 500.0
Running reward: 97.94641876220703
Average Total: 500.0
Average Length: 500.0
Running reward: 97.72764587402344
Average Total: 500.0
Average Length: 500.0
Running reward: 98.11595916748047
Average Total: 500.0
Average Length: 500.0
Running reward: 97.94359588623047
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34589385986328
Average Total: 469.9
Average Length:

0,1
Average Length,▁▂▅█▇█▇███████████▄█
Average Total Reward,▁▂▅█▇█▇███████████▄█
Avg_length,▁▂▅█▇█▇███████████▄█
Avg_total_reward,▁▂▅█▇█▇███████████▄█
Baseline Loss,▁▂▁▃▃▅▃▆█▂█▄▂▅▃▄▆▇▆▆▆▆▆▂▃▄▆▆▄▆▆▄▅▃▂▁▁▂▂▂
Loss_baseline,▁▂▁▃▃▅▃▆█▂█▄▂▅▃▄▆▇▆▆▆▆▆▂▃▄▆▆▄▆▆▄▅▃▂▁▁▂▂▂
Policy Loss,▇▇▄▅█▆▅▅▁▆▃▄▅▄▂▄▄▅▄▇▆▅▃▅▅▆▃▆▅▃▆▇▃▇▇▅▃▃▅▇
Reward,▁▂▃▆▇████▇█████████████████████████▇▇███
Running Reward,▁▂▃▆▇████▇█████████████████████████▇▇███

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,96.83517
Loss_baseline,96.83517
Policy Loss,-1.57884
Reward,98.30106
Running Reward,98.30106


Training agent with baseline value network.
Average Total: 24.2
Average Length: 24.2
Running reward: 1.736941933631897
Average Total: 59.0
Average Length: 59.0
Running reward: 39.15964126586914
Average Total: 361.1
Average Length: 361.1
Running reward: 81.53880310058594
Average Total: 402.3
Average Length: 402.3
Running reward: 96.61164855957031
Average Total: 453.0
Average Length: 453.0
Running reward: 98.05730438232422
Average Total: 431.3
Average Length: 431.3
Running reward: 96.02603912353516
Average Total: 366.1
Average Length: 366.1
Running reward: 97.140625
Average Total: 500.0
Average Length: 500.0
Running reward: 94.53871154785156
Average Total: 493.9
Average Length: 493.9
Running reward: 98.27022552490234
Average Total: 282.2
Average Length: 282.2
Running reward: 97.60303497314453
Average Total: 457.1
Average Length: 457.1
Running reward: 96.67610168457031
Average Total: 500.0
Average Length: 500.0
Running reward: 98.06002807617188
Average Total: 500.0
Average Length: 500.0
R

0,1
Average Length,▁▂▆▇▇▇▆██▅▇██████▅██
Average Total Reward,▁▂▆▇▇▇▆██▅▇██████▅██
Avg_length,▁▂▆▇▇▇▆██▅▇██████▅██
Avg_total_reward,▁▂▆▇▇▇▆██▅▇██████▅██
Baseline Loss,▅▂▃▃▃▆▄█▃▄▃▃▂▁▂▅█▆▃▅▂▂▂▃▄▁▁▁▂▁▁▁▁▁▁▁▃▁▁▇
Loss_baseline,▅▂▃▃▃▆▄█▃▄▃▃▂▁▂▅█▆▃▅▂▂▂▃▄▁▁▁▂▁▁▁▁▁▁▁▃▁▁▇
Policy Loss,█▂▅▄▃▂▂▁▄▆▂▁▄▂▃▇▄▆▄▄▃▄▂▂▆▃▄▃▄▂▃▃▄▄▄▃▅▂▃▄
Reward,▁▃▃▄▇███████████████████████████████████
Running Reward,▁▃▃▄▇███████████████████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,646.85034
Loss_baseline,646.85034
Policy Loss,3.74032
Reward,98.34936
Running Reward,98.34936


Training agent with baseline value network.
Average Total: 17.4
Average Length: 17.4
Running reward: 0.81915682554245
Average Total: 76.6
Average Length: 76.6
Running reward: 32.222686767578125
Average Total: 373.9
Average Length: 373.9
Running reward: 92.16215515136719
Average Total: 454.1
Average Length: 454.1
Running reward: 96.19007873535156
Average Total: 107.9
Average Length: 107.9
Running reward: 81.78004455566406
Average Total: 369.6
Average Length: 369.6
Running reward: 90.45332336425781
Average Total: 500.0
Average Length: 500.0
Running reward: 97.94478607177734
Average Total: 500.0
Average Length: 500.0
Running reward: 97.40885925292969
Average Total: 477.2
Average Length: 477.2
Running reward: 96.87308502197266
Average Total: 442.1
Average Length: 442.1
Running reward: 97.55530548095703
Average Total: 344.0
Average Length: 344.0
Running reward: 97.45919799804688
Average Total: 500.0
Average Length: 500.0
Running reward: 98.24307250976562
Average Total: 485.8
Average Length:

In [21]:
# And run the final agent for a few episodes.
env_render = gym.make('CartPole-v1', render_mode='human')
env_render.reset(seed = 100)
r.setEnvRender(env_render)

average_test_rewards = []
for state_dict in state_dicts:
    total_rewards = 0
    policy = PolicyNet(env, inner_size=run.config["hidden_layer_size"], T=run.config["temperature"])
    policy.load_state_dict(state_dict)
    r.setPolicy(policy)
    for _ in range(20):
        (_, _, _, rewards, _, _) = r.run_episode(display=True, test=True)
        total_rewards += np.sum(rewards)
    average_test_reward = total_rewards / 20
    print(f'Average reward for episode: {average_test_reward}')
    average_test_rewards.append(average_test_reward)
avg_test_rew = np.sum(average_test_rewards) / 5
print(f'Total Average reward for test episode: {avg_test_rew}')
test_metrics = {"Total average test reward": avg_test_rew}
wandb.log({**test_metrics})
env_render.close()
wandb.finish()

Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 500.0
Total Average reward for test episode: 500.0


0,1
Average Length,▁▂▆▇▂▆███▇▆██▇▇█▆██▆
Average Total Reward,▁▂▆▇▂▆███▇▆██▇▇█▆██▆
Avg_length,▁▂▆▇▂▆███▇▆██▇▇█▆██▆
Avg_total_reward,▁▂▆▇▂▆███▇▆██▇▇█▆██▆
Baseline Loss,▁▂▄▃▄▂▄▆▂▃▃▆▃▁▄▂▆▅▅▅▄▅█▅▂▂▄█▁▃▂▄▃▁▂▁▁▃▂▁
Loss_baseline,▁▂▄▃▄▂▄▆▂▃▃▆▃▁▄▂▆▅▅▅▄▅█▅▂▂▄█▁▃▂▄▃▁▂▁▁▃▂▁
Policy Loss,▇▇▆▆▆▄▄▆▄▄▅▄█▄▂▄▄▇▅▆▇▃▄▂▇▄▃▁▆▅▆█▅▆▆▄▅▇▇▃
Reward,▁▂▃▆▇▇██▆▇██████████████████████████████
Running Reward,▁▂▃▆▇▇██▆▇██████████████████████████████
Total average test reward,▁

0,1
Average Length,333.8
Average Total Reward,333.8
Avg_length,333.8
Avg_total_reward,333.8
Baseline Loss,40.10471
Loss_baseline,40.10471
Policy Loss,0.88201
Reward,98.2653
Running Reward,98.2653
Total average test reward,500.0


In [22]:
state_dicts = []
for i in range(n_run):

    # Instantiate a rendering and a non-rendering environment.
    env_render = gym.make('CartPole-v1', render_mode='human')
    env = gym.make('CartPole-v1')

    torch.manual_seed(seeds[i])
    env_render.reset(seed = val_seeds[i])
    env.reset(seed = seeds[i])

    run = wandb.init(
          # Set the project where this run will be logged
          project="Lab3-Final",
          # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
          name=f"ReinforceBas higher lrb",
          # Track hyperparameters and run metadata
          config={
          "learning_rate": 1e-2,
          "baaseline_learning_rate":1e-2,
          "architecture": "REINFORCE_BAS",
          "dataset": "CartPole",
          "hidden_layer_size": 128,
          "episodes": 2000,
          "gamma": 0.99,
          "episode_max_len": 500,
          "N": 100,
          "temperature": 10,
          "M": 10,
          "baseline": "BaselineNet",
          "learning_rate_baseline": 0.1})

    # Make a policy and a baseline network.
    policy = PolicyNet(env, inner_size=run.config["hidden_layer_size"], T=run.config["temperature"])
    baseline = BaselineNet(env)

    # Train the agent.r = ReinforceBas(policy, env, env_render, num_episodes=2000, baseline=baseline)
    r = ReinforceBas(policy, env, env_render, gamma=run.config["gamma"], num_episodes=run.config["episodes"],
                  lr=run.config["learning_rate"], max_len=run.config["episode_max_len"],
                     N=run.config["N"], eval_episodes=run.config["M"], baseline=baseline,
                     lrb=run.config["learning_rate_baseline"])
    (total, average, length, state_dict) = r.reinforce()

    state_dicts.append(state_dict)

    # Close up everything
    env_render.close()
    env.close()

Training agent with baseline value network.
Average Total: 19.8
Average Length: 19.8
Running reward: 0.9013606905937195
Average Total: 213.3
Average Length: 213.3
Running reward: 65.07324981689453
Average Total: 493.9
Average Length: 493.9
Running reward: 91.6904525756836
Average Total: 500.0
Average Length: 500.0
Running reward: 96.04548645019531
Average Total: 500.0
Average Length: 500.0
Running reward: 98.26725769042969
Average Total: 500.0
Average Length: 500.0
Running reward: 96.80255889892578
Average Total: 500.0
Average Length: 500.0
Running reward: 98.128662109375
Average Total: 500.0
Average Length: 500.0
Running reward: 95.4605484008789
Average Total: 500.0
Average Length: 500.0
Running reward: 98.33232879638672
Average Total: 500.0
Average Length: 500.0
Running reward: 98.28246307373047
Average Total: 500.0
Average Length: 500.0
Running reward: 98.15731048583984
Average Total: 500.0
Average Length: 500.0
Running reward: 94.3599624633789
Average Total: 500.0
Average Length: 5

0,1
Average Length,▁▄█████████████▇█▇██
Average Total Reward,▁▄█████████████▇█▇██
Avg_length,▁▄█████████████▇█▇██
Avg_total_reward,▁▄█████████████▇█▇██
Baseline Loss,▂▂▄▄▅▄▅▄▄▄▄▃▄▁▇▅▃█▂▃▄▄▃▁▃▁▂▁▂▁▁▁▂▁▁▁▃▅▂▂
Loss_baseline,▂▂▄▄▅▄▅▄▄▄▄▃▄▁▇▅▃█▂▃▄▄▃▁▃▁▂▁▂▁▁▁▂▁▁▁▃▅▂▂
Policy Loss,▄▆▆█▄▆▅▆▅▅▄▇▅▆▆▅▆▁▆▇▆█▅▆▃▅▆▅▇▆▇▆▇▅▅▅▆▇▇▅
Reward,▁▃▆▇█▇█████████████████████████████████▇
Running Reward,▁▃▆▇█▇█████████████████████████████████▇

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,24.9475
Loss_baseline,24.9475
Policy Loss,0.74484
Reward,87.66064
Running Reward,87.66064


Training agent with baseline value network.
Average Total: 21.8
Average Length: 21.8
Running reward: 0.6927111744880676
Average Total: 85.4
Average Length: 85.4
Running reward: 45.39704132080078
Average Total: 383.9
Average Length: 383.9
Running reward: 93.3168716430664
Average Total: 467.5
Average Length: 467.5
Running reward: 94.18602752685547
Average Total: 483.9
Average Length: 483.9
Running reward: 95.03372955322266
Average Total: 431.0
Average Length: 431.0
Running reward: 96.9093017578125
Average Total: 500.0
Average Length: 500.0
Running reward: 97.8935317993164
Average Total: 459.1
Average Length: 459.1
Running reward: 97.66127014160156
Average Total: 432.6
Average Length: 432.6
Running reward: 86.75359344482422
Average Total: 493.9
Average Length: 493.9
Running reward: 92.00419616699219
Average Total: 500.0
Average Length: 500.0
Running reward: 98.31185913085938
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34847259521484
Average Total: 500.0
Average Length: 5

0,1
Average Length,▁▂▆██▇█▇▇███████████
Average Total Reward,▁▂▆██▇█▇▇███████████
Avg_length,▁▂▆██▇█▇▇███████████
Avg_total_reward,▁▂▆██▇█▇▇███████████
Baseline Loss,▁▂▄▅▄▃▇▃▃▃▄▆▄▄▂▂▄▆▄▂▃▁▃▂▂▂▂▃▁▁▂▂▁▂█▃▂▁▁▄
Loss_baseline,▁▂▄▅▄▃▇▃▃▃▄▆▄▄▂▂▄▆▄▂▃▁▃▂▂▂▂▃▁▁▂▂▁▂█▃▂▁▁▄
Policy Loss,▃▄▁▆▆▇▁▃▇▄█▅▄▅▅▄▃▄▅▃▆▅▄▅▅▃▄▃▄▄▄▂▃▅▇▄▆▄▄▅
Reward,▁▃▄▆███▇███████▇████████████████████████
Running Reward,▁▃▄▆███▇███████▇████████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,399.52283
Loss_baseline,399.52283
Policy Loss,-1.00553
Reward,97.75267
Running Reward,97.75267


Training agent with baseline value network.
Average Total: 22.7
Average Length: 22.7
Running reward: 1.5372270345687866
Average Total: 67.0
Average Length: 67.0
Running reward: 38.81025695800781
Average Total: 168.9
Average Length: 168.9
Running reward: 77.17304229736328
Average Total: 224.1
Average Length: 224.1
Running reward: 82.11857604980469
Average Total: 423.8
Average Length: 423.8
Running reward: 96.67865753173828
Average Total: 500.0
Average Length: 500.0
Running reward: 98.27657318115234
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34899139404297
Average Total: 500.0
Average Length: 500.0
Running reward: 98.03437805175781
Average Total: 500.0
Average Length: 500.0
Running reward: 98.21302795410156
Average Total: 500.0
Average Length: 500.0
Running reward: 96.08950805664062
Average Total: 500.0
Average Length: 500.0
Running reward: 98.31603240966797
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34922790527344
Average Total: 437.4
Average Length

0,1
Average Length,▁▂▃▄▇███████▇███▇███
Average Total Reward,▁▂▃▄▇███████▇███▇███
Avg_length,▁▂▃▄▇███████▇███▇███
Avg_total_reward,▁▂▃▄▇███████▇███▇███
Baseline Loss,▁▁▁▂▂▃▂▃█▃▃▂▂▁▂▁▁▁▂▃▂▂▁▂▂▃▂▂▃▂▂▂▁▁▁▁▁▂▁▁
Loss_baseline,▁▁▁▂▂▃▂▃█▃▃▂▂▁▂▁▁▁▂▃▂▂▁▂▂▃▂▂▃▂▂▂▁▁▁▁▁▂▁▁
Policy Loss,▇▆▇█▇▇▆█▁▇▇█▆▇▇▇▆▇▆█▆▇▇▇▇▆█▇▇▇▅▇▇▇▇▇▇▇▆▇
Reward,▁▂▃▅▆█▇█████████████████████████████████
Running Reward,▁▂▃▅▆█▇█████████████████████████████████

0,1
Average Length,500.0
Average Total Reward,500.0
Avg_length,500.0
Avg_total_reward,500.0
Baseline Loss,201.73639
Loss_baseline,201.73639
Policy Loss,1.46595
Reward,98.34936
Running Reward,98.34936


Training agent with baseline value network.
Average Total: 20.3
Average Length: 20.3
Running reward: 1.1382864713668823
Average Total: 125.3
Average Length: 125.3
Running reward: 58.06493377685547
Average Total: 398.3
Average Length: 398.3
Running reward: 87.54988861083984
Average Total: 433.3
Average Length: 433.3
Running reward: 95.77208709716797
Average Total: 500.0
Average Length: 500.0
Running reward: 97.10147094726562
Average Total: 467.9
Average Length: 467.9
Running reward: 97.2571792602539
Average Total: 500.0
Average Length: 500.0
Running reward: 98.1942138671875
Average Total: 500.0
Average Length: 500.0
Running reward: 98.2298355102539
Average Total: 500.0
Average Length: 500.0
Running reward: 96.50343322753906
Average Total: 500.0
Average Length: 500.0
Running reward: 98.25199890136719
Average Total: 500.0
Average Length: 500.0
Running reward: 98.34768676757812
Average Total: 500.0
Average Length: 500.0
Running reward: 96.89368438720703
Average Total: 500.0
Average Length:

0,1
Average Length,▁▃▇▇███████████▂▅▃▄█
Average Total Reward,▁▃▇▇███████████▂▅▃▄█
Avg_length,▁▃▇▇███████████▂▅▃▄█
Avg_total_reward,▁▃▇▇███████████▂▅▃▄█
Baseline Loss,▄▃▃█▅▃▅▁█▄▆▆▆▆▄▁▂▁▁▂▁▂▂▁▂▂▁▁▃▄▄▃▂▂▁▁▆▂▃▂
Loss_baseline,▄▃▃█▅▃▅▁█▄▆▆▆▆▄▁▂▁▁▂▁▂▂▁▂▂▁▁▃▄▄▃▂▂▁▁▆▂▃▂
Policy Loss,█▅▇▃▆▆▃▅▄▇▂▁▆▆▇▅▆▅▅▄▅▆▅▅▆▄▆▅▃▄▄▆▅▅▄▅▄▆▇▅
Reward,▁▃▅▆▇▇██████████████████▇█████▇█▇▇▇▇▇▇██
Running Reward,▁▃▅▆▇▇██████████████████▇█████▇█▇▇▇▇▇▇██

0,1
Average Length,473.6
Average Total Reward,473.6
Avg_length,473.6
Avg_total_reward,473.6
Baseline Loss,101.52434
Loss_baseline,101.52434
Policy Loss,-0.64806
Reward,93.21489
Running Reward,93.21489


Training agent with baseline value network.
Average Total: 19.9
Average Length: 19.9
Running reward: 0.8604653477668762
Average Total: 107.1
Average Length: 107.1
Running reward: 52.112220764160156
Average Total: 343.3
Average Length: 343.3
Running reward: 92.42385864257812
Average Total: 270.1
Average Length: 270.1
Running reward: 93.06834411621094
Average Total: 272.7
Average Length: 272.7
Running reward: 95.0214614868164
Average Total: 499.6
Average Length: 499.6
Running reward: 96.44585418701172
Average Total: 497.5
Average Length: 497.5
Running reward: 97.79147338867188
Average Total: 500.0
Average Length: 500.0
Running reward: 96.88499450683594
Average Total: 154.6
Average Length: 154.6
Running reward: 89.80244445800781
Average Total: 487.7
Average Length: 487.7
Running reward: 95.1417236328125
Average Total: 387.5
Average Length: 387.5
Running reward: 98.17354583740234
Average Total: 500.0
Average Length: 500.0
Running reward: 98.0136947631836
Average Total: 500.0
Average Length

In [23]:
# And run the final agent for a few episodes.
env_render = gym.make('CartPole-v1', render_mode='human')
env_render.reset(seed = 100)
r.setEnvRender(env_render)

average_test_rewards = []
for state_dict in state_dicts:
    total_rewards = 0
    policy = PolicyNet(env = env_render)
    policy.load_state_dict(state_dict)
    r.setPolicy(policy)
    for _ in range(20):
        (_, _, _, rewards, _, _) = r.run_episode(display=True, test=True)
        total_rewards += np.sum(rewards)
    average_test_reward = total_rewards / 20
    print(f'Average reward for episode: {average_test_reward}')
    average_test_rewards.append(average_test_reward)
avg_test_rew = np.sum(average_test_rewards) / 5
print(f'Total Average reward for test episode: {avg_test_rew}')
test_metrics = {"Total average test reward": avg_test_rew}
wandb.log({**test_metrics})
env_render.close()
wandb.finish()

Average reward for episode: 231.95
Average reward for episode: 500.0
Average reward for episode: 500.0
Average reward for episode: 471.4
Average reward for episode: 500.0
Total Average reward for test episode: 440.66999999999996


0,1
Average Length,▁▂▆▅▅███▃█▆████████▇
Average Total Reward,▁▂▆▅▅███▃█▆████████▇
Avg_length,▁▂▆▅▅███▃█▆████████▇
Avg_total_reward,▁▂▆▅▅███▃█▆████████▇
Baseline Loss,▃▃▇▇▄▄▃█▃▆▃▂▁▁▁▂▁▂▁▁▁▆▁▇▁▂▁▁▂▂▂▂▂▄▃▃▆▃▆▂
Loss_baseline,▃▃▇▇▄▄▃█▃▆▃▂▁▁▁▂▁▂▁▁▁▆▁▇▁▂▁▁▂▂▂▂▂▄▃▃▆▃▆▂
Policy Loss,▆▆▁▂▄▅▇▁▃█▃▆▅▅▅▃▄▆▄▅▅█▅▅▅▇▆▅▄▅▅▅▃▅▅▄▂▆▅▅
Reward,▁▂▅▇████████████▇▇██████████████████████
Running Reward,▁▂▅▇████████████▇▇██████████████████████
Total average test reward,▁

0,1
Average Length,425.2
Average Total Reward,425.2
Avg_length,425.2
Avg_total_reward,425.2
Baseline Loss,92.30949
Loss_baseline,92.30949
Policy Loss,-0.81379
Reward,98.30076
Running Reward,98.30076
Total average test reward,440.67


-----
## Exercise 3: Going Deeper

As usual, pick **AT LEAST ONE** of the following exercises to complete.

### Exercise 3.1: Solving Lunar Lander with `REINFORCE` (easy)

Use my (or even better, improve on my) implementation of `REINFORCE` to solve the [Lunar Lander Environment](https://gymnasium.farama.org/environments/box2d/lunar_lander/). This environment is a little bit harder than Cartpole, but not much. Make sure you perform the same types of analyses we did during the lab session to quantify and qualify the performance of your agents.

### Exercise 3.2: Solving Cartpole and Lunar Lander with `Deep Q-Learning` (harder)

On policy Deep Reinforcement Learning tends to be **very unstable**. Write an implementation (or adapt an existing one) of `Deep Q-Learning` to solve our two environments (Cartpole and Lunar Lander). To do this you will need to implement a **Replay Buffer** and use a second, slow-moving **target Q-Network** to stabilize learning.

### Exercise 3.3: Solving the OpenAI CarRacing environment (hardest)

Use `Deep Q-Learning` -- or even better, an off-the-shelf implementation of **Proximal Policy Optimization (PPO)** -- to train an agent to solve the [OpenAI CarRacing](https://github.com/andywu0913/OpenAI-GYM-CarRacing-DQN) environment. This will be the most *fun*, but also the most *difficult*. Some tips:

1. Make sure you use the `continuous=False` argument to the environment constructor. This ensures that the action space is **discrete** (we haven't seen how to work with continuous action spaces).
2. Your Q-Network will need to be a CNN. A simple one should do, with two convolutional + maxpool layers, folowed by a two dense layers. You will **definitely** want to use a GPU to train your agents.
3. The observation space of the environment is a single **color image** (a single frame of the game). Most implementations stack multiple frames (e.g. 3) after converting them to grayscale images as an observation.

