# Deep Reinforcement Learning

 Reinforcement Learning (RL) is an approach wherein an agent learns to make sequential decisions by interacting with an environment. The objective is for the agent to maximize the cumulative reward it receives over time.
 The agent goes through this process by repeatedly evaluating the consequences of its actions, trying to select actions that lead to better outcomes.

To do this, we will use Gym, an platform for developing and comparing reinforcement learning algorithms. Gym provides an interface for interacting with different environments, it accepts actions from agents and plays them out in an environment, providing rewards.


## Environment

We will be using `CartPole` environment from gym's library for this assignment.  In this environment, a pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.

You can use the code below to run an instance of a random agent in this environment and see the results.

In [None]:
from IPython.display import HTML
from base64 import b64encode

def show_video(path):
    mp4 = open(path, 'rb').read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    return HTML("""
    <video width=400 controls>
          <source src="%s" type="video/mp4">
    </video>
    """ % data_url)

In [None]:
!pip install gym[atari,accept-rom-license] -qq
!pip install imageio -qq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/1.6 MB[0m [31m23.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/434.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m434.7/434.7 kB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for AutoROM.accept-rom-license (pyproject.toml) ... [?25l[?25hdone


In [None]:
import cv2
import gym
import imageio
import numpy as np
from gym import spaces

We use `gym.make()` to make an instance of a certain environemtn. We can then use `.step()` method which accepts an action as input and performs it. Before that we reset the environment to its initial state by using `.reset()` method.

In [None]:
env_name = 'CartPole-v1'

env = gym.make(env_name)

env.reset()

frames = []

for _ in range(500):
    action = env.action_space.sample()

    obs, reward, done, _ = env.step(action)

    frames.append(env.render(mode='rgb_array'))

    if done:
        env.reset()

env.close()
imageio.mimsave('./cartpole.mp4', frames, fps=25)



In [None]:
show_video('./cartpole.mp4')

As you can see, the cart fails to keep the balance of the pole. In the next section we will train an agent to learn how to perform this task.

## Algorithm
We will be using A2C algorithm.

Advantage Actor-Critic (A2C) is a reinforcement learning algorithm.
It consists of an actor (which predicts the best action based on the current state) and a critic (which estimates the state's value function to measure expected future rewards).

We will implement this together step by step.




In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions.categorical import Categorical

import numpy as np
import gym
from collections import deque
from tqdm import tqdm

## Neural Network

Here we design a simple feed forward model to embed the observation from the environment to a hidden layer. We then use two fully connected layers on top of the hidden layer, to predict the next action and estimate the value of current state. This acts as both actor, and critic.


In [None]:
class ActorCritic(nn.Module):
    def __init__(self, hidden_size, num_inputs, num_outputs):
        super(ActorCritic, self).__init__()
        self.fc1 = nn.Linear(num_inputs, hidden_size)
        self.fc2_actor = nn.Linear(hidden_size, num_outputs)
        self.fc2_critic = nn.Linear(hidden_size, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        action_probs = F.softmax(self.fc2_actor(x), dim=-1)
        value = self.fc2_critic(x)
        return action_probs, value

## A2C

The A2C algorithm aims to jointly train both the actor and the critic to improve the policy. It does this by updating the parameters
of the actor to increase the likelihood of good actions and updating the parameters
of the critic to better estimate the value function.

In each iteration A2C plays the until it ends. During this time it records log probabality of actions, rewards, and predicted values in each step. These values will be used to update the model at the end of this trajectory.

The actor is updated using the objective below:

$$ L_{\text{actor}} = -\log \pi(a|s;\theta) \times A(s, a) $$
Where advantage is calculated as:
$$A(s, a) = Q(s, a) - V(s) $$

Namely the function $Q(s,a)$ is the estimated value of taking action
$a$
 in state
$s$.
$V(s)$ is the predicted value of our critic.

This loss function aims to improve the probability of playing actions that result in higher rewards.

As for the critic the loss function is defined as a simple mean square loss between actual value of an state and the predicted one:

$$ L_{\text{critic}} = \frac{1}{2} ( R - V(s))^2 $$

In [None]:
class A2CAgent:
    def __init__(self, env, num_episodes=1000, max_steps=500, gamma=0.99, lr=1e-3, hidden_size=256):
        self.env = env
        self.num_episodes = num_episodes
        self.max_steps = max_steps
        self.gamma = gamma
        self.lr = lr
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        num_inputs = env.observation_space.shape[0]
        num_outputs = env.action_space.n

        self.policy_net = ActorCritic(hidden_size, num_inputs, num_outputs).to(self.device)
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=self.lr)

    def choose_action(self, state):
        state = torch.FloatTensor(state).to(self.device)
        action_probs, _ = self.policy_net(state)
        dist = Categorical(action_probs)
        action = dist.sample()
        return action.item()

    def compute_returns(self, rewards):
        returns = []
        discounted_reward = 0
        for r in reversed(rewards):
            discounted_reward = r + self.gamma * discounted_reward
            returns.insert(0, discounted_reward)
        returns = torch.tensor(returns).to(self.device)
        return returns

    def train(self):
        episode_rewards = []
        for episode in tqdm(range(self.num_episodes)):
            state = self.env.reset()
            log_probs = []
            values = []
            rewards = []
            episode_reward = 0

            for step in range(self.max_steps):
                state = torch.FloatTensor(state).to(self.device)
                action_probs, value = self.policy_net(state)
                dist = Categorical(action_probs)
                action = dist.sample()
                next_state, reward, done, _ = self.env.step(action.cpu().numpy())

                log_prob = dist.log_prob(action).unsqueeze(0)
                log_probs.append(log_prob)
                values.append(value.detach())
                rewards.append(reward)
                episode_reward += reward

                state = next_state

                if done:
                    break

            log_probs = torch.cat(log_probs, dim=0)
            values = torch.cat(values)

            returns = self.compute_returns(rewards)

            advantages = returns - values

            actor_loss = -(advantages.detach() * log_probs).mean()

            critic_loss = F.mse_loss(values, returns)

            loss = actor_loss + critic_loss

            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

            episode_rewards.append(episode_reward)

            if (episode + 1) % 10 == 0:
                print(f"Episode {episode + 1}, Reward: {episode_reward}")

        self.env.close()
        return episode_rewards


Define the model and set hyperparameters.

In [None]:
env_name = 'CartPole-v1'
num_episodes = 1300
max_steps = 500
lr = 0.001
hidden_size = 256

env = gym.make(env_name)

a2c_model = A2CAgent(env, num_episodes=num_episodes, max_steps=max_steps, lr=lr, hidden_size=hidden_size)

Train the model.

In [None]:
rewards = a2c_model.train()

  1%|          | 13/1300 [00:00<00:46, 27.92it/s]

Episode 10, Reward: 11.0


  2%|▏         | 22/1300 [00:00<00:57, 22.13it/s]

Episode 20, Reward: 34.0


  3%|▎         | 34/1300 [00:01<00:51, 24.35it/s]

Episode 30, Reward: 17.0


  3%|▎         | 44/1300 [00:01<00:51, 24.33it/s]

Episode 40, Reward: 16.0


  4%|▍         | 51/1300 [00:02<00:47, 26.10it/s]

Episode 50, Reward: 39.0


  5%|▍         | 63/1300 [00:02<00:59, 20.80it/s]

Episode 60, Reward: 31.0


  6%|▌         | 73/1300 [00:03<01:25, 14.38it/s]

Episode 70, Reward: 107.0


  6%|▌         | 81/1300 [00:03<01:16, 15.99it/s]

Episode 80, Reward: 29.0


  7%|▋         | 92/1300 [00:04<01:15, 15.95it/s]

Episode 90, Reward: 29.0


  8%|▊         | 100/1300 [00:05<01:27, 13.66it/s]

Episode 100, Reward: 52.0


  9%|▊         | 112/1300 [00:06<01:30, 13.18it/s]

Episode 110, Reward: 47.0


  9%|▉         | 122/1300 [00:06<01:08, 17.30it/s]

Episode 120, Reward: 16.0


 10%|█         | 132/1300 [00:07<01:26, 13.44it/s]

Episode 130, Reward: 30.0


 11%|█         | 141/1300 [00:08<01:20, 14.39it/s]

Episode 140, Reward: 34.0


 12%|█▏        | 153/1300 [00:09<01:21, 14.01it/s]

Episode 150, Reward: 30.0


 12%|█▏        | 162/1300 [00:09<01:15, 15.08it/s]

Episode 160, Reward: 22.0


 13%|█▎        | 172/1300 [00:10<01:23, 13.53it/s]

Episode 170, Reward: 52.0


 14%|█▍        | 182/1300 [00:11<01:08, 16.26it/s]

Episode 180, Reward: 34.0


 15%|█▍        | 193/1300 [00:11<01:11, 15.58it/s]

Episode 190, Reward: 102.0


 16%|█▌        | 203/1300 [00:12<01:07, 16.26it/s]

Episode 200, Reward: 73.0


 16%|█▌        | 211/1300 [00:13<01:27, 12.48it/s]

Episode 210, Reward: 31.0


 17%|█▋        | 221/1300 [00:13<01:38, 10.99it/s]

Episode 220, Reward: 43.0


 18%|█▊        | 232/1300 [00:14<01:24, 12.66it/s]

Episode 230, Reward: 59.0


 18%|█▊        | 240/1300 [00:15<01:19, 13.36it/s]

Episode 240, Reward: 46.0


 19%|█▉        | 252/1300 [00:16<01:53,  9.24it/s]

Episode 250, Reward: 147.0


 20%|██        | 260/1300 [00:17<02:12,  7.82it/s]

Episode 260, Reward: 67.0


 21%|██        | 270/1300 [00:19<02:33,  6.72it/s]

Episode 270, Reward: 92.0


 22%|██▏       | 281/1300 [00:20<01:51,  9.16it/s]

Episode 280, Reward: 38.0


 22%|██▏       | 290/1300 [00:21<02:20,  7.17it/s]

Episode 290, Reward: 80.0


 23%|██▎       | 301/1300 [00:24<03:02,  5.48it/s]

Episode 300, Reward: 83.0


 24%|██▍       | 310/1300 [00:27<06:22,  2.59it/s]

Episode 310, Reward: 254.0


 25%|██▍       | 320/1300 [00:31<05:25,  3.01it/s]

Episode 320, Reward: 115.0


 25%|██▌       | 330/1300 [00:34<04:10,  3.87it/s]

Episode 330, Reward: 96.0


 26%|██▌       | 340/1300 [00:37<06:24,  2.50it/s]

Episode 340, Reward: 293.0


 27%|██▋       | 350/1300 [00:41<05:10,  3.06it/s]

Episode 350, Reward: 101.0


 28%|██▊       | 360/1300 [00:44<04:02,  3.87it/s]

Episode 360, Reward: 150.0


 28%|██▊       | 370/1300 [00:46<03:30,  4.41it/s]

Episode 370, Reward: 76.0


 29%|██▉       | 381/1300 [00:49<03:28,  4.41it/s]

Episode 380, Reward: 128.0


 30%|███       | 390/1300 [00:54<09:41,  1.57it/s]

Episode 390, Reward: 327.0


 31%|███       | 400/1300 [00:58<05:56,  2.52it/s]

Episode 400, Reward: 219.0


 32%|███▏      | 411/1300 [01:01<03:15,  4.54it/s]

Episode 410, Reward: 147.0


 32%|███▏      | 421/1300 [01:03<02:40,  5.49it/s]

Episode 420, Reward: 120.0


 33%|███▎      | 430/1300 [01:05<04:27,  3.26it/s]

Episode 430, Reward: 212.0


 34%|███▍      | 440/1300 [01:09<05:18,  2.70it/s]

Episode 440, Reward: 88.0


 35%|███▍      | 450/1300 [01:13<05:28,  2.59it/s]

Episode 450, Reward: 500.0


 35%|███▌      | 460/1300 [01:17<05:05,  2.75it/s]

Episode 460, Reward: 119.0


 36%|███▌      | 471/1300 [01:21<03:27,  3.99it/s]

Episode 470, Reward: 239.0


 37%|███▋      | 480/1300 [01:23<03:28,  3.94it/s]

Episode 480, Reward: 148.0


 38%|███▊      | 490/1300 [01:26<03:48,  3.54it/s]

Episode 490, Reward: 151.0


 39%|███▊      | 501/1300 [01:28<02:15,  5.89it/s]

Episode 500, Reward: 33.0


 39%|███▉      | 511/1300 [01:30<01:49,  7.19it/s]

Episode 510, Reward: 27.0


 40%|████      | 521/1300 [01:31<01:33,  8.38it/s]

Episode 520, Reward: 27.0


 41%|████      | 529/1300 [01:32<01:51,  6.91it/s]

Episode 530, Reward: 35.0


 41%|████▏     | 539/1300 [01:34<02:22,  5.34it/s]

Episode 540, Reward: 26.0


 42%|████▏     | 550/1300 [01:36<02:25,  5.14it/s]

Episode 550, Reward: 119.0


 43%|████▎     | 560/1300 [01:38<01:23,  8.82it/s]

Episode 560, Reward: 48.0


 44%|████▍     | 571/1300 [01:39<01:07, 10.79it/s]

Episode 570, Reward: 46.0


 45%|████▍     | 581/1300 [01:40<01:41,  7.08it/s]

Episode 580, Reward: 109.0


 45%|████▌     | 590/1300 [01:42<01:32,  7.67it/s]

Episode 590, Reward: 21.0


 46%|████▌     | 600/1300 [01:44<02:54,  4.01it/s]

Episode 600, Reward: 163.0


 47%|████▋     | 610/1300 [01:47<03:28,  3.31it/s]

Episode 610, Reward: 239.0


 48%|████▊     | 620/1300 [01:50<03:18,  3.42it/s]

Episode 620, Reward: 192.0


 48%|████▊     | 630/1300 [01:55<05:52,  1.90it/s]

Episode 630, Reward: 121.0


 49%|████▉     | 640/1300 [01:59<03:34,  3.08it/s]

Episode 640, Reward: 139.0


 50%|█████     | 650/1300 [02:02<03:05,  3.50it/s]

Episode 650, Reward: 174.0


 51%|█████     | 660/1300 [02:05<02:53,  3.69it/s]

Episode 660, Reward: 118.0


 52%|█████▏    | 670/1300 [02:10<05:14,  2.01it/s]

Episode 670, Reward: 302.0


 52%|█████▏    | 681/1300 [02:15<03:56,  2.62it/s]

Episode 680, Reward: 85.0


 53%|█████▎    | 690/1300 [02:21<05:35,  1.82it/s]

Episode 690, Reward: 155.0


 54%|█████▍    | 700/1300 [02:24<03:11,  3.13it/s]

Episode 700, Reward: 178.0


 55%|█████▍    | 710/1300 [02:27<02:35,  3.79it/s]

Episode 710, Reward: 106.0


 55%|█████▌    | 720/1300 [02:31<04:15,  2.27it/s]

Episode 720, Reward: 78.0


 56%|█████▌    | 730/1300 [02:35<03:23,  2.80it/s]

Episode 730, Reward: 130.0


 57%|█████▋    | 740/1300 [02:39<03:33,  2.62it/s]

Episode 740, Reward: 236.0


 58%|█████▊    | 750/1300 [02:45<04:54,  1.87it/s]

Episode 750, Reward: 371.0


 58%|█████▊    | 760/1300 [02:51<05:57,  1.51it/s]

Episode 760, Reward: 343.0


 59%|█████▉    | 770/1300 [02:55<03:38,  2.42it/s]

Episode 770, Reward: 249.0


 60%|██████    | 780/1300 [02:58<02:29,  3.47it/s]

Episode 780, Reward: 197.0


 61%|██████    | 789/1300 [03:00<01:33,  5.49it/s]

Episode 790, Reward: 17.0


 61%|██████▏   | 799/1300 [03:01<01:26,  5.78it/s]

Episode 800, Reward: 32.0


 62%|██████▏   | 810/1300 [03:04<02:01,  4.03it/s]

Episode 810, Reward: 198.0


 63%|██████▎   | 820/1300 [03:08<04:03,  1.97it/s]

Episode 820, Reward: 500.0


 64%|██████▍   | 830/1300 [03:14<04:43,  1.66it/s]

Episode 830, Reward: 500.0


 65%|██████▍   | 840/1300 [03:21<03:38,  2.11it/s]

Episode 840, Reward: 69.0


 65%|██████▌   | 850/1300 [03:26<03:29,  2.15it/s]

Episode 850, Reward: 213.0


 66%|██████▌   | 860/1300 [03:29<02:32,  2.89it/s]

Episode 860, Reward: 131.0


 67%|██████▋   | 870/1300 [03:33<02:11,  3.26it/s]

Episode 870, Reward: 158.0


 68%|██████▊   | 880/1300 [03:35<02:03,  3.41it/s]

Episode 880, Reward: 203.0


 68%|██████▊   | 890/1300 [03:38<02:21,  2.89it/s]

Episode 890, Reward: 183.0


 69%|██████▉   | 900/1300 [03:42<03:01,  2.21it/s]

Episode 900, Reward: 255.0


 70%|███████   | 910/1300 [03:47<02:59,  2.17it/s]

Episode 910, Reward: 377.0


 71%|███████   | 920/1300 [03:50<02:12,  2.87it/s]

Episode 920, Reward: 217.0


 72%|███████▏  | 930/1300 [03:54<02:49,  2.18it/s]

Episode 930, Reward: 234.0


 72%|███████▏  | 940/1300 [03:59<03:13,  1.86it/s]

Episode 940, Reward: 302.0


 73%|███████▎  | 950/1300 [04:05<03:48,  1.53it/s]

Episode 950, Reward: 370.0


 74%|███████▍  | 960/1300 [04:11<03:55,  1.44it/s]

Episode 960, Reward: 500.0


 75%|███████▍  | 970/1300 [04:19<04:02,  1.36it/s]

Episode 970, Reward: 189.0


 75%|███████▌  | 980/1300 [04:25<03:00,  1.77it/s]

Episode 980, Reward: 284.0


 76%|███████▌  | 990/1300 [04:33<03:53,  1.33it/s]

Episode 990, Reward: 500.0


 77%|███████▋  | 1000/1300 [04:39<04:15,  1.18it/s]

Episode 1000, Reward: 500.0


 78%|███████▊  | 1010/1300 [04:46<02:55,  1.65it/s]

Episode 1010, Reward: 234.0


 78%|███████▊  | 1020/1300 [04:52<02:46,  1.68it/s]

Episode 1020, Reward: 210.0


 79%|███████▉  | 1030/1300 [04:58<02:04,  2.17it/s]

Episode 1030, Reward: 349.0


 80%|████████  | 1040/1300 [05:03<02:48,  1.54it/s]

Episode 1040, Reward: 500.0


 81%|████████  | 1050/1300 [05:10<02:41,  1.55it/s]

Episode 1050, Reward: 260.0


 82%|████████▏ | 1060/1300 [05:17<02:57,  1.35it/s]

Episode 1060, Reward: 500.0


 82%|████████▏ | 1070/1300 [05:24<02:17,  1.68it/s]

Episode 1070, Reward: 291.0


 83%|████████▎ | 1080/1300 [05:31<03:14,  1.13it/s]

Episode 1080, Reward: 500.0


 84%|████████▍ | 1090/1300 [05:37<02:03,  1.70it/s]

Episode 1090, Reward: 375.0


 85%|████████▍ | 1100/1300 [05:43<01:53,  1.76it/s]

Episode 1100, Reward: 311.0


 85%|████████▌ | 1110/1300 [05:50<02:08,  1.48it/s]

Episode 1110, Reward: 500.0


 86%|████████▌ | 1120/1300 [05:57<02:10,  1.38it/s]

Episode 1120, Reward: 429.0


 87%|████████▋ | 1130/1300 [06:02<01:18,  2.15it/s]

Episode 1130, Reward: 339.0


 88%|████████▊ | 1140/1300 [06:07<01:41,  1.57it/s]

Episode 1140, Reward: 500.0


 88%|████████▊ | 1150/1300 [06:12<01:19,  1.88it/s]

Episode 1150, Reward: 500.0


 89%|████████▉ | 1160/1300 [06:16<01:06,  2.10it/s]

Episode 1160, Reward: 422.0


 90%|█████████ | 1171/1300 [06:21<00:38,  3.38it/s]

Episode 1170, Reward: 144.0


 91%|█████████ | 1180/1300 [06:23<00:35,  3.37it/s]

Episode 1180, Reward: 203.0


 92%|█████████▏| 1190/1300 [06:28<01:14,  1.48it/s]

Episode 1190, Reward: 500.0


 92%|█████████▏| 1200/1300 [06:36<01:09,  1.44it/s]

Episode 1200, Reward: 289.0


 93%|█████████▎| 1210/1300 [06:43<01:07,  1.34it/s]

Episode 1210, Reward: 500.0


 94%|█████████▍| 1220/1300 [06:50<00:53,  1.50it/s]

Episode 1220, Reward: 470.0


 95%|█████████▍| 1230/1300 [06:55<00:51,  1.37it/s]

Episode 1230, Reward: 500.0


 95%|█████████▌| 1240/1300 [07:03<00:46,  1.28it/s]

Episode 1240, Reward: 500.0


 96%|█████████▌| 1250/1300 [07:11<00:41,  1.21it/s]

Episode 1250, Reward: 500.0


 97%|█████████▋| 1260/1300 [07:18<00:33,  1.21it/s]

Episode 1260, Reward: 500.0


 98%|█████████▊| 1270/1300 [07:26<00:21,  1.39it/s]

Episode 1270, Reward: 500.0


 98%|█████████▊| 1280/1300 [07:33<00:15,  1.32it/s]

Episode 1280, Reward: 357.0


 99%|█████████▉| 1290/1300 [07:41<00:07,  1.37it/s]

Episode 1290, Reward: 251.0


100%|██████████| 1300/1300 [07:49<00:00,  2.77it/s]

Episode 1300, Reward: 499.0





## Evaluation

Use the `choose_action` method of the trained agent to evaluate its performance.

In [None]:
env = gym.make(env_name)
model = a2c_model

num_episodes = 10
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

frames = []
episode_rewards = []


for i in range(num_episodes):
    state = env.reset()
    episode_reward = 0
    done = False

    while not done:
        frame = env.render(mode='rgb_array')
        frames.append(frame)

        action = model.choose_action(state)

        next_state, reward, done, _ = env.step(action)

        state = next_state
        episode_reward += reward

    episode_rewards.append(episode_reward)
    print(f"Episode {i+1} Reward: {episode_reward}")

env.close()

episode_rewards = np.array(episode_rewards)
avg_reward = np.mean(episode_rewards)
print(f"Average Reward over {num_episodes} episodes: {avg_reward}")

output_path = './test.mp4'
imageio.mimsave(output_path, frames, fps=25)

Episode 1 Reward: 320.0
Episode 2 Reward: 500.0
Episode 3 Reward: 500.0
Episode 4 Reward: 500.0
Episode 5 Reward: 237.0
Episode 6 Reward: 500.0
Episode 7 Reward: 500.0
Episode 8 Reward: 500.0
Episode 9 Reward: 500.0




Episode 10 Reward: 500.0
Average Reward over 10 episodes: 455.7


In [None]:
show_video('./test.mp4')