# Intro to RL 

## Problem setup: We want to make an AI that is able to complete a simple video game.

### What is the game we are going to start with?
In this game, we want our agent (character) to move through the 2D world and reach the goal. At each timestep our agent can to either move up, down, left or right. The agent cannot move into obstacles, and when it reaches the goal, the game ends.

# insert video of game being played

We are going to use an environment that we built, called Griddy, that works in exactly the same way as other environments provided as part of openAI gym. 


The main ideas are:
<ul>
<li>we need to create our environment</li>
<li>we need to initialise it by calling `env.reset()`</li>
<li>we can increment the simulation by one timestep by calling `env.step(action)`</li>
</ul>

Check out [openAI gym's docs](http://gym.openai.com/docs/) to see how the environments work in general and in more detail.

Let's set up our simulation to train our agent in.


In [1]:
# IMPORTS
import time
import pickle
import numpy as np

import gym
from GriddyEnv import GriddyEnv # make sure you: pip3 install GriddyEnv

# SET UP THE ENVIRONMENT
env = GriddyEnv()    # create the environment

## Once we have an agent in the game, what do we do?

Our agent has no idea of how to win the game. It simply observes states that change based on it's actions and receives a reward signal for doing so.
So the agent has to learn about the game for itself. Just like a baby learns to interact with it's world by playing with it, our agent has to try random actions to figure out when and why it receives negative or positive rewards.

A function which tells the agent what to do in a given state is called a **policy**

We need our agent to understand what actions might lead it to achieving high rewards, but it doesn't know anything about how to complete the game yet. So let's set up our environment and implement a random policy that takes in a state and returns a random action for the agent to take.

![](./images/policy.png)

### Action Space
The action space consists of 4 unique actions: 0, 1, 2, 3<br>
0 - Move left<br>
1 - Move right<br>
2 - Move up<br>
3 - Move down<br>

### Observation Space
Has shape (3, 4, 4). Our grid world is 4x4<br>
Each of the 3 channels is a binary mask for the location of different objects within the environment.<br>
Channel 0 - Goal<br>
Channel 1 - Wall<br>
Channel 2 - Agent<br>

In [5]:
#Visualise agent function
def visualise_agent(policy, n=5):
    try:
        for trial_i in range(n):
            observation = env.reset()
            done=False
            t=0
            while not done:
                env.render()
                policy_action = policy(observation)
                observation, reward, done, info = env.step(policy_action)
                time.sleep(0.5)
                t+=1
            env.render()
            time.sleep(1.5)
            print("Episode {} finished after {} timesteps".format(trial_i, t))
        env.close()
    except KeyboardInterrupt:
        env.close()

In [15]:
# IMPLEMENT A RANDOM POLICY
def random_policy(state):
    #action = #fill this in
    return action


## How do we know if we are doing well?

When our agent takes this action and moves into a new state, the environment returns it a reward. The reward when it reaches the goal is +1, and 0 everywhere else. The reward that the agent receives at any point can be considered as what it feels in that moment - like pain or pleasure.

**However**, the reward doesn't tell the agent how good that move actually was, only whether it sensed anything, and how good or bad that sensation was.

E.g.
- Our agent might not receive any reward for stepping toward the goal, even though this might be a good move.
- A robot might receive a negative reward as it's battery depletes, but still make good progress towards its goal.
- A chess playing agent might receive a positive reward for taking an opponent's piece, but make a bad move in doing so by exposing its king to an attack eventually causing it to lose the game.

What we really want to know is not the instantaneous reward, but "How good is the position I'm in right now?", that is, what amount of reward can our agent get from this point onwards.
This future reward is also known as the return.

![](./images/undiscounted_return.png)

#### Is getting a reward now as good as getting the same reward later?
- What if the reward is removed from the game in the next timestep?
- Would you rather be rich now or later?
- What if a larger reward is introduced and you don't have enough energy to reach both?
- What about inflation?

It's better to get rewards sooner rather than later.

![](./images/decay.png)

We can encode this into our goal by using a **discount factor**, $\gamma \in [0, 1]$ ($\gamma$ between 0 and 1). This makes our agent value more immediate rewards more than those which can be reached further in the future. This makes the goal become:

![](./images/discounted_return.png)


The value of these return values can be defined recursively as shown below.

![](./images/recursive_return.png)

Because of this, we can calculate the returns by having our agent play one run-through of the game and then *backing-up* through that trajectory, step-by-step, looking forward at what the future reward was from that point.

The back up procedure is a way that we can determine the returns for each state that we visited in an episode. The return of a terminal state is always zero, but the terminal state is not the goal, it is the state which our agent transitions into once the episode has finished. We do backup by looking at the final timestep before our agent went into the terminal state - here it is easy to calculate the return. It is simply the reward that we received for moving into this state, because the expected return from the next state (terminal state) is always zero. Then using the recursive expression of returns (above), we can calculate the return for the timestep before that. This can be done recursively until we reach our initial state. At this point we know what all of the returns were for the whole episode.

![](./images/backup.png)

### So how good *is* each state?
In general, the goal of reinforcement learning is to maximise the **expected** future reward. That is, to maximise the expected return from a the current state onwards. The measure of this, is called the *value* of the state (or the state value). A function that predicts this value is called a **state-value function** or **value function**.

![](./images/value_def.png)

If we had a way to estimate this, then we could look ahead to the state that each action would take us to and take the action which results in us landing in the state with best value. 

![](./images/follow_values.png)

Values will not be updated for states that aren't visited to during an episode.

If we initialise the value for each state as zero, and then average the return for each state over many episodes, that average return will converge to the true value of the state for this policy. This process of iteratively updating the value function is called **value iteration**.

![](./images/update_values.png)

Value iteration is a type of **value based** method. Notice that to learn an optimal policy, we never have to represent it explicitly. There is no function which represents the policy. Instead we just look-ahead and choose the action that maximises the value of the next state.

Now that our agent is exploring the environment, let's implement value iteration.

In [29]:
epsilon = 1
i_episode=0
discount_factor=0.8
value_table = {}

In [30]:
def calculate_Gs(episode_mem, discount_factor=0.95):
    return episode_mem

def update_value_table(value_table, episode_mem, alpha=0.5):
    return value_table, v_delta

In [31]:
def train(policy, n_episodes=100):
    global epsilon
    global value_table
    global i_episode
    try:
        for _ in range(n_episodes):
            observation = env.reset()
            episode_mem = []
            done=False
            t=0
            while not done:
                env.render()
                time.sleep(0.05)
                action = policy(observation)
                new_observation, reward, done, info = env.step(action)
                episode_mem.append({'observation':observation,
                                    'action':action,
                                    'reward':reward,
                                    'new_observation':new_observation,
                                    'done':done})
                observation=new_observation
                t+=1
                epsilon*=0.999
            episode_mem = calculate_Gs(episode_mem, discount_factor)
            value_table, v_delta = update_value_table(value_table, episode_mem)
            i_episode+=1
            print("Episode {} finished after {} timesteps. Eplislon={}. V_Delta={}".format(i_episode, t, epsilon, v_delta))
            env.render()
            time.sleep(1)
        env.close()
    except KeyboardInterrupt:
        env.close()

In [None]:
train(random_policy)

## How can we use the values that we know to perform well?

Now that our agent is capable of exploring and learning about it's environment, we need to make it take advantage of what it knows so that it can perform well.
Our random policy has helped us to estimate the values of each state, which means we have some idea of how good each state is. Think about how we could use this knowledge to make our agent perform well before reading the next paragraphs.

In this simple version of the game, we know exactly what actions will lead us to what states. That means we have a perfect **model** of the environment. A model is a function that tells us how the state will change when we take certain actions. E.g. we know that if the agent tries to move up into an empty space, then that's where it will end up.

Because we know exactly what states we can end up in by taking an action, we can just look at the value of the states and choose the action which leads us to the state with the greatest value. So we just move into the best state that we can reach at any point.
A policy that always takes the action that it expects to end up in the best, currently reachable state is called a **greedy policy**.

Let's implement a greedy policy.

In [None]:
#transition model
def transition(state, action):
    state = np.copy(state)
    agent_pos = list(zip(*np.where(state[2] == 1)))[0]
    new_agent_pos = np.array(agent_pos)
    if action==0:
        new_agent_pos[1]-=1
    elif action==1:
        new_agent_pos[1]+=1
    elif action==2:
        new_agent_pos[0]-=1
    elif action==3:
        new_agent_pos[0]+=1    
    new_agent_pos = np.clip(new_agent_pos, 0, 3)

    state[2, agent_pos[0], agent_pos[1]] = 0 #moved from this position so it is empty
    
    state[2, new_agent_pos[0], new_agent_pos[1]] = 1 #moved to this position
    return state

In [None]:
### implement greedy policy
#greedy policy
def greedy_policy(state):
    return policy_action

## Why not just act greedily all the time?

If we act greedily all the time then we will move into the state with the best value. But remember that these values are only estimates based on our agent's experience with the game, which means that they might not be correct. So if we want to make sure that our agent will do well by always choosing the next action greedily, we need to make sure that it has good estimates for the values of those states. This brings us to a core challenge in reinforcement learning: **the exploration vs exploitation dilemma**. Our agent can either exploit what it knows by using it's current knowledge to choose the best action, or it can explore more and improve it's knowledge perhaps learning that some actions are even worse than what it does currently.

## An epsilon-greedy policy
We can combine our random policy and our greedy policy to make an improved policy that both explores its environment and exploits its current knowledge. An $\epsilon$-greedy (epsilon-greedy) policy is one which exploits what it knows most of the time, but with probability $\epsilon$ will instead select a random action to try.

## Do we need to keep exploring once we are confident in the values of states?

As our agent explores more, it becomes more confident in predicting how valuable any state is. Once it knows a lot, it should start to explore less and exploit what it knows more. That means that we should decrease epsilon over time.

Let's implement it

In [None]:
def epsilon_greedy_policy(state):
    epsilon = 0.05
    if random.random() < epsilon:
        return random_policy(state)
    else:
        return greedy_policy(state)

# How can we find an optimal policy?

An optimal policy would take the best possible action in any state. Because of this, the optimal value function would give the maximum possible values for any state.

In the first line below, the maximum state-value of a state is equivalent to the maximum state-action value when taking the best action in that state. Following this, we can derive a recursive definition of the optimal value function.

In the last step, we even remove the policy from the equation entirely! This means that value iteration never needs to explicitly represent a policy in terms of a function that takes in a state and returns a distribution over actions.
Instead, value iteration uses a **model**, $p(s', r | s, a)$, to look one step ahead, and take the action, $a$, that most likely leads it to the next state that has the best state-value function.

A **model** defines how the state changes. It is also known as the transition dynamics of the environment. In our case the model is really simple: we are certain that taking the action to move right will move our agent one space to the right as long as there are no obstacles. There is no randomness in our environment (e.g. no wind that might push us into a different cell when we try to move right). That is, our environment is deterministic, not stochastic.

![](./images/bellman_op_v.png)

![](./images/backup_v.png)

![](./images/update_rule_v.png)

In [46]:
# implement value iteration to find optimal value function
def update_value_table(episode_mem, value_table, discount_factor=0.95, alpha=0.5):
    return value_table, v_delta

# Final Code solution with visualisation

In [1]:
import time
import pickle
import numpy as np
import torch
import torch.nn.functional as F

import gym
from GriddyEnv import GriddyEnv

In [2]:
class QNetwork(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(np.prod(env.observation_space.shape), 32)
        self.fc2 = torch.nn.Linear(32, 32)
        self.fc3 = torch.nn.Linear(32, env.action_space.n)
    def forward(self, obs):
        obs = obs.view(-1, np.prod(env.observation_space.shape))
        x = F.relu(self.fc1(obs))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    def create_optimizer(self, lr=0.001):
        self.optimizer = torch.optim.Adam(self.parameters(), lr=lr)

In [3]:
def random_policy(state):
    return env.action_space.sample()

In [4]:
def create_greedy_policy(q_network):
    def greedy_policy(state, return_action_val=False):
        action_values = q_network(torch.tensor(state).double()).detach().numpy()
        policy_action = np.argmax(action_values)
        if return_action_val: return policy_action, action_values[0][policy_action]
        return policy_action
    return greedy_policy

def create_stochastic_policy(q_network):
    def stochastic_policy(state, return_action_val=False):
        action_values = q_network(torch.tensor(state).double()).detach().numpy()
        action_probs = F.softmax(torch.tensor(action_values), dim=-1)
        policy_action = torch.distributions.Categorical(action_probs).sample().item()
        if return_action_val: return policy_action, action_values[0][policy_action]
        return policy_action
    return stochastic_policy

In [5]:
def create_epsilon_greedy_policy(policy):
    def epsilon_greedy_policy(state):
        action = env.action_space.sample() if np.random.rand()<epsilon else policy(state)
        return action
    return epsilon_greedy_policy

In [6]:
##Q
def update_q_table(q_table, episode_mem, alpha=0.5):
    all_diffs=[]
    for mem in episode_mem:
        key = pickle.dumps(np.array((*mem['observation'].flatten(), mem['action'])))
        if key not in q_table:
            q_table[key]=0 #initialize
        new_val = q_table[key] + alpha*(mem['q']-q_table[key])
        diff = abs(q_table[key]-new_val)
        all_diffs.append(diff)
        q_table[key] = new_val
    return q_table, np.mean(all_diffs)

def calculate_qs(episode_mem, discount_factor=0.95):
    for i, mem in reversed(list(enumerate(episode_mem))):
        if i==len(episode_mem)-1:
            episode_mem[i]['q']= mem['reward']
        else:
            _, next_obs_q = greedy_policy(mem['new_observation'], return_action_val=True)
            calculated_q = mem['reward']+discount_factor*next_obs_q
            episode_mem[i]['q'] = calculated_q
    return episode_mem

def update_q_table(episode_mem, q_network, discount_factor=0.95, alpha=0.1):
    all_diffs=[]
    for i, mem in reversed(list(enumerate(episode_mem))):
        if i==len(episode_mem)-1:
            #episode_mem[i]['q']= mem['reward']
            calculated_new_q= mem['reward']
        else:
            _, next_obs_q = greedy_policy(mem['new_observation'], return_action_val=True)
            calculated_new_q = mem['reward']+discount_factor*next_obs_q
            #episode_mem[i]['q'] = calculated_q
        predicted_old_q = q_network(torch.tensor(mem['observation']).double())[0, mem['action']]
        all_diffs.append(abs(calculated_new_q-predicted_old_q.item()))
        label_new_q = predicted_old_q.item() + alpha*(calculated_new_q-predicted_old_q.item())
        cost = F.mse_loss(predicted_old_q, torch.tensor(label_new_q).double())
        cost.backward()
        q_network.optimizer.step()
        q_network.optimizer.zero_grad()   
    return np.mean(all_diffs)

In [7]:
def q_table_viz(q_network):
    qs = np.zeros((4, 4, 4))
    base_st = np.zeros((3, 4, 4), dtype=np.int64)
    base_st[0, 3, 3]=1
    for i in range(4):
        for j in range(4):
            test_st = np.copy(base_st)
            test_st[2, i, j] = 1
            action_vals = q_network(torch.tensor(test_st).double())
            for action in range(4):                
                qs[action, i, j] = action_vals[0, action].item()
    return qs

In [8]:
def visualise_agent(policy, value_network=None, n=5):
    try:
        for trial_i in range(n):
            observation = env.reset()
            done=False
            t=0
            while not done:
                if value_network: env.render(value_table_viz(value_network, observation))
                else: env.render()
                policy_action = policy(observation)
                observation, reward, done, info = env.step(policy_action)
                #time.sleep(0.5)
                t+=1
            env.render()
            time.sleep(1.5)
            print("Episode {} finished after {} timesteps".format(trial_i, t))
        env.close()
    except KeyboardInterrupt:
        env.close()

In [15]:
#HYPER-PARAMS
epsilon = 1
i_episode=0
discount_factor=0.95
alpha=0.1
lr = 0.001

env = GriddyEnv(4, 4)
#env = gym.make('CartPole-v1')
q_network = QNetwork().double()
q_network.create_optimizer(lr)
greedy_policy = create_greedy_policy(q_network)
stochastic_policy = create_stochastic_policy(q_network)
epsilon_greedy_policy = create_epsilon_greedy_policy(greedy_policy)

In [16]:
def train(policy, n_episodes=100):
    global epsilon
    global q_network
    global i_episode
    try:
        for _ in range(n_episodes):
            observation = env.reset()
            episode_mem = []
            done=False
            t=0
            while not done:
                action = policy(observation)
                new_observation, reward, done, info = env.step(action)
                episode_mem.append({'observation':observation,
                                    'action':action,
                                    'reward':reward,
                                    'new_observation':new_observation,
                                    'done':done})
                observation=new_observation
                t+=1
            epsilon*=0.995
            q_delta = update_q_table(episode_mem, q_network, discount_factor, alpha)
            i_episode+=1
            print("Episode {} finished after {} timesteps. Eplislon={}. Q_Delta={}".format(i_episode, t, epsilon, q_delta))#, end='\r')
            #print(value_table_viz(value_table))
            #print()
            #env.render(value_table_viz(value_network, observation))
        env.close()
    except KeyboardInterrupt:
        env.close()

In [17]:
train(epsilon_greedy_policy, 1000)

Episode 1 finished after 51 timesteps. Eplislon=0.995. Q_Delta=0.04986628715166174
Episode 2 finished after 23 timesteps. Eplislon=0.990025. Q_Delta=0.056947377301697
Episode 3 finished after 86 timesteps. Eplislon=0.985074875. Q_Delta=0.021745694623375847
Episode 4 finished after 20 timesteps. Eplislon=0.9801495006250001. Q_Delta=0.055985401194907844
Episode 5 finished after 114 timesteps. Eplislon=0.9752487531218751. Q_Delta=0.02882166354646702
Episode 6 finished after 45 timesteps. Eplislon=0.9703725093562657. Q_Delta=0.041306373344464804
Episode 7 finished after 6 timesteps. Eplislon=0.9655206468094844. Q_Delta=0.1187547271733982
Episode 8 finished after 15 timesteps. Eplislon=0.960693043575437. Q_Delta=0.06616494181511673
Episode 9 finished after 35 timesteps. Eplislon=0.9558895783575597. Q_Delta=0.03837591349893982
Episode 10 finished after 91 timesteps. Eplislon=0.9511101304657719. Q_Delta=0.0265574246309001
Episode 11 finished after 40 timesteps. Eplislon=0.946354579813443. Q_D

Episode 105 finished after 28 timesteps. Eplislon=0.5907768628656763. Q_Delta=0.020146360036629635
Episode 106 finished after 3 timesteps. Eplislon=0.5878229785513479. Q_Delta=0.04894994638155359
Episode 107 finished after 3 timesteps. Eplislon=0.5848838636585911. Q_Delta=0.031681350629919315
Episode 108 finished after 11 timesteps. Eplislon=0.5819594443402982. Q_Delta=0.03570705571483095
Episode 109 finished after 19 timesteps. Eplislon=0.5790496471185967. Q_Delta=0.012107811341320714
Episode 110 finished after 5 timesteps. Eplislon=0.5761543988830038. Q_Delta=0.03926569400799429
Episode 111 finished after 7 timesteps. Eplislon=0.5732736268885887. Q_Delta=0.03310783952735468
Episode 112 finished after 5 timesteps. Eplislon=0.5704072587541458. Q_Delta=0.01497484501213
Episode 113 finished after 5 timesteps. Eplislon=0.567555222460375. Q_Delta=0.005559804730397544
Episode 114 finished after 5 timesteps. Eplislon=0.5647174463480732. Q_Delta=0.014339550230658315
Episode 115 finished after

Episode 200 finished after 15 timesteps. Eplislon=0.3669578217261671. Q_Delta=0.012575077396928366
Episode 201 finished after 13 timesteps. Eplislon=0.36512303261753626. Q_Delta=0.015827056081499404
Episode 202 finished after 3 timesteps. Eplislon=0.3632974174544486. Q_Delta=0.006650874675507956
Episode 203 finished after 3 timesteps. Eplislon=0.3614809303671764. Q_Delta=0.004439213406411698
Episode 204 finished after 3 timesteps. Eplislon=0.3596735257153405. Q_Delta=0.00535291236887514
Episode 205 finished after 4 timesteps. Eplislon=0.3578751580867638. Q_Delta=0.010394307555756238
Episode 206 finished after 6 timesteps. Eplislon=0.35608578229633. Q_Delta=0.01219462225119472
Episode 207 finished after 2 timesteps. Eplislon=0.3543053533848483. Q_Delta=0.009607777554484387
Episode 208 finished after 5 timesteps. Eplislon=0.35253382661792404. Q_Delta=0.012052036861087734
Episode 209 finished after 6 timesteps. Eplislon=0.3507711574848344. Q_Delta=0.010305986701007944
Episode 210 finished

Episode 307 finished after 6 timesteps. Eplislon=0.21462770857094118. Q_Delta=0.007905978554825721
Episode 308 finished after 5 timesteps. Eplislon=0.21355457002808648. Q_Delta=0.008984661812290119
Episode 309 finished after 3 timesteps. Eplislon=0.21248679717794605. Q_Delta=0.004933824127484221
Episode 310 finished after 8 timesteps. Eplislon=0.21142436319205632. Q_Delta=0.004266256453677883
Episode 311 finished after 7 timesteps. Eplislon=0.21036724137609603. Q_Delta=0.0045295896390898606
Episode 312 finished after 4 timesteps. Eplislon=0.20931540516921554. Q_Delta=0.004891946655480289
Episode 313 finished after 4 timesteps. Eplislon=0.20826882814336947. Q_Delta=0.010013760796200283
Episode 314 finished after 7 timesteps. Eplislon=0.20722748400265262. Q_Delta=0.007631506822683479
Episode 315 finished after 3 timesteps. Eplislon=0.20619134658263935. Q_Delta=0.007424825090603789
Episode 316 finished after 5 timesteps. Eplislon=0.20516038984972615. Q_Delta=0.005921788933210737
Episode 3

Episode 431 finished after 3 timesteps. Eplislon=0.11527836319047392. Q_Delta=0.003200969466322201
Episode 432 finished after 5 timesteps. Eplislon=0.11470197137452155. Q_Delta=0.006236093516882457
Episode 433 finished after 5 timesteps. Eplislon=0.11412846151764894. Q_Delta=0.013108714909293084
Episode 434 finished after 1 timesteps. Eplislon=0.1135578192100607. Q_Delta=0.012357474611901065
Episode 435 finished after 1 timesteps. Eplislon=0.11299003011401039. Q_Delta=0.00508299209440799
Episode 436 finished after 2 timesteps. Eplislon=0.11242507996344034. Q_Delta=0.006660552900823846
Episode 437 finished after 3 timesteps. Eplislon=0.11186295456362313. Q_Delta=0.0039487370428384905
Episode 438 finished after 3 timesteps. Eplislon=0.11130363979080501. Q_Delta=0.0022965453444757644
Episode 439 finished after 3 timesteps. Eplislon=0.11074712159185099. Q_Delta=0.01628315133202385
Episode 440 finished after 4 timesteps. Eplislon=0.11019338598389174. Q_Delta=0.005263421534251023
Episode 441

Episode 519 finished after 5 timesteps. Eplislon=0.07416156859737154. Q_Delta=0.007820365748267544
Episode 520 finished after 4 timesteps. Eplislon=0.07379076075438468. Q_Delta=0.003470093738319663
Episode 521 finished after 6 timesteps. Eplislon=0.07342180695061275. Q_Delta=0.004099684747838468
Episode 522 finished after 1 timesteps. Eplislon=0.07305469791585968. Q_Delta=0.0032657335962373857
Episode 523 finished after 1 timesteps. Eplislon=0.07268942442628039. Q_Delta=0.013344548639540488
Episode 524 finished after 4 timesteps. Eplislon=0.07232597730414898. Q_Delta=0.003248868691335921
Episode 525 finished after 1 timesteps. Eplislon=0.07196434741762824. Q_Delta=0.0015803164897327804
Episode 526 finished after 3 timesteps. Eplislon=0.0716045256805401. Q_Delta=0.0007019124507892845
Episode 527 finished after 3 timesteps. Eplislon=0.0712465030521374. Q_Delta=0.002087947934996078
Episode 528 finished after 3 timesteps. Eplislon=0.0708902705368767. Q_Delta=0.0015626081626223993
Episode 5

Episode 624 finished after 3 timesteps. Eplislon=0.043812938841152796. Q_Delta=0.0014546705271606026
Episode 625 finished after 2 timesteps. Eplislon=0.04359387414694703. Q_Delta=0.004487296075013136
Episode 626 finished after 2 timesteps. Eplislon=0.043375904776212296. Q_Delta=0.0013796267009917873
Episode 627 finished after 1 timesteps. Eplislon=0.043159025252331236. Q_Delta=0.003332468132373867
Episode 628 finished after 3 timesteps. Eplislon=0.04294323012606958. Q_Delta=0.0032566359533100497
Episode 629 finished after 4 timesteps. Eplislon=0.04272851397543923. Q_Delta=0.0021325069799648655
Episode 630 finished after 3 timesteps. Eplislon=0.04251487140556204. Q_Delta=0.003125126083971789
Episode 631 finished after 3 timesteps. Eplislon=0.04230229704853423. Q_Delta=0.0028147055330015234
Episode 632 finished after 1 timesteps. Eplislon=0.04209078556329156. Q_Delta=0.0017723014944238669
Episode 633 finished after 6 timesteps. Eplislon=0.0418803316354751. Q_Delta=0.002384112364453239
Ep

Episode 719 finished after 6 timesteps. Eplislon=0.027214167668287145. Q_Delta=0.004895127901852907
Episode 720 finished after 2 timesteps. Eplislon=0.02707809682994571. Q_Delta=0.00856036716100228
Episode 721 finished after 1 timesteps. Eplislon=0.02694270634579598. Q_Delta=0.0065264251840333465
Episode 722 finished after 3 timesteps. Eplislon=0.026807992814067. Q_Delta=0.002512195086742756
Episode 723 finished after 18 timesteps. Eplislon=0.026673952849996664. Q_Delta=0.01879972856137302
Episode 724 finished after 5 timesteps. Eplislon=0.02654058308574668. Q_Delta=0.009240225185856588
Episode 725 finished after 5 timesteps. Eplislon=0.026407880170317945. Q_Delta=0.004642761409249197
Episode 726 finished after 1 timesteps. Eplislon=0.026275840769466357. Q_Delta=0.012768441817741838
Episode 727 finished after 2 timesteps. Eplislon=0.026144461565619025. Q_Delta=0.01451903071060151
Episode 728 finished after 2 timesteps. Eplislon=0.02601373925779093. Q_Delta=0.0034511315742916104
Episode

Episode 821 finished after 3 timesteps. Eplislon=0.01632109498333433. Q_Delta=0.012697607657368678
Episode 822 finished after 2 timesteps. Eplislon=0.016239489508417658. Q_Delta=0.010592762486057838
Episode 823 finished after 3 timesteps. Eplislon=0.01615829206087557. Q_Delta=0.011632140584848924
Episode 824 finished after 5 timesteps. Eplislon=0.01607750060057119. Q_Delta=0.007456593520653976
Episode 825 finished after 1 timesteps. Eplislon=0.015997113097568336. Q_Delta=0.011308076422018587
Episode 826 finished after 4 timesteps. Eplislon=0.015917127532080494. Q_Delta=0.00865785619096085
Episode 827 finished after 3 timesteps. Eplislon=0.01583754189442009. Q_Delta=0.0172583607896248
Episode 828 finished after 1 timesteps. Eplislon=0.01575835418494799. Q_Delta=0.010676976160319152
Episode 829 finished after 4 timesteps. Eplislon=0.01567956241402325. Q_Delta=0.006427898981893082
Episode 830 finished after 2 timesteps. Eplislon=0.015601164601953134. Q_Delta=0.008921366472544878
Episode 8

Episode 920 finished after 3 timesteps. Eplislon=0.009936519429207103. Q_Delta=0.00164299734434592
Episode 921 finished after 2 timesteps. Eplislon=0.009886836832061067. Q_Delta=0.001581027097098786
Episode 922 finished after 2 timesteps. Eplislon=0.009837402647900761. Q_Delta=0.001755569542337243
Episode 923 finished after 6 timesteps. Eplislon=0.009788215634661257. Q_Delta=0.0014150551475976363
Episode 924 finished after 3 timesteps. Eplislon=0.00973927455648795. Q_Delta=0.0009132430291965606
Episode 925 finished after 2 timesteps. Eplislon=0.009690578183705511. Q_Delta=0.0014000870182348524
Episode 926 finished after 5 timesteps. Eplislon=0.009642125292786984. Q_Delta=0.0006566394594036184
Episode 927 finished after 2 timesteps. Eplislon=0.009593914666323049. Q_Delta=0.001621002215896672
Episode 928 finished after 2 timesteps. Eplislon=0.009545945092991434. Q_Delta=0.0007925335086186314
Episode 929 finished after 3 timesteps. Eplislon=0.009498215367526477. Q_Delta=0.0006297905615617

In [38]:
print('Current estimates of q')
q_table_viz(q_network)

Current estimates of q


array([[[0.6158861 , 0.62582307, 0.62775117, 0.63344254],
        [0.59492904, 0.64371173, 0.72779348, 0.76158336],
        [0.63116915, 0.6493037 , 0.77441553, 0.81960775],
        [0.64674917, 0.68068603, 0.82203199, 0.67465353]],

       [[0.66970614, 0.74170172, 0.71648302, 0.74727872],
        [0.6548344 , 0.73713401, 0.80110498, 0.90997436],
        [0.67860998, 0.79380229, 0.979512  , 0.94932395],
        [0.74961898, 0.89303597, 1.11797035, 0.8362021 ]],

       [[0.59005972, 0.60746182, 0.58005861, 0.60399724],
        [0.54453877, 0.58100982, 0.66956028, 0.69903702],
        [0.54524714, 0.63031498, 0.75585223, 0.74375212],
        [0.58360111, 0.65783605, 0.7621886 , 0.64929498]],

       [[0.67290375, 0.73610613, 0.71822396, 0.76673053],
        [0.62813164, 0.73591152, 0.86242113, 0.9477278 ],
        [0.65914157, 0.80449378, 1.01969181, 1.12815753],
        [0.67966637, 0.80821061, 1.01098999, 0.82250206]]])

In [18]:
visualise_agent(greedy_policy, n=3)

Episode 0 finished after 4 timesteps
Episode 1 finished after 3 timesteps
Episode 2 finished after 5 timesteps


## End of notebook!

Next you might want to check out:
- [Policy Gradients]()
