# Intro to RL 

## Problem setup: We want to make an AI that is able to complete a simple video game.

### What is the game we are going to start with?
In this game, we want our agent (character) to move through the 2D world and reach the goal. At each timestep our agent can to either move up, down, left or right. The agent cannot move into obstacles, and when it reaches the goal, the game ends.

# insert video of game being played

We are going to use an environment that we built, called Griddy, that works in exactly the same way as other environments provided as part of openAI gym. 


The main ideas are:
<ul>
<li>we need to create our environment</li>
<li>we need to initialise it by calling `env.reset()`</li>
<li>we can increment the simulation by one timestep by calling `env.step(action)`</li>
</ul>

Check out [openAI gym's docs](http://gym.openai.com/docs/) to see how the environments work in general and in more detail.

Let's set up our simulation to train our agent in.


In [17]:
# IMPORTS
import time
import pickle
import numpy as np

from GriddyEnv import GriddyEnv # make sure you: pip3 install GriddyEnv

# SET UP THE ENVIRONMENT
env = GriddyEnv()    # create the environment

## Once we have an agent in the game, what do we do?

Our agent has no idea of how to win the game. 
It simply observes states, and takes actions.
As a result of these actions, the agent will see the environment change to a new state and also receive some sensation. This sensation, which may be good or bad, is called a **reward**.
and then  that change based on it's actions and receives a reward signal for doing so.

This continuous interaction between the agent and our environment sets up the framework for a **reinforcement learning problem**, in an agent-environment loop as shown below.

![](images/agent-env-loop.png)

So without any prior knowledge, the agent has to learn about the game for itself. Just like a baby learns to interact with it's world by playing with it, our agent has to try random actions in the environment to figure out what causes it to receive negative or positive rewards.

A function which tells the agent what actions to take from a given state is called a **policy**.

![](./images/policy.png)

Policies can be deterministic or stochastic (have randomness).

Mathematically, a policy is a probability distribution over actions, conditioned on the state. 

## Defining our RL problem

In our case:

### Action Space
The action space consists of 4 unique actions: 0, 1, 2, 3<br>
0 - Move left<br>
1 - Move right<br>
2 - Move up<br>
3 - Move down<br>

### Observation Space
Has shape (3, 4, 4). Our grid world is 4x4<br>
Each of the 3 channels is a binary mask for the location of different objects within the environment.<br>
Channel 0 - Goal<br>
Channel 1 - Wall<br>
Channel 2 - Agent<br>

Let's drop our agent into the environment, implement a random policy, and then watch it act for a few episodes.

In [18]:
#Visualise agent function
def visualise_agent(policy, n=5):
    try:
        for trial_i in range(n):
            observation = env.reset()
            done=False
            t=0
            while not done:
                env.render()
                agent_pos = list(zip(*np.where(observation[2] == 1)))[0]
                print('agent_pos:', agent_pos)
                state = 4 * agent_pos[0] + agent_pos[1]
                policy_action = policy(state)
#                 print('policy_action:', policy_action)
                observation, reward, done, info = env.step(policy_action)
                time.sleep(0.1)
                t += 1
#                 print('observation:', observation)
            env.render()
            time.sleep(1.5)
            print("Episode {} finished after {} timesteps".format(trial_i, t))
        env.close()
    except KeyboardInterrupt:
        env.close()

In [19]:
# IMPLEMENT A RANDOM POLICY
def random_policy(state):
    action = np.random.randint(0, 4)
    return action

In [20]:
#visualise_agent(random_policy)

## How do we know if we are doing well?
## What should our agent try to do?
## What should our agent try to optimise?

When our agent takes this action and moves into a new state, the environment returns it a reward. The reward when it reaches the goal is +1, and 0 everywhere else. The reward that the agent receives at any point can be considered as what it feels in that moment - like pain or pleasure.

**However**, the reward doesn't tell the agent how good that move actually was, only whether it sensed anything, and how good or bad that sensation was at that particular moment.

E.g.
- Our agent might not receive any reward for stepping toward the goal, even though this might be a good move.
- A robot might receive a negative reward as it's battery depletes, but still make good progress towards its goal.
- A chess playing agent might receive a positive reward for taking an opponent's piece, but make a bad move in doing so by exposing its king to an attack eventually causing it to lose the game.

What we really want to know is not the instantaneous reward, but "How good is the position I'm in right now?", that is, what amount of reward can our agent get from this point onwards.
This future reward is also known as the return.

![](./images/undiscounted_return.png)

#### Is getting a reward now as good as getting the same reward later?
- What if the reward is removed from the game in the next timestep?
- Would you rather be rich now or later?
- What if a larger reward is introduced and you don't have enough energy to reach both?
- What about inflation?

It's better to get rewards sooner rather than later.

![](./images/decay.jpg)

We can encode this into our goal by using a **discount factor**, $\gamma \in [0, 1]$ ($\gamma$ between 0 and 1). 
The discount factor is the coefficient of a reward $t$ timesteps in the future, raised to the power of $t$. Because it is less than 1, raising to the power reduces its value. As such, this coefficient weights rewards further away in the future by a lesser number than those nearby in time.

This makes our agent value more immediate rewards more than those which can be reached further in the future.
This makes the goal become:

![](./images/discounted_return.png)

This is called the **discounted return**. From here on, when we say return, we mean discounted return unless otherwise specified, because we rarely use the undiscounted version.

Once an agent has played a whole game it can calculate the return for each state it visited by simply adding up the discounted reward it achieved from there on.

The value of these return values can also be defined recursively as shown below.

![](./images/recursive_return.png)
 
Because of this recursive relationship, we can calculate the experienced returns for each visited state in a single pass through the trajectory of that episode. We do this by working backwards, firstly calculating the  return from the terminal state (always zero), and then recursively calculating the return from the previous timestep by discounting it and adding the reward from that timestep.
This process is called ***backup***.

The return of a terminal state is always zero, because from there the episode will have terminated and hence the agent will not be able to attain any further reward.

![](./images/backup.png)


## What tools might our agent be able to use to perform well?

### Value functions - so how good is each state?

Now we know how to calculate the return that we experienced during any single episode. 
But this is just a single sample estimate. 
In general, the goal of reinforcement learning is to maximise the **expected** future reward. 
That is, to maximise the average return from a the current state onwards. 
This quantity is defined as the **state-value** of a state, commonly just referred to as the  or **value** of a state. 
A function that returns this value given a state is called a value function.

![](./images/value_def.jpg)

**Note that a value function must correspond to some policy.** If we follow a bad policy then states will have lower values than if we follow a good policy. If we change the policy, then the value function will change


### Q-functions - how good is taking a certain action in a certain state?

Another useful thing for our agent to know is how good it is to take a particular action, from a particular state.

![](./images/q_def.jpg)

### The Bellman equations

The Bellman equations are a way of expressing state and action value functions recursively.

### The Bellman Optimality equations

#### If we act optimally (following an optimal policy $\pi*$), how do the Bellman equations for $V$ and $Q$ relate to each other?

#### Recovering the Bellman optimality questions

#### What do the Bellman optimality equations optimise?

What does it mean to reduce the Bellman error?
Until the Bellman optimality equations are solved, we have not found an optimal policy. 


## How can we use these tools, and improve our agent's performance?

### Method 1: Dynamic Programming (DP) methods for computing value functions and improving policies

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect **model** of the environment.

A model tells us how the environment will change when we take certain actions. It may be stochastic or deterministic. A model can allow us to **simulate** the progression of the environment, taking any action from any state.

# diagram of model

**A model will also be referred to as a transition function.**

Fortunately (by our design) in this simple version of the game, we do know exactly what actions will lead us to what states. That means we have a perfect **model** of the environment. A model is a function that tells us how the state will change when we take certain actions. E.g. we know that if the agent tries to move up into an empty space, then that's where it will end up.

Let's run the cell below to define our transition function

In [21]:
# TRANSITION FUNCTION/ENVIRONMENT MODEL
def old_transition(state, action):
    state = np.copy(state)
    agent_pos = list(zip(*np.where(state[2] == 1)))[0]
    new_agent_pos = np.array(agent_pos)
    if action==0:
        new_agent_pos[1]-=1
    elif action==1:
        new_agent_pos[1]+=1
    elif action==2:
        new_agent_pos[0]-=1
    elif action==3:
        new_agent_pos[0]+=1    
    new_agent_pos = np.clip(new_agent_pos, 0, 3)

    state[2, agent_pos[0], agent_pos[1]] = 0 # moved from this position so it is empty
    
    state[2, new_agent_pos[0], new_agent_pos[1]] = 1 # moved to this position
    
    goal_pos = list(zip(*np.where(state[0] == 1)))[0]
    agent_pos = list(zip(*np.where(state[2] == 1)))[0]
    print('goal_pos', goal_pos)
    print('agent pos:', agent_pos)
    
    return state, reward

def transition(state, action):
    LEFT = 0
    DOWN = 1
    RIGHT = 2
    UP = 3
    GOAL = 15
    nrow = 4
    ncol = 4
    row = state // 4
    col = state % 4
#     print('state:', state, 'row:', row, 'col:', col)
    if action == LEFT:
        state = (row, max(col-1, 0))
    if action == DOWN:
        state = (min(row+1, nrow - 1), col)
    if action == RIGHT:
        state = (row, min(col+1, ncol - 1))
    if action == UP:
        state = (max(row-1, 0), col)
#     new_states = [(row, max(col-1, 0)), (min(row+1, nrow - 1), col), (row, min(col+1, ncol - 1)), (max(row-1, 0), col)] # new (row, col) if action [left, down, right, up] is taken
#     state = new_states[action]
    state = nrow * state[0] + state[1] # convert back to integer
#     print('action:', action)
    row = state // 4
    col = state % 4
#     print('new_state:', state, 'row:', row, 'col:', col)
    reward = 0
    if state == GOAL:
        reward = 1
    return state, reward

If we have a model, we can look ahead to the successor states reachable from our current state.
If we also had a way to estimate the value function, then we could look ahead to the state that each action would take us to and take the action which results in the best expected return.

Acting greedily with respect to a value function for an optimal policy will produce optimal behaviour.

![](./images/follow_values.png)

#### Computing value functions using dynamic programming
## Algorithm 1: Policy iteration

### Step 1: Policy evaluation step

Policy evaluation is the process of approximately evaluating the value function for our current policy.

How can we do this using our environment model?

The Bellman equations define the value of any state recursively, as a function of it successor state.
We know that we can use our model to simulate the next states that our agent will move into by taking any given action, and that it defines what rewards we might receive.
Given this, along with the fact that our value table already contains estimates

Even if we 
We can't overestimate the value of a state because the transition dynamics are known
The best value that 

### Step 2: Policy improvement step

### Does this converge to an optimal policy?

For a policy $\pi'$ to better than some other policy $\pi$?, it must be such that $v_{\pi'}(s) \geq v_{\pi}(s)$ for all states.

Why is $\pi'$ strictly better than $\pi$?

$\pi$ is determining the column of the action-value (Q) table that we should take. This is fixed until we do policy improvement. As we perform policy evaluation, the values of the table might change, and the column containing the maximum value for any given state might change

### Full policy iteration algorithm

The full policy iteration algorithm iterates between policy evaluation and policy improvement. This alternatively improves the policy by making it greedy with respect to the value function, and then improves the value function by minimising the Bellman error.

# policy eval algo

Local consistency. The values of each state must relate to all their neighbbouring (reachable in 1 step) states according to the Bellman equation for $v(s)$


In [22]:
def initialise_value_table(num_states=16):
    value_table = {} # start off with empty map
    for s in range(num_states): # for each state
        value_table[s] = np.random.rand()
        value_table[s] = 0
#         states = np.zeros(3, num_states)
#         states[s] = 1
#         key = pickle.dumps(s)
#         value_table[key] = 0
#     value_table[15] = 0
    print(value_table)
    return value_table

def initialise_policy(num_states=16, num_actions=4):
    policy = {} # start off with empty map
    for s in range(num_states):
        action_probs = np.random.rand((num_actions))
        action_probs /= sum(action_probs)
        policy[s] = action_probs
    return policy


def initialise_deterministic_policy(num_states=16, num_actions=4):
    policy = {} # start off with empty map
    for s in range(num_states):
        action_probs = np.random.rand((num_actions))
        action_probs /= sum(action_probs)
        policy[s] = np.argmax(action_probs) # only line different to above - the function returns the action, rather than the dist over
    return policy

In [25]:
# COMPUTING VALUE FUNCTION USING DYNAMIC PROGRAMMING
def policy_evaluation(policy, value_table, discount_factor, error_threshold=0.01, num_states=16):
    print()
    print(value_table)

    new_value_table = {} # init new value table to be filled in and returned
    converged = False # initially we have not found a converged value function for this policy
    k = 0
    while not converged: # until the value function converges
        print('sweep ', k)
        k += 1
        worst_delta = 0 # difference between previous values and iterated values
        for state in range(num_states): # loop over each state
            action = policy[state]
            new_state, reward = transition(state, action)
            new_val = reward + discount_factor * value_table[new_state]
#             print(new_val)
#             print(value_table[state])
            new_value_table[state] = new_val
            delta = abs(new_val - value_table[state]) # find the absolute diff between new val and old val for this state
#             print('delta:', delta)
            if delta > worst_delta: 
                worst_delta = delta
                print('worst_delta:', worst_delta)
        if worst_delta < error_threshold: # once the values stop changing
            converged = True # we have found the value function
            print('Converged on value function')
        value_table = new_value_table # took me ages to realise i was missing this line and debug 
    return new_value_table # return value table evaluated for this policy
        
# IMPROVE POLICY
def policy_improvement(value_table, discount_factor): # set a greedy policy which will always be better than the previous
    new_policy = {}
    action_space = range(4) 
    for state in value_table.keys(): # loop over each state
        best_value = -float('inf')
        best_action = None
        print()
        print('STATE:', state)
        for action in action_space: 
            print('action:', action)
            new_state, reward = transition(state, action)
            value = reward + discount_factor * value_table[new_state]
            print('value:', value)
            if value > best_value:
                best_value = value
                best_action = action
        new_policy[state] = best_action
    return new_policy

def check_stable_policy(old_policy, new_policy):
    stable = True
    for s in range(16):
        old_action = old_policy[s]
        new_action = new_policy[s]
        if new_action != old_action:
            stable = False
    return stable

# # IMPLEMENT A GREEDY POLICY
# def greedy_policy(state, value_table):
#     best_expected_return = 0
#     best_action = None
#     for action in range(0, 4): # for all possible actions
#         new_state, reward = transition(state, action)
#         expected_return = 0
#         for new_state in new_states:
#             expected_return += reward + discount_factor * value_table[new_state] # our MODEL is deterministic, if it weren't, we'd need to average over different possible next states
#         if expected_return > best_expected_return:
#             best_expected_return = expected_return
#             best_action = action
#     return action

In [30]:
# POLICY ITERATION ALGORITHM
def policy_iteration(discount_factor=0.9):
    value_table = initialise_value_table()
#     policy = initialise_policy()
    policy = initialise_deterministic_policy()
    policy_stable = False
    policy_idx = 0
    while not policy_stable:
        print('Evaluating policy ', policy_idx)
        value_table = policy_evaluation(policy, value_table, discount_factor) # converge on value function
        print('Iterating policy ', policy_idx)
        new_policy = policy_improvement(value_table, discount_factor) # set greedy policy using converged value function
        if check_stable_policy(policy, new_policy):
            policy_stable = True
            print('Policy now stable - optimal policy found')
            
        policy = new_policy
    print('Optimal policy:', policy)
    return policy
        
optimal_policy = policy_iteration()


{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0}
Evaluating policy  0

{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0}
sweep  0
Converged on value function
Iterating policy  0

STATE: 0
action: 0
value: 0.0
action: 1
value: 0.0
action: 2
value: 0.0
action: 3
value: 0.0

STATE: 1
action: 0
value: 0.0
action: 1
value: 0.0
action: 2
value: 0.0
action: 3
value: 0.0

STATE: 2
action: 0
value: 0.0
action: 1
value: 0.0
action: 2
value: 0.0
action: 3
value: 0.0

STATE: 3
action: 0
value: 0.0
action: 1
value: 0.0
action: 2
value: 0.0
action: 3
value: 0.0

STATE: 4
action: 0
value: 0.0
action: 1
value: 0.0
action: 2
value: 0.0
action: 3
value: 0.0

STATE: 5
action: 0
value: 0.0
action: 1
value: 0.0
action: 2
value: 0.0
action: 3
value: 0.0

STATE: 6
action: 0
value: 0.0
action: 1
value: 0.0
action: 2
value: 0.0
action: 3
value: 0.0

STATE: 7
action: 0
value: 0.0
action: 1
value: 0.0
actio

In [31]:

def policy_map_to_func(policy_map):
    
    def policy(k):
#         print('state:', k)
#         print('action:', policy_map[k])
        return policy_map[k]
    
    return policy

optimal_policy_func = policy_map_to_func(optimal_policy)
visualise_agent(optimal_policy_func)

agent_pos: (2, 2)
[2 2]
Agent_pos: (2, 2)


NameError: name 'ksdfclks' is not defined

Notice that the output policy is optimal, but it is not the only optimal policy. For this map it's easy to see that there could be alternatives

## Algorithm 2: Value iteration - forget the explicit policy

Value iteration is very similar to policy iteration. We only perform a single sweep over the state space when updating our value function, instead of repeating this until our approximate value function converges for this policy.

Every time we perform a sweep, the value function gets closer to the true value function for the current policy. 
It can be seen that improved policies are found by greedily following 


In [None]:
### implement greedy policy
#greedy policy
def greedy_policy(state):
    return policy_action

In [None]:
def train(policy, n_episodes=100):
    global value_table
    global i_episode
    try:
        for _ in range(n_episodes):
            observation = env.reset()
            episode_mem = []
            done=False
            t = 0
            while not done:
                env.render()
                time.sleep(0.05)
                action = policy(observation)
                new_observation, reward, done, info = env.step(action)
                episode_mem.append({'observation':observation,
                                    'action':action,
                                    'reward':reward,
                                    'new_observation':new_observation,
                                    'done':done})
                observation=new_observation
                t+=1
                epsilon*=0.999
            episode_mem = calculate_Gs(episode_mem, discount_factor)
            value_table, v_delta = update_value_table(value_table, episode_mem)
            i_episode+=1
            print("Episode {} finished after {} timesteps. Eplsilon={}. V_Delta={}".format(i_episode, t, epsilon, v_delta))
            env.render()
            time.sleep(1)
        env.close()
    except KeyboardInterrupt:
        env.close()

In [None]:
value_table = {}

train(random_policy)

![](./images/update_values.png)

Value iteration is a type of **value based** method. Notice that to learn an optimal policy, we never have to represent it explicitly. There is no function which represents the policy. Instead we just look-ahead and choose the action that maximises the value of the next state.

Now that our agent is exploring the environment, let's implement value iteration.

In [None]:
epsilon = 1
i_episode = 0
discount_factor = 0.8
value_table = {}

In [None]:
def calculate_Gs(episode_mem, discount_factor=0.95):
    for i, mem in reversed(list(enumerate(episode_mem))): #start from terminal state
        if i==len(episode_mem)-1: #if terminal state, G=reward
            episode_mem[i]['G']= mem['reward'] 
        else:
            G = mem['reward']+discount_factor*episode_mem[i+1]['G']
            episode_mem[i]['G'] = G 
    return episode_mem

def update_value_table(value_table, episode_mem):
    all_diffs=[]
    for mem in episode_mem:
        key = pickle.dumps(mem['new_observation'])
        if key not in value_table:
            value_table[key]=0 #initialize
        new_val = max(value_table[key], mem['G'])
        diff = abs(value_table[key]-new_val)
        all_diffs.append(diff)
        value_table[key] = new_val
    return value_table, np.mean(all_diffs)

In [None]:
def train(policy, n_episodes=100):
    global epsilon
    global value_table
    global i_episode
    try:
        for _ in range(n_episodes):
            observation = env.reset()
            episode_mem = []
            done=False
            t=0
            while not done:
                env.render()
                time.sleep(0.05)
                action = policy(observation)
                new_observation, reward, done, info = env.step(action)
                episode_mem.append({'observation':observation,
                                    'action':action,
                                    'reward':reward,
                                    'new_observation':new_observation,
                                    'done':done})
                observation=new_observation
                t+=1
                epsilon*=0.999
            episode_mem = calculate_Gs(episode_mem, discount_factor)
            value_table, v_delta = update_value_table(value_table, episode_mem)
            i_episode+=1
            print("Episode {} finished after {} timesteps. Eplsilon={}. V_Delta={}".format(i_episode, t, epsilon, v_delta))
            env.render()
            time.sleep(1)
        env.close()
    except KeyboardInterrupt:
        env.close()

In [None]:
train(random_policy)


### Method 2:  Monte-Carlo methods for computing value functions and improving policies

Monte-Carlo methods are those based on repeated sampling to estimate a quantity. 

#### Computing value functions using Monte-Carlo methods

For this method, we will use Monte-Carlo sampling to estimate the value of each state by running many episodes, and then doing backup from the terminal state.

The equation below shows how we can use Monte-Carlo sampling to compute the value function for any given policy.

![](./

#### Improving the policy using Monte-Carlo methods

As with DP, we can use generalised policy iteration to cyclically improve the value function for the current policy and then improve the current policy based on that value function. Again, policy improvement is done by acting greedily with respect to the current value function. However, in this method, we will use Monte-Carlo sampling to find the value function, alternatively to simulation using a model, as described above.

#### Finding an optimal policy using Monte-Carlo methods


Values will not be updated for states that aren't visited to during an episode.



This process of iteratively updating the value function is called **value iteration**.

Let's implement an algorithm to find the value function for our random policy



Now that our agent is capable of exploring and learning about it's environment, we need to make it take advantage of what it knows so that it can perform well.
Our random policy has helped us to estimate the values of each state, which means we have some idea of how good each state is. Think about how we could use this knowledge to make our agent perform well before reading the next paragraphs.


# old transition func

Because we know exactly what states we can end up in by taking an action, we can just look at the value of the states and choose the action which leads us to the state with the greatest value. So we just move into the best state that we can reach at any point.
A policy that always takes the action that it expects to end up in the best, currently reachable state is called a **greedy policy**.

Let's implement a greedy policy.

## Why not just act greedily all the time?

If we act greedily all the time then we will move into the state with the best value. But remember that these values are only estimates based on our agent's experience with the game, which means that they might not be correct. So if we want to make sure that our agent will do well by always choosing the next action greedily, we need to make sure that it has good estimates for the values of those states. This brings us to a core challenge in reinforcement learning: **the exploration vs exploitation dilemma**. Our agent can either exploit what it knows by using it's current knowledge to choose the best action, or it can explore more and improve it's knowledge perhaps learning that some actions are even worse than what it does currently.

## An epsilon-greedy policy
We can combine our random policy and our greedy policy to make an improved policy that both explores its environment and exploits its current knowledge. An $\epsilon$-greedy (epsilon-greedy) policy is one which exploits what it knows most of the time, but with probability $\epsilon$ will instead select a random action to try.

## Do we need to keep exploring once we are confident in the values of states?

As our agent explores more, it becomes more confident in predicting how valuable any state is. Once it knows a lot, it should start to explore less and exploit what it knows more. That means that we should decrease epsilon over time.

Let's implement it

In [None]:
def epsilon_greedy_policy(state):
    epsilon = 0.05
    if random.random() < epsilon:
        return random_policy(state)
    else:
        return greedy_policy(state)

## How can we find an optimal policy?

An optimal policy would take the best possible action in any state. Because of this, the optimal value function would give the maximum possible values for any state.

In the first line below, the maximum state-value of a state is equivalent to the maximum action-value when taking the best action in that state. Following this, we can derive a recursive definition of the optimal value function.

In the last step, we even remove the policy from the equation entirely! This means that value iteration never needs to explicitly represent a policy in terms of a function that takes in a state and returns a distribution over actions.
Instead, value iteration uses a **model**, $p(s', r | s, a)$, to look one step ahead, and take the action, $a$, that most likely leads it to the next state that has the best state-value function.

A **model** defines how the state changes. It is also known as the transition dynamics of the environment. In our case the model is really simple: we are certain that taking the action to move right will move our agent one space to the right as long as there are no obstacles. There is no randomness in our environment (e.g. no wind that might push us into a different cell when we try to move right). That is, our environment is deterministic, not stochastic.

![](./images/bellman_op_v.png)

![](./images/backup_v.png)

![](./images/update_rule_v.png)

In [None]:
# implement value iteration to find optimal value function
def update_value_table(episode_mem, value_table, discount_factor=0.95, alpha=0.5):
    return value_table, v_delta

# Final Code solution with visualisation

In [None]:
import time
import pickle
import numpy as np

import gym
from griddy_env import GriddyEnv

In [None]:
def key_from_state(state):
    key = pickle.dumps(state)
    if key not in value_table:
        value_table[key]=0 #initialize
    return key

In [None]:
def update_value_table(episode_mem, value_table, discount_factor=0.95, learning_rate=0.1):
    all_diffs=[]
    for i, mem in reversed(list(enumerate(episode_mem))): #start from terminal state
        if i==len(episode_mem)-1: #if terminal state, G=reward
            calculated_new_v = episode_mem[i]['reward']
        else:
            calculated_new_v = mem['reward']+(discount_factor*np.max(greedy_policy(mem['new_observation'], return_action_vals=True)))
        key = key_from_state(mem['new_observation'])
        diff = abs(value_table[key]-calculated_new_v)
        all_diffs.append(diff)
        value_table[key] =  value_table[key] + learning_rate*(calculated_new_v-value_table[key])
    return value_table, np.mean(all_diffs)

In [None]:
#This is the transition model aka our model of the environment. Given state and action it predicts next state
def transition(state, action):
    state = np.copy(state)
    agent_pos = list(zip(*np.where(state[2] == 1)))[0]
    new_agent_pos = np.array(agent_pos)
    if action==0:
        new_agent_pos[1]-=1
    elif action==1:
        new_agent_pos[1]+=1
    elif action==2:
        new_agent_pos[0]-=1
    elif action==3:
        new_agent_pos[0]+=1    
    new_agent_pos = np.clip(new_agent_pos, 0, 3)

    state[2, agent_pos[0], agent_pos[1]] = 0 #moved from this position so it is empty
    state[2, new_agent_pos[0], new_agent_pos[1]] = 1 #moved to this position
    return state

In [None]:
def greedy_policy(state, return_action_vals=False):
    action_values=[]
    for test_action in range(4): #for each action
        new_state = transition(state, test_action)
        key = key_from_state(new_state)
        action_values.append(value_table[key])
    policy_action = np.argmax(action_values)
    if return_action_vals: return action_values
    return policy_action

In [None]:
def epsilon_greedy_policy(state):
    action = env.action_space.sample() if np.random.rand()<epsilon else greedy_policy(state)
    return action

In [None]:
def random_policy(state):
    return np.random.randint(0, 4)

In [None]:
def value_table_viz(value_table):
    values = np.zeros((4, 4))
    base_st = np.zeros((3, 4, 4), dtype=np.int64)
    base_st[0, 3, 3]=1
    for i in range(4):
        for j in range(4):
            test_st = np.copy(base_st)
            test_st[2, i, j] = 1
            key = pickle.dumps(test_st)
            if key in value_table:
                val = value_table[key]
            else:
                val=0
            values[i, j] = val
    return values

In [None]:
def visualise_agent(policy, value_table=None, n=5):
    try:
        for trial_i in range(n):
            observation = env.reset()
            done=False
            t=0
            while not done:
                if value_table: env.render(value_table_viz(value_table))
                else: env.render()
                policy_action = policy(observation)
                observation, reward, done, info = env.step(policy_action)
                time.sleep(0.5)
                t+=1
            env.render()
            time.sleep(1.5)
            print("Episode {} finished after {} timesteps".format(trial_i, t))
        env.close()
    except KeyboardInterrupt:
        env.close()

In [None]:
env = GriddyEnv(4, 4)
epsilon = 1
i_episode=0
discount_factor=0.9
learning_rate=0.3
value_table = {}

In [None]:
def train(policy, n_episodes=100):
    global epsilon
    global value_table
    global i_episode
    try:
        for _ in range(n_episodes):
            observation = env.reset()
            episode_mem = []
            done=False
            t=0
            while not done:
                env.render()
                time.sleep(0.05)
                action = policy(observation)
                new_observation, reward, done, info = env.step(action)
                episode_mem.append({'observation':observation,
                                    'action':action,
                                    'reward':reward,
                                    'new_observation':new_observation,
                                    'done':done})
                observation=new_observation
                t+=1
                epsilon*=0.999
            value_table, v_delta = update_value_table(episode_mem, value_table, discount_factor, learning_rate)
            i_episode+=1
            print("Episode {} finished after {} timesteps. Eplislon={}. V_Delta={}".format(i_episode, t, epsilon, v_delta))#, end='\r')
            #print(value_table_viz(value_table))
            #print()
            env.render(value_table_viz(value_table))
            time.sleep(2)
        env.close()
    except KeyboardInterrupt:
        env.close()

In [None]:
train(epsilon_greedy_policy)

In [None]:
print('Current estimates of value')
value_table_viz(value_table)

In [None]:
visualise_agent(greedy_policy, value_table)

## End of notebook!

Next you might want to check out:
- [Policy Gradients]()
