# Intro to RL 

## Problem setup: We want to make an AI that is able to complete a simple video game.

### What is the game we are going to start with?
In this game, we want our agent (character) to move through the 2D world and reach the goal. At each timestep our agent can to either move up, down, left or right. The agent cannot move into obstacles, and when it reaches the goal, the game ends.

# insert video of game being played

We are going to use an environment that we built, called Griddy, that works in exactly the same way as other environments provided as part of openAI gym. 


The main ideas are:
<ul>
<li>we need to create our environment</li>
<li>we need to initialise it by calling `env.reset()`</li>
<li>we can increment the simulation by one timestep by calling `env.step(action)`</li>
</ul>

Check out [openAI gym's docs](http://gym.openai.com/docs/) to see how the environments work in general and in more detail.

Let's set up our simulation to train our agent in.


In [25]:
# IMPORTS
import time
import pickle
import numpy as np

import gym
from griddy_env import GriddyEnv

# SET UP THE ENVIRONMENT
env = GriddyEnv()    # create the environment

## Once we have an agent in the game, what do we do?

Our agent has no idea of how to win the game. It simply observes states that change based on it's actions and receives a reward signal for doing so.
So the agent has to learn about the game for itself. Just like a baby learns to interact with it's world by playing with it, our agent has to try random actions to figure out when and why it receives negative or positive rewards.

A function which tells the agent what to do in a given state is called a **policy**

We need our agent to understand what actions might lead it to achieving high rewards, but it doesn't know anything about how to complete the game yet. So let's set up our environment and implement a random policy that takes in a state and returns a random action for the agent to take.

![](./images/policy.png)

### Action Space
The action space consists of 4 unique actions: 0, 1, 2, 3<br>
0 - Move left<br>
1 - Move right<br>
2 - Move up<br>
3 - Move down<br>

### Observation Space
Has shape (3, 4, 4). Our grid world is 4x4<br>
Each of the 3 channels is a binary mask for the location of different objects within the environment.<br>
Channel 0 - Goal<br>
Channel 1 - Wall<br>
Channel 2 - Agent<br>

In [5]:
#Visualise agent function
def visualise_agent(policy, n=5):
    try:
        for trial_i in range(n):
            observation = env.reset()
            done=False
            t=0
            while not done:
                env.render()
                policy_action = policy(observation)
                observation, reward, done, info = env.step(policy_action)
                time.sleep(0.5)
                t+=1
            env.render()
            time.sleep(1.5)
            print("Episode {} finished after {} timesteps".format(trial_i, t))
        env.close()
    except KeyboardInterrupt:
        env.close()

In [15]:
# IMPLEMENT A RANDOM POLICY
def random_policy(state):
    #action = #fill this in
    return action


## How do we know if we are doing well?

When our agent takes this action and moves into a new state, the environment returns it a reward. The reward when it reaches the goal is +1, and 0 everywhere else. The reward that the agent receives at any point can be considered as what it feels in that moment - like pain or pleasure.

**However**, the reward doesn't tell the agent how good that move actually was, only whether it sensed anything, and how good or bad that sensation was.

E.g.
- Our agent might not receive any reward for stepping toward the goal, even though this might be a good move.
- A robot might receive a negative reward as it's battery depletes, but still make good progress towards its goal.
- A chess playing agent might receive a positive reward for taking an opponent's piece, but make a bad move in doing so by exposing its king to an attack eventually causing it to lose the game.

What we really want to know is not the instantaneous reward, but "How good is the position I'm in right now?", that is, what amount of reward can our agent get from this point onwards.
This future reward is also known as the return.

![](./images/undiscounted_return.png)

#### Is getting a reward now as good as getting the same reward later?
- What if the reward is removed from the game in the next timestep?
- Would you rather be rich now or later?
- What if a larger reward is introduced and you don't have enough energy to reach both?
- What about inflation?

It's better to get rewards sooner rather than later.

![](./images/decay.png)

We can encode this into our goal by using a **discount factor**, $\gamma \in [0, 1]$ ($\gamma$ between 0 and 1). This makes our agent value more immediate rewards more than those which can be reached further in the future. This makes the goal become:

![](./images/discounted_return.png)


The value of these return values can be defined recursively as shown below.

![](./images/recursive_return.png)

Because of this, we can calculate the returns by having our agent play one run-through of the game and then *backing-up* through that trajectory, step-by-step, looking forward at what the future reward was from that point.

The back up procedure is a way that we can determine the returns for each state that we visited in an episode. The return of a terminal state is always zero, but the terminal state is not the goal, it is the state which our agent transitions into once the episode has finished. We do backup by looking at the final timestep before our agent went into the terminal state - here it is easy to calculate the return. It is simply the reward that we received for moving into this state, because the expected return from the next state (terminal state) is always zero. Then using the recursive expression of returns (above), we can calculate the return for the timestep before that. This can be done recursively until we reach our initial state. At this point we know what all of the returns were for the whole episode.

![](./images/backup.png)

### So how good *is* each state?
In general, the goal of reinforcement learning is to maximise the **expected** future reward. That is, to maximise the expected return from a the current state onwards. The measure of this, is called the *value* of the state (or the state value). A function that predicts this value is called a **state-value function** or **value function**.

![](./images/value_def.png)

If we had a way to estimate this, then we could look ahead to the state that each action would take us to and take the action which results in us landing in the state with best value. 

![](./images/follow_values.png)

Values will not be updated for states that aren't visited to during an episode.

If we initialise the value for each state as zero, and then average the return for each state over many episodes, that average return will converge to the true value of the state for this policy. This process of iteratively updating the value function is called **value iteration**.

![](./images/update_values.png)

Value iteration is a type of **value based** method. Notice that to learn an optimal policy, we never have to represent it explicitly. There is no function which represents the policy. Instead we just look-ahead and choose the action that maximises the value of the next state.

Now that our agent is exploring the environment, let's implement value iteration.

In [29]:
epsilon = 1
i_episode=0
discount_factor=0.8
value_table = {}

In [30]:
def calculate_Gs(episode_mem, discount_factor=0.95):
    return episode_mem

def update_value_table(value_table, episode_mem, alpha=0.5):
    return value_table, v_delta

In [31]:
def train(policy, n_episodes=100):
    global epsilon
    global value_table
    global i_episode
    try:
        for _ in range(n_episodes):
            observation = env.reset()
            episode_mem = []
            done=False
            t=0
            while not done:
                env.render()
                time.sleep(0.05)
                action = policy(observation)
                new_observation, reward, done, info = env.step(action)
                episode_mem.append({'observation':observation,
                                    'action':action,
                                    'reward':reward,
                                    'new_observation':new_observation,
                                    'done':done})
                observation=new_observation
                t+=1
                epsilon*=0.999
            episode_mem = calculate_Gs(episode_mem, discount_factor)
            value_table, v_delta = update_value_table(value_table, episode_mem)
            i_episode+=1
            print("Episode {} finished after {} timesteps. Eplislon={}. V_Delta={}".format(i_episode, t, epsilon, v_delta))
            env.render()
            time.sleep(1)
        env.close()
    except KeyboardInterrupt:
        env.close()

In [None]:
train(random_policy)

## How can we use the values that we know to perform well?

Now that our agent is capable of exploring and learning about it's environment, we need to make it take advantage of what it knows so that it can perform well.
Our random policy has helped us to estimate the values of each state, which means we have some idea of how good each state is. Think about how we could use this knowledge to make our agent perform well before reading the next paragraphs.

In this simple version of the game, we know exactly what actions will lead us to what states. That means we have a perfect **model** of the environment. A model is a function that tells us how the state will change when we take certain actions. E.g. we know that if the agent tries to move up into an empty space, then that's where it will end up.

Because we know exactly what states we can end up in by taking an action, we can just look at the value of the states and choose the action which leads us to the state with the greatest value. So we just move into the best state that we can reach at any point.
A policy that always takes the action that it expects to end up in the best, currently reachable state is called a **greedy policy**.

Let's implement a greedy policy.

In [None]:
#transition model
def transition(state, action):
    state = np.copy(state)
    agent_pos = list(zip(*np.where(state[2] == 1)))[0]
    new_agent_pos = np.array(agent_pos)
    if action==0:
        new_agent_pos[1]-=1
    elif action==1:
        new_agent_pos[1]+=1
    elif action==2:
        new_agent_pos[0]-=1
    elif action==3:
        new_agent_pos[0]+=1    
    new_agent_pos = np.clip(new_agent_pos, 0, 3)

    state[2, agent_pos[0], agent_pos[1]] = 0 #moved from this position so it is empty
    
    state[2, new_agent_pos[0], new_agent_pos[1]] = 1 #moved to this position
    return state

In [None]:
### implement greedy policy
#greedy policy
def greedy_policy(state):
    return policy_action

## Why not just act greedily all the time?

If we act greedily all the time then we will move into the state with the best value. But remember that these values are only estimates based on our agent's experience with the game, which means that they might not be correct. So if we want to make sure that our agent will do well by always choosing the next action greedily, we need to make sure that it has good estimates for the values of those states. This brings us to a core challenge in reinforcement learning: **the exploration vs exploitation dilemma**. Our agent can either exploit what it knows by using it's current knowledge to choose the best action, or it can explore more and improve it's knowledge perhaps learning that some actions are even worse than what it does currently.

## An epsilon-greedy policy
We can combine our random policy and our greedy policy to make an improved policy that both explores its environment and exploits its current knowledge. An $\epsilon$-greedy (epsilon-greedy) policy is one which exploits what it knows most of the time, but with probability $\epsilon$ will instead select a random action to try.

## Do we need to keep exploring once we are confident in the values of states?

As our agent explores more, it becomes more confident in predicting how valuable any state is. Once it knows a lot, it should start to explore less and exploit what it knows more. That means that we should decrease epsilon over time.

Let's implement it

In [None]:
def epsilon_greedy_policy(state):
    epsilon = 0.05
    if random.random() < epsilon:
        return random_policy(state)
    else:
        return greedy_policy(state)

# How can we find an optimal policy?

An optimal policy would take the best possible action in any state. Because of this, the optimal value function would give the maximum possible values for any state.

In the first line below, the maximum state-value of a state is equivalent to the maximum state-action value when taking the best action in that state. Following this, we can derive a recursive definition of the optimal value function.

In the last step, we even remove the policy from the equation entirely! This means that value iteration never needs to explicitly represent a policy in terms of a function that takes in a state and returns a distribution over actions.
Instead, value iteration uses a **model**, $p(s', r | s, a)$, to look one step ahead, and take the action, $a$, that most likely leads it to the next state that has the best state-value function.

A **model** defines how the state changes. It is also known as the transition dynamics of the environment. In our case the model is really simple: we are certain that taking the action to move right will move our agent one space to the right as long as there are no obstacles. There is no randomness in our environment (e.g. no wind that might push us into a different cell when we try to move right). That is, our environment is deterministic, not stochastic.

![](./images/bellman_op_v.png)

![](./images/backup_v.png)

![](./images/update_rule_v.png)

In [46]:
# implement value iteration to find optimal value function
def update_value_table(episode_mem, value_table, discount_factor=0.95, alpha=0.5):
    return value_table, v_delta

# Final Code solution with visualisation

In [2]:
import time
import pickle
import numpy as np

import gym
from griddy_env import GriddyEnv

In [3]:
def key_from_state(state):
    key = pickle.dumps(state)
    if key not in value_table:
        value_table[key]=0 #initialize
    return key

In [4]:
def update_value_table(episode_mem, value_table, discount_factor=0.95, learning_rate=0.1):
    all_diffs=[]
    for i, mem in reversed(list(enumerate(episode_mem))): #start from terminal state
        if i==len(episode_mem)-1: #if terminal state, G=reward
            calculated_new_v = episode_mem[i]['reward']
        else:
            calculated_new_v = mem['reward']+(discount_factor*np.max(greedy_policy(mem['new_observation'], return_action_vals=True)))
        key = key_from_state(mem['new_observation'])
        diff = abs(value_table[key]-calculated_new_v)
        all_diffs.append(diff)
        value_table[key] =  value_table[key] + learning_rate*(calculated_new_v-value_table[key])
    return value_table, np.mean(all_diffs)

In [5]:
#This is the transition model aka our model of the environment. Given state and action it predicts next state
def transition(state, action):
    state = np.copy(state)
    agent_pos = list(zip(*np.where(state[2] == 1)))[0]
    new_agent_pos = np.array(agent_pos)
    if action==0:
        new_agent_pos[1]-=1
    elif action==1:
        new_agent_pos[1]+=1
    elif action==2:
        new_agent_pos[0]-=1
    elif action==3:
        new_agent_pos[0]+=1    
    new_agent_pos = np.clip(new_agent_pos, 0, 3)

    state[2, agent_pos[0], agent_pos[1]] = 0 #moved from this position so it is empty
    state[2, new_agent_pos[0], new_agent_pos[1]] = 1 #moved to this position
    return state

In [6]:
def greedy_policy(state, return_action_vals=False):
    action_values=[]
    for test_action in range(4): #for each action
        new_state = transition(state, test_action)
        key = key_from_state(new_state)
        action_values.append(value_table[key])
    policy_action = np.argmax(action_values)
    if return_action_vals: return action_values
    return policy_action

In [7]:
def epsilon_greedy_policy(state):
    action = env.action_space.sample() if np.random.rand()<epsilon else greedy_policy(state)
    return action

In [8]:
def random_policy(state):
    return np.random.randint(0, 4)

In [16]:
def value_table_viz(value_table):
    values = np.zeros((4, 4))
    base_st = np.zeros((3, 4, 4), dtype=np.int64)
    base_st[0, 3, 3]=1
    for i in range(4):
        for j in range(4):
            test_st = np.copy(base_st)
            test_st[2, i, j] = 1
            key = pickle.dumps(test_st)
            if key in value_table:
                val = value_table[key]
            else:
                val=0
            values[i, j] = val
    return values

In [17]:
def visualise_agent(policy, value_table=None, n=5):
    try:
        for trial_i in range(n):
            observation = env.reset()
            done=False
            t=0
            while not done:
                if value_table: env.render(value_table_viz(value_table))
                else: env.render()
                policy_action = policy(observation)
                observation, reward, done, info = env.step(policy_action)
                time.sleep(0.5)
                t+=1
            env.render()
            time.sleep(1.5)
            print("Episode {} finished after {} timesteps".format(trial_i, t))
        env.close()
    except KeyboardInterrupt:
        env.close()

In [18]:
env = GriddyEnv(4, 4)
epsilon = 1
i_episode=0
discount_factor=0.9
learning_rate=0.3
value_table = {}

In [22]:
def train(policy, n_episodes=100):
    global epsilon
    global value_table
    global i_episode
    try:
        for _ in range(n_episodes):
            observation = env.reset()
            episode_mem = []
            done=False
            t=0
            while not done:
                env.render()
                time.sleep(0.05)
                action = policy(observation)
                new_observation, reward, done, info = env.step(action)
                episode_mem.append({'observation':observation,
                                    'action':action,
                                    'reward':reward,
                                    'new_observation':new_observation,
                                    'done':done})
                observation=new_observation
                t+=1
                epsilon*=0.999
            value_table, v_delta = update_value_table(episode_mem, value_table, discount_factor, learning_rate)
            i_episode+=1
            print("Episode {} finished after {} timesteps. Eplislon={}. V_Delta={}".format(i_episode, t, epsilon, v_delta))#, end='\r')
            #print(value_table_viz(value_table))
            #print()
            env.render(value_table_viz(value_table))
            time.sleep(2)
        env.close()
    except KeyboardInterrupt:
        env.close()

In [None]:
train(epsilon_greedy_policy)

In [24]:
print('Current estimates of value')
value_table_viz(value_table)

Current estimates of value


array([[0.51701049, 0.57998881, 0.62474069, 0.67338781],
       [0.58888181, 0.6550619 , 0.72840154, 0.80980877],
       [0.65598463, 0.72898751, 0.80999937, 0.8999999 ],
       [0.70749033, 0.80901724, 0.89982683, 0.99999997]])

In [None]:
visualise_agent(greedy_policy, value_table)

## End of notebook!

Next you might want to check out:
- [Policy Gradients]()
