# Intro to RL 

## Problem setup: We want to make an AI that is able to complete a simple video game.

### What is the game we are going to start with?
In this game, we want our agent (character) to move through the 2D world and reach the goal. At each timestep our agent can to either move up, down, left or right. The agent cannot move into obstacles, and when it reaches the goal, the game ends.

# insert video of game being played

We are going to use an environment that we built, called Griddy, that works in exactly the same way as other environments provided as part of openAI gym. 


The main ideas are:
<ul>
<li>we need to create our environment</li>
<li>we need to initialise it by calling `env.reset()`</li>
<li>we can increment the simulation by one timestep by calling `env.step(action)`</li>
</ul>

Check out [openAI gym's docs](http://gym.openai.com/docs/) to see how the environments work in general and in more detail.

Let's set up our simulation to train our agent in.


In [42]:
# IMPORTS
import gym
from griddy_env import GriddyEnvOneHot
import numpy as np
import pickle
from copy import deepcopy
import time
import random

# SET UP THE ENVIRONMENT
env = GriddyEnvOneHot()    # create the environment

## Once we have an agent in the game, what do we do?

Our agent has no idea of how to win the game. It simply observes states that change based on it's actions and receives a reward signal for doing so.
So the agent has to learn about the game for itself. Just like a baby learns to interact with it's world by playing with it, our agent has to try random actions to figure out when and why it receives negative or positive rewards.

A function which tells the agent what to do in a given state is called a **policy**

We need our agent to understand what actions might lead it to achieving high rewards, but it doesn't know anything about how to complete the game yet. So let's set up our environment and implement a random policy that takes in a state and returns a random action for the agent to take.

# picture maybe

In [43]:
# IMPLEMENT A RANDOM POLICY
def random_policy(state):
    return random.randint(0, 3)

In [44]:
# WRITE A LOOP FOR THE AGENT TO TRY RANDOM ACTIONS
num_episodes = 3

try:
    for episode_idx in range(num_episodes):
        print('Episode', episode_idx)
        observation = env.reset()
        done = False
        episode_mem = []
        t = 0
        while not done:
            env.render()
            action = random_policy(observation)
            observation, _, done, info = env.step(action)
            t += 1
            time.sleep(0.1)
        env.render()
        #time.sleep(0.5)
        print(f"Episode finished after {t + 1} timesteps.")
    env.close()
except KeyboardInterrupt:
    env.close()

Episode 0
Episode finished after 17 timesteps.
Episode 1



## How do we know if we are doing well?

When our agent takes this action and moves into a new state, the environment returns it a reward. The reward when it reaches the goal is +1, and 0 everywhere else. The reward that the agent receives at any point can be considered as what it feels in that moment - like pain or pleasure.

**However**, the reward doesn't tell the agent how good that move actually was, only whether it sensed anything, and how good or bad that sensation was.

E.g.
- Our agent might not receive any reward for stepping toward the goal, even though this might be a good move.
- A robot might receive a negative reward as it's battery depletes, but still make good progress towards its goal.
- A chess playing agent might receive a positive reward for taking an opponent's piece, but make a bad move in doing so by exposing its king to an attack eventually causing it to lose the game.

What we really want to know is not the instantaneous reward, but "How good is the position I'm in right now?". The measure of this, is called the *value* of the state. If we had a way to estimate this, then we could look ahead to the state that each action would take us to and take the action which results in us landing in the state with best value. A function that predicts this value is called a **state-value function**.

# diagram of following value function

### So how good *is* each state?
Intuitively, we want our agents to receive as much reward as possible.
In general, the goal of reinforcement learning is to maximise this future reward. 

# goal of RL equation
![](./images/objective.png)

The value of a state is the total reward that we can expect from this state onwards. 
This future reward is also known as the return.

# return
![](./images/return.png)

To determine what these values are, we can have our agent play one run-through of the game and then *back-up* through that trajectory, step-by-step, looking forward at what the future reward was from that point.

# backup diagram
![](./images/backup.png)

#### Is getting a reward now as good as getting the same reward later?
- What if the reward is removed from the game in the next timestep?
- Would you rather be rich now or later?
- What if a larger reward is introduced and you don't have enough energy to reach both?
- What about inflation?

It's better to get rewards sooner.

![](./images/decay.png)

We can encode this into our goal by using a **discount factor**, $\gamma \in [0, 1]$ ($\gamma$ between 0 and 1). This makes the goal become:

![](./images/discounted_obj.png)


#### How good is the terminal state?
In this initial version of the game we get +1 reward when we reach the goal. So +1 is the value of that state!
#### How good is the last state before the game ends?
Well we don't get a reward for this state. But we know that the action we took led to the terminal state where we got a reward of +1. So discounting that future reward gives us an estimate of the value of this state at $\gamma$.

This process is recursive, and we can continue to apply it to each preceding state in the trajectory that we took until we arrive back at the initial state, having estimated values for every state that we encountered. Some states may not have been visited and as such won't have had their values updated yet.

### The backup algorithm for value iteration

# algo
![](./images/backup_algo.png)

Value iteration is a type of **value based** method. Notice that to learn an optimal policy, we never have to represent it explicitly. There is no function which represents the policy.



<div class="body">
<div class="title">
Implement the backup algorithm 🔥
</div>

Now our agent is exploring the environment, let's implement the backup algorithm to update our estimates of the value of each state.
</div>


In [11]:
# INITIALISE THE ENVIRONMENT
### SAME AS PREVIOUS CELL

# LOOP TO RUN EPISODES TO EXPLORE THE ENVIRONMENT
### SAME AS PREVIOUS CELL

    # FOR EACH TIMESTEP, SELECT AN ACTION USING OUR RANDOM POLICY
    ### SAME AS PREVIOUS CELL
    
    # FOR EACH EPISODE RUN THE BACKUP ALGORITHM TO UPDATE THE VALUES
    ### IMPLEMENT

<div class="body">
<div class="title">
How can we use the values that we know to perform well?
</div>

Now that our agent is capable of exploring and learning about it's environment, we need to make it take advantage of what it knows so that it can perform well.
Our random policy has helped us to estimate the values of each state, which means we have some idea of how good each state is. Think about how we could use this knowledge to make our agent perform well before reading the next paragraphs.

In this simple version of the game, we know exactly what actions will lead us to what states. That means we have a perfect **model** of the environment. A model is a function that tells us how the state will change when we take certain actions. E.g. we know that if the agent tries to move up into an empty space, then that's where it will end up.

Because we know exactly what states we can end up in by taking an action, we can just look at the value of the states and choose the action which leads us to the state with the greatest value. So we just move into the best state that we can reach at any point.
A policy that always takes the action that it expects to end up in the best, currently reachable state is called a **greedy policy**.
</div>


<div class="body">
<div class="title">
Why not just act greedily all the time?
</div>

If we act greedily all the time then we will move into the state with the best value. But remember that these values are only estimates based on our agent's experience with the game, which means that they might not be correct. So if we want to make sure that our agent will do well by always choosing the next action greedily, we need to make sure that it has good estimates for the values of those states. This brings us to a core challenge in reinforcement learning: **the exploration vs exploitation dilemma**. Our agent can either exploit what it knows by using it's current knowledge to choose the best action, or it can explore more and improve it's knowledge perhaps learning that some actions are even worse than what it does currently.

# An epsilon-greedy policy
We can combine our random policy and our greedy policy to make an improved policy that both explores its environment and exploits its current knowledge. An $\epsilon$-greedy (epsilon-greedy) policy is one which exploits what it knows most of the time, but with probability $\epsilon$ will instead select a random action to try.

## Do we need to keep exploring once we are confident in the values of states?

As our agent explores more, it becomes more confident in predicting how valuable any state is. Once it knows a lot, it should start to explore less and exploit what it knows more. That means that we should decrease epsilon over time.

Let's implement it

</div>

In [None]:
def epsilon_greedy_policy(state):
    epsilon = 0.05
    if random.random() < epsilon:
        return random_policy(state)
    else:
        return greedy_policy(state)

num_episodes = 100


# INITIALISE THE ENVIRONMENT
### COPY FROM PREVIOUS CODE CELL

# LOOP TO RUN EPISODES TO EXPLORE THE ENVIRONMENT
### COPY FROM PREVIOUS CODE CELL

    # FOR EACH TIMESTEP, SELECT AN ACTION USING OUR EPSILON-GREEDY POLICY
    ### IMPLEMENT

<div class="body">
<div class="title">
What if we don't have a model?
</div>
How big is the input space to the action-state value function?
</div>

<div class="body">
<div class="title">End of notebook!
</div>

Next you might want to check out:
- [Policy Gradients]()
</div>