---

# OpenAi and Reinforcement Learning

In this notebook we will:

1. Introduce programming with the [OpenAi gym environment](https://gym.openai.com)

2. Introduce Reinforcement Learning with the Q-Learning Algorithm

With this aim, we first need to install the latest version of gym and in my case, also install the cmake gym atari interface. This can be done by running the following code cell. 


---

In [105]:
#!pip install cmake 'gym[atari]' scipy
#!pip install gym

---


## The Gym Environment

The core gym interface is ``env``, which is the unified environment interface. The following list shows the basic ``env`` methods:

1. ``env.reset()``: Resets the environment and returns a random initial state.

2. ``env.step(action)``: Step the environment by one timestep. Returns

 - observation (object): This is the current state of the environment and is an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.

 - reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

 - done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)

 - info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

7. ``env.render()``: Renders one frame of the environment (helpful in visualizing the environment)



We can best illustrate these notions with an example such as the Taxi-v3 environment. 


----

In [106]:
# Import the gym environment 
import gym

# Instantiate the taxi environment 
env = gym.make("Taxi-v3").env

# Reset the environment 
env.reset()

# Show the current frame of the environment 
env.render()

+---------+
|R: |[43m [0m: :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



---

## The Taxi-v3 environment

There are 4 locations (labeled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions

* Observations: There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations. 
   
* Passenger locations:
 - 0: R(ed)
 - 1: G(reen)
 - 2: Y(ellow)
 - 3: B(lue)
 - 4: in taxi
  
* Destinations:
 - 0: R(ed)
 - 1: G(reen)
 - 2: Y(ellow)
 - 3: B(lue)
   
* Actions: There are 6 discrete deterministic actions:
 - 0: move south
 - 1: move north
 - 2: move east
 - 3: move west
 - 4: pickup passenger
 - 5: drop off passenger
 
* Rewards: There is a default per-step reward of -1, except for delivering the passenger, which is +20, or executing "pickup" and "drop-off" actions illegally, which is -10.

* Rendering:
 - blue: passenger
 - magenta: destination
 - yellow: empty taxi
 - green: full taxi
 - other letters (R, G, Y and B): locations for passengers and destinations
   
* state space is represented by: (taxi_row, taxi_col, passenger_location, destination)


---

In [107]:
import numpy as np

# Choose a random action
action = np.random.randint(6)

# Take a step
observation, reward, done, info = env.step(action)

# Show the current frame of the environment 
env.render()

print(f"action: {action} \n")

print(f"observation: {observation}\n")

print(f"reward: {reward}\n")

print(f"done: {done} \n")

+---------+
|R: | : :[35mG[0m|
| : |[43m [0m: : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (South)
action: 0 

observation: 149

reward: -1

done: False 



In [108]:
# Choose a random action
action = np.random.randint(6)

# Take a step
observation, reward, done, info = env.step(action)

# Show the current frame of the environment 
env.render()

print(f"action: {action} \n")

print(f"observation: {observation}\n")

print(f"reward: {reward}\n")

print(f"done: {done} \n")

+---------+
|R: |[43m [0m: :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (North)
action: 1 

observation: 49

reward: -1

done: False 



In [109]:
# Choose a random action
action = np.random.randint(6)

# Take a step
observation, reward, done, info = env.step(action)

# Show the current frame of the environment 
env.render()

print(f"action: {action} \n")

print(f"observation: {observation}\n")

print(f"reward: {reward}\n")

print(f"done: {done} \n")

+---------+
|R: |[43m [0m: :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (Pickup)
action: 4 

observation: 49

reward: -10

done: False 



---

We can also manually set the state of the current ``env`` by first using the ``env.encode`` method and then setting the current state (``env.s``) to be this new state. The following code cell illustrates this option. 

---

In [110]:
# Use the env.encode method 
state = env.encode(0, 2, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print(f"State: {state}")

# Set the current state of the environmnet 
env.s = state

# Show the current frame of the environment 
env.render()

State: 48
+---------+
|[35mR[0m: |[43m [0m: :G|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (Pickup)


---




---

In [111]:
env.reset()

epochs = 0
penalties = 0
reward = 0
max_iter = 150

frames = [] # for animation

done = False

while not done and epochs < max_iter:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print(f"Timesteps taken: {epochs} \n")
print(f"Penalties incurred: {penalties} \n")

Timesteps taken: 150 

Penalties incurred: 44 



In [112]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        #print(frame['frame'].getvalue())
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        

print_frames(frames)

+---------+
|[34;1mR[0m:[43m [0m| : :G|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (Pickup)

Timestep: 150
State: 22
Action: 4
Reward: -10


---




---

In [113]:
env.P[448]

{0: [(1.0, 448, -1, False)],
 1: [(1.0, 348, -1, False)],
 2: [(1.0, 448, -1, False)],
 3: [(1.0, 428, -1, False)],
 4: [(1.0, 448, -10, False)],
 5: [(1.0, 448, -10, False)]}

---

# Introduction to Reinforcement Learning  

Reinforcement learning is learning what to do; how to map situations to actions as to maximize a numerical reward signal. The learner, or agent, is not told which actions to take, but to instead discover which actions yield the most reward by trying them. This type of model can be thought of a specific instance of Markov decision processes (MDP's). The learner and action maker is called the **agent**. The thing that the agent interacts with is called the **environment**, which can be thought of as everything outside the agent. 

The agent and environment interact in a looping process, where the agent observes some portion of the environment and takes an action, after which, the environment responds and presents a new situation to the agent. More specifically, the agent and environment  interact at each of a sequence of discrete time steps $t = 0, 1, \dots, T$, where $T$ is the **terminal state**. At each time step $t$, the agent recieves some representation of the environment's **state**, $S_t \in \mathcal{S}$, and on that basis selects an **action**, $A_t \in \mathcal{A}$. Here, $\mathcal{S}$ is the set of all possible states and $\mathcal{A}$ is the set of all possible/valid actions. One time step later, and in part as a consequence of action $A_t$, the agents recieves a numerical **reward**, $R_{t+1} \in \mathbb{R}$, and finds itself in a new state, $S_{t+1}\in \mathcal{S}$. The MDP and agent together give rise to a sequence, or **trajectory** typically denoted by $\tau$:

$$
\tau = S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots 
$$


The goal of the agent is to maximize its rewards over a given trajectory starting from state $S_t$. This is called the **return** and is given with the following equation:

$$
G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} \dots, 
$$

where $\gamma \in [0, 1]$ is the **discount rate**. The discount rate determines the present value of the future rewards. This formula can be written recursively as:

$$
G_{t} = R_{t+1} + \gamma G_{t+1}. 
$$



## Policies and the Q-Function

A **policy** is a mapping from states to probabilities of selecting each possible action. 

$$
q_{\pi}(s, a) = \mathbb{E}_{\pi}\Big[ G_t | S_t = s, A_t = a\Big] = \mathbb{E}_{\pi}\Big[ G_t + \gamma G_{t+1}| S_t = s, A_t = a\Big]
$$

Let $Q(S_t, A_t)$ denote the current q-value of the state action pair $(S_t, A_t)$. Through experience, the agent can learn how well our current estimate is (just like we compare predicted labels to true labels in supervised learning). The agent can then update the value of $Q(S_t, A_t)$ after experiencing its future rewards. The following update rule illustrates this updating:

$$
Q(S, A) \leftarrow Q(S, A) + \alpha \Big[R + \gamma \max_{a}Q(S', a) - Q(S, A) \Big]
$$



---

In [121]:
import numpy as np


# Initialize Q-values as a Q-table of 0's 
q_table = np.zeros([env.observation_space.n, env.action_space.n])

print(f"The shape of the Q-table is: {q_table.shape} \n")

The shape of the Q-table is: (500, 6) 



In [136]:
%%time
"""Training the agent"""

import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []
frames = []
for i in range(1, 10_0001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        # Q-table update Rule 
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        
        state = next_state
        epochs += 1
        
    
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")


 
state = env.encode(0, 2, 1, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)
print(q_table[state])
env.s = state
print(np.argmax(state))
env.render()

Episode: 100000
Training finished.

State: 44
[ -2.47060537  -2.45101778  -2.41837066  -2.45101987 -11.45012154
 -11.45057615]
0
+---------+
|[35mR[0m: |[43m [0m: :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Wall time: 32 s


In [137]:
state = env.encode(0, 2, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
env.render()

State: 48
+---------+
|[35mR[0m: |[43m [0m: :G|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (Dropoff)


In [138]:
q_table[48]

array([ -2.41837066,  -2.43829334,  -2.44519974,  -2.43835261,
       -10.88028614, -10.09929611])

In [143]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 2
frames = []
for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1
        # Put each rendered frame into dict for animation
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward
                            }
            )
        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

Results after 2 episodes:
Average timesteps per episode: 10.0
Average penalties per episode: 0.0


In [144]:
print_frames(frames)

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)

Timestep: 20
State: 475
Action: 5
Reward: 20
