# Reinforcement Learning Code, Part 2: Agents that act


Here we are going to implement a RL agent interacting with a simple environment. In this case, our agent would be a simulated RL mouse, and the environment a 2D maze.

## Defining the environment

To define the environment, we need to define the set of possible states $S = {s_1, s_2 ... s_N}$, the transition function $P_{s,s'}^{a}$, and the reward transition function $R_{s,s'}^{a}$.

In our case, the environment just consists of a 4x4 grid. Our hypothetical agent perceives only one cell at any time---the cell where it is.  Therefore, our states correspond to the sixteen position of the maze, which we can indicate with the coordinates $(0, 0), (0, 1)... (0, 3), (1, 0) ... (3, 3)$.

An environment is characterized by two functions:

* The state transition probability function $P(s, a, s')$, which the probability of transitioning to a possible state $s'$ when action $a$ is applied during state $s$; and

* The reward transition probability function $R(s, a, s')$, which is the probability of receiving a reward $r$ when action $a$ is applied to state $s$ and the environment moves to state $s'

In our simple cases, both $P(s,a,s')$ and $R(s,a,s')$ will be simplified to deterministic functions.


In [28]:
import random
import numpy as np
from copy import copy
import matplotlib.pyplot as plt 

class Maze():
    """A maze environment"""

    ACTIONS = ("up", "down", "left", "right") # List of actions
    INITIAL_STATE = (0, 0) # Always starts at the topleft corner
    
    def __init__(self, fname = "grid.txt"):
        """Inits a maze by loading the grid file"""
        self.grid = np.loadtxt(fname)
        self.state = self.INITIAL_STATE
        self.end = False


    def state_transition(self, state, action):
        s = state
        new_s = copy(s)
        
        if action in self.ACTIONS:
            if action == "up":
                if s[0] > 0:
                    new_s = (s[0] - 1, s[1])
            
            elif action == "left":
                if s[1] > 0:
                    new_s = (s[0], s[1] - 1)
            
            elif action == "down":
                if s[0] < (self.grid.shape[0] - 1):
                    new_s = (s[0] + 1, s[1])

            else:
                if s[1] < (self.grid.shape[1] -1):
                    new_s = (s[0], s[1] + 1)
                    
        return new_s
                    
    
    def reward_transition(self, state, action, new_state):
        if state == new_state:
            return -1
        else:
            return self.grid[new_state[0], new_state[1]]
        
    
    # Quick way to combine State transitions and Reward transitions 
    def transition(self, action):
        """Changes the state following an action"""
        s = self.state
        new_s = self.state_transition(s, action)
        new_r = self.reward_transition(s, action, new_s)
        
        self.state = new_s
        return (new_s, new_r) # Returns s_t, r_t

    def print_state(self):
        "Prints a text representation of the maze (with the agent position)"
        bar = "-" * ( 4 * self.grid.shape[1] + 1)
        for i in range(self.grid.shape[0]):
            row = "|"
            for j in range(self.grid.shape[1]):
                cell = " "
                if i == self.state[0] and j == self.state[1]:
                    cell = "*"
                row += (" %s |" % cell)
            print(bar)
            print(row)
        print(bar)

The maze is simple but functional. It is easy to create a maze, check the available actions, apply a few actions, and so on. Let's check that our environment works.

In [29]:
m = Maze()
m.print_state()
state_after_bad_action = m.state_transition(m.state, "up")  # Illigal action: bounces!
state_after_good_action = m.state_transition(m.state, "down") # Legal action: Goes down

print("State after illegal action: %s" % (state_after_bad_action,))
print("Reward after illegal action: %s" % (m.reward_transition(m.state, "up", state_after_bad_action)))

print("State after legal action: %s" % (state_after_good_action,))
print("Reward after legal action: %s" % (m.reward_transition(m.state, "down", state_after_good_action)))



-----------------
| * |   |   |   |
-----------------
|   |   |   |   |
-----------------
|   |   |   |   |
-----------------
|   |   |   |   |
-----------------
State after illegal action: (0, 0)
Reward after illegal action: -1
State after legal action: (1, 0)
Reward after legal action: 0.0


And, finally, we can easily navigate in our virtual maze by executing the appropriate actions:

In [30]:
m.transition("down")

((1, 0), 0.0)

Note how the ```Maze``` object returns two values at the end of each action execution, the new state $s_{t+1}$ and the associated reward $r_{t+1}$. If the ```down``` action was executed with the original maze layout of the ```grid.txt``` file, case, the two values are $s_{t+1} = $ ```(3, 0)``` and $r_{t+1} = $ ```0.0```. We can also execute more actions, and see what happens after a few movements:

In [31]:
m.transition("down")
m.transition("down")
m.transition("right")
m.print_state()


-----------------
|   |   |   |   |
-----------------
|   |   |   |   |
-----------------
|   |   |   |   |
-----------------
|   | * |   |   |
-----------------


Now we can create our own very fantastic agents! As an example, we will create a $Q$-learning agent that interacts with the ``` Maze``` world.

In [40]:

import random
import numpy as np

class Agent():
    def __init__(self, actions=Maze.ACTIONS, epsilon=0.1, alpha=0.2, gamma=0.9):
        """Creates a Q-agent"""
        self.Q = {}    ## Initial dictionary of (s, a) pairs. At the beginning, it's emtpy.

        self.epsilon = epsilon     # Epsilon for e-greey policy
        self.alpha = alpha         # Learning rate
        self.gamma = gamma         # Temporal discounting
        self.actions = actions     # Set of possible actions (provide those of Maze.ACTIONS)

    def getQ(self, state, action):
        """Returns the Q-value associated to a given state and action (or 0.0 if unknown)"""
        return self.Q.get((state, action), 0.0)
        
    def updateQ(self, state, action, reward, value):
        """Updates the value Q(s, a) by a given value"""
        oldv = self.Q.get((state, action), None)
        if oldv is None:
            self.q[(state, action)] = reward
        else:
            self.q[(state, action)] = oldv + self.alpha * (value - oldv)

    def policy(self, state):
        """Selects an action with a epsilon-greedy policy"""
        if random.random() < self.epsilon:
            action = random.choice(self.actions)
        else:
            q = [self.getQ(state, a) for a in self.actions]
            maxQ = max(q)
            count = q.count(maxQ)
            if count > 1:
                best = [i for i in range(len(self.actions)) if q[i] == maxQ]
                i = random.choice(best)
            else:
                i = q.index(maxQ)

            action = self.actions[i]
        return action

    def learn(self, state1, action1, reward, state2, action2):
        """Updates the Q-values when given an (s,a) pair, the reward value and a new state"""
        g = self.gamma
        a = self.alpha
        
        q1 = self.getQ(state1, action1)
        max_q2 = max([self.getQ(state2, a) for a in self.actions])
        
        rpe = reward + g * max_q2 - q1 
        self.Q[(state1, action1)] = q1 + a * rpe

    def run(self, task, n):
        """Run through N moves in the maze"""
        i = 0
        while i < n:
            s = task.state
            a = self.choose_action(s)
            new_s, r = task.execute_action(a)
            self.learn(s, a, r, new_s)
            i += 1

    def calculateV(self, task):
        """Returns a representation of the best Q values in every states"""
        v = np.zeros(task.grid.shape)
        for i in range(v.shape[0]):
            for j in range(v.shape[1]):
                maxq = max([self.getQ((i, j), a) for a in self.actions])
                v[i,j] =maxq
        return v

Now, two functions to run multiple trials.

In [35]:
def run_trial(environment, agent):
    "A trial ends when the agent gets a reward"
    state1 = environment.state
    reward1 = 0.0
    action1 = agent.policy(state1)
    
    while reward1 != 10:
        state2, reward2 = environment.transition(action1)
        action2 = agent.policy(state2)
        
        # Update the Q-values for state1, action1
        agent.learn(state1, action1, reward1, state2, action2)
        
        # Next trial
        state1 = state2
        action1 = action2
        reward1 = reward2
        
        
    agent.learn(state1, action1, reward1, None, None)

def run_trials(environment, agent, n):
    for j in range(n):
        run_trial(environment, agent)
        environment.state = Maze.INITIAL_STATE
        
        

In [39]:
m = Maze()
a = Agent()
run_trials(m, a, 1)
m.state
a.Q
run_trials(m, a, 100)