# Reinforcement learning code
Here we are going to implement a RL agent interacting with a simple environment. In this case, our agent would be a simulated RL mouse, and the environment a 2D maze.

## Defining the environment

To define the environment, we need to define the set of possible states $S = {s_1, s_2 ... s_N}$, the transition function $P_{s,s'}^{a}$, and the reward transition function $R_{s,s'}^{a}$.

In our case, the environment just consists of a 4x4 grid. Our hypothetical agent perceives only one cell at any time---the cell where it is.  Therefore, our states correspond to the sixteen position of the maze, which we can indicate with the coordinates $(0, 0), (0, 1)... (0, 3), (1, 0) ... (3, 3)$.


In [8]:
import random
import numpy as np


class Maze():
    """A maze environment"""

    ACTIONS = ("up", "down", "left", "right") # List of actions
    INITIAL_STATE = (0, 0) # Always starts at the topleft corner
    
    def __init__(self, fname = "grid.txt"):
        """Inits a maze by loading the grid file"""
        self.grid = np.loadtxt(fname)
        self.state = self.INITIAL_STATE

        
    def execute_action(self, action):
        """Executes one of four possible actions: up, down, right, and left"""
        s = self.state
        
        # By default, it assumes you bounced into a wall
        new_s = s
        new_r = -1
        
        if action in self.ACTIONS:
            if action == "up":
                if s[0] > 0:
                    new_s = (s[0] - 1, s[1])
                    new_r = self.grid[new_s[0], new_s[1]]
            
            elif action == "left":
                if s[1] > 0:
                    new_s = (s[0], s[1] - 1)
                    new_r = self.grid[new_s[0], new_s[1]]
            
            elif action == "down":
                if s[0] < (self.grid.shape[0] - 1):
                    new_s = (s[0] + 1, s[1])
                    new_r = self.grid[new_s[0], new_s[1]]

            else:
                if s[1] < (self.grid.shape[1] -1):
                    new_s = (s[0], s[1] + 1)
                    new_r = self.grid[new_s[0], new_s[1]]

        # If you find the cheese, you go back to square #1
        # ---This keeps the environment looping---
        if new_r == 10:
            new_s =  self.INITIAL_STATE

        # Updates the state and reward, and returns them as a tuple 
        self.state = new_s
        return (new_s, new_r)

    def print_state(self):
        "Prints a text representation of the maze (with the agent position)"
        bar = "-" * ( 4 * self.grid.shape[1] + 1)
        for i in range(self.grid.shape[0]):
            row = "|"
            for j in range(self.grid.shape[1]):
                cell = " "
                if i == self.state[0] and j == self.state[1]:
                    cell = "*"
                row += (" %s |" % cell)
            print(bar)
            print(row)
        print(bar)

The maze is simple but functional. It is easy to create a maze, check the available actions, execute a few actions, and so on. For example, we can create a new ```Maze``` object and have the possible actions readily available: 

In [9]:
m = Maze()
m.ACTIONS

('up', 'down', 'left', 'right')

We can also use the ```print_state()``` method to display a very low-tech representation of the maze, with teh agent's position indicated by the start symbol ```*```:

In [10]:
m.print_state()

-----------------
| * |   |   |   |
-----------------
|   |   |   |   |
-----------------
|   |   |   |   |
-----------------
|   |   |   |   |
-----------------


And, finally, we can easily navigate in our virtual maze by executing the appropriate actions:

In [12]:
m.execute_action("down")

((3, 0), 0.0)

Note how the ```Maze``` object returns two values at the end of each action execution, the new state $s_{t+1}$ and the associated reward $r_{t+1}$. If the ```down``` action was executed with the original maze layout of the ```grid.txt``` file, case, the two values are $s_{t+1} = $ ```(3, 0)``` and $r_{t+1} = $ ```0.0```. We can also execute more actions, and see what happens after a few movements:

In [14]:
m.execute_action("down")
m.execute_action("down")
m.execute_action("right")
m.print_state()


-----------------
|   |   |   |   |
-----------------
|   |   |   |   |
-----------------
|   |   |   |   |
-----------------
|   | * |   |   |
-----------------


Now we can create our own very fantastic agents! As an example, we will create a $Q$-learning agent that interacts with the ``` Maze``` world.

In [15]:

import random
import numpy as np

class QAgent():
    def __init__(self, actions, epsilon=0.1, alpha=0.2, gamma=0.9):
        """Creates a Q-agent"""
        self.q = {}    ## Initial dictionary of (s, a) pairs. At the beginning, it's emtpy.

        self.epsilon = epsilon     # Epsilon for e-greey policy
        self.alpha = alpha         # Learning rate
        self.gamma = gamma         # Temporal discounting
        self.actions = actions     # Set of possible actions (provide those of Maze.ACTIONS)

    def getQ(self, state, action):
        """Returns the Q-value associated to a given state and action (or 0.0 if unknown)"""
        return self.q.get((state, action), 0.0)
        
    def updateQ(self, state, action, reward, value):
        """Updates the value Q(s, a) by a given value"""
        oldv = self.q.get((state, action), None)
        if oldv is None:
            self.q[(state, action)] = reward
        else:
            self.q[(state, action)] = oldv + self.alpha * (value - oldv)

    def choose_action(self, state):
        """Selects an action with a epsilong-greedy policy"""
        if random.random() < self.epsilon:
            action = random.choice(self.actions)
        else:
            q = [self.getQ(state, a) for a in self.actions]
            maxQ = max(q)
            count = q.count(maxQ)
            if count > 1:
                best = [i for i in range(len(self.actions)) if q[i] == maxQ]
                i = random.choice(best)
            else:
                i = q.index(maxQ)

            action = self.actions[i]
        return action

    def learn(self, state1, action1, reward, state2):
        """Updates the Q-values when given an (s,a) pair, the reward value and a new state"""
        maxqnew = max([self.getQ(state2, a) for a in self.actions])
        self.updateQ(state1, action1, reward, reward + self.gamma*maxqnew)

    def run(self, task, n):
        """Run through N moves in the maze"""
        i = 0
        while i < n:
            s = task.state
            a = self.choose_action(s)
            new_s, r = task.execute_action(a)
            self.learn(s, a, r, new_s)
            i += 1

    def calculateV(self, task):
        """Returns a representation of the best Q values in every states"""
        v = np.zeros(task.grid.shape)
        for i in range(v.shape[0]):
            for j in range(v.shape[1]):
                maxq = max([self.getQ((i, j), a) for a in self.actions])
                v[i,j] =maxq
        return v
