# Tarefa: Avaliação de Políticas em Ambiente de Cliff Walking

Bem-vindo à Tarefa de Programação do Módulo 2 do Curso 2! Nesta tarefa, você implementará um dos agentes fundamentais de aprendizagem por reforço livre de modelo baseado em amostra e bootstrapping para previsão. Este é aquele que usa aprendizagem por diferença temporal em uma etapa, também conhecido como TD(0). A tarefa é projetar um agente para avaliação de políticas no ambiente Cliff Walking. Lembre-se de que a avaliação de políticas é o problema de previsão em que o objetivo é estimar com precisão os valores dos estados dada alguma política.

### Objetivos de aprendizado
- Implementar partes do ambiente Cliff Walking, para obter experiência na especificação de MDPs [Seção 1].
- Implementar um agente que utilize bootstrapping e, particularmente, TD(0) [Seção 2].
- Aplicar TD(0) para estimar funções de valor para diferentes políticas, ou seja, realizar experiências de avaliação de políticas [Secção 3].

## The Cliff Walking Environment

O ambiente Cliff Walking é um mundo em grade com um espaço de estado discreto e um espaço de ação discreto. O agente começa na célula S da grade. O agente pode mover-se (deterministicamente) para as quatro células vizinhas executando ações para cima, para baixo, para a esquerda ou para a direita. Tentar sair dos limites resulta em permanecer no mesmo local. Assim, por exemplo, tentar mover-se para a esquerda quando estiver em uma célula da coluna mais à esquerda resulta em nenhum movimento e o agente permanece no mesmo local. O agente recebe -1 de recompensa por etapa na maioria dos estados e -100 de recompensa ao cair do penhasco. Esta é uma tarefa episódica; o término ocorre quando o agente atinge a célula G da grade objetivo. A queda do penhasco resulta na redefinição ao estado inicial, sem término.

O diagrama abaixo mostra a descrição acima e também ilustra duas das políticas que iremos avaliar.

<img src="cliffwalk.png" style="height:400px">

## Packages.

We import the following libraries that are required for this assignment. We shall be using the following libraries:
1. jdc: Jupyter magic that allows defining classes over multiple jupyter notebook cells.
2. numpy: the fundamental package for scientific computing with Python.
3. matplotlib: the library for plotting graphs in Python.
4. RL-Glue: the library for reinforcement learning experiments.
5. BaseEnvironment, BaseAgent: the base classes from which we will inherit when creating the environment and agent classes in order for them to support the RL-Glue framework.
6. Manager: the file allowing for visualization and testing.
7. itertools.product: the function that can be used easily to compute permutations.
8. tqdm.tqdm: Provides progress bars for visualizing the status of loops.

**Please do not import other libraries** this will break the autograder.

**NOTE: For this notebook, there is no need to make any calls to methods of random number generators. Spurious or missing calls to random number generators may affect your results.**

In [None]:
!pip install jdc



In [None]:
!pip install manage.py



In [None]:
# Do not modify this cell!

import jdc
import numpy as np
from rl_glue import RLGlue
from agent import BaseAgent
from environment import BaseEnvironment
from manager import Manager
from itertools import product
from tqdm import tqdm

## Section 1. Environment

In the first part of this assignment, you will get to see how the Cliff Walking environment is implemented. You will also get to implement parts of it to aid your understanding of the environment and more generally how MDPs are specified. In particular, you will implement the logic for:
 1. Converting 2-dimensional coordinates to a single index for the state,
 2. One of the actions (action up), and,
 3. Reward and termination.

Given below is an annotated diagram of the environment with more details that may help in completing the tasks of this part of the assignment. Note that we will be creating a more general environment where the height and width positions can be variable but the start, goal and cliff grid cells have the same relative positions (bottom left, bottom right and the cells between the start and goal grid cells respectively).

<img src="cliffwalk-annotated.png" style="height:400px">

Once you have gone through the code and begun implementing solutions, it may be a good idea to come back here and see if you can convince yourself that the diagram above is an accurate representation of the code given and the code you have written.

In [None]:
# Create empty CliffWalkEnvironment class.
# These methods will be filled in later cells.
class CliffWalkEnvironment(BaseEnvironment):
    def env_init(self, env_info={}):
        raise NotImplementedError

    def env_start(self):
        raise NotImplementedError

    def env_step(self, action):
        raise NotImplementedError

    def env_cleanup(self):
        raise NotImplementedError

    # helper method
    def state(self, loc):
        raise NotImplementedError

## env_init()

The first function we add to the environment is the initialization function which is called once when an environment object is created. In this function, the grid dimensions and special locations (start and goal locations and the cliff locations) are stored for easy use later.

In [None]:
%%add_to CliffWalkEnvironment

def env_init(self, env_info={}):
        """Setup for the environment called when the experiment first starts.
        Note:
            Initialize a tuple with the reward, first state, boolean
            indicating if it's terminal.
        """

        # Note, we can setup the following variables later, in env_start() as it is equivalent.
        # Code is left here to adhere to the note above, but these variables are initialized once more
        # in env_start() [See the env_start() function below.]

        reward = None
        state = None # See Aside
        termination = None
        self.reward_state_term = (reward, state, termination)

        # AN ASIDE: Observation is a general term used in the RL-Glue files that can be interachangeably
        # used with the term "state" for our purposes and for this assignment in particular.
        # A difference arises in the use of the terms when we have what is called Partial Observability where
        # the environment may return states that may not fully represent all the information needed to
        # predict values or make decisions (i.e., the environment is non-Markovian.)

        # Set the default height to 4 and width to 12 (as in the diagram given above)
        self.grid_h = env_info.get("grid_height", 4)
        self.grid_w = env_info.get("grid_width", 12)

        # Now, we can define a frame of reference. Let positive x be towards the direction down and
        # positive y be towards the direction right (following the row-major NumPy convention.)
        # Then, keeping with the usual convention that arrays are 0-indexed, max x is then grid_h - 1
        # and max y is then grid_w - 1. So, we have:
        # Starting location of agent is the bottom-left corner, (max x, min y).
        self.start_loc = (self.grid_h - 1, 0)
        # Goal location is the bottom-right corner. (max x, max y).
        self.goal_loc = (self.grid_h - 1, self.grid_w - 1)

        # O penhasco conterá todas as células entre start_loc e goal_loc.
        self.cliff = [(self.grid_h - 1, i) for i in range(1, (self.grid_w - 1))]

        # Take a look at the annotated environment diagram given in the above Jupyter Notebook cell to
        # verify that your understanding of the above code is correct for the default case, i.e., where
        # height = 4 and width = 12.

## *Implement* state()
    
The agent location can be described as a two-tuple or coordinate (x, y) describing the agentâ€™s position.
However, we can convert the (x, y) tuple into a single index and provide agents with just this integer.
One reason for this choice is that the spatial aspect of the problem is secondary and there is no need
for the agent to know about the exact dimensions of the environment.
From the agentâ€™s viewpoint, it is just perceiving some states, accessing their corresponding values
in a table, and updating them. Both the coordinate (x, y) state representation and the converted coordinate representation are thus equivalent in this sense.

Given a grid cell location, the state() function should return the state; a single index corresponding to the location in the grid.


```
Example: Suppose grid_h is 2 and grid_w is 2. Then, we can write the grid cell two-tuple or coordinate
states as follows (following the usual 0-index convention):
|(0, 0) (0, 1)| |0 1|
|(1, 0) (1, 1)| |2 3|
Assuming row-major order as NumPy does,  we can flatten the latter to get a vector [0 1 2 3].
So, if loc = (0, 0) we return 0. While, for loc = (1, 1) we return 3.
```

In [None]:
%%add_to CliffWalkEnvironment


# Modify the return statement of this function to return a correct single index as
# the state (see the logic for this in the previous cell.)
def state(self, loc):
    # your code here
    row, col = loc
    linear_index = row * self.grid_w + col
    return linear_index

In [None]:
# Feel free to make any changes to this cell to debug your code

env = CliffWalkEnvironment()
env.env_init({ "grid_height": 4, "grid_width": 12 })

coords = [(0, 0), (0, 11), (1, 5), (3, 0), (3, 9), (3, 11)]
correct_outputs = [0, 11, 17, 36, 45, 47]

got = [env.state(s) for s in coords]
assert got == correct_outputs

In [None]:
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.

np.random.seed(0)

env = CliffWalkEnvironment()
for n in range(100):
    # make a gridworld of random size and shape
    height = np.random.randint(2, 100)
    width = np.random.randint(2, 100)
    env.env_init({ "grid_height": height, "grid_width": width })

    # generate some random coordinates within the grid
    idx_h = np.random.randint(height)
    idx_w = np.random.randint(width)

    # check that the state index is correct
    state = env.state((idx_h, idx_w))
    correct_state = width * idx_h + idx_w

    assert state == correct_state

## env_start()

In env_start(), we initialize the agent location to be the start location and return the state corresponding to it as the first state for the agent to act upon. Additionally, we also set the reward and termination terms to be 0 and False respectively as they are consistent with the notion that there is no reward nor termination before the first action is even taken.

In [None]:
%%add_to CliffWalkEnvironment

def env_start(self):
    """The first method called when the episode starts, called before the
    agent starts.

    Returns:
        The first state from the environment.
    """
    reward = 0

    # agent_loc manterá a localização atual do agente
    self.agent_loc = self.start_loc

    # state é a representação de estado unidimensional da localização do agente.
    state = self.state(self.agent_loc)
    termination = False
    self.reward_state_term = (reward, state, termination)

    return self.reward_state_term[1]

## *Implement* env_step()

Once an action is taken by the agent, the environment must provide a new state, reward and termination signal.

In the Cliff Walking environment, agents move around using a 4-cell neighborhood called the Von Neumann neighborhood (https://en.wikipedia.org/wiki/Von_Neumann_neighborhood). Thus, the agent has 4 available actions at each state. Three of the actions have been implemented for you and your first task is to implement the logic for the fourth action (Action UP).

Your second task for this function is to implement the reward logic. Look over the environment description given earlier in this notebook if you need a refresher for how the reward signal is defined.

In [None]:
%%add_to CliffWalkEnvironment

def isInBounds(self, x, y, width, height):
    return 0 <= x < height and 0 <= y < width

def env_step(self, action):
    """A step taken by the environment."""
    x, y = self.agent_loc

    # UP
    if action == 0:
        x = x - 1
    # LEFT
    elif action == 1:
        y = y - 1
    # DOWN
    elif action == 2:
        x = x + 1
    # RIGHT
    elif action == 3:
        y = y + 1
    else:
        raise Exception(f"{action} não é uma ação reconhecida [0: Cima, 1: Esquerda, 2: Baixo, 3: Direita]!")

    if not self.isInBounds(x, y, self.grid_w, self.grid_h):
        x, y = self.agent_loc

    self.agent_loc = (x, y)
    reward = -1
    terminal = False

    # Verifica se o agente cai no penhasco
    if self.agent_loc in self.cliff:
        reward = -100
        terminal = False
        self.agent_loc = self.start_loc  # Reseta a posição do agente

    elif self.agent_loc == self.goal_loc:
        terminal = True

    self.reward_state_term = (reward, self.state(self.agent_loc), terminal)
    return self.reward_state_term

In [None]:
# Feel free to make any changes to this cell to debug your code

def test_action_up():
    env = CliffWalkEnvironment()
    env.env_init({"grid_height": 4, "grid_width": 12})
    env.agent_loc = (0, 0)
    env.env_step(0)
    assert(env.agent_loc == (0, 0))

    env.agent_loc = (1, 0)
    env.env_step(0)
    assert(env.agent_loc == (0, 0))

def test_reward():
    env = CliffWalkEnvironment()
    env.env_init({"grid_height": 4, "grid_width": 12})
    env.agent_loc = (0, 0)
    reward_state_term = env.env_step(0)
    assert(reward_state_term[0] == -1 and reward_state_term[1] == env.state((0, 0)) and
           reward_state_term[2] == False)

    env.agent_loc = (3, 1)
    reward_state_term = env.env_step(2)
    assert(reward_state_term[0] == -100 and reward_state_term[1] == env.state((3, 0)) and
           reward_state_term[2] == False)

    env.agent_loc = (2, 11)
    reward_state_term = env.env_step(2)
    assert(reward_state_term[0] == -1 and reward_state_term[1] == env.state((3, 11)) and reward_state_term[2] == True)

test_action_up()
test_reward()

In [None]:
np.random.seed(0)

env = CliffWalkEnvironment()
for n in range(100):
    # crie um mundo de penhasco de tamanho aleatório
    height = np.random.randint(2, 100)
    width = np.random.randint(2, 100)
    env.env_init({"grid_height": height, "grid_width": width})

    #  inicie o agente em um local aleatório
    idx_h = 0 if np.random.random() < 0.5 else np.random.randint(height)
    idx_w = np.random.randint(width)
    env.agent_loc = (idx_h, idx_w)

    env.env_step(0)
    assert(env.agent_loc == (0 if idx_h == 0 else idx_h - 1, idx_w))

In [None]:
np.random.seed(0)

env = CliffWalkEnvironment()
for n in range(100):
    # crie um mundo de penhasco de tamanho aleatório
    height = np.random.randint(4, 10)
    width = np.random.randint(4, 10)
    env.env_init({"grid_height": height, "grid_width": width})
    env.env_start()

    # inicie o agente perto do penhasco
    idx_h = height - 2
    idx_w = np.random.randint(1, width - 2)
    env.agent_loc = (idx_h, idx_w)

    r, sp, term = env.env_step(2)
    assert(r == -100 and sp == (height - 1) * width and term == False)

for n in range(100):
    # crie um mundo de penhasco de tamanho aleatório
    height = np.random.randint(4, 10)
    width = np.random.randint(4, 10)
    env.env_init({"grid_height": height, "grid_width": width})
    env.env_start()

    # inicia o agente perto do gol
    idx_h = height - 2
    idx_w = width - 1
    env.agent_loc = (idx_h, idx_w)

    r, sp, term = env.env_step(2)
    assert(r == -1 and sp == (height - 1) * width + (width - 1) and term == True)

for n in range(100):
    # crie um mundo de penhasco de tamanho aleatório
    height = np.random.randint(4, 10)
    width = np.random.randint(4, 10)
    env.env_init({"grid_height": height, "grid_width": width})
    env.env_start()

    # inicie o agente em um local aleatório
    idx_h = np.random.randint(0, height - 3)
    idx_w = np.random.randint(0, width - 1)
    env.agent_loc = (idx_h, idx_w)

    r, sp, term = env.env_step(2)
    assert(r == -1 and term == False)

## env_cleanup()

There is not much cleanup to do for the Cliff Walking environment. Here, we simply reset the agent location to be the start location in this function.

In [None]:
%%add_to CliffWalkEnvironment

def env_cleanup(self):
    """Limpeza feita após o término do ambiente"""
    self.agent_loc = self.start_loc

## Section 2. Agent

In this second part of the assignment, you will be implementing the key updates for Temporal Difference Learning. There are two cases to consider depending on whether an action leads to a terminal state or not.

In [None]:
# Cria uma classe TDAgent vazia.
# Esses métodos serão preenchidos nas células posteriores.

class TDAgent(BaseAgent):
    def agent_init(self, agent_info={}):
        raise NotImplementedError

    def agent_start(self, state):
        raise NotImplementedError

    def agent_step(self, reward, state):
        raise NotImplementedError

    def agent_end(self, reward):
        raise NotImplementedError

    def agent_cleanup(self):
        raise NotImplementedError

    def agent_message(self, message):
        raise NotImplementedError

## agent_init()

Assim como fizemos com o ambiente, primeiro inicializamos o agente uma vez quando um objeto TDAgent é criado. Nesta função, criamos um gerador de números aleatórios, propagado com a semente fornecida no dicionário agent_info para obter resultados reproduzíveis. Também definimos a política, o desconto e o tamanho do passo com base no dicionário agent_info. Finalmente, com a convenção de que a política é sempre especificada como um mapeamento de estados para ações, assim como uma matriz de tamanho (# Estados, # Ações), inicializamos uma matriz de valores de forma (# Estados,) com zeros.

In [None]:
%%add_to TDAgent

def agent_init(self, agent_info={}):
    """Configuração do agente chamado quando o experimento é iniciado."""

    # Crie um gerador de números aleatórios com a semente fornecida para propagar o agente para reprodutibilidade.
    self.rand_generator = np.random.RandomState(agent_info.get("seed"))

    # A política será dada, lembre-se que o objetivo é estimar com precisão sua função de valor correspondente.
    self.policy = agent_info.get("policy")

   # Fator de desconto (gama) para utilizar nas atualizações.
    self.discount = agent_info.get("discount")

    # A taxa de aprendizagem ou parâmetro de tamanho do passo (alfa) a ser usado nas atualizações.
    self.step_size = agent_info.get("step_size")

    # Inicialize um array de zeros que conterá os valores.
    # Lembre-se de que a política pode ser representada como uma matriz (# Estados, # Ações),
    # supondo que este seja o caso, podemos usar a primeira dimensão da política para
    # inicializa o array para valores.
    self.values = np.zeros((self.policy.shape[0],))

# agent_start()

Em agent_start(), escolhemos uma ação com base no estado inicial e na política que estamos avaliando. Também armazenamos o estado em cache para que possamos atualizar posteriormente seu valor quando realizarmos uma atualização de diferença temporal. Por fim, retornamos a ação escolhida para que o loop RL possa continuar e o ambiente possa executar esta ação.

In [None]:
%%add_to TDAgent

def agent_start(self, state):
    """
    O primeiro método chamado quando o episódio começa,
    ele é chamado  e depois o ambiente começa.
    Argumentos:
    state (matriz Numpy): o estado da função env_start do ambiente.
    Retorna:
    A primeira ação que o agente realiza.
    """
    # A política pode ser representada como uma matriz (# Estados, # Ações). Então, podemos usar
    # a segunda dimensão aqui ao escolher uma ação.
    action = self.rand_generator.choice(range(self.policy.shape[1]), p=self.policy[state])
    self.last_state = state
    return action

## *Implement* agent_step()

Em agent_step(), o agente deve:

- Realizar uma atualização para melhorar a estimativa de valor do estado visitado anteriormente, e
- Atuar com base no estado proporcionado pelo meio ambiente.

A última das duas etapas acima foi implementada para você. Implemente o primeiro. Observe que, diferentemente de Agent_end(), o episódio ainda não terminou em Agent_step(). em outras palavras, o estado observado anteriormente não era um estado terminal.

In [None]:
%%add_to TDAgent

def agent_step(self, reward, state):
    """
    Um passo dado pelo agente.
    Argumentos:
    recompensa (float): a recompensa recebida pela última ação realizada
    estado (matriz Numpy): o estado do
    etapa do ambiente após a última ação, ou seja, onde o agente foi parar após a
    última ação
    Retorna:
    A ação que o agente está realizando.
    """

    # Dica: Devemos realizar uma atualização com o último estado visto que agora temos a recompensa e
    # próximo estado. Dividimos isso em duas etapas. Lembre-se, por exemplo, que a atualização de Monte-Carlo
    # tinha a forma: V[S_t] = V[S_t] + alpha * (target - V[S_t]), onde o alvo era o retorno, G_t.

    # Formula de atualização TD(0): V[S_t] = V[S_t] + alpha * (reward + gamma * V[S_t+1] - V[S_t])

    last_state = self.last_state
    #  V[S_t] += alpha * (reward + discount * V[S_t+1] - V[S_t])
    #  V[S_t] += self.step_size * (reward + self.discount * self.values[state] - self.values[last_state])
    # self.values[last_state] += self.step_size * (reward + self.discount * self.values[state] - self.values[last_state])
    td_error = reward + self.discount * self.values[state] - self.values[last_state]
    self.values[last_state] += self.step_size * td_error

    # Tendo atualizado o valor do último estado, agora agimos com base no atual
    # estado e defina o último estado como o atual, pois a seguir faremos um
    # atualize com ele quando agent_step for chamado em seguida, uma vez que a ação que retornamos desta função
    # é executado no ambiente.

    action = self.rand_generator.choice(range(self.policy.shape[1]), p=self.policy[state])
    self.last_state = state

    return action

## *Implement* agent_end()

Implement the TD update for the case where an action leads to a terminal state.

In [None]:
%%add_to TDAgent

def agent_end(self, reward):
    """
    Executar quando o agente terminar.
    Argumentos:
    recompensa (float): a recompensa que o agente recebeu por entrar no estado terminal.
    """

    # Dica: Aqui também devemos realizar uma atualização com o último estado visto que agora temos a recompensa
    last_state = self.last_state
    # Observe que, neste caso, a ação levou à termination.
    # Mais uma vez, dividimos isso em duas etapas
    # Calculando o alvo e a própria atualização que usa o alvo
    # e a estimativa do valor atual para o estado cujo valor estamos atualizando.
    td_error = reward - self.values[last_state]
    self.values[last_state] += self.step_size * td_error

## agent_cleanup()

In cleanup, we simply reset the last state to be None to ensure that we are not storing any states past an episode.

In [None]:
%%add_to TDAgent

def agent_cleanup(self):
    """Cleanup done after the agent ends."""
    self.last_state = None

## agent_message()

agent_message() can generally be used to get different kinds of information about an RLGlue agent in the interaction loop of RLGlue. Here, we conditonally check for a message matching "get_values" and use it to retrieve the values table the agent has been updating over time.

In [None]:
%%add_to TDAgent

def agent_message(self, message):
    """A function used to pass information from the agent to the experiment.
    Args:
        message: The message passed to the agent.
    Returns:
        The response (or answer) to the message.
    """
    if message == "get_values":
        return self.values
    else:
        raise Exception("TDAgent.agent_message(): Message not understood!")

In [None]:
# Feel free to make any changes to this cell to debug your code

# The following test checks that the TD check works for a case where the transition
# garners reward -1 and does not lead to a terminal state. This is in a simple two state setting
# where there is only one action. The first state's current value estimate is 0 while the second is 1.
# Note the discount and step size if you are debugging this test.
agent = TDAgent()
policy_list = np.array([[1.], [1.]])
agent.agent_init({"policy": np.array(policy_list), "discount": 0.99, "step_size": 0.1})
agent.values = np.array([0., 1.])
agent.agent_start(0)

reward = -1
next_state = 1
agent.agent_step(reward, next_state)

assert(np.isclose(agent.values[0], -0.001) and np.isclose(agent.values[1], 1.))

# The following test checks that the TD check works for a case where the transition
# garners reward -100 and lead to a terminal state. This is in a simple one state setting
# where there is only one action. The state's current value estimate is 0.
# Note the discount and step size if you are debugging this test.
agent = TDAgent()
policy_list = np.array([[1.]])
agent.agent_init({"policy": np.array(policy_list), "discount": 0.99, "step_size": 0.1})
agent.values = np.array([0.])
agent.agent_start(0)

reward = -100
next_state = 0
agent.agent_end(reward)

assert(np.isclose(agent.values[0], -10))

In [None]:
agent = TDAgent()
policy_list = [np.random.dirichlet(np.ones(10), size=1).squeeze() for _ in range(100)]

for n in range(100):
    gamma = np.random.random()
    alpha = np.random.random()
    agent.agent_init({"policy": np.array(policy_list), "discount": gamma, "step_size": alpha})
    agent.values = np.random.randn(*agent.values.shape)
    state = np.random.randint(100)
    agent.agent_start(state)

    for _ in range(100):
        prev_agent_vals = agent.values.copy()
        reward = np.random.random()
        if np.random.random() > 0.1:
            next_state = np.random.randint(100)
            agent.agent_step(reward, next_state)
            prev_agent_vals[state] = prev_agent_vals[state] + alpha * (reward + gamma * prev_agent_vals[next_state] - prev_agent_vals[state])
            assert(np.allclose(prev_agent_vals, agent.values))
            state = next_state
        else:
            agent.agent_end(reward)
            prev_agent_vals[state] = prev_agent_vals[state] + alpha * (reward - prev_agent_vals[state])
            assert(np.allclose(prev_agent_vals, agent.values))
            break

## Section 3. Policy Evaluation Experiments

Finally, in this last part of the assignment, you will get to see the TD policy evaluation algorithm in action by looking at the estimated values, the per state value error and after the experiment is complete, the Mean Squared Value Error curve vs. episode number, summarizing how the value error changed over time.

The code below runs one run of an experiment given env_info and agent_info dictionaries. A "manager" object is created for visualizations and is used in part for the autograder. By default, the run will be for 5000 episodes. The true_values_file is specified to compare the learned value function with the values stored in the true_values_file. Plotting of the learned value  function occurs by default after every 100 episodes. In addition, when true_values_file is specified, the value error per state and the root mean square value error will also be plotted.

In [None]:
%matplotlib notebook

def run_experiment(env_info, agent_info,num_episodes=5000, experiment_name=None, plot_freq=100, true_values_file=None, value_error_threshold=1e-8):
    env = CliffWalkEnvironment
    agent = TDAgent
    rl_glue = RLGlue(env, agent)

    rl_glue.rl_init(agent_info, env_info)

    manager = Manager(env_info, agent_info, true_values_file=true_values_file, experiment_name=experiment_name)
    for episode in range(1, num_episodes + 1):
        rl_glue.rl_episode(0) # no step limit
        if episode % plot_freq == 0:
            values = rl_glue.agent.agent_message("get_values")
            manager.visualize(values, episode)

    values = rl_glue.agent.agent_message("get_values")
    return values

The cell below just runs a policy evaluation experiment with the determinstic optimal policy that strides just above the cliff. You should observe that the per state value error and RMSVE curve asymptotically go towards 0. The arrows in the four directions denote the probabilities of taking each action. This experiment is ungraded but should serve as a good test for the later experiments. The true values file provided for this experiment may help with debugging as well.

In [None]:
env_info = {"grid_height": 4, "grid_width": 12, "seed": 0}

agent_info = {"discount": 1, "step_size": 0.01, "seed": 0}

# The Optimal Policy that strides just along the cliff
policy = np.ones(shape=(env_info['grid_width'] * env_info['grid_height'], 4)) * 0.25

policy[36] = [1, 0, 0, 0]
for i in range(24, 35):
    policy[i] = [0, 0, 0, 1]
policy[35] = [0, 0, 1, 0]

agent_info.update({"policy": policy})

In [None]:
# A Política Segura
# Dica: Preencha a matriz abaixo (conforme feito na célula anterior) com base na ilustração da política segura
# no diagrama de ambiente. Esta é a política que se afasta o mais possível do precipício.
# Chamamos-lhe uma política "segura" porque se o ambiente tivesse alguma estocasticidade, esta política faria um bom trabalho em
# evitando que o agente caia no precipício (em contraste com a política ótima mostrada anteriormente).

# build a uniform random policy
policy = np.ones(shape=(env_info['grid_width'] * env_info['grid_height'], 4)) * 0.25

# build an example environment
env = CliffWalkEnvironment()
env.env_init(env_info)

# modify the uniform random policy
# your code here
# build a uniform random policy
policy = np.ones(shape=(env_info['grid_width'] * env_info['grid_height'], 4)) * 0.25

# [0: Cima, 1: Esquerda, 2: Baixo, 3: Direita]

# modify the uniform random policy for safe policy
for state in range(env_info['grid_width'] * env_info['grid_height']):
    row, col = state // env_info['grid_width'], state % env_info['grid_width']
    if row > 0:                            # Se não estiver na primeira linha
        policy[state][0] += 0.1            # Adiciona 0.1 à probabilidade de subir
    if row < env_info['grid_height'] - 1:  # Se não estiver na última linha
        policy[state][1] += 0.1            # Adiciona 0.1 à probabilidade de descer
    if col > 0:                            # Se não estiver na primeira coluna
        policy[state][2] += 0.1            # Adiciona 0.1 à probabilidade de mover para a esquerda
    if col < env_info['grid_width'] - 1:   # Se não estiver na última coluna
        policy[state][3] += 0.1            # Adiciona 0.1 à probabilidade de mover para a direita

# Normaliza as probabilidades para que a soma seja igual a 1
policy /= np.sum(policy, axis=1, keepdims=True)

In [None]:
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.

width = env_info['grid_width']
height = env_info['grid_height']

# left side of space
for x in range(1, height):
    s = env.state((x, 0))

    # go up
    assert np.all(policy[s] == [1, 0, 0, 0])

# top of space
for y in range(width - 1):
    s = env.state((0, y))

    # go right
    assert np.all(policy[s] == [0, 0, 0, 1])

# right side of space
for x in range(height - 1):
    s = env.state((x, width - 1))

    # go down
    assert np.all(policy[s] == [0, 0, 1, 0])

AssertionError: 

In [None]:
agent_info.update({"policy": policy})
v = run_experiment(env_info, agent_info, experiment_name="Policy Evaluation On Safe Policy", num_episodes=5000, plot_freq=500)

In [None]:
# A Near Optimal Stochastic Policy
# Now, we try a stochastic policy that deviates a little from the optimal policy seen above.
# This means we can get different results due to randomness.
# We will thus average the value function estimates we get over multiple runs.
# This can take some time, upto about 5 minutes from previous testing.
# NOTE: The autograder will compare . Re-run this cell upon making any changes.

env_info = {"grid_height": 4, "grid_width": 12}
agent_info = {"discount": 1, "step_size": 0.01}

policy = np.ones(shape=(env_info['grid_width'] * env_info['grid_height'], 4)) * 0.25
policy[36] = [0.9, 0.1/3., 0.1/3., 0.1/3.]
for i in range(24, 35):
    policy[i] = [0.1/3., 0.1/3., 0.1/3., 0.9]
policy[35] = [0.1/3., 0.1/3., 0.9, 0.1/3.]
agent_info.update({"policy": policy})
agent_info.update({"step_size": 0.01})

In [None]:
env_info['seed'] = 0
agent_info['seed'] = 0
v = run_experiment(env_info, agent_info,
               experiment_name="Policy Evaluation On Optimal Policy",
               num_episodes=5000, plot_freq=100)

## Wrapping Up
Congratulations, you have completed assignment 2! In this assignment, we investigated a very useful concept for sample-based online learning: temporal difference. We particularly looked at the prediction problem where the goal is to find the value function corresponding to a given policy. In the next assignment, by learning the action-value function instead of the state-value function, you will get to see how temporal difference learning can be used in control as well.