# The Agent-Environment Interaction

In this exercise, you will implement the interaction of a reinforecment learning agent with its environment. We will use the gridworld environment from the second lecture. You will find a description of the environment below, along with two pieces of relevant material from the lectures: the agent-environment interface and the Q-learning algorithm.

1. Create an agent that chooses actions randomly with this environment. 

2. Create an agent that uses Q-learning. You can use initial Q values of 0, a stochasticity parameter for the $\epsilon$-greedy policy function $\epsilon=0.05$, and a learning rate $\alpha = 0.1$. But feel free to experiment with other settings of these three parameters.

3. Plot the mean total reward (i.e. the undiscounted return) obtained by the two agents for each episode. This kind of graph is called a **learning curve**, and it gives us an idea of how our agent's performance changes during training.


## The agent-environment interface

<img src="img/agent-environment.png" style="width: 500px;" align="left"/> 

<br><br><br>

The interaction of the agent with its environments starts at decision stage $t=0$ with the observation of the current state $s_0$. (Notice that there is no reward at this initial stage.) The agent then chooses an action to execute at decision stage $t=1$. The environment responds by changing its state to $s_1$ and returning the numerical reward signal $r_1$. 


## The environment: Navigation in a gridworld

<img src="img/gold.png" style="width: 250px;" align="left"/>

The agent has four possible actions in each state (grid square): west, north, south, and east. The actions are unreliable. They move the agent in the intended direction with probability 0.8, and with probability 0.2, they move the agent in a random other direction. If the direction of movement is blocked, the agent remains in the same grid square. The initial state of the agent is one of the five grid squares at the bottom, selected randomly. The grid squares with the gold and the bomb are **terminal states**. If the agent finds itself in one of these squares, the episode ends. Then a new episode begins with the agent at a randomly selected initial state.

You will use a reinforcement learning algorithm to compute the best policy for finding the gold with as few steps as possible while avoiding the bomb. For this, we will use the following reward function: $-1$ for each navigation action, an additional $+10$ for finding the gold, and an additional $-10$ for hitting the bomb. For example, the immediate reward for transitioning into the square with the gold is $-1 + 10 = +9$. Do not use discounting (that is, set $\gamma=1$).

## Q-Learning
For your reference, the pseudocode for the Q-Learning algorithm is reproduced below (Reinforcement Learning, Sutton & Barto, 2018, Section 6.5 p.131).
<img src="img/q.png" style="width: 720px;"/>


## Example of a learning curve

<img src="img/lc_example.png" style="width: 550px;" align="left"/>

<br><br><br><br>

This is a sample learning curve and shows the reward obtained by a Q-learning agent across 500 episodes. Do not try to replicate this exact curve! It was computed using a different environment than the one described here.

In [2]:
# YOUR CODE HERE!
import numpy as np

### `Gridworld` Class


In [13]:
class Gridworld:
    def __init__(self):
        self.num_rows = 5
        self.num_cols = 5
        self.num_cells = self.num_cols * self.num_rows
        self.random_move_probability = 0.2

        #Starting position for our agent
        self.agent_position = np.array([0,np.random.randint(0,5)])

        #Terminal states positions:
        self.bomb_positions = [[3,3]]
        self.gold_positions = [[4,3]]
        self.terminal_states = np.concatenate([self.bomb_positions, self.gold_positions])

        #Rewards
        self.rewards = np.zeros([self.num_rows,self.num_cols])
        for b_position in self.bomb_positions:
            self.rewards[tuple(b_position)] = -10
        for g_position in self.gold_positions:
            self.rewards[tuple(g_position)] = 10 

        #Possible actions
        self.actions = ["UP", "RIGHT", "DOWN", "LEFT"]
        self.num_actions = len(self.actions)

    def get_available_actions(self):
        return self.actions
        
    def make_step(self, action_index):
        #Random step
        if np.random.uniform(0,1) < self.random_move_probability:
            action_indices = np.arange(self.num_actions,dtype=int)
            action_indices = np.delete(action_indices,action_index)
            action_index = np.random.choice(action_indices,1)[0]
        action = self.actions[action_index]

        #Determine new position and check if it hits a wall
        old_position = self.agent_position
        new_position = self.agent_position
        if action == "UP":
            if old_position[0] > 0:
                new_position=old_position-[1,0]
        elif action == "RIGHT":
            if old_position[1] < 4:
                new_position=old_position+[0,1]
        elif action == "DOWN":
            if old_position[0] < 4:
                new_position=old_position+[1,0]
        elif action == "LEFT":
            if old_position[1] > 0:
                new_position=old_position-[0,1]
        else:
            raise ValueError("Action was mis-specified.")
            
        #Update enviroment state
        self.agent_position= new_position

        #rewards
        reward = self.rewards[tuple(self.agent_position)]
        reward -= 1

        return reward,new_position
    def reset(self):
        self.agent_position = np.array([0,np.random.randint(0,5)])

    

### `RandomAgent`Class

In [4]:
class RandomAgent():
    def choose_action(self, available_actions):
        number_of_actions = len(available_actions)
        random_action_index = np.random.randint(0, number_of_actions)
        return random_action_index

### Testing code

In [5]:
env = Gridworld()
agent = RandomAgent()

In [6]:
print("Current position of the agent =", env.agent_position)
available_actions = env.get_available_actions()
print("Available_actions =", available_actions)
chosen_action = agent.choose_action(available_actions)
print("Randomly chosen action =", chosen_action)
reward = env.make_step(chosen_action)
print("Reward obtained =", reward)
print("Current position of the agent =", env.agent_position)
print(env.rewards)

Current position of the agent = [0 2]
Available_actions = ['UP', 'RIGHT', 'DOWN', 'LEFT']
Randomly chosen action = 2
Reward obtained = (-1.0, array([1, 2]))
Current position of the agent = [1 2]
[[  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0. -10.   0.]
 [  0.   0.   0.  10.   0.]]


In [12]:
reward=np.zeros(100)
env = Gridworld()
agent = RandomAgent()
for i in range(100):
    while env.agent_position not in env.terminal_states:
        print("Current position of the agent =", env.agent_position)
        print(env.agent_position)
        print(env.terminal_states)
        available_actions = env.get_available_actions()
        print("Available_actions =", available_actions)
        chosen_action = agent.choose_action(available_actions)
        print("Randomly chosen action =", chosen_action)
        reward[i] += env.make_step(chosen_action)[0]
        print("Reward obtained =", reward[i])
        print("Current position of the agent =", env.agent_position)
    env.reset()
    print(i)

Current position of the agent = [0 1]
[0 1]
[[3 3]
 [4 3]]
Available_actions = ['UP', 'RIGHT', 'DOWN', 'LEFT']
Randomly chosen action = 3
Reward obtained = -1.0
Current position of the agent = [0 0]
Current position of the agent = [0 0]
[0 0]
[[3 3]
 [4 3]]
Available_actions = ['UP', 'RIGHT', 'DOWN', 'LEFT']
Randomly chosen action = 0
Reward obtained = -2.0
Current position of the agent = [0 0]
Current position of the agent = [0 0]
[0 0]
[[3 3]
 [4 3]]
Available_actions = ['UP', 'RIGHT', 'DOWN', 'LEFT']
Randomly chosen action = 0
Reward obtained = -3.0
Current position of the agent = [0 0]
Current position of the agent = [0 0]
[0 0]
[[3 3]
 [4 3]]
Available_actions = ['UP', 'RIGHT', 'DOWN', 'LEFT']
Randomly chosen action = 2
Reward obtained = -4.0
Current position of the agent = [1 0]
Current position of the agent = [1 0]
[1 0]
[[3 3]
 [4 3]]
Available_actions = ['UP', 'RIGHT', 'DOWN', 'LEFT']
Randomly chosen action = 0
Reward obtained = -5.0
Current position of the agent = [0 0]
Curre