# RL Framework

- Environment
- Agent

<div>
<img src="img/rl_base.png" width="500"/>
</div>

## Environment
### GridWorld
GridWorld is a 2D rectangular grid of size NxM. It has an **agent** starting in one of the grid squares and possible **rewards** in other grid squares.

In our initial setup, the GridWorld is a 3x4 grid with the agent starting in the bottom left corner. The world contains a blocking state, a positive and a negative reward.

The agent's **goal** is to receive a positive reward by moving up, down, left or right. The game ends when a reward is received.

<div>
<img src="img/grid_example.png" width="250"/>
</div>

Complete the **TODO**s in the code, marked with `...`, and discuss some questions, marked with **TODO** in the text.

### Step 1 (TODO): Define GridWorld Parameters
Define variables for the dimensions of the GridWorld, the initial position of the agent, the list of blocking state positions, and the dictionary of rewards with key: position and value: reward.

Use the following values:

- dimensions of the GridWorld: (3, 4)
- initial position of the agent: (2, 0)
- list of blocking state positions: [(1, 1)]
- dictionary of rewards: {(0,3): 1, (1,3): -1}

In [1]:
# Dimensions of the GridWorld.
world_shape = (3, 4)
# Initial position of the agent.
agent_init_pos = (2, 0)

# TODO: Define the remaining variables.
# List of blocking state positions.
blocking_states = [(1, 1)]
# Dictionary of rewards with key: position and value: reward.
reward_states = {(0,3): 1, (1,3): -1}

### Step 2 (TODO): Visualization
For now, only a human agent will interact with our environment.

We need some visualizations.

The environment is represented by a 2D array. We will label
- empty states with 0,
- the agent with 4,
- blocking states 8,
- rewards with their respective values.

Your task is to define the function `render_environment`, that takes in the world shape, agent position, blocking states, and reward states, and returns a 2D array representing the environment. Follow the instructions in the code comments.

In [2]:
import numpy as np

legend = {
    'empty': 0,
    'agent': 4,
    'blocking': 8
}

def render_environment(world_shape, agent_pos, blocking_states, reward_states):
    # Initialize empty states.
    states = np.ones(world_shape) * legend['empty']
    
    # Add agent.
    # We can index states with agent_pos because is a tuple of ints.
    # You can not index a dictionary with numpy arrays.
    # Make sure, that everytime you call this function, the indexing values are tuples of ints.
    states[agent_pos] = legend['agent']
    
    # TODO: Add blocking states.
    # Iterate over blocking_states, and set the states value according to the legend.
    # blocking_states is a list of tuples of ints.
    for blocking_state in blocking_states:
        states[blocking_state] = legend['blocking']
    
    # TODO: Add rewards.
    # Iterate over reward_states dictionary's items.
    for reward_state, reward in reward_states.items():
        states[reward_state] = reward

    return states

Now, call the `render_environment` function with the GridWorld parameters defined earlier, and print the result.

In [3]:
render = render_environment(world_shape, agent_init_pos, blocking_states, reward_states)
print(render)

[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]


### Step 3 (TODO): Actions
Possible actions in the GridWorld:
- up
- down
- right
- left

the agent is blocked by the bounds of the GridWorld and blocking states.

Define a dictionary possible_actions with keys: 'up', 'down', 'right', 'left', and values: numpy arrays that represent the corresponding movements.

In [4]:
# TODO: Define the possible_actions dictionary.
possible_actions = {
    'up': np.array([-1, 0]),
    'right': np.array([0, 1]),
    'down': np.array([1, 0]),
    'left': np.array([0, -1])
}

Now, define the function `move_agent` that takes in the agent's position, the action to perform, the shape of the world, and the blocking states, and returns the new position of the agent. Follow the instructions in the code comments.

In [5]:
def move_agent(agent_pos, action, world_shape, blocking_states):
    # Move agent.
    # We cast agent_pos to numpy array, so that we can do maths properly.
    new_agent_pos = np.array(agent_pos) + possible_actions[action]
    
    # TODO: Check if new position is blocked, by checking whether the new agent position
    # is in a blocking state. If it is, return the unchanged agent_pos. 
    if tuple(new_agent_pos) in blocking_states:
        return agent_pos
    
    # TODO: Check if new position is out of bounds. If it is, return the unchanged agent_pos.
    # Possible coordinates are in the [0, world_shape[0] - 1] and [0, world_shape[1] - 1] intervals.
    if (new_agent_pos < (0, 0)).any() or (new_agent_pos >= world_shape).any():
        return agent_pos
    
    # We cast new_agent_pos to have no problems during rendering.
    return tuple(new_agent_pos)

Test the `move_agent` function with some actions and visualize the results.

In [6]:
# Test some actions
actions = ['down', 'up', 'right', 'left', 'up', 'up']
new_agent_pos = agent_init_pos
render = render_environment(world_shape, new_agent_pos, blocking_states, reward_states)
print(render)
for action in actions:
    print(f'going {action}')
    new_agent_pos = move_agent(new_agent_pos, action, world_shape, blocking_states)
    render = render_environment(world_shape, new_agent_pos, blocking_states, reward_states)
    print(render)

[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]
going down
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]
going up
[[ 0.  0.  0.  1.]
 [ 4.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
going right
[[ 0.  0.  0.  1.]
 [ 4.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
going left
[[ 0.  0.  0.  1.]
 [ 4.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
going up
[[ 4.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
going up
[[ 4.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]


### Step 4 (TODO): Implement the GridWorld Class
We have everything for the environment.

Now, your task is to implement the GridWorld class, which will encapsulate all the information about the environment (dimensions, agent_position ...) and provides the methods 
- <code>reset</code> the GridWorld to its initial state
- <code>step</code> the environment by executing actions, returning new observations, calculating rewards and deciding whether the game has ended

to interact with it. Follow the instructions in the code comments.

In [7]:
class GridWorld:
    def __init__(self, world_shape, agent_init_pos, blocking_states, reward_states):
        # TODO: Initialize the class attributes.
        self.world_shape = world_shape
        self.agent_init_pos = agent_init_pos
        self.blocking_states = blocking_states
        self.reward_states = reward_states
        
        # TODO: Initialize the agent's current position to its initial position.
        self.agent_current_pos = self.agent_init_pos
        
    def reset(self):
        # TODO: Reset agent position.
        self.agent_current_pos = self.agent_init_pos
            
        # TODO: Render initial observation.
        # Use the method render_environment from before.
        observation = render_environment(self.world_shape, self.agent_current_pos, self.blocking_states, self.reward_states)
        
        return observation
        
    def step(self, action):
        # TODO: Execute action and update agent_current_pos.
        # Use the method move_agent from before.
        self.agent_current_pos = move_agent(self.agent_current_pos, action, self.world_shape, self.blocking_states)
        
        # Check if there is any reward and whether the game ended. If the game has 
        # ended, set done flag to True.
        if self.agent_current_pos in self.reward_states.keys():
            done = True
            reward = self.reward_states[self.agent_current_pos]
        else:
            done = False
            reward = 0
            
        # TODO: Render observation
        # Use the method render_environment again.
        observation = render_environment(self.world_shape, self.agent_current_pos, self.blocking_states, self.reward_states)
        return observation, reward, done

## Agent
We use python's <code>input()</code> method to receive actions from the human agent.

### Step 5: Define Function to Receive Input
We define the function `receive_action_input` that asks the user for input and returns the corresponding action.

In [8]:
def receive_action_input():
    # Read input and translate to action.
    action = input('move with w, a, s, d; exit with q - ')
    if action == 'w':
        return 'up'
    elif action == 'a':
        return 'left'
    elif action == 's':
        return 'down'
    elif action == 'd':
        return 'right'
    # Additional action to exit from GridWorld
    elif action == 'q':
        return 'exit'
    # Handle other cases.
    else:
        print(f'Invalid input - {action}')
        return receive_action_input()

Test the `receive_action_input` function.

In [9]:
receive_action_input()

move with w, a, s, d; exit with q - q


'exit'

Now, we define the function `act` that presents the observation to the human agent and obtains an action from them.

In [10]:
def act(observation):
    # Present observation to human agent.
    print(observation)
    
    # Obtain action from human agent.
    action = receive_action_input()
    return action

## Agent Environment Interaction

### Step 6 (TODO): Implement Agent-Environment Interaction
Your task is to implement the agent-environment interaction loop. In each iteration of the loop, get an action from the agent, execute it in the environment, and handle the consequences. Follow the instructions in the code comments.

In [11]:
# Initialize environment.
env = GridWorld(world_shape, agent_init_pos, blocking_states, reward_states)

# Reset environment and receive initial observaion.
obs = env.reset()
# We loop until the agent exits the game
while True:
    # TODO: Get action from agent.
    # use the act method.
    action = act(obs)
    
    # Exit loop if agent's action is exit.
    if action == 'exit':
        break
    
    # TODO: Execute action in environment.
    # Use the env objects step method. It takes action as a parameter and returns
    # the observation, reward and a done flag.
    obs, reward, done = env.step(action)
    
    # Print action and reward.
    print(f'went {action}, received reward {reward}')
    
    # Reset environment if game ended
    if done:
        print('============= game over =============')
        obs = env.reset()

[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]
move with w, a, s, d; exit with q - w
went up, received reward 0
[[ 0.  0.  0.  1.]
 [ 4.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
move with w, a, s, d; exit with q - w
went up, received reward 0
[[ 4.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
move with w, a, s, d; exit with q - d
went right, received reward 0
[[ 0.  4.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
move with w, a, s, d; exit with q - d
went right, received reward 0
[[ 0.  0.  4.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]
move with w, a, s, d; exit with q - d
went right, received reward 1
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]
move with w, a, s, d; exit with q - q


### Step 7 (TODO): Value
Define a function calculate_values that calculates the cumulative future reward for each step from a list of rewards. Follow the instructions in the code comments.

In [12]:
def calculate_values(rewards):
    # Empty list to store the values.
    values = []
    # Iterate over the indices of rewards, but skip the last one because there is no future reward
    for i in range(len(rewards) - 1):
        # TODO: Calculate the sum of future rewards.
        value = np.sum(rewards[i + 1:])
        # TODO: Append the calculated value to values.
        values.append(value)
    # TODO: Append 0 as future reward for the last step.
    values.append(0)
    return values

# TODO (Optional): calculate values using numpy's cumulative sum (cumsum) function.
def calculate_values_cumsum(rewards):
    # TODO: Shift reward indices and set last reward to be 0, there is no following step, thus also no reward.
    shifted_rewards = rewards[1:] + [0]
    # TODO: convert list of shifted rewards to numpy array
    rewards_np = np.array(shifted_rewards)
    # TODO: Reverse rewards.
    rewards_np = rewards_np[::-1]
    # TODO: Compute cumulative sum.
    values = np.cumsum(rewards_np)
    # TODO: Reverse values.
    values = values[::-1]
    return values

# TODO (Optional): calculate values using the recursive definition.
def calculate_values_recursive(rewards):
    if len(rewards) == 1:
        # TODO: If last step, there is no future reward, return a list containing only a 0
        return [0]
    else:
        # TODO: Compute values for the next step.
        next_values = calculate_values_recursive(rewards[1:])
        # TODO: Use the recursive definition to compute current value.
        value = rewards[1] + next_values[0] 
        # Return current value and the next values as a single list
        return [value] + next_values

# TODO (Optional): calculate values from sum
def calculate_values_from_sum(rewards):
    # Empty list to store the values.
    values = []
    # TODO: Compute -1st value.
    previous_value = np.sum(rewards)
    # TODO: Iterate over rewards.
    for reward in rewards:
        # TODO: Subtract current reward from previous value.
        previous_value = previous_value - reward
        # TODO: Append the calculated value to values.
        values.append(previous_value)
    return values

Test the `calculate_values` function with some rewards.

In [13]:
rewards = [0, 1, 2, 3, 4]
values = calculate_values(rewards)
print(values)

[10, 9, 7, 4, 0]


(Optional) Test the other value calculation functions.

In [15]:
values = calculate_values_cumsum(rewards)
print(values)
values = calculate_values_recursive(rewards)
print(values)
values = calculate_values_from_sum(rewards)
print(values)

[10  9  7  4  0]
[10, 9, 7, 4, 0]
[10, 9, 7, 4, 0]


### Step 8 (TODO): Calculate values from a game 
Now, play a game and collect the received rewards. Follow the instructions in the code comments. Record the agents positions, too.

In [16]:
# TODO: Initialize environment.
env = GridWorld(world_shape, agent_init_pos, blocking_states, reward_states)

# TODO: Reset environment and receive initial observaion.
obs = env.reset()

# List to collect rewards from game.
rewards = []
# List to collect agent position for later
agent_positions = []
# Add initial agent position to agent_positions
agent_positions.append(env.agent_current_pos)

# We loop until the agent completes
done = False
while not done:
    # TODO: Get action from agent.
    action = act(obs)
    
    # TODO: Exit loop if agent's action is exit.
    if action == 'exit':
        break
    
    # TODO: Execute action in environment.
    obs, reward, done = env.step(action)
    # TODO: Store received reward.
    rewards.append(reward)
    # TODO: Store agents new position.
    agent_positions.append(env.agent_current_pos)
    
    # Print action and reward.
    print(f'went {action}, received reward {reward}')
    print(done)

[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 4.  0.  0.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 0
False
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  4.  0.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 0
False
[[ 0.  0.  0.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  4.  0.]]


move with w, a, s, d; exit with q -  w


went up, received reward 0
False
[[ 0.  0.  0.  1.]
 [ 0.  8.  4. -1.]
 [ 0.  0.  0.  0.]]


move with w, a, s, d; exit with q -  w


went up, received reward 0
False
[[ 0.  0.  4.  1.]
 [ 0.  8.  0. -1.]
 [ 0.  0.  0.  0.]]


move with w, a, s, d; exit with q -  d


went right, received reward 1
True


Now, compute the values from the recorded rewards. Print rewards, values and agent positions and their lengths.

In [17]:
values = calculate_values(rewards)
print('rewards', len(rewards), rewards)
print('values', len(values), values)
print('agent_positions', len(agent_positions), agent_positions)

rewards 5 [0, 0, 0, 0, 1]
values 5 [1, 1, 1, 1, 0]
agent_positions 6 [(2, 0), (2, 1), (2, 2), (1, 2), (0, 2), (0, 3)]


Notice, that agent_position has one more element than values and rewards. **TODO** Discuss the reason below:
#### Discussion

### Step 9: Visualize obtained values
The function `visualize_values` visualizes the values obtained from a game.

In [18]:
def visualize_values(values, positions, world_shape):
    # Create array containing zeros
    value_vis = np.zeros(world_shape)
    # Set each position to its corresponding value
    for p, v in zip(positions, values):
        value_vis[tuple(p)] = v
    return value_vis

Run visualize values on the results of the previous game. Consider, that agent positions has one element more, than values. Feed either `agent_positions[1:]` or `agent_positions[:-1]` in `visualize_values`. Which one is correct? What would happen, if we visited the same position multiple times during our game? **TODO** Discuss below:

#### Discussion


In [19]:
# TODO: Visualize the values obtained from the game
visualize_values(values, agent_positions[1:], env.world_shape)

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 1., 0.]])