# Artificial Intelligence
## Assignment 6 – Reinforcement Learning

### Personal details

* **Name:** Ahmed Jabir Zuhayr

In [1]:
import random
import time

In Assignment 1 we solved the problem of finding the shortest path from point A to point B. Now we will expand on this by placing our agent in a dynamic environment where it must navigate through obstacles and reach a goal while avoiding traps.

### 6.1 – Dynamic Environments

![escaperoom.png](escaperoom.png)

<small>Image generated with ChatGPT</small>

You will be working with an environment that can be conceptualized as an escape room where a humanoid robot is tasked with finding its way out as quickly as possible. The robot can move in four directions (up, down, left, right) and can carry a key that allows it to open doors.

We can repurpose the `Grid` class from Assignment 1 to represent this environment:

In [2]:
from ex6_utils import Grid

grid = Grid(xlim=13+2, ylim=9+2)
grid.generate_nodes(carrying_key=False)
grid.generate_nodes(carrying_key=True)
grid.get_initial().current = True
grid.visualize(agent=None, delay=0)

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ G ■ ■ ■ ■
■ . @ . ■ ~ ~ ~ ■ . . . ■ . ■
■ . . . ■ ~ ~ ~ ■ . . . . . ■
■ . . . ■ ~ ~ ~ ■ . . . ■ . ■
■ . . . ■ ■ — ■ ■ ■ — ■ ■ . ■
■ . . . ■ . . . T . . . ■ . ■
■ . . . | . T . . . T . ■ . ■
■ . . . ■ . . . T . . . ■ . ■
■ ■ . ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ . ■
■ K . . . . . . . . . . . . ■
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■


The symbols used in the visualization are as follows:

- `@`: current position of the agent
- `G`: goal
- `—` / `|`: door
- `■`: wall
- `K`: key
- `~`: lava
- `T`: trap
- `.`: empty node

Doors are locked until the agent picks up the key, which happens automatically when it is reached. Lava leads to instant death and restart. Traps turn into lava when stepped into and are non-deterministic: there is a 50/50 chance of the agent falling into the lava or managing to jump into a random neighbouring square to save themselves.

In this problem our agent has to make decisions based on incomplete information. The agent only knows its current position and whether it is holding a key, as well as the properties of the current node (e.g. whether it is standing in lava). This type of problem is well-suited for **reinforcement learning**, where the agent explores its environment and learns through trial and error.

### 6.2 – Q-Learning

The basic idea in reinforcement learning is to learn a **policy** that maximizes the expected cumulative reward for an agent by interacting with its environment. Actions performed yield *rewards* that can be either positive or negative, and the agent learns to associate actions with their outcomes.

**Q-learning** is a form of reinforcement learning that uses a table of *action-value pairs* or **Q-values** to estimate the expected utility of taking a given action in a given state. The values are updated using the following equation:

$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right)
$$

where
- $s$  is the current state
- $a$  is the action taken
- $r$  is the immediate reward received
- $s'$ is the next state
- $a'$ is an action taken in the next state
- $\alpha$ is the *learning rate*
- $\gamma$ is the *discount factor*

Q-learning is **model-free**, meaning it neither knows nor learns the underlying transition model of the environment ("with probability $P$, taking an action $a$ in state $s$ leads to state $s'$"). All the agent learns is the value of being in a given state and taking a specific action, based on the rewards it has previously received. This allows the agent to learn a policy for always taking actions that maximize its expected reward.

**Task 1: Reward function (0.2 pt)**

Let's start by defining a class to represent the Q-learning agent. Your task is to fill in suitable values in the `get_reward` function (replace the zeros where appropriate) which is used to receive feedback from ending up in specific states.

In [4]:
class QLearningAgent():
    def __init__(self, grid):
        self.grid = grid
        self.current_node = grid.get_initial()
        self.previous_node = None
        self.previous_action = None
        self.has_key = False
        self.total_rewards = 0
    
    def move(self, direction):
        self.previous_node = self.current_node
        self.previous_action = direction

        next_node = self.current_node.get_neighbor(direction, self.has_key)

        if not next_node or next_node.blocked:
            # Invalid move
            return self.get_reward(next_node)

        # Trap logic
        if next_node.trap:
            next_node.lava = True
            while random.random() < 0.5 and next_node.blocked:
                next_node = next_node.get_neighbor("random", self.has_key)

        # Key pickup
        if next_node.has_key:
            self.has_key = True

        # Door logic
        if next_node.locked:
            if self.has_key:
                next_node.locked = False
            else:
                # Invalid move
                return self.get_reward(next_node)

        # Valid move
        self.current_node.current = False
        next_node.current = True
        self.current_node = next_node
        self.grid.current_node = next_node
        return self.get_reward(next_node)
    
    def get_reward(self, node):
        # ---------- YOUR CODE HERE ----------- #
        reward = 0
        if node is None or node.blocked or node.locked:
            reward += -1
        elif node.goal:
            reward += 100
        elif node.lava:
            reward += -100
        elif node.trap:
            reward += -10
        elif node.has_key:
            reward += 0
        return reward
        # ---------- YOUR CODE HERE ----------- #

    def reset(self, grid):
        self.grid = grid
        self.has_key = False
        self.current_node = self.grid.get_initial()
        self.previous_action = None
        self.total_rewards = 0

**Task 2: Q-Learning Update (0.5 pt)**

Now let's bring in learning. Your task is to fill in the `update_q_value` function to update the Q-values based on the agent's experiences. You may experiment with different values for the learning rate and the discount factor, but reasonable defaults are $\alpha = 0.5$ and $\gamma = 0.95$.

(Hint: the provided equation updates the Q value of a state based on the reward received and the maximum Q value of the next state. You'll need to switch perspectives in implementing this: update the *previous state's* Q value based on the *current state*.)

(Hint: this is not necessary, but if you get stuck because you don't understand why the Q-table is defined as it is, you might want to jump ahead and look at the discussion section at the end of the notebook.)

In [7]:
def initialize_q_table(grid):
    """
    Initialize a Q-table with all values set to zero.

    The Q-table is a dictionary of dictionaries, where the keys of the outer dictionary identify each node,
    and the inner dictionaries map actions ("up", "down", "left", "right") to Q-values.
    
    (You can ignore the loop over two sets of nodes for now.)
    """
    q_table = {}
    for nodes in [grid.nodes, grid.nodes2]:
        for row in nodes:
            for node in row:
                q_table[node.id] = {"up": 0, "down": 0, "left": 0, "right": 0}
    return q_table

def update_q_value(agent, q_table, reward):
    """
    Update the Q-value for the previous state based on the action taken and the received reward.
    """
    previous = agent.previous_node.id
    current = agent.current_node.id
    action = agent.previous_action
    # ---------- YOUR CODE HERE ----------- #
    alpha = 0.1
    gamma = 0.9
    current_q = q_table[previous][action]
    max_next_q = max(q_table[current].values())
    q_table[previous][action] = current_q + alpha * (reward + gamma * max_next_q - current_q)
    # ---------- YOUR CODE HERE ----------- #

def get_action_greedy(agent, q_table):
    """
    Returns the best action according to the current policy.
    """
    Q = q_table[agent.current_node.id]
    Q_max = max(Q.values())
    best_actions = [action for action in ["up", "down", "left", "right"] if Q[action] == Q_max]
    return random.choice(best_actions) # break ties randomly

Run the following cell once to train the agent. You may adjust the visualization parameters as you see fit.

In [9]:
# ---------- YOUR CODE HERE (OPTIONAL) ----------- #
visualize = True # toggle visualization
increment = 1 # see every ´increment´th iteration
delay = 0.01 # adjust as needed
# ---------- YOUR CODE HERE (OPTIONAL) ----------- #

random.seed(42548)
agent = QLearningAgent(grid)
q_table = initialize_q_table(grid)

for i in range(100):
    print(f"Running iteration {i}")
    while True:
        action = get_action_greedy(agent, q_table)

        if action:
            reward = agent.move(action)
            agent.total_rewards += reward
            update_q_value(agent, q_table, reward)

        if visualize and i % increment == 0:
            grid.visualize(agent)
            print(f"Current iteration: {i}")
            print(f"Total rewards: {agent.total_rewards}")
            # ---------- YOUR CODE HERE (OPTIONAL) ----------- #
            # Optional debugging can be added here.
            # You may uncomment the lines below to inspect
            # the Q-values of the current and previous nodes.

            # Q = q_table[agent.current_node.id]
            # print("\nCurrent node Q-values:")
            # print(f"   {Q['up']:.2f}")
            # print(f"{Q['left']:.2f} {Q['right']:.2f}")
            # print(f"   {Q['down']:.2f}")
            # prev_Q = q_table[agent.previous_node.id]
            # print("\nPrevious node Q-values:")
            # print(f"   {prev_Q['up']:.2f}")
            # print(f"{prev_Q['left']:.2f} {prev_Q['right']:.2f}")
            # print(f"   {prev_Q['down']:.2f}")
            # ---------- YOUR CODE HERE (OPTIONAL) ----------- #
            time.sleep(delay)

        if agent.current_node.goal or agent.current_node.lava:
            break

    agent.reset(grid)
    grid.reset()

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ @ ■ ■ ■ ■
■ . . . ■ ~ ~ ~ ■ . . . ■ . ■
■ . . . ■ ~ ~ ~ ■ . . . . . ■
■ . . . ■ ~ ~ ~ ■ . . . ■ . ■
■ . . . ■ ■ — ■ ■ ■ — ■ ■ . ■
■ . . . ■ . . . T . . . ■ . ■
■ . . . | . T . . . T . ■ . ■
■ . . . ■ . . . T . . . ■ . ■
■ ■ . ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ . ■
■ . . . . . . . . . . . . . ■
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
Current iteration: 99
Total rewards: 100


Run the following cell to test your agent (can be done multiple times). It should have converged to a stable policy within 100 iterations. If the agent gets stuck in an infinite loop, there is something wrong with your implementation or reward function.

In [12]:
while True:
    action = get_action_greedy(agent, q_table)
    if action:
        reward = agent.move(action)
        agent.total_rewards += reward
        grid.visualize(agent)
        print(f"Total rewards: {agent.total_rewards}")
        time.sleep(0.1)
    if agent.current_node.goal or agent.current_node.lava:
        break

agent.reset(grid)
grid.reset()

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ @ ■ ■ ■ ■
■ . . . ■ ~ ~ ~ ■ . . . ■ . ■
■ . . . ■ ~ ~ ~ ■ . . . . . ■
■ . . . ■ ~ ~ ~ ■ . . . ■ . ■
■ . . . ■ ■ — ■ ■ ■ — ■ ■ . ■
■ . . . ■ . . . T . . . ■ . ■
■ . . . | . T . . . T . ■ . ■
■ . . . ■ . . . T . . . ■ . ■
■ ■ . ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ . ■
■ . . . . . . . . . . . . . ■
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
Total rewards: 100


The learned policy should now guide the agent through the room in the middle, as this is the shortest path to the goal. However, the agent might still make a suboptimal detour near the goal.* The reason for this behavior lies in the exploration strategy being used.

The `get_action_greedy` function always selects the action with the highest Q-value for the current state. This strategy maximizes *exploitation* of the current best-known action but involves no *exploration* of other potentially better actions. As a result, the agent may get stuck in local optima and fail to discover more optimal paths.

<small>*This is generally non-deterministic, but has been enforced through a fixed seed for the random number generator. If you managed to define the reward function so that you already found the optimal policy, congratulations! You still need to do the final task.</small>

**Task 3: Epsilon-Greedy Exploration (0.3 pt)**

To encourage exploration, we can utilize a similar method as in Assignment 2 by implementing an **epsilon-greedy** strategy. This involves selecting a random action with a small probability $\epsilon$ (e.g. 0.1) and the best-known action with probability $1 - \epsilon$. Optionally, we could start with a higher value and decay $\epsilon$ over time to reduce exploration as the agent becomes more confident in its learned policy, but we will not do that here.

Your task is to implement the `get_action_epsilon_greedy` function to select actions according to this strategy. You may utilize the `get_action_greedy` function and the `random` module.

In [13]:
def get_action_epsilon_greedy(agent, epsilon, q_table):
    # ---------- YOUR CODE HERE ----------- #
    if random.random() < epsilon:
        return random.choice(["up", "down", "left", "right"])
    else:
        return get_action_greedy(agent, q_table)
    # ---------- YOUR CODE HERE ----------- #

The code for training and testing your agent is the same as before.

In [14]:
# ---------- YOUR CODE HERE (OPTIONAL) ----------- #
visualize = True # toggle visualization
increment = 1 # see every ´increment´th iteration
delay = 0.00 # adjust as needed
# ---------- YOUR CODE HERE (OPTIONAL) ----------- #

random.seed(42548)
agent = QLearningAgent(grid)
q_table = initialize_q_table(grid)
epsilon = 0.1

for i in range(100):
    while True:
        action = get_action_epsilon_greedy(agent, epsilon, q_table)

        if action:
            reward = agent.move(action)
            agent.total_rewards += reward
            update_q_value(agent, q_table, reward)

        if visualize and i % increment == 0:
            grid.visualize(agent)
            print(f"Current iteration: {i}")
            print(f"Total rewards: {agent.total_rewards}")
            # ---------- YOUR CODE HERE (OPTIONAL) ----------- #
            # OPTIONAL DEBUGGING CAN BE ADDED HERE
            
            # Q = q_table[agent.current_node.id]
            # print("\nCurrent node Q-values:")
            # print(f"   {Q['up']:.2f}")
            # print(f"{Q['left']:.2f} {Q['right']:.2f}")
            # print(f"   {Q['down']:.2f}")
            # prev_Q = q_table[agent.previous_node.id]
            # print("\nPrevious node Q-values:")
            # print(f"   {prev_Q['up']:.2f}")
            # print(f"{prev_Q['left']:.2f} {prev_Q['right']:.2f}")
            # print(f"   {prev_Q['down']:.2f}")
            # ---------- YOUR CODE HERE (OPTIONAL) ----------- #
            time.sleep(delay)

        if agent.current_node.goal or agent.current_node.lava:
            break

    agent.reset(grid)
    grid.reset()

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ @ ■ ■ ■ ■
■ . . . ■ ~ ~ ~ ■ . . . ■ . ■
■ . . . ■ ~ ~ ~ ■ . . . . . ■
■ . . . ■ ~ ~ ~ ■ . . . ■ . ■
■ . . . ■ ■ — ■ ■ ■ — ■ ■ . ■
■ . . . ■ . . . T . . . ■ . ■
■ . . . | . T . . . T . ■ . ■
■ . . . ■ . . . T . . . ■ . ■
■ ■ . ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ . ■
■ . . . . . . . . . . . . . ■
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
Current iteration: 99
Total rewards: 100


In [15]:
while True:
    action = get_action_greedy(agent, q_table)
    if action:
        reward = agent.move(action)
        agent.total_rewards += reward
        grid.visualize(agent)
        print(f"Total rewards: {agent.total_rewards}")
        time.sleep(0.1)
    if agent.current_node.goal or agent.current_node.lava:
        break

agent.reset(grid)
grid.reset()

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ @ ■ ■ ■ ■
■ . . . ■ ~ ~ ~ ■ . . . ■ . ■
■ . . . ■ ~ ~ ~ ■ . . . . . ■
■ . . . ■ ~ ~ ~ ■ . . . ■ . ■
■ . . . ■ ■ — ■ ■ ■ — ■ ■ . ■
■ . . . ■ . . . T . . . ■ . ■
■ . . . | . T . . . T . ■ . ■
■ . . . ■ . . . T . . . ■ . ■
■ ■ . ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ . ■
■ . . . . . . . . . . . . . ■
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
Total rewards: 100


The agent should now learn the optimal policy within 100 iterations. Notice that we used `get_action_epsilon_greedy` for training but `get_action_greedy` for testing, as we want to exploit the learned policy during testing rather than continue exploring.

### EXTRA: Discussion

**Something we neglected to mention before is why we're using two different sets of nodes in initializing two distinct Q-tables. The reason for this is somewhat intricate: we're actually working with two different grids that represent different states of the environment. Nodes can be thought of as (x, y, k) tuples where x and y are the coordinates and k is a boolean indicating whether the key has been collected. One grid is the original grid in which doors are locked and the key is not yet collected (k = 0), while the other grid represents all states after the key has been collected (k = 1) and the doors are unlocked. The underlying logic has been abstracted away, but the agent effectively gets transported to a whole new environment with its own set of states and corresponding Q-values after picking up the key.**

**Why is this necessary? What would happen if we used only a single grid and provided a positive reward when landing on the key square and subsequently 1) didn't remove it from the grid, or 2) removed it from the grid?**

Q-learning assumes the Markov property, that the value of being in a state depends only on that state, not on the history of how you got there. If the same (x, y) coordinate can mean two fundamentally different things depending on whether the key has been picked up, the Q-values for that coordinate become meaningless because they're trying to average over two incompatible situations:

Case 1: Key stays on the grid after collection
The key square would accumulate a positive reward every time the agent steps on it, whether or not the key has already been collected. The agent would learn to repeatedly revisit the key square to farm rewards, never progressing toward the door or goal. The Q-values around the key square would be artificially inflated, pulling the agent into a pointless loop.

Case 2: Key is removed from the grid after collection
This sounds reasonable, but now the same grid node (x, y) means something different before and after the key is picked up. A single Q-table would blend these two meanings into one entry. Early in training, Q-values are learned with the key present; later visits to the same coordinate update the same Q-values even though the strategic context has completely changed. The agent can't learn a coherent policy because the optimal action at (x, y) genuinely differs depending on whether k=0 or k=1 — for example, the correct path to the goal may require going through the now-unlocked door, which wasn't even accessible before.
Why two grids solve this

By treating the state as (x, y, k), each node in each grid has its own dedicated Q-values that are only ever updated in the correct context. The agent in the k=0 world learns to navigate toward the key; the agent in the k=1 world learns to navigate through the unlocked door toward the goal. These are genuinely different problems with different optimal policies, and keeping them separate ensures the Markov property holds and the Q-values remain meaningful.

## Aftermath

Please provide short answers to the following questions:

**1. Did you experience any issues or find anything particularly confusing?**

No

**2. Is there anything you would like to see improved in the assignment?**

No

### Submission

1. Make sure you have completed all tasks and filled in your personal details at the top of this notebook.
2. Ensure all the code runs without errors: restart the kernel and run all cells in order.
3. Submit *only* this notebook (`ex6.ipynb`) on Moodle.