# CPSC-5616EL: Gridworld - Q-Learning (fixed path)


The current implementation is based on Q-learning algorithm but only apply to a fixed policy (a fixed path) = [(3, 1), (2, 1), (1, 1), (0, 1)] to demostrate the convergence of the Q-values which is capped to 20 iterations (episods) for the terminal state (0,1).

In [None]:
import numpy as np

# Grid size
n, m = 4, 3

# Rewards grid setup with living reward -1
rewards = np.array([
    [-1, 10, -10],
    [-1, -1, -1],
    [-1, -1, -1],
    [-1, -1, -1]
])

# Initialize Q-values to zeros (Q-table)
q_values = np.zeros((n, m, 4))

# Possible actions that the agent can take: move up, down, left, or right.
actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]

# Discount factor gamma determines the agent's consideration for future rewards.
gamma = 1.0

# Learning rate alpha for Q-learning
alpha = 0.5

# Define the fixed path the agent will follow during training
path = [(3, 1), (2, 1), (1, 1), (0, 1)]

# Number of iterations
iterations = 20

for it in range(iterations):
    print(f"Iteration {it + 1}")

    for i, j in path:
        action_index = 0  # Agent's action index (moving up)
        action = actions[action_index]

        # Calculate the next state (ni, nj) based on the action
        ni, nj = i + action[0], j + action[1]

        # If current state is (0,1) or (0,2), update the Q-values based on their respective rewards

        if (i, j) == (0, 1):
            q_values[i, j, action_index] = 10
            continue
        elif (i, j) == (0, 2):
            q_values[i, j, action_index] = -10
            continue

        # Q-learning update rule:
        # 1. Calculate the sample using the reward and the maximum Q-value of the next state
        sample = rewards[i, j] + gamma * np.max(q_values[ni, nj])

        # 2. Update Q-value using the Q-learning update rule (weighted average)
        q_values[i, j, action_index] = (1 - alpha) * q_values[i, j, action_index] + alpha * sample

        # Clip the Q-value to a maximum of 10
        q_values[i, j, action_index] = min(q_values[i, j, action_index], 10.0)

    # Print Q-values for the specified path in this iteration
    print("Q-values for the path:")
    for i, j in path:
        action_index = np.argmax(q_values[i, j])
        max_action = actions[action_index]
        print(f"State ({i},{j}): {max_action} Q={q_values[i, j, action_index]:.2f}")

    print("-----------------------------")

# Display final Q-values for the specified path
print("Final Q-values for the path:")
for i, j in path:
    action_index = np.argmax(q_values[i, j])
    max_action = actions[action_index]
    print(f"State ({i},{j}): {max_action} Q={q_values[i, j, action_index]:.2f}")


Iteration 1
Q-values for the path:
State (3,1): (1, 0) Q=0.00
State (2,1): (1, 0) Q=0.00
State (1,1): (1, 0) Q=0.00
State (0,1): (-1, 0) Q=10.00
-----------------------------
Iteration 2
Q-values for the path:
State (3,1): (1, 0) Q=0.00
State (2,1): (1, 0) Q=0.00
State (1,1): (-1, 0) Q=4.25
State (0,1): (-1, 0) Q=10.00
-----------------------------
Iteration 3
Q-values for the path:
State (3,1): (1, 0) Q=0.00
State (2,1): (-1, 0) Q=1.25
State (1,1): (-1, 0) Q=6.62
State (0,1): (-1, 0) Q=10.00
-----------------------------
Iteration 4
Q-values for the path:
State (3,1): (1, 0) Q=0.00
State (2,1): (-1, 0) Q=3.44
State (1,1): (-1, 0) Q=7.81
State (0,1): (-1, 0) Q=10.00
-----------------------------
Iteration 5
Q-values for the path:
State (3,1): (-1, 0) Q=1.06
State (2,1): (-1, 0) Q=5.12
State (1,1): (-1, 0) Q=8.41
State (0,1): (-1, 0) Q=10.00
-----------------------------
Iteration 6
Q-values for the path:
State (3,1): (-1, 0) Q=2.59
State (2,1): (-1, 0) Q=6.27
State (1,1): (-1, 0) Q=8.7