## COSC-4117EL: Gridworld - Q-Learning (with epsilon-greedy)


This sample shows the implementation of Q-learning (a model-free reinforcement learning algorithm) applied to a simple grid world. The agent can take four possible actions at each state: move up, down, left, or right. The Q-values represent the expected cumulative reward the agent can obtain by taking a specific action in a specific state.

**Q-learning with epsilon-greedy exploration:**

For each state (i, j), the algorithm iterates over possible actions.
It then calculates the resulting state (ni, nj) from taking an action.
The agent decides whether to explore (choose a random action) or exploit (choose the action with the highest Q-value for the state) based on the epsilon value.
Q-value for the current state and action is updated using the Q-learning update rule.

**Q-value Clipping:**

After each Q-value update, the code ensures that the Q-value for the action at state (0,1) is at most 10, and the Q-value for the action at state (0,2) is at least -10.

The Q-values represent the agent's learned values for how good each action is for each state, considering future rewards.

When you run this code, it will give you the Q-values after a specified number of iterations. If you increase the number of iterations or adjust the hyperparameters, the Q-values might converge to more optimal values, depending on the problem and reward structure.

# `TODO: can you add a living reward of -1 for each step the agent takes?`

In [None]:
import numpy as np

# Grid size
n, m = 4, 3

# Rewards grid setup
rewards = np.array([
    [0, 10, -10],
    [0, 0, 0],
    [0, 0, 0],
    [0, 0, 0]
])

# Initialize Q-values to zeros (Q-table)
q_values = np.zeros((n, m, 4))

# Possible actions that the agent can take: move up, down, left, or right.
actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]

# Discount factor gamma determines the agent's consideration for future rewards.
gamma = 1.0

# Learning rate alpha for Q-learning
alpha = 0.5

# Number of iterations
iterations = 10

# Exploration rate (epsilon) for epsilon-greedy strategy
epsilon = 0.1

# Perform Q-learning with epsilon-greedy exploration
for it in range(iterations):
    #print(f"Iteration {it + 1}")
    for i in range(n):
        for j in range(m):
            for action_index, action in enumerate(actions):
                ni, nj = i + action[0], j + action[1]

                # Ensure the agent stays within the grid
                ni = max(0, min(n - 1, ni))
                nj = max(0, min(m - 1, nj))

                # If current state is (0,1) or (0,2), update the Q-values based on their respective rewards
                if (i, j) == (0, 1):
                    q_values[i, j, action_index] = 10
                    continue
                elif (i, j) == (0, 2):
                    q_values[i, j, action_index] = -10
                    continue

                # Explore with probability epsilon or exploit with probability 1 - epsilon
                if np.random.rand() < epsilon:
                    action_index = np.random.randint(0, len(actions))
                    action = actions[action_index]

                # Q-learning update rule:
                # 1. Calculate the sample using the reward and the maximum Q-value of the next state
                sample = rewards[i, j] + gamma * np.max(q_values[ni, nj])

                # 2. Update Q-value using the Q-learning update rule (weighted average)
                q_values[i, j, action_index] = (1 - alpha) * q_values[i, j, action_index] + alpha * sample

                # Clip the Q-value to a maximum of 10 for (0,1) and a minimum of -10 for (0,2)
                if i == 0 and j == 1:
                    q_values[i, j, action_index] = min(q_values[i, j, action_index], 10.0)
                elif i == 0 and j == 2:
                    q_values[i, j, action_index] = max(q_values[i, j, action_index], -10.0)

# Display final Q-values and corresponding actions for all states and paths
print(f"Final Q-values after {iterations} iteration(s)")
for i in range(n):
    for j in range(m):
        for action_index, action in enumerate(actions):
            print(f"State ({i},{j}): {action} Q={q_values[i, j, action_index]:.2f}")
        print("-----------------")


Final Q-values after 10 iteration(s)
State (0,0): (-1, 0) Q=9.73
State (0,0): (1, 0) Q=9.43
State (0,0): (0, -1) Q=9.84
State (0,0): (0, 1) Q=9.98
-----------------
State (0,1): (-1, 0) Q=10.00
State (0,1): (1, 0) Q=10.00
State (0,1): (0, -1) Q=10.00
State (0,1): (0, 1) Q=10.00
-----------------
State (0,2): (-1, 0) Q=-10.00
State (0,2): (1, 0) Q=-10.00
State (0,2): (0, -1) Q=-10.00
State (0,2): (0, 1) Q=-10.00
-----------------
State (1,0): (-1, 0) Q=9.77
State (1,0): (1, 0) Q=8.95
State (1,0): (0, -1) Q=9.72
State (1,0): (0, 1) Q=9.79
-----------------
State (1,1): (-1, 0) Q=9.98
State (1,1): (1, 0) Q=9.63
State (1,1): (0, -1) Q=9.65
State (1,1): (0, 1) Q=9.53
-----------------
State (1,2): (-1, 0) Q=-9.98
State (1,2): (1, 0) Q=8.73
State (1,2): (0, -1) Q=9.85
State (1,2): (0, 1) Q=9.69
-----------------
State (2,0): (-1, 0) Q=9.61
State (2,0): (1, 0) Q=8.25
State (2,0): (0, -1) Q=9.28
State (2,0): (0, 1) Q=9.62
-----------------
State (2,1): (-1, 0) Q=9.92
State (2,1): (1, 0) Q=9.23