# Q-Learning in a 2×2 GridWorld

This notebook demonstrates how Q-values are computed in a simple deterministic **2x2 gridworld** using a basic form of value iteration (offline Q-learning).

It is designed to help build intuition for:
- How states and actions interact in an MDP
- How Q-values evolve over iterations
- What an optimal policy looks like in grid navigation problems
"""


In [None]:
# Cell 2: Imports and Definitions
import numpy as np

states = [(0, 0), (0, 1), (1, 0), (1, 1)]  # S0, S1, S2, S3
actions = ['Up', 'Down', 'Left', 'Right']
action_moves = {
    'Up': (-1, 0),
    'Down': (1, 0),
    'Left': (0, -1),
    'Right': (0, 1)
}

num_states = len(states)
num_actions = len(actions)

Q_values = np.zeros((num_states, num_actions))
transition_prob = np.zeros((num_states, num_actions, num_states))
rewards = np.zeros((num_states, num_actions, num_states))

In [None]:
# Cell 3: Populate transition and reward matrices

for i, state in enumerate(states):
    for j, action in enumerate(actions):
        move = action_moves[action]
        next_state = (state[0] + move[0], state[1] + move[1])

        # Clip next_state within grid boundaries (2x2 -> max index 1)
        next_state = (
            max(min(next_state[0], 1), 0),
            max(min(next_state[1], 1), 0)
        )

        next_state_idx = states.index(next_state)
        transition_prob[i, j, next_state_idx] = 1.0
        rewards[i, j, next_state_idx] = 10 if next_state_idx == 3 else -1


In [None]:
# Cell 4: Perform Q-value iteration

gamma = 0.90
iterations = 5

for _ in range(iterations):
    Q_old = Q_values.copy()
    for s in range(num_states):
        for a in range(num_actions):
            Q_values[s, a] = sum([
                transition_prob[s, a, sp] * (
                    rewards[s, a, sp] + gamma * np.max(Q_old[sp])
                )
                for sp in range(num_states)
            ])


In [None]:
# Cell 5: View final Q-values
print("Final Q-Value Table:")
print(np.round(Q_values, 3))


In [None]:
# Cell 6: Compute and visualize optimal policy

action_symbols = ['↑', '↓', '←', '→']
policy_indices = np.argmax(Q_values, axis=1)
policy_symbols = [action_symbols[i] for i in policy_indices]

print("\nOptimal Policy Grid:")
print(f"{policy_symbols[0]} {policy_symbols[1]}")
print(f"{policy_symbols[2]} G")  # G = goal state


## Summary

- Q-values converge after a few iterations.
- The agent learns to navigate directly toward the goal at state **S3**.
- This simple setup is a powerful tool to visualize and debug learning dynamics before generalizing to larger environments.

Next steps:
- Generalize to **N×N gridworlds**
- Add **walls/obstacles**
- Introduce **stochasticity** or explore other value functions (like S-values)
"""
