# Q-Learning on an NxN GridWorld (No Obstacles)

This notebook demonstrates Q-Learning applied to a deterministic NxN grid world environment.

## Problem Setup
- The agent starts in any state in a grid.
- The goal is to reach the bottom-right corner of the grid (goal state).
- Rewards:
  - +10 for reaching the goal state.
  - -1 penalty for every other move to encourage shortest path.
- Actions: Up, Down, Left, Right.
- Transition is deterministic — actions succeed unless hitting grid boundary (agent stays in place).
- Discount factor γ = 0.98.

---

## 1. Import Required Libraries and Setup


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from q_learning_nxn import QLearningNxN  # Import the class from the main code


## 2. Initialize Q-Learning Environment and Parameters


In [None]:
# Grid size
grid_size = (5, 5)

# Create QLearning instance
q_learning_agent = QLearningNxN(grid_size=grid_size, gamma=0.98)

print(f"Grid size: {grid_size[0]} rows x {grid_size[1]} columns")
print(f"Total states: {len(q_learning_agent.states)}")
print(f"Goal state: {q_learning_agent.goal_state}")


## 3. Training the Q-Learning Agent

We will train for 50 iterations and observe how the policy improves.


In [None]:
# Train the agent
n_iterations = 50
q_learning_agent.train(n_iterations=n_iterations)

print(f"Training completed with {n_iterations} iterations.")


## 4. Display the Learned Policy

Here, arrows indicate the optimal action for each state, and `G` indicates the goal state.


In [None]:
q_learning_agent.print_policy()


## 5. Visualize Q-Values for Each Action

We plot heatmaps of the Q-values for each action to understand the agent's value estimates.


In [None]:
actions = q_learning_agent.actions
Q_values = q_learning_agent.Q_values
grid_shape = grid_size

fig, axs = plt.subplots(1, len(actions), figsize=(20, 4))
for i, action in enumerate(actions):
    ax = axs[i]
    q_vals_action = Q_values[:, i].reshape(grid_shape)
    im = ax.imshow(q_vals_action, cmap='coolwarm', interpolation='nearest')
    ax.set_title(f"Q-values for '{action}'")
    ax.set_xticks(np.arange(grid_shape[1]))
    ax.set_yticks(np.arange(grid_shape[0]))
    for (j, k), val in np.ndenumerate(q_vals_action):
        ax.text(k, j, f"{val:.1f}", ha='center', va='center', color='black')
    fig.colorbar(im, ax=ax)

plt.suptitle("Q-Values Heatmaps for Each Action")
plt.show()


## Summary

- The agent learns an optimal policy that guides it to the goal efficiently.
- The Q-values give insight into the expected return of taking each action in each state.
- This simple deterministic environment helps understand basic Q-Learning mechanics.

---

Next steps: Extend to include obstacles and stochastic transitions.
