# Q-Learning in GridWorld

This notebook demonstrates how to implement **Q-Learning** in a custom GridWorld environment using Python. The agent learns to navigate from any valid cell to a goal state while avoiding obstacles.

Key Concepts:
- **Q-Learning**: Off-policy Temporal Difference control algorithm.
- **ε-greedy exploration**: Balances exploration and exploitation.
- **GridWorld**: A simple environment with a reward structure.

---


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from utils import (
    plot_arrows_MDP_QL,
    plot_grid_world_MDP_QL,
    simulate_policy_MDP_QL,
    print_value_grid
)


## Define the GridWorld Environment

- 5x6 grid.
- Negative step penalty, goal reward.
- Obstacles at specific coordinates.


In [None]:
grid_world = [5, 6]
GOAL_REWARD = 10.0
STEP_PENALTY = -1.0
OBSTACLES = [(1, 4), (2, 3), (3, 4)]
GOAL_STATE = (2, 4)

actions = np.array(['up', 'down', 'left', 'right'])
action_moves = {'up': (-1, 0), 'down': (1, 0), 'left': (0, -1), 'right': (0, 1)}
action_to_index = {act: i for i, act in enumerate(actions)}

states = [(j, i) for j in range(grid_world[0]) for i in range(grid_world[1]) if (j, i) not in OBSTACLES]
state_to_index = {s: idx for idx, s in enumerate(states)}
GOAL_INDEX = state_to_index[GOAL_STATE]

Q_values = np.zeros((len(states), len(actions)))

def validate_policy_result(current_state, move):
    next_state = current_state[0] + move[0], current_state[1] + move[1]
    if next_state not in OBSTACLES and (0 <= next_state[0] < grid_world[0]) and (0 <= next_state[1] < grid_world[1]):
        return next_state, True
    return current_state, False


## Q-Learning Algorithm

The agent runs multiple episodes to learn the optimal policy using the Bellman update.


In [None]:
def run_episodes_update_Q(Q_values, valid_starts, n_episodes, learning_rate=0.1,
                          gamma=1.0, max_steps=100, epsilon=0.1, epsilon_decay=0.995):
    for episode in range(n_episodes):
        s = valid_starts[np.random.choice(len(valid_starts))]
        steps = 0
        while s != GOAL_STATE and steps < max_steps:
            if np.random.rand() < epsilon:
                action_name = np.random.choice(actions)
            else:
                action_name = actions[np.argmax(Q_values[state_to_index[s]])]

            move = action_moves[action_name]
            sp, valid = validate_policy_result(s, move)
            action_idx = action_to_index[action_name]
            s_idx = state_to_index[s]
            sp_idx = state_to_index[sp]
            r = GOAL_REWARD if sp == GOAL_STATE else STEP_PENALTY

            Q_values[s_idx, action_idx] += learning_rate * (r + gamma * np.max(Q_values[sp_idx]) - Q_values[s_idx, action_idx])
            s = sp
            steps += 1
        epsilon = max(0.01, epsilon * epsilon_decay)


## Train the Agent


In [None]:
valid_starts = [s for s in states if s != GOAL_STATE]
run_episodes_update_Q(Q_values, valid_starts, n_episodes=2000)


## Visualize the Learned Policy

We print the Q-table and overlay arrows representing the optimal policy.


In [None]:
print_value_grid(Q_values)
plot_arrows_MDP_QL(Q_values, grid_world, actions, states, goal_state=GOAL_STATE,
                   OBSTACLES=OBSTACLES, state_to_index=state_to_index)


## Simulate Optimal Path from a Start State


In [None]:
START_STATE = (4, 0)
START_INDEX = state_to_index[START_STATE]

path = simulate_policy_MDP_QL(Q_values, states, actions, start_index=START_INDEX,
                               goal_index=GOAL_INDEX, grid_world=grid_world, OBSTACLES=OBSTACLES, action_moves=action_moves)

plot_arrows_MDP_QL(Q_values, grid_world, actions, states, goal_state=GOAL_STATE, OBSTACLES=OBSTACLES,
                   path=path, state_to_index=state_to_index)


## Final Visualization: GridWorld with Path


In [None]:
plot_grid_world_MDP_QL(states, grid_size=grid_world, obstacles=OBSTACLES,
                       goal=GOAL_STATE, path=path, Q_values=Q_values, actions=actions)


**Project Summary: Q-Learning in GridWorld**

*This project demonstrates how to implement Q-Learning, an off-policy reinforcement learning algorithm, in a custom-built GridWorld environment. The agent learns to navigate through a grid, avoid obstacles, and reach a designated goal using the ε-greedy strategy for exploration.*

**Key Features:**

Environment: 5x6 GridWorld with impassable obstacles and a rewarding goal.

**Learning Algorithm:**

Q-Learning with dynamic ε-decay and TD updates.

**Visualization:**

Arrow-based policy map

Grid overlay with obstacles, start, and goal

Simulated path from a chosen start state

**Reinforcement Learning Concepts Covered:**

State-action value function (Q-values)

Bellman Equation update

ε-greedy policy for exploration vs. exploitation

Off-policy learning

**Outcome:**

After training, the agent reliably learns an optimal policy that maximizes cumulative rewards, visualized through intuitive grid-based plots.