# Q-learning Tutorial using Python

In this notebook we will:

1. Summarize the Q-learning algorithm using a pseudocode (verbal) description.
2. Implement a detailed example using a simple maze problem.

The maze is a grid where:
  - 0 represents a free cell the agent can move into,
  - 1 represents a wall (obstacle), and
  - The agent receives a reward of -1 for each move and +100 at the goal.

The Q-learning update rule is:

  Q(s, a) = Q(s, a) + α * [r + γ * max_a′Q(s′,a′) - Q(s, a)]

where:
  - α (alpha) is the learning rate,
  - γ (gamma) is the discount factor,
  - r is the reward, and
  - s and s′ are the current and next states respectively.

**Links**
- https://medium.com/@alwinraju/in-depth-guide-to-implementing-q-learning-in-python-with-openai-gyms-taxi-environment-cd356cc6a288
- https://github.com/vmayoral/basic_reinforcement_learning/blob/master/tutorial1/README.md
- https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/
- https://en.wikipedia.org/wiki/Q-learning


## Pseudocode Summary for Q-learning

1. Initialize Q(s, a) arbitrarily for all state–action pairs.
2. For each episode:
    - Reset the environment starting from the initial state.
    - While the state is not terminal:
        1. Choose an action a using an epsilon-greedy policy.
        2. Execute action a, obtain reward r and new state s′.
        3. Update Q(s, a) as follows:
          
             Q(s, a) ← Q(s, a) + α * [r + γ * maxₐ′ Q(s′, a′) - Q(s, a)]
        4. Set s ← s′.
3. Repeat until the policy converges.


## Maze Environment and Q-learning Implementation

In the next code cell we define the maze environment as a Python class and implement the complete Q-learning algorithm. The maze is represented by a 2D NumPy array.

In [2]:
import numpy as np
import random


class MazeEnv:
    def __init__(self):
        # Maze layout: 0 = free space, 1 = wall
        # A sample 6x6 maze
        self.maze = np.array(
            [
                [0, 0, 0, 1, 0, 0],
                [0, 1, 0, 1, 0, 1],
                [0, 1, 0, 0, 0, 0],
                [0, 0, 1, 1, 1, 0],
                [1, 0, 0, 0, 1, 0],
                [0, 0, 1, 0, 0, 0],
            ]
        )
        self.n_rows, self.n_cols = self.maze.shape

        # Define start and goal positions
        self.start_state = (0, 0)  # Top-left corner
        self.goal_state = (5, 5)  # Bottom-right corner

        # Define actions: up, down, left, right
        self.action_space = {
            0: (-1, 0),  # Up
            1: (1, 0),  # Down
            2: (0, -1),  # Left
            3: (0, 1),  # Right
        }
        self.n_actions = len(self.action_space)

        self.current_state = self.start_state

    def reset(self):
        """Reset the environment to the initial state."""
        self.current_state = self.start_state
        return self.current_state

    def step(self, action):
        """
        Take a step in the maze environment.
        Input:
            action (int): 0 (up), 1 (down), 2 (left), 3 (right)
        Returns:
            next_state (tuple): the new state after taking the action
            reward (int): reward for the action
            done (bool): True if the goal is reached
        """
        delta = self.action_space[action]
        next_state = (
            self.current_state[0] + delta[0],
            self.current_state[1] + delta[1],
        )

        # Check boundaries
        if (
            next_state[0] < 0
            or next_state[0] >= self.n_rows
            or next_state[1] < 0
            or next_state[1] >= self.n_cols
        ):
            next_state = self.current_state
        # Check if the next state is a wall
        elif self.maze[next_state] == 1:
            next_state = self.current_state

        # Define reward and termination condition
        if next_state == self.goal_state:
            reward = 100
            done = True
        else:
            reward = -1
            done = False

        self.current_state = next_state
        return next_state, reward, done

    def state_to_index(self, state):
        """Convert a (row, col) state into a flat index."""
        return state[0] * self.n_cols + state[1]

    def print_policy(self, Q):
        """Display the learned policy with arrows for each action."""
        directions = {0: "↑", 1: "↓", 2: "←", 3: "→"}
        policy = np.full(self.maze.shape, " ")
        for r in range(self.n_rows):
            for c in range(self.n_cols):
                if self.maze[r, c] == 1:
                    policy[r, c] = "█"
                elif (r, c) == self.goal_state:
                    policy[r, c] = "G"
                else:
                    state_index = self.state_to_index((r, c))
                    best_action = np.argmax(Q[state_index])
                    policy[r, c] = directions[best_action]
        for row in policy:
            print(" ".join(row))


# Q-learning parameters
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 0.5  # Exploration probability
num_episodes = 500  # Number of training episodes

# Initialize environment and Q-table
env = MazeEnv()
n_states = env.n_rows * env.n_cols
n_actions = env.n_actions

# Q-table initialization: each row for a state, each column for an action
Q = np.zeros((n_states, n_actions))

# Q-learning Training Loop
for episode in range(num_episodes):
    state = env.reset()
    state_index = env.state_to_index(state)
    done = False

    while not done:
        # Choose action with epsilon-greedy policy
        if random.uniform(0, 1) < epsilon:
            action = random.choice(list(env.action_space.keys()))
        else:
            action = np.argmax(Q[state_index])

        next_state, reward, done = env.step(action)
        next_state_index = env.state_to_index(next_state)

        # Q-learning update equation:
        # Q(s, a) = Q(s, a) + alpha * [reward + gamma * max_a' Q(s', a') - Q(s, a)]
        curr_q = Q[state_index, action]
        target = reward + gamma * np.max(Q[next_state_index])
        Q[state_index, action] = curr_q + alpha * (target - curr_q)

        state_index = next_state_index

    if (episode + 1) % 100 == 0:
        print(f"Episode {episode + 1} completed.")

print("\nLearned Q-table:")
print(Q)

print("\nLearned Policy (arrows show the best action for each cell):")
env.print_policy(Q)

# Testing the learned policy from the start state
print("\nTest Run:")
state = env.reset()
steps = 0
path = [state]
done = False
while not done and steps < 50:
    state_index = env.state_to_index(state)
    action = np.argmax(Q[state_index])
    state, reward, done = env.step(action)
    path.append(state)
    steps += 1

print("Path taken:", path)
if done:
    print(f"Goal reached in {steps} steps!")
else:
    print("Failed to reach goal within the step limit.")

Episode 100 completed.
Episode 200 completed.
Episode 300 completed.
Episode 400 completed.
Episode 500 completed.

Learned Q-table:
[[ 28.34020005  24.47722978  28.35371127  32.61625379]
 [ 32.60586859  32.61243338  28.3476529   37.3513931 ]
 [ 37.34479665  42.612659    32.61054185  37.34804752]
 [  0.           0.           0.           0.        ]
 [  2.72204597  46.75775064   6.10292696  -0.98781656]
 [ -0.99673864  -1.09293229   1.40205811  -0.94862952]
 [ 28.34815535   8.35215522  17.3825928   20.38908024]
 [  0.           0.           0.           0.        ]
 [ 37.34747087  48.45851     42.60868331  42.61175886]
 [  0.           0.           0.           0.        ]
 [ 26.44559286  62.17041029  46.11355035  41.93833196]
 [  0.           0.           0.           0.        ]
 [ -0.14429012  17.19331759   0.3455677    1.23546592]
 [  0.           0.           0.           0.        ]
 [ 42.60963061  48.45566475  48.45687693  54.9539    ]
 [ 54.95140903  54.95252071  48.45728793  

## Conclusion

This notebook demonstrated a Q-learning algorithm using a simple maze problem as motivation. The agent learns the optimal policy using the Q-update rule:

  Q(s, a) = Q(s, a) + α [r + γ maxₐ′ Q(s′, a′) – Q(s, a)]

Experiment with the maze layout, learning parameters, or number of episodes to see how performance is affected. Happy learning!