In [1]:
import numpy as np
import random

# Define the maze
maze = [
    [0, -1, 0, 0, 1],
    [0, -1, 0, -1, -1],
    [0, 0, 0, 0, 0],
    [-1, -1, 0, -1, 0],
    [0, 0, 0, -1, 0]
]

start = (0, 0)  # Starting point
goal = (0, 4)   # Goal point

# Map actions to numbers (0, 1, 2, 3)
# 0 = up, 1 = down, 2 = left, 3 = right
action_dict = {
    0: (-1, 0),  # up
    1: (1, 0),   # down
    2: (0, -1),  # left
    3: (0, 1)    # right
}

# Initialize Q-table with zeros (maze dimensions x number of actions)
q_table = np.zeros((len(maze), len(maze[0]), 4))

alpha = 0.1     # Learning rate
gamma = 0.9     # Discount factor
epsilon = 0.1   # Exploration rate
episodes = 1000 # Number of episodes

def is_valid_position(position):
    row, col = position
    return 0 <= row < len(maze) and 0 <= col < len(maze[0]) and maze[row][col] != -1

def choose_action(state):
    if random.uniform(0, 1) < epsilon:
        # Random action (0, 1, 2, or 3)
        return random.randint(0, 3)
    else:
        row, col = state
        # Exploit the best action (max Q-value)
        return np.argmax(q_table[row, col])  # Exploit the best action based on Q-table

# Q-learning
for episode in range(episodes):
    state = start
    while state != goal:
        row, col = state
        action = choose_action(state)

        # Ensure valid action is chosen (safety check)
        if action not in action_dict:
            print(f"Invalid action: {action}, using 0 (up) instead.")
            action = 0  # Default to 0 (up) in case of invalid action.

        move = action_dict[action]
        next_state = (row + move[0], col + move[1])

        if not is_valid_position(next_state):
            reward = -1  # Penalty for hitting a wall
            next_state = state  # Stay in the same position
        elif next_state == goal:
            reward = 1  # Reward for reaching the goal
        else:
            reward = -0.1  # Small penalty for each move

        # Update Q-value
        next_row, next_col = next_state
        best_next_action = np.max(q_table[next_row, next_col])
        q_table[row, col, action] += alpha * (reward + gamma * best_next_action - q_table[row, col, action])

        # Update state
        state = next_state

    # Decrease exploration rate over time
    epsilon = max(0.01, epsilon * 0.99)

# Print the trained Q-table
print("Trained Q-Table:")
print(q_table)

# Find the path using the trained Q-table
state = start
path = [state]
while state != goal:
    row, col = state
    action = np.argmax(q_table[row, col])  # Choose the best action based on Q-values
    move = action_dict[action]
    next_state = (row + move[0], col + move[1])
    if not is_valid_position(next_state):
        break
    state = next_state
    path.append(state)

# Print the path taken by the agent
print("Path taken by the agent:", path)

Trained Q-Table:
[[[-0.67108401 -0.0434062  -0.6596749  -0.56884186]
  [ 0.          0.          0.          0.        ]
  [-0.25556198  0.06024417 -0.21397658  0.8       ]
  [-0.1        -0.10965367  0.05219     1.        ]
  [ 0.          0.          0.          0.        ]]

 [[-0.32962058  0.062882   -0.56763977 -0.70338836]
  [ 0.          0.          0.          0.        ]
  [ 0.62       -0.02746457 -0.37665468 -0.37357948]
  [ 0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.        ]]

 [[-0.18731646 -0.57341646 -0.44370292  0.18098   ]
  [-0.46888239 -0.41653968 -0.09074595  0.3122    ]
  [ 0.458      -0.16064348 -0.10655847 -0.13384147]
  [-0.19709412 -0.192439   -0.04124843 -0.12921187]
  [-0.19619388 -0.13029543 -0.13229608 -0.199     ]]

 [[ 0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.        ]
  [-0.10023423 -0.14700515 -0.199      -0.199     ]
  [ 0.          0.          0.          0

A **maze** is a grid-based environment that contains paths, walls, and sometimes a goal. It’s a classic setting for teaching and testing **Reinforcement Learning (RL)** concepts, as it provides a controlled space with clear objectives and obstacles. In the maze environment, the **agent** (like a robot or virtual entity) must navigate from a **starting position** to a **goal position** while avoiding walls or obstacles. 

### Purpose of the Maze in Reinforcement Learning

In Reinforcement Learning, the **purpose of the maze environment** is to allow an agent to learn **how to make decisions** to reach a target. Since each cell in the maze represents a state, the agent must decide which direction to move next. By **exploring** different paths and learning from the **feedback** (rewards or penalties), the agent can identify the **optimal path** to reach the goal.

### How Reinforcement Learning Works in a Maze

The maze environment offers a straightforward scenario for RL, where:
- **The agent** interacts with the maze, making choices at each step.
- **The goal** is to maximize rewards (like reaching the endpoint with the fewest steps or avoiding obstacles).
- The agent learns a **policy** (strategy) to navigate the maze efficiently.

In the RL approach, the agent:
1. **Explores** different paths and experiences various outcomes.
2. **Learns** from rewards (positive when moving closer to the goal, negative when hitting a wall or backtracking).
3. **Adapts** its decisions over time to find the best route, effectively solving the maze with an optimal policy.

Using this, let’s walk through the implementation steps:

1. **Define the Maze Environment**: Represent the maze as a grid where each cell can be an open path, an obstacle, or the goal.
2. **Implement Q-Learning**: Use a popular RL technique, Q-Learning, where the agent learns the best actions by updating Q-values (estimations of the quality of actions).
3. **Train the Agent**: Run simulations for the agent to explore and find optimal paths.
4. **Test the Agent’s Policy**: Once trained, the agent should be able to navigate to the goal efficiently. 

Would you like to proceed with setting up the code for implementing this in Python?

This code demonstrates **Q-Learning** in a simple maze environment, where the agent learns to navigate from a **starting point** to a **goal** while avoiding obstacles. Here’s an overview of how the code is structured and how it enables the agent to learn and find an optimal path:

### Code Breakdown

1. **Maze Setup**:
   - The maze is represented as a 2D list where:
     - `0` indicates an open path.
     - `-1` indicates a wall or obstacle that the agent cannot pass.
     - `1` at `(0, 4)` represents the goal.
   - **Starting point** is `(0, 0)`, and the **goal** is `(0, 4)`.

2. **Action Space**:
   - The agent can move in four directions:
     - `0` for up, `1` for down, `2` for left, and `3` for right.
   - Each action has a corresponding movement vector in `action_dict`.

3. **Q-Table Initialization**:
   - The Q-table is initialized to store **Q-values** for each state-action pair. It has dimensions matching the maze size (`[rows][columns][4 actions]`).

4. **Learning Parameters**:
   - `alpha` (learning rate): Determines how much new information overrides the old.
   - `gamma` (discount factor): Controls the value of future rewards.
   - `epsilon` (exploration rate): Probability of choosing a random action to encourage exploration.

5. **Q-Learning Algorithm**:
   - For each episode, the agent starts from the beginning.
   - **Choose an action**: Based on `epsilon`, the agent either explores randomly or exploits the best action (highest Q-value) for the current state.
   - **Update Q-values**: The Q-value for each state-action pair is updated based on the received reward and the estimated best future reward.
   - **Decrease `epsilon`** gradually to reduce exploration over time, allowing the agent to rely more on learned behavior.

6. **Training Process**:
   - The agent navigates the maze for `episodes` number of times, gradually improving its path by adjusting Q-values based on rewards.
   - Rewards include:
     - `-1` for hitting a wall.
     - `1` for reaching the goal.
     - `-0.1` for each move to encourage efficiency.

7. **Path Extraction**:
   - After training, the agent uses the learned Q-table to find the **optimal path** from the starting point to the goal.

### Output
- **Trained Q-Table**: Displays Q-values learned for each state-action combination.
- **Path Taken by the Agent**: Shows the sequence of states the agent follows to reach the goal based on the Q-values learned during training.

### Example Usage
This implementation allows the agent to autonomously learn how to reach the goal efficiently, overcoming obstacles and learning from penalties. The path output helps visualize how well the agent learned from the environment.