# Implement Reinforcement Learning using an example of a maze environment that the agent needs to explore.

To implement Reinforcement Learning (RL) in a maze environment, we can use Q-Learning, which is a model-free RL algorithm. The agent will explore the maze, take actions, and learn from the consequences of those actions to reach a goal.

We'll break down the process into the following steps:

Environment Setup: We’ll define a maze where the agent can move.
Q-Learning Algorithm: The agent will learn from exploring the maze using the Q-Learning algorithm.
Agent Movement: The agent will move based on its Q-values and explore until it finds the goal.

In [2]:
# Step 1: Environment Setup
# First, we define a simple maze environment. Let's assume the maze is represented as a grid of states where the agent can move in four directions: up, down, left, and right. The agent will receive rewards when it reaches the goal and negative rewards for moving into walls or invalid cells.

import numpy as np
import random

# Define the maze grid, where 0 is an empty space, 1 is a wall, and 9 is the goal.
maze = np.array([
    [0, 0, 0, 1, 0, 0],
    [0, 1, 0, 1, 0, 0],
    [0, 1, 0, 0, 0, 0],
    [0, 1, 1, 1, 0, 9],
    [0, 0, 0, 0, 0, 0]
])

# Define the action space
# 0: Up, 1: Down, 2: Left, 3: Right
actions = [(0, -1), (0, 1), (-1, 0), (1, 0)]  # (row, col) movements

# Start and goal positions
start = (0, 0)  # Start position at top-left corner
goal = (3, 5)  # Goal position at bottom-right corner

# Create a simple environment class
class MazeEnv:
    def __init__(self, maze, start, goal):
        self.maze = maze
        self.start = start
        self.goal = goal
        self.agent_position = start
        
    def reset(self):
        self.agent_position = self.start
        return self.agent_position

    def step(self, action):
        # Get the new position after taking the action
        move = actions[action]
        new_position = (self.agent_position[0] + move[0], self.agent_position[1] + move[1])
        
        # Check for boundaries or walls
        if (0 <= new_position[0] < self.maze.shape[0] and 0 <= new_position[1] < self.maze.shape[1] and 
            self.maze[new_position] != 1):
            self.agent_position = new_position
        
        # Check if goal is reached
        if self.agent_position == self.goal:
            return self.agent_position, 10, True  # 10 reward for reaching the goal
        
        return self.agent_position, -1, False  # -1 penalty for each step
        
    def render(self):
        # Visualize the agent's position in the maze
        maze_copy = np.copy(self.maze)
        maze_copy[self.agent_position] = 2  # Mark agent's position
        print(maze_copy)


In [3]:
# Step 2: Q-Learning Algorithm
# Now, we’ll implement the Q-Learning algorithm, which involves updating Q-values based on the agent's experiences. The agent learns the best actions to take by receiving feedback from the environment.

class QLearningAgent:
    def __init__(self, env, learning_rate=0.1, discount_factor=0.9, exploration_rate=1.0, exploration_decay=0.995):
        self.env = env
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.exploration_rate = exploration_rate
        self.exploration_decay = exploration_decay
        self.q_table = np.zeros((env.maze.shape[0], env.maze.shape[1], len(actions)))  # Q-value table

    def choose_action(self, state):
        if random.uniform(0, 1) < self.exploration_rate:
            # Exploration: choose a random action
            return random.choice(range(len(actions)))
        else:
            # Exploitation: choose the best action based on Q-values
            return np.argmax(self.q_table[state[0], state[1], :])

    def learn(self, state, action, reward, next_state):
        # Get the best future Q-value for the next state
        future_q = np.max(self.q_table[next_state[0], next_state[1], :])
        
        # Update Q-value for the current state-action pair
        self.q_table[state[0], state[1], action] = (1 - self.learning_rate) * self.q_table[state[0], state[1], action] + \
                                                      self.learning_rate * (reward + self.discount_factor * future_q)

    def train(self, episodes=1000):
        for episode in range(episodes):
            state = self.env.reset()
            done = False
            total_reward = 0
            while not done:
                action = self.choose_action(state)
                next_state, reward, done = self.env.step(action)
                self.learn(state, action, reward, next_state)
                state = next_state
                total_reward += reward
                
            # Decay the exploration rate
            self.exploration_rate *= self.exploration_decay
            if episode % 100 == 0:
                print(f"Episode {episode}, Total Reward: {total_reward}, Exploration Rate: {self.exploration_rate}")

    def test(self):
        state = self.env.reset()
        done = False
        total_reward = 0
        while not done:
            action = self.choose_action(state)
            next_state, reward, done = self.env.step(action)
            state = next_state
            total_reward += reward
            self.env.render()
            if done:
                print(f"Goal reached with total reward: {total_reward}")
                break


In [4]:
# Step 3: Training the Agent
# Now, let’s create an instance of the environment and the QLearning agent, train the agent, and then test it.

# Create the environment and agent
env = MazeEnv(maze, start, goal)
agent = QLearningAgent(env)

# Train the agent
agent.train(episodes=1000)

# Test the agent by visualizing the path to the goal
print("Testing the trained agent:")
agent.test()


Episode 0, Total Reward: -181, Exploration Rate: 0.995
Episode 100, Total Reward: -5, Exploration Rate: 0.6027415843082742
Episode 200, Total Reward: 0, Exploration Rate: 0.36512303261753626
Episode 300, Total Reward: -1, Exploration Rate: 0.2211807388415433
Episode 400, Total Reward: 1, Exploration Rate: 0.13398475271138335
Episode 500, Total Reward: 2, Exploration Rate: 0.0811640021330769
Episode 600, Total Reward: 1, Exploration Rate: 0.04916675299948831
Episode 700, Total Reward: 1, Exploration Rate: 0.029783765425331846
Episode 800, Total Reward: 1, Exploration Rate: 0.018042124582040707
Episode 900, Total Reward: 3, Exploration Rate: 0.010929385683282892
Testing the trained agent:
[[0 2 0 1 0 0]
 [0 1 0 1 0 0]
 [0 1 0 0 0 0]
 [0 1 1 1 0 9]
 [0 0 0 0 0 0]]
[[0 0 2 1 0 0]
 [0 1 0 1 0 0]
 [0 1 0 0 0 0]
 [0 1 1 1 0 9]
 [0 0 0 0 0 0]]
[[0 0 0 1 0 0]
 [0 1 2 1 0 0]
 [0 1 0 0 0 0]
 [0 1 1 1 0 9]
 [0 0 0 0 0 0]]
[[0 0 0 1 0 0]
 [0 1 0 1 0 0]
 [0 1 2 0 0 0]
 [0 1 1 1 0 9]
 [0 0 0 0 0 0]]


In [None]:
# Key Points:

# State Representation: The state is represented by the agent's position in the maze (a pair of row, column).
# Action Space: The agent can choose from four possible actions (up, down, left, right).
# Q-Table: A table that holds Q-values for every state-action pair.
# Exploration vs. Exploitation: The agent explores the environment at first (with random actions), but as it learns, it starts exploiting the knowledge it has acquired (choosing actions with the highest Q-values).


# How It Works:

# The agent starts at the top-left corner of the maze (start = (0, 0)).
# It takes actions in the maze based on its Q-values.
# When the agent reaches the goal, it gets a positive reward (10 points).
# The agent learns the best actions to take in the maze by updating its Q-values based on the rewards it receives.
# Over time, the agent becomes better at navigating the maze, moving towards the goal in the fewest steps possible.

# Output:

# The agent will explore the maze and update its knowledge. After training for several episodes, the agent will learn the optimal path to the goal. You will see the Q-values being updated during training, and after training, the agent will demonstrate its learned policy by navigating to the goal.

# This is a basic implementation of Q-learning in a maze environment. Let me know if you'd like to enhance or further customize the implementation!