Q-Learning algorithm for a 3x3 grid world environment. In this environment, the agent learns to navigate from a start state to a goal state.

Simple Grid World Environment (3x3)
The grid world is a 3x3 grid where:

The agent starts at the top-left corner (0, 0).
The goal is at the bottom-right corner (2, 2).

GridWorld Environment: A simple 3x3 grid world with a start position and a goal position.
Q-Table: A 3D array to store Q-values for each state-action pair. 
The dimensions are [grid_size, grid_size, num_actions].
Hyperparameters: Learning rate (alpha), discount factor (gamma), exploration rate (epsilon), and number of training episodes.
Training Loop: The agent is trained using the Q-learning algorithm. For each episode, the environment is reset, and the agent explores or exploits based on the epsilon-greedy policy. The Q-values are updated using the Q-learning formula.
Evaluation: The trained agent is evaluated in the grid world, and the path taken by the agent is rendered.

Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize cumulative reward. Unlike supervised learning, where the model is trained on a dataset with labeled examples, reinforcement learning involves learning from the consequences of actions, exploring and exploiting to find the best strategies.

Key Concepts in Reinforcement Learning
Agent: The learner or decision-maker.
Environment: Everything the agent interacts with.
State: A representation of the current situation of the environment.
Action: Choices made by the agent that affect the state.
Reward: Feedback from the environment based on the action taken.
Policy: The strategy that the agent employs to determine actions based on states.
Value Function: A prediction of future rewards used to evaluate the desirability of states or actions.
Q-Learning: A popular RL algorithm that aims to learn the value of action-reward pairs.

Applications of Reinforcement Learning
Robotics: Learning to perform tasks like walking, grasping, and manipulating objects.
Gaming: Creating agents that can play and excel in games like Chess, Go, and video games.
Autonomous Vehicles: Decision-making for navigation and control.
Healthcare: Personalized treatment plans and medical decision-making.
Finance: Algorithmic trading and portfolio management.

In [1]:
import numpy as np
import random

# Define the environment
grid = [
    [0, 0, 0, 1],
    [0, -1, 0, -1],
    [0, 0, 0, 0]
]

# Define actions: 0=up, 1=right, 2=down, 3=left
actions = 4

# Q-table
q_table = np.zeros((len(grid), len(grid[0]), actions))

# Hyperparameters
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 0.1  # Exploration rate

# Training parameters
episodes = 1000

def choose_action(state, epsilon):
    if random.uniform(0, 1) < epsilon:
        return random.randint(0, 3)  # Explore
    else:
        return np.argmax(q_table[state[0]][state[1]])  # Exploit

def get_next_state(state, action):
    if action == 0:  # Up
        next_state = [max(state[0] - 1, 0), state[1]]
    elif action == 1:  # Right
        next_state = [state[0], min(state[1] + 1, len(grid[0]) - 1)]
    elif action == 2:  # Down
        next_state = [min(state[0] + 1, len(grid) - 1), state[1]]
    else:  # Left
        next_state = [state[0], max(state[1] - 1, 0)]
    return next_state

# Training loop
for _ in range(episodes):
    state = [2, 0]  # Start state
    
    while state != [0, 3]:  # Until goal state is reached
        action = choose_action(state, epsilon)
        next_state = get_next_state(state, action)
        reward = grid[next_state[0]][next_state[1]]
        
        # Q-value update
        old_q = q_table[state[0]][state[1]][action]
        next_max = np.max(q_table[next_state[0]][next_state[1]])
        new_q = (1 - alpha) * old_q + alpha * (reward + gamma * next_max)
        q_table[state[0]][state[1]][action] = new_q
        
        state = next_state

# Print the learned Q-table
print("Learned Q-table:")
print(q_table)

Learned Q-table:
[[[ 0.64053665  0.81        0.60643746  0.69455038]
  [ 0.78572591  0.9        -0.28965447  0.67666503]
  [ 0.85753237  1.          0.72352471  0.71151964]
  [ 0.          0.          0.          0.        ]]

 [[ 0.729      -0.34260979  0.57067222  0.60391069]
  [ 0.80174241  0.10954271  0.          0.17777732]
  [ 0.89747569 -0.091       0.03914909  0.        ]
  [ 0.3439      0.          0.          0.        ]]

 [[ 0.6561      0.31929802  0.57360514  0.55233607]
  [-0.14976338  0.51781876  0.03082074  0.059049  ]
  [ 0.72838184  0.          0.06476343  0.        ]
  [-0.23122     0.          0.          0.        ]]]


In [1]:
import numpy as np
import random

# Define the environment
class GridWorld:
    def __init__(self):
        self.size = 3
        self.start = (0, 0)
        self.goal = (2, 2)
        self.reset()

    def reset(self):
        self.agent_pos = self.start
        return self.agent_pos

    def step(self, action):
        x, y = self.agent_pos
        if action == 0:  # Up
            x = max(x - 1, 0)
        elif action == 1:  # Down
            x = min(x + 1, self.size - 1)
        elif action == 2:  # Left
            y = max(y - 1, 0)
        elif action == 3:  # Right
            y = min(y + 1, self.size - 1)

        self.agent_pos = (x, y)
        if self.agent_pos == self.goal:
            return self.agent_pos, 1, True

        return self.agent_pos, 0, False

    def render(self):
        grid = np.zeros((self.size, self.size), dtype=str)
        grid[:] = '.'
        grid[self.goal] = 'G'
        grid[self.agent_pos] = 'A'
        print('\n'.join(' '.join(row) for row in grid))
        print()

# Initialize the environment
env = GridWorld()

# Q-Learning parameters
q_table = np.zeros((env.size, env.size, 4))
alpha = 0.1  # Learning rate
gamma = 0.99  # Discount factor
epsilon = 0.1  # Exploration rate
num_episodes = 1000

# Training the agent
for episode in range(num_episodes):
    state = env.reset()
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = random.choice([0, 1, 2, 3])  # Explore action space
        else:
            action = np.argmax(q_table[state])  # Exploit learned values
        
        next_state, reward, done = env.step(action)
        old_value = q_table[state][action]
        next_max = np.max(q_table[next_state])
        
        new_value = old_value + alpha * (reward + gamma * next_max - old_value)
        q_table[state][action] = new_value
        
        state = next_state

# Evaluate the agent
state = env.reset()
env.render()
done = False
while not done:
    action = np.argmax(q_table[state])
    state, reward, done = env.step(action)
    env.render()
    if done:
        print("Goal reached!" if reward == 1 else "Failed to reach the goal.")


A . .
. . .
. . G

. A .
. . .
. . G

. . A
. . .
. . G

. . .
. . A
. . G

. . .
. . .
. . A

Goal reached!
