In this implementation, the EpsilonGreedyPolicy class encapsulates the epsilon-greedy policy, and the select_action method selects an action based on the given Q-values. The method randomly chooses an action with probability ε and selects the action with the highest Q-value otherwise. Adjusting the value of ε allows you to control the trade-off between exploration and exploitation.

In [1]:
import numpy as np

class EpsilonGreedyPolicy:
    def __init__(self, epsilon, num_actions):
        self.epsilon = epsilon
        self.num_actions = num_actions

    def select_action(self, q_values):
        if np.random.uniform(0, 1) < self.epsilon:
            # Randomly choose an action (exploration)
            return np.random.randint(0, self.num_actions)
        else:
            # Choose the action with the highest Q-value (exploitation)
            return np.argmax(q_values)

# Example usage
epsilon = 0.1  # Exploration rate
num_actions = 4  # Number of actions
q_values = [0.2, 0.5, 0.8, 0.3]  # Q-values for each action

# Create an epsilon-greedy policy with the specified epsilon and number of actions
policy = EpsilonGreedyPolicy(epsilon, num_actions)

# Select an action using the epsilon-greedy policy based on the given Q-values
action = policy.select_action(q_values)
print("Selected Action:", action)


Selected Action: 2


The epsilon-greedy algorithm is a simple yet effective strategy for balancing exploration and exploitation in reinforcement learning. Here's how it works:

    Exploration vs. Exploitation: In reinforcement learning, agents often face the dilemma of whether to explore new actions (exploration) or exploit the current best-known action (exploitation). Exploration allows the agent to discover potentially better actions, while exploitation exploits the current knowledge to maximize rewards.

    Epsilon: Epsilon (ε) is a parameter between 0 and 1 that represents the probability of exploration. A value of ε indicates the probability that the agent will choose a random action (explore), while 1-ε represents the probability of choosing the action with the highest estimated value (exploit).

    Greedy Action Selection: In the epsilon-greedy algorithm, the agent selects the action with the highest estimated value (greedy action) with probability 1-ε. This is the exploitation step, where the agent chooses the action that it believes will yield the highest reward based on its current knowledge.

    Random Action Selection: With probability ε, the agent selects a random action from the action space. This is the exploration step, where the agent explores new actions to gather more information about their rewards.

    Balancing Exploration and Exploitation: By adjusting the value of ε, the agent can balance exploration and exploitation. A higher value of ε encourages more exploration, while a lower value of ε favors exploitation.