## Installation

Before running the code, make sure you have the Gym library installed. You can install it using:

```python
!pip install gym


In [2]:
import gym
import numpy as np
import random

The initialize_q_values function creates and returns a Q-table initialized with zeros, where each row corresponds to a state, and each column represents a possible action in a reinforcement learning environment. The dimensions of the Q-table are determined by the num_states (number of states) and num_actions (number of possible actions).

In [3]:

def initialize_q_values(num_states, num_actions):
    return np.zeros([num_states, num_actions])

The epsilon_greedy_action function selects a random action with probability epsilon or chooses the action with the highest Q-value for the given state using the Q-table (q_values) and the action space (action_space).

In [4]:
def epsilon_greedy_action(q_values, state, epsilon, action_space):
    return action_space.sample() if random.uniform(0, 1) < epsilon else np.argmax(q_values[state])

The update_q_values function implements the Q-value update rule for a state-action pair in a Q-table, incorporating the observed reward, the learning rate (learning_rate), and the discount factor (discount_factor). This update is based on the temporal difference error, adjusting the Q-value towards the estimated future rewards.

In [5]:
def update_q_values(q_values, state, action, reward, next_state, next_action, learning_rate, discount_factor):
    q_values[state, action] = q_values[state, action] + learning_rate * (reward + discount_factor * q_values[next_state, next_action] - q_values[state, action])

The train_sarsa_agent function trains a SARSA (State-Action-Reward-State-Action) agent by iteratively interacting with the environment over a specified number of episodes (num_episodes). It utilizes an epsilon-greedy strategy to explore and update the Q-values based on observed rewards, learning rate (learning_rate), and discount factor (discount_factor). The final trained Q-values are returned for the agent to make informed decisions in the environment.

In [6]:
def train_sarsa_agent(environment, num_episodes, learning_rate, discount_factor, exploration_rate):
    state_space_size = environment.observation_space.n
    action_space_size = environment.action_space.n
    q_values = initialize_q_values(state_space_size, action_space_size)

    for episode in range(num_episodes):
        current_state = environment.reset()
        done = False

        current_action = epsilon_greedy_action(q_values, current_state, exploration_rate, environment.action_space)

        while not done:
            next_state, reward, done, _ = environment.step(current_action)

            next_action = epsilon_greedy_action(q_values, next_state, exploration_rate, environment.action_space)

            update_q_values(q_values, current_state, current_action, reward, next_state, next_action, learning_rate, discount_factor)

            current_state = next_state
            current_action = next_action

    return q_values

The evaluate_agent function assesses the performance of a trained agent in an environment over multiple trials (num_trials). It calculates the average total reward obtained by executing actions based on the learned Q-values (q_values) and returns this average as a measure of the agent's effectiveness.

In [7]:
def evaluate_agent(environment, q_values, num_trials=100):
    total_rewards = 0
    for _ in range(num_trials):
        state = environment.reset()
        done = False
        while not done:
            action = np.argmax(q_values[state])
            state, reward, done, _ = environment.step(action)
            total_rewards += reward
    return total_rewards / num_trials

The line creates a reinforcement learning environment using the Gym library, specifically the "Taxi-v3" environment. This environment models a simplified taxi navigation problem where an agent must pick up and drop off passengers.

In [8]:
# Create the Taxi-v3 environment
taxi_environment = gym.make("Taxi-v3")

The hyperparameters define the learning characteristics of a SARSA (State-Action-Reward-State-Action) reinforcement learning agent. These parameters include the learning rate (`learning_rate`), discount factor (`discount_factor`), exploration rate (`exploration_rate`), and the total number of training episodes (`num_episodes`). Adjusting these values influences the agent's behavior and learning process during training.

In [9]:
# Hyperparameters
learning_rate = 0.1
discount_factor = 0.9
exploration_rate = 0.1
num_episodes = 1000

This line initiates the training of a SARSA (State-Action-Reward-State-Action) reinforcement learning agent on the "Taxi-v3" environment using specified hyperparameters. The resulting `trained_q_values` represent the learned state-action values, enabling the agent to make informed decisions in the environment.

In [10]:
# Train the SARSA agent
trained_q_values = train_sarsa_agent(taxi_environment, num_episodes, learning_rate, discount_factor, exploration_rate)

This code segment evaluates the performance of a trained SARSA agent in the "Taxi-v3" environment over 100 trials. The average reward obtained during these trials is calculated and printed as a measure of the agent's effectiveness in navigating and completing tasks within the environment.

In [11]:
# Evaluate the trained agent
average_reward = evaluate_agent(taxi_environment, trained_q_values)
print(f"The trained agent achieves an average reward of {average_reward} over 100 trials.")

The trained agent achieves an average reward of -160.1 over 100 trials.
