# Q-learning

### Model (Q-Learning):
The reinforcement learning model used is Q-learning, a classic algorithm for learning optimal action-selection policies in Markov decision processes. The Q-table stores Q-values for state-action pairs, and the agent updates these values based on observed rewards and transitions. During training, the agent balances exploration and exploitation to discover optimal policies.

### Environment (Dataset):
The code uses the OpenAI Gym toolkit, and specifically, the Taxi-v3 environment. This environment represents a simplified taxi problem where an agent must pick up and drop off passengers while navigating a grid. The environment provides discrete states, actions, and rewards, making it suitable for reinforcement learning.

### Code Overview:
The provided code consists of two main parts: training the Q-learning agent and evaluating its performance.

#### Training the Agent:
1. **Initialization:** The Q-table is initialized with zeros. This table will store Q-values for state-action pairs, representing the expected cumulative rewards.

2. **Training Loop:**
   - The agent explores the environment by taking actions and updating Q-values based on the Q-learning formula.
   - Hyperparameters like learning rate (`alpha`), discount factor (`gamma`), and exploration-exploitation trade-off (`epsilon`) are defined.
   - The training loop runs for a specified number of episodes, during which the agent interacts with the environment, learns from experiences, and updates its Q-table.

#### Evaluating the Agent:
1. The trained agent is evaluated over a specified number of episodes to assess its performance.
2. The agent selects actions based on learned Q-values (exploitation).
3. Metrics such as average timesteps per episode and average penalties per episode are calculated and printed.





In [2]:
# Use pip to install the 'cmake' package, which is a cross-platform build tool.
!pip install cmake

# Install the 'gym' package with the 'atari' extra dependencies.
# 'gym' is a toolkit for developing and comparing reinforcement learning algorithms.
# The '[atari]' indicates that additional dependencies for Atari environments will be installed.
!pip install 'gym[atari]'

# Install the 'scipy' package, which is a scientific library for mathematics, science, and engineering.
!pip install scipy


Collecting ale-py~=0.7.5 (from gym[atari])
  Downloading ale_py-0.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ale-py
Successfully installed ale-py-0.7.5


In [None]:
#import necessary libraries
import gym
import random
import numpy as np
from IPython.display import clear_output

In [3]:
# Create a Taxi environment and extract the inner environment.
env = gym.make("Taxi-v3").env

# Reset the environment to a new, random state.
env.reset()

# Print the current state of the environment in ANSI mode.
print(env.render(mode="ansi"))

# Print the action space of the environment.
print("Action Space {}".format(env.action_space))

# Print the state space of the environment.
print("State Space {}".format(env.observation_space))

+---------+
|[35mR[0m: | : :[34;1mG[0m|
| : | : :[43m [0m|
| : : : : |
| | : | : |
|Y| : |B: |
+---------+


Action Space Discrete(6)
State Space Discrete(500)


  deprecation(
  deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


In [4]:
# The parameters represent (taxi row, taxi column, passenger index, destination index).
state = env.encode(3, 1, 2, 0)
print("State:", state)

# Set the environment's current state to the encoded state.
env.s = state

# Print the rendered state of the environment in ANSI mode.
env.render(mode="ansi")

State: 328


  and should_run_async(code)


'+---------+\n|\x1b[35mR\x1b[0m: | : :\x1b[34;1mG\x1b[0m|\n| : | : :\x1b[43m \x1b[0m|\n| : : : : |\n| | : | : |\n|Y| : |B: |\n+---------+\n\n'

In [5]:
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

In [6]:
# Set the environment to the specified state for illustration.
env.s = 328

# Initialize variables for tracking the number of epochs, penalties, and total rewards.
epochs = 0
penalties, reward = 0, 0

# Create a list to store frames for animation.
frames = []

# Flag indicating whether the episode is done.
done = False

# Run the episode until it is done.
while not done:
    # Choose a random action from the action space.
    action = env.action_space.sample()

    # Take a step in the environment based on the chosen action.
    state, reward, done, info = env.step(action)

    # If the agent receives a penalty, increment the penalties counter.
    if reward == -10:
        penalties += 1

    # Put each rendered frame into a dictionary for animation.
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
    })

    # Increment the epoch counter.
    epochs += 1

# Print the results of the episode.
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 1756
Penalties incurred: 575


In [7]:
from IPython.display import clear_output
from time import sleep
from io import StringIO

# Function to print frames for animation
def print_frames(frames):
    for i, frame in enumerate(frames):
        # Clear the output for a dynamic display.
        clear_output(wait=True)

        # Use StringIO to read the frame and print it.
        file = StringIO(frame['frame'])
        print(file.getvalue())

        # Print additional information for each frame.
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")

        # Pause for a short duration between frames.
        sleep(.1)

# Call the function to display the frames.
print_frames(frames)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 1756
State: 0
Action: 5
Reward: 20


In [8]:
# Initialize a Q-table with zeros.
# The shape of the table is determined by the number of states and actions in the environment.
q_table = np.zeros([env.observation_space.n, env.action_space.n])

In [9]:
# Hyperparameters
alpha = 0.1  # Learning rate
gamma = 0.6  # Discount factor
epsilon = 0.1  # Exploration-exploitation trade-off

# For plotting metrics
all_epochs = []
all_penalties = []

# Training loop
for i in range(1, 100001):
    # Reset the environment for a new episode
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    done = False

    while not done:
        # Exploration-exploitation trade-off
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()  # Explore action space
        else:
            action = np.argmax(q_table[state])  # Exploit learned values

        # Take a step in the environment
        next_state, reward, done, info = env.step(action)

        # Update Q-values using the Q-learning formula
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        # Update counters
        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1

    # Logging for every 100 episodes
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

Episode: 100000
Training finished.



In [10]:
# Initialize variables to track total epochs and penalties over episodes.
total_epochs, total_penalties = 0, 0

# Number of episodes for evaluation.
episodes = 100

# Run the agent for the specified number of episodes.
for _ in range(episodes):
    # Reset the environment for a new episode.
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    done = False

    while not done:
        # Choose actions using the learned Q-values (exploitation).
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        # Count penalties incurred during the episode.
        if reward == -10:
            penalties += 1

        # Increment the timestep counter.
        epochs += 1

    # Update total counters for all episodes.
    total_penalties += penalties
    total_epochs += epochs

# Calculate and print average metrics over all episodes.
print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

Results after 100 episodes:
Average timesteps per episode: 12.53
Average penalties per episode: 0.0


The Q-learning agent was trained over 100,000 episodes on the Taxi-v3 environment. The training process took approximately 1756 timesteps per episode, and the agent incurred 575 penalties. During the final episode of training, the agent took 1756 timesteps, reached state 0, took action 5, and received a reward of 20.

Upon evaluation over 100 episodes:
- Average timesteps per episode: 12.53
- Average penalties per episode: 0.0

**Conclusion:**
1. **Training Efficiency:** The agent efficiently learned a policy to navigate the Taxi environment, completing episodes in a relatively small number of steps.

2. **Penalties:** The agent successfully learned a policy with no penalties during evaluation episodes, indicating effective navigation.

3. **Generalization:** The Q-learning model demonstrated good generalization, achieving a low average penalty rate in unseen scenarios. The average timesteps per episode is a positive indicator of the agent's ability to learn an effective policy.