Evaluating reinforcement learning (RL) models is distinct from supervised or unsupervised learning due to the sequential decision-making nature of RL. Here's a breakdown of key metrics and considerations:

# Key Metrics:

## Cumulative Reward:
This is the most fundamental metric. It's the total reward an agent accumulates over an episode or a series of episodes.
It directly reflects how well the agent is achieving its goal.
## Average Reward:
This provides a more stable measure, especially when episode lengths vary.
It's calculated by averaging the rewards obtained per time step or per episode.
Discounted Cumulative Reward:
In many RL scenarios, future rewards are discounted. This metric accounts for the time value of rewards.
It's useful for evaluating how well an agent balances immediate and long-term rewards.
## Episode Length:
The number of steps an agent takes to complete an episode.
It can indicate the efficiency of the agent's policy.
## Success Rate:
In tasks with a clear definition of success or failure, this metric measures the percentage of episodes where the agent achieves the goal.
## Sample Efficiency:
Measures how many interactions with the environment the agent requires to learn an effective policy.
Important when environment interactions are costly or time-consuming.
## Robustness:
How well the agent performs in varying or unexpected environments.
Testing the agents ability to generalize.
## Policy Entropy:
Measures the randomness of the agent's policy.
Higher entropy can encourage exploration, while lower entropy indicates a more deterministic policy.

In [None]:
import numpy as np
import gym
import random

# Create a simple FrozenLake environment (4x4)
env = gym.make('FrozenLake-v1', is_slippery=False)  # Non-slippery for simplicity

# Define the Q-learning agent
class QLearningAgent:
    def __init__(self, state_size, action_size, alpha=0.8, gamma=0.95, epsilon=1.0, epsilon_decay=0.001):
        self.state_size = state_size
        self.action_size = action_size
        self.alpha = alpha  # Learning rate
        self.gamma = gamma  # Discount factor
        self.epsilon = epsilon  # Exploration rate
        self.epsilon_decay = epsilon_decay
        self.q_table = np.zeros((state_size, action_size))

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return env.action_space.sample()  # Explore
        else:
            return np.argmax(self.q_table[state, :])  # Exploit

    def learn(self, state, action, reward, next_state, done):
        predict = self.q_table[state, action]
        if done:
            target = reward # if done, the next state value is 0.
        else:
            target = reward + self.gamma * np.max(self.q_table[next_state, :])

        self.q_table[state, action] += self.alpha * (target - predict)
        if self.epsilon > 0.01:
            self.epsilon -= self.epsilon_decay

# Initialize the agent
agent = QLearningAgent(env.observation_space.n, env.action_space.n)

# Training the agent
num_episodes = 5000
episode_rewards = []
episode_lengths = []
success_counts = 0

for episode in range(num_episodes):
    state, _ = env.reset() #Corrected line.
    done = False
    total_reward = 0
    step_count = 0

    while not done:
        action = agent.choose_action(state)
        next_state, reward, done, _, _ = env.step(action) #Corrected line.
        agent.learn(state, action, reward, next_state, done)
        total_reward += reward
        state = next_state
        step_count += 1

    episode_rewards.append(total_reward)
    episode_lengths.append(step_count)
    if total_reward == 1: #success
        success_counts += 1

# Evaluation metrics
average_reward = np.mean(episode_rewards)
average_length = np.mean(episode_lengths)
success_rate = success_counts / num_episodes

print(f"Average Reward: {average_reward}")
print(f"Average Episode Length: {average_length}")
print(f"Success Rate: {success_rate}")

#optional: show a few games played with the trained agent.

num_test_episodes = 5
for _ in range(num_test_episodes):
    state, _ = env.reset() #Corrected line.
    done = False
    env.render()
    while not done:
        action = np.argmax(agent.q_table[state, :])
        next_state, reward, done, _, _ = env.step(action) #corrected line
        state = next_state
        env.render()
    print("---")

env.close()

  if not isinstance(terminated, (bool, np.bool8)):


Average Reward: 0.0
Average Episode Length: 1061.1462
Success Rate: 0.0


  logger.warn(
