# Model free methods: Q-learning

## Gynmasium environments
[Farama Gym](https://gymnasium.farama.org/index.html) is a collection of environments designed for developing and comparing reinforcement learning algorithms. It is a Python library that offers easy-to-use interfaces for a wide range of environments, originally provided by OpenAI. The environments are organized into categories such as classic control, Box2D, toy text, algorithmic, MuJoCo, robotics, and more.

Each environment is implemented as a Python class that provides consistent methods to interact with them. These methods include:

- `reset()`: Resets the environment to its initial state and returns the initial `observation`.
- `step(action)`: Executes the provided action, returning the next `observation`, the `reward`, two termination booleans (`truncated` or `terminated`), and an additional `info` dictionary.
- `render()`: Renders the environment for visualization purposes.
- `close()`: Closes the environment and frees resources.
- `seed(seed_value)`: Sets the seed for the environment to ensure reproducible results.

This standardized interface makes it simple to develop, test, and compare RL algorithms across different environments.


### FrozenLake-v1
The FrozenLake-v1 environment is a 4x4 grid world where the agent has to reach the goal without falling into a hole. The agent can move in four directions: up, down, left, and right. The environment is stochastic, meaning that the agent can slip and move in a different direction than the one it chose. The environment is considered solved when the agent reaches the goal state. The agent receives a reward of 1 when it reaches the goal and 0 otherwise.

<img src="https://gymnasium.farama.org/_images/frozen_lake.gif"/>

4 different versions of the FrozenLake environment are proposed:

* A deterministic one
* A stochastic one with only 2 holes
* A stochastic one with 4 holes
* A stochastic one with an 8x8 map

Your implementation should be able to solve always the deterministic one and the others around 50% of the time.

In [None]:
import gymnasium as gym

# env = gym.make('FrozenLake-v1', desc=["SFFF", "FHFH", "FFFH", "HFFG"],  map_name="4x4", is_slippery=False, render_mode="rgb_array") # --> Deterministic (no slippery), Easy
# env = gym.make('FrozenLake-v1', desc=["SFFH", "FFFF", "FFFF", "HFFG"],  map_name="4x4", is_slippery=True, render_mode="rgb_array") # --> Stochastic (slippery), More Challenging
env = gym.make('FrozenLake-v1', desc=["SFFF", "FHFH", "FFFH", "HFFG"],  map_name="4x4", is_slippery=True, render_mode="rgb_array") # --> Very Challenging
# env = gym.make('FrozenLake-v1', desc=["SFFFFFFF", "FFFFFFFF", "HHFFFFFF", "HHFFFFFF", "HHFFFFFF", "HHFFFFFF", "HHFFFFFF", "FFGFFFFF"],  map_name="8x8", is_slippery=True, render_mode="rgb_array") # --> Very Challenging

## Q algorithm Implementation

The Q-learning algorithm is a model-free reinforcement learning algorithm that learns by interacting with the environment. The algorithm learns by sampling episodes and updating the value function based on the returns obtained. The Q-learning algorithm is an off-policy algorithm, meaning that it learns the optimal target policy following a different epsilon-greedy policy to generate the data. 

### Exercise 1: Epsilon Greeedy policy

Implement a function that given a Q table and a state, and and epsilon value returns the action to take following an epsilon-greedy policy. The epsilon-greedy policy selects a random action with probability epsilon and the action with the highest Q-value with probability 1-epsilon.

In [2]:
import numpy as np

def epsilon_greedy_policy(Q, state, epsilon, n_actions):
    """
    Choose an action using the epsilon-greedy strategy.
    Args:
        Q: The Q-table
        state: The current state
        epsilon: The exploration rate
        n_actions: Total number of actions available
    Returns:
        action: The action selected
    """
    if np.random.rand() < (1 - epsilon):
        action = np.argmax(Q[state, :])
    else:
        action = np.random.choice(range(n_actions))

    return action

### Exercise 2: Q-learning

Implement the Q-learning algorithm to learn the optimal Q-values for the FrozenLake environment. The Q-learning algorithm learns the optimal Q-values by sampling episodes and updating the Q-values based on the returns obtained. The Q-values are updated using the following formula:

$$Q(s, a) = Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right)$$

where:
- $Q(s, a)$ is the Q-value of state $s$ and action $a$.
- $\alpha$ is the learning rate.
- $r$ is the reward obtained after taking action $a$ in state $s$.
- $\gamma$ is the discount factor.
- $s'$ is the next state.
- $a'$ is the next action.


In [3]:
def q_learning(env, episodes, alpha=0.1, gamma=0.9, epsilon=0.3):
    """
    Q-Learning algorithm implementation.
    Args:
        env: The environment
        episodes: The number of episodes to train for
        alpha: The learning rate
        gamma: The discount factor
        epsilon: The exploration rate
    Returns:
        Q: The learned Q-table
        policy: The learned policy
    """
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    policy = np.zeros(env.observation_space.n, dtype=int)
    for i in range(episodes):
        obs, _ = env.reset()
        state = obs
        while True:
            action = epsilon_greedy_policy(Q, state, epsilon, env.action_space.n)
            state_prime, reward, terminated, truncated, info = env.step(action)
            Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[state_prime, :]) - Q[state, action])
            policy[state] = np.argmax(Q[state, :])
            state = state_prime

            if terminated == True:
                break
   
    return Q, policy

In [None]:
# Example usage with FrozenLake environment
Q, policy = q_learning(env, episodes=10000, alpha=0.1, gamma=0.9, epsilon=0.3) # what happens when you use a very small epsilon?
print(Q)
print(policy)

[[0.2298209  0.24363756 0.23384531 0.21912067]
 [0.25707458 0.25717107 0.27115917 0.24513052]
 [0.31764709 0.17997296 0.2567811  0.16904549]
 [0.         0.         0.         0.        ]
 [0.24452327 0.29665319 0.25719722 0.2517203 ]
 [0.28467607 0.36426255 0.30449671 0.28361284]
 [0.35832222 0.43245626 0.39628357 0.33115466]
 [0.33660641 0.49708142 0.39700551 0.30830871]
 [0.20764456 0.25165787 0.23332769 0.33296811]
 [0.32238629 0.41231061 0.4702718  0.42919328]
 [0.44307683 0.61719416 0.52821817 0.45591791]
 [0.70950153 0.62622125 0.59645403 0.50638969]
 [0.         0.         0.         0.        ]
 [0.29130454 0.204852   0.63171146 0.44732433]
 [0.53537175 0.76965668 0.7833215  0.71158919]
 [0.         0.         0.         0.        ]]
[1 2 0 0 1 1 1 1 3 2 1 0 0 2 2 0]


### Exercise 3: Evaluate Q-learning algorithm

Create a function that evaluates the Q-learning algorithm by running multiple episodes and calculating the average return. You can also store the frames of the episodes to visualize the agent's behavior if rendering is enabled. Be sure that the environment used has the `render` mode set as `rgb_array` to be able to store the frames.

In [5]:
def generate_episode(env, policy, render=False): # choose an action following a deterministic policy not an epsilon greedy policy
    frames = []
    obs, _ = env.reset()

    if render == True:
        frames.append(env.render())

    state = obs
    total_reward = 0
    while True:
        state, reward, terminated, truncated, info = env.step(policy[state])
        total_reward += reward
        if render == True:
            frames.append(env.render())

        if reward == 1 or terminated == True or truncated == True:
            break

    return total_reward, frames

In [6]:
def evaluate_policy(env, policy, episodes=100):
    total_sum = 0
    for i in range(episodes):
        total_reward, _ = generate_episode(env, policy, render=False)
        total_sum += total_reward
    return total_sum / episodes

In [7]:
performance = evaluate_policy(env, policy, episodes=100)
print(f"Average performance: {performance}")

Average performance: 1.0


In [8]:
# Create gif of the episode
import imageio

rw, frames = generate_episode(env, policy, render=True)
imageio.mimsave('frozenlake_Q_learning.gif', frames, loop=0, duration=100)

## Exercise 4: SARSA Algorithm

Implement the SARSA algorithm to learn the optimal Q-values for the FrozenLake environment. The SARSA algorithm learns the optimal Q-values by sampling episodes and updating the Q-values based on the returns obtained. Check the changes with respect the Q-learning algorithm in the slides.


In [12]:
def sarsa(env, episodes, alpha=0.1, gamma=0.9, epsilon=0.1):
    """
    SARSA algorithm implementation.
    Args:
        env: The environment
        episodes: The number of episodes to train for
        alpha: The learning rate
        gamma: The discount factor
        epsilon: The exploration rate
    Returns:
        Q: The learned Q-table
        policy: The learned policy
    """
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    policy = np.zeros(env.observation_space.n, dtype=int)
    for i in range(episodes):
        obs, _ = env.reset()
        state = obs
        action = epsilon_greedy_policy(Q, state, epsilon, env.action_space.n)
        while True:
            state_prime, reward, terminated, truncated, info = env.step(action)
            action_prime = epsilon_greedy_policy(Q, state_prime, epsilon, env.action_space.n)
            Q[state, action] = Q[state, action] + alpha * (reward + gamma * Q[state_prime, action_prime] - Q[state, action])
            state = state_prime
            action = action_prime

            if terminated == True:
                break

    
    for s in range(env.observation_space.n):
        policy[s] = np.argmax(Q[s, :])

    return Q, policy

In [10]:
# Example usage with FrozenLake environment
Q, policy = sarsa(env, episodes=10000, alpha=0.1, gamma=0.9, epsilon=0.3)
print(Q)
print(policy)

[[0.1011756  0.10130941 0.10390745 0.10056469]
 [0.11957302 0.11700672 0.12771754 0.10766697]
 [0.12470073 0.09949858 0.09453772 0.09591144]
 [0.         0.         0.         0.        ]
 [0.11084486 0.12389547 0.10847952 0.11205749]
 [0.14903019 0.14586025 0.18040888 0.1270721 ]
 [0.15362477 0.19363816 0.23331723 0.17886949]
 [0.19818157 0.3550716  0.10740192 0.12514775]
 [0.08491797 0.0936215  0.11794369 0.14702064]
 [0.21229231 0.20894064 0.2921035  0.15876418]
 [0.27910795 0.36634349 0.37165827 0.29655817]
 [0.40342153 0.60633318 0.42396599 0.41011681]
 [0.         0.         0.         0.        ]
 [0.14458172 0.1535057  0.32623544 0.27157872]
 [0.37596138 0.54732197 0.46660399 0.51058272]
 [0.         0.         0.         0.        ]]
[2 2 0 0 1 2 2 1 3 2 2 1 0 2 1 0]


In [11]:
# Evaluate the learned policy using the previously defined function
performance = evaluate_policy(env, policy, episodes=100)
print(f"Average performance: {performance}")

Average performance: 1.0
