# An Introduction to Reinforcement Learning

Think about how you might teach a dog a new trick, like telling it to sit:

- If it performs the trick correctly (it sits), you'll reward it with a treat (positive feedback) ✔️
- If it doesn't sit correctly, it doesn't get a treat (negative feedback) ❌

By continuing to do things that lead to positive outcomes, the dog will learn to sit when it hears the command in order to get its treat. Reinforcement learning is a subdomain of machine learning which involves training an 'agent' (the dog) to learn the correct sequences of actions to take (sitting) on its environment (in response to the command 'sit') in order to maximize its reward (getting a treat).

This can be illustrated more formally as:

![sutton barto rl](https://www.gocoder.one/static/RL-diagram-b3654cd3d5cc0e07a61a214977038f01.png)

## Key Concepts in Reinforcement Learning

1. **Agent**: The learner or decision maker.
2. **Environment**: The external system with which the agent interacts.
3. **State (s)**: A representation of the current situation of the agent.
4. **Action (a)**: Choices made by the agent that affect the environment.
5. **Reward (r)**: Feedback from the environment based on the action taken.
6. **Policy (π)**: A strategy used by the agent to determine the next action based on the current state.
7. **Value Function (V)**: A function that estimates the expected cumulative reward from a given state.
8. **Q-Value (Q)**: A function that estimates the expected cumulative reward from a given state-action pair.

## Detailed Explanation of Key Concepts

- **Agent**: In RL, the agent is the entity that interacts with the environment to learn optimal behaviors. The agent's goal is to maximize the cumulative reward it receives over time.
- **Environment**: The environment is the external system that the agent interacts with. It provides the agent with states and rewards based on the actions taken.
- **State (s)**: A state is a specific situation or configuration of the environment. The state provides the agent with all the information needed to make a decision.
- **Action (a)**: An action is a specific move or decision made by the agent that affects the state of the environment.
- **Reward (r)**: A reward is a scalar feedback signal received by the agent after taking an action. It indicates how good or bad the action was in the context of the environment.
- **Policy (π)**: A policy is a mapping from states to actions. It defines the agent's behavior at any given time. The policy can be deterministic (specific action for each state) or stochastic (probability distribution over actions for each state).
- **Value Function (V)**: The value function estimates the expected cumulative reward from a given state, following a specific policy. It helps the agent evaluate how good it is to be in a certain state.
- **Q-Value (Q)**: The Q-value (or action-value) function estimates the expected cumulative reward from a given state-action pair. It helps the agent evaluate how good it is to take a certain action in a certain state.

# Q-Learning

Q-learning is a reinforcement learning algorithm that seeks to find the best possible next action given its current state, in order to maximise the reward it receives (the 'Q' in Q-learning stands for quality - i.e. how valuable an action is).

Let's take the following starting state:

![](https://www.gocoder.one/static/start-state-6a115a72f07cea072c28503d3abf9819.png)

Which action (up, down, left, right, pick-up or drop-off) should it take in order to maximise its reward? (_Note: blue = pick-up location and green= drop-off destination_)

First, let's take a look at how our agent is 'rewarded' for its actions. **Remember in reinforcement learning, we want our agent to take actions that will maximise the possible rewards it receives from its environment.**

### 'Taxi' reward system

According to the [Taxi documentation](https://gym.openai.com/envs/Taxi-v3/):

> _"…you receive +20 points for a successful drop-off, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions."_

Looking back at our original state, the possible actions it can take and the corresponding rewards it will receive are shown below:

![](https://www.gocoder.one/static/state-rewards-62ab43a53e07062b531b3199a8bab5b3.png)

In the image above, the agent loses 1 point per timestep it takes. It will also lose 10 points if it uses the pick-up or drop-off action here.

We want our agent to go North towards the pick-up location denoted by a blue R - **but how will it know which action to take if they are all equally punishing?**

### Exploration vs. Exploitation

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/exp_2.jpg" width="600">

#### Exploration

Our agent currently has no way of knowing which action will lead it closest to the blue R. This is where trial-and-error comes in - we'll have our agent take random actions, and observe what rewards it gets (i.e. our agent will **explore**).

Over many iterations, our agent will have observed that certain sequences of actions will be more rewarding than others. Along the way, our agent will need to keep track of which actions led to what rewards.

#### Exploitation

We can let our agent explore to update our Q-table using the Q-learning algorithm. As our agent learns more about the environment, we can let it use this knowledge to take more optimal actions and converge faster - known as **exploitation**.

During exploitation, our agent will look at its Q-table and select the action with the highest Q-value (instead of a random action). Over time, our agent will need to explore less, and start exploiting what it knows instead.

### Epsilon-Greedy ($\epsilon$) Exploration vs. Exploitation

In reinforcement learning, an agent must balance between exploring new actions to discover their effects (**exploration**) and exploiting known actions that yield high rewards (**exploitation**). The epsilon-greedy strategy is commonly used to address this trade-off.

#### Epsilon-Greedy Strategy

-   **Exploration**: With probability $\epsilon$, the agent chooses a random action to explore the environment. This helps the agent discover new strategies and avoid local optima.
-   **Exploitation**: With probability 1−$\epsilon_1$ - $\epsilon$, the agent chooses the action with the highest Q-value. This leverages the knowledge already gained to maximize the reward.

#### Formula

$$ \text{action} =
\begin{cases} 
\text{random action} & \text{with probability } \epsilon \\
\arg\max\limits_a Q(s, a) & \text{with probability } 1 - \epsilon
\end{cases} $$

#### Epsilon Decay

Initially, a higher $\epsilon$ encourages exploration. Over time, $\epsilon$ decays to encourage exploitation as the agent gains more knowledge about the environment.

#### Visual Explanation

1.  **Exploration Phase** (high $\epsilon$): The agent tries different actions to explore the environment.
2.  **Exploitation Phase** (low $\epsilon$): The agent uses the knowledge gained to choose the best-known actions.

### Introducing… Q-tables

A Q-table is simply a look-up table storing values representing the maximum expected future rewards our agent can expect for a certain action in a certain state (_known as Q-values_). It will tell our agent that when it encounters a certain state, some actions are more likely than others to lead to higher rewards. It becomes a 'cheatsheet' telling our agent what the best action to take is.

The image below illustrates what our 'Q-table' will look like:

-   Each row corresponds to a unique state in the 'Taxi' environment
-   Each column corresponds to an action our agent can take
-   Each cell corresponds to the Q-value for that state-action pair - a higher Q-value means a higher maximum reward our agent can expect to get if it takes that action in that state.

<img src="https://www.gocoder.one/static/q-table-9461cc903f50b78d757ea30aeb3eb8bc.png" width="600">

## Q-Learning Algorithm

The Q-learning algorithm is given below. We won't go into details, but you can read more about it in [Ch 6 of Sutton & Barto (2018)](http://www.incompleteideas.net/book/RLbook2018trimmed.pdf).

![](https://www.gocoder.one/static/q-learning-algorithm-84b84bb5dc16ba8097e31aff7ea42748.png)

The Q-learning algorithm will help our agent **update the current $Q$-value $Q(S_t,A_t)$ with its observations after taking an action.** I.e. increase Q if it encountered a positive reward, or decrease Q if it encountered a negative one.

Note that in Taxi, our agent doesn't receive a positive reward until it successfully drops off a passenger (_+20 points_). Hence even if our agent is heading in the correct direction, there will be a delay in the positive reward it should receive. The following term in the Q-learning equation addresses this:

![](https://www.gocoder.one/static/max-q-e593ddcec76cda87ed189c31d60837b6.png)

This term adjusts our current Q-value to include a portion of the rewards it may receive sometime in the future ($S_t+1$). The '$a$' term refers to all the possible actions available for that state. The equation also contains two hyperparameters which we can specify:

1.  Learning rate ($α$): how easily the agent should accept new information over previously learnt information
2.  Discount factor ($γ$): how much the agent should take into consideration the rewards it could receive in the future versus its immediate reward

### Steps in Q-Learning

1. **Initialize the Q-table**: Assign a value of 0 to all entries.
2. **Choose an action**: Use an ε-greedy strategy to balance exploration and exploitation.
3. **Take the action**: Observe the reward and the next state.
4. **Update the Q-table**: Use the Bellman equation to update the Q-values.
5. **Repeat**: Continue until the episode ends.

Q-Learning involves the agent interacting with the environment and updating a Q-table that contains Q-values for all possible state-action pairs. The agent uses these Q-values to make decisions that maximize its expected cumulative reward. The Q-value for a state-action pair (s, a) is updated using the Bellman equation, which incorporates the reward received and the maximum Q-value of the next state.

### Detailed Explanation of the Algorithm

- **Exploration-Exploitation Strategy**: The agent must balance exploration (trying new actions to discover their effects) and exploitation (choosing actions that are known to yield high rewards). The ε-greedy strategy is commonly used, where the agent chooses a random action with probability ε, and the best-known action with probability 1-ε.
- **Q-Table Initialization**: The Q-table is initialized with arbitrary values, often zeros. It has dimensions `[number of states, number of actions]`.
- **State and Action**: The agent starts in an initial state and chooses an action based on the current Q-values and the exploration-exploitation strategy.
- **Reward and Next State**: After taking an action, the agent receives a reward and observes the next state.
- **Q-Value Update**: The Q-value for the state-action pair is updated using the Bellman equation, incorporating the observed reward and the maximum Q-value of the next state.
- **Episode Termination**: The episode terminates when the agent reaches a terminal state (e.g., completing a task or reaching a goal).

### Bellman Equation

The Bellman equation provides a recursive decomposition for solving Markov Decision Processes (MDPs). The Q-value for a state-action pair (s, a) is updated as follows:

$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] $$

where:
- $ Q(s, a) $: Current Q-value.
- $ \alpha $: Learning rate (0 < α ≤ 1). It determines how much new information overrides the old information.
- $ r $: Reward received after taking action $ a $ from state $ s $.
- $ \gamma $: Discount factor (0 ≤ γ ≤ 1) which balances the importance of immediate and future rewards. A higher value of γ puts more emphasis on future rewards.
- $ s' $: Next state after taking action $ a $.
- $ \max_{a'} Q(s', a') $: Maximum Q-value of the next state $ s' $ over all possible actions $ a' $.

# Q-Learning Implementation for the Taxi Environment

In this section, we will implement the Q-learning algorithm for the Taxi environment. We'll go through each step in detail to understand how the agent learns to optimize its actions.

### Step 1: Import Libraries and Initialize the Environment

First, we need to import the necessary libraries and create the Taxi environment.

In [None]:
#!pip install gym
#!pip install pygame
import numpy as np
import gym
import random

# Create Taxi environment
env = gym.make("Taxi-v3", render_mode="human")

### Step 2: Initialize the Q-Table

Next, we initialize the Q-table with zeros. The Q-table has dimensions `[number of states, number of actions]`.

-   `state_size`: The total number of possible states.
-   `action_size`: The total number of possible actions.
-   `q_table`: A table to store Q-values for each state-action pair, initialized to zero.

In [None]:
# Initialize Q-table
state_size = env.observation_space.n  # Total number of states
action_size = env.action_space.n      # Total number of actions
q_table = np.zeros((state_size, action_size))
q_table

### Step 3: Define Hyperparameters

We define the hyperparameters for the Q-learning algorithm.

-   `learning_rate (α)`: Determines how much new information overrides old information.
-   `discount_rate (γ)`: Balances the importance of immediate and future rewards.
-   `epsilon`: The probability of choosing a random action (exploration).
-   `decay_rate`: The rate at which `epsilon` decreases.

In [None]:
# Hyperparameters
learning_rate = 0.1  # Alpha, learning rate
discount_rate = 0.6  # Gamma, discount factor
epsilon = 1.0        # Exploration rate
decay_rate = 0.01    # Decay rate for epsilon

### Step 4: Train the Q-Learning Agent

We train the agent over a number of episodes. In each episode, the agent interacts with the environment and updates the Q-table based on its experiences.

-   `env.reset()`: Resets the environment to the initial state at the beginning of each episode.
-   `random.uniform(0, 1) < epsilon`: Determines whether to explore or exploit.
-   `env.action_space.sample()`: Selects a random action (exploration).
-   `np.argmax(q_table[state])`: Selects the action with the highest Q-value (exploitation).
-   `env.step(action)`: Executes the action and returns the next state, reward, and done flag.
-   `q_table[state, action]`: Updates the Q-value using the Bellman equation.

In [None]:
# Training variables
num_episodes = 20  # Total number of episodes
max_steps = 10       # Max steps per episode

# Training the agent
for episode in range(num_episodes):
    state, info = env.reset()  # Reset the environment to the initial state
    done = False               # Variable to check if the episode is finished

    for _ in range(max_steps):
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()  # Explore: select a random action
        else:
            action = np.argmax(q_table[state])  # Exploit: select the action with max Q-value
        
        next_state, reward, done, truncated, info = env.step(action)  # Take action and observe the result

        # Update Q-table
        q_table[state, action] = q_table[state, action] + learning_rate * (reward + discount_rate * np.max(q_table[next_state]) - q_table[state, action])
        
        state = next_state  # Move to the next state

        if done:
            break
    
    # Decay epsilon
    epsilon = np.exp(-decay_rate * episode)

print(f"Training completed over {num_episodes} episodes")

## Step 5: Visualize the Agent's Performance

Finally, we visualize the agent's performance by rendering the environment.

-   Renders the environment to visualize the agent's actions.
-   Uses `sleep` to slow down the rendering for better visualization.

In [None]:
from IPython.display import clear_output
from time import sleep

# Watch the trained agent
state, info = env.reset()
done = False
rewards = 0

for s in range(max_steps):
    clear_output(wait=True)
    env.render()  # Render the environment in the human mode
    action = np.argmax(q_table[state])
    next_state, reward, done, truncated, info = env.step(action)
    rewards += reward
    print(f"Step {s+1}, Total Reward: {rewards}")

    state = next_state

    if done:
        break

env.close()