# 10.4.3 Q-Learning

## Explanation of Q-Learning

Q-Learning is a model-free reinforcement learning algorithm that aims to learn the optimal action-selection policy for a given environment. It belongs to the class of temporal difference (TD) learning methods, where the agent learns directly from the raw experiences without requiring a model of the environment. The algorithm seeks to find the best action to take in a given state by updating its Q-values, which estimate the expected future rewards for taking a specific action in a given state.

The Q-value update rule is given by:

$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
$$

where:
- $ Q(s, a) $ is the Q-value of state $ s $ and action $ a $,
- $ \alpha $ is the learning rate,
- $ r $ is the reward received after taking action $ a $,
- $ \gamma $ is the discount factor,
- $ s' $ is the new state after taking action $ a $,
- $ \max_{a' } Q(s', a') $ is the maximum Q-value of the next state $ s' $.

## Applications and Benefits of Q-Learning

- **Applications:**
  - **Game AI:** Q-Learning can be used to train agents in various games, such as tic-tac-toe, chess, or more complex games like Atari.
  - **Robotics:** It can be applied to train robots for tasks like navigation, object manipulation, or any task requiring decision-making.
  - **Resource Management:** Q-Learning can optimize resource allocation in network routing, energy distribution, or cloud computing.

- **Benefits:**
  - **Model-Free:** Q-Learning does not require a model of the environment, making it applicable to a wide range of problems.
  - **Off-Policy:** Q-Learning learns the optimal policy even if the agent is following a different policy (e.g., an exploratory policy).
  - **Simplicity:** The algorithm is relatively simple to implement and understand, making it a good starting point for learning reinforcement learning.

## Methods for Implementing Q-Learning

Implementing Q-Learning involves initializing the Q-values arbitrarily and iteratively updating them as the agent interacts with the environment. The agent selects actions based on an epsilon-greedy policy, which balances exploration and exploitation. The steps for implementing Q-Learning are as follows:

1. **Initialize** the Q-values \( Q(s, a) \) arbitrarily (e.g., to zeros).
2. **For each episode**:
   - Initialize the state \( s \).
   - **For each step** in the episode:
     - Choose an action \( a \) using the epsilon-greedy policy.
     - Take action \( a \), observe reward \( r \) and next state \( s' \).
     - Update the Q-value for \( Q(s, a) \) using the Q-Learning update rule.
     - Set the current state \( s \) to the next state \( s' \).
   - Continue until the episode ends (e.g., reaching a terminal state or maximum steps).
3. **Repeat** the process for a predefined number of episodes or until the Q-values converge.

The final Q-values represent the learned policy, where the agent can select actions based on the highest Q-values for each state.


___
___
### Readings:
- [An introduction to Q-Learning: reinforcement learning](https://www.freecodecamp.org/news/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc/)
- [Q-learning: a value-based reinforcement learning algorithm](https://medium.com/intro-to-artificial-intelligence/q-learning-a-value-based-reinforcement-learning-algorithm-272706d835cf)
- [An introduction to Q-Learning: reinforcement learning](https://medium.com/free-code-camp/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc)
- [Introducing Q-Learning \( Hugging Face \)](https://huggingface.co/learn/deep-rl-course/en/unit2/q-learning)
- [SARSA & Q Learning in Temporal Difference for Reinforcement Learning](https://medium.com/data-science-in-your-pocket/sarsa-q-learning-in-temporal-difference-for-reinforcement-learning-with-example-8bfd902a5d2)
___
___

In [1]:
import numpy as np
import random
from collections import defaultdict

In [2]:
# Define the environment
n_states = 6   # Number of states
n_actions = 2  # Number of actions (0 = left, 1 = right)
gamma = 0.9    # Discount factor
alpha = 0.1    # Learning rate
epsilon = 0.1  # Exploration rate
n_episodes = 500  # Number of episodes

In [3]:
rewards = np.zeros(n_states)
rewards[-1] = 1.0  # Reward at the terminal state

In [4]:
Q = defaultdict(lambda: np.zeros(n_actions))

# Epsilon-greedy policy
def choose_action(state):
    if random.uniform(0, 1) < epsilon:
        return random.choice(np.arange(n_actions))  # Explore
    else:
        return np.argmax(Q[state])                  # Exploit

In [5]:
# Q-Learning algorithm
for episode in range(n_episodes):
    state = np.random.randint(0, n_states - 1)  # Start from a random state
    while state != n_states - 1:                # Loop until reaching the terminal state
        action = choose_action(state)
        next_state = state + 1 if action == 1 else max(0, state - 1)
        reward = rewards[next_state]
        best_next_action = np.argmax(Q[next_state])
        Q[state][action] += alpha * (reward + gamma * Q[next_state][best_next_action] - Q[state][action])
        state = next_state

In [6]:
optimal_policy = np.argmax([Q[state] for state in range(n_states)], axis=1)

print("Optimal Policy:", optimal_policy)

Optimal Policy: [1 1 1 1 1 0]


## Conclusion

Q-Learning is a powerful and widely-used reinforcement learning algorithm that enables agents to learn optimal policies through interaction with the environment. By iteratively updating Q-values based on the rewards received, Q-Learning converges to an optimal policy that maximizes long-term rewards. The example provided demonstrates a simple implementation of Q-Learning in a grid world environment, highlighting the core concepts and steps involved. Understanding and applying Q-Learning is fundamental for solving various reinforcement learning tasks, making it a valuable tool in the field of AI and machine learning.
