# Q Learning

Q-learning is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for an agent interacting with an environment. The agent learns a Q-value function that estimates the utility (or quality) of taking a specific action in a specific state to maximize long-term rewards.

**Key Components of Q-Learning**
- States (𝑆): The possible configurations or situations in which the agent can be.
- Actions (A): The set of actions the agent can perform in each state.
- Reward (R): The immediate feedback received after performing an action.
- Q-value (Q(s,a)): A function that represents the expected cumulative reward of taking action a in state s and following the optimal policy afterward.

---

**The Q-Learning Algorithm**
The Q-value is updated iteratively using the Bellman Equation:  
$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ R + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
$$  

Where:  
- Q(s,a): Current estimate of the Q-value for state s and action a.
- α: Learning rate (how much new information overrides old information).
- R: Immediate reward received after taking action 
- γ: Discount factor (how much future rewards are valued compared to immediate rewards).
- $\max_{a'} Q(s', a')$: Maximum Q-value for the next state \(s'\), over all possible actions \(a'\).
- $\gamma$: Discount factor, determining the importance of future rewards.



simplified formula
$$
Q(s, a) = R(s, a) + \gamma \max_{a'} Q(s', a_{all}')
$$

**Initialization:**
1. Initialize
    - Q(s,a) to arbitrary values (e.g., zero) for all state-action pairs.
2. Action Selection:
    - Use an exploration strategy (e.g.,ϵ-greedy) to choose an action:
        - With probability ϵ, choose a random action (exploration).
        - With probability 1−ϵ, choose the action with the highest Q-value (exploitation).
3. Action Execution and Feedback:
    - Perform the chosen action 
a, observe the new state s′and reward R.
4. Q-value Update:
    - Update Q(s,a) using the Bellman equation.
5. Repeat:
    - Continue until the agent learns an optimal policy or the process converges.

---
**Key Features of Q-Learning**
- Model-Free: Q-learning does not require a model of the environment (e.g., transition probabilities).
- Off-Policy: It learns the optimal policy independent of the agent's current behavior policy.
- Convergence: Given enough exploration and a suitable learning rate, Q-learning converges to the optimal Q-values.

---

**Applications of Q-Learning**
- Robotics: Navigation and control.
- Game playing: Agents learning to play board games or video games.
- Operations research: Solving optimization problems.
- Finance: Portfolio management and trading strategies.

Q-learning serves as a foundation for more advanced reinforcement learning algorithms like Deep Q-Networks (DQN), where deep neural networks approximate the Q-value function for environments with large or continuous state spaces.

In [1]:
import numpy as np
import random

In [2]:
# Define a simple environment
# States: 0 (start), 1, 2, 3 (goal)
# Actions: 0 (left), 1 (right)
rewards = [0, 0, 0, 1]  # Reward is 1 only at the goal state (3)
num_states = len(rewards)
num_actions = 2

In [3]:
# Initialize Q-table with zeros
q_table = np.zeros((num_states, num_actions))

In [4]:
q_table

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [5]:
# Hyperparameters
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.2  # Exploration rate
episodes = 1000

In [6]:
# Q-learning algorithm
for episode in range(episodes):
    state = 0  # Start state
    while state != 3:  # Loop until reaching the goal
        # Choose an action (epsilon-greedy)
        if random.uniform(0, 1) < epsilon:
            action = random.choice([0, 1])  # Explore: random action
        else:
            action = np.argmax(q_table[state])  # Exploit: best action
        
        # Determine the next state
        next_state = state + 1 if action == 1 else max(0, state - 1)
        
        # Update Q-value
        q_table[state, action] = q_table[state, action] + learning_rate * (
            rewards[next_state] + discount_factor * np.max(q_table[next_state]) - q_table[state, action]
        )
        
        state = next_state  # Move to the next state


In [7]:
# Display the learned Q-table
print("Learned Q-Table:")
print(q_table)

Learned Q-Table:
[[0.72899977 0.81      ]
 [0.72899748 0.9       ]
 [0.8099861  1.        ]
 [0.         0.        ]]


In [8]:

# Test the trained agent
state = 0
print("\nTesting the agent:")
while state != 3:
    action = np.argmax(q_table[state])  # Pick the best action
    state = state + 1 if action == 1 else max(0, state - 1)
    print(f"Moved to state {state}")


Testing the agent:
Moved to state 1
Moved to state 2
Moved to state 3
