# SARSA Algorithm : On-Policy Temporal Difference Control

SARSA stands for **State–Action–Reward–State–Action**, representing the sequence of experience tuples the algorithm learns from.

Unlike Q-Learning (which is **off-policy**), **SARSA is an on-policy algorithm**: it learns the value of the policy it actually follows, not the optimal one derived from it.

---
### 🔍 SARSA Update Rule

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]$$

Where:
- $\alpha$ → Learning rate
- $\gamma$ → Discount factor
- $r_{t+1}$ → Reward from environment
- $a_{t+1}$ → Next action chosen by the **current policy**

---
### ⚙️ Key Differences Between SARSA and Q-Learning
| Feature | SARSA | Q-Learning |
|----------|--------|-------------|
| Type | On-policy | Off-policy |
| Update Target | $r + \gamma Q(s', a')$ | $r + \gamma \max_a Q(s', a)$ |
| Exploration | Learns the actual ε-greedy behavior | Learns optimal greedy policy |

---
### 🧩 Example: SARSA in a Simple Grid Environment

In [None]:
import numpy as np
import random

# Define environment parameters
n_states = 6
actions = [0, 1]  # 0 = left, 1 = right
Q = np.zeros((n_states, len(actions)))
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 0.2  # Exploration rate

def step(state, action):
    if state == 0 or state == 5:
        return state, 0
    if action == 1:  # Move right
        next_state = state + 1
    else:  # Move left
        next_state = state - 1
    reward = 1 if next_state == 5 else 0
    return next_state, reward

def choose_action(state):
    if random.uniform(0, 1) < epsilon:
        return random.choice(actions)
    else:
        return np.argmax(Q[state])

# Training loop for SARSA
for episode in range(200):
    state = 2  # Start from the middle
    action = choose_action(state)
    while state not in [0, 5]:
        next_state, reward = step(state, action)
        next_action = choose_action(next_state)

        # SARSA update rule
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * Q[next_state, next_action] - Q[state, action]
        )

        state, action = next_state, next_action

print("Learned Q-Table:")
print(Q)
print("\nDerived Policy (0=Left, 1=Right):")
print(np.argmax(Q, axis=1))

---
### 📊 Comparing SARSA and Q-Learning
- **SARSA** is more conservative: it accounts for exploration during learning.
- **Q-Learning** assumes the agent always takes the optimal action next.

Hence, in risky environments, **SARSA tends to produce safer policies**, while **Q-Learning** may produce more aggressive ones.

---
### 🧮 SARSA Algorithm Steps
1. Initialize Q(s, a) arbitrarily.
2. For each episode:
   - Initialize `s` and choose `a` using ε-greedy.
   - Repeat for each step:
     - Take action `a`, observe `r` and `s'`.
     - Choose `a'` from `s'` using ε-greedy.
     - Update: `Q(s, a) ← Q(s, a) + α [r + γQ(s', a') - Q(s, a)]`.
     - Set `s ← s'`, `a ← a'`.
3. Until `s` reaches terminal state.

---
### 📈 Visualization of Q-values

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
for a in range(len(actions)):
    plt.plot(Q[:, a], label=f'Action {a}')
plt.title('Learned Q-values (SARSA)')
plt.xlabel('State')
plt.ylabel('Q-value')
plt.legend()
plt.grid(True)
plt.show()

---
### ⚙️ Summary: SARSA vs Q-Learning
| Algorithm | Policy Type | Update Target | Behavior |
|------------|-------------|----------------|-----------|
| **SARSA** | On-policy | $r + \gamma Q(s', a')$ | Learns actual behavior |
| **Q-Learning** | Off-policy | $r + \gamma \max_a Q(s', a)$ | Learns optimal behavior |

---
### 📚 References
- Sutton & Barto, *Reinforcement Learning: An Introduction* (Ch. 6)
- David Silver, *UCL RL Course* : Lecture 5 (TD Control)
- OpenAI Spinning Up : *SARSA Explained*