<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Machine%20Learning%20Interview%20Prep%20Questions/Reinforcement%20Learning%20Algorithms/SARSA/sarsa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SARSA: On-Policy Temporal Difference Control

SARSA is a reinforcement learning algorithm used to learn the optimal action-value function \( Q(s, a) \) for a given environment. Unlike Q-Learning, which is off-policy, SARSA is **on-policy**, meaning it updates its Q-values using the action actually taken by the current policy.

---

## SARSA Update Rule

The SARSA update rule is:

$$[
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]
]$$

Where:

- $$( s_t )$$: current state  
- $$( a_t )$$: current action  
- $$( r_{t+1} )$$: reward after taking action $$( a_t )$$  

- $$( s_{t+1} )$$: next state  
- $$( a_{t+1} )$$: next action taken by the current policy  
- $$( \alpha )$$: learning rate  
- $$( \gamma )$$: discount factor  

---

## Key Properties

- **On-policy:** Learns from actions taken using its current policy.
- **Exploration Strategy:** Uses ε-greedy to balance exploration and exploitation.
- **Goal:** Learn an optimal policy to maximize expected cumulative rewards.

---

## Code Implementation (SARSA with FrozenLake)


In [5]:
import numpy as np
import random

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)

# Define the 4x4 GridWorld environment
grid_size = 4
num_states = grid_size * grid_size
num_actions = 4  # 0 = Up, 1 = Right, 2 = Down, 3 = Left

# Define the environment dynamics
def get_next_state(state, action):
    row = state // grid_size
    col = state % grid_size

    if action == 0 and row > 0:            # Up
        row -= 1
    elif action == 1 and col < grid_size-1:  # Right
        col += 1
    elif action == 2 and row < grid_size-1:  # Down
        row += 1
    elif action == 3 and col > 0:           # Left
        col -= 1

    return row * grid_size + col

# Define the reward function
def get_reward(state):
    if state == 15:
        return 10  # Goal state
    else:
        return -1  # Penalty for each step

# Epsilon-greedy policy
def choose_action(state, Q, epsilon):
    if random.uniform(0, 1) < epsilon:
        return random.randint(0, num_actions - 1)
    else:
        return np.argmax(Q[state])

# SARSA Parameters
alpha = 0.1       # Learning rate
gamma = 0.99      # Discount factor
epsilon = 0.1     # Exploration rate
episodes = 500    # Total training episodes

# Initialize Q-table with zeros
Q = np.zeros((num_states, num_actions))

# Training Loop
for episode in range(episodes):
    state = 0  # Start at top-left corner (state 0)
    action = choose_action(state, Q, epsilon)

    while state != 15:  # Until we reach the goal
        next_state = get_next_state(state, action)
        reward = get_reward(next_state)
        next_action = choose_action(next_state, Q, epsilon)

        # SARSA Update
        Q[state, action] += alpha * (reward + gamma * Q[next_state, next_action] - Q[state, action])

        # Move to next state-action
        state = next_state
        action = next_action

# Display final policy
actions_map = ['↑', '→', '↓', '←']
policy_grid = np.array([actions_map[np.argmax(Q[s])] if s != 15 else '🏁' for s in range(num_states)]).reshape((grid_size, grid_size))
print("Learned Policy (SARSA):")
print(policy_grid)

Learned Policy (SARSA):
[['→' '↓' '↓' '↓']
 ['→' '→' '↓' '↓']
 ['→' '→' '→' '↓']
 ['→' '→' '→' '🏁']]


##  Key Points
* On-policy: SARSA follows the policy used to pick actions.

* Exploration-aware: It learns based on the action actually taken.

* Simple environment: 4x4 grid helps visualize learning steps clearly.