# 10.4.2 SARSA (State-Action-Reward-State-Action)

### Explanation of SARSA (State-Action-Reward-State-Action)

SARSA is an on-policy reinforcement learning algorithm that updates the action-value function based on the current state, the action taken, the reward received, the next state, and the next action. The acronym SARSA stands for State-Action-Reward-State-Action, which captures the sequence of events in the algorithm.

The update rule for SARSA is given by:

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] $$

Where:
- $ s_t $ and $ a_t $ are the current state and action, respectively.
- $ r_{t+1} $ is the reward received after taking action $ a_t $ in state $ s_t $.
- $ s_{t+1} $ and $ a_{t+1} $ are the next state and the next action, respectively.
- $ \alpha $ is the learning rate.
- $ \gamma $ is the discount factor.

SARSA is an on-policy algorithm, meaning it learns the value of the policy being followed, including the exploration strategy.


 
### Methods for Implementing SARSA

1. **Initialize the Q-Table**: Start with an initial Q-table with all values set to zero or a small random value.
2. **Choose an Action**: Use an epsilon-greedy policy to choose an action based on the current state and the Q-values.
3. **Take the Action and Observe the Outcome**: Execute the chosen action, observe the next state and the reward received.
4. **Choose the Next Action**: Again, use the epsilon-greedy policy to choose the next action based on the new state.
5. **Update the Q-Value**: Update the Q-value for the current state-action pair using the SARSA update rule.
6. **Repeat**: Continue this process until convergence or for a predefined number of episodes.

This step-by-step process will eventually lead to the learning of an optimal or near-optimal policy, allowing the agent to make better decisions in the given environment.


___
___
### Readings:
- [Reinforcement Learning: Temporal Difference Learning — Part 2](https://readmedium.com/en/https:/medium.com/analytics-vidhya/reinforcement-learning-temporal-difference-learning-part-2-c290af52f483)
- [The Epsilon-Greedy Algorithm for Reinforcement Learning](https://medium.com/analytics-vidhya/the-epsilon-greedy-algorithm-for-reinforcement-learning-5fe6f96dc870)
- [Temporal differencing (SARSA, Q learning, Expected SARSA)](https://medium.com/@j13mehul/reinforcement-learning-part-6-temporal-differencing-sarsa-q-learning-expected-sarsa-b7725c755410)
- [Reinforcement Learning: SARSA](https://readmedium.com/en/https:/towardsdev.com/reinforcement-learning-sarsa-1a703f0cb25b)
- [Reinforcement Learning with SARSA — A Good Alternative to Q-Learning Algorithm](https://readmedium.com/en/https:/towardsdatascience.com/reinforcement-learning-with-sarsa-a-good-alternative-to-q-learning-algorithm-bf35b209e1c)
- [Navigating Complexity: The Role of SARSA in Reinforcement Learning for Game Strategy Optimization](https://readmedium.com/en/https:/pub.aimind.so/navigating-complexity-the-role-of-sarsa-in-reinforcement-learning-for-game-strategy-optimization-6dff9e630453)
___
___

### Benefits and Scenarios for Using SARSA

- **On-Policy Learning**: SARSA directly learns the policy being used to make decisions. This makes it ideal for environments where the exploration strategy is an integral part of the decision-making process.
- **Exploration-Sensitive**: Because SARSA considers the action actually taken in the next state, it takes into account the exploration policy. This can lead to safer behavior in some scenarios compared to Q-learning, which is an off-policy method.
- **Stability**: SARSA can be more stable than Q-learning in environments with high variability or noise because it updates using the actions chosen by the current policy rather than the optimal policy.

In [1]:
import numpy as np
import random
from collections import defaultdict

In [2]:
# Initialize parameters
n_states = 6   # Number of states
n_actions = 2  # Number of actions (0 = left, 1 = right)
alpha = 0.1    # Learning rate
gamma = 0.9    # Discount factor
epsilon = 0.1  # Exploration rate
n_episodes = 1000  # Number of episodes

In [3]:
# Initialize Q-table
Q = defaultdict(lambda: np.zeros(n_actions))

In [4]:
def epsilon_greedy_policy(state):
    # Choose an action based on epsilon-greedy policy.
    if random.uniform(0, 1) < epsilon:
        return random.choice(np.arange(n_actions))  # Explore
    else:
        return np.argmax(Q[state])                  # Exploit

In [5]:
def sarsa():
    # SARSA algorithm implementation.
    for _ in range(n_episodes):
        state = np.random.randint(0, n_states - 1)  # Initialize the starting state
        action = epsilon_greedy_policy(state)       # Choose the first action
        
        while state != n_states - 1:  
            next_state = state + 1 if action == 1 else max(0, state - 1)
            reward = 1.0 if next_state == n_states - 1 else 0.0
            
            next_action = epsilon_greedy_policy(next_state)
            
            # Update Q-value using the SARSA update rule
            Q[state][action] += alpha * (reward + gamma * Q[next_state][next_action] - Q[state][action])
            
            state = next_state
            action = next_action

    
    optimal_policy = {state: np.argmax(Q[state]) for state in range(n_states)}
    return optimal_policy

In [6]:
optimal_policy = sarsa()
print("Optimal Policy:", optimal_policy)

Optimal Policy: {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 0}


## Conclusion

In this section, we explored the SARSA (State-Action-Reward-State-Action) algorithm, a key method in reinforcement learning for on-policy temporal difference control. We implemented SARSA in a simple grid world environment, demonstrating how the agent learns to take optimal actions through exploration and exploitation. The epsilon-greedy policy allowed the agent to balance the trade-off between exploration of new actions and exploitation of known rewards. The final result was an optimal policy derived from the Q-table, showcasing the effectiveness of SARSA in finding the best strategy in this scenario. This foundational understanding of SARSA prepares us to tackle more complex reinforcement learning problems.
