# **Introduction**

This notebook is for implementing the SARSA(λ) algorithm, which is an on-policy temporal difference method that utilizes the concept of averaging over all $n$-step returns. Every $n$-step return from $0~to~\infty$ is considered, and the returns are incrementally weighted by a factor of $\lambda$ each time-step, normalized by $(1-\lambda)$.

Similar to other implementations, this will be done using the Frozen Lake environment offered through Gymnasium, which is an open source Python library for developing and comparing reinforcement learning algorithms, through the use of a standardized API. 

# **Import Packages**

This section imports the necessary packages.

In [1]:
# Import these packages:
import gymnasium as gym 
import numpy as np
import tqdm as tqdm
import matplotlib.pyplot as plt

# **Environment Setup**

This section sets up the environment and defines the relevant functions needed for this implementation.

In [None]:
# SARSA(λ)-Agent Class:
class SARSA_L_Agent:
    ####################### INITIALIZATION #######################
    # constructor:
    def __init__(self, env: gym.Env, gamma: float, alpha: float, lamb: float, beta: float, es: bool, rs: bool):
        """
        this is the constructor for the agent. this agent is a TD-based agent, implementing SARSA(λ), meaning that the policy
        is evaluated and improved every time-step by examining all n-step returns

        env:    a gymnasium environment
        gamma:  a float value indicating the discount factor
        alpha:  a float value indicating the learning rate
        lamb: a float value indicating the trace decay rate, λ
        beta:   a float value indicating the decay rate of ε
        es:     a boolean value indicating whether to use exploring starts or not
        rs:     a boolean value indicating whether to use reward shaping or not
                    if true:
                        goal_value: +10.0
                        hole_value: -1.0
                    else:
                        goal_value: +1.0
                        hole_value: 0.0 (sparsely defined)
        Q:      the estimate of the action-value function q, initialized as zeroes over all states and actions
        E:      the eligibility trace, initialized as zeroes over all states and actions

        """

        # object parameters:
        self.env = env
        self.gamma = gamma
        self.alpha = alpha
        self.lamb = lamb
        self.beta = beta
        self.es = es
        self.rs = rs

        # set the reward shaping:
        if self.rs:
            self.goal_value = 10.0
            self.hole_value = -1.0
        else:
            self.goal_value = 1.0
            self.hole_value = 0.0

        # get the number of states, number of actions:
        nS, nA = env.observation_space.n, env.action_space.n

        # get the terminal spaces of the current map:
        desc = env.unwrapped.desc.astype("U1")
        chars = desc.flatten()
        self.terminal_states = [i for i, c in enumerate(chars) if c in ("H", "G")]

        # tabular Q values:
        self.Q = np.zeros((nS, nA))

        # eligibility trace:
        self.E = np.zeros((nS, nA))

        # return to the user the metrics about the environment:
        print(f"Action Space is: {env.action_space}")
        print(f"Observation Space is: {env.observation_space}\n")


SyntaxError: invalid syntax (2474448226.py, line 5)