# Windy Gridworld with Sarsa and extended action space

Let's look at the windy grid world problem, with the starting point **S** and the destination **G**. As shown in the diagram, there is a crosswind represented by the arrows, which moves the agent x fields up depending on the column.

<img src="images/WindyGridworld.png" alt="Drawing" style="width: 600px;"/>

The respective strength of the wind is indicated below the columns.	Initially, the actions are the so-called standard moves ``up, right, down, left``.
For example, if the agent is one cell to the right of the target and moves to ``left``, it lands one cell above the target. Furthermore, we assume a non-discounted task that calculates a constant reward of $-1$ in each time step until the goal is reached.

The setup of the Excercise consists of this jupyter notebook as well as the two python modules ``windy.py`` and ``sarsa.py``. The former consists of the implementation of the environment whereas the latter all relevant methods for the agent are implemented. Be sure that all files are in the same directory for proper imports.
The module ``sarsa.py`` consists of two methods ``run_episode()`` which applies the policy to the environment and visualizes the results and the method ``sarsa()`` which implements the SARSA algorithm from the lecture.

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

# importieren der mitgelieferten Python Module, diese m√ºssen im gleichen Verzeichnis wie das Notebook liegen!
import windy
from sarsa import sarsa, run_episode

## Task 1
Take a closer look at the **framework of the `sarsa` function** and implement the **SARSA algorithm** from the lecture using the given input parameters.
The **output values** should include a table of $Q(s,a)$ values, a **stochastic policy**, and a `history` array containing the number of time steps per episode.
The parameter $\varepsilon$ should decay with the number of episodes $e$ according to
$$
\varepsilon = \frac{1}{e}
$$

In [None]:
# Initialisieren der Umgebung
env = gym.make('WindyGridworld-v0', disable_env_checker=True)
q, policy, history = sarsa(env, 500, eps0=0.5, alpha=0.5)

## Task 2

Plot the episodes over the time steps. For correct display, the required time steps per episode must be stored in the `history` output array.

In [None]:
plt.figure()
plt.xlabel("Time steps"); plt.xlim(0, 8_000)
plt.ylabel("Episodes"); plt.ylim(0, 170)
timesteps = np.cumsum([0] + history)
plt.plot(timesteps, np.arange(len(timesteps)), color='red')
plt.show()

Plot the value function and the policy indicated as arrows with the following ``plot_results()`` and ``run_episode()`` methods.

In [None]:
import matplotlib
from sarsa import plot_results

matplotlib.rcParams['figure.figsize'] = [10, 10]

plot_results(env, q, policy)

In [None]:
rewards = run_episode(env, policy, render=True)
print(f"Episode length = {len(rewards)}")

## Task 3

Now consider four additional possible actions using the so-called King's moves as shown in the illustration. These are diagonal movement options.
Adapt the environment model in the file ``windy.py`` so that you define the additional actions and take these actions into account depending on the Boolean transfer parameter of the environment. What has changed?

In [None]:
env = gym.make('WindyGridworld-v0', king=True, disable_env_checker=True)
q, policy, _ = sarsa(env, 500, eps0=0.5, alpha=0.5)

In [None]:
plot_results(env, q, policy)

In [None]:
rewards = run_episode(env, policy, render=True)
print(f"Episode length = {len(rewards)}")

## Task 4

Now consider an additional stop action. This allows for letting the wind blew the agent in the wind direction without additional moving.
Adapt the environment model in the file ``windy.py`` so that you define the additional actions and take these actions into account depending on the Boolean transfer parameter of the environment. What has changed?

In [None]:
env = gym.make('WindyGridworld-v0', king=True, stop=True, disable_env_checker=True)
q, policy, _ = sarsa(env, 500, eps0=0.5, alpha=0.5)

In [None]:
plot_results(env, q, policy)

In [None]:
rewards = run_episode(env, policy, render=True)
print(f"Episode length = {len(rewards)}")