Temporal-Difference methods: model-free learning methods that learn directly from episodes of experience.
It is a combination of Monte Carlo and dynamic programming (DP) ideas.

Temporal Difference Learning Algorithm Outline:

1. Initialize the value function V with some initial values.
2. For each time step in an episode:
3. Use the policy derived from the current value function to select an action.
4. Observe the reward and the next state.
5. Update the value of the current state based on the observed reward and the value of the next state.

Notice how it uses observed reward and the value of the next state to compute a "target", and then updates the value of the current state towards this target. This is different from Monte Carlo methods, which wait until the end of an episode to compute a target based on the total observed return, and from DP methods, which use the expected next value based on a model of the environment.

The Sarsa algorithm is a Temporal Difference method that learns a policy that is epsilon-greedy with respect to the current Q-function estimate.

Sarsa Algorithm Outline:

1. Initialize the Q-function with some initial values.
2. For each time step in an episode:
Use an epsilon-greedy policy derived from the current Q-function to select an action.
3. Observe the reward and the next state.
4. Select the next action using the same epsilon-greedy policy.
5. Update the Q-value of the current state-action pair based on the observed reward and the Q-value of the next state-action pair.

In [None]:
import gym
import numpy as np

def sarsa(action_values, policy, episodes, alpha = 0.1, gamma = 0.99, epsilon = 0.2):

    for episode in range(1, episodes + 1):
        state = env.reset()
        action = policy(state, epilson)
        done = False

        while not done:
            next_state, reward, done, _ = env.step(action)
            next_action = policy(next_state, epilson)

            qsa = action_values[state][action]
            next_qsa = action_values[next_state][next_action]
            action_values[state][action] = qsa + alpha * (reward + gamma * next_qsa - qsa)
            state = next_state
            action = next_action

sarsa(action_values, policy)

In this code, n_episodes is the number of episodes to train over, alpha is the step size parameter, gamma is the discount factor, and epsilon is the parameter for the epsilon-greedy policy. The function epsilon_greedy is used to select actions, and the Q-function is updated according to the Sarsa update rule.

After the function has been run, the Q-function can be used to determine the best action to take in any state by taking the action with the highest Q-value. For example, np.argmax(Q[state]) will give the best action to