# Temporal difference (TD) learning

Temporal difference learning is a model-free reinforcement learning algorithm. 

A `model` in RL is the policy matrix: the probabilities of transitioning from one state to another. 

An agent in the TD learning framework does not have a model of the environment. Instead, it learns the policy matrix by interacting with the environment.

# Definition recap 

Before diving deeper into TD learning, we need to define some additional terms.

`Control` - the process of learning the optimal policy (finding the policy matrix $P$).

`Prediction` - the process of learning the value $v_{\pi} (s)$ and $q_{\pi}(a, s)$ function with a fixed policy $\pi$ for all $s$ and $a$ (finding the value matrix $V$).

`Bootstrapping` - the process of using the current estimate of the value function to estimate the value function for the next state.

`Episode` - the process of interacting with the environment until the episode is terminated (the terminal state is reached).

`Step` - a single interaction with the environment. 

`State` - the current state of the environment.

`Action` - the action taken by the agent.

`Reward` - the reward received by the agent.

# TD(0) algorithm 

The TD(0) algorithm is a model-free reinforcement learning algorithm. It is a special case of the TD learning algorithm. The basis of it is that the agent chooses an action based on policy $\pi$, observes the reward and the next state, and then updates the value function $v_{\pi}(s)$ for the current state $s$. 

The full algorithm is as follows:

1. Initialize the value function $v_{\pi}(s)$ for all $s$ arbitrarily, except for the terminal state $v_{\pi}(s_{terminal}) = 0$. Define the number of episodes $N$.

2. For each episode n = 1, 2, ..., $N$:

    1. Initialize the state $s$.
    
    2. While the state $s$ is not terminal:
    
        1. Choose an action $a$ from the state $s$ using policy $\pi$.
        
        2. Observe the reward $r$ and the next state $s'$.
        
        3. Update the value function $v_{\pi}(s)$:
        
            $$v_{\pi}(s) \leftarrow v_{\pi}(s) + \alpha \left[r + \gamma v_{\pi}(s') - v_{\pi}(s)\right]$$
            
        4. Set $s \leftarrow s'$.

TD algorithm above works particulary well if our agent, at the end of each episode, returns to the initial state and continues the episode from there.

# Initializing the frozen lake environment

We will create an environment where our agent needs go from the initial state to the goal without falling into a random generated set of holes. 

Each step is rewarded with a reward of -1 and the goal state has a reward of 10. If an agent falls into the hole, it receives a reward of -10 and the episode is terminated. If the agent reaches the goal, the episode is also terminated.

In [9]:
# Importing python packages 
import numpy as np

def init_env(
        n_rows: int, 
        n_cols: int,
        step_reward: float = -1, 
        goal_reward: float = 10,
        hole_reward: float = -10,
        n_holes: int = 1,
        ) -> np.array: 
    """
    Functionat that returns the initial environment: 
        S - the state matrix indexed by [row, col]
        V - the initial value matrix indexed by [row, col]
        R - the reward matrix indexed by [row, col]
        A - the action matrix indexed by [row, col]
        P - the probability dictionary where for each state, the keys are the actions and the values are the probabilities of the next state
    """
    # Initiating the S matrix 
    S = np.arange(0, n_rows * n_cols).reshape(n_rows, n_cols)

    # Creating the initial V matrix
    V = np.zeros((n_rows, n_cols))

    # The start state will be always the top left corner 
    # The goal state will be always the bottom right corner
    # We will generate a random holes that our agent can fall in
    # Any other state that is not the hole or the goal state will receive a step reward 
    R = np.zeros((n_rows, n_cols))
    R.fill(step_reward)
    R[0, 0] = step_reward
    R[-1, -1] = goal_reward

    hole_coords = []
    for _ in range(n_holes):
        hole_row = np.random.randint(1, n_rows - 1)
        hole_col = np.random.randint(1, n_cols - 1)
        R[hole_row, hole_col] = hole_reward

        # Appending to the hole coordinates list
        hole_coords.append((hole_row, hole_col))

    return S, V, R

In [10]:
init_env(3, 5)

(array([[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14]]),
 array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]),
 array([[ -1.,  -1.,  -1.,  -1.,  -1.],
        [ -1.,  -1., -10.,  -1.,  -1.],
        [ -1.,  -1.,  -1.,  -1.,  10.]]))