# Inleveropgave 2: Model-Free Prediction and Control

## Model-Free Prediction

### Sources
- https://towardsdatascience.com/reinforcement-learning-rl-101-with-python-e1aa0d37d43b

In [1]:
from typing import Tuple, List
from collections import defaultdict
from utils import Maze, show_utility, value_iteration

import random as rd
import numpy as np

Eerst wordt de environment, terminal states en de values van de value iteration geïnitialiseerd.

In [2]:
start_state = (3, 2)
terminal_states = [(0, 3), (3, 0)]

rewards = np.array([[-1, -1, -1, 40],
                    [-1, -1, -10, -10],
                    [-1, -1, -1, -1],
                    [10, -2, -1, -1]])

# initialize the Maze
maze = Maze(rewards, terminal_states, start_state)

# use the value function to get the utilities
values = value_iteration(maze, discount=0.9, p_action=1.0)
values

array([[30.5   , 35.    , 40.    ,  0.    ],
       [26.45  , 30.5   , 35.    , 40.    ],
       [22.805 , 26.45  , 22.805 , 26.    ],
       [ 0.    , 22.805 , 19.5245, 22.4   ]])

### Generating episodes
Voor het genereren van een episode worden er twee nieuwe functies geïntroduceerd. De eerste functie kan gebruikt worden om een random episode te genereren, terwijl de tweede gebruikt maakt van de eerder uitgewerkte value iteration om zodanig een bepaalde policy te kunnen volgen.

Deze functies hebben dezelfde hoeveelheid parameters. Echter, niet al deze parameters worden gebruikt (dit zorgt voor iets meer consistency).

In [3]:
def generate_episode_random(env: Maze, values: np.ndarray, discount: float, p_action: float) -> List[Tuple[Tuple[int, int], Tuple[int, int], int]]:
    """Generates an episode based on the random policy.
    
    Here the p_action is not being used, because the actions are already random and therefore it won't really matter if we take a wrong turn
    Furthermore the discount is not being used.
    Both these parameters are set in the function because API consistency
    
    args:
        env (Maze): The environment which holds the rewards of all the possible states.
        values (np.ndarray): The utility values of the states.
        discount (float): The discount factor.
        p_action (float): The probability of a succesfull action.        

    returns:
        List[Tuple[Tuple[int, int], Tuple[int, int], int]]: Returns a list with all the state-action pairs with the corresponding rewards.
    """
    steps = []  # holds Tuples with the states, actions and rewards
    pos = env.get_random_position()
    
    # break if the chosen state is a terminal state
    while pos not in env.end_states:

        next_actions = env.get_next_action_positions(pos)
        # choose a random action and get the reward for the action.
        action = rd.choice(next_actions)
        
        reward = env.R[action]
        steps.append((pos, action, reward))
        # update the pos to the taken action
        pos = action
        
    # save the latest pos with all extra data
    steps.append((pos, (), 0))

    return steps

In [4]:
def generate_episode_optimal(env: Maze, values: np.ndarray, discount: float, p_action: float) -> List[Tuple[Tuple[int, int], Tuple[int, int], int]]:
    """Generates an episode based on the optimal policy.

    args:
        env (Maze): The environment which holds the rewards of all the possible states.
        values (np.ndarray): The utility values of the states.
        discount (float): The discount factor.
        p_action (float): The probability of a succesfull action.        

    returns:
        List[Tuple[Tuple[int, int], Tuple[int, int], int]]: Returns a list with all the state-action pairs with the corresponding rewards.
    """
    steps = []  # holds Tuples with the states, actions and rewards
    pos = env.get_random_position()
    
    # break if the chosen state is a terminal state
    while pos not in env.end_states:
    
        # get the next action based on the optimal policy
        next_actions = env.get_next_action_positions(pos)
        action_values = []
        
        for action in next_actions:
            action_values.append(env.R[action] + (discount * values[action]))
        
        # get the index of the max elements
        max_elem = max(action_values)
        policy_actions = [act for i, act in zip(action_values, next_actions) if i == max_elem]
        
        # choose the desired action and check based on the p_action if the action is certain
        action = rd.choice(policy_actions)
        if p_action < rd.random():
            # whoops, the desired action cannot be taken, so choose one of the others
            chosen_index = np.argmax(action_values)
            # remove the earlier chosen action and choose a random action
            action = rd.choice(next_actions[:chosen_index] + next_actions[chosen_index + 1:])
        
        reward = env.R[action]
        steps.append((pos, action, reward))
        # update the pos to the taken action
        pos = action
        
    # save the latest pos with all extra data
    steps.append((pos, (), 0))

    return steps

### Monte-Carlo Policy Evaluation

In [5]:
def monte_carlo_policy_evaluation(env: Maze, values: np.ndarray, policy: callable, discount: float = 0.9,
                              n_episodes: int = 10000, p_action: float = 0.7) -> np.ndarray:
    """A function that uses monte carlo policy evaluation to get a value-function.

    args:
        env (Maze): The environment which holds the rewards of all the possible states.
        values (np.ndarray): The utility values of the states.
        policy (callable): The policy that is used to generate an episode.
        discount (float, optional): The discount factor. Defaults to 0.9.
        n_episodes (int, optional): The amount of episodes to run. Defaults to 10000.
        p_action (float, optional): The probability of a succesfull action. Defaults to 0.7.        

    returns:
        np.ndarray: Returns the calculated values per state.
    """
    state_values = np.zeros(env.R.shape)
    state_returns = defaultdict(list)

    for _ in range(n_episodes):
        # generate a new episode with a certain policy
        episode = policy(env, values, discount, p_action)

        G = 0
        visited_states = []
        # looping over each step and 
        for pos, action, reward in episode[::-1]:
            G = discount * G + reward
            
            if pos not in visited_states:
                # update the the current state with the new return
                state_returns[pos].append(G)
                # calculate the average value
                state_values[pos] = np.mean(state_returns[pos])
                # update visited states
                visited_states.append(pos)
    
    return state_values

#### MC Random policy
met discount van 1.0

In [6]:
random_values1 = monte_carlo_policy_evaluation(maze, values, policy=generate_episode_random, discount=1.0)
show_utility(random_values1)

-------------------------------------
| -1.08  | 4.58   | 15.71  | 0.0    | 
-------------------------------------
| 0.97   | 4.69   | 14.89  | 19.39  | 
-------------------------------------
| 2.98   | 3.61   | 4.5    | 2.62   | 
-------------------------------------
| 0.0    | 3.49   | -0.32  | -3.26  | 
-------------------------------------


met een discount van 0.9

In [7]:
random_values2 = monte_carlo_policy_evaluation(maze, values, policy=generate_episode_random, discount=0.9)
show_utility(random_values2)

-------------------------------------
| -0.15  | 5.31   | 17.95  | 0.0    | 
-------------------------------------
| 0.06   | 1.4    | 10.69  | 20.18  | 
-------------------------------------
| 3.71   | 1.24   | 0.07   | 0.87   | 
-------------------------------------
| 0.0    | 3.67   | -1.78  | -3.71  | 
-------------------------------------


#### MC Optimal policy
met discount van 1.0

In [8]:
optim_values1 = monte_carlo_policy_evaluation(maze, values, policy=generate_episode_optimal, discount=1.0)
show_utility(optim_values1)

-------------------------------------
| 34.67  | 37.16  | 39.62  | 0.0    | 
-------------------------------------
| 31.95  | 34.03  | 36.26  | 38.12  | 
-------------------------------------
| 27.3   | 31.03  | 28.51  | 27.06  | 
-------------------------------------
| 0.0    | 26.25  | 26.14  | 24.64  | 
-------------------------------------


met discount van 0.9

In [9]:
optim_values1 = monte_carlo_policy_evaluation(maze, values, policy=generate_episode_optimal, discount=0.9)
show_utility(optim_values1)

-------------------------------------
| 24.15  | 32.26  | 39.54  | 0.0    | 
-------------------------------------
| 18.83  | 24.27  | 29.12  | 36.7   | 
-------------------------------------
| 14.02  | 18.19  | 14.45  | 20.22  | 
-------------------------------------
| 0.0    | 13.05  | 11.1   | 15.09  | 
-------------------------------------


### Temporal Difference Learning
#### functies voor het genereren van één stap met behulp van een policy

In [10]:
def get_random_step(env: Maze, values: np.ndarray, pos: Tuple[int, int], discount: float, p_action: float) -> Tuple[Tuple[int, int], int]:
    """Picks the next action based on the current state and the random policy.
    
    args:
        env (Maze): The environment which holds the rewards of all the possible states.
        values (np.ndarray): The urrent values of the states in the env.
        pos (Tuple[int, int]): The current position from which an action should be taken.
        discount (float): The discount factor.
        p_action (float): The probability of a succesfull action.

    returns:
        Tuple[Tuple[int, int], int]: A tuple with the action and corresponding reward of the next step.
    
    """
    next_actions = env.get_next_action_positions(pos)
    
    # choose a random action
    action = rd.choice(next_actions)
    reward = env.R[action]
    
    # return the current state, the action taken and the reward of the state after the action
    return action, reward


def get_optimal_step(env: Maze, values: np.ndarray, pos: Tuple[int, int], discount: float, p_action: float) -> Tuple[Tuple[int, int], int]:
    """Picks the next action based on the current state and the optimal policy.

    args:
        env (Maze): The environment which holds the rewards of all the possible states.
        values (np.ndarray): The urrent values of the states in the env.
        pos (Tuple[int, int]): The current position from which an action should be taken.
        discount (float): The discount factor.
        p_action (float): The probability of a succesfull action.

    returns:
        Tuple[Tuple[int, int], int]: A tuple with the action and corresponding reward of the next step.
    """
    # get the next action based on the optimal policy
    next_actions = env.get_next_action_positions(pos)
    action_values = []
    
    # calculate the value of the next actions based on the values calculated during the value iteration step
    for action in next_actions:
        action_values.append(env.R[action] + (discount * values[action]))

    # get the index of the max elements 
    max_elem = max(action_values)
    policy_actions = [act for i, act in zip(action_values, next_actions) if i == max_elem]
    
    # choose the desired action and check based on the p_action if the action is certain
    action = rd.choice(policy_actions)
    if p_action < rd.random():
        # whoops, the desired action cannot be taken, so choose one of the others
        chosen_index = np.argmax(action_values)
        # remove the earlier chosen action and choose a random action
        action = rd.choice(next_actions[:chosen_index] + next_actions[chosen_index + 1:])

    # gather the reward of the taken action
    reward = env.R[action]
    
    return action, reward

In [11]:
def temporal_difference_learning(env: Maze, values: np.ndarray, policy: callable, step_size: float = 0.1,
                                 discount: float = 0.9, n_episodes: int = 10000, p_action: float = 0.7) -> np.ndarray:
    """A function that uses temporal difference learning to get a value-function.

    args:
        env (Maze): The environment which holds the rewards of all the possible states.
        values (np.ndarray): The utility values of the states.
        policy (callable): The policy that is used to take a step.
        step_size (float, optional): The size of the step in a particular direction. Defaults to 0.1.
        discount (float, optional): The discount factor. Defaults to 0.9.
        n_episodes (int, optional): The amount of episodes to run. Defaults to 10000.
        p_action (float, optional): The probability of a succesfull action. Defaults to 0.7.        

    returns:
        np.ndarray: Returns the calculated values per state.
    """
    state_values = np.zeros(env.R.shape)

    for _ in range(n_episodes):
        # get the random first position
        state = env.get_random_position()

        while state not in env.end_states:
            
            # choose an action based on the policy
            action, reward = policy(env, values, state, discount, p_action)
            
            # update the value of the current_state
            state_values[state] = state_values[state] + step_size * (reward + discount * state_values[action] - state_values[state])
            
            # update the current state
            state = action

    return state_values   

#### TD Random policy
met discount van 1.0

In [12]:
random_values1 = temporal_difference_learning(maze, values, policy=get_random_step, discount=1.0, p_action=0.7)
show_utility(random_values1)

-------------------------------------
| -12.29 | -7.47  | 12.7   | 0.0    | 
-------------------------------------
| -8.98  | -12.05 | -4.19  | 1.81   | 
-------------------------------------
| 0.08   | -9.95  | -15.47 | -16.26 | 
-------------------------------------
| 0.0    | -6.55  | -15.36 | -16.77 | 
-------------------------------------


met discount van 0.9

In [13]:
random_values2 = temporal_difference_learning(maze, values, policy=get_random_step, discount=0.9, p_action=0.7)
show_utility(random_values2)

-------------------------------------
| -5.24  | -4.01  | 5.58   | 0.0    | 
-------------------------------------
| -4.26  | -6.12  | -4.13  | 2.63   | 
-------------------------------------
| 0.94   | -5.43  | -7.98  | -9.52  | 
-------------------------------------
| 0.0    | 0.07   | -6.96  | -8.29  | 
-------------------------------------


#### TD Optimal policy
met discount van 1.0

In [14]:
optim_values1 = temporal_difference_learning(maze, values, policy=get_optimal_step, discount=1.0, p_action=0.7)
show_utility(optim_values1)

-------------------------------------
| 32.43  | 35.68  | 36.91  | 0.0    | 
-------------------------------------
| 30.78  | 33.3   | 32.16  | 33.67  | 
-------------------------------------
| 26.33  | 29.91  | 27.49  | 24.72  | 
-------------------------------------
| 0.0    | 27.91  | 25.26  | 23.92  | 
-------------------------------------


met discount van 0.9

In [15]:
optim_values2 = temporal_difference_learning(maze, values, policy=get_optimal_step, discount=0.9, p_action=0.7)
show_utility(optim_values2)

-------------------------------------
| 21.64  | 27.9   | 35.24  | 0.0    | 
-------------------------------------
| 15.78  | 20.36  | 25.46  | 32.71  | 
-------------------------------------
| 12.41  | 14.79  | 13.21  | 18.11  | 
-------------------------------------
| 0.0    | 11.21  | 9.75   | 13.01  | 
-------------------------------------
