# Reinforcement Learning

## Value iteration

In [None]:
from mllab.rl import TicTacToe, LGame

We implement value iteration and test it on two board games, Tic-tac-toe and the L-Game ([Wikipedia](https://en.wikipedia.org/wiki/L_game)).

Both games have the same interface, so you can write your code to work for both games without further changes. The interface is shown in the following.

In [None]:
game = TicTacToe
state = game.unique_states[2]  # A list of all possible game states. The states are normalized.
state

In [None]:
state = game.unique_states[0]
print("Current player:", state.player)
print("Winner?", state.winner())  # returns either None or a player number
print("Is terminal? ", state.is_terminal())  # is the game finished?
print("List of valid actions: ", state.valid_actions())

Let's see how we can apply an action and get a new state.

In [None]:
import random

action = random.choice(state.valid_actions())
print("Place piece at", action)
new_state = state.apply_action(action)
new_state

### Normalized States

The returned `new_state` is **not normalized**. Since the value function is only defined for normalized states. You have to normalize the state.

In [None]:
new_state_normalized = new_state.normalized()

### Task 1

Implement the `win_reward` and `value_iteration` function below. Use the game interface explained above.

**Pitfall**: Let $s$ be the current state, $a$ be a valid action for $s$, and $s^\prime$ be the state we get if action $a$ is taken in state $s$. Then, the reward for the player taking action $a$ in $s$ is _not_ $V(s^\prime)$, but $-V(s^\prime)$. We use one value function for both policies and only store the value for normalized states (for equivalence classes).

In [None]:
from itertools import count


def win_reward(s, action=None):
    """
    Compute the reward if in state s the given action is applied.
    
    If there is not winner, 0 is returned. Otherwise, 1 or -1 is returned,
    depending on whether the current player has won or lost.
    """
    # your code goes here


def value_iteration(game, asynchronous=True, reward=win_reward):
    """
    Perform value iteration and return the value function.
    
    Parameters
    ==========
    
    game: A game class (e.g. TicTacToe or LGame)
    asynchronous: bool
        Whether to do the updates directly on the current iterate
        or if the iterate is only updated at the end of the max operation (updated in parallel, synchronous).
        With other words, if the parameter is False the maximum is computed
        independently (using old values, ignoring other updates). Otherwise, other max operations
        in the same iteration are taken into account.
    reward: function
        A function which takes a state and an action and returns a number.
    """
    states = game.unique_states
    v = {s: 0 for s in states}  # value function, initialized to 0
    # your code goes here

Compute an optimal value function for the game Tic-Tac-Toe.

In [None]:
vT_slow = value_iteration(TicTacToe, asynchronous=False)
vT_fast = value_iteration(TicTacToe, asynchronous=True)

Now compute an optimal value function for the L-Game.

In [None]:
vL_slow = value_iteration(LGame, asynchronous=False)
vL_fast = value_iteration(LGame, asynchronous=True)

Write code to get an policy from a value function.

In [None]:
import random


class ValueFunctionBasedPolicy:
    """
    A policy computed using a value function.
    
    Usage
    =====
    
        # assume a value function is stored in v
        policy = ValueFunctionBasedPolicy(v)

        # get a best action for state
        action = policy[state]
    """

    def __init__(self, v, reward):
        self._v = v
        self._reward = reward
    
    def __getitem__(self, s):
        """Get an action for state s."""
        return random.choice(self.actions(s))
    
    def actions(self, s):
        """Get all actions which maximize the reward in state s."""
        actions = list(s.valid_actions())
        if not actions:
            return []
        # your code goes here

    def value(self, s, a):
        return self._reward(s, a) - self._v[s.apply_action(a).normalized()]

Let us watch the agent play against itself.

In [None]:
from mllab.rl import self_play

In [None]:
self_play(TicTacToe, ValueFunctionBasedPolicy(vT_fast, win_reward), sleep=2)

In [None]:
self_play(LGame, ValueFunctionBasedPolicy(vL_fast, win_reward), sleep=0.01, max_steps=500)

### L-Game Insights and Improvements

The L-Game does never finish if both players are perfect. There some states though, which are special. Let us use the value function to get some insights.

### Task 2

Collect all states with negative value. How many are terminal, how many are not terminal? What does it mean that a state has a negative value but is not terminal?

In [None]:
# your code goes here

Let the agent start in every state with negative reward which is not terminal and let it play against itself. Use
```python
final_state, steps = self_play(state, ValueFunctionBasedPolicy(vL_fast), max_steps=100, sleep=0)
```
to compute the steps until the game terminates and it final state.

Print the number of steps it takes to terminate and how the game ended. Run the code several times, what do you observe?

In [None]:
# your code goes here

Use the computed value function to design a new reward which improves the behaivor.

In [None]:
good_positions = set(s for s in LGame.unique_states if vL[s] > 0 and s.winner() is None)
bad_positions = set(s for s in LGame.unique_states if vL[s] < 0 and s.winner() is None)

def asap_win_reward(s, a=None):
    # your code goes here

In [None]:
vL_improved = value_iteration(LGame, reward=asap_win_reward)

Now, compute the number of steps before termination again and compare. Maybe, also read the Wikipedia article on the L-Game referenced at the beginning.

In [None]:
# your code goes here

## Policy Iteration

### Task 3

Implement `policy_evaluation` and `policy_improvement` below.

In [None]:
import random
import math
from mllab.rl import Policy
from itertools import cycle


def policy_improvement(p1, p2, v1, v2, reward):
    """
    Compute an improved policy for the states using the value function v.
    
    Parameters
    ==========
    p1: dict
        Policy for player 1
    p2: dict
        Policy for player 2
    v1: dict
        Value function for player 1
    v2: dict
        Value function for player 2
    reward: callable
        The reward function
    Returns
    =======
    (dict, dict)
        Two policies, one for player 1, the second for player 2
    """
    p1 = dict(p1)
    states = list(v1.keys())
    # compute greedy update for p1
    for s in states:
        # your code goes here
    # compute best response to p1 for p2
    v1 = policy_evaluation(states, p1, p2, reward)
    v2 = policy_evaluation(states, p2, p1, reward)
    p2 = {}
    while True:
        print('P', end='')  # print a 'P' for each iteration
        # your code goes here
    return p1, p2


def policy_evaluation(states, p1, p2, reward):
    """
    Comptue the value function for given policies.
    
    Parameters
    ==========
    states: list of normalized states
    p1: dict
        Policy for first player
    p2: dict
        Policy for second player
    """
    v = {}
    for s in states:
        current = s
        rewards = []
        sign = 1  # cf. pitfall
        # store (sign, state) pairs and check for cycles!
        trajectory = []
        # iterate until the current state is terminal (i.e., None)
        while current is not None:
            # your code goes here
        v[s] = sum(rewards)
    return v


def policy_iteration(game, reward):
    """
    Perform policy iteration for the given game, and reward function.
    
    Returns
    =======
    policy: Policy
    """
    states = game.unique_states
    p1 = {s: random.choice(s.valid_actions()) for s in states if s.valid_actions()}
    p2 = {s: random.choice(s.valid_actions()) for s in states if s.valid_actions()}
    v1, v2 = {}, {}
    while True:
        v1_new = policy_evaluation(states, p1, p2, reward)
        v2_new = policy_evaluation(states, p2, p1, reward)
        if v1 == v1_new and v2 == v2_new:
            break
        v1, v2 = v1_new, v2_new
        p1, p2 = policy_improvement(p1, p2, v1, v2, reward)
        print('.', end='')
    print('')
    return Policy(game, p)

In [None]:
pT_policy_iteration = policy_iteration(TicTacToe, win_reward)

The policy iteration does might not converge for the L-Game, more specifically, finding the best response in `policy_improvement`. What it the reason? Can you fix it? (optional)

In [None]:
pL_pi = policy_iteration(LGame, asap_win_reward)

## Deep Q Learning (DQN)

Install some Python package we need by running the following cell.

In [None]:
!pip install gym box2d box2d-kengz opencv-python h5py tqdm

In [1]:
import gym
import numpy as np
from mllab.rl.dqn import BaseQNetwork, ReplayMemory, EpsilonGreedyPolicy, ProportionalPrioritizationReplayMemory

Using TensorFlow backend.


### Environment

The action space of the car racing environment is continous and
consists of a three dimensional real vector $[-1, 1]x[0, 1]x[0, 1]$
corresponding to steering position, amount of gas and and brake intensity.
We need discrete actions, so you have to pick finitely many points from this box.

A initial suggestion has been made, **feel free to modify it**.

In [2]:
env = gym.make('CarRacing-mllab-v0', verbose=0)
# This will open a window. Call env.close() at the end to get rid of it.

In [3]:
# Picke a finite set of actions
action_space = env.action_space.discretize((
    np.array([ 0, 1, 0]),  # full gas
    np.array([ 0, 0, 1]),  # full brake
    np.array([-1, 0, 0]),  # steer left
    np.array([ 1, 0, 0]),  # steer right
    np.array([ 0, 0, 0]),  # do nothing
))


In [None]:
env = gym.make('MountainCar-v0')

action_space = env.action_space
print(action_space.sample())

Let's watch a random policy.

In [None]:
import time 

start = time.process_time()
current = time.process_time()

env.reset()
while  True:
    current = time.process_time()
    new_state, reward, terminated, _info = env.step(action_space.sample())
    env.render()
    if (current - start) >= 1:
        break
print(new_state)
env.close()

In [4]:
env.close()

### Preprocessing map

The state space of the car racing environment is made of an $96\times96$ RGB image and seven measurements:

- The velocity of the car (absolute value)
- The angular velocity of the four wheels
- The steering angle of the front wheels
- The angular velocity of the car

The map `preprocess` takes a state and transforms it to a state which hopefully is better suited as an input to the neural network. You can use the transformation as is **or change it**.

In [9]:
import cv2 as cv
from matplotlib import pyplot as plt
from keras import backend as K


def show(image):
    """
    Show a greyscale image.
    
    Useful for debugging.
    """
    fig, ax = plt.subplots(dpi=2 * 72)
    if image.ndim == 3:
        if image.shape[0] == 1:
            image = image.reshape(image.shape[1:])
        elif image.shape[-1] == 1:
            image = image.reshape(image.shape[:-1])
    if image.ndim == 2:
        ax.imshow(image, cmap='gray')
    else:
        ax.imshow(image)
    plt.axis('off')
    plt.show()


def preprocess(state):
    """
    Preprocess the rendered color image of the car racing environment.

    Parameters
    ----------

    state: (image, measurements)
        image is an RGB image, more precisely an 96x96x3 array.
        measurements is 1D vector of length 7.
    """
    image, measurements = state
    # Convert to grayscale
    gray = cv.cvtColor(image, cv.COLOR_RGB2GRAY)
    # Resize the image (to save memory)
    # Get mask for red markings in curves
    curve_marks = cv.inRange(image, (250, 0, 0), (255, 0, 0))
    # Replace markings with white
    gray[curve_marks == 255] = 255
    gray = cv.resize(gray, (0,0), fx=0.85, fy=0.85)
    # Remove pattern in grass by setting light pixels (> 130) to white (255)
    gray = cv.threshold(gray, 130, 255, cv.THRESH_TRUNC)[1] / 130
    if K.image_data_format() == 'channels_first':
        gray = gray.reshape((1,) + gray.shape)
    else:
        gray = gray.reshape(gray.shape + (1,))
    measurements = np.concatenate((
        measurements[:4],
        np.array([np.cos(measurements[4]), np.sin(measurements[4])]),
        measurements[5:],
    ))
    return (gray.astype(K.floatx()), measurements.astype(K.floatx()))

def preprocess_mount(state):
    return (state[0],state[1])

### Q-Network

The Q-Network maps a (preprocessd) state to a Q-value for each action. Since out state consists of an image and measurements, we need to use Keras' functional API to build a neural network which can take mixed input.

First, define two models for the scalar inputs and the image inputs:

```python
input_img = layers.Input(shape=...)
img = layers.Conv2D(4, kernel_size=(3, 3), activation='relu')(input_img)
# add more layers here (replace input_img by img)
img = layers.Flatten()(img)
img = keras.Model(input_img, img)

input_scalar = layers.Input(shape=...)
img = keras.Dense(8, activation='relu')(input_scalar)
# as above
scalar = layers.Model(input_scalar, scalar)
```

Then concatenate both models and create a new model:
```python
model = layers.concatenate([img.output, scalar.output])
model = layers.Dense(num_actions, activation='linear')(model)
model = keras.Model(inputs=[img.input, scalar.input], outputs=model)
```

### Task 4

Define your model for the Q-network by implementing the `build_model` method.

The method must return a model and an optimizer. The loss is implemented in the parent class.

In [11]:
import keras
import keras.layers as layers
import keras.optimizers as optimizers
from keras.layers import Input, Dense
from keras.models import Model

class QNetworkCar(BaseQNetwork):
    model_type = 'car'
    
    def build_model(self, state_shape):
        num_actions = len(self.action_space)
        # Build the network for the image part
        img_shape, scalar_shape = state_shape

        input_img = layers.Input(shape=img_shape)

        img = layers.Conv2D(4, kernel_size=(3, 3), activation='relu')(input_img)
        img = layers.Conv2D(4, kernel_size=(3, 3), activation='relu')(img)
        # add more layers here (replace input_img by img)
        img = layers.Flatten()(img)
        img = keras.Model(input_img, img)

        # Build the network for the scalar part
        input_scalar = layers.Input(shape=scalar_shape)
        scalar = layers.Dense(8, activation='relu')(input_scalar)
        scalar = keras.Model(input_scalar, scalar)
        # Combine both networks
        model = layers.concatenate([img.output, scalar.output])
        # add your layers, if any, here
        # the output shape must be the number of actions!
        model = layers.Dense(num_actions, activation='linear')(model)
        model = keras.Model(inputs=[img.input, scalar.input], outputs=model)
        

        opt = optimizers.RMSprop(lr=0.00025 / 4, rho=0.95, epsilon=0.01)

        return model, opt
    
class QNetworkMount(BaseQNetwork):
    model_type = 'mountain'
    
    def build_model(self,state_shape):
        
        num_actions = 3
        
        # This returns a tensor
        inputs = Input(shape=(num_actions,))
        # a layer instance is callable on a tensor, and returns a tensor
        x = Dense(3, activation='relu')(inputs)
        predictions = Dense(1, activation='linear')(x)

        # This creates a model that includes
        # the Input layer and three Dense layers
        
        model = Model(inputs=[inputs,], outputs=predictions)

        opt = optimizers.RMSprop(lr=0.00025 / 4, rho=0.95, epsilon=0.01)

        return model, opt

### Replay Memory

The replay memory stores transitions. It was already implemented for your. To add a transition use
```python
replay_memory.add(state, action_index, reward, new_state)
```
**Important:** `state` and `new_state` must be the output of `preprocess`. If `state` is terminal, `new_state` must be `None`. The action index (not the actual index) is returned by the policy, see below.


In order to sample a batch of transitions, call
```python
transitions, sample_weights = replay_memory.sample(importance_criterion, progress)
```
The parameter `importance_criterion` is a callable (e.g., a function) which get a transitions as arguments and returns an number to measure the prediction error for the transitions. You should use the TD-Error
$$
    |y - Q(s^\prime, a)| = |\bigl(r + \gamma Q_\textrm{target}(s^\prime, \operatorname{argmax}_aQ(s^\prime, a))\bigr) - Q(s^\prime)|.
$$
For terminal states $y$ is just $r$.

The arguments for `importance_criterion` are
```python
def my_criterion(s, actions, rewards, s2, not_terminal): ...
```
Where
- `s` is a list of preprocessed state
- `actions` is a NumPy array of action indices (the action taken in `s`)
- `rewards` is a NumPy array of rewards received (the reward received after taking the action from `actions` in the state from `s`)
- `s2` is a list of preprocessd states (the new state). Only non terminal states are returned.
- `not_terminal` is a NumPy array of boolean indicating which of the states in `s` was not terminal. 

The parameter `progression` is a float in $[0, 1]$ which represents the percentage of the steps taken so far.

The return value `transitions` of `replay_memory.sample` is a tuple which has the same entries as those given as parameters to `importance_criterion`. The `sample_weights` return value must be passed to the gradient step (see policy description).

#### Memory requirements
Depending on the size of your state the memory requirements can be huge. For example, to store 100k transitions you need 10GB of memory or more. Check if your machine has enough memory or try a smaller replay memory.

### $\varepsilon$-Greedy-Policy

A policy class is already implemented for your. Initialize it as following (feel free to change the parameters):
```python
policy = EpsilonGreedyPolicy(q_network)
policy.initial_exploration = 1.0  # initial epsilon value
policy.final_exploration = 0.01  # lowest epsilon value
policy.evaluation_exploration = 0.001  # epsilon used during evaluation
policy.final_exploration_step = 500_000  # number of steps over which epsilon is linearly decreased
```
Here, `q_network` is an instance of `QNetwork`.

With probability $\varepsilon$ the policy returns a random action (exploration). Otherwise, an action is returned with maximal Q-value. The probability $\varepsilon$ is linearly decreased with the step number.

You get an action from the policy by calling it (like a function):
```python
action_index, action = policy(preprocessed_state, step)
```

More methods:

- `policy.copy()` creates an independent copy of the policy
- `policy.gradient_step(states, actions, labels, sample_weights)` performs a gradient step. `state` and `actions` are the return values of the replay memory (first two elements in `transitions`), and `sample_weights` is the second return value of the replay memory.
- `policy.copy_weights_from(other_policy)` Copies over the weights from another policy.

To compute the Q-values of the underlying network, use
```python
policy.q_network(states)
```
which returns a NumPy array where each row contains the outputs of the network. You need this to implement the label computation and for `importance_criterion`.

### Q-Learning Algorithm

Implement the `train` method.

In [12]:
from tqdm import tqdm_notebook as tqdm


class DeepQLearning:
    # After how many steps the weights are copied to the target-action network
    target_network_update_frequency = 1_000
    discount_factor = 0.99
    # A random policy is run for that many steps to initialize the replay memory
    replay_start_size = 5_000

    def __init__(self, env, replay_memory, policy, preprocess=preprocess):
        self.env = env
        self.replay_memory = replay_memory
        self.policy = policy
        self.preprocess = preprocess
        self.rewards = []
        self.best_agent = None


    def train(self, total_steps, replay_period, weight_filename=None, evaluate=None, double_dqn=False):
        """
        Train the agent using DQN.

        Parameters
        ==========

        total_steps: int
            Number of steps the agent is trained for.
        replay_period: int
            Number of steps between which the network is trained.
        weight_filename: str or None
            If not None the weights of Q-network are stored to this file during training.
        evaluate: int or None
            Number of episodes after which the policy is evaluted and the result is printed.
        double_qdn: bool
            Whether to use Double-DQN (DDQN).
        """
        if len(self.replay_memory) == 0:
            self.initialize_replay_memory()
        self.action_value = self.policy
        self.target_action_value = self.policy
        episode = 0
        step = 0

        while episode < total_steps:
            episode += 1
            self.env.reset()
            preprocessed_state = self.preprocess(self.env.state)
            print("Episode {} ({} steps so far)".format(episode, step))
            
            
            
            step=0
            for _ in tqdm(range(env.spec.max_episode_steps)):
                step += 1
                preprocessed_state = self.preprocess(self.env.state)
                
                action_index, action = policy(preprocessed_state, step)
               
                new_state, reward, *_ = env.step(action)
                new_state = self.preprocess(new_state)
                replay_memory.add(preprocessed_state, action_index, reward, new_state)
                
                if step % replay_period == 0:
                    
                    transitions, sample_weights = self.replay_memory.sample(self.importance_criterion, progression = step/total_steps)
                    nt= transitions[4]
                    y = transitions[2]
                    s2= transitions[3]
                    y[nt] = y[nt] + self.discount_factor*self.target_action_value.q_network(s2)[np.arange(s2[0].shape[0]), np.argmax(self.action_value.q_network(s2),axis=1)]
                    
                    self.target_action_value.gradient_step(*transitions[0:2], y, sample_weights)
                    
                if step % self.target_network_update_frequency ==0:
                    self.action_value.copy_weights_from(self.target_action_value)
                    
            if evaluate is not None and episode % evaluate == 0:
                total_reward = self.evaluate(self.target_action_value, weight_filename)
                print("Total reward: {}".format(total_reward))
                

    def initialize_replay_memory(self):
        """Initialize the replay memory using a random policy."""
        self.env.reset()
        self.replay_memory.purge()
        state = self.preprocess(self.env.state)
        size = min(self.replay_start_size, self.replay_memory.capacity)
        print("Initialize replay memory with {} transitions".format(size))
        for _ in tqdm(range(size)):
            if self.policy.q_network.model_type == 'car':
                action_index, action = self.policy.sample(return_index=True)
            elif self.policy.q_network.model_type == 'mountain':
                action_index, action = self.policy.sample(),self.policy.sample()
                
            new_state, reward, terminated, _info = self.env.step(action)
            new_state = self.preprocess(new_state)
            self.replay_memory.add(state, action_index, reward, new_state)
            if terminated:
                self.env.reset()
                state = self.preprocess(self.env.state)
            else:
                state = new_state

    def evaluate(self, policy, weight_filename=None):
        state = self.env.reset()
        total_reward = 0
        for _ in tqdm(range(env.spec.max_episode_steps)):
            # get action from policy
            _, action = policy(self.preprocess(state))
            state, r, terminal, _ = self.env.step(action)
            total_reward = r + total_reward
            if terminal:
                break
        if self.best_agent is None or total_reward > max(self.rewards):
            self.best_agent = policy.copy()
            if weight_filename is not None:
                self.best_agent.q_network.save(weight_filename + '.best')
        self.rewards.append(total_reward)
        return total_reward
    
    def importance_criterion(self, s, actions, rewards, s2, nt):
        y = rewards
        y[nt] = y[nt] + self.discount_factor*self.target_action_value.q_network(s2)[np.arange(s2[0].shape[0]), np.argmax(self.action_value.q_network(s2),axis=1)]
        w = np.max(self.action_value.q_network(s2),axis=1)
        return np.abs(y-w)

Let's create all objects, set parameters, and start training. **Make sure the replay memory is not too big for your memory!**

# Mountain Car ( Not Working)

In [54]:
from keras.models import Sequential 
from keras.layers import Dense 

# ---- Mountain Car -----
env = gym.make('MountainCar-v0')

action_space = env.action_space

q_network = QNetworkMount((2,), action_space)

policy = EpsilonGreedyPolicy(q_network)
policy.initial_exploration = 1.0  # initial epsilon value
policy.final_exploration = 0.01  # lowest epsilon value
policy.evaluation_exploration = 0.001  # epsilon used during evaluation
policy.final_exploration_step = 500  # number of steps over which epsilon is linearly decreased

# Create the (empty) replay memory
replay_memory = ProportionalPrioritizationReplayMemory(
    (1,),
    # ATTENTION: This is most likely too much for a laptop
    capacity=500, batch_size=32)

dqn = DeepQLearning(env, replay_memory, policy,preprocess = preprocess_mount)
dqn.target_network_update_frequency = 50 #5000
dqn.replay_start_size = 64

<class 'tensorflow.python.framework.ops.Tensor'>
<class 'tensorflow.python.framework.ops.Tensor'>


In [55]:
dqn.train(total_steps=1, replay_period=10, evaluate=1, weight_filename="agentM.h5")

() ()
Initialize replay memory with 64 transitions


HBox(children=(IntProgress(value=0, max=64), HTML(value='')))

() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
0.0
() ()
Episode 1 (0 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
() ()
1


ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[-0.469495  ],
       [-0.47360775],
       [-0.49301416],
       [-0.51150113],
       [-0.52384526],
       [-0.53044635],
       [-0.5341803 ],
       [-0.54677486],
       [-0.561777  ],
 ...

# Car Racing

In [15]:
# --- Car Racing ---0

env = gym.make('CarRacing-mllab-v0', verbose=0)

action_space = env.action_space.discretize((
    np.array([ 0, 1, 0]),  # full gas
    np.array([ 0, 0, 1]),  # full brake
    np.array([-1, 0, 0]),  # steer left
    np.array([ 1, 0, 0]),  # steer right
    np.array([ 0, 0, 0]),  # do nothing
))

# Get shape of transformed state
s = preprocess(env.reset())
img_shape = s[0].shape
scalar_shape = s[1].shape

# Create the Q-Network
q_network = QNetworkCar((img_shape, scalar_shape), action_space)

policy = EpsilonGreedyPolicy(q_network)
policy.initial_exploration = 1.0  # initial epsilon value
policy.final_exploration = 0.01  # lowest epsilon value
policy.evaluation_exploration = 0.001  # epsilon used during evaluation
policy.final_exploration_step = 500  # number of steps over which epsilon is linearly decreased

# Create the (empty) replay memory
replay_memory = ProportionalPrioritizationReplayMemory(
    img_shape, scalar_shape,
    # ATTENTION: This is most likely too much for a laptop
    capacity=5000, batch_size=32)

dqn = DeepQLearning(env, replay_memory, policy)
dqn.target_network_update_frequency = 5000 #5000
dqn.replay_start_size = 64

In [24]:
#dqn.train(episodes=10, max_steps_per_episode=1000, evaluate=5, weight_filename="agent.h5")
dqn.train(total_steps=50, replay_period=10, evaluate=5, weight_filename="agent.h5")

Initialize replay memory with 64 transitions


HBox(children=(IntProgress(value=0, max=64), HTML(value='')))

0.0
Episode 1 (0 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.02
Episode 2 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.04
Episode 3 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.06
Episode 4 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.08
Episode 5 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

Total reward: -85.20710059171537
0.1
Episode 6 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.12
Episode 7 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.14
Episode 8 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.16
Episode 9 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.18
Episode 10 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

Total reward: -77.18631178707221
0.2
Episode 11 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.22
Episode 12 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.24
Episode 13 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.26
Episode 14 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.28
Episode 15 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

Total reward: -87.26114649681459
0.3
Episode 16 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.32
Episode 17 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.34
Episode 18 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.36
Episode 19 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.38
Episode 20 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

Total reward: -86.57718120805303
0.4
Episode 21 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.42
Episode 22 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.44
Episode 23 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.46
Episode 24 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.48
Episode 25 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

Total reward: -85.91549295774585
0.5
Episode 26 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.52
Episode 27 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.54
Episode 28 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.56
Episode 29 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.58
Episode 30 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

Total reward: -86.57718120805303
0.6
Episode 31 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.62
Episode 32 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.64
Episode 33 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.66
Episode 34 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.68
Episode 35 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

Total reward: -86.11111111111047
0.7
Episode 36 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.72
Episode 37 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.74
Episode 38 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.76
Episode 39 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.78
Episode 40 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

Total reward: -92.53731343283484
0.8
Episode 41 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.82
Episode 42 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.84
Episode 43 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.86
Episode 44 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.88
Episode 45 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

Total reward: -93.46405228758066
0.9
Episode 46 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.92
Episode 47 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.94
Episode 48 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.96
Episode 49 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

0.98
Episode 50 (100 steps so far)


HBox(children=(IntProgress(value=0), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

Total reward: -93.86503067484557


Get weights from google drive, trained using google colab

In [31]:
!pip install googledrivedownloader
!pip install requests

Collecting requests
  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
Collecting idna<2.9,>=2.5 (from requests)
  Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests)
  Downloading https://files.pythonhosted.org/packages/e6/60/247f23a7121ae632d62811ba7f273d0e58972d75e58a94d329d51550a47d/urllib3-1.25.3-py2.py3-none-any.whl (150kB)
Collecting certifi>=2017.4.17 (from requests)
  Downloading https://files.pythonhosted.org/packages/69/1b/b853c7a9d4f6a6d00749e94eb6f3a041e342a885b87340b79c1ef73e3a78/certifi-2019.6.16-py2.py3-none-any.whl (157kB)
Collecting chardet<3.1.0,>=3.0.2 (from requests)
  Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8

In [39]:
from google_drive_downloader import GoogleDriveDownloader as gdd


gdd.download_file_from_google_drive(file_id='1sRhf154Xp7I_RXwsEDFpVa-y5tlUrR4d',
                                    dest_path='./weights',
                                    unzip=True)


Downloading 1sRhf154Xp7I_RXwsEDFpVa-y5tlUrR4d into ./weights... Done.
Unzipping...

In [40]:
dqn.policy.q_network.load("weights")

We can watch the agent:

In [25]:
import time
def render_policy(env, preprocess, policy):
    """Visualize a policy for an environment."""

    start = time.process_time()
    current = time.process_time()
    env.reset()


    while True:
        current = time.process_time()
        state = preprocess(env.state)
        terminal = env.step(policy(state)[1])[2]
        env.render()
        if (current - start) >= 30:
            break
    env.close()

In [26]:
# Optional: To record the video uncomment the following lines 
# and change "env" in the call to render_policy below to "rec_env"

# rec_env = gym.wrappers.Monitor(env, "recording", video_callable=lambda episode_id: True, force=True)
# rec_env.reset_video_recorder()
render_policy(env, preprocess, policy)

In [11]:
env.close()