# Applied Deep Learning Tutorial 
contact: Mark.schutera@kit.edu


# Deep Reinforcement Learning with Deep-Q-Network (DQN)

## Introduction
In this tutorial, you will attempt to implement a Deep-Q-Network that is able to do a classic control. The approaches are build upon the paper by DeepMind: Playing Atari with Deep Reinforcement Learning [paper](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), which first introduces the notion of a Deep Q-Network.
<img src="graphics/atari_play.png" width="700"><br>
<center> Fig. 1: Breakout environment of the Atari game </center>

## Core idea
As you probably remember from the lecture, during trial and error we can learn a policy for our Atari game, and model it within our Q-matrix. This is done with a deep neural network. After training, this Q-matrix gives us an estimate of the expected reward when taking action a in state s: Q(s, a).
Playing the action with the maximum Q-value in any given state is the same as playing optimal, or following a full exploitation strategy.

## OpenAI Gym
[OpenAI Gym](https://gym.openai.com/docs/) is a library that can simulate a large number of reinforcement learning environments, including Atari games (these need to be installed additionaly). You will need Python 3.5+

>pip install gym


## Taking our cart pole on a first ride
Now that you have gym installed you can load the 'Pendulum-v0' environment of Atari.


In [1]:
# Import the gym module
import gym


In [4]:

env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()



In [3]:
# Load the environment
env = gym.make('Pendulum-v0')

# Reset, it returns the starting frame
frame = env.reset()

for _ in range(100000):
    # Perform a random action, returns the new frame, reward and whether the game is over
    
    '''
    Implement to sample a random action from the action space within the loaded environment
    action = environment.action_space.sample()
    '''
    action = env.action_space.sample()
    observation, reward, is_done, info = env.step(action)
    print('observation: ', observation, 'reward: ', reward)
    if is_done: break
    env.render()

env.close()


observation:  [-0.62096734  0.78383644  0.62204748] reward:  -4.886862995935415
observation:  [-0.66524989  0.74662078  1.15704492] reward:  -5.059880174146855
observation:  [-0.73501691  0.67804878  1.95726317] reward:  -5.420116607278223
observation:  [-0.81175464  0.58399863  2.42917556] reward:  -6.126302008045042
observation:  [-0.88941889  0.45709303  2.97843764] reward:  -6.930695297295492
observation:  [-0.9566193   0.29134089  3.58191627] reward:  -8.002316935557456
observation:  [-0.99458071  0.10396738  3.82945276] reward:  -9.382563356839903
observation:  [-0.99537117 -0.09610528  4.00818874] reward:  -10.692945963182936
observation:  [-0.96146246 -0.27493624  3.64539081] reward:  -10.884402508936926
observation:  [-0.89593945 -0.44417621  3.63462388] reward:  -9.527750619403111
observation:  [-0.81749807 -0.57593134  3.06976739] reward:  -8.513009452209062
observation:  [-0.73113727 -0.68223038  2.74131841] reward:  -7.3328474206375
observation:  [-0.65043032 -0.75956593  

observation:  [-0.26524265  0.96418169 -0.43100281] reward:  -3.6259714077025986
observation:  [-0.28897995  0.95733515  0.49411151] reward:  -3.4032370422393132
observation:  [-0.35081796  0.93644368  1.30566552] reward:  -3.499141102025509
observation:  [-0.44227128  0.89688133  1.9937017 ] reward:  -3.8924552388918605
observation:  [-0.55107788  0.83445382  2.51051775] reward:  -4.515104183761019
observation:  [-0.68352222  0.72992971  3.37843631] reward:  -5.274536734213726
observation:  [-0.81055328  0.5856649   3.85037977] reward:  -6.539701206560693
observation:  [-0.91714938  0.39854362  4.31543514] reward:  -7.812287289358149
observation:  [-0.98325253  0.18224834  4.53311507] reward:  -9.324581102212454
observation:  [-0.99879682 -0.04903992  4.64664408] reward:  -10.806594554032536
observation:  [-0.95909107 -0.28309771  4.75925631] reward:  -11.723882736122459
observation:  [-0.87542163 -0.48336009  4.34932999] reward:  -10.415358739768077
observation:  [-0.76866528 -0.6396

This already looks nice, yet the actions are random and thus it is time to better understand our environment. And to implement our Deep-Q-Network


In [1]:
# import the necessary libraries
import gym
import gym.spaces
import gym.wrappers
import numpy as np
import random
import pickle
from collections import deque
from keras.layers import Flatten, Dense
from keras import backend as K
from keras.models import Sequential, Model, load_model
from keras import optimizers

## Observation
The observation is made up of cos(theta), sin(theta) and theta dot. 
Theta is normalized between -pi and pi.

## Action
Joint effort -2.0 to +2.0
Write a function to discretize the continuous action space of the joint effort.


In [2]:
# define the action space
def create_action_bins(num_action_bins):
    '''
    Using linspace of numpy implement the action bins for the pendulum, when given the number of the action bins as argument
    '''
    actionbins = np.linspace(-2.0, 2.0, num_action_bins)
    
    return actionbins

# depending on the action, find the according actionbin 
# discretization of the continuous action space
def find_actionbin(action, actionbins):
    idx = (np.abs(actionbins - action)).argmin()

    return idx

## Reward
The reward is defined as
> -(theta^2 + 0.1 x theta_dt^2 + 0.001 x action^2)

What is the lowest expected cost? And what is the highest cost?

-(pi^2 + 0.1 x 8^2 + 0.001 x 2^2) = -16.2736044

-(0^2 + 0.1 x 0^2 + 0.001 x 0^2) = 0

From this reward function, what is the goal of the agent?
In essence, the goal is to remain at zero angle (vertical), with the least rotational velocity, and the least effort.

For a hint have a look at the [wiki](https://github.com/openai/gym/wiki).

In [3]:
def train_model(memory, gamma=0.9):
    for state, action, reward, state_new in memory:
        
        # flatten state to make it compatible to our neural network
        flat_state_new = np.reshape(state_new, [1, 3])
        flat_state = np.reshape(state, [1, 3])

        # determine estimated reward given state s' after action a, 
        # combination of observed and predicted exploited reward.
        
        target = reward + gamma * np.amax(model.predict(flat_state_new))
        
        # determine current expected agent rewards
        targetfull = model.predict(flat_state)
        
        # update current expected rewards with the emulated prediced reward
        targetfull[0][action] = target
        
        # Fit model based on emulation and prediction
        model.fit(flat_state, targetfull, epochs=1, verbose=0)

<span style="color:red;">
    
    
# Understanding the background concept:

--

   

For training a NN model, we need the labels or outputs in case of supervised learning. In DQN, target is kind of used as label. Here target is generated as per the Bellman's principle. Using this target, we calculate the loss function mse and try to minise it while training.
    
</span>

## Deep Q Model

As a reminder, this is our Q function.
> Q(s, a) = r + gamma max_a'(Q(s, a'))

The input of our neural network, our generalizable Q-matrix, will be the observation or the state of the pendulum. 
and the output will be the estimate of the reward taking the action a'. Gamma is the discount factor of the predicted reward in our next state. r is the reward 

For our first network we will implement a DQN with keras:

- Layer with 128 ReLU units
- Layer with 64 ReLU units
- 3 inputs and one output per action bin with linear activation function
- Adam optimizer with learning rate 0.0002, beta_1 0.9 and beta_2 0.999
- Loss mean squared error

In [4]:
# Define the Deep-Q-Network in keras

def build_model(num_output_nodes):
    model = Sequential()
    
    model.add(Dense(128, input_shape=(3,), activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(num_output_nodes, activation='linear'))
    
    
    adam = optimizers.Adam(lr=0.0002, beta_1=0.9, beta_2=0.999)
    model.compile(optimizer=adam, loss='mse')

    return model

In [5]:
def run_episodes(epsilon, gamma, training_iterations, sequence_iterations):
    
    # These are hyperparameters to play around with, after your first run.
    epsilon_decay = 0.9999
    epsilon_min = 0.02
    steps_per_sequence = 250

    for epoch in range(0, training_iterations // sequence_iterations):# for1
        for sequence_id in range(0, sequence_iterations): # for2
            state = env.reset()
            memory = deque()
            
            total_reward = 0
            
            # Easy implementation of decaying exploration
            if epsilon > epsilon_min:
                epsilon = epsilon * epsilon_decay
            
            for i in range(0, steps_per_sequence): # for3
                    
                '''
                Given epsilon implement a simple method for trading off exploration and exploitation
            
                Hint: For random values (use numpy) smaller than epsilon we want to explore
                
                
                
                
                Hint: For random values larger than epsilon we want to exploit
                
                
                '''
                if np.random.uniform() < epsilon:
                    action = env.action_space.sample()
                else: # Exploitation = The greedy approach of choosing the action with high output value Q

                    flat_state = np.reshape(state, [1, 3])
                    action = np.amax(model.predict(flat_state)) 
                    
                    
                # determine action
                actionbin = find_actionbin(action, actionbinslist)
                action = actionbinslist[actionbin]
                action = np.array([action])

                # emulate the action in the simulation and observe the transition 
                # as well as the reward
                observation, reward, done, _ = env.step(action)
                total_reward += reward

                state_new = observation

                '''
                save transitions into memory
                Hint: The memory is used as an argument for the train_model function.
                '''
                
                memory.append((state, actionbin, reward, state_new))
                
                
                
                state = state_new
                
            # train model on the samples from memory
            train_model(memory, gamma)
            
            print(epoch , ' epoch', sequence_id, ' sequence. Average reward = ', total_reward / steps_per_sequence, '. epsilon = ', epsilon)

           


<span style="color:red;">
    
# Understanding the background concept:

--


    
    
The inner most for loop (for3) runs 250 times and collects 250 samples of (s,a,r,s') and stores in the memory. This 250 samples are used to train the model once. Similarily we do the training for 25 times (sequence_iterations == 25) in for loop(for2). 

For every epoch, we train the model 25 times with 250 samples each time. 
We use here 40 epochs for training the model. (1000/25 in for loop (for1)).
    
</span>

## Function for running the policy of our DQN after loading or training


In [6]:
def play_game(rounds):
    state = env.reset()
    totalreward = 0

    for _ in range(0, rounds):
        # Rendering for visualization
        env.render()

        flat_state = np.reshape(state, [1, 3])
        actionbin = np.argmax(model.predict(flat_state))

        action = actionbinslist[actionbin]
        action = np.array([action])

        observation, reward, done, _ = env.step(action)

        totalreward += reward

        state_new = observation
        state = state_new
        
    return totalreward

## Train the DQN


In [15]:
env = gym.make('Pendulum-v0')

# These are hyperparameters to play around with

# iterations
training_iterations = 1000
sequence_iterations = 25

# epsilon (setting exploitation vs exploration)
epsilon = 1

# gamma (importance of predicted estimated reward)
gamma = 0.9

# Discretization settings for the action space
num_action_bins = 20
actionbinslist = create_action_bins(num_action_bins)



# Build model
model = build_model(num_action_bins)

run_episodes(epsilon, gamma, training_iterations, sequence_iterations)

'''
training takes super long, this is not efficient at all, how can we bypass this?
Hint: See cells in run pretrained model.

'''
   

0  epoch 0  sequence. Average reward =  -5.265124535395204 . epsilon =  0.9999
0  epoch 1  sequence. Average reward =  -6.084306390158031 . epsilon =  0.9998000100000001
0  epoch 2  sequence. Average reward =  -8.067130315879194 . epsilon =  0.9997000299990001
0  epoch 3  sequence. Average reward =  -8.475199613461633 . epsilon =  0.9996000599960002
0  epoch 4  sequence. Average reward =  -6.184959141074345 . epsilon =  0.9995000999900007
0  epoch 5  sequence. Average reward =  -6.023470164568668 . epsilon =  0.9994001499800017
0  epoch 6  sequence. Average reward =  -6.866023373383015 . epsilon =  0.9993002099650037
0  epoch 7  sequence. Average reward =  -5.774279801564324 . epsilon =  0.9992002799440072
0  epoch 8  sequence. Average reward =  -8.076870277917344 . epsilon =  0.9991003599160128
0  epoch 9  sequence. Average reward =  -9.173715904890374 . epsilon =  0.9990004498800211
0  epoch 10  sequence. Average reward =  -4.85829289998192 . epsilon =  0.9989005498350332
0  epoch 11

KeyboardInterrupt: 

<span style="color:red;">
    
By loading the weights of a pretrained model.
    
</span>

In [None]:

# Save model weights

print('saving model')
model.save('pendulum_model_juno_' + str(training_iterations) + '.h5')
print('model saved')


In [None]:
# Evaluate performance on 10 test runs with 100 steps each
trarray = []
rounds = 100
for i in range(10):
    trarray.append(play_game(rounds))
    print(i, ' sequence. Average test reward = ', np.average(trarray)/rounds, 'Average test reward = ', trarray[-1]/rounds)
    

## Run pretrained model

In case you already trained a model or want to load the pretrained model for sanity checking use the following script (make sure you executed the necessary cells starting with the imports).

- How does the performance change with the amount of trained iterations?
- How can we measure performance to begin with?
- Is it sufficient to start the play_game function a single time? 
- How can we make sure, that the evaluation is meaningful?



In [14]:
env = gym.make('Pendulum-v0')

actionbinslist = create_action_bins(20)

# 'pendulum_model_[iterationstrained].h5' 
# iterationstrained: 100, 1000, 10000
model = load_model('pendulum_model_1000.h5')

'''
Is the next line meaningful for evaluation, if not, what can we do?
'''
#lay_game(rounds=250)
#play_game(rounds=2500)

for i in range(3):
    play_game(rounds=250)
    
    
env.close()

# Answer for - How does the performance change with the amount of trained iterations?

<span style="color:red;">

With less number of iterations trained, the model fails to understand that the vertical position (theta =0) is the ideal position or goal. 

But as we increase the training iterations, the model is able to stop after reaching the theta =0 position.

</span>

# Answer for - How can we measure performance to begin with?

<span style="color:red;">
    
    
To begin with, we start playing the game also specifying the maximum number of rounds. By restarting the game, we go to different initial positions and we can measure the number of times if we have reached the goal.
    
Example:
    
    
    for i in range(50):
    
        play_game(rounds=250)
    
In this example, we start and play the game 50 times. This means 50 different random initial positions. Now the efficiency of our DQN could be calculated as  'x/50', where x = number times the game has reached the goal (Vertical position).
    
</span>

# Answer for - Is it sufficient to start the play_game function a single time?

<span style="color:red;">


# Case1 - play_game(rounds=250) or play_game(rounds=2500):

For Evaluation just writing play_game(rounds=250) or play_game(rounds=2500) is not appropiate.
#### Reason:
If the game reaches the goal or ends or gets into a loop with lessrounds (say 50), then this means optimal state is found and there is no need to change the state further and so the state remains the same for the rest of the rounds(200 rounds). 

 
    
</span>

# Answer for - How can we make sure, that the evaluation is meaningful?

<span style="color:red;">

# Case2 - multiple calls for play_game(rounds = 250):
This case might be appropriate.
    
#### Reason: 
The play game is called multiple times and if we make the selected initial action random, then we can evaluate the model for different initial positions with this for loop (multiple calls).
    
</span>

## Next steps to take it from here

- Implement a skip frame approach
- Experiment with the discretization of the action bins (e.g. advantages and disadvantages of triadisation)
- Experiment with exploration vs exploitation

Send extended ipynb file to mark.schutera@kit.edu for the chance to get bonus points for the final project.

<span style="color:red;">

# Skip frame approach:

The below is an implementation of skip_frame factor/parameter = 2. 

This means one in every set of two frames/observation from environment are used to form a new state and are feeded to the NN.
    
</span>

<span style="color:red;">

# Exploitation and Exploration:
Always choose the action that has highest Q value from the model. 

But initially exploration is good.

### A faster epsilon_decay factor could be chosen and the we run episodes with epsilon = 1 => Exploration first. Then we decay the epsilon value at a higher rate (0.8) so as to go to Exploitation fast.
    
    
</span>

In [5]:
def run_episodes(epsilon, gamma, training_iterations, sequence_iterations):
    
    # These are hyperparameters to play around with, after your first run.
    epsilon_decay = 0.8 
    epsilon_min = 0.02
    steps_per_sequence = 250

    for epoch in range(0, training_iterations // sequence_iterations):# for1
        for sequence_id in range(0, sequence_iterations): # for2
            state = env.reset()
            memory = deque()
            
            total_reward = 0
            
            # Easy implementation of decaying exploration
            if epsilon > epsilon_min:
                epsilon = epsilon * epsilon_decay
            
            for i in range(0, steps_per_sequence): # for3
                
                if i / 2 == 0: # --> Skip the even frames
                    continue
                
                #if np.random.uniform() < epsilon:
                #    action = env.action_space.sample()
                #else: # Exploitation = The greedy approach of choosing the action with high output value Q

                flat_state = np.reshape(state, [1, 3])
                action = np.amax(model.predict(flat_state)) 
                    
                
                
                # determine action
                actionbin = find_actionbin(action, actionbinslist)
                action = actionbinslist[actionbin]
                action = np.array([action])

                # emulate the action in the simulation and observe the transition 
                # as well as the reward
                observation, reward, done, _ = env.step(action)
                total_reward += reward

                state_new = observation
                
                memory.append((state, actionbin, reward, state_new))
                
                state = state_new
                
            # train model on the samples from memory
            train_model(memory, gamma)
            
            print(epoch , ' epoch', sequence_id, ' sequence. Average reward = ', total_reward / steps_per_sequence, '. epsilon = ', epsilon)

           