# Session 04 DQN - Assignment


In deep Q-learning, we use a neural network to approximate the Q-value function. The state is given as an input to the neural network. 
The output of the neural network represents the (estimated) Q-values of all possible actions. Using an argmax, we choose the action corresponding to the highest Q-value.


To train the Q network, we sample a batch of stored experiences from the replay memory. An experience is a tuple of (state, action, reward, next_state).
We input the state into the Q network and get the estimated Q-values. For the Q network to adjust the weights, it needs to have an idea of how accurate these predicted Q-values are.
However, we do not know the target or actual value here as we are dealing with a reinforcement learning problem. The solution is to estimate the target value by using a second neural network, called the target network. This target network will take the next state as an input and predict the Q-values for all possible actions from that state. 
Now we can compute the labels $y$ to train the policy network: $y = R(s, a) + \gamma max_{a'}Q(s', a') - Q_{t-1}(s, a)$.

The Q network can now be trained with the MSE loss. It's important to know that the target network is an exact copy of the policy network and the weights of the target network 

After a certain amount of Q-network updates, we copy its weights to the target network.

For more detailed information: https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/



In [1]:
import gym
import random
import numpy as np
import matplotlib.pyplot as plt
import collections

# Import Tensorflow libraries

import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.optimizers import Adam

from IPython.display import HTML

## MountainCar-V0

A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.
The agent (a car) is started at the bottom of a valley. For any given state the agent may choose to accelerate to the left, right or cease any acceleration.

<img src="./NotebookImages/MountainCart.gif">

For a description of the statevector, the action space and the episode termination,have a look at:https://github.com/openai/gym/blob/master/gym/envs/classic_control/mountain_car.py

- Implement a DQN to solve this environment.
- Try to minimize the total number of steps per episode needed to reach the flag.
- You are allowed to tweak the reward function. For example, giving an extra reward for getting closer to the flag.
- Modify the DQN implementation into a deep SARSA implementation. Compare the deep SARSA to the DQN implementation.

In [2]:
# Define DQAgent

class DQAgent:
    def __init__(self, replayCapacity, inputShape, hiddenUnits, actionSpace, learningRate):
        # Init replay memory
        self.capacity = replayCapacity
        self.memory = collections.deque(maxlen=self.capacity)
        self.populated = False #Wat doet dit?
                
        # Policy /Q network
        self.inputShape = inputShape
        self.hiddenUnits = hiddenUnits # 24 from demo
        self.actionSpace = actionSpace # 2 from demo
        self.learning_rate = learningRate # 0.001 from demo
        self.q_model = self.buildNetwork()

        # Target network with same weights as policy network
        self.target_model = self.buildNetwork()
        self.target_model.set_weights(self.q_model.get_weights())
    
    def buildNetwork(self):
        model = Sequential()
        model.add(Dense(self.hiddenUnits, input_shape=self.inputShape, activation='relu'))
        model.add(Dense(self.hiddenUnits, activation='relu'))
        model.add(Dense(self.actionSpace, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(learning_rate = self.learning_rate), metrics=['MeanSquaredError'])
        return model
    
    def addToReplayMemory(self, step):
        self.step = step
        self.memory.append(self.step)
    
    def sampleFromReplayMemory(self, batchSize):
        self.batchSize = batchSize
        if self.batchSize > len(self.memory): 
            # if the number of samples in the memory is still smaller than the number of samples needed for one batch return False
            self.populated = False
            return self.populated
        else:
            self.populated = True
            return random.sample(self.memory, self.batchSize)
    
    def q_network_fit(self, batch, batchSize):
        self.batchSize = batchSize
        self.batch = batch

    def q_network_predict(self, state):
        self.state = state
        self.qPolicy = self.q_model.predict(self.state)
        return self.qPolicy
    
    def target_network_predict(self, state):
        self.state = state
        self.qTarget = self.target_model.predict(self.state)
        return self.qTarget
    
    def update_target_network(self):
        self.target_model.set_weights(self.q_model.get_weights())   
    

In [3]:
# Memory parameters
REPLAY_MEMORY_CAPACITY = 10000

# Model parameters
#MIN_REPLAY_MEMORY_SIZE = 1_000  # Minimum number of steps in a memory to start training
BATCH_SIZE = 128  # How many steps (samples) to use for training
UPDATE_TARGET_INTERVAL = 500

DISCOUNT = 0.90
EPSILON = 0.95 # Exploration percentage
MIN_EPSILON = 0.01
DECAY = 0.999

# Agent parameters
POSSIBLE_ACTIONS = [0,1,2] #0 = left, 1 = do nothing, 2 = right
hiddenLayerSize = 24
actionSpaceSize = len(POSSIBLE_ACTIONS)
print(actionSpaceSize)
learningRate = 0.001

3


In [4]:
# Testcode om te verwijderen
ENV_NAME = 'MountainCar-v0'

env = gym.make(ENV_NAME)
state = env.reset()

print(state)
new_state = state.reshape(1,-1)
print(new_state)

[-0.48339436  0.        ]
[[-0.48339436  0.        ]]


In [6]:
ENV_NAME = 'MountainCar-v0'

#  The episode ends if either of the following happens:
#     1. Termination: The position of the car is greater than or equal to 0.5 (the goal position on top of the right hill)
#     2. Truncation: The length of the episode is 200.

env = gym.make(ENV_NAME)
state = env.reset()

# Create the Agent
agent = DQAgent(replayCapacity=REPLAY_MEMORY_CAPACITY, inputShape=state.shape, hiddenUnits=hiddenLayerSize, actionSpace=actionSpaceSize,learningRate=learningRate)

# Fill the replay memory with the first batch of samples
updateCounter = 0
rewardHistory = []

for episode in range(200):
    episodeReward = 0
    stepCounter = 0

    state = env.reset()
    done = False

    while not done:
        env.render()
        r = random.random()

        if r >= EPSILON:
            action = random.sample(POSSIBLE_ACTIONS, 1)[0]
        else:
            qValues = agent.q_network_predict(state.reshape(1,-1))
            action = np.argmax(qValues[0])

        newState, reward, done, _ = env.step(action)

        # if (done) and (stepCounter <199):
        #     reward -= 10
        stepCounter += 1

        # Store step in replay memory
        experience = (state, action, reward, newState, done)
        agent.addToReplayMemory(experience)
        state = newState
        episodeReward += reward

        # When enough steps in replay memory -> train policy network
        if len(agent.memory) >= (BATCH_SIZE):
            # Reduce epsilon Greedy parameter -> less exploration
            EPSILON *= DECAY
            if EPSILON < MIN_EPSILON:
                EPSILON = MIN_EPSILON
            
            miniBatch = agent.sampleFromReplayMemory(BATCH_SIZE)
            miniBatch_states = np.asarray(list(zip(*miniBatch))[0],dtype=float)
            miniBatch_actions = np.asarray(list(zip(*miniBatch))[1],dtype=int)
            miniBatch_rewards = np.asarray(list(zip(*miniBatch))[2],dtype=float)
            miniBatch_next_states = np.asarray(list(zip(*miniBatch))[3],dtype=float)
            miniBatch_done = np.asarray(list(zip(*miniBatch))[4],dtype=bool)

            current_state_q_values = agent.q_network_predict(miniBatch_states)
            y = current_state_q_values

            next_state_q_values = agent.target_network_predict(miniBatch_next_states)
            max_q_next_state = np.max(next_state_q_values, axis=1)

            for i in range(BATCH_SIZE):
                if miniBatch_done[i]:
                    y[i,miniBatch_actions[i]] = miniBatch_rewards[i]
                else:
                    y[i,miniBatch_actions[i]] = miniBatch_rewards[i] + DISCOUNT * max_q_next_state[i]

            agent.q_model.fit(miniBatch_states, y, batch_size= BATCH_SIZE, verbose=0)
        
        else:
            env.render()
            continue
        if updateCounter == UPDATE_TARGET_INTERVAL:
            agent.update_target_network()
            print('target updated')
            updateCounter = 0
        updateCounter += 1
    print('episodeReward for episode ', episode, '= ', episodeReward, 'with epsilon = ', EPSILON)
    rewardHistory.append(episodeReward)

env.close()

plt.plot(rewardHistory)
plt.show()


         



Exception ignored in: <function Viewer.__del__ at 0x000001C2B0C38A60>
Traceback (most recent call last):
  File "c:\Users\Richard\.conda\envs\gym\lib\site-packages\gym\envs\classic_control\rendering.py", line 165, in __del__
    self.close()
  File "c:\Users\Richard\.conda\envs\gym\lib\site-packages\gym\envs\classic_control\rendering.py", line 83, in close
    self.window.close()
  File "c:\Users\Richard\.conda\envs\gym\lib\site-packages\pyglet\window\win32\__init__.py", line 299, in close
    super(Win32Window, self).close()
  File "c:\Users\Richard\.conda\envs\gym\lib\site-packages\pyglet\window\__init__.py", line 823, in close
    app.windows.remove(self)
  File "c:\Users\Richard\.conda\envs\gym\lib\_weakrefset.py", line 114, in remove
    self.data.remove(ref(item))
KeyError: <weakref at 0x000001C2D279BD60; to 'Win32Window' at 0x000001C2ACBF0C70>


episodeReward for episode  0 =  -200.0 with epsilon =  0.6209492420859954
episodeReward for episode  1 =  -200.0 with epsilon =  0.5083393701993449
episodeReward for episode  2 =  -200.0 with epsilon =  0.4161514303916004
target updated
episodeReward for episode  3 =  -200.0 with epsilon =  0.3406818813759434
episodeReward for episode  4 =  -200.0 with epsilon =  0.27889882341299566
target updated
episodeReward for episode  5 =  -200.0 with epsilon =  0.22832019533001768
episodeReward for episode  6 =  -200.0 with epsilon =  0.18691406065325245
episodeReward for episode  7 =  -200.0 with epsilon =  0.15301697696688396
target updated
episodeReward for episode  8 =  -200.0 with epsilon =  0.12526716908429902
episodeReward for episode  9 =  -200.0 with epsilon =  0.10254982134296374
target updated
episodeReward for episode  10 =  -200.0 with epsilon =  0.08395229120566039
episodeReward for episode  11 =  -200.0 with epsilon =  0.06872744492756346
episodeReward for episode  12 =  -200.0 wi

## LunarLander-v2

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.
For more information abou this environment see: https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py

<img src="./NotebookImages/LunarLander.gif">

- Implement a DQN to solve this environment. LunarLander-v2 defines "solving" as getting average reward of 200 over 100 consecutive trials. 
- Try to minimize the number of episodes it takes to solve the environment.
- How would you tweak the reward function for the LunarLander to make a quicker descent.
- Modify the DQN implementation into a deep SARSA implementation. Compare the deep SARSA to the DQN implementation.

## OPTIONAL: CarRacing-v0


Description of the environment:

Easiest continuous control task to learn from pixels, a top-down racing environment. Discreet control is reasonable in this environment as well, on/off discretisation is fine. State consists of 96x96 pixels. Reward is -0.1 every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points. Episode finishes when all tiles are visited. Some indicators shown at the bottom of the window and the state RGB buffer. From left to right: true speed, four ABS sensors, steering wheel position, gyroscope.

<img src="./NotebookImages/CarRacing.gif">

Solve this environment with Deep Q-learning. 
- Skip the first 60 frames of an episode until the zooming has stopped and the car is ready to be controlled.
- Crop each state (=image) in such a way that the indicators are removed.
- It might be useful to convert the images to grayscale images
- You might want to take a couple of consecutive images as one state. 

The action space can for example look like this:
```
self.actionSpace = [(-1, 1, 0.2), (0, 1, 0.2),
                            (1, 1, 0.2),(-1, 1,0), (0, 1,0),
                            (1, 1,0), (-1, 0, 0.2), (0, 0, 0.2),
                            (1, 0, 0.2),(-1, 0,0), (0, 0,0), (1, 0,0)]
```
