GitHub - LennyBoyatzis/DeepRL-Vizdoom: Deep reinforcement learning in vizdoom

Q-Learning

Q-Learning Algorithms vs Policy-Based Algorithms
Start with simple lookup table algo
Then look at Neural Network equivalent

Unlike policy gradient methods, which attempt to learn functions which
directly map an observation to an action, Q-Learning attempts to learn
the value of being in a give state, and taking a specific action there

Frozen lake
Need an algorithm which learns long-term expected rewards
This is exactly what q-learning is designed to provide

In its simplest form q-learning is a table of values for every state (row)
& and action (column) possible in the environment.

Start by initializing the Q-table to be uniform (all-zeros)
Observe rewards obtained for various actions & update table accordingly
Make updates to the Q-table using something called the Bellman equation

Bellman equation:
The expected long-term reward for a given action is equal to the immediate
reward from the current action combined with the expected reward from the best
future action taken at the following state

Tables are GREAT but they DON'T SCALE

No worries for a simple grid world (16 * 4)
For any interesting problems, tables simply don't work

We need a way to take a description of our state, and produce Q-values
for actions without a table, this is where neural networks come in

Deep Q Network Learning

Q-table
Q-network
Deep Q-network

To transform an ordinary QN into a DQN

Go from single-layer network to multi-layer convolutional network
Implement Experience Relay: allow network to train itself using stored memories from its experience
Utilize a second target network, used to compute target q-values

These 3 innovations allow Google Deepmind to kill the game (2014)

Things have progressed since 2014

Improvements to DQN Arch since then

Double DQN
Dueling DQN

Allow for improved performance, stability and faster training time

Addition 1: Convolutional Layers

Instead of considering each pixel independently, convolutional layers allow us to consider regions of an image, and maintain spatial relationships between objects on the screen

In Tensorflow, we can utilize the tf.contrib.layers.convolution2d function to easily create a convolutional layer

conv_layer = tf.contrib.layers.convolution2d(inputs, num_outputs, kernal_size,
stride, padding)

num_outputs: number of filters applied to previous layer
kernel_size: how large a window we would like to slide over previous layer
stride: how many pixels to skips as we slide the window across the layer

Addition 2: Experience Replay

Basic idea is that by storing the agents experiences, and then randomly drawing batches of them to train the network
By keeping the experiences random we prevent the network from only learning about what it is immedately doing in the environment and allow it to learn from a more varied array of past experiences
Each experience is stored as a tuple of (state, action, reward, next_state)
The Experience Replay buffer stores a fixed number of recent memories
As new ones come in, old ones are removed
When it comes to training we draw a uniform batch from the buffer and train our network with them
Build simple class to store and retrieve memories

Addition 3: Separate Target Network

Use a 2nd network during training procedure
Used to generate the target Q-values that will be used to compute the loss for every action during training
The target networks weights are fixed and only periodically or slowly updated to the primary Q-networks values
This makes training more stable

Double DQN

OUR ALGO

Plays the game:
- each step saves initial state
- the action it took
- the reward it got
- the next state it reached
This data is used to train the Neural Network
The input of the Neural Network wil be an initial state
- i.e. a stack of 4 preprocessed frames
- output with be its estimate of Q(s, a)

Experience Replay

At each Q-learning iteration, you play one step in the game
Instead of updating the model based on the last step
Add all relevant information from the step you just took (current state, next state, action taken, reward and whether the next state is terminal) to a finite-size memory
And then call fit batch on the sample of that memory

Why is Experience Replay useful?

Successive states are highly similar
THEREFORE there is significant risk that the network will completely forget about what its like to be in state that it hasn't been in for a while. Replaying prevents this by still showing old frames to the network

NOTE: ER requires loads of memory

A single preprocessed 110x80 frame takes around 9kb
With ER we store 1M of those ~ 9GB in frames, all in RAM
Store frames in np.uint8 type and convert them into floats in [0,1] range at the last moment
np.float32 will use 4x as much space
np.float64 will use 8x as much space
Never copy frames (each frame will be part of 4 unique states). Make sure each unique frame uses the same memory in each state
Don't call np.array() multiple times on a given frame
Nor copy.copy
Be careful when serializing memory

Knowing when to explore

How to choose what action to play while learning?
When DQN is bad, we want to explore, otherwise, we keep following whatever our initial random weight initialization told it to
When it gets better, we want to exploit more
Use epsilon greedy
- With probability epsilon, we choose a random action
- Otherwise we choose the greedy action (maximise Q-val for that state)
- Common to start with a high epsilon and reduce it as DQN goes through more iterations
- This is referred to as annealing
- Annealing linearly means the value of epsilon is reduced by a fixed amount at each iteration

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agents		agents
scenarios		scenarios
.gitignore		.gitignore
README.md		README.md
app.py		app.py
game.py		game.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Q-Learning

Tables are GREAT but they DON'T SCALE

Deep Q Network Learning

To transform an ordinary QN into a DQN

Things have progressed since 2014

Addition 1: Convolutional Layers

Addition 2: Experience Replay

Addition 3: Separate Target Network

Double DQN

OUR ALGO

Experience Replay

Why is Experience Replay useful?

NOTE: ER requires loads of memory

Knowing when to explore

About

Releases

Packages

Languages

LennyBoyatzis/DeepRL-Vizdoom

Folders and files

Latest commit

History

Repository files navigation

Q-Learning

Tables are GREAT but they DON'T SCALE

Deep Q Network Learning

To transform an ordinary QN into a DQN

Things have progressed since 2014

Addition 1: Convolutional Layers

Addition 2: Experience Replay

Addition 3: Separate Target Network

Double DQN

OUR ALGO

Experience Replay

Why is Experience Replay useful?

NOTE: ER requires loads of memory

Knowing when to explore

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages