Lecture 9

Function approximators.

Outline

Motivation
Recycling is good: an intro to Reinforcement Learning
Deep Q-learning
Application of Deep Q-Learning: Breakout (Atari)
Tips to train Deep Q-Network
Advanced topics

Motivation

Deep Reinforcement learning is new. but reinforcement learning is an old topic

two examples of Deep RL

Alpha Go
Human level control with deep RL

Game of go

maximize your territory. white and black buttons. 19 * 19 board. How would we solve Go with classical supervised learning?

input the picture of the game and predict the next move from a professional player. ( the problem is that the # of states are two big 10^170. it's about the strategy not discriminating and we have long term strategy )
RL : automatic learning to make good sequence of decisions. (in RL we don't have the ground truth)

in RL we give the agent a reward and the agent tries to do the task by trial and error.

example of RL applications

Robotics
Games
Advertisement

Recycling is good: an Introduction to RL

States and rewards

goal : maximize the return(rewards)

types of states

starting states
normal states
terminal states

in this games we can have limited number of moves. and we can have possible actions.

discounted return = every time passes -1 return.

return = the sum of the returns with no penalties.

q table = the data of my knowledge

we make a tree of states and actions. we calculate the discounted return for every path.

in each action we calculate the long term reward, because we want to maximize it and we will update the Q-table.

all of the optimal Q-tables should follow this equation.

we have policy pi as above. this is the decision making. tells us what to do.

why deep learning is helpful?

because the q-table is very large for some applications and we can't afford to have such big matrixes (for go it is something like 10^170 * 19*19 matrix)

Deep Q-learning

we don't have labels this is a regression problem. We use L2 loss function for this purpose.

the labels are moving. we iterate with the bellman equation. we guess at the first step, then try to reach that. by doing so we will have a better Q-net. when we have a better Q-net we can make better guesses and we will do this iteratively.

we calculate bellman equation every time.

we hack the way into learning with DRL. one hack is to fix the Q in the right side of the equation below. we should do this, other wise the network will go in an infinite loop. we fix the Q for 1 million or 100,000 iterations then Update it and repeat.

episode = one game from start to the end

the one hard part of understanding and the part that is different from ordinary networks is that we should forward propagate twice, one for the S and one for the S'. because of the bellman equation that is a recursive function.

Application of Deep Q-Network: Breakout (Atari)

Goal: destroy all of the breaks.

the network after a lot of iterations finds a trick to dig a tunnel in the breaks and finish the game faster. and this network finds it by its own and without supervision.

we can get the position of every thing to the network, this approach is the feature based approach, another approach is to input the pixels to the network and give him the handle and let it play the game like we do!

we can not give one frame of the game to the network and ask it to generalize, because we are removing the moving objectives features. so we give it short clips (4 frames for example) to make its guess

pre processing ;

gray scale
cropping
history of 4 frames

architecture is conv layer.

some tricks to learn better.

keep track of terminal step ( we change the loss when at terminal step)
experience replay: we might never see some states.
epsilon greedy action choose (exploration/exploitation trade off)

we create an Experience Replay Database with all of our experiences. we sample from the sample of the Replay Memory.

data efficiency ( many epochs )
decorrelate experiences
trading experience / exploitation

we are now training on replay memory and not on the current experiment that we are having.

exploration vs exploitation

we always take the best action. we can explore more.

5% of the time explore other wise exploit the knowledge

the final pseudocode with tricks and hacks

with and without human knowledge

huge difference between human in/out of the loop of learning.

some games have some analogies for humans that machines don't understand. also long games that requires a long strategy to win is also hard for DQN to overcome.

Advanced Topics

before alpha go

Tree searching
...

calculating the score of each image.

two AI paly with each other. a fixed one and another one that is dynamic and try to overcome the fixed one. if the dynamic one wins for N streaks, then the dynamic one becomes the fix one and replicate himself to try this algorithm again.

POLICY GRADENTS:

they try to update policy function each time.

meta learning

train on similar tasks then very few updates should get us to the special task that we want

imitation learning

defining the reward is hard. we have human to the rescue! and we try to imitate the human.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lecture 9 - Deep Reinforcement Learning.md

Lecture 9 - Deep Reinforcement Learning.md

Lecture 9

Outline

Motivation

Game of go

Recycling is good: an Introduction to RL

Deep Q-learning

Application of Deep Q-Network: Breakout (Atari)

exploration vs exploitation

with and without human knowledge

Advanced Topics

before alpha go

POLICY GRADENTS:

meta learning

imitation learning

Files

Lecture 9 - Deep Reinforcement Learning.md

Latest commit

History

Lecture 9 - Deep Reinforcement Learning.md

File metadata and controls

Lecture 9

Outline

Motivation

Game of go

Recycling is good: an Introduction to RL

Deep Q-learning

Application of Deep Q-Network: Breakout (Atari)

exploration vs exploitation

with and without human knowledge

Advanced Topics

before alpha go

POLICY GRADENTS:

meta learning

imitation learning