Skip to content

Latest commit

 

History

History
60 lines (36 loc) · 3.84 KB

01_2013_Mnih.md

File metadata and controls

60 lines (36 loc) · 3.84 KB

Title: Playing Atari with Deep Reinforcement Learning

Author: Mnih et al (2013)

General Content:

Introduces a first CNN Architecture which learns to play Atari Games in a fully end-to-end fashion from raw pixels and reward + termination signals. In order to do so the authors sample efficiently train the CNN with Experience Replay, frame concatenation, dimensionality reduction (grey + downsampling), frame skipping and reward clipping.

Key take-away for own work:

Even though the main aim is to introduce a general learning algorithm with the least amount of inductive bias, there are still many elements which are somewhat hand-crafted an over-fitted to the application at hand. TD-Gammon was invented 20 years before DQN breakthrough. Those "hacks" were what made DRL really take off - not the key learning algo.

Keypoints:

  • Problem of transferring breakthroughs for DL to RL: Data Distribution and sample efficiency. Most DL assumes data samples that are independent. In RL the nature of the dgp is based on behavior which is in turn the learned objective. Hence, the data distribution is non-stationary!

  • Solution in Experience Replay: Sampling breaks up sequentiality of the tuples - reduces gradient variance. Also allows for sample efficient reusability. Smoothing of training distribution over many past behaviors. But keep max size of buffer and then throw away old experiences. Overwriting removes rare random transitions in beginning of learning.

  • Frame stacking - different view of state. A state is defined as concatenated obs, action pairs - s_t+1 = s_t || {a_t, s_t+1}. Large but finite MDP. Sequence as distinct state.

  • Bellman recursion intuition: If optimal value of state at next time step was known for all actions, then optimal strategy is to select action a' maximising expected boostrapped value.

  • Problem with classical value iteration: Estimates value function separetly for each sequence without generalization - enter: approximation. Train via SGD and batch/transition sampled from "behavior distribution" - off-policy necessary with ER buffer. - Single Online updates <-> Q-Learning

  • Previous Work:

    • TD-Gammon (Tesauro > 20 years ago): Original DRL work with approx of V single hidden layer perceptron. Did not generalize from backgammon.
    • Neural Fitted Q-Learning (NFQ): Batch updates, autoencoder representation learning - not end-to-end!
    • Dont feed action as separate node to the network but have an output node per action.
  • Preprocessing - Start: (210x160x3)

    1. Gray-scale + Downsample (110x84x1)
    2. Crop to Square for 2D Conv (84x84)
    3. Stack four subsequent frames (84x84x4)
  • CNN Architecture:

    1. In: (84x84x4)
    2. 1Hidden: 2D Conv 16 8x8 filters + Stride 4 + ReLU
    3. 2Hidden: 2D Conv 32 4x4 + Stride 4 + ReLU
    4. Fully Connected: 256 Units
    5. Fully Connected: Num Actions Units
  • Additional Tricks: Reward Clipping, RMSProp, 32 batchsize, frameskipping (only emit new action every 4 frames), linear eps decay from 1 to 0.1 in 1 mio frames

  • Use interesting learning metric: Run random trace in beginning and save states. Calculate the max value for each state over actions and average over saved states. Calculate at different times throughout learning.

Questions/Critical Observations:

  • No target network yet introduced. Have to use max Q-Value average over random states in order to visualize stabile learning.

  • How robust to input perturbation or noise/blur injection?

  • Is it possible to generalize between games at all? Transfer

  • How to identify number of frames to stack in order to achieve Markovity/MDP? Is there are general method to insure Markovity?

  • What happens if rewards are clipped differently or more expressive than {-1, 0, 1}?

  • What happens if we randomly shuffle the 4 frames in their order?

  • Is # frames used in skipping related to number of frames concatenated?