Skip to content


Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Step into the AI Era: Deep Reinforcement Learning Workshop

In this workshop, through exercises, we will learn about (Deep) Reinforcement Learning and how to implement different strategies and train an agent to solve different tasks (or play games) in OpenAI Gym. For the consistency of the environment and make use of a free GPU, we will use Google Colaboratory (Google Account needed)

Table of Contents

What is Reinforcement Learning

Also classified as machine learning, what makes reinforcement learning stands out is that an example is not necessary for training, so it is not supervised learning. However, different from unsupervised learning like k-mean clustering or anomaly detection, reinforcement learning takes a bottom-up approach rather than a top-down approach. By trying out different actions with different policy and record different outcomes (rewards), we train an agent that creates it's own 'training data' from trials and 'learn' from it. Sometimes, reinforcement learning is listing alongside supervised learning and unsupervised learning as one of three basic machine learning paradigms1

101 of Reinforcement Learning

First, we will go through the basics of reinforcement learning. Almost all problems we solved using reinforcement learning will involve defining a set of agent states in the environment and a set of actions that can be taken by the agent with what rewards those can lead to. The very basic of how this works is to make use of Markov decision process (MDP).

Markov decision process

Markov decision process

To explain, for the agent at each state, it can take an action which will have a different probability to move to a different state which will lead to different rewards.2

Finding the 'winning' policy

The goal of reinforcement learning is to find the policy, strategies of what series of actions to take, that gain the maximum rewards possible. You may want to try brute force, which is to try all combination of actions to take and pick the best policy. But most of the time it will not work as the number of policies can be large, or even infinite. Practically speaking, we will need other algorithms to pick the best policy (or the best one we came across so far). We will introduce some of the popular ones in this workshop. We will also try to implement them in Python (with Keras and Tensorflow) to solve problems or play games in OpenAI Gym.

Crossentropy Method

Crossentropy method is considered as Monte Carlo methods as it's mechanism involve trying different actions many times, provided that:

  1. the MDP is finite
  2. sufficient memory is available
  3. problem is episodic
  4. after each episode, a new one starts fresh

For details and mathematic explanation of crossentropy method can be found on Wikipedia. To summarize, an overview of what we gonna do with crossentropy method:

While it has not converge:

  1. Sample N policies with the current distribution
  2. Evaluate the N policies
  3. Choosing the best m% of the policies
  4. Update the distribution according to the policies we have chosen


  • Crossentropy Method Open In Colab

  • Deep Crossentropy Method Open In Colab

Model-free Model

So far we have to know exactly what will happen when we take a certain action at a current state. That is the rewards and the next state for each state-action pair. What if we are not sure (which is most of what happened in real life) and can only have expectation values for the rewards Qπ(s,a) at a certain state-action (from a statistic point of view). Here comes the Model-free Model, the differences are:

Model-based: you know P(s'|s,a)

  • can apply dynamic programming
  • can plan ahead

Model-free: you can sample trajectories

  • can try stuff out
  • insurance not included

To find the expectation, there are 2 strategies:

1: Monte-Carlo

In this method, the whole sampled 'path' of playing the game form the start to finish will be completed and the average Q will be considered. This method is less reliant on the Markov property.

2: Temporal Difference

In this method, the recurrent formula for Q will be involved and the agent will learn from the partial trajectory. (Learning on the go) It is great for infinite MDP and needs less experience to learn.

Cliff World: Q-learning vs SARSA

Sometimes it can also be referred to as off-policy vs on-policy, the different between Q-learning and SARSA is

on-policy (e.g. SARSA)

  • Agent can pick actions
  • Agent always follows his own policy

off-policy (e.g. Q-learning)

  • Agent can't pick actions
  • Learning with exploration, playing without exploration
  • Learning from expert (expert is imperfect)
  • Learning from sessions (recorded data)

One famous example is the Cliff World: Cliff World

As you can see, in theory, if the agent always picks the most optimal path (off-policy/Q-learning) it will always pick the lower path. However, during training, the epsilon-greedy “exploration" (With probability ε take random action; otherwise, take optimal action) can make the robot easily fail as one step downwards will push the robot in the gutter (-10 rewards) so the agent will actually never learn the 'optimal path'. In this case, SARSA (on-policy) is more desirable as it gets optimal rewards under current policy, so for the path at the bottom, the exception rewards for each tile is low as it also count the risk of stepping into the gutter (by mistake or exploration).


  • Cliff World Open In Colab

Experience Replay

In deep learning, the same set of data will be used to train the model in many epoch. However, the 'training data' we have so far are only used one off. Sometime the game takes a long time to play it once and thus, training will be computationally expensive.

To slightly improve this situation, we can store the 'gaming' experience with a buffer. Then we can train on random subsamples of it so we don't need to re-visit same (s,a) many times in playing the game to learn it. Also, note that it only works with off-policy algorithms.

Approximate Q-learning and Deep Q-Network

State-space can be large, and sometimes continuous, so kind of like what we did to make Crossentropy Method into Deep Crossentropy Method, we can approximate agent with a function and learn Q value using a neural network. This is what we will do in the following exercise.

The famous DQN Paper was published by Google Deep Mind to play Atari Breakout in 2015, the design involve stacking 4 flames together so you can 'see' the action of the ball movement and use a CNN as an agent. We will try implementing it in the last exercise, before that, feel free to check out the video of how a fully trained agent play the game.


  • DQN Open In Colab

Credit: Big thanks to Yandex School of Data Analysis which most of the content of this workshop are based on


Step into the AI Era: Deep Reinforcement Learning Workshop







No releases published


No packages published