Source:
* [Analytics Vidhya - A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python](https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/)
* [Analytics Vidhya - Simple Beginnerâ€™s guide to Reinforcement Learning & its implementation](https://www.analyticsvidhya.com/blog/2017/01/introduction-to-reinforcement-learning-implementation/)

# Importing the required libraries

## Install keras-rl library

In [5]:
# git clone https://github.com/matthiasplappert/keras-rl.git
# cd keras-rl
# python setup.py install

## Install dependencies for CartPole environment

In [6]:
# pip install h5py
# pip install gym

## Import libraries

In [1]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

import matplotlib.pyplot as plt
%matplotlib inline
from IPython import display

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


# Building the environment

Environments preloaded into gym:
* [FrozenLake-v0](https://gym.openai.com/envs/FrozenLake-v0/)
* [Taxi-v3](https://gym.openai.com/envs/Taxi-v3/)
* [CartPole-v1](https://gym.openai.com/envs/CartPole-v1/)

In [3]:
env_name = 'CartPole-v1'

env = gym.make(env_name)
env.seed(123)

[123]

# Model

In [4]:
n_actions = env.action_space.n

Next, we build a very simple single hidden layer neural network model.

In [5]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(n_actions))
model.add(Activation('linear'))
print(model.summary())

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                80        
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
_________________________________________________________________
activation_2 (Activation)    (None, 2)                 0         
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
None


Next, we configure and compile our agent. We set our policy as Epsilon Greedy and we also set our memory as Sequential Memory because we want to store the result of actions we performed and the rewards we get for each action.

In [8]:
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)

In [9]:
dqn = DQNAgent(
    model=model, nb_actions=n_actions,
    memory=memory, nb_steps_warmup=10,
    target_model_update=1e-2, policy=policy
)

In [10]:
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

In [11]:
# Okay, now it's time to learn something!
# We visualize the training here for show, but this slows down training quite a lot. 
dqn.fit(env, nb_steps=10000, visualize=False, verbose=1)

Training for 10000 steps ...
Interval 1 (0 steps performed)
Instructions for updating:
Use tf.cast instead.




done, took 26.724 seconds


<keras.callbacks.History at 0x135c12110>

In [12]:
dqn.test(env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...
Episode 1: reward: 114.000, steps: 114
Episode 2: reward: 125.000, steps: 125
Episode 3: reward: 145.000, steps: 145
Episode 4: reward: 121.000, steps: 121
Episode 5: reward: 117.000, steps: 117


<keras.callbacks.History at 0x136034510>