# Reinforcement Learning - Cartpole Problem

- Giving credit where credit is due: https://www.analyticsvidhya.com/blog/2017/01/introduction-to-reinforcement-learning-implementation/

In [2]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

Using TensorFlow backend.


### Exploration vs Exploitation :

These are two frameworks for solving Reinforcement Learning problems. 

To make the explanation easier just immagine you find yourself in Vegas, you find yourself in front of many, many slot machines and in order to maximise your return you can either sit down in front of one machine and one machine only and pull the lever all day long. This is known as __PURE EXPLOITATION APPROACH__.
If you do the exact opposite, hence pull the level of every slot machine in the casino one after the other, you are going for the __PURE EXPLORATION APPROACH__.

### Markov Decision process :

This is a mathematical framework for defining solutions in a reinforcement learning task. 
The following are the elements used to build this mathematical framework:
- Set of states, S
- Set of actions, A
- Reward function, R
- Policy, π
- Value, V

Taking an action A will make us move from starting state (S) to end state (S). For the actions we took, we will gain rewards (R) or be punished for negative moves. The sequence of decisions made will define our policy (π) where the final sum of rewards/punishments will define our value (V). 

Keeping this in mind, at any given time (t), we have to maximise all possible values of S:

                                   E(Rt | πt St)

### Setting the relevant variables :

In [3]:
#from gym import envs
#print(envs.registry.all())
# These two lines of code will show you all the pre - customized environments the gym library has to offer.

ENV_NAME = 'CartPole-v0'

# Get the environment and extract the number of actions available in the Cartpole problem

env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

[2017-10-11 12:54:58,233] Making new env: CartPole-v0


### Building a simple double hidden layer neural network model :

In [8]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(32)) # layer 1
model.add(Dense(64)) # layer 2
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_4 (Dense)              (None, 32)                160       
_________________________________________________________________
dense_5 (Dense)              (None, 64)                2112      
_________________________________________________________________
activation_3 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 130       
_________________________________________________________________
activation_4 (Activation)    (None, 2)                 0         
Total params: 2,402
Trainable params: 2,402
Non-trainable params: 0
_________________________________________________________________
None


### Configuring and Compiling our agent :

- Policy = Epsilon Greedy (greedy approach to solving the problem)
- Memory = Sequential (as to store the result of actions we performed and the rewards we get for each action)

In [5]:
policy = EpsGreedyQPolicy() 
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=100,
target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

### Show time !

For the sake of making your project as visual and clear as possible, you can make __visualize = True__ so that the training process is shown to the user, however, keep in mind that this slows down training quite a lot. 

In [6]:
try:
    dqn.fit(env, nb_steps=5000, visualize=False, verbose=2)
except:
    next

Training for 5000 steps ...
    9/5000: episode: 1, duration: 0.083s, episode steps: 9, steps per second: 109, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.162 [-2.882, 1.749], loss: --, mean_absolute_error: --, mean_q: --
   21/5000: episode: 2, duration: 0.006s, episode steps: 12, steps per second: 1902, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.917 [0.000, 1.000], mean observation: -0.084 [-3.008, 2.002], loss: --, mean_absolute_error: --, mean_q: --
   33/5000: episode: 3, duration: 0.007s, episode steps: 12, steps per second: 1653, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.833 [0.000, 1.000], mean observation: -0.089 [-2.561, 1.619], loss: --, mean_absolute_error: --, mean_q: --
   43/5000: episode: 4, duration: 0.009s, episode steps: 10, steps per second: 1141, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.0

In [7]:
dqn.test(env, nb_episodes=5, visualize= False)

Testing for 5 episodes ...
Episode 1: reward: 9.000, steps: 9
Episode 2: reward: 10.000, steps: 10
Episode 3: reward: 20.000, steps: 20
Episode 4: reward: 9.000, steps: 9
Episode 5: reward: 9.000, steps: 9


<keras.callbacks.History at 0x1230cd8d0>