# Cart Pole Problem

Problem Statement: 
Cartpole - known also as an Inverted Pendulum is a pendulum with a center of gravity above its pivot point. It’s unstable, 
but can be controlled by moving the pivot point under the center of mass. The goal is to keep the cartpole balanced by 
applying appropriate forces to a pivot point.

In [1]:
!pip install gym



In [2]:
!pip install h5py



In [3]:
!pip install keras



In [4]:
!pip install tensorflow

Collecting setuptools>=41.0.0 (from tensorboard<1.15.0,>=1.14.0->tensorflow)
  Downloading https://files.pythonhosted.org/packages/ec/51/f45cea425fd5cb0b0380f5b0f048ebc1da5b417e48d304838c02d6288a1e/setuptools-41.0.1-py2.py3-none-any.whl (575kB)
Installing collected packages: setuptools
  Found existing installation: setuptools 40.8.0
    Uninstalling setuptools-40.8.0:
      Successfully uninstalled setuptools-40.8.0
Successfully installed setuptools-41.0.1


In [5]:
!pip install keras-rl 



In [6]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory
ENV_NAME = 'CartPole-v0'

# Get the environment and extract the number of actions available in the Cartpole problem
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this slows down training quite a lot. 
dqn.fit(env, nb_steps=5000, visualize=True, verbose=2)
dqn.test(env, nb_episodes=5, visualize=True)

Using TensorFlow backend.
W0806 19:13:54.943631  5012 deprecation_wrapper.py:119] From C:\Users\Dell\Anacondanew\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0806 19:13:56.061696  5012 deprecation_wrapper.py:119] From C:\Users\Dell\Anacondanew\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0806 19:13:56.687731  5012 deprecation_wrapper.py:119] From C:\Users\Dell\Anacondanew\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0806 19:13:56.952746  5012 deprecation_wrapper.py:119] From C:\Users\Dell\Anacondanew\lib\site-packages\keras\backend\tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0806 19:13:56.954746  501

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                80        
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
_________________________________________________________________
activation_2 (Activation)    (None, 2)                 0         
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
None


W0806 19:13:58.911859  5012 deprecation_wrapper.py:119] From C:\Users\Dell\Anacondanew\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



Training for 5000 steps ...




   79/5000: episode: 1, duration: 6.819s, episode steps: 79, steps per second: 12, episode reward: 79.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.519 [0.000, 1.000], mean observation: 0.060 [-0.402, 0.722], loss: 0.428072, mean_absolute_error: 0.495901, mean_q: 0.052834
  113/5000: episode: 2, duration: 0.155s, episode steps: 34, steps per second: 219, episode reward: 34.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.529 [0.000, 1.000], mean observation: 0.151 [-0.159, 0.753], loss: 0.351183, mean_absolute_error: 0.445727, mean_q: 0.190834
  163/5000: episode: 3, duration: 0.250s, episode steps: 50, steps per second: 200, episode reward: 50.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.520 [0.000, 1.000], mean observation: 0.082 [-0.295, 0.778], loss: 0.313844, mean_absolute_error: 0.465298, mean_q: 0.319593
  197/5000: episode: 4, duration: 0.160s, episode steps: 34, steps per second: 212, episode reward: 34.000, mean reward: 1.000 [1.000, 1.000], mean acti

  719/5000: episode: 32, duration: 0.053s, episode steps: 9, steps per second: 171, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.111 [0.000, 1.000], mean observation: 0.135 [-1.611, 2.534], loss: 0.615713, mean_absolute_error: 2.483820, mean_q: 4.703921
  729/5000: episode: 33, duration: 0.076s, episode steps: 10, steps per second: 131, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.131 [-1.544, 2.503], loss: 0.436056, mean_absolute_error: 2.465793, mean_q: 4.741426
  738/5000: episode: 34, duration: 0.061s, episode steps: 9, steps per second: 148, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.111 [0.000, 1.000], mean observation: 0.151 [-1.516, 2.495], loss: 0.367566, mean_absolute_error: 2.461628, mean_q: 4.829305
  747/5000: episode: 35, duration: 0.057s, episode steps: 9, steps per second: 158, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean actio

 1010/5000: episode: 62, duration: 0.051s, episode steps: 10, steps per second: 197, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.120 [-1.560, 2.492], loss: 1.078485, mean_absolute_error: 3.630986, mean_q: 6.796435
 1022/5000: episode: 63, duration: 0.066s, episode steps: 12, steps per second: 183, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.333 [0.000, 1.000], mean observation: 0.102 [-1.129, 1.692], loss: 1.222748, mean_absolute_error: 3.676298, mean_q: 6.787294
 1035/5000: episode: 64, duration: 0.063s, episode steps: 13, steps per second: 206, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.385 [0.000, 1.000], mean observation: 0.103 [-0.935, 1.488], loss: 0.889639, mean_absolute_error: 3.653696, mean_q: 6.744975
 1048/5000: episode: 65, duration: 0.065s, episode steps: 13, steps per second: 200, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean

 1679/5000: episode: 91, duration: 0.257s, episode steps: 41, steps per second: 159, episode reward: 41.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.488 [0.000, 1.000], mean observation: -0.113 [-1.142, 0.247], loss: 1.015555, mean_absolute_error: 5.087359, mean_q: 9.682144
 1706/5000: episode: 92, duration: 0.130s, episode steps: 27, steps per second: 207, episode reward: 27.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.481 [0.000, 1.000], mean observation: -0.095 [-0.783, 0.191], loss: 1.009575, mean_absolute_error: 5.208023, mean_q: 9.938776
 1730/5000: episode: 93, duration: 0.125s, episode steps: 24, steps per second: 192, episode reward: 24.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: -0.104 [-0.962, 0.392], loss: 0.821628, mean_absolute_error: 5.300941, mean_q: 10.246982
 1752/5000: episode: 94, duration: 0.107s, episode steps: 22, steps per second: 205, episode reward: 22.000, mean reward: 1.000 [1.000, 1.000], 

 2270/5000: episode: 121, duration: 0.143s, episode steps: 28, steps per second: 195, episode reward: 28.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: -0.111 [-0.984, 0.188], loss: 3.526611, mean_absolute_error: 6.979375, mean_q: 13.039157
 2309/5000: episode: 122, duration: 0.190s, episode steps: 39, steps per second: 205, episode reward: 39.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.513 [0.000, 1.000], mean observation: -0.023 [-1.046, 0.398], loss: 2.841224, mean_absolute_error: 6.982582, mean_q: 13.088970
 2328/5000: episode: 123, duration: 0.125s, episode steps: 19, steps per second: 152, episode reward: 19.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.526 [0.000, 1.000], mean observation: -0.119 [-1.133, 0.375], loss: 2.596931, mean_absolute_error: 6.983477, mean_q: 13.252357
 2380/5000: episode: 124, duration: 0.306s, episode steps: 52, steps per second: 170, episode reward: 52.000, mean reward: 1.000 [1.000, 1.

 3519/5000: episode: 150, duration: 0.516s, episode steps: 94, steps per second: 182, episode reward: 94.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: 0.033 [-0.542, 0.981], loss: 3.364593, mean_absolute_error: 9.057568, mean_q: 17.336023
 3555/5000: episode: 151, duration: 0.200s, episode steps: 36, steps per second: 180, episode reward: 36.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.528 [0.000, 1.000], mean observation: 0.084 [-0.209, 0.753], loss: 3.621181, mean_absolute_error: 9.116926, mean_q: 17.485611
 3599/5000: episode: 152, duration: 0.224s, episode steps: 44, steps per second: 197, episode reward: 44.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.523 [0.000, 1.000], mean observation: 0.082 [-0.248, 0.819], loss: 3.580888, mean_absolute_error: 9.158651, mean_q: 17.585537
 3651/5000: episode: 153, duration: 0.257s, episode steps: 52, steps per second: 203, episode reward: 52.000, mean reward: 1.000 [1.000, 1.000

Episode 4: reward: 200.000, steps: 200
Episode 5: reward: 61.000, steps: 61


<keras.callbacks.History at 0xece43c8>