# Making an agent to play Open AI's Cartpool with Reinforcement Learning

Reinforcement learning is a method for controlling an agent to do any tasks, given observations of the environment state and the rewards associated to actions it takes responding to it. 

The CartPole environment provided by Open AI, where there is a freely swinging pole pointing upwards attached to a cart that can be controlled by move horizontally, and the goal is for the pole to not fall down.

The observation of the envirement is a list of 4 numbers, the position of cart, the velocity of cart, the angle of pole, and the rotation rate of pole.

In [1]:
import gym

from keras.models import Model
from keras.layers import Input, Dense, Activation, Flatten, Reshape


from rl.agents.cem import CEMAgent
from rl.memory import EpisodeParameterMemory

env = gym.make('CartPole-v0')

#Define our reinforcement learning agent

The reinforcement learning agent will observe the environment state, and return a score for each action it can take. 

For this problem, we can get the observation shape from `env.observation_space.shape`, and the action space from `env.action_space.n`.

In [2]:
def model():
    inp=Input(env.observation_space.shape)

    y=Dense(20)(inp)
    x=Activation('elu')(y)

    x = Dense(20)(x)
    x = Activation('elu')(x) + y

    y = Dense(20)(x)
    x = Activation('elu')(x)

    x = Dense(20)(x)
    x = Activation('elu')(x) + y

    x = Dense(20)(x)
    x = Activation('elu')(x)

    x = Dense(env.action_space.n)(x)
    x = Activation('softmax')(x)

    return Model(inp,x)

model=model()

print(model.summary())

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 4)]          0                                            
__________________________________________________________________________________________________
dense (Dense)                   (None, 20)           100         input_1[0][0]                    
__________________________________________________________________________________________________
activation (Activation)         (None, 20)           0           dense[0][0]                      
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 20)           420         activation[0][0]                 
_______________________________________________________________________________________

# Training the Model
Keras RL makes reinforcement learning very simple. You define the environment, which in this case is provided, you set the memory, and the reinforcement learning algorithm, compile, and fit.

The finished model should give the best action to take the highest score.

In [None]:
memory = EpisodeParameterMemory(limit=1000, window_length=1)
cem = CEMAgent(model=model, nb_actions=2, memory=memory,
               batch_size=50, nb_steps_warmup=2000, train_interval=50, elite_frac=0.05)
cem.compile()

cem.fit(env, nb_steps=100000, visualize=False, verbose=2)

Training for 100000 steps ...
    31/100000: episode: 1, duration: 4.292s, episode steps: 31, steps per second: 7, episode reward: 31.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.419 [0.000, 1.000], mean observation: -0.023 [-1.392, 1.947], mean_best_reward: --
    45/100000: episode: 2, duration: 0.802s, episode steps: 14, steps per second: 17, episode reward: 14.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.714 [0.000, 1.000], mean observation: -0.095 [-1.938, 1.134], mean_best_reward: --
    57/100000: episode: 3, duration: 0.703s, episode steps: 12, steps per second: 17, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.417 [0.000, 1.000], mean observation: 0.113 [-0.960, 1.523], mean_best_reward: --
    69/100000: episode: 4, duration: 0.884s, episode steps: 12, steps per second: 14, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.167 [0.000, 1.000], mean observation: 0.093 [-1.619, 2.581], mean_best_reward: --
 