# Deep Reinforcement Algorithm in OpenAI gym environment

We shall build a deep neural network and use RL to solve a cart and pole balancing problem

In [1]:
import sys
print(sys.version)

3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]


In git bash, we type the following commands:


git clone https://github.com/openai/gym

cd gym

pip install -e . # minimal install




This downloads the bare minimums for the OpenAI Gym environment. 

In [2]:
import gym
print(gym.__version__)

import keras
print(keras.__version__)

0.12.1


Using Theano backend.


2.2.4


If it does not show 'Using Theano backend' and instead shows "Using Tensorflow backend" or anything else;
go to .keras folder in the directory where Anaconda is installed;
open the 'keras' JSON file in a text editor and change whatever is written in the section marked as "backend" to "Theano"

In [3]:
import random
import math
import numpy as np
from collections import deque

## Setting up OpenAI Gym environment

In [4]:
env = gym.make('CartPole-v0')

for i_episode in range(20):
    observation = env.reset()
    
    for t in range(100):
        env.render()
        
        print(observation)
        
        action = env.action_space.sample()
        
        observation, reward, done, info = env.step(action)
        
        if done:
            break

[ 0.02093232  0.04103462 -0.00739778 -0.03682714]
[ 0.02175302  0.23626187 -0.00813433 -0.33183494]
[ 0.02647825  0.43149866 -0.01477102 -0.62707189]
[ 0.03510823  0.62682363 -0.02731246 -0.92436991]
[ 0.0476447   0.43208105 -0.04579986 -0.64039385]
[ 0.05628632  0.23762652 -0.05860774 -0.36247838]
[ 0.06103885  0.04338442 -0.06585731 -0.0888363 ]
[ 0.06190654 -0.15073476 -0.06763403  0.1823632 ]
[ 0.05889184 -0.34482702 -0.06398677  0.45296671]
[ 0.0519953  -0.14886134 -0.05492743  0.14082046]
[ 0.04901808  0.04700251 -0.05211102 -0.16867278]
[ 0.04995813 -0.14733631 -0.05548448  0.10712603]
[ 0.0470114  -0.34162106 -0.05334196  0.38180063]
[ 0.04017898 -0.14578388 -0.04570595  0.07278757]
[ 0.0372633   0.04996251 -0.04425019 -0.23395825]
[ 0.03826255  0.24568786 -0.04892936 -0.5402642 ]
[ 0.04317631  0.44146222 -0.05973464 -0.84795377]
[ 0.05200555  0.24720369 -0.07669372 -0.57463724]
[ 0.05694963  0.44331232 -0.08818646 -0.89046133]
[ 0.06581587  0.63951296 -0.10599569 -1.20951189]


[-0.02035437 -0.77519139  0.01467604  1.19710377]
[-0.03585819 -0.58026244  0.03861812  0.90905643]
[-0.04746344 -0.38568387  0.05679925  0.6287571 ]
[-0.05517712 -0.19139868  0.06937439  0.35448927]
[-0.05900509 -0.38743494  0.07646417  0.66821605]
[-0.06675379 -0.19345488  0.08982849  0.40055473]
[-0.07062289  0.00028568  0.09783959  0.13748967]
[-0.07061718 -0.19609159  0.10058938  0.45936578]
[-0.07453901 -0.00252468  0.1097767   0.20000652]
[-0.0745895   0.19086996  0.11377683 -0.05612969]
[-0.0707721   0.38419226  0.11265423 -0.31085994]
[-0.06308826  0.18766064  0.10643704  0.01511916]
[-0.05933504  0.3811079   0.10673942 -0.2421762 ]
[-0.05171289  0.57455609  0.10189589 -0.49937521]
[-0.04022176  0.37815634  0.09190839 -0.17639806]
[-0.03265864  0.18184748  0.08838043  0.14380546]
[-0.02902169  0.37559985  0.09125654 -0.11973873]
[-0.02150969  0.56930387  0.08886176 -0.38229357]
[-0.01012361  0.76305911  0.08121589 -0.64568871]
[ 0.00513757  0.95696104  0.06830212 -0.91173167]


[-0.03271322 -0.41134839  0.05668308  0.61629274]
[-0.04094019 -0.21706225  0.06900894  0.34198749]
[-0.04528143 -0.41309468  0.07584869  0.65560878]
[-0.05354332 -0.60918608  0.08896086  0.97117834]
[-0.06572705 -0.80538208  0.10838443  1.29042858]
[-0.08183469 -0.61179239  0.134193    1.03355065]
[-0.09407054 -0.80841888  0.15486402  1.3651732 ]
[-0.11023891 -0.61553788  0.18216748  1.12466241]
[-0.12254967 -0.81251838  0.20466073  1.46850561]
[-0.01812699  0.01898009  0.00548304  0.0208221 ]
[-0.01774738 -0.17622006  0.00589948  0.31522993]
[-0.02127179  0.01881736  0.01220408  0.02441333]
[-0.02089544 -0.17647746  0.01269234  0.32092166]
[-0.02442499  0.01846146  0.01911078  0.03226828]
[-0.02405576  0.21330422  0.01975614 -0.25432426]
[-0.01978967  0.01790583  0.01466966  0.04452404]
[-0.01943156 -0.17742337  0.01556014  0.34179904]
[-0.02298003 -0.37276321  0.02239612  0.63934782]
[-0.03043529 -0.56819014  0.03518307  0.93899862]
[-0.04179909 -0.37355971  0.05396305  0.65757534]


In [5]:
print(env.action_space)
print(env.observation_space)

Discrete(2)
Box(4,)


In [6]:
print(env.observation_space.high)
print(env.observation_space.low)


[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


## Defining parameters

In [7]:
# training parameters

n_episodes = 1000    # no. of episodes
n_win_ticks = 195    # every time step is a tick(in OpenAI); done state = win_tick
max_env_steps = None # for OPen AI

# RL parameters

gamma = 1.0          # Discount factor: measure of how far ahead in time the algorithm looks
                     # might not be good now 
                     # To prioritise rewards in the distant future, the value is kept one 
                     # deciding whether or not we want to value current rewards or future rewards   
epsilon = 1.0        # exploration factor starting from one   
                     # Exploration : Choose a uniformly random choice, random force to use; agent choosing an alg thinking it 
                     # will have the best long term effect
                     # avoid local minimum
                     # exploitation: when you keep doing what you were doing; exploration: when you try something new   
epsilon_min = 0.01   # starting with high expl with and then immediately start lowering this 
epsilon_decay = 0.995 # how quickly it will stop exploring
alpha = 0.01         # Learning rate: how big you take a leap in finding optimal policy
                     # it will determine to what extent new info will override old info
                     # alpha=0 means no learning; alpha = 1 means considering only recent info   
alpha_decay = 0.01   # lowering alpha

batch_size = 64      # 64 samples
monitor = False      # stuff for OpenAI
quiet = False        # control print statements 


# Environment Parameters

# for AI Gym

memory = deque(maxlen = 100000)    # custom list parameter, setting(controlling) max length 
env = gym.make("CartPole-v0")
if max_env_steps is not None: 
    env.max_episode_steps = max_env_steps

## Building the neural network

In [8]:
# building the neural network

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam


#Model definition

model = Sequential()
model.add(Dense(24, input_dim=4, activation = 'relu'))
                # 24 neurons, input dimensions = 4 as current environment has 4 paramters
                # activation is rectified linear unit

#adding hidden layers

model.add(Dense(48, activation = 'relu'))
model.add(Dense(2, activation = 'relu'))  
                # we have force to the left and to the right
                # so two possible outputs; so 2 neurons

#how to compile this
model.compile(loss = 'mse', optimizer = Adam(lr = alpha, decay = alpha_decay))    # learning rate is alpha

## Defining necessary functions

In [9]:
# defining necessary functions

#setting up memory
def remember(state, action, reward, next_state, done):          # reward that we got, checking whether it is done ot not
    memory.append((state, action, reward, next_state, done))
    
#choose action: pick what to do    
def choose_action(state, epsilon):
    return env.action.sample() if (np.random.random() <= epsilon) else np.argmax(model.predict(state))
                                                #if no. chosen randomly from action space <= 1(at start)
                                                #if not, we shall get our model making up prediction based off current state
                                                            #i.e., for exploration stage, prediction on force and direction
        
def get_epsilon(t):
    return max(epsilon_min, min(epsilon, 1.0-math.log10((t+1)*epsilon_decay)))
                                                #towards the end we'd be decreasing substantially
                                                # in the beginning, right up at epsilon
        
# getting preprocess
def preprocess_state(state):
    return np.reshape(state, [1, 4])            # transposing state matrix to a column

#going through replay
def replay(batch_size, epsilon):
    x_batch, y_batch = [], []
    minibatch = random.sample(memory, min(len(memory), batch_size))
    
    for state, action, reward, next_state, done in minibatch:
        y_target = model.predict(state)
        y_target[0][action] = reward if done else reward + gamma + np.max(model.predict(next_state)[0])
        x_batch.append(state[0])
        y_batch.append(y_target[0])
        
    #fit our model
    #using the actions to train our model
    model.fit(np.array(x_batch), np.array(y_batch), batch_size=len(x_batch), verbose=0)
                                                                    #verbose: whther or not to make print statements outof this
        
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

In [10]:
# define run function
# training our model which would choose the best action to do


def run():
    scores = deque(maxlen = 100)
    
    for e in range(n_episodes):
        state = preprocess_state(env.reset())    # start from the beginning each and everytime 
        done = False
        i = 0                                    # time-set = 0
        
        while not done:                          # while done is false
            action = choose_action(state, get_epsilon(e))
            next_state, reward, done, _ = env.step(action)
            env.render()                         # rendering so that we can see what's goin' on
            next_state = preprocess_state(next_state)
            remember(state, action, reward, next_state, done)
            state = next_state
            i += 1
        
        scores.append(i)
        
        mean_score = np.mean(scores)
        
        if mean_score >= n_win_ticks and e >= 100:
            if not quiet: print('Ran {} episodes. Solved after {} trials'.format(e, e-100))
            return e-100
        if e % 20 == 0 and not quiet:
            print('[episode {}] - Mean survival time over last 100 episodes was {} ticks.'.format(e, mean_score))
            
        
        replay(batch_size,epsilon)
        
    if not quiet: print('did not solve after {} episodes'.format(e))
    return e

## Training the network

In [None]:

# copying and pasting all the things from above

# as running the environment already initiated is not a good idea


import gym
import keras
import random
import math
import numpy as np
from collections import deque



# training parameters

n_episodes = 1000    # no. of episodes
n_win_ticks = 195    # every time step is a tick(in OpenAI); done state = win_tick
max_env_steps = None # for OPen AI

# RL parameters

gamma = 1.0          # Discount factor: measure of how far ahead in time the algorithm looks
                     # might not be good now 
                     # To prioritise rewards in the distant future, the value is kept one 
                     # deciding whether or not we want to value current rewards or future rewards   
epsilon = 1.0        # exploration factor starting from one   
                     # Exploration : Choose a uniformly random choice, random force to use; agent choosing an alg thinking it 
                     # will have the best long term effect
                     # avoid local minimum
                     # exploitation: when you keep doing what you were doing; exploration: when you try something new   
epsilon_min = 0.01   # starting with high expl with and then immediately start lowering this 
epsilon_decay = 0.995 # how quickly it will stop exploring
alpha = 0.01         # Learning rate: how big you take a leap in finding optimal policy
                     # it will determine to what extent new info will override old info
                     # alpha=0 means no learning; alpha = 1 means considering only recent info   
alpha_decay = 0.01   # lowering alpha

batch_size = 64      # 64 samples
monitor = False      # stuff for OpenAI
quiet = False        # control print statements 


# Environment Parameters

# for AI Gym

memory = deque(maxlen = 100000)    # custom list parameter, setting(controlling) max length 
env = gym.make("CartPole-v0")
if max_env_steps is not None: 
    env.max_episode_steps = max_env_steps
    
    
# building the neural network

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam


#Model definition

model = Sequential()
model.add(Dense(24, input_dim=4, activation = 'relu'))
                # 24 neurons, input dimensions = 4 as current environment has 4 paramters
                # activation is rectified linear unit

#adding hidden layers

model.add(Dense(48, activation = 'relu'))
model.add(Dense(2, activation = 'relu'))  
                # we have force to the left and to the right
                # so two possible outputs; so 2 neurons

#how to compile this
model.compile(loss = 'mse', optimizer = Adam(lr = alpha, decay = alpha_decay))    # learning rate is alpha


# defining necessary functions

#setting up memory
def remember(state, action, reward, next_state, done):          # reward that we got, checking whether it is done ot not
    memory.append((state, action, reward, next_state, done))
    
#choose action: pick what to do    
def choose_action(state, epsilon):
    return env.action_space.sample() if (np.random.random() <= epsilon) else np.argmax(model.predict(state))
                                                #if no. chosen randomly from action space <= 1(at start)
                                                #if not, we shall get our model making up prediction based off current state
                                                            #i.e., for exploration stage, prediction on force and direction
        
def get_epsilon(t):
    return max(epsilon_min, min(epsilon, 1.0-math.log10((t+1)*epsilon_decay)))
                                                #towards the end we'd be decreasing substantially
                                                # in the beginning, right up at epsilon
        
# getting preprocess
def preprocess_state(state):
    return np.reshape(state, [1, 4])            # transposing state matrix to a column

#going through replay
def replay(batch_size, epsilon):
    x_batch, y_batch = [], []
    minibatch = random.sample(memory, min(len(memory), batch_size))
    
    for state, action, reward, next_state, done in minibatch:
        y_target = model.predict(state)
        y_target[0][action] = reward if done else reward + gamma + np.max(model.predict(next_state)[0])
        x_batch.append(state[0])
        y_batch.append(y_target[0])
        
    #fit our model
    #using the actions to train our model
    model.fit(np.array(x_batch), np.array(y_batch), batch_size=len(x_batch), verbose=0)
                                                                    #verbose: whther or not to make print statements outof this
        
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay
        
        
# define run function
# training our model which would choose the best action to do


def run():
    scores = deque(maxlen = 100)
    
    for e in range(n_episodes):
        state = preprocess_state(env.reset())    # start from the beginning each and everytime 
        done = False
        i = 0                                    # time-set = 0
        
        while not done:                          # while done is false
            action = choose_action(state, get_epsilon(e))
            next_state, reward, done, _ = env.step(action)
            env.render()                         # rendering so that we can see what's goin' on
            next_state = preprocess_state(next_state)
            remember(state, action, reward, next_state, done)
            state = next_state
            i += 1
        
        scores.append(i)
        
        mean_score = np.mean(scores)
        
        if mean_score >= n_win_ticks and e >= 100:
            if not quiet: print('Ran {} episodes. Solved after {} trials'.format(e, e-100))
            return e-100
        if e % 20 == 0 and not quiet:
            print('[episode {}] - Mean survival time over last 100 episodes was {} ticks.'.format(e, mean_score))
            
        
        replay(batch_size,epsilon)
        
    if not quiet: print('did not solve after {} episodes'.format(e))
    return e





run()




[episode 0] - Mean survival time over last 100 episodes was 11.0 ticks.
[episode 20] - Mean survival time over last 100 episodes was 105.61904761904762 ticks.
[episode 40] - Mean survival time over last 100 episodes was 67.1219512195122 ticks.
[episode 60] - Mean survival time over last 100 episodes was 48.22950819672131 ticks.
[episode 80] - Mean survival time over last 100 episodes was 38.617283950617285 ticks.
[episode 100] - Mean survival time over last 100 episodes was 38.44 ticks.
[episode 120] - Mean survival time over last 100 episodes was 18.36 ticks.
[episode 140] - Mean survival time over last 100 episodes was 15.09 ticks.
[episode 160] - Mean survival time over last 100 episodes was 16.26 ticks.
[episode 180] - Mean survival time over last 100 episodes was 18.35 ticks.
[episode 200] - Mean survival time over last 100 episodes was 13.74 ticks.
[episode 220] - Mean survival time over last 100 episodes was 14.76 ticks.
[episode 240] - Mean survival time over last 100 episodes 

AssertionError: 

AssertionError: 

[episode 520] - Mean survival time over last 100 episodes was 22.78 ticks.
[episode 540] - Mean survival time over last 100 episodes was 22.32 ticks.
