Reinforcement Learning with OpenAI Gym
---
This notebook will create and test different reinforcement learning agents and environments.

In [1]:
import tensorflow as tf
import gym

import numpy as np
import matplotlib.pyplot as plt
import time
import os

%matplotlib inline

Load the Environment
---
Call `gym.make("environment name")` to load a new environment.

Check out the list of available environments at <https://gym.openai.com/envs/>

Edit this cell to load different environments!

In [2]:
# TODO: Load an environment
env = gym.make("CartPole-v1")
# If err that env doesn't exist, run ' pip install 'gym[all]' '


  result = entry_point.load(False)


In [3]:
# TODO: Print observation and action spaces
print(env.observation_space)
print(env.action_space)


Box(4,)
Discrete(2)


Run an Agent
---

Reset the environment before each run with `env.reset`

Step forward through the environment to get new observations and rewards over time with `env.step`

`env.step` takes a parameter for the action to take on this step and returns the following:
- Observations for this step
- Rewards earned this step
- "Done", a boolean value indicating if the game is finished
- Info - some debug information that some environments provide. 

In [4]:
# TODO: Make a random agent
games_to_play = 10

for i in range(games_to_play):
    # Reset the env
    obs = env.reset()  # initialize all vars and prep game to run
    episode_rewards = 0
    done = False
    
    while not done:
        env.render()  # draws frame of the game
        
        action = env.action_space.sample()  # choose action randomly
        
        # Take a step in the env with the chosen action
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        
    print(episode_rewards)  # print total rewards when done


env.close()  # close the env

15.0
20.0
20.0
23.0
15.0
23.0
15.0
22.0
19.0
19.0


Policy Gradients
---
The policy gradients algorithm records gameplay over a training period, then runs the results of the actions chosen through a neural network, making successful actions that resulted in a reward more likely, and unsuccessful actions less likely.

In [5]:
# TODO Build the policy gradient neural network
class Agent:
    
    def __init__(self, num_actions, state_size):
        initializer = tf.contrib.layers.xavier_initializer()  # initializes some starting values for the neurons
        
        # this will let someone pass any number of states into the network in a batch
        self.input_layer = tf.placeholder(dtype=tf.float32, shape=[None, state_size])
        
        # Neural net starts here...
        
        # creates a hidden layer connected to input layer with 8 units, relu activation, and the xavier initializer
        hidden_layer_1 = tf.layers.dense(self.input_layer, 8, activation=tf.nn.relu, kernel_initializer=initializer)
        hidden_layer_2 = tf.layers.dense(hidden_layer_1, 8, activation=tf.nn.relu, kernel_initializer=initializer)
        
        # Output of neural net...
        
        out = tf.layers.dense(hidden_layer_2, num_actions, activation=None)
        
        self.outputs = tf.nn.softmax(out)
        self.choice = tf.argmax(self.outputs, axis=1)  # ' axis=1 ' indicates maximum value of axis 1(action weights) is wanted
        
        self.rewards = tf.placeholder(shape=[None, ], dtype=tf.float32)
        self.actions = tf.placeholder(shape=[None, ], dtype=tf.int32)
        
        one_hot_actions = tf.one_hot(self.actions, num_actions)
        
        cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=out, labels=one_hot_actions)
        
        self.loss = tf.reduce_mean(cross_entropy * self.rewards)
        
        self.gradients = tf.gradients(self.loss, tf.trainable_variables())
        
        # Create a placeholder list for gradients
        self.gradients_to_apply = []
        for index, variable in enumerate(tf.trainable_variables()):
            gradient_placeholder = tf.placeholder(tf.float32)
            self.gradients_to_apply.append(gradient_placeholder)
            
        # Create the operation to update gradients with the gradients placeholder...
        
        optimizer = tf.train.AdamOptimizer(learning_rate=1e-2)
        # Update gradients operation applies gradients that were fed into the corresponding trainable vars in the model
            # Operation runs every time model needs to appy what it has learned from its games and update its parameters
        self.update_gradients = optimizer.apply_gradients(zip(self.gradients_to_apply, tf.trainable_variables()))

        

Discounting and Normalizing Rewards
---
In order to determine how "successful" a given action is, the policy gradient algorithm evaluates each action based on how many rewards were earned after it was performed in an episode.

The discount rewards function goes through each time step of an episode and tracks the total rewards earned from each step to the end of the episode.

For example, if an episode took 10 steps to finish, and the agent earns 1 point of reward every step, the rewards for each frame would be stored as 
`[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]`

This allows the agent to credit early actions that didn't lose the game with future success, and later actions (that likely resulted in the end of the game) to get less credit.

One disadvantage of arranging rewards like this is that early actions didn't necessarily directly contribute to later rewards, so a **discount factor** is applied that scales rewards down over time. A discount factor < 1 means that rewards earned closer to the current time step will be worth more than rewards earned later.

With our reward example above, if we applied a discount factor of .90, the rewards would be stored as
`[ 6.5132156   6.12579511  5.6953279   5.217031    4.68559     4.0951      3.439
  2.71        1.9         1. ]`

This means that the early actions still get more credit than later actions, but not the full value of the rewards for the entire episode.

Finally, the rewards are normalized to lower the variance between reward values in longer or shorter episodes.

You can tweak the discount factor as one of the hyperparameters of your model to find one that fits your task the best!

In [6]:
# TODO Create the discounted and normalized rewards function
discount_rate = 0.95


def discount_normalize_rewards(rewards):
    discounted_rewards = np.zeros_like(rewards)
    total_rewards = 0
    
    for i in reversed(range(len(rewards))):
        total_rewards = total_rewards * discount_rate + rewards[i]
        discounted_rewards[i] = total_rewards
        
    # Normalize rewards across multiple game lengths
    discounted_rewards -= np.mean(discounted_rewards)
    discounted_rewards /= np.std(discounted_rewards)
    
    return discounted_rewards
        

Training Procedure
---
The agent will play games and record the history of the episode. At the end of every game, the episode's history will be processed to calculate the **gradients** that the model learned from that episode.

Every few games the calculated gradients will be applied, updating the model's parameters with the lessons from the games so far.

While training, you'll keep track of average scores and render the environment occasionally to see your model's progress.

In [7]:
# TODO Create the training loop
tf.reset_default_graph()

# Modify to match shape of actions and states in the env
num_actions = 2
state_size = 4

path = "./cartpole-pg/"  # for checkpoints

training_episodes = 1000
max_steps_per_episode = 10000
episode_batch_size = 5

agent = Agent(num_actions, state_size)

init = tf.global_variables_initializer()

saver = tf.train.Saver(max_to_keep=2)

if not os.path.exists(path):
    os.makedirs(path)

with tf.Session() as sess:
    sess.run(init)
    
    total_episode_rewards = []
    
    gradient_buffer = sess.run(tf.trainable_variables())
    
    for index, gradient in enumerate(gradient_buffer):
        gradient_buffer[index] = gradient * 0
    
    for episode in range(training_episodes):
        state = env.reset()
        
        episode_history = []
        episode_rewards = 0
    
        for step in range(max_steps_per_episode):
            if episode % 100 == 0:
                env.render()

            # Gets weights for each action
            action_probabilities = sess.run(agent.outputs, feed_dict={agent.input_layer: [state]})
            action_choice = np.random.choice(range(num_actions), p=action_probabilities[0])

            # Save the resulting states, rewards and whether the episode finished
            state_next, reward, done, _ = env.step(action_choice)

            episode_history.append([state, action_choice, reward, state_next])
            state = state_next

            episode_rewards += reward

            if done or step + 1 == max_steps_per_episode:
                total_episode_rewards.append(episode_rewards)
                episode_history = np.array(episode_history)
                # normalize rewards fn on the stored rewards in episode history
                episode_history[:,2] = discount_normalize_rewards(episode_history[:,2])

                ep_gradients = sess.run(agent.gradients, 
                                        feed_dict={agent.input_layer: np.vstack(episode_history[:,0]),
                                                   agent.actions: episode_history[:,1],
                                                   agent.rewards: episode_history[:,2]})

                # add the gradients 
                for index, gradient in enumerate(ep_gradients):
                    gradient_buffer[index] += gradient

                break
        if episode % episode_batch_size == 0:
            feed_dict_gradients = dict(zip(agent.gradients_to_apply, gradient_buffer))
            
            sess.run(agent.update_gradients, feed_dict=feed_dict_gradients)
            
            for index, gradient in enumerate(gradient_buffer):
                gradient_buffer[index] = gradient * 0
            
            if episode % 100 == 0:
                saver.save(sess, path + "pg-checkpoint", episode)
                
                print("Average reward / 100 eps: " + str(np.mean(total_episode_rewards[-100:])))
        

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Average reward / 100 eps: 33.0
Average reward / 100 eps: 21.04
Average reward / 100 eps: 35.06
Average reward / 100 eps: 114.82
Average reward / 100 eps: 276.01
Average reward / 100 eps: 207.42
Average reward / 100 eps: 207.11
Average reward / 100 eps: 205.12
Average reward / 100 eps: 293.26
Average reward / 100 eps: 498.4


Testing the Model
---

This cell will run through games choosing actions without the learning process so you can see how your model has learned!

In [8]:
# TODO Create the testing loop
testing_episodes = 5

with tf.Session() as sess:
    checkpoint = tf.train.get_checkpoint_state(path)
    saver.restore(sess, checkpoint.model_checkpoint_path)
    
    for episode in range(testing_episodes):
        
        state = env.reset()
        
        episode_rewards = 0
        
        for step in range(max_steps_per_episode):
            env.render()
            
            # Get Action
            action_argmax = sess.run(agent.choice, feed_dict={agent.input_layer:[state]})
            action_choice = action_argmax[0]
            
            state_next, reward, done, _ = env.step(action_choice)
            state = state_next
            
            episode_rewards += reward
            
            if done or step + 1 == max_steps_per_episode:
                print("Rewards for episode " + str(episode) + ": " + str(episode_rewards))
                break

INFO:tensorflow:Restoring parameters from ./cartpole-pg/pg-checkpoint-900
Rewards for episode 0: 500.0
Rewards for episode 1: 500.0
Rewards for episode 2: 500.0
Rewards for episode 3: 500.0
Rewards for episode 4: 500.0


In [9]:
# Run to close the environment
env.close()