Reinforcement Learning with OpenAI Gym
---
This notebook will create and test different reinforcement learning agents and environments.

In [1]:
import tensorflow as tf
import gym

import numpy as np
import matplotlib.pyplot as plt
import time

%matplotlib inline

Load the Environment
---
Call `gym.make("environment name")` to load a new environment.

Check out the list of available environments at <https://gym.openai.com/envs/>

Edit this cell to load different environments!

In [2]:
# TODO: Load an environment
env = gym.make("CartPole-v1")
# If err that env doesn't exist, run ' pip install 'gym[all]' '


In [3]:
# TODO: Print observation and action spaces
print(env.observation_space)
print(env.action_space)


Box(4,)
Discrete(2)


Run an Agent
---

Reset the environment before each run with `env.reset`

Step forward through the environment to get new observations and rewards over time with `env.step`

`env.step` takes a parameter for the action to take on this step and returns the following:
- Observations for this step
- Rewards earned this step
- "Done", a boolean value indicating if the game is finished
- Info - some debug information that some environments provide. 

In [4]:
# TODO: Make a random agent
games_to_play = 10

for i in range(games_to_play):
    # Reset the env
    obs = env.reset()  # initialize all vars and prep game to run
    episode_rewards = 0
    done = False
    
    while not done:
        env.render()  # draws frame of the game
        
        action = env.action_space.sample()  # choose action randomly
        
        # Take a step in the env with the chosen action
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        
    print(episode_rewards)  # print total rewards when done


env.close()  # close the env

23.0
14.0
12.0
35.0
28.0
18.0
28.0
21.0
11.0
35.0


Policy Gradients
---
The policy gradients algorithm records gameplay over a training period, then runs the results of the actions chosen through a neural network, making successful actions that resulted in a reward more likely, and unsuccessful actions less likely.

In [5]:
# TODO Build the policy gradient neural network

Discounting and Normalizing Rewards
---
In order to determine how "successful" a given action is, the policy gradient algorithm evaluates each action based on how many rewards were earned after it was performed in an episode.

The discount rewards function goes through each time step of an episode and tracks the total rewards earned from each step to the end of the episode.

For example, if an episode took 10 steps to finish, and the agent earns 1 point of reward every step, the rewards for each frame would be stored as 
`[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]`

This allows the agent to credit early actions that didn't lose the game with future success, and later actions (that likely resulted in the end of the game) to get less credit.

One disadvantage of arranging rewards like this is that early actions didn't necessarily directly contribute to later rewards, so a **discount factor** is applied that scales rewards down over time. A discount factor < 1 means that rewards earned closer to the current time step will be worth more than rewards earned later.

With our reward example above, if we applied a discount factor of .90, the rewards would be stored as
`[ 6.5132156   6.12579511  5.6953279   5.217031    4.68559     4.0951      3.439
  2.71        1.9         1. ]`

This means that the early actions still get more credit than later actions, but not the full value of the rewards for the entire episode.

Finally, the rewards are normalized to lower the variance between reward values in longer or shorter episodes.

You can tweak the discount factor as one of the hyperparameters of your model to find one that fits your task the best!

In [6]:
# TODO Create the discounted and normalized rewards function

Training Procedure
---
The agent will play games and record the history of the episode. At the end of every game, the episode's history will be processed to calculate the **gradients** that the model learned from that episode.

Every few games the calculated gradients will be applied, updating the model's parameters with the lessons from the games so far.

While training, you'll keep track of average scores and render the environment occasionally to see your model's progress.

In [7]:
# TODO Create the training loop

Testing the Model
---

This cell will run through games choosing actions without the learning process so you can see how your model has learned!

In [8]:
# TODO Create the testing loop

In [9]:
# Run to close the environment
env.close()