<a href="https://colab.research.google.com/github/Benned-H/Summer2019/blob/master/Simple_RL_with_TF/Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 2 - Policy-based Agents

This time, we'll introduce a full reinforcement learning problem with observations, actions, and long-term rewards. Environments which pose full problems are referred to as Markov Decision Processes (MDPs). Such environments provide rewards and state transitions given actions, but the rewards also condition on the state and action taken. We could define a MDP as follows:   
**Def**: A *Markov Decision Process* consists of a set of possible states $S$. At any time, an agent will experience some state $s\in S$. In this state $s$, an agent could take any action $a$ from the set of possible actions $A$. Given a state-action pair $(s,a)$, the transition probability to some new state $s'$ is defined by $T(s,a)$, and the reward by $R(s,a)$. Thus at any time in a MDP, an agent is given some state $s\in S$, takes action $a\in A$, and receives a new state $s'$ and reward $r$.

**Cart-Pole Task** - We'll need a challenge a bit more complex than the two-armed bandit to have a full MDP. For this, we'll use the Cart-Pole environment provided by OpenAI Gym. This environment gives us:   
* *Observations* - The agent receives where the cart is and what angle the pole is currently balanced at. With this information, our agent will learn to produce appropriate action probabilities.
* *Delayed reward* - Our only goal is to keep the pole in the air as long as possible, so it's crucial that we consider how valuable our actions were in the long-term. We'll create a function to weight rewards over time by modifying our earlier Policy Gradient. To update our agent with more than one experience at a time, we'll collect experiences in a buffer and update the agent with an entire buffer at once. We call one of these sequences of experiences a rollout.

As a recap of TensorFlow object types, here's a summary of my notes from Part 1:
* ```tf.constant(x)``` - Stores some constant value ```x``` which it will continually reproduce into the computational graph.
* ```tf.Session()``` - Returns the Session which we'll use to run the computational graphs.
* ```tf.placeholder(dtype, shape=None)``` - Allow the input of external parameters into the graph. Use this to input training examples or rollouts for RL. Must be fed inputs at runtime.
* ```tf.Variable()``` - Allow the graph to give some different output w.r.t. the same input. These are the trainable parameters and are initialized with some value and optional datatype. These will survive across multiple executions of the graph.
* ```sess.run(train, feed_dict={x: x_in, y: y_in})``` - Runs the graph ```train``` using Session ```sess```. We're also passing in inputs for placeholders ```x``` and ```y```.

In [6]:
### Cart-Pole Vanilla Policy Gradient Agent ###

import random
import gym

env = gym.make('CartPole-v0')

array([ 0.04887844,  0.04350822,  0.01944751, -0.01199158])

In [4]:
# Survey the environment's observation and action spaces.
print(env.observation_space) # 
print(env.action_space) # 0 or 1

Box(4,)
Discrete(2)


In [8]:
# This first reset just gives our starting state, or our
# first set of observations.
env.reset()

array([0.0421422 , 0.01901745, 0.00920982, 0.04300114])

In [0]:
def discount(arr, gamma):
  """Discounts a given list of rewards using the given gamma."""
  
  for r in range(len(arr)): # Loop over all rewards
    for i in range(r + 1, len(arr)): # For all future rewards...
      arr[r] += gamma ** (i-r) * arr[i] # Add discounted future rewards
    
  return arr

I'll be reading through SpinningUp before continuing coding this, as the algorithmic foundations just aren't being taught in this series. Last revised 7/16/2019.