<h1>Introduction to Reinforcement Learning with TensorFlow</h1>
<p>This tutorial was adapted from various <a href="https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724">examples</a> provided by Arthur Juliani. I would recommend their various blog posts for further reading on Reinforcement Learning.</p>
<p>The following notebook introduces some of the key concepts of Reinforcment Learning with Python and TensorFlow. Please make sure that the following import statements run successfully before continuing:</p>

In [1]:
import tensorflow as tf
import gym

  return f(*args, **kwds)


<p>Now that we have this taken care of, we can begin  with a brief introduction to one subset of Reinforcement Learning methods: Policy based methods. Policy based methods allow us to make our agent improve its performance by determining which policies for acting produce better results. This is different from more state based methods referred to as Q learning methods. <a href="https://flyyufelix.github.io/2017/10/12/dqn-vs-pg.html">This writeup</a> provides a relatively concise comparision of the two methods, but in short: policy methods seek to act optimally while Q learning methods seek to reach optimal states. Both methods may be used to make an intelligent agent, but for the examples we are working with here, our state space is more continuous than discrete.</p>
<p>Our agent should operate by processing its current <i>observation</i> of the world to choose <i>actions</i> which maximize a <i>reward</i> feedback symbol. The specific world our agent will work in is referred to as the <i>environment</i>, and for this example, we will be working with OpenAI's <a href="https://gym.openai.com/docs/">Cart-Pole</a> environment. Running the following code should set up our example environment.</p>

In [2]:
env = gym.make("CartPole-v0")

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


<p>Now that we have our environment set up, we can take a look at what it looks like using the render() function. For this environment, our goal is to train an agent to balance a pivoting pole on a rolling cart. The following code will reset our environment and use an agent which randomly performs actions for 20 episodes, each composed of 100 frames.</p>

<p>Note: if the following code fails, try downgrading the pyglet package using the following command:</p>
<div style="background-color:#300a24"><b><p style="color:white">python3 -m pip install pyglet==1.2.4</p></b></div>

In [3]:
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

Episode finished after 12 timesteps
Episode finished after 14 timesteps
Episode finished after 11 timesteps
Episode finished after 17 timesteps
Episode finished after 25 timesteps
Episode finished after 20 timesteps
Episode finished after 15 timesteps
Episode finished after 19 timesteps
Episode finished after 12 timesteps
Episode finished after 25 timesteps


<h1>Observation Space</h1>
<p>It is important to note what our observation space is. Our observations represent how our agent perceives the world, and in the case of the CartPole environment, our observations consist of 4 numbers representing the position of the cart, the velocity of the cart, the angle of the pole, and the rotational velocity of the pole. Additional information about the CartPole environment can be found <a href="https://github.com/openai/gym/wiki/CartPole-v0">here</a>.</p>

In [4]:
print('Observation space:', env.observation_space)

Observation space: Box(4,)
