# Introduction to reinforcement learning

Consider the scenario of teaching a dog new tricks. The dog doesn't understand our language, so we can't tell him what to do. Instead, we follow a different strategy. We emulate a situation (or a cue), and the dog tries to respond in many different ways. If the dog's response is the desired one, we reward them with snacks. Now guess what, the next time the dog is exposed to the same situation, the dog executes a similar action with even more enthusiasm in expectation of more food. That's like learning "what to do" from positive experiences. Similarly, dogs will tend to learn what not to do when face with negative experiences.

That's exactly how Reinforcement Learning works in a broader sense:

* Your dog is an "agent" that is exposed to the environment. The environment could in your house, with you.
* The situations they encounter are analogous to a state. An example of a state could be your dog standing and you use a specific word in a certain tone in your living room
* Our agents react by performing an action to transition from one "state" to another "state," your dog goes from standing to sitting, for example.
* After the transition, they may receive a reward or penalty in return. You give them a treat! Or a "No" as a penalty.
* The policy is the strategy of choosing an action given a state in expectation of better outcomes.

Reinforcement Learning lies between the spectrum of Supervised Learning and Unsupervised Learning, and there's a few important things to note:

Being greedy doesn't always work
* There are things that are easy to do for instant gratification, and there's things that provide long term rewards The goal is to not be greedy by looking for the quick immediate rewards, but instead to optimize for maximum rewards over the whole training.
* Sequence matters in Reinforcement Learning
The reward agent does not just depend on the current state, but the entire history of states. Unlike supervised and unsupervised learning, time is important here.

## The process

<img src="./img/Reinforcement-Learning-Animation.gif" alt="drawing" width="650"/>

In a way, Reinforcement Learning is the science of making optimal decisions using experiences. Breaking it down, the process of Reinforcement Learning involves these simple steps:

1. Observation of the environment
2. Deciding how to act using some strategy
3. Acting accordingly
4. Receiving a reward or penalty
5. Learning from the experiences and refining our strategy
6. Iterate until an optimal strategy is found

Let's now understand Reinforcement Learning by actually developing an agent to learn to play a game automatically on its own.

## Getting Started with Gym

Gym is a toolkit for developing and comparing reinforcement learning algorithms. It makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano.

The gym library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.

To get started, you’ll need to have Python 3.5+ installed. Simply install gym using pip:

In [2]:
# pip install gym

## Environments

Here’s a bare minimum example of getting something running. This will run an instance of the CartPole-v0 environment for 1000 timesteps, rendering the environment at each step. You should see a window pop up rendering the classic cart-pole problem:

In [2]:
import gym
env = gym.make('CartPole-v0')
env.reset()
for i in range(300):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()

In [4]:
import gym
env = gym.make('MountainCar-v0')
env.reset()
for i in range(600):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()

Normally, we’ll end the simulation before the cart-pole is allowed to go off-screen. More on that later. For now, please ignore the warning about calling step() even though this environment has already returned done = True.

If you’d like to see some other environments in action, try replacing CartPole-v0 above with something like MountainCar-v0, MsPacman-v0 (requires the Atari dependency), or Hopper-v1 (requires the MuJoCo dependencies). Environments all descend from the Env base class.

Note that if you’re missing any dependencies, you should get a helpful error message telling you what you’re missing. (Let us know if a dependency gives you trouble without a clear instruction to fix it.) Installing a missing dependency is generally pretty simple. You’ll also need a MuJoCo license for Hopper-v1.



## Observations

If we ever want to do better than take random actions at each step, it’d probably be good to actually know what our actions are doing to the environment.

The environment’s step function returns exactly what we need. In fact, step returns four values. These are:

* observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
* reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
* done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
* info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward.

The process gets started by calling reset(), which returns an initial observation. So a more proper way of writing the previous code would be to respect the done flag:

In [9]:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(10):
    observation = env.reset()
    for t in range(300):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

[-0.04814275 -0.0376203  -0.01749686  0.01238456]
[-0.04889516 -0.23248701 -0.01724917  0.29949605]
[-0.0535449  -0.03712349 -0.01125925  0.00142338]
[-0.05428737  0.15815811 -0.01123078 -0.2947906 ]
[-0.0511242   0.35343835 -0.0171266  -0.5909943 ]
[-0.04405544  0.5487959  -0.02894648 -0.8890225 ]
[-0.03307952  0.3540784  -0.04672693 -0.6055778 ]
[-0.02599795  0.54982156 -0.05883849 -0.91260475]
[-0.01500152  0.35554287 -0.07709058 -0.63897955]
[-0.00789066  0.5516503  -0.08987018 -0.95490915]
[ 0.00314234  0.74785864 -0.10896836 -1.2744204 ]
[ 0.01809951  0.5542823  -0.13445677 -1.0177513 ]
[ 0.02918516  0.7509154  -0.1548118  -1.3494501 ]
[ 0.04420347  0.94760615 -0.1818008  -1.6862909 ]
Episode finished after 14 timesteps
[ 0.00499526  0.00483182  0.021218   -0.04613058]
[ 0.00509189 -0.19058785  0.02029539  0.25317058]
[ 0.00128014 -0.38599363  0.0253588   0.5521853 ]
[-0.00643973 -0.19123682  0.0364025   0.26759872]
[-0.01026447 -0.38685888  0.04175448  0.5715374 ]
[-0.01800165 -