# An introduction to Reinforcement Learning
> "Demo and Discussion of RL & OpenAI gym"

In [1]:
#collapse-hide
%matplotlib inline
import PIL
import gym
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# To get smooth animations
import matplotlib.animation as animation
import matplotlib.pyplot as plt
mpl.rc('animation', html='jshtml')
# Imports specifically so we can render outputs in Jupyter.
#from JSAnimation.IPython_display import display_animation
from matplotlib import animation
from IPython.display import display
import tensorflow as tf
import pandas as pd
from tensorflow import keras

## What is RL?

- In a nutshell, RL is the study of agents and how they learn by trial and error
- It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future

This is quite a broad setting, which can apply to a wide variety of tasks.  

<img src="my_icons/RL_applications.JPG" width=700 height=700 left   align="left"/>

1. The agent can be the program controlling a robot. In this case, the environment is the real world, the agent observes the environment through a set of sensors such as cameras and touch sensors, and its actions consist of sending signals to activate motors  
1. The agent can be the program controlling Ms. Pac-Man. In this case, the environment is a simulation of the Atari game
1. The agent can be the program playing a board game such as Go
1. The agent does not have to control a physically (or virtually) moving thing. A smart thermostat, getting positive rewards whenever it is close to the target temperature and saves energy, and negative rewards when humans need to tweak the temperature
1. The agent can observe stock market prices and decide how much to buy
or sell every second. Rewards are obviously the monetary gains and
losses

## RL framework

<img src="my_icons/RL_framework.JPG" width=700 height=700 left   align="left"/>

**Key Concepts and Terminology**:  

1. The main characters of RL are the agent and the environment:<br>
   - The agent is an entity which is of interest to us
   - The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the environment, and then decides on an action to take  
   - The environment changes when the agent acts on it, but may also change on its own  

2. Moreover:
   - The agent perceives a reward signal from the environment, a number that tells it how good or bad the current world state is  
   - The goal of the agent is to maximize its cumulative reward, called return. Reinforcement learning    methods are ways that the agent can learn behaviors to achieve its goal


> To talk more specifically what RL does, we need to introduce additional terminology. We need to talk about states and observations,
action spaces,
policies,
return,
and value functions.

### States and Observations<br>
- A state '**s**' is a complete description of the state of the world.
There is no information about the world which is hidden from the state.  
- An observation o is a partial description of a state, which may omit information. 
A lot of the times, observations and state are used interchangeably

- In deep RL, we almost always represent states and observations by a real-valued vector, matrix, or higher-order tensor
For instance, a visual observation could be represented by the RGB matrix of its pixel values;   
the state of a robot might be represented by its joint angles and velocities


### Action Spaces <br>
- Different environments allow different kinds of actions  
- The set of all valid actions in a given environment is often called the action space. Some environments, like Atari and Go, have discrete action spaces, where only a finite number of moves are available to the agent  
Other environments, like where the agent controls a robot in a physical world, have continuous action spaces

## Policies

- The algorithm a software agent uses to determine its actions is called its policy. It can be deterministic or stochastic denoted by $\Pi$
- Because the policy is essentially the agent’s brain, it’s not uncommon to substitute the word “policy” for “agent”, eg saying “The policy is trying to maximize reward.”
- In deep RL, we deal with parameterized policies i.e policies whose outputs are computable functions that depend on a set of parameters (eg the weights and biases of a neural network) which we can adjust to change the behavior via some optimization algorithm  

<img src="my_icons/RL_agent.JPG" width=800 height=800 left   align="left"/>

### Policy design<br>

- Techniques to design a policy:

      For example, consider a robotic vacuum cleaner whose reward is
      the amount of dust it picks up in 30 minutes. 
      Its policy could be to move forward with some probability p every second,
      or randomly rotate left or right with probability 1 – p. 
      The rotation angle would be a random angle between –r and +r.
      
      How would you train such a robot?
      There are just two policy parameters you can tweak: the probability p and the angle range r.

<img src="my_icons/RL_vaccum_robot.JPG" width=700 height=700 left   align="left"/>.<br>

Four points in policy space (left) and the agent’s corresponding behavior (right)

- One possible learning algorithm could be to try out many different values for these parameters, and pick the
combination that performs best (see above Figure).  
This is an example of ***policy search using a brute force approach***. When the policy space is too
large (which is generally the case), finding a good set of parameters  
this way is like searching for a needle in a gigantic haystack.
- Another way to explore the policy space is to use genetic algorithms. For example, you could randomly create a first generation of 100 policies and try them out,  
then “kill” the 80 worst policies and make the 20 survivors produce 4 offspring each.  
An offspring is a copy of its parent plus some random variation. The surviving policies plus their offspring together constitute the second generation.  
we can continue to iterate through generations this way until we find a good policy.
- Yet another approach is to use optimization techniques, by evaluating the
gradients of the rewards with regard to the policy parameters,  
then tweaking these parameters by following the gradients toward higher rewards.  
This approach, is called **policy gradients (PG)** .  
Going back to the vacuum cleaner robot, you could slightly increase p and evaluate whether doing so increases the amount of dust picked up by the robot in 30 minutes;  
if it does, then increase p some more, or else reduce p.