# Reinforcement learning

## What is reinforcement learning ?

### Definition

**Reinforcement learning** is learning what to do (how to map situations to actions) so as to maximize a numerical reward signal.

The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.  
In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards.

The two most important distinguishing features of reinforcement learning are:
- **trial-and-error search**
- **delayed reward**

### The field

**Reinforcement learning** (like many topics whose names end with “ing”) is simultaneously:
- a problem
- a class of solution methods that work well on the problem
- the field that studies this problem and its solution methods

It is convenient to use a single name for all three things, but at the same time essential to keep the three conceptually separate.  
In particular, the distinction between problems and solution methods is very important in reinforcement learning;  
failing to make this distinction is the source of many confusions.

We formalize the problem of reinforcement learning using ideas from dynamical systems theory, specifically,  
as the optimal control of incompletely-known Markov decision processes.

## How reinforcement learning differ from other type of machine learning ?

### Reinforcement learning vs Supervised learning

**Reinforcement learning** is different from **supervised learning**, the kind of learning studied in most current research in the field of machine learning.  
Supervised learning is learning from a training set of labeled examples provided by a knowledgable external supervisor.

### Reinforcement learning vs Unsupervised learning

**Reinforcement learning** is also different from what machine learning researchers call **unsupervised learning**,  
which is typically about finding structure hidden in collections of unlabeled data.

### Trade-off between exploration and exploitation

One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the **trade-off between exploration and exploitation**.

To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward.  
But to discover such actions, it has to try actions that it has not selected before.  
The agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future.  
The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task.

### Goal Directed Agent

Another **key feature** of **reinforcement learning** is that it explicitly considers the whole problem of a **goal-directed agent** interacting with an **uncertain environment**.  
This is in contrast to many approaches that consider subproblems without addressing how they might fit into a larger picture.

## Elements of Reinforcement Learning

Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system:  
a policy, a reward signal, a value function, and, optionally, a model of the environment.

The four main subelements are:

- **A policy** defines the learning agent’s way of behaving at a given time. Roughly speaking,  
a policy is a mapping from perceived states of the environment to actions to be taken  
when in those states. It corresponds to what in psychology would be called a set of  
stimulus–response rules or associations. In some cases the policy may be a simple function  
or lookup table, whereas in others it may involve extensive computation such as a search  
process. The policy is the core of a reinforcement learning agent in the sense that it alone  
is sucient to determine behavior. In general, policies may be stochastic, specifying  
probabilities for each action.

- **A reward** signal defines the goal of a reinforcement learning problem. On each time  
step, the environment sends to the reinforcement learning agent a single number called  
the reward. The agent’s sole objective is to maximize the total reward it receives over  
the long run. The reward signal thus defines what are the good and bad events for the  
agent.

- **the value of a state** is the total amount of reward an agent can expect to accumulate  
over the future, starting from that state.

- **A model of the environment** is something that mimics the behavior of the environment, or  
more generally, that allows inferences to be made about how the environment will behave.  
For example, given a state and action, the model might predict the resultant next state  
and next reward. Models are used for planning, by which we mean any way of deciding  
on a course of action by considering possible future situations before they are actually  
experienced. Methods for solving reinforcement learning problems that use models and  
planning are called model-based methods, as opposed to simpler model-free methods that  
are explicitly trial-and-error learners—viewed as almost the opposite of planning.

## Summary

**Reinforcement learning** is a computational approach to understanding and automating goal-directed learning and decision making.  
It is distinguished from other computational approaches by its emphasis on learning by an agent from direct interaction with its  
environment, without requiring exemplary supervision or complete models of the environment.  
Reinforcement learning is the first field to seriously address the computational issues that arise when learning  
from interaction with an environment in order to achieve long-term goals.

**Reinforcement learning** uses the formal framework of **Markov decision processes** to  
define the interaction between a learning agent and its environment in terms of states,  
actions, and rewards. This framework is intended to be a simple way of representing  
essential features of the artificial intelligence problem. These features include a sense of  
cause and effect, a sense of uncertainty and nondeterminism, and the existence of explicit goals.

The concepts of **value** and **value function** are key to most of the reinforcement learning  
methods that we consider in this book. We take the position that value functions  
are important for efficient search in the space of policies. The use of value functions  
distinguishes reinforcement learning methods from evolutionary methods that search  
directly in policy space guided by evaluations of entire policies.