# Reinforcement Learning: An Introduction 2nd Edition
***


# Chapter 1  --  Introduction

## 1. What is Reinforcement Learning?

Learning from interaction is a fundational idea underlying nearly all theories of learning and intelligence.  
  
**Reinforcement learning** is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal.  
  
These two characteristics—**trial-and-error search** and **delayed reward**—are the two most important distinguishing features of reinforcement learning.  
  
There are three machine learning paradigms:
* Supervise Learning: Learning from a training set of labeled examples provided by a knowledgable external supervisor.
* Unsupervise Learning: Finding structure hidden in collections of unlabeled data.
* Reinforcement Learning: Trying to maximize a reward signal.

Trade-off between **exploration** and **exploitation** is one of the challenges that arise only in reinforcement learning:
* Exploration: explore new actions in order to make better action selections in the future
* Exploitation: exploit actions what it has already experienced in order to obtain best reward
![](https://steemitimages.com/640x0/https://steemitimages.com/DQmXH5tjBiS41iNtcyvh7s7Rj5z3SqGkcwoaV2otRJNx3FT/Exploration_vs._Exploitation.png)

## 2. Elements of Reinforcement Learning

Beyond the **agent** and the **environment**, there are four main subelements of a reinforcement learning system: a **policy**, a **reward signal**, a **value function**, and optionally, a **model** of the environment.  
* **Policy:** Defines the learning agent's way of behaving at a given time. A mapping from perceived states of the environment to actions to be taken when in those states. It could be a simple function or lookup table, or a search process.  
* **Reward Signal:** Defines the goal in a reinforcement learning problem. At each time step, the environment sends the agent a single number called reward, it represents what are the good and bad events for the agent.  
* **Value Function:** Defines what is good in the long run. It is the total amount of reward an agent can expect to accumulate over the future, starting from that state. **The most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values.**  
* **Model:** Methods for solving reinforcement learning problems that use models and planning are called **model-based** methods, as opposed to simpler **model-free** methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning.  
![](https://www.kdnuggets.com/images/reinforcement-learning-fig1-700.jpg)

## 3. Limitations and Scope

**Reinforcement learning:** 
* Heavily relies on the concept of **state**—as input to the policy and value function, and as both input to and output from the model. Informally, we can think of the state as a signal conveying to the agent some sense of “how the environment is” at a particular time. 
* Learning from interacting with the environment and estimating value functions

**Evolutionary methods:** 
* Such as genetic algorithms, genetic programming, simulated annealing, apply multiple static policies each interacting over an extended period of time with a separate instance of the environment. The policies that obtain the most reward, and random variations of them, are carried over to the next generation of policies, and the process repeats.  
* If the space of policies is sufficiently small, or can be structured so that good policies are common or easy to find—or if a lot of time is available for the search—then evolutionary methods can be effective. In addition, evolutionary methods have advantages on problems in which the learning agent cannot sense the complete state of its environment.



## 4. An Extended Example: Tic-Tac-Toe

### How to build a tic-tac-toe robot from scratch
* **Step 1: Building an environment(state), the board**  
Normally, we use a matrix to represent a board. Besides, your environment(state) should provide the state itself, game rules(gameover sign), the winner under the state at least.
* **Step 2: Building an agent**  
The agent shoud have the ability of sensing the environment, estimating the value function of each state through interacting with the environment, and update the estimations through some methods(we use TD-Error) at each time step. Besides, it could use policy to select an action.
* **Step 3: Buliding a third party to run the game**  
The third party serve as a judger to manage the game and agents, and let the agents interact with the environment.
* **Step 4: Building a human player interface**  
* **Step 5: Training the agent with self-play method**

### Temploral-Difference Learning Method
* $V(s) \leftarrow V(s) + \alpha [V(s^{\prime}) - V(s)]$


### Shut up! Show me the code!
* [Offical Release Python Code](https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter01/tic_tac_toe.py)  
* [C++ Version]()

***