# Chapter 1

## Introduction

- We learn by interacting with our environment. This is a foundational idea in theories of learning and intelligence.

- This book explores a computational approach to learning from interaction. Reinforcement learning is focused on goal-directed learning from interaction.

## 1.1 Reinforcement Learning

- RL is learning what to do to maximise a numerical reward signal. The learner discovers which actions yield the most reward by trial-and-error. Actions may affect immediate reward and all subsequent rewards (delayed rewards).

- RL is a problem, a class of solution methods that work well on the problem, and the field that studies the problem and its solution methods.

- The RL problem is the optimal control of incompletely-known Markov decision processes. A method well suited to solving such problems is a reinforcement learning method.

- A challenge in RL is trade-off between exploration and exploitation. To obtain a lot of reward, a RL agent must prefer actions it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions not selected before. The agent has to _exploit_ what it has already experienced in order to obtain reward, but it also has to _explore_ to make better action selections in the future. Neither can be pursued exclusively without failing at the task.

- RL starts with complete, interactive, goal-seeking agent. All RL agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environments.

## 1.2 Examples

- RL examples:

  - Chess player makes move. Choice informed by planning &mdash; anticipating replies and counterreplies &mdash; and by immediate, intuitive judgements of desirability of positions and moves.

  - Controller adjusts parameters of a machine's operation in real time.

  - Newborn learns to walk then run.

  - Preparing breakfast by accessing information about the state of your body that determines your nutritional needs, level of hunger, and food preferences.

- In examples the effects of actions cannot be fully predicted; agent monitors environment frequently and reacts appropriately.

  - E.g. watching milk pouring so it doesn't overflow, or winning at Chess.

- Agent can use its experience to improve its performance over time. The knowledge the agent brings to the task at the start &mdash; either from previous experience or built in by design or evolution &mdash; influences what is useful or easy to learn.

## 1.3 Elements of Reinforcement Learning

- Beyond agent and environment, 4 main sub-elements: _policy_, _reward signal_, _value function_, and, optionally, _model_ of the environment.

- Policy: defines learning agent's way of behaving at a given time.

  - Mapping from perceived states of the environment to actions to be taken when in those states.

- Reward signal: defines goal of a RL problem.

  - On each time step, environment sends to RL agent a single number called _reward_ (immediate).

  - Objective of agent is to maximise total reward it receives over the long run.

  - Reward signal defines what are the good and bad events (immediate).

  - Reward signal is primary basis for altering policy.

- Value function: specifies what is good in the long run.

  - Value of a state: the total amount of reward an agent can expect to accumulate over the feature, starting from that state.

- Without reward then no values, and only point of estimating values is to achieve more reward. Nevertheless, most concerned with value when making and evaluating decisions. Most important component of almost all RL algorithms is method for efficiently estimating values.

- Model: mimics behaviour of the environment and allows inferences to be made about how the environment will behave.

  - Given state and action, model might predict resultant next state and next reward.

  - Used for planning: way of deciding on a course of action.

  - Methods for solving RL problems that use models and planning are called _model-based_ methods, as opposed to simpler _method-free_ methods that are explicitly trial-and-error learners (the opposite of planning).

## 1.4 Limitations and Scope

- Policy and value function take state as input. State is both input to and output from model.

  - State: signal conveying to agent "how the environment is" at particular time.

- Most RL methods in this book structured around estimating value functions. Others such as evolutionary methods have advantage on problems in which learning agent cannot sense the complete state of its environment.

- Focus in this book is on RL methods that learn while interacting with the environment, which evolutionary methods do not do. Evolutionary methods do not use fact that policy they are searching for is a function from states to actions; they do not notice which states an individual passes through during its lifetime, or which actions it selects.

# 1.5 An Extended Example: Tic-Tac-Toe

- Cannot be solved through classical techniques such as "minimax". Best can do is to learn a model of the opponent's behaviour, up to some level of confidence, then apply dynamic programming to compute an optimal solution.

- Evolutionary model searches space of possible policies (rule that tells player what move to make for every state of the game) for one with high probability of winning. For each policy, obtain estimate of winning probability by playing some number of games against the opponent.

- Tic-tac-toe approach with a method using a value function:

  1. Set up table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of winning from that state &mdash; the state's _value_. The whole table is the learned value function.

    - State A has higher value than state B (better), if estimate from A > that from B.

    - Assuming playing Xs:

      - &forall; states with 3 Xs in a row, __P__[win] = 1

      - &forall; states with 3 Os in a row, or that are filled up, __P__[win] = 0

      - Set initial values of all other states to 0.5.

  2. Play many games against the opponent.

    - Select moves by examining the states that would result from each possible move and look up their current values in the table. Most of the time move _greedily_; occasionally select randomly. These are _exploratory_ moves because they cause us to experience states that we might otherwise never see.

While playing change values of states which we find ourselves in during the game. Attempt to make more accurate estimates by "backing up" the value of the state after each greedy move to the state before the move: current value of earlier state updated to be closer to the value of the later state.

Let S<sub>t</sub> denote the state before the greedy move, and S<sub>t+1</sub> the state after that move, then the update to the estimated value of S<sub>t</sub>, denoted V(S<sub>t</sub>), can be written as

$$
V(S_t) \leftarrow V(S_t) + \alpha[V(S_{t+1}) - V(S_t)]
$$

where alpha is a small positive fraction called the _step-size parameter_, which influences the rate of learning. This updated rule is an example of a _temporal-difference_ learning method because its changes are based on a difference between estimates at two successive times.

- If step-size parameter reduced properly over time, then method converges, for any fixed opponent, to the true probability of winning from each state given optimal play by our player. If not reduced all the way to zero over time, also plays well against opponents that slowly change their way of playing.

- To evaluate policy, evolutionary methods holds policy fixed and plays many games against opponent or simulates many games using model. Frequency of wins give unbiased estimate of the probability of winning with that policy, and can be used to direct the next policy selection. Only final outcome of each game is used. In contrast, value function methods allow individual states to be evaluated.  Both search space of policies, but learning a value function takes advantage of information available during the course of play.

- In the example no prior knowledge beyond rules of the game. Prior information can be incorporated.

- Model-free systems cannot think about how environments change in response to actions. Tic-tac-toe player is model-free wrt opponent: has no model of opponent of any kind.