# Day 1 â€” Introduction

Learning happens by interacting with our environment.
Whatever it is we're learning, we notice how our environment reacts to our actions, and we seek to influence what happens.

The book ([Sutton & Barto](http://incompleteideas.net/book/RLbook2020.pdf)) explores the computational approach to the above. Looking at idealized situations, we try to understand the best way to learn and act, exploring different designs and methods.

## Reinforcement Learning

* Learn what to do to maximize reward
* Learner must discover actions and rewards itself
* All actions may affect far future rewards
* Distinguishing features of RL: trial-and-error search + delayed reward
* RL is a class of problems, the solutions to these problems, and the entire field studying these
* Ideas from dynamical systems theory: optimal control of incompleteley known Markov decision processes
* Agent must sense the environment, be able to act in it, and have goals to achieve in it
    * Sensation, Action, Goal
* Agents have to learn from experience; supervised learning is not feasible, due to lack of labeled data
* Unlike unsupervised learning, RL doesn't aim to find structure, but to maximize reward
* One challenge: Exploration vs. Exploitation
    * Learning more vs. Using knowledge to gain reward
* RL considers the whole problem of goal-directed acting, not focusing on subproblems in isolation
* Subproblems are considered only in service of a complete, goal-seeking agent
* A "complete" agent can be part of a larger agent
* RL is part of a trend towards greater interdisciplinary integration
* Many core RL algorithms are inspired by how real biological systems learn
* Modern AI and RL focus more on the search for general principles, as opposed to a collection of many specialized methods for intelligence

## Examples

* Agents need to asses past experience, immediate concerns, and plans for the future, while keeping track of their internal state and the external environment, and setting and pursuing goals
* Each action can have indirect, future consequences that need to be taken into account for later planning
* All action involves goals, the progress towards which can be judged by the agent directly, to guide behavior
* The agent continuously learns from its experience to improve its behavior in the future

## Elements of Reinforcement Learning

* Four main elements:
    1. Policy
    2. Reward signal
    3. Value function
    4. Environment model
* Policy maps from perceived states to actions to perform
    - A form of stimulus-response
    - Can be a simple lookup table, or involve complex planning and search
* Reward signal defines the goal of a problem
    - A single number given by the environment at each time step
    - Goal is to maximize this over time
    - Defines what's good and bad
    - Like pleasure and pain
    - Primary basis for altering the policy
* Value function specifies long-term reward
    - Value of a state is expected accumulated future reward
    - Rewards: immediate, intrinsic desirability; Values: long-term desirability
    - Low reward might be followed by very high reward, making value an important concept
    - Rewards: Pleasure and pain; Values: Higher level judgment of desirability of environment
    - Without rewards, no values; Purpose of values is to maximize rewards
    - All choices are made based on values
    - Values are much harder to determine than immediately received rewards
    - **Most important component of RL: Efficient method for estimating values**
* Environment model
    - Predicts next state and reward, based on current state and chosen action
    - Used for planning, considering future situations
    - Model-based methods exist in contrast to simpler, model-free methods
    - Model-based: Long term planning; Model-free: Simple trial-and-error learning (~opposite of planning)

## Limitations and Scope

* State is to be thought of as what the agents knows about the environment
* Issue of constructing this state representation not considered in the book, to focus on decision-making
* Most methods considered are concerned with estimating value functions, but not strictly necessary
    - Alternatives:
        1. Genetic algorithms
        2. Genetic programming
        3. Simulated annealing
        4. Other optimization methods
    - Applying multiple static policies, selecting the best (+ random variations) for next generation
    - These are called evolutionay methods
* In small, or well-structured policy space, evolutionary methods can find good policies
* Advantageous if the state cannot be fully sensed
* Our focus is on learning by interacting with the environment
* Evolutionary methods ignore important structure of the problem
    - The fact that policies map from states to actions
    - Information about what states are actually visited
    - Which actions agents select in their lifetimes
* Evolutionary methods and learning share features and may work well together
* Evolutionary methods not considered useful *on their own*, for RL problems

## An Extended Example: Tic-Tac-Toe

| | | |
|-|-|-|
|X|O|O|
|O|X|X|
| | |X|

* Classical optimization methods can compute optimal solution, but:
* They require a complete specification of the opponent for this
* This information is not available a priori, in most interesting problems
* The information can be estimated from experience
* First, learn a model of the opponent's behavior
* Then, apply dynamic programming to compute optimal solution
* Similar to reinforcement learning methods
* Evolutionary method would directly search policy space
* It would hill-climb in policy space, achieving incremental improvement
* Hundreds of different optimization methods could be applied for this
* With a value function, we would try to estimate each state's probability of leading to a win
    - State $\mathsf A$ is "better than" state $\mathsf B$ if it has a higher value; a higher probability of leading to a win
* We then play many games against the opponent
* To choose an action, we examine the value of the resulting state of each action
* Most of the time, choose *greedily*, sometimes choose at random to explore
* While playing, we adjust the estimated values
* After each greedy move, we update the previous state's value to be closer to the next state's value
* If $S_t$ is the state before the move, and $S_{t+1}$ the state after the move, the update to the estimated value of $S_t$, $V(S_t)$, looks like this: $$V(S_t)\leftarrow V(S_t)+\alpha\left[V(S_{t+1})-V(S_t)\right] $$
    - $\alpha$: Small positive fraction, *step-size parameter*; influences rate of learning
* Update rule is an example of *temporal-difference learning*; changes based on difference between estimates at two successive time steps
* If $\alpha$ is reduced to zero properly over time, estimates approach true winning probabilities against static opponent
* If $\alpha$ is not reduced to zero, the agent can even adapt to a changing opponent
* Evolutionary methods hold a policy fixed while evaluating over many games
* Value-based methods use information gained *during each game* to update estimates
* Evolutionary methods only look at the final outcome, favoring policies with moves that were never even seen
* RL emphasizes learning while interacting
* Thre is a clear goal, and correct behavior requires planning
* Multi-move traps can be set even without explicit lookahead, without a model
* RL applies even without an adversary, in "games against nature"
* Also applicable when behavior continues indefinitely, with rewards of different magnitudes arriving at any given time
* Also applicable when there are no discrete time steps, though theory gets more complicated
* Can even be used with large, or infinite state spaces, like the $~10^{20}$ states in backgammon, using neural networks
    - See Gerry Tesauro's TD-Gammon
* Neural networks allow generalization across similar states from experience
* How well an RL system works is closely tied to its ability to generalize
* This area shows the greates need for supervised learning
* ANNs and Deep Learning are great, but not the only applicable methods for this
* While the Tic-Tac-Toe agent had no prior knowledge, RL allows incorporating such knowledge
* While the true state was known in Tic-Tac-Toe, RL can also be applied when some information is hidden
* In many situations, there is no known model of the environment at all, but RL still works
* Model-free methods cannot even think about how their actions will change the environment
* They have an advantage when constructing a model is a bottleneck
* Model-free methods are also building blocks for model-based methods
* RL can be used at low, single-move levels, while also being applied to higher levels, where each "action" is the application of some elaborate problem-solving method
* RL can operate at multiple of these levels at once

## Exercises

### $Exercise\ \mathcal{1.1}$*:*$\ Self$-$Play$

#### Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?

It would have to learn a different policy over time, as any moves that were good in the past, were good against a past version of itself. A move that led to a win against the original agent may no longer work at all after the policy is adjusted the first time. Applied for long enough, this method would probably find the optimal strategy for Tic-Tac-Toe, leading to a draw every time, unless exploratory actions are taken.

### $Exercise\ \mathcal{1.2}$*:*$\ Symmetries$

#### Many tic-tac-toe positions appear diferent but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this?

To make use of symmetries, states that are equal according to these should share the same value, and be updated simultaneously every time one of them is encountered.

#### In what ways would this change improve the learning process?

The learning process would be sped up, as this introduces generalization across states, where learning from one state changes the values of all the equivalent states.

#### Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we?

If the opponent does not take advantage of the states' symmetries, then it might act differently in two states that we consider to be the same state. In this case, taking the symmetries into account oversimplifies the problem by throwing away information about how our opponent acts. Instead of seeing that the opponent takes deterministic moves in two distinct states, we may infer that the opponent randomly picks one of two actions in the "same" state.

#### Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

In that case, symmetrically equivalent positions should not share values. It might be best to let the learning algorithm decide how to generalize across states, whenever such generalization is actually beneficial.

### $Exercise\ \mathcal{1.3}$*:*$\ Greedy\ Play$

#### Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?

The agent would quickly learn to follow one specific path from each state, without considering that its estimates of some states might be pessimistic. It may never visit a state it thinks to be of low value, which in reality might lead to another state that has a much higher value. It may outperform a nongreedy player in early games, but it might not be able to adapt as fast as the nongreedy player.

### $Exercise\ \mathcal{1.4}$*:*$\ Learning\ from\ Exploration$

#### Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves?

1. When we do learn from exploratory moves, the probabilities learned are the true winning probabilities of the agent that continues exploring at random, just as it has during learning.
2. When we do *not* learn from exploratory moves, then the values converge to those that assume that exploratory actions are not taken.

In the first case, a state from which the agent can immediately win will still not have a value of $1$, as there is a small chance the agent chooses a random move instead of the winning move. In the second case, the value of such a state would approach exactly $1$, as the chance of winning, given that a greedy move is taken, is $100\%$.

#### Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

Under this assumption, learning the values of only greedy moves may lead to problems, especially if there are states where the greedy move leads to a very high chance of winning, while any other move leads to an immediate loss. If we assume that the greedy action is alwasy taken, such a state would have a very high value, but will regularly lead to losses, whenever a random, nongreedy action is taken instead.  
If the values are updated on exploratory steps as well, then the agent will take this random chance of an immediate loss into account when choosing the next action, and might avoid such a precarious situation.

### $Exercise\ \mathcal{1.5}$*:*$\ Other\ Improvements$

#### Can you think of other ways to improve the reinforcement learning player?

Instead of only considering the value of the immediate next states, the agent could look ahead several moves into the future for each value update.
Additionally, instead of choosing exploratory actions at random, there could be some heuristic for choosing actions that result in the highest information gain.

#### Can you think of any better way to solve the tic-tac-toe problem as posed?

As posed, I cannot think of any method that is clearly better than what was described.

## Summary

* RL is an approach to understanding and automating goal-directed learning and decision-making
* Emphasis is on learning from interaction with an environment, without requiring supervision or complete models
* The first field to address the computational issues that arise when learning from interaction to achieve long-term goals
* RL uses the framework of Markov decision processes:
    - States
    - Actions
    - Rewards
* MDPs are a simple way of representing essential features of the AI problem, including:
    - Cause and effect
    - Uncertainty and nondeterminism
    - Explicit goals
* The concepts of values and value functions are key to most RL methods, are important for efficient search of policies
* Value functions distinguish RL from evolutionary methods

## Early History of Reinforcement Learning

* Check out Minsky's "[Steps](https://courses.csail.mit.edu/6.803/pdf/steps.pdf)" paper