# &#x1F4D1; &nbsp; <span style="color:red"> Reflections. Introduction To Reinforcement Learning. Lessons 3-4</span>

##   &#x1F916; &nbsp; <span style="color:red">Links</span>

An Analysis of Temporal-Difference Learning with Function Approximation
http://www.mit.edu/~jnt/Papers/J063-97-bvr-td.pdf

Reinforcement Learning
http://www.cis.upenn.edu/~cis519/fall2015/lectures/14_ReinforcementLearning.pdf

Reinforcement Learning & Monte Carlo Planning
https://courses.cs.washington.edu/courses/csep573/12au/lectures/18-rl.pdf

Reinforcement Learning
http://isites.harvard.edu/fs/docs/icb.topic539621.files/lec5.pdf

The Forward View of TD(λ): 
https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node74.html

True Online TD(λ): 
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2014c1_seijen14.pdf
http://wittawat.com/assets/talks/true_online_td.pdf

Model-Free Prediction: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf

Reinforcement Learning: http://www.lsi.upc.es/~mmartin/Ag4-4x.pdf

MathML: https://www.w3.org/TR/MathML/

http://incompleteideas.net/sutton/book/bookdraft2016aug.pdf

##  &#x1F916; &nbsp;  <span style="color:red"> Lesson 3. TD and Friends

The **reward hypothesis**:
That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

*** Dynamic Programming (DP)*** is based on the Bellman Equation and breaks down a problem into subproblems. Dividing a big task into smaller steps, this approach is depending on a perfect model of the environment.

***Monte Carlo methods (MC)*** do not need a model of the learning environment. From experience in form of sequences of state-action-reward-samples they can approximate future rewards. However the methods only update after a complete sequence, when the final state is reached.

***Temporal Difference (TD) methods*** combine both procedures - there is no need for a model of the learning
environment and updates are available at each state of the incremental procedure. The method learns
directly from the raw experience in a partially unknown system with each recorded sample.

Bootstrapping: update involves an estimate
    
- MC does not bootstrap
- DP bootstraps
- TD bootstraps

Sampling: update samples an expectation

- MC samples
- DP does not sample
- TD samples

Model-based RL:  Learn a model, and use it to derive a controller.

Model-free RL: Learn a controller without learning a model.

- Q Learning: the function Q(i, a) provides the expected value of taking action a in state i. The Q value incorporates both the immediate reward and expected value of the next state reached. The Q function also determines a policy and a value function.

Temporal difference method. 

- TD tries to combine the advantages of both the model-based and model-free approaches. Like the model-based approach, it does learn a transition model. Like the model-free approach, it provides a very quick decision procedure. TD also makes sense if the transition model is known, but the MDP is very hard to solve.
- Rather than learn the Q function, TD learns the value function V (i). The idea is quite similar to Q learning. It is based on the fact that the value of a state i is equal to the expected immediate reward at i plus the expected future reward from the successor state j. Every time a transition is made from i to j, we receive a sample of both the immediate and future rewards. On transitioning from i to j and receiving reward r, the estimate for the value of i is updated according to the formula:
  - V (i) ← (1 − α) V (i) + α(r + V (j))

Advantages and Disadvantages of MC vs. TD:

- TD can learn before knowing the final outcome
  - TD can learn online after every step
  - MC must wait until end of episode before return is known
- TD can learn without the final outcome
  - TD can learn from incomplete sequences
  - MC can only learn from complete sequences
  - TD works in continuing (non-terminating) environments
  - MC only works for episodic (terminating) environments  
- MC has high variance, zero bias
  - Good convergence properties (even with function approximation)
  - Not very sensitive to initial value
  - Very simple to understand and use
- TD has low variance, some bias
  - Usually more efficient than MC
  - TD(0) converges to vπ(s) (but not always with function approximation)
  - More sensitive to initial value
- TD exploits Markov property
  - Usually more efficient in Markov environments
- MC does not exploit Markov property
  - Usually more effective in non-Markov environments

**TD(λ)** is a popular TD algorithm that combines basic TD learning with eligibility traces to further speed
learning. The popularity of TD(λ) can be explained by its simple implementation, its low computational complexity, and its conceptually straightforward interpretation, given by its forward view.

TD learning(MDP, π, γ)

 - Inputs: policy π, discount factor γ
 - Output: value function Vπ
 - Initialize V = 0, α according to schedule
 
repeat

1. initialize s
2. while s 6∈ T do
   - (a) take action a = π(s)
   - (b) observe reward r and next state s′
   - (c) V (s) = V (s) + α[r + γV (s′) − V (s)]
   - (d) let s = s′
3. decay α according to schedule

forever

Learning rate parameter ${\alpha}$

- ${\alpha}$ is used for weighting different experiences
- In stationary environments:
$${\alpha (s) = \frac {1} {number \ of \ visits \ to \ state \ s}}$$

  - In this case, the Q and V values are the exact arithmetic average of the experiences
  
- In non-stationary environments:
  -  takes a constant value (usually on the range 0,3..0,5)
- Constant values decay relative influence of past experiences
- As higher the value, higher the learning (more influence of recent experiences in the estimations)

The *forward view* of TD(λ) (Sutton & Barto, 1998) is that the estimate at each time step is moved toward an update target known as as the λ-return; the corresponding algorithm is known as the λ-return algorithm. The λ-return is an estimate of the expected return based on both subsequent rewards and the expected-return estimates at subsequent states, with λ determining the precise way these are combined. The forward view is useful primarily for understanding
the algorithm theoretically and intuitively.

The *backward view* of TD(λ) provides a causal, incremental mechanism for approximating the forward view and, in the off-line case, for achieving it exactly.

- TD(0) algorithms
  - Q-learning
  - Sarsa

- TD(1) is roughly equivalent to every-visit Monte-Carlo
  - Error is accumulated online, step-by-step
  - If value function is only updated offline at end of episode then total update is exactly the same as MC

The *true online* TD(λ) is a new variant of TD(λ) allowing exact online updates with the same computational complexity as classical TD(λ).

Online updates

- TD(λ) updates are applied online at each step within episode
- Forward and backward-view TD(λ) are slightly different
- NEW: Exact online TD(λ) achieves perfect equivalence
- By using a slightly different form of eligibility trace

**Reinforcement Learning**: 

- model-based 
- value-based (TD) 
- policy-based

##  &#x1F916; &nbsp;  <span style="color:red"> Lesson 4. Convergence

- policy evaluation or **prediction problem**
  - estimating the value function  for a given policy
- **control problem** (finding an optimal policy)
  - some variation of generalized policy iteration (GPI)

The **Bellman-Equation** 

- expresses the relationship between a value of a state and the value of a successor state, 
- calculates the value of a state by considering all available options and assessing each by its likelihood of appearance.

Let $V^{\pi}(s)$ be the state-value function at the state $s$ using the policy $\pi$ and $Q^{\pi}(s,a)$ - the q-value function at the state $s$ taking the action $a$ using the policy ${\pi}$ (with the reward function R and the transition function P).

$$V^{\pi}(s) = \sum_a {\pi(a,s)} \sum_{s'}{P_{ss'}^a \ [R_{ss'}^a + \gamma V^{\pi}(s')]}$$

$$Q^{\pi}(s,a) = \sum_{s'}{P_{ss'}^a \ [R_{ss'}^a + \gamma V^{\pi}(s')]}$$

For finite MDPs there is always at least one optimal policy ${\pi^*}$ that is better than or equal to all other policies.

Value functions partially order the policies,

- but at least one optimal policy exists, and
- all optimal policies have the same value function V*.

The optimal value function can be interpreted as the total reward received by an agent behaving optimally.

$$V^∗(s)= \max_π V^π(s)$$

Optimal policies also share the same optimal action-value function $Q^*$.

$$Q^∗(s,a)= \max_π Q^π(s,a)$$

**Generalized MDPs Examples**:

- alternating Markov games
- discounted expected-reward MDPs 
- risk-sensitive MDPs
- exploration-sensitive MDPs
- etc.