In [4]:
import os
import sys
# add parent directory to path to be able to load local RLC lib.
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [5]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
from RLC.capture_chess.environment import Board
from RLC.capture_chess.learn import Q_learning
from RLC.capture_chess.agent import Agent

In [8]:
env = Board()
env.render()
env.visual_board

AttributeError: 'Board' object has no attribute 'render'

# Function approximation

In the former we only looked at tabular methods. Now we will look at approximate solutions. The problem with large state spaces is not just the memory needed for large tables, but the time and data needed to fill them accurately. In many of our target tasks, almost every state encountered will never have been seen before. To make sensible decisions in such states it is necessary to generalize from previous encounters with different states that are in some sense similar to the current one. In other words, the key issue is that of generalization. Function approximation is an instance of supervised learning, the primary topic studied in machine learning, artificial neural networks, pattern recognition, and statistical curve fitting. In theory, any of the methods studied in these fields can be used in the role of function approximator within reinforcement learning algorithms, although in practice some fit more easily into this role than others. In reinforcement learning, however, it is important that learning be able to occur online, while the agent interacts with its environment or with a model of its environment. To do this requires methods that are able to learn efficiently from incrementally acquired data. In addition, reinforcement learning generally requires function approximation methods able to handle nonstationary target functions The novelty in this chapter is that the approximate function is represented not as a table but as a parameterized functional form with weight vector $w \in R^{d}$. We will write $v(s,w) \approx v_{\pi}(s)$ for the approximate value of state s given weight vector w. Typically, the number of weights (the dimensionality of w) is much less than the number of states and changing one weight changes the estimated value of many states. Consequently, when a single state is updated, the change generalizes from that state to affect the values of many other states. Such generalization makes the learning potentially more powerful but also potentially more diffcult to manage and understand. Moreover, making one state’s estimate more accurate invariably means making others’ less accurate. We are obligated then to say which states we care most about. We must specify a state distribution $\mu(s)\ge 0$, $\sum \mu(s) = 1$, representing how much we care about the error in each state s.

# Prediction

In case of prediction by the error in a state s, we mean the square of the difference between the approximate value $\hat{v}(s,w)$ and the true value $v_{\pi}(s)$. Weighting this over the state space by μ, we obtain a natural objective function, the Mean Squared Value Error

$$VE(w) = \sum \mu(s)[v_{\pi}(s) - \hat{v}(s,w) ]^{2}  $$

Often $\mu(s)$ is chosen to be the fraction of time spent in s. Under on-policy training this is called the on-policy distribution. In continuing tasks, the on-policy distribution is the stationary distribution under $\pi$. For episodic tasks 
$$ \nu(s) = h(s) + \gamma \sum_{\overline{s}} \nu(\overline{s}) \sum_{a} \pi( a | \overline{s} ) \ p(s | \overline{s}, a) $$

with $\nu(s)$ the average time steps spent in s for a single episode, $h(s)$ the probability that an episode starts in s. The on-policy distribution is then
$$\mu(s) = \frac{\nu(s)}{\sum_{s'} \nu(s')}$$

It is not completely clear that the VE is the right performance objective for rein- forcement learning. Remember that our ultimate purpose—the reason we are learning a value function—is to find a better policy. The best value function for this purpose is not necessarily the best for minimizing VE but until now there has not been found a better metric.

## Gradient descent

We sample the states and try to optimize w to get the examples correct. This means the strategy is to minimize the error on the observed examples. Stochastic gradient-descent (SGD) methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example:
$$w_{t+1} = w_{t} - \frac{1}{2} \alpha \nabla [v_{\pi}(s_{t}) - \hat{v}(s_{t}, w_{t}) ]^{2}$$
$$w_{t+1} = w_{t} + \alpha [v_{\pi}(s_{t}) - \hat{v}(s_{t}, w_{t}) ] \nabla \hat{v}(s_{t}, w_{t})$$

where $\alpha$ is a positive step-size parameter and $\nabla f(w)$ is the column vector of partial derivative w.r.t. $w$. In most cases we will not have an example at time t of the true target value $v_{\pi}(s_{t})$, but an approximation $U_{t}$. If $U_{t}$ is unbiased for each t then $w_{t}$ is guaranteed to converge to a local optimum for decreasing $\alpha$. However, this is not guaranteed in case of a biased estimate (this is the case for bootstrapping estimates). They take into account the effect of changing the weight vector $w_{t}$ on the estimate, but ignore its effect on the target. These are called semi-gradient methods. Often these methods are preferred over gradient methods for following reasons. One, they typically enable significantly faster learning. Two, they enable learning to be continual and online without waiting until the end of an episode.

TODO: add (semi-)gradient method to estimate v with MC.
    : add semi-gradient TD(o) method.
    
State aggregation is a simple form of generalizing function approximation in which states are grouped together, with one estimated value (one component of the weight vector $w$) for each group. The value of a state is estimated as its group’s component, and when the state is updated, that component alone is updated. State aggregation is a special case of stochastic gradient descent in which the gradient, $\nabla \hat{v}(s_{t},w_{t})$, is 1 for $s_{t}$'s group’s component and 0 for the other components. For state aggregation the resulting learned function will have a typical staircasing effect.

## Linear methods

One of the most important special cases of function approximation is that in which the approximate function, $\hat{v}(·,w)$, is a linear function of the weight vector, $w$. Linear methods approximate the state-value function by the inner product between $w$ and $x(s)$:
$$\hat{v}(s,w) = w^{T}x = \sum_{i=1}^{d} w_{i} x_{i}$$

The vector $x(s)$ is called a feature vector representing state s. For linear methods, features are basis functions because they form a linear basis for the set of approximate functions. Constructing d-dimensional feature vectors to represent states is the same as selecting a set of d basis functions. It is natural to use SGD updates with linear function approximation. The gradient of the approximate value function with respect to w in this case is
$$\nabla \hat{v}(s, w) = x(s)$$

In particular, in the linear case there is only one optimum (or, in degenerate cases, one set of equally good optima), and thus any method that is guaranteed to converge to or near a local optimum is automatically guaranteed to converge to or near the global optimum. 

Choosing features appropriate to the task is an important way of adding prior domain knowledge to reinforcement learning systems. A limitation of the linear form is that it cannot take into account any interactions between features, such as the presence of feature i being good only in the absence of feature j.

To get some intuitive feel for how to set the step-size parameter $\alpha$ manually, it is best to go back momentarily to the tabular case. There we can understand that a step size of $\alpha = 1$ will result in a complete elimination of the sample error after one target. We usually want to learn slower than this. In the tabular case, a step size of $\alpha = \frac{1}{10}$ would take about 10 experiences to converge approximately to their mean target, and if we wanted to learn in 100 experiences we would use $\alpha = \frac{1}{100}$. In general, if $\alpha = \frac{1}{\tau}$, then the tabular estimate for a state will approach the mean of its targets, with the most recent targets having the greatest effect, after about $\tau$ experiences with the state. With general function approximation there is not such a clear notion of number of experiences with a state, as each state may be similar to and dissimilar from all the others to various degrees. However, there is a similar rule that gives similar behavior in the case of linear function approximation. Suppose you wanted to learn in about $\tau$ experiences with substantially the same feature vector. A good rule of thumb for setting the step-size parameter of linear SGD methods is then 
$$\alpha = (\tau \ E[x^{T}x])^{-1}$$

where $x$ is a random feature vector chosen from the same distribution as input vectors will be in the SGD. This method works best if the feature vectors do not vary greatly in length.

## non-linear methods

Artificial neural networks (ANNs) are widely used for nonlinear function approximation. Training the hidden layers of an ANN is therefore a way to automatically create features appropriate for a given problem so that hierarchical representations can be produced without relying exclusively on hand-crafted features. This has been an enduring challenge for artificial intelligence and explains why learning algorithms for ANNs with hidden layers have received so much attention over the years. ANNs typically learn by a stochastic gradient method.

## Least squares TD

Directly calculates the TD fixed point:
$$w_{t} = \hat{A}_{t}^{-1}b$$

with $\hat{A}_{t} = \sum_{k=0}^{t-1} x_{k}(x_{k}- \gamma x_{k+1})^{T} +\epsilon I$ ($\epsilon$ needed to make sure it is invertible and $b = \sum_{k=0}^{t-1} R_{k+1} x_{k}$. Whether the greater data efficiency of LSTD is worth this computational expense depends on how large d is, how important it is to learn quickly, and the expense of other parts of the system. (O($d^{2}$) is still significantly more expensive than the O(d) of semi-gradient TD.)

## Memory based 

In memory based function approximation training examples are simply saved in memory as they arrive (or at least save a subset of the examples) without updating any parameters. Then, whenever a query state’s value estimate is needed, a set of examples is retrieved from memory and used to compute a value estimate for the query state. This approach is sometimes called lazy learning because processing training examples is postponed until the system is queried to provide an output. Unlike parametric methods, the approximating function’s form is not limited to a fixed parameterized class of functions, such as linear functions or polynomials, but is instead determined by the training examples themselves, together with some means for combining them to output estimated values for query states.

There are many different memory-based methods depending on how the stored training examples are selected and how they are used to respond to a query. Here, we focus on local-learning methods that approximate a value function only locally in the neighborhood of the current query state. These methods retrieve a set of training examples from memory whose states are judged to be the most relevant to the query state, where relevance usually depends on the distance between states: the closer a training example’s state is to the query state, the more relevant it is considered to be, where distance can be defined in many different ways. After the query state is given a value, the local approximation is discarded. Examples of local approximation are nearest neighbor, weighted averaging and locally weighted regression. Because trajectory sampling is of such importance in reinforcement learning, memory-based local methods can focus function approximation on local neighborhoods of states (or state–action pairs) visited in real or simulated trajectories. There may be no need for global approximation because many areas of the state space will never (or almost never) be reached. In addition, memory-based methods allow an agent’s experience to have a relatively immediate effect on value estimates in the neighborhood of the current state, in contrast with a parametric method’s need to incrementally adjust parameters of a global approximation.

Memory-based methods such as the weighted average and locally weighted regression methods described above depend on assigning weights to examples in the database depending on the distance between the example state and the query state. The function that assigns these weights is called a kernel function, or simply a kernel. Kernel functions numerically express how relevant knowledge about any state is to any other state. For many sets of feature vectors, kernel regression has a compact functional form that can be evaluated without any computation taking place in the d-dimensional feature space. In these cases, kernel regression is much less complex than directly using a linear parametric method with states represented by these feature vectors. This is the so-called “kernel trick” that allows effectively working in the high-dimension of an expansive feature space while actually working only with the set of stored training examples. The kernel trick is the basis of many machine learning methods, and researchers have shown how it can sometimes benefit reinforcement learning.

### Experience replay 

The system stores the data discovered for [state, action, reward, next_state]. The learning phase is then logically separate from gaining experience, and based on taking random samples from this data. You still want to interleave the two processes - acting and learning - because improving the policy will lead to different behaviour than should we don’t want the data we are feeding to be correlated with each other in any way. Random sampling of experiences breaks temporal correlation of behavior and distributes/averages it over many of its previous states. By doing so, we avoid significant oscillations or divergence in our model — problems that can arise from correlated data.

Advantages of experience replay:
- More efficient use of previous experience, by learning with it multiple times. This is key when gaining real-world experience is costly, you can get full use of it. Especially useful when there is low variance in immediate outcomes (reward, next state) given the same state, action pair.
- Better convergence behaviour when training a function approximator. Partly this is because the data is more like i.i.d. data assumed in most supervised learning convergence proofs.

Disadvantage of experience replay:
- It is harder to use multi-step learning algorithms, such as $Q(\lambda)$, which can be tuned to give better learning curves by balancing between bias (due to bootstrapping) and variance (due to delays and randomness in long-term outcomes). Multi-step DQN with experience-replay DQN is one of the extensions explored in the paper Rainbow: Combining Improvements in Deep Reinforcement Learning.

Use case experience learning on deep Q networks (DQN). Here a convolutional neural net is learned with pixels as input and Q values as output. Additionally, a buffer is kept for the experience replay where random data is sampled from. We want to minimize the difference between our current Q and target Q. 
$$ L_{i}(w_{i}) = \mathbb{E} [(R_{ss'}^{a} + \gamma \max_{a'} Q(s', a', \overline{w}_{i}) - Q(s,a, w_{i}))^{2}]$$

SGD was then used to go to this minima. To get the non-linear approximator stable, two networks were used one that was learning and one that was feeding in the batches of sampled data. This is noted by $\overline{w}_{i}$, after a certain fixed time, the fixed $w_{i}$ were updated with the latest. 

# Control

## episodic

In this case the approximation of the value function can be replaced with this from the q function. We choose the action where the q function is maximal for the current state. Policy improvement is then done by changing the estimation policy to a soft approximation of the greedy policy such as the $\epsilon$-greedy policy. Actions are selected according to this same policy. This only works well when the action set is discrete and not too large.

## Continual

Like the discounted setting, the average reward setting applies to continuing problems, problems for which the interaction between agent and environment goes on and on forever without termination or start states. Unlike that setting, however, there is no discounting—the agent cares just as much about delayed rewards as it does about immediate reward. The discounted setting is problematic with function approximation, and thus the average-reward setting is needed to replace it. To see why, consider an infinite sequence of returns with no beginning or end, and no clearly identified states. The states might be represented only by feature vectors, which may do little to distinguish the states from each other. As a special case, all of the feature vectors may be the same. Thus one really has only the reward sequence (and the actions), and performance has to be assessed purely from these. How could it be done? One way is by averaging the rewards over a long interval—this is the idea of the average-reward setting. How could discounting be used? Well, for each time step we could measure the discounted return. Some returns would be small and some big, so again we would have to average them over a sufficiently large time interval. In the continuing setting there are no starts and ends, and no special time steps, so there is nothing else that could be done. However, if you do this, it turns out that the average of the discounted returns is proportional to the average reward. In fact, for policy ⇡, the average of the discounted returns is always $r(\pi)/(1 - \gamma)$, that is, it is essentially the average reward, $r(\pi)$. In particular, the ordering of all policies in the average discounted return setting would be exactly the same as in the average-reward setting. It is no longer true that if we change the policy to improve the discounted value of one state then we are guaranteed to have improved the overall policy in any useful sense. That guarantee was key to the theory of our reinforcement learning control methods. With function approximation we have lost it!
In fact, the lack of a policy improvement theorem is also a theoretical lacuna for the
total-episodic and average-reward settings. Once we introduce function approximation we can no longer guarantee improvement for any setting. 

In the average-reward setting, the quality of a policy $\pi$ is defined as the average rate of reward, or simply average reward, while following that policy, which we denote as $r(\pi)$:
$$ r(\pi) = \lim_{h \rightarrow \inf} \sum_{t=1}^{t=h}\mathbb{E}[R_{t} \ | \ S_{0} A_{0:t-1} \sim \pi] $$
$$ r(\pi) = \sum_{s} \mu_{\pi}(s) \sum_{a} \pi(a \ | \ s) \sum_{s', r} p(s', r \ | \ s,a) $$

Returns are now defined as differences w.r.t. the average reward:
$$ G_{t} = R_{t+1} - r(\pi) +  R_{t+2} - r(\pi) +  R_{t+3} - r(\pi) + \ldots$$

This results in the following algorithm for differential semi-gradient Sarsa:
- Input: $\hat{q}$ a differentiable action-vale function paramerisation 
- Initialize: step sized $\alpha,\ \beta > 0 $, value-function weights $ w \in \mathbb{R}^{d}$, average reward estimate $\overline{R} \in \mathbb{R}$ arbitrarirely (e.g. 0), state S and action A
- Loop for each step:
    - Take action A, observe R, S'
    - choose A' as a function of $\hat{q}(S', . , w)$ (e.g. $\epsilon-greedy$
    - $ \delta = R - \overline{R} + \hat{q}(S', A', w) - \hat{q}(S', A, w)$
    - $\overline{R} = \overline{R} + \beta \delta$
    - w = w + \alpha \delta \nabla \hat{q}(S, A, w)
    - S = S' , A = A'


# off-policy

The tabular off-policy methods readily extend to semi-gradient algorithms, but these algorithms do not converge as robustly as they do under on-policy training. Recall that in off-policy learning we seek to learn a value function for a target policy $\pi$, given data due to a different behavior policy b. In the prediction case, both policies are static and given, and we seek to learn either state values or action values. In the control case, action values are learned, and both policies typically change during learning—⇡ being the greedy policy with respect to qˆ, and b being something more exploratory such as the $\epsilon$-greedy policy with respect to q.

The challenge of off-policy learning can be divided into two parts, one that arises in the tabular case and one that arises only with function approximation. The first part of the challenge has to do with the target of the update (not to be confused with the target policy), and the second part has to do with the distribution of the updates. The techniques related to importance sampling deal with the first part; these may increase variance but are needed in all successful algorithms tabular and approximate. For the second part, because the distribution of updates in the off-policy case is not according to the on-policy distribution. The on-policy distribution is important to the stability of semi-gradient methods. Two general approaches have been explored to deal with this. One is to use importance sampling methods again, this time to warp the update distribution back to the on-policy distribution, so that semi-gradient methods are guaranteed to converge (in the linear case). The other is to develop true gradient methods that do not rely on any special distribution for stability. An example of this is TDC (TD(0) with gradient correction.

the danger of instability and divergence arises whenever we combine all of the following three elements, making up what we call the deadly triad:
- Function Approximation
- Bootstrapping
- Off-policy training

In particular, note that the danger is not due to control or to generalized policy iteration. Those cases are more complex to analyze, but the instability arises in the simpler prediction case whenever it includes all three elements of the deadly triad. The danger is also not due to learning or to uncertainties about the environment, because it occurs just as strongly in planning methods, such as dynamic programming, in which the environment is completely known. If any two elements of the deadly triad are present, but not all three, then instability can be avoided. Additionally wit off-policy convergence is not guaranteed to convergenge to the correct value when combined with function approximation. off-policy learning in combination with function approximation that converges in a quick fashion is currently still an active field of study.


# Function approximation on capture chess

## Environment

There is a maximum of 25 moves, after that the environment resets.
Our Agent only plays white.
The Black player is part of the environment and returns random moves.
The reward structure is not based on winning/losing/drawing but on capturing black pieces:
- pawn capture: +1
- knight capture: +3
- bishop capture: +3
- rook capture: +5
- queen capture: +9

Our state is represent by an 8x8x8 array
- Plane 0 represents pawns
- Plane 1 represents rooks
- Plane 2 represents knights
- Plane 3 represents bishops
- Plane 4 represents queens
- Plane 5 represents kings
- Plane 6 represents 1/fullmove number (needed for markov property)
- Plane 7 represents can-claim-draw
White pieces have the value 1, black pieces are -1

In [15]:
# start with a linear network (8,8,8) env to (64,64) state space => 32768 weights.
# state aggregation or tile coding can be used to reduce the nr of weights used.

board = Board()
agent = Agent(network='linear',gamma=0.1,lr=0.07)
R = Q_learning(agent,board)
R.agent.fix_model()
R.agent.model.summary()

ModuleNotFoundError: No module named 'keras'