# Notes from Fundamentals of RL

### K armed bandits
Reinforcement learning trains based on evaluation of actions rather than from instructions.  Evlauative feedback indicates how good the action was, but not what the correct actoin to take is, while instructive feedback indicates what the correct action is. 

The $k$-armed bandit problem is one in which you face repeated choices, receiving a reward after each choice.  Each action has an expected (time invariant in this example) reward -- the value of the action.
$$
q_*(a) :=E \,[R_t\, \vert \, A_t=a]
$$
We do not know $q_*$, but estimate the value of action a at time $t$ with $Q_t(a)$.  At any given time step one action has highest expected value, and you choose between exploiting your current knowledge by choosing it (the greedy action) and exploring other actions (choosing them) in order to inprove your estimate of their value.

Action-value methods:  methods for estimating action values and using the estimates to make decisions

For example estimate expected reward using average reward (sample averaging):
$$
Q_t(a)=\frac{\textrm{sum of rewards when action a taken}}{\textrm{number times action a taken}}  =\frac{\sum_{i=1}^{t-1} R_i 1_{A_i=a}}{\sum_{i=1}^{t-1}1_{A_i=a}}
$$
Greedy selection is:
$$
A_t= \textrm{argmax}_a Q_t(a)
$$
To induce exploration, you can choose epsilon greedy methods, in which you make the greedy choice most of the time, and sample randomly, independantly of estimated values, wiht probability epsilon.

For averaging, the update rule is:
$$
Q_n=n^{-1} \sum_{i=1}^n R_i \,=\, n^{-1} \left( R_n + (n-1) Q_n \right) \, = \, Q_n + n^{-1} \left( R_n - Q_n \right)
$$
$$
NewEstimate=OldEstimate+StepSize\, ( Target - oldEstimate)
$$
Where Target means the direction of travel.  
For the nonstationary case, given more weight to recent rewards by discounting old rewards geometrically:
$$
Q_n + \alpha \left( R_n - Q_n \right) =  \sum_{i-1}^n \alpha (1-\alpha)^{n-i} R_i
$$
Sometime called an exponentially recency weighted average.  Alpha can be replaced by other sequence, in fact convergence will occur with probability 1 for any sequence such that
$$
\sum_n \alpha_n = \infty \quad \sum_n \alpha_n^2 < \infty
$$
The first condition makes the steps big enough to overcome initial conditions and the second makes them small enough to assure convergence.  But convergence is not desired in a nonstationary environment.

Value estimates are biased by initial conditions, and setting optimistic initial conditions encourages exploration early in the process.  Thus optimistic initial values tend to be good for stationary problems, but the temporary drive for exploration does not help in nonstationary problems.

An alternative to using a fixed epsilon is to try to include a term to encourage exploration of infrequenly sampled states which represents the uncertainty of the estimate of actions values:
$$
A_t = \textrm{argmax}_a \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}  \right]
$$
where $N_t(a)$ was the number of times a has been selected so far.

Finally, instead of using value estimates, one could use the relative value of the actions,  selecting via softmax


$$
P(A_t=a)  = \frac{e^{H_t(a)}}{e^{H_t(1)}+\cdots + e^{H_t(k)} } :=\pi_t(a)
$$

Where H is a preference function updated via stochastic gradient descent:

$$
H_{t+1}:=H_t+\alpha(R_t-\bar R_t) (1-\pi_t)=H_t(a) -\alpha(R_t-\bar R_t)\pi_t(a)
$$
Where $\bar R_t$ is the mean of all rewards so far (baseline).

In gradient descent, you minimize an objective function of the form 
$$
F(v) = n^{-1} \sum_{i=1}^n F_i(v) 
$$
for a parameter v, with F being the error attributed to the ith observation.  This leads to an update process, with learning rate $\eta$, consists of taking steps opposite the direction of hte gradient.  
$$
v\leftarrow v-\eta \nabla F(v)
$$
In stochastic gradient descent, you choose one observation at a time time minimize with respect to:
$$
v\leftarrow v-\eta \nabla F_i(v)
$$
In our example, the update step is 
$$
H_{t+1}(a)=H_t(a) -\alpha\frac{\partial E [R_t]}{\partial H_t(a)} \qquad E[R_t]=\sum_x \pi_t(x) q_*(x)
$$
Where x ranges over all actions.  We do not know $q_*$ Full Dirivation in Sutton on page 38.  
Resources for real world application of contextual bandits problems:
https://www.hunch.net/~rwil/ and https://vowpalwabbit.org/neurips2019/


### Markov decision processes

Finite MDP are used for associative evaluative feedback problems.  Actions influence immediate and delayed rewards, and values are state dependant $q_*(s,a)$.  The learner is the agent, which interacts with the environment.  The agent has a state, in which it takes an action, resulting in a reward and an updated state at a sequence of discrete time steps.  In a finite MDP the number of states, actions, and rewards are all finite (discrete).  
$$
p(s^\prime, \,r \, \vert \, s,\,a):=P(S_t={s^\prime}, \, R_t=r \, \vert \, S_{t-1}=s,\,A_{t-1}=a)\qquad \sum_s^\prime \sum_r p(s^\prime, \,r \, \vert \, s,\,a) =1
$$
The assumption that only the current state matters in an assumption that all relevant information about previous states is included in the current state.  Compute state transition probabilities by summing over rewards, and find expected rewareds by summing over states and averaging over rewards:
$$
r(s,a)=E[R_t\, \vert \, S_{t-1}=s,\, A_{t-1}=a]\sum_r r \sum_{s^\prime} p(s^\prime, \,r \, \vert \, s,\,a) \qquad 
r(s,a,s^\prime)=E[R_t\, \vert \, S_{t}=s^\prime, \,S_{t-1}=s, \, A_{t-1}=a]\sum_r r  p( \,r \, \vert \,s^\prime,\, s,\,a)
$$
The actions represent choices made by the agent, the states represent the basis on which the choies are made, and rewards represent and interpretation of achieving goals.  Goals are formalized by rewards passing from environment to agent.  In RL goals and purposes are represented by the maximization of expected reward.  The reward signal does not impart prior knowledge of how to do a task, but only of what you want to achieve.  In general try to maximize future expected reward, expected return, denoted $G_t = F(R_{t+1}, \dots, R_T)$ where T is a final time step.  Makes sense when rewards break down in terms of subsequences (episodes) that end in a terminal state.  Episodes begin independant of the previous episode.  Episodic tasks break down into episodes, while continuing tasks do not end.  In this case, future rewards may be discounted, and discouted return represented by 
$$
G_t = \sum_{k=0}^\infyt \gamma^k R_{t+k+1} = R_{t+1}+\gamma G_{t+1}
$$



