# Reinforcement Learning

Reinforcement learning is a branch of machine learning concerning **reward maximization**.  Specifically, how some virtual agent ought to make or decide actions in order to maximize this reward.  It differs from Supervised learning in that we do not have labeled inputs or outputs, and from Unsupervised learning in that we nonetheless do have some goal to maximize, which we know beforehand. 

Reinforcement learning is closely related to the concept of **Dynamic Programming**, a mathematical optimization technique that has found use in many fields.  It is basically breaking down a complicated problem into many sub-problems, often in a recursive manner.  In terms of reinforcement learning, this is usually seen represented as a value function being defined by the next value function 

The **Bellman** equation is probably the central equation in reinforcement learning: $$V(s) = \max_a(R(s,a) + \gamma V(s'))$$

where s = state, a = action, V(s) is the value of being in state s, and $\gamma$ is a discounting factor between 0 and 1

In general, a Markov chain is a process in which the probability of the next state depends only on the current state, and not the sequence of steps that led to the current state
A **Markov Decision Process** takes place in some stochastic environment where outcomes are not completely under the control of a decision maker, or in other words are partly random(we make this assumption with regards to real-life).  At each point in time, the system is in some state *s*, the decision maker must choose some action *a*, whereby they recieve some reward *R(s,a)* and the system is moved into a new state *s'*(s prime)

Assuming our system is a MDP, we can write the Bellman equation like this, as a Stochastic(MDP) version:
$$V(s) = \max_a(R(s,a) + \gamma\sum_{s'}P(s, a, s')V(s'))$$

where P(s,a,s') is the expected value of each state in some Markov process

Exploitation vs Exploration:
In order to optimize the reward, an agent must continually make a tradeoff between exploitation and exploration.  Exploitation means best utilizing what it has learning already for maximum gain.  But in order to find what can be exploited in the first place, the agent must explore as well.  Singlemindedly pursuing either one of these will inevitably lead to failure.  The agent needs to try a variety of actions *and* progressively favor the more rewarding actions over time. 

The 4 main parts of a RL agent:
1. Policy
2. Reward
3. Value
4. Model of Environment

Policy vs. Plan: If a system is deterministic, the situational decision making is referred to as a "plan", if the environment is 

The "Living Penalty" - a negative reward added for every move the agent makes, can encourage the agent to change it's actions, namely, actions that are potentially riskier but less time consuming

**Q-Learning** Q-learning is a model free reinforcement learning algorithm.  It can determine the optimal **policy** of any finite Markov decision process. Optimal in the sense that it maximizes the expected value of the total reward over any and all of the steps, starting from the current state.  We call it Q-learning because we are concerned with the "quality" of actions in a given state.

$$Q(s,a) = R(s,a) + \gamma\max_{a'}Q(s', a')$$

$$Q(s,a) = R(s,a) + \gamma\sum_{s'}(P(s,a,s')V(s'))$$

$$Q(s,a) = R(s,a) + \gamma\sum_{s'}(P(s,a,s')\max_a'Q(s',a'))$$

**Temporal Difference Methods**: $$TD(s,a) = R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a)$$

Which if we examine the equations, is just equal to $TD(s,a) = Q(s,a) - Q(s,a)$  Now, we can rewrite this as: 

$$Q_t(s,a) = Q_{t-1}(s,a) + \alpha TD_t(s,a)$$

We can combine the above equations to get: 

$$Q_t(s,a) = Q_{t-1}(s,a) + \alpha(R(s,a) + \gamma\max_{a'}Q(s'_{t+1},a') - Q_{t-1}(s,a))$$

We want to minimize the TD(s,a), as at that point, our agent will have perfectly learned its environment, since there is no longer a difference from one action to the next

The TD reward function mimics the dopamine firing pattern of real neurons in the brain

**SARSA** (State-action-Reward-state-action) uses a slightly different update rule:

$$Q_t(s,a) = Q_{t-1}(s,a) + \alpha(R(s,a) + \gamma Q(s',a'_{t+1}) - Q_{t-1}(s,a))$$

SARSA is an on-policy algorithm, meaning that the max action from the previous Q-learning algorithm is replaced by $a'_{t+1}$ which represents the approximation of the next action according the the *current* policy the agent is pursuing.  The previous algorithm was not restricted to staying on a single policy.  SARSA is useful in non-stationary environments, those that are constantly changing, where no optimal policy could ever possibly be reached, and when function approximations have to be used.  The SARSA is not greedy like the TD Q-learning algorithm

**Actor-Critic Learning** Asynchronous Advantage Actor-Critic Algorithm(A3C):
This algorithm splits the function into a policy, or actor, and a value function, or critic. The performance of the actor is judged by the critic for every action, which affects the *preference* for the action according to the formula: $$p(s_t, a_t) = p(s_t, a_t) + \beta TD$$
where $\beta$ is the size of the update
A3C agents are able to select a new action without examining ALL possible actions, and a-priori knowledge about some policy constraints can be used

![RL methods](RL methods.png)

**Deep Q- Learning** All the concepts of Q-Learning can be extended to neural networks.
In this framework, the NN will attempt to minimize $$L = \sum(Q_{Target} - Q)^2$$ where Q-Target is the Q-Value the network calculated for a particular action during a previous run through the network.  The network calculates Q-values for every possible action in all possible states on each iteration.  This is similar to the standard Q-learning algorithm it's just now the calculation is being done by a neural network, through the updating of weights instead of being explicitly computed in an exact formula

One strategy is to use the *Softmax* function to calculate the best action for each state.  As you know the Softmax function will choose convert the Q-values into a set of probabilities that sum to one and selects only the highest value

The Agent will select the best action $(1-\epsilon)$ percent of the time, this helps to control overfitting

*Recommended Readings*: https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

https://pdfs.semanticscholar.org/968b/ab782e52faf0f7957ca0f38b9e9078454afe.pdf

https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0