## Neural Networks: 

from https://pad.gwdg.de/s/Machine_Learning_For_Physicists_2021#

- Reinforcement learning: 
    - Model-free (REINFORCE)
    - Policy gradient
    - Reward baseline
    - Q-learning

## Reinforcement learning


                   observation
              --------------------(ENVIRONMENT)   Fully observed vs. partially observed 'state' of the environment
              |                       ^
              v                       |
            (AGENT) -------------------
                         action

    Agent may be: self-driving car, robot -> observes inmediate environment & moves. 

- The net tries out things
- Train a net to produce actions based on rare rewards instead of being told the correct action. 

**Challenge**: the correct action is unknown -> **NO SUPERVISED LEARNING**; reward will be rare (or decided only at end). We could use the final reward to define a cost function, but we cannot know how the environment reacts to a proposed chage of the actions that were taken (UNLESS we have a model of environment). 

        
**RL BASIC SETTING**:

Take an observation which indicates that the environment is on a certain state (partial or full information about the state) -> we want to map this into an action that will be performed next ->  Policy (strategy) = state + action; 


    [       RL-AGENT       ]       [RL - ENVIRONMENT]
    ------------------------       ------------------
    | - Action             |   ->  |                |
    | Policy: state-action |       |                |
    | -Observation         |   <-  |                |
    ________________________       __________________
    
e.g., robot game: pick boxes as quick as possible

        State  = position x,y
        Action = move (direction)
        Reward - whenever it picks a box, reward = 1, otherwise = 0 or even negative (if we want to penalize)

#### MODEL-FREE GENERAL REINFORCEMENT LEARNING TECHNIQUE

**> REINFORCE: Policy gradient**: use probabilistic action choice. If the reward at the end turns out to be high, make **all** the actions in this sequence **more likely** (otherwise, do the opposite). This also reinforces 'bad' actions, but since they occur more likely in trajectories with low reward, the net effect will still be to suppress them. 

*Policy*: probability distribution -> probability to pick action $a_t$ given an observed state $s_t$ at time $t$; $\theta$ parameters of the NN. 

\begin{equation}
\pi_\theta(a_t | s_t)
\end{equation}

Given a state, s -> $x$ robot, $o$ are the boxes

                                    [S]
                                    ______________________
                                    |  o        o    o o | 
                                    |        o           |
                                    |  o                 |
                                    |   o         o      |
                                    | oo     o    X      |
                                    ______________________
                                    
                                    
                                    
Action, a: $\qquad\qquad$$\pi$ 

      down       0.1
      up         0.6
      left       0.2
      right      0.1

Environment: makes (possibly stochastic) transition to a new state s' and possibly gives a reward r. 

*Transition function*: probability of the environment to go into different state, s', given the actual state, s, and the taken action, a;  $P(s'|s,a)$. 

*Probability for having a certain trajectory of actions and states:* product over time-steps

\begin{equation}
P_\theta(\tau)=\prod_t P(s_{t+1}|s_t,a_t)\pi_\theta(a_t | s_t)
\end{equation}

* Trajectory: 
    - $\tau=(a,s)$
    - $a=a_0, a_1, a_2,..$
    - $s=s_1, s_2,...$ ($s_0$ is fixed)
    
Expected overall reward (='return'), sum over all actions at all times and over all states at all times >0; $R(\tau)$, return for this sequence (sum over individual rewards r for all times)

\begin{equation}
\bar{R}= E\{R\} = \sum_\tau P_\theta(\tau)R(\tau) = \sum_{a_0, a_1,...,s_1,s_2,...} P_\theta(\tau)R(\tau)
\end{equation}

Try to maximize expected return by changing parameters of policy: gradient of a product -> gradient of the various factors and all these different possibilities.

\begin{equation}
\frac{\partial\bar{R}}{\partial \theta}=?
\end{equation}

\begin{equation}
\frac{\partial\bar{R}}{\partial \theta}=\sum_t\sum_\tau R(\tau)\frac{\partial\pi_\theta(a_t | s_t)}{\partial \theta}\frac{1}{\pi_\theta(a_t | s_t)}\prod_{t'}P(s_{t'+1}|s_{t'},a_{t'})\pi_\theta(a_{t'} | s_{t'})
\end{equation}


\begin{equation}
\frac{\partial\bar{R}}{\partial \theta}= \sum_t\sum_\tau R(\tau)\frac{\partial ln\pi_\theta(a_t | s_t)}{\partial \theta}\prod_{t'}P(s_{t'+1}|s_{t'},a_{t'})\pi_\theta(a_{t'} | s_{t'}) 
\end{equation}

**> Main formula of policy gradient method:**

\begin{equation}
\frac{\partial\bar{R}}{\partial \theta}= \sum_t E\left(R\frac{\partial ln\pi_\theta(a_t | s_t)}{\partial \theta}\right)
\end{equation}

(1) Run a lot of trajectories

(2) For **each** trajectory calculate the return, R

(3) Look at which actions $a_t$ and states $s_t$ we went through for **each** particular trajectories 

(4) Calculate the probability of taking the action $a_t$ given the state $s_t$ for each of the time-steps of a **given** trajectory 

(5) Calculate its logarithmic derivative

(6) Sum this overall times

(7) Average overall trajectories

Stochastic gradient descent: where E(...) is approximated via the value for one trajectory (or a batch). 

\begin{equation}
\Delta\theta=\eta\frac{\partial\bar{R}}{\partial \theta}
\end{equation}


**> Physical meaning:**

\begin{equation}
\frac{\partial\bar{R}}{\partial \theta}= \sum_t E\left(R\frac{\partial ln\pi_\theta(a_t | s_t)}{\partial \theta}\right)
\end{equation}

Taking the derivative of the probability $\pi_\theta$ means I'm changing the parameters in the direction in which this probability will get **larger**; I can interpret a step in this direction as a step into the direction in which the parameters make my probability larger, $\pi_\theta(a_t | s_t)$. 

**One particular trajectory is defined by the sequence of all states and actions** <- how can I make the probability for one step **inside** the sequence larger? Since we're summing overall times -> how can I make **on average** all the probabilities larger? 

    - Take a trajectory
    - Calculate its return
    - Let me make all the probabilities of all the actions that I really took more probable
    - All the probabilities depend on the return:
\begin{equation}
R\frac{\partial ln\pi_\theta(a_t | s_t)}{\partial \theta}
\end{equation}

Increase the probability of all action choices in the given sequence, depending on size of return $R$. Even if $R>0$ always, due to normalization of probabilities this will tend to suppress the action choices in sequences with lower-than-average returns.

For a given parameter $\theta_k$; gradient of the probability of a given trajectory $P_\theta(\tau)$, being $\tau$ the trajectory (only the policy $\pi_\theta(a_t | s_t)$ depends on the parameters $\theta_k$):

\begin{equation}
G_k=\frac{\partial ln P_\theta(\tau)}{\partial \theta_k}=\sum_t\frac{\partial ln\pi_\theta(a_t | s_t)}{\partial \theta_k}
\end{equation}
\begin{equation}
\frac{\partial\bar{R}}{\partial \theta_k}= E(RG_k)
\end{equation}

**> Policy gradient: reward baseline**

Fluctuations of estimate for return gradient can be huge -> things improve if one subtracts a constant baseline from the return:

\begin{equation}
\frac{\partial\bar{R}}{\partial \theta}= \sum_t E\left((R-b)\frac{\partial ln\pi_\theta(a_t | s_t)}{\partial \theta}\right)=E((R-b)G)
\end{equation}
\begin{equation}
E((R-b)G) = (R-b)E(G) \qquad \rightarrow \qquad E(G_k)=\sum_\tau P_\theta(\tau)\frac{\partial ln P_\theta(\tau)}{\partial \theta_k}=\frac{\partial}{\partial \theta_k}\sum_\tau P_\theta(\tau)= \frac{\partial}{\partial \theta_k}\cdot 1 = 0
\end{equation}

However, the variance of the fluctuating random variable $(R-b)G$ is different and can be smaller (depending on $b$). 

**Optimal baseline:** $k$ refers to the parameters $\theta_k$

\begin{equation}
X_k=(R-b_k)G_k \qquad \qquad Var(X_k)=E(X_k^2)-E(X_k)^2 = min \qquad \qquad \frac{\partial Var(X_k)}{\partial b_k}=0
\end{equation}

\begin{equation}
b_k=\frac{E(G_k^2 R)}{E(G_k^2)} \qquad \qquad G_k=\frac{\partial ln P_\theta(\tau)}{\partial \theta_k}
\end{equation}

\begin{equation}
\Delta\theta_k=-\eta E(G_k(R-b_k))
\end{equation}

##### RL: random walk

The probability to go up is determined by the policy, and where the return is given by the final position (strategy: always go up) -> the policy doesn't depend on the current state.

> Policy:
\begin{equation}
\pi_\theta(up)=\frac{1}{1+e^{-\theta}}
\end{equation}
Return:
\begin{equation}
R=x(T)
\end{equation}

RL update: $a_t$ up or down
\begin{equation}
\Delta\theta = \eta \sum_t\left<R\frac{\partial ln\pi_\theta(a_t)}{\partial \theta}\right>
\end{equation}

\begin{equation}
\frac{\partial ln\pi_\theta(a_t)}{\partial \theta} = \pm e^{-\theta}\pi_\theta(a_t)=\pm(1-\pi_\theta(a_t))
\end{equation}

(+ for up, - for down)

\begin{equation}
up \qquad \frac{\partial ln\pi_\theta(a_t)}{\partial \theta} = \ 1-\pi_\theta(up) \qquad \qquad down\qquad \frac{\partial ln\pi_\theta(a_t)}{\partial \theta} = -\pi_\theta(up)
\end{equation}

\begin{equation}
\sum_t\frac{\partial ln\pi_\theta(a_t)}{\partial \theta} = N_{up}-N\pi_\theta(up)
\end{equation}

$N_{up}$ number of **up-steps**, $N$ number of time-steps. 

Return:
\begin{equation}
R=x(T)=N_{up}-N_{down}=2N_{up}-N
\end{equation}

RL update: $a_t$ up or down
\begin{equation}
\Delta\theta = \eta \sum_t\left<R\frac{\partial ln\pi_\theta(a_t)}{\partial \theta}\right>
\end{equation}

\begin{equation}
\left<R\sum_t\frac{\partial ln\pi_\theta(a_t)}{\partial \theta}\right>=\left<(N_{up} - N/2)(N_{up}-\bar{N}_{up})\right> 
\end{equation}

Initially, when $\pi_\theta(up)=1/2$: 
\begin{equation}
\Delta\theta = 2\eta \left<(N_{up} - N/2)^2\right> = 2\eta Var(N_{up})=\eta N/2 >0 \qquad Binomial\quad distribution
\end{equation}

In general: 

\begin{equation}
\left<R\sum_t\frac{\partial ln\pi_\theta(a_t)}{\partial \theta}\right>=2N\pi_\theta(up)(1-\pi_\theta(up))
\end{equation}

##### RL: Walker target (Keras)

Robot that can change its position: steps are 0 (stays) or 1 (moves). 
    
    There's a specific target-site -> reward is +1 for each time it remains on the target site; 
                                      return is the number of time steps on target. 

Stage of NN:

    output = action probabilities (softmax to guarantee the normalization)

> Policy $\pi_\theta(a|s)$

        a=0 (stay),     a=1 (move)
        
        input = s = are we on target? (0/1)

Steps to obtain *one trajectory*: 

    (1) Execute action, record new state
    (2) Apply NN to state thus obtain action probabilities
    (3) From probabilities, obtain action for next step
    (4) -> (1) repeat

Steps *for each trajectory*: 

    (1) Do one trajectory (batch of trajectories) [Previous - Steps to obtain one trajectory]
    (2) Obtain overall sum of rewards (=return) for each trajectory
    (3) Apply policy gradient training (enhance proabilities for all actions in a high-return trajectory)
    (4) -> (1) repeat
    
**Categorical cross-entropy trick:**

input = state, s

output = action, a, probabilities (softmax) - $\pi_\theta(a|s)$

                        a=0       a=1      a=2
                         o         o        o
                         |         |        |
                         ...(connections)...
                            |           |
                            o           o
                            input = state
 
Categorical cross-entropy: distribution from net ($\pi_\theta(a|s)$), desired distribution ($P(a)$).

\begin{equation}
C=-\sum_aP(a)ln\pi_\theta(a|s) 
\end{equation}

Set $P(a) = R$ for $a=action$ that was taken and $P(a)=0$ for all other actions $a$. 

\begin{equation}
\Delta\theta=-\eta\frac{\partial C}{\partial \theta}\qquad implements\quad policy\quad gradient
\end{equation}

- Encountered N states (during repeated runs)
- After setting categorical cross-entropy as a cost function, implement policy gradient

In [None]:
# POLICY GRADIENT ON KERAS:
net.train_on_batch(observed_inputs,desired_outputs)

Where:

    observed_inputs -> array N x state-size
    desired_outputs -> array N x number of actions
    
The desired output is the distribution $P(a)$, which was set $P(a) = R$ for $a=action$ that was taken and $P(a)=0$ for all other actions $a$
>**desired_outputs(j,a)=R** for the state $j$ if action $a$ was taken during a run that gave overall return $R$. 

>**Walker target .py @ 22:40 - 44:00 https://www.fau.tv/clip/id/11621**

### Q-learning:

An alternative to the policy gradient approach; introduce a quality function $Q$ that predicts the future reward for a given state $s$ and a given action $a$.

**Deterministic policy:** just select the action with the largest Q!

- Value function -> function of the states (how valuable is to be in a particular position).
- Quality function -> function of the state but also the action: Q(s,a). Predicts the future reward for a given state $s$ and a given action $a$:

\begin{equation}
Q(s_t,a_t)=E(R_t|s_t,a_t)
\end{equation}

*assuming future steps to follow the policy*. Discounted future reward: 
\begin{equation}
R_t=\sum_{t'=t}^Tr_{t'}\gamma^{t'-t}
\end{equation}

Reward at time step $t$: $r_t$, depends on state and action at time $t$

Discount factor: $0<\gamma\leq 1$, learning somewhat easier for smaller factor (short memory times)

Value of state: $V(s)=max_aQ(s,a)$

Update rule: **Bellmann equation**

\begin{equation}
Q(s_t,a_t)=E(r_t+\gamma max_aQ(s_{t+1},a)|s_t,a_t)
\end{equation}

*expected value of the next reward $r_t$ + the discounting $\gamma$ times the quality function already evaluated for the next timestep and maximized overall actions*. 

We don't know $Q$ -> *update rule*
\begin{equation}
Q^{new}(s_t,a_t)=Q^{old}(s_t,a_t)+ \alpha(r_t+\gamma max_aQ^{old}(s_{t+1},a)-Q^{old}(s_t,a))
\end{equation}

where $(r_t+\gamma max_aQ^{old}(s_{t+1},a)-Q^{old}(s_t,a)) <1$ is the small update factor; will be $0$ once we have converged to the correct $Q$. If we use a NN to calculate $Q$, it will be trained to yield the new value in each step. 

Initially, $Q$ is arbitrary: bad to follow this $Q$ all the time -> introduce probability $\epsilon$ of random action (**exploration**, $\epsilon$-greedy):

                         Follow Q - exploitation
        Do something random (new) - exploration
        Reduce randomness later. 