# Reinforcement Learning - Summary

Lecture summer 2024 by *Prof. Matthias Niepert* at university of Stuttgart


## Table of Contents
1. [Introduction](#introduction)
2. [Markov Decision Processes](#markov-decision-processes)


## Introduction

### RL - Problem
* General framework for decision making
* Agent -> max reward (long-term)

<img src="slides/images/RL_Context.png" width="40%">
<img src="slides/images/RL_Context_ML.png" width="40%">

#### RL - Cycle
<img src="slides/images/RL_Cycle.png" width="80%">

with state $S_t$, action $A_t$, reward $R_t$ and history $H_t$

$$ \begin{align} H_t = S_0, A_0, R_1, S_1, A_1, R_2, ... S_{t-1}, A_{t-1}, R_t \end{align} $$


#### Markov Property

A state $S_t$ is Markov if and only if
$$ \begin{align} \text{Pr}\{S_{t+1}\} = \text{Pr}\{ S_{t+1} | S_1, ..., S_t \} \end{align} $$

e.i. the future is independent of the past given the present.

#### Markov - Types

* Agent observes Markov state - Markov Decision Process (MDP)

* observes indirectly $\to$ Partially Obeservable MDP (POMDP) 


### RL - Agent

* Policy $\pi$ - agent's behavior (det. or stoch.)
* Value function $V(s)$ - expected return from state $s$
* Action-value function $Q(s,a)$ - expected return from state $s$ and action $a$
* Model - agent's representation of the environment

The true model aka transition function is given by
$$ \begin{align} p(s',r|s,a) = \text{Pr}\{ S_{t+1} = s', R_{t+1} = r| S_t = s, A_t = a\} \end{align} $$

#### RL - Flavors

```mermaid
mindmap
  root((RL))
    model-based
      id[test]
    model-free
      value-based
      policy-based
      actor-critic
    imitation learning
```
Exploitation vs Exploration actions, where the reward follows a probability distribution -> max. expected reward

#### Bandits - Tabular solution methods

If state and action spaces are small enough
* find exact solution
  * optimal V
  * optimal $\pi$

Code example can be found [here](exercises/exercise_01/ex01-bandits.py)

#### Greedy and Eplsilon-Greedy

* Greedy for $\epsilon = 0$ which is the percentage of doing an other (random or softmax) action instead of the believed best action

$$ \begin{align} A_t = \text{argmax}_a Q_t(a) \end{align} $$

* Finetuning $\varepsilon$
  * reward variance is small, e.g. zero
  * reward variance is large
  * task is non-stationary

$$ \begin{align} \pi_t(a) = \frac{e^{\frac{Q_t(a)}{\tau}}}{\sum_{a' = 1}^k e^{\frac{Q_t(a')}{\tau}}} \end{align} $$

Gibbs or Boltzmann distribution with Temperature $\tau$ which is continuous $0$ and random $\infty$.

Incremental equation
$$ \begin{align} Q_{n+1} = Q_n + \underbrace{\frac{1}{n}}_{\text{generally } \alpha} [R_n - Q_n] \end{align} $$

## Markov Decision Processes

### Definitions

<img src="slides/images/MDP_Defi.png" width="80%">


#### Goal

Cumulative reward with discount factor

$$ \begin{align} G_t &= \sum_{i=0}^\infty \gamma^i R_{t+i+1} \\
&= R_{t+1} + \gamma G_{t+1} \end{align} $$

with $\gamma \in [0,1]$

#### Transition Function

$$ \begin{align} p(s'|s,a) &= \text{Pr}\{ S_{t+1} = s'|S_t=s, A_t = a \} \\ &= \sum_{r \in \mathcal{R}} p(s',r|s,a) \end{align} $$

#### Reward function

immediate reward

$$ \begin{align} r(s,a,s') &= \mathbb{E}\left[R_{t+1}|S_t = s, A_t = a, S_{t+1} = s'\right] \\ &= \sum_{r \in \mathcal{R}} r \frac{p(s',r|s,a)}{p(s'|s,a)} \end{align} $$

Note - typically we assume there is a single reward for each $s,a,s'$ and drop the $\mathbb{E}$.

##### Collection rewards

$$ \begin{align} r(s,a) &= \sum_{r \in \mathcal{R}} r \sum_{s'\in \mathcal{S}} p(s',r|s,a) \\ r(s) &= ...
 \end{align} $$

 #### Transition Graph

```mermaid
graph LR
    id0((low)) --> id01[recharge]
    id01 --$$1,\:0$$--> id1((high))


    id0 --$$1, r_{wait}$$--> id02[search]
    id02 --$$\beta, r_{search}$$--> id0
    id02 --$$1-\beta, -3$$--> id1

    id0 --> id03[wait]
    id03 --$$1,\: r_{wait}$$--> id0


    id1 --> id11[search]
    id11 --$$\alpha, r_{search}$$--> id1
    id11 --$$1-\alpha, \: r_{search}$$--> id0

    id1 --> id12[wait]
    id12 --$$1, r_{search}$$--> id1

```

#### Bellman Equation

Recursive equation which is stationary for optimum

##### Value Function
$$ \begin{align} v_\pi(s) &= \sum_{a} \pi(a|s) \sum_{s',r} p(s',r|s,a)\left[r + \gamma v_\pi(s')\right] \qquad \forall \: s \in \mathcal{S} \\
v_*(s) &= \max_a \sum_{s',r} p(s',r|s,a)\left[r + \gamma v_*(s')\right] 
\end{align}$$
##### Action-Value Function
$$\begin{align}
q_\pi(s,a) &= \sum_{s',r} p(s',r|s,a)\left[r + \gamma \sum_{a'} \pi(a'|s') q_\pi(s',a')\right] \\
q_*(s,a) &= \sum_{s',r} p(s',r|s,a)\left[r + \gamma 
\max_{a'} q_*(s',a')\right] \end{align}$$

##### Relation toward each other
$$ \begin{align}
q_\pi(s,a) &= \sum_{s',r} p(s',r|s,a)\left[r + \gamma v_\pi(s') \right]
\end{align}$$

since $v_\pi(s) = \sum_{a} \pi(a|s) q_\pi(s,a)$


#### Bellman: Matrix Form

$$ \begin{align} v = (I - \gamma\mathcal{P})^{-1}\mathcal{R} \end{align}$$

where $\mathcal{P}$ is the transition matrix 
$$ \begin{align}\mathcal{P} = \begin{pmatrix} p_{11} & \dots & p_{1n}\\
\vdots & \ddots & \vdots \\
p_{n1} & \dots & p_{nn} \end{pmatrix} \end{align} $$
and $\mathcal{R}$ the reward matrix


#### Optimal Policy

$$ \begin{align} v_*(s) &= \max_\pi v_\pi(s) \quad \forall \: s \in \mathcal{S} \\
q_*(s,a) &= \max_\pi \underbrace{\mathbb{E}_\pi \left[ R_{t+1} + \gamma v_*(S_{t+1})|S_t = s, A_t = t \right]}_{q_\pi(s,a)} \end{align} $$





In [4]:
# Bellman Equation solved for slippery gridworld
import numpy as np
import gym

def value_iteration(env, gamma=0.9, theta=1e-8):
    V = np.zeros(env.nS)
    while True:
        delta = 0
        for s in range(env.nS):
            v = V[s]
            V[s] = max([sum([p * (r + gamma * V[s_]) for p, s_, r, _ in env.P[s][a]]) for a in range(env.nA)])
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            break
    policy = np.zeros([env.nS, env.nA])
    for s in range(env.nS):
        q = np.zeros(env.nA)
        for a in range(env.nA):
            q[a] = sum([p * (r + gamma * V[s_]) for p, s_, r, _ in env.P[s][a]])
        policy[s, np.argmax(q)] = 1
    return policy, V

env = gym.make('FrozenLake-v0')
policy, V = value_iteration(env)


In [5]:
# Visualize the grid and the policy

import matplotlib.pyplot as plt
from matplotlib.table import Table

def plot_policy(policy):
    fig, ax = plt.subplots()
    ax.set_axis_off()
    tb = Table(ax, bbox=[0, 0, 1, 1])

    nrows, ncols = policy.shape
    width, height = 1.0 / ncols, 1.0 / nrows

    # Add cells
    for (i, j), val in np.ndenumerate(policy):
        tb.add_cell(i, j, width, height, text=val, loc='center', facecolor='white')

    # Row and column labels...
    for i in range(len(policy)):
        tb.add_cell(i, -1, width, height, text=i, loc='right', edgecolor='none')
        tb.add_cell(-1, i, width, height/2, text=i, loc='center', edgecolor='none')

    ax.add_table(tb)
    plt.show()

plot_policy(np.argmax(policy, axis=1).reshape(env.observation_space.n, env.observation_space.n))

ValueError: cannot reshape array of size 16 into shape (16,16)