## Action spaces, Policies and the like

### Policies can be deterministic or stochastic

A deterministic policy is just a function of the (partially observed) state: 

$$ a_t = f(s_t) $$

A stochastic one is given by a distribution taking the state as input

$$ a \sim \pi_{\theta}(\centerdot | s_t ) $$

Two common stochastic policies are **categorical** and **diagonal Gaussian*

- **Categorical policies** are generally applied by adding a *softmax* function to the final layer of the network that accepts the state
    - Can be sampled from 
    - The log likelihood for an action a is $log[P_{\theta}(s)]_a $
    
- **Diagonal Gaussian policies** can either have a fixed vector of log st.devs for the cov. matrix diagonal, or a neural network maps to log st.devs (arbitrary sharing with the mean network)

     - The log likelihood of an action a is 
     $$ log\pi_{\theta}(a|s) = -\frac{1}{2}\big(\sum_{i=1}^{k}\big(\frac{(a_i - \mu_i)^2}{\sigma_{i}^2} + 2log\sigma_i\big) + k \ log \ 2\pi \big) $$

## Trajectories

(also called episodes or rollouts) are just a sequence of states and actions

State transitions only depend on the last action a (Markov property). they can also be deterministic or stoachastic.

## Rewards and Returns

The reward function R:

$$ r_t = R(s_t, a_t, s_{t+1})) $$

exists to calculate some type of cumulative reward over a trajectory.

### Finite horizon undiscounted return

is just the sum of rewards obtained in a certain number of steps

$$ R(\tau) = \sum_{t=0}^T r_t $$

### infinite horizon discounted return 

sums over all previous rewards, but discounted by how far back they were obtained

$$ R(\tau) = \sum_{t=0}^{\inf} \gamma^t r_t $$


## The RL problem

We want to find a policy that maximizes expected return 

The probability of a trajectory with steps *T* is

$$ P(\tau|\pi) = \rho_0 (s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t, a_t) \pi(a_t|s_t) $$

The corresponding return is:

$$ J(\pi) = \int_t P(\tau|\pi)R(\tau) = E_{t\sim \pi}[R(\tau)] $$

The optimal policy is expressed as 

$$ \dot{\pi} = argmax_\pi J(\pi) $$

## Value Functions

represent the expected return of a state or a state-action pair under a given policy. 

1. **On-Policy Value Function** gives the expected return if you start in state *s* and always act according to policy $\pi$

$$ V^{\pi}(s) = E_{\tau \sim \pi} [R(\tau)[s_0 = s] $$

2. **On-Policy Action-Value Function** is the same but includes taking an arbitrary action *a*

$$ Q^{\pi}(s,a) = E_{\tau \sim \pi} [R(\tau)[s_0 = s, a_0 = a] $$

3. **Optimal Value Function** for starting state *s* and always acting with the *optimal policy*

$$ \dot{V}^{\pi}(s) = max_{\pi} E_{\tau \sim \pi} [R(\tau)[s_0 = s] $$

4. **Optimal Action-Value Function** is the same with arbitrary action *a*

$$ \dot{Q}^{\pi}(s) = max_{\pi} E_{\tau \sim \pi} [R(\tau)[s_0 = s, a_0 = a] $$

### Note:

Without explicit time-dependence in a value function, we mean infinite-horizon discounted return. Otherwise time would need to be an argument because the reward function would either be outdated or not make sense anymore

### Optimal action

If we have the optimal Q function, the optimal action is

$$ \dot{a} (s) = arg max_{a} \dot{Q}(s,a) $$

## The Bellman Equations

simply state that the value of a given point is the expected reward from being in that point plus the value of whatever your next point is.

## Advantage functions

When choosing an action we can calculate the **relative advantage** of that action. It's simply the difference between taking an action *a* in state *s* under a policy $\pi$ than a random action chosen by that policy. 

$$ A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s)

It is a big part of policy gradient methods