State
1) understand a2c. understand a3c. https://danieltakeshi.github.io/2018/06/28/a2c-a3c/#fn:unclear
1) implement NTM in isolation
2) put NTM and a3c together.

https://arxiv.org/pdf/1803.10760.pdf
Memory is not enough. It needs to be stored in the right format.

There already were RL models with memory. However, those models were bad at long delays

## Models
### RL-LSTM
It stops performing well when the amount of data to remember is big. 

#### Steps
1) $o_t = concat((env\_data, r_{t-1}), a_{t-1})$

2) $e_t = transform(o_t)$

3) $h_t = LSTM(e_t; h_{t-1})$

4) $Pr(a_t) = transform(concat(h_t, e_t))$ I think here it's optional to take $e_t$ into account.

### RL-MEM
#### Steps
Steps 1, 2, and 4 are the same as RL_LSTM. Assume we have n read heads.
3) 
$$
c^{ij}_t = similarity(k^i_t, M^j_{t-1}) \\
b^{i:}_t = softargmax(c^{i:}_t) \\
a^i_t = M^T_{t-1}b^{i:}_t \\
read(k_t, M_{t-1}) = [a^1_t, a^2_t, ..., a^n_t] \\ 
$$

$similarity(k^i_t, M^j_{t-1}) = cosine_distance$
   $read(k_t, M_{t-1}) = softargmax(sim(k_t, M_{t-1}))$
   $m_t = read(k_t, M_t)$
   $h_t = LSTM(e_t, m_t; h_{t-1})$
   $M_{t+1} = write(M_t, affine(h_t))$ We always write in an empty memory slot.

### MERLIN
We pass the state through a compression bottleneck to generate a lower-dimensional representation. The system is asked to recover the reward for a given state, but the state can be stored as it desires.

It'd be interesting to think about the problem in terms of building world models.

# RL
## Initial definitions
Actions: the things the agent can do.
State: all the information about the place where the agent is.
Observation: information (potentially partial) about the state where the agent is.
Policy: map from observation to actions
* deterministic: you use a mlp to map from observations to a continuous space, which is the output of the mlp. {what can we do to map to a discrete space}
* stochastic
    * multinomial: we use a mlp and interpret the output as the log probabilities of a multinomial distribution
    * diagonal gaussian: we use a mlp to generate the mean of the gaussian. Then, two possibilities
        * fixed std
        * we use another mlp to generate the mlp
        Finally, we generate a sample by $\pi(a|s) = \mu + \sigma * z$ with $z ~ N(0, I).$ That z enables the expression to be a stochastic distribution. Otherwise, it would be deterministic.

Trajectories/episodes/rollouts: the set of states and actions that happened. We assume that $s_t$ only depends on $s_{t-1}$ and $a_{t-1}.$

Reward: $R(s_t, a_t, s_{t+1}).$ Notice that if the policy and environment are deterministic, $R(s_t) = R(s_t, a_t, s_{t+1})$ given a fixed agent. As the reward is a measure of what happened and not the things that we could have done, we only care about the value sampled from the prob. distr. and not from the prob. distr. itself. That's why we need $a_t$ and $s_{t+1}$ as arguments for the reward function.
* finite-horizon undiscounted reward: we sum all the rewards until time T
* infinite-horizon discounted reward: we sum infinitely many rewards, but the rewards in the future have less value.
Finite-horizon discounted isn't wrong per se, but it doesn't seem to be used. Inifnite-horizon undiscounted: it seems difficult to estimate and it seems that in most cases it would be infinite {continue thinking.}

## Central (park) problem
$$
P(\tau|\pi) = \rho_0(s_0) \Pi^{T-1}_{t=0}P(s_{t+1}|s_t, a_t) \pi(a_t|s_t) \\
R(\tau) = \sum_{t=0}^{T-1} R(s_t, a_t, s_{t+1}) \\
J(\pi) = E_{\tau ~ \pi} R(\tau) = \int_\tau P(\tau|\pi)R(\tau) d\tau \\
\pi^* = argmax_\pi J(\pi) \\
$$

{the space of $\tau$ seems discrete. What does that mean when we integrate?}

## Value functions
On-policy value function
$$V^\pi(s) = \underset{\tau \sim \pi}{E} [R(\tau|s_0 = s)]$$

On-policy action-value function
$$Q^\pi(s, a) = \underset{\tau \sim  \pi}{E} [R(\tau|s_0 = s, a_0 = a)]$$

Optimal value function
$$V^*(s) = \underset{\pi}{max}\underset{\tau \sim \pi}{E} [R(\tau|s_0 = s)]$$

Optimal action-value function
$$Q^*(s, a) = \underset{\pi}{max}\underset{\tau \sim \pi}{E} [R(\tau|s_0 = s, a_0 = a)]$$

### Identities
$$V^\pi(s) = \underset{a \sim \pi}{E} Q^\pi(s, a)$$ {prove it}
$$V^*(s) = \underset{a}{max} Q^*(s, a)$$ {prove it}

## Optimal policies
$$a^*(s) = arg \underset{a}{max} Q^*(s, a)$$

## Bellman equations
The idea is that the value of an state (or state, action pair) is the value we have from being there plus the value we have after moving.

{emphasize this concept, as it is very imp}

$$
V^\pi(s) = \underset{a \sim \pi(.|s)}{E}\underset{\ s' \sim P(.|s, a)}{E} [R(s, a) + \gamma V^\pi(s')]\\
Q^\pi(s, a) = \underset{s' \sim P(.|s, a)}{E} [R(s, a) + \gamma \underset{a' \sim \pi}{E} Q^\pi(s', a')] \\
V^*(s) = \underset{a}{max}\underset{s' \sim P(.|s, a)}{E} [R(s, a) + \gamma V^*(s')] \\
Q^*(s, a) = \underset{s' \sim P(.|s, a)}{E} [R(s, a) + \gamma \underset{a'}{max} Q^*(s', a')] \\
$$

Let's try to understand them.

Say we consider the reward to be a function of the state and the action we took (and not the state where we land on.) In that case, the immediate reward is deterministic. Also, in that case, R(s, a) is independent of the state where the agent lands on (given an action and previous state.) We then can pull the reward function from the action-value functions. 

Thus, the value of a (state, action) pair is the same as the immediate reward from that (state, action) pair plus the expected value over all the states where the agent can land on. In turn, the value of a state is the expectation (over all actions) of the action-value function. Notice that we could have different weights for each action. Thus, we can think of the expectation (over all actions) of the action-value function as a weighted average of the returns[1] we get for every action. The weights for each return is given by the probability that the policy assigns to the action. Notice that for the optimal policy, we can't change the fact that we have the (state, action) pair selected. Also, the agent doens't decide where it's going to land. But it has the option to maximize the expected return by selecting the best action to take after landing in some new state.

$$
Q^\pi(s, a) = R(s, a) + \underset{s' \sim P(.|s, a)}{E} [\gamma \underset{a' \sim \pi}{E} Q^\pi(s', a')] \\
Q^*(s, a) = R(s, a) + \underset{s' \sim P(.|s, a)}{E} [\gamma \underset{a'}{max} Q^*(s', a')] \\
$$

For the optimal value function, the agent maximizes the return by selecting the best action after s instead of selecting the best action after s'. Similarly, for the value function, we take the expectative of the return over all the actions after s instead of taking the expectative over all the actions after s'.

More concisely, 
* Value function: fixed: s. max/expectation: a'. random: s' 
* Action-value function: fixed: s, a. random: s'. max/expectation: a'

Notice that for the expressions for the optimal value and action-value functions, we select one specific action instead of taking the expectaction. 

We can think of the optimal action-value function and optimal value function as taking the path that looks more promising. Call $\{t_1, t_2, ..., t_n\}$ to the paths that the optimal value function has to decide over. All those paths have the form
$$a_1-s_1-a_2-s_2-...$$
where each $a_i$ is an action in the MDP and each $s_i$ is an state in the MDP.
It's important to notice that every path starts with an action. Also, notice that the resulting state is something that the agent doesn't have control over. Now, notice that given that we are in the same state, the optimal value function and the optimal action-value function are going to decide over different set of paths.

For each path $t_i$ of the form $a_1-s_1-a_2-s_2-a_3-s_3-...$ that $V^*$ has to decide over, $Q^*$ has to decide over the path $a_2-s_2-a_3-s_3-...$ It's as if we need to remove opportunities from $V^*$ to reach $Q^*.$

## Advantage function
$$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$

## MDP
We need a list of all the states, actions, reward given state and action, transition probabilities given state and action, initial state probabilities.

Why isn't the reward stochastic? It seems the reward is just a fixed property of (s_t, a_t, s_{t+1})

# Algorithms
## Models
The agents always knows the available actions in one state. A model gives it information about the reward function (ie given that it took this state and this action, what's its reward?) and the transition probabilities (ie, given that it took this state and this action, what's the PDF of places to land?) [2]

### Model-based
The agent learns/has a model of the MDP

### Model-free
The agent maps observations to actions {i'm not sure}

## What do we model
The only way for the agent to influence the world is by the actions it takes. And those actions are 100% given by the policy of the agent. Thus, if we want an intelligent agent, then we need an agent that has a good policy. We can directly model the policy, so as to get the one that gives us the most return. We can also model $V^*$ and $Q^*.$ 

Say we have $Q^*.$ One way {is it the only one?} to derive the best policy from $Q^*$ is to calculate the value of $Q^*$ for every action, and then take the action with highest return. 

Say we have $V^*.$ {i don't know what you'd do, even if you have the transition probs. what you _can_ tell is whether the action you took was wrong (after taking it many times.) You can tell this because you'd have what you think is the value function executed at one state and the empirical one given by expecation of the immediate reward plus the value function executed at the states where you land on.}

We could also model the reward and transition functions. Based on those we can use ~dynamic programming. {Think more about this.}

## Policy optimization
$\pi_\theta = $ policy parameterized by $\theta.$
We look for the $\pi_\theta$ that maximizes the expected return $J(\pi_\theta).$


## Policy gradient
We first calculate the gradient of the action probabilities (how the action probabilities changes if we change our parameters $\theta.)$ Then, for each action we calculate the return $R_t = r_t + r_{t+1} + ... r_T.$ {I'm assuming that those rewards are a function of the action and state.} We then multiply the return vector with the gradient of the action probs. matrix. The return vector weights the different actions. Imagine all the actions give the same expected return. Then, we just add up the action probabilities (that would be the same as having a return vector with all entries equal.) Now, if one action leads to more expected return than other, then after multiplying by the return vector we would have taken that into account. {I'm not yet sure about what it means to add up over all the timesteps until T}

# Notes
[1] Remember that the return is different from the reward. The return is a measure of how much reward you will receive over an amount of steps. The reward gives you a measure of how good/bad a particular pair of state, action was.
[2] Having a model seems particularly interesting. It doesn't seem that good to just map observations to actions. I think that the good way of having a model is that we are encoding the prior that says that the task is given by an MDP.