### Notation
- $\theta$ denotes the parameters of our function
- $\sim$ denotes "distributed as"

### States vs Observations
A **State** provides a complete description of the enviroment. An **Observation** provides a partial description of the enviroment.

### Discrete vs Continuous (Action Spaces)
A **Discrete** action space has a finite number of moves. A **Continuous** action space has an infinite number of moves.

### Deterministic vs Stochastic (Policies)
A policy is a function, which takes a state $s_t$ and returns an action, $a_t$

If the policy is **deterministic**, the function is denoted $\mu$
$$a_t = \mu_\theta(s_t)$$

If the policy is **stochastic**, the function is denoted $\pi$
$$a_t \sim \pi_\theta(s_t)$$

### Trajectories
Trajectories are a sequence of states and actions
$$\tau = (s_0, a_0, s_1, a_1, \ldots, s_n, a_n)$$

State transitions (whatever happens between $s_t$ and $s_{t+1}$) are governed by the natural laws of the enviroment, and depend on the most recent action $a_t$.

If $a_t$ is **determininistic** then the subsequent $s_{t+1}$ is deterministic
$$s_{t+1} = f(s_t, a_t)$$

If $s_t$ is **stochastic** then the subsequent $s_{t+1}$ is stochastic
$$s_{t+1} \sim P_{s, a}$$
Where $P_{s, a}$ gives the distribution over what states we will transition to if we take action $a$ in state $s$

### Reward and Return

The reward function, $r_t$, depends on the current state $s_t$, the action taken, $a_t$, and the subsequent state $s_{t+1}$

$$r_t = R(s_t, a_t, s_{t+1})$$

The goal of an agent is to maximize reward over a trajectory, $R(\tau)$. 

Maximizing the reward over a fixed range is called, **finite-horizon undiscounted return**
$$R = \sum_{t=0}^T r_t$$

Maximizing the reward over an infinite range is called, **infinite-horizon discounted return** 
$$R = \sum_{t=0}^\infty \gamma^t \cdot r_t$$
($\gamma \in (0, 1)$)

### The Value Function

The dynamics of our R.L process are as follows: We start in some state $s_0$,
and get to choose some action $a_0 \in A$. As a result of our
choice, the state randomly transitions to some successor state
$s_1$, drawn according to $s1 \sim P_{s_0,a_0}$. Then, we get to pick another action $a_1$.
As a result of this action, the state transitions again, now to some $s_2 \sim P_{s_1,a_1}$.
We then pick $a_2$, and so on…. Pictorially, we can represent this process as
follows:
$$s_0 \stackrel{a_0}{\to} s_1 \stackrel{a_1}{\to} \ldots$$

Our reward function being
$$R(s_0, a_0) + \gamma R(s_1, a_1) + \gamma^2 R(s_2, a_2) + \ldots$$

Next we introduce $V^{\pi_\theta}$ which denotes the **value function** tied to a given policy
$$E\left[R(s_0, a_0) + \gamma R(s_1, a_1) + \gamma^2 R(s_2, a_2) + \ldots ~|~ s_0=s, \pi\right] = V^{\pi_\theta}(s)$$

### The Optimal Value Function
The thing we are trying to find, or the Optimal Value Function is written

$$V^*(s) = \max V^\pi (s)$$