This is summary note for [Stanford XCS234 Reinforcement Learning](https://web.stanford.edu/class/cs234/). It contains some of my understanding and inference from learning process, while most of the notes are directly from the notes in course or the book [Reinforcement Learning, Sutton and Barto ](http://incompleteideas.net/book/the-book-2nd.html)

# Module 1: Introduction

## Application of Reinforcement Learning

**Reading Material:**
- Ch.1 of **[Reinforcement Learning, Sutton and Barto ](http://incompleteideas.net/book/the-book-2nd.html)**

**Reinforcement Learning** learns through experience or data to make good decisions under uncertainty. A learning **agent** must be able to sense the **state** of its **enviornment** to some extent and take **actions** to affect the state with a **goal** or goals related to the state of enviornment.

**Application**
- Chess/Game: Go
- Fusion Science: Learning Plasma Control
- Covid Testing: Efficient and Targeted
- ChatGPT
    - Step 1: behavior Cloning, Imitation Learning
    - Step 2: Model of Reward, Model-based RL
    - Step 3: Reinforcement Learning, RLHF

## Reinforcement Learning Frameworks

- Optimization: Find the optimal way to make decisions
- Delayed Consequences: Decisions now can impact things later; temporal credit is hard.
- Exploration: Learning about the world by making decisions; You only get the result of what you try.
- Generalization: Policy is mapping from past experience to action
    - pre-program policies might be too large to cover all possibilities

**Imitation Learning (IL)** assumes input demonstrations of good policies. It allows us to reducec RL to Supervised Learning (SL). For example, instead of code a policy for auto-driving, we can use data that has human driving it.

**When RL is powerful**
- No examples of desired behavior e.g. beyond human performance or no existing data
- Enormous search or optimization problem with delayed outcomes.

## Reinforcement Learning Fundamentals

- State
- Actions/Decisions
- Reward model
- Meaning of dynamics model

**Warning**: Reward Hacking - Choosing the easiest solution to maximize the reward but not meaningful.

<div style="text-align:center;">
  <img height="100%" width="50%" src="sources/XCS224_m1_p4_0.png" />
</div>

**Two Object**
- World/Environment
- Agent: Our model

**Four Sub-element**
- **policy**: It defines an agent's behavior at a given time and maps the state of the environment to actions.
- **reward signal** (immediate): It defines the goal of a reinforcement learning problem. At each timestep, a reward (scalar value) is sent to an agent from the environment. A large number indicates good **actions**, and a small number indicates bad actions. If the actions are bad, the policy may be changed to select other actions next time. **Agent** aims to maximize its total rewards over the long run. 
- **value function**: it specifies what is good in the long run. A value of a **state** is the total amount of reward an agent can expect to accumulate over the future, starting from the current stage **(future expectations)**
- **model of environment**: It is used to mimic the behavior of environment. The next state and next reward can be inferenced by model. **Model-based methods** are algorithms that use models of the environment where, whereas **Model-free methods** are simpler trial-and-error learners.

## Self-Summary:

1. What's relationship between reward and value?
> The **Value** of a state is the expectation of future **rewards**. 

# Module 2: Policy Evaluation

**Reading Material:**
- Ch.3,4.1-4.4,5.1-5.5,6.1-6.7 of **[Reinforcement Learning, Sutton and Barto ](http://incompleteideas.net/book/the-book-2nd.html)**

## Markov Decision Processes (MDPs)

### Markov Processes

**Markov Assumption**

State $s_t$ is Markov iff: $p(s_{t+1} | s_t,a_t) = p(s_{t+1} | h_t,a_t)$, which means future is independent of past given present.

**Things to Think of**
1. Is state Markov? Is world partially observable?
> If the data can not help us identify the state, such as which floor I am based on razer, then it is partially observable.
2. Are dynamics deterministic or stochastic?
3. Do actions influence only immediate reward (bandits) or reward and next state?

**MDP Model**
Dynamic model predicts next agent state. 
$$p(s_{t+1} = s' | s_t =s, a_t = a)$$

Reward model predicts immediate reward.
$$r(s_t=s,a_t=a) = \mathbb{E}[r_t|s_t=s,a_t=a]$$

<div style="text-align:center;">
 <img height="100%" width="50%" src="sources/XCS224_2_1_1.png" />
</div>

**Policy**

Policy $\pi$ determines how the agent chooses actions.

$\pi:S \rightarrow A$ mapping from states to actions

*Deterministic policy* $$\pi(s)=a$$

*Stochastic Policy* $$\pi(a | s) = Pr(a_t = a | s_t = s)$$


**Evaluation and Control**

*Evaluation* Estimate/predict the expected rewards from following a given policy

*Control* Optimization: find the best policy

**Markov Process or Markov Chain**

*Memoryless random process* Sequence of random states with Markov property

*Definition of Markov Process*
> S is a (finite) set of states ($s \in S$) <br>
> P is dynamics/traistion model that specifices $p(s_{t+1} = s' | s_t = s)$

Note, no rewards, no actions.

If finite number (N) of states, can express P as a matrix

$$p=\begin{bmatrix} P(s_1|s_1) & P(s_2|s_1) & \cdots & P(s_N | s_1) \\  P(s_2|s_1) & P(s_2|s_2) & \cdots & P(s_N | s_2) \\ \vdots & \vdots & & \vdots \\ P(s_1|s_N) & P(s_2|s_N) & \cdots & P(s_N|s_N) \end{bmatrix}$$

All rows need to sum up to 1.

**Markov Reward Process (MRP)**

*Markov Reward Process* = Markov Chain + rewards

*Definition of Markov Process*
> S is a (finite) set of states ($s \in S$) <br>
> P is dynamics/traistion model that specifices $p(s_{t+1} = s' | s_t = s)$<br>
> R is a reward function $R(s_t = s) = \mathbb{E}[r_t | s_t = s]$<br>
> Discount factor $\gamma \in [0,1]$

Note: No Action

If finite number (N) of states can express R as a vector.

**Return & Value Function**

*Definition of Horizon (H)*
> Number of time steps in each episode <br>
> Can be infinite <br>
> Otherwise called **finite** Markov reward process

*Definition of Return*, $G_t$ (for a Markov Reward Process)
> Discounted sum of rewards from time step t to horizon H
$$G_t = r_t + \gamma r_{t+1} + \gamma^2r_{t+2}+...+\gamma^{H-1}r_{t+H-1}$$

*Definition of State Value Function* $V(s)% (for a Markov Reward Process)
>Expected return from starting in state s
$$V(s) = \mathbb{E}[G_t | s_t=s] = \mathbb{E} [r_t + \gamma r_{t+1} + \gamma^2 r_{t+2}+ ...+ \gamma^{H-1} r_{t+H-1} | s_t=s]$$

**Discount Factor**
1. Mathematically convenient (avoid infinite returns and values)
2. Humans often act as if there's a discount factor < 1
3. $\gamma=0$ Only care about immediate reward
4. $\gamma=1$ Future reward is as beneficial as immediate reward
5. If episode length are always finite ($H < \infty$), can use $\gamma =1$


### Markov Decision Process (MDPs)