# Reinforcement Learning

$s_t$ = state at time $t$

$a_t$ = action taken at time $t$

$r_{t+1}$ = reward after action is taken

$G_t = r_{t+1}+r_{t+2}+r_{t+3}+\cdot\cdot\cdot = r_{t+1} + G_{t+1} = r_{t+1} + \gamma Q(s_{t+1},a_{t+1})$

$Q_{new}(s_t,a_t)= Q(s_t,a_t)+\alpha[r(s_t,a_t)+\gamma Q(s_{t+1},a_{t+1})-Q(s_t,a_t)]$

#### Bellman Equation for State Value Function

$V^\pi(s) = {\sum}_{a\in A}\pi(a|s)\space{\sum}_{s^{\prime}\in S}P(s^{\prime}|s,a)[R(s,a)+\gamma V^\pi(s^{\prime})]$

Where:

* $V^\pi(s):$ Value of function of state $s$ under policy
* $P(s^{\prime}|s,a):$  Transition probability from state $s$ to $s^{\prime}$ when taking action $a$
* $R(s,a):$ Reward obtained after taking action $a$ in state $s$
* $\gamma:$ Discount factor controlling the importance of future rewards
* $\pi(a|s):$ Probability of taking action $a$ in state $s$ under policy

#### Bellman Equation for Action Value Function
$$Q^\pi(s,a) = {\sum}_{s^{\prime}\in S}P(s^{\prime}|s,a)\cdot[R(s,a)+\gamma \sum_{a}\pi(a^{\prime}|s^{\prime})\cdot Q^\pi(s^{\prime},a^{\prime})]$$

Where $Q(s,a)$ represents the expected return for taking action $a$ in state $s$.
and following policy afterward.

#### Q-Learning

$$Q^*(s,a) = {\sum}_{s}P(s^{\prime}|s,a)\cdot[R(s,a)+\gamma \max_{a}\pi(a^{\prime}|s^{\prime})\cdot Q^*(s^{\prime},a^{\prime})]$$



### `Q-learning` and `SARSA (State-Action-Reward-State-Action)` are reinforcement learning algorithms designed to estimate the optimal policy in a `Markov Decision Process (MDP)`.

Both of these algorithms are value-based methods that learn action-value functions, but they differ in how they update their Q-values and handle **exploration** and **exploitation**.

#### `Q-learning`

* `Q-learning` is an off-policy algorithm, which means that it learns the optimal policy independently of the agent’s actions. In other words, Q-learning updates its Q-values based on the maximum possible reward from the next state, regardless of the action taken by the agent.
* In `Q-learning`, the update is based on the maximum Q-value of the next state, which is the highest possible reward the agent could achieve from the next state, independent of the action actually chosen.
* `Q-learning` emphasizes exploration by learning based on the maximum possible future reward, which may result in an overestimation of Q-values as it assumes optimal actions will always be taken.




#### `SARSA`
* `SARSA` is an on-policy algorithm, which means it updates its Q-values based on the actual actions taken by the agent. It learns from the current policy being followed, which means the agent’s actions in the next state directly influence the Q-value updates.
* In `SARSA`, the Q-value is updated based on the action actually chosen in the next state, meaning the update depends on both the next state and the action taken according to the current policy.
* `SARSA` is more conservative and aligns its learning with the agent’s current policy. It’s more cautious and avoids the risk of overestimation, as it learns from the actual actions it takes.

Summary Table: Differences between Q-learning and SARSA

|Aspect	|Q-learning	|SARSA|
|-------|-----------|-----|
|**Policy Type**	|Off-policy	|On-policy|
|**Update Rule** |Uses the maximum Q-value from the next state	|Uses the Q-value of the actual next action taken|
|**Exploration**	|Encourages exploration by considering the best future action	|Learns from the agent’s actual policy and actions|
|**Convergence Speed**|	Generally faster, as it assumes optimal actions	|Slower, as it learns based on the actual actions taken|
|**Stability**	|May lead to overestimation of Q-values	|More stable, less prone to overestimation|
|**Suitability**	|Suitable for environments where the optimal policy is the focus and aggressive exploration is acceptable	|Suitable for environments where stability and alignment with the current policy are more important|
|**Risk of Suboptimal Policy**|	Lower risk, as it always aims for the best possible action|	Higher risk, as it learns based on the current policy, which may be suboptimal