## Reinforcement Learning(RL)
Reinforcement learning is a type of machine learning where an agent learns to take decisions by performing actions in  an environment to maximize cumulative reward.

#### Terms:
1. **Agent:** The learner or decision maker.
2. **Environment:** The external systems with which the agent interacts.
3. **State`(s)`:** A reprentation of the current situation of the agent.
4. **Action`(a)`:** A set of all possible moves the agent can take.
5. **Action Space:** The set of all valid actions in a given environment is often called `Action Space`.
   1. ***Descrete Action Space:*** where only a finite number of moves are available for agents. EX. `Atari and GO`
   2. ***Continuous Action Space:*** In continous action space actions are real-valued matrix- ex. `Robot`. 
6. **Reward`(r)`:** Feedback from the environment based on the action taken by agent. $r_t = R(s_t, a_t, s_{t+1}) \text{ or } r_t = R(s_t), \text{ or state-action pair } r_t = R(s_t,a_t).$
7. **Return:** `Finite Horizontal undiscounted return` $\to R(\tau) = \sum_{t=0}^T r_t.$ `Infinite Horizontal discounted return` $$\to \gamma \in (0,1): R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t.$$ 
8. **Policy($\pi$):** A strategy used by agent to determine the next action based on the current state. On the other a policy is rule used by an agent to decide what action to take. $a_t=\mu (s_t)$ or $a_t=\pi (.\mid s_t)$
   1. Deterministic Policy: maps each state to a single action with certaity. $\pi: S \to A$ and $s\in S \text{ and } a\in A$. 
   2. Stochastic Policy: maps each state to a probability distribution over actions.
      1. Categorial Policy: Categorical Policy is used in descrete action spaces like the way for a classifier, so there will be a final linear layer that will give us logits for each action, followed by a `softmax` to convert the logit into probabilities. Log-Likelihood the last layer of probabilites as $p_{\theta}(s)$. The log-likelihood for an action `a` can then optinin by the indexing into the vector. $\log \pi_{\theta}(a\mid s)=\log [P_{\theta}(s)]_a$
      2. Gaussian Policy: A multivariate Gaussian distribution(normal distribution) is described by a mean vector $(\mu)$ and covarience matrix $(\sum)$. `In Gaussian distribution matrix has entries on the diagoanl, therby we can reopresent it by a vector.`
         1. First way: There is a single vector of log standard deviations, $\log \sigma$ which is not a function of state, rather a standalone parameter.
         2. Second way: There is a neural network that maps from states to log standard deviations, $\log \sigma_{\theta}(s)$. It may optionally share some layers with the mean network.
   3. **Sampling:**
9. **Value Function`(V)`:** The expected long-term return with discount, as oppsed to short-term reward.
10. **Q-Value`(Q)`:** The expected utility of taking a given action in a given state and following a particular policy thereafter. 
11. **Trajectories($\tau$):** A trajectory $\tau$ is a sequence of states and actions in the world. $\tau = (s_0, a_0, s_1, a_1, ...).$
`The very first state of the world`, $s_0$, is randomly sampled from the start-state distribution, sometimes denoted by $\rho_0:s_0 \sim \rho_0(\cdot).$ State transitions are governend by the natural laws of the environment and depend only on the most recent action, $a_t$.
    1. Deterministic: $s_{t+1} = f(s_t, a_t)$
    2. Stochastic: $s_{t+1} \sim P(\cdot|s_t, a_t)$
1.  **Reward:** The reward depends on the current state and next state.
2.  **Return:**

#### Advanced Terms
1. **Markov Decision Process (MDP):** A mathemetical framework for modeling decision-making, including states, actions, reward and transition probabilities.
2. **Transition Probability (P):** The probability of moving from one sate to another given a particular action.
3. **Expoloration vs Exploitation:** Balancing the choice between exploring new actions to find potentially better rewards and exploiting known actions that yeild high rewards.
4. **Discount factor($\gamma$):** A factor used to discound future rewards to their given values, typically between 0 and 1.
5. **Return(G):** The cumulative reward an agent receive, usally discounted one.
6. **Episode:** A sequence of states, actions, and rewards that ends in a terminal state.
7. **Learning Rate($\sigma$):** A parameter that determines how much new information overrides old information.

### Mathmetical Formula for RL: Markov Decision Process: (S, A, R, P, $\gamma$)
- `S:` set of possible state
- `A:` set of possible action
- `R:` distribution of reward given `(state, action)` pair
- `P:` transition probability:(P)- distribution over next state given `(state, action)` pair $P(s^{\prime}\mid s,a)$ probablility from `state s` to **$s^{\prime}$**
- $\gamma$: discounted factor

#### Solving MDP:
1. Value Iteration: Iteratively updating the value of each state based on expected rewards and transition probabilites.
2. Policy Iteration: Alternating between evaluating the current policy and improving it by choosing actions that maximize the expected value.

### Value Function and Q-value function:

#### Value function(On policy Value function) 
The value function`(State Value function)` representing the expected culumatiive reward that an agent can acheive starting from a given `state(S)` following a particular policy($\pi$) from `state(S)`.

$$V^{\pi}(s) = \mathbb{E}[\sum_{t\geq 0} \gamma^{t}r_t \mid s_o=s,\pi ]$$
where,
- $\mathbb{E}$ denotes the expected value under the policy $\pi$
- $\sum_{t\geq 0} \gamma^{t}r_t$ is the total discounted reward from tim step `t`
- $\gamma$ is the discounted factor, $0\leq \gamma \le1$
- $r_t$ is the reward received after taking an action at time step `t`
- $s_0$ is the state at time `t`

#### Q-Value Function(On-Policy Action-Value Function):
The Q-value function or action value function`Q(s,a)` represents the expected return starting from a `state s`, taking action `a` and following a particular policy($\pi$).

$$Q^{\pi}(s,a) = \mathbb{E}[\sum_{t\geq 0} \gamma^{t}r_t \mid s_o=s,a_0=a, \pi ]$$
where,
- $\mathbb{E}$ denotes the expected value under the policy $\pi$
- $\sum_{t\geq 0} \gamma^{t}r_t$ is the total discounted reward from tim step `t`
- $\gamma$ is the discounted factor, $0\leq \gamma \le1$
- $r_t$ is the reward received after taking an action at time step `t`
- $s_0$ is the state at time `t`
- $a_0$ is the action at time `t`

### Bellman Equation:
The Bellman equation is a fundamental concept in Reinforcement Learning that provides a recursive decomposition of the value function. it states that the value of a state under a particular policy $(\pi)$ can be decomposed into the immediate reward  plus the discounted value of a subsequent state.

***Bellman Equation for Value Function:***
$$V^{\pi}(s) = \mathbb{E}[\sum_{t\geq 0} r+ \gamma^{t}r_t \mid s_o=s,\pi ]$$

***Bellman Equation for Q-value function:***
$$Q^{\pi}(s,a) = \mathbb{E}[\sum_{t\geq 0} r+ \gamma^{t}r_t \mid s_o=s,a_0=a, \pi ]$$

***Example:***
```
[(0,0),(0,1)]
[(1,0),(1,1)]
```
- Start State:`(1,0)`
- Goal State:`(0,1)`
- Action: `Up, Right`
- `r`: +1 for reaching the goal and -.1  for reach each move
- $\gamma$=.9
- $\pi$: always move up if possible otherwise move right

***Value Function:*** $V^{\pi}((1,0)) = \mathbb{E}[\sum_{t\geq 0} r+ \gamma^{t}V^{\pi}((0,0)) \mid s_o=(1,0),\pi ]$
1. $V^{\pi}((0,0)) = [-.1 + .9\times 1]= 0.81$
2. $V^{\pi}((1,0)) = [-.1 + .9\times .81]= 0.629$

The expected cumulative reward starting from state(1,0) = .0629

***Q-Function:*** $Q^{\pi}((1,0),Up) = \mathbb{E}[\sum_{t\geq 0} r+ \gamma^{t}Q^{\pi}(s_t, a_t) \mid s_o=(1,0),a_0=Up, \pi ]$

$Q^{\pi}((1,0),Up) =  -.1 + \gamma^{t}Q^{\pi}((0,0), Right)
   =-.1 + .9\times 1 \text{     }[\text{assueme }  Q^{\pi}((0,0), Right)=1; \text{since it dirrectly reachs the goal}]$

$Q^{\pi}((1,0),Up) =  -.1+.9=.8$

### Optimal Q-Value Function:
The optimal Q-value function $Q^{*}$ is the maximum expected cumulative reward achieveable from a given (state, action) pair.
$$Q^{*}(s,a) = max_{\pi}\mathbb{E}[\sum_{t\geq 0} \gamma^{t}r_t \mid s_o=s,a_0=a, \pi ]$$ 


The bellman equation:
$$Q^{*}(s,a) = \mathbb{E}_{s^{\prime}\sim \infty }[ r+ \gamma \max_{a^{\prime}} Q^{*}(s^{\prime},a^{\prime}) \mid s, a ]$$
If the optimal state-action values for the next time-step $Q^{*}(s^{\prime},a^{\prime})$ are known, then the optimal strategy is to take the action that maximizes the expected value of $r+ \gamma Q^{*}(s^{\prime},a^{\prime})$

The optimal policy $\pi^{*}$ corresponds to taking the best action in any state as specified by $Q^{*}$ 

### Solving Optimal Policy:
Value Iteration algorithm--> Use bellman equation as an iterative update where $Q_{i}$ will converge to $Q^{*}$ as $i \to \infty$
$$Q_{i+1}(s,a) = \mathbb{E}[ r+ \gamma \max_{a^{\prime}} Q^{*}(s^{\prime},a^{\prime}) \mid s,a ]$$

The problem with approach is not scalable. Must compute $Q(s,a)$ for every state-action pair. If the state is big , computationally infeasible to compute for entire state space.

***`To solve the above problem, use a function estimator to estimate Q(s,a)` for example Neural network.*** We already learnt that if we have some complex function that we don't know but want to estimate a neural network is a good way to estimate it. 


#### Q-Learning:
Use a function approximator to estimate the action value function.
$$ Q(s,a,\theta) \approx Q^{*}(s,a)$$
- $\theta$ is the function parameter(weights).

$$Q^{*}(s,a) = \mathbb{E}_{s^{\prime}\sim \infty }[ r+ \gamma \max_{a^{\prime}} Q^{*}(s^{\prime},a^{\prime}) \mid s, a ] \text{ want to find a Q-function that satishfies Bellman equation.} $$

***Forward Pass:***<br>
***`Loss function:`*** $L_i(\theta_{i})=  \mathbb{E}_{s,a\sim p(.)}[(y_i - Q(s,a,\theta_i))^2]$<br>
where, $y_i=Q^{*}(s,a, \theta_{i}) = \mathbb{E}_{s^{\prime}\sim \infty }[ r+ \gamma \max_{a^{\prime}} Q(s^{\prime},a^{\prime}, \theta_{i-1}) \mid s, a ]$<br>
This function will  iteratively try to make the Q-value close to the target value $(y_i)$, if Q-function corresponds to optimal $Q^*$ and optimal policy $r^*$.

***Backward Pass:*** Gradient will update with respect to Q function parameters $\theta$ <br>
$$\bigtriangledown_{\theta_{i}}L_i (\theta_{i}) = \mathbb{E}_{s,a\sim p(.), s^{\prime}\sim \infty }[ r+ \gamma \max_{a^{\prime}} Q(s^{\prime},a^{\prime}, \theta_{i-1}) - Q(s,a, \theta_{i}) \bigtriangledown_{\theta_{i}} Q(s,a,\theta_i)  ]$$



#### Algorithms:
1. **Q-Learning:** Amodel free algorithm
2. **SARSA(State-Action-Reward-State-Action):**
3. **Deep Q-Networks`(DQN)`:**
4. **Policy Gradient Methods:**
5. **Actor-Critic Methods:**
6. **Proximal Policy Optimization`(PPO)`:**
   
#### Concept In Practice:
1. **Reward Shaping:**
2. **Experience Reply:**
3. **Transfer Learning:**
4. **Multi-Agent Reinforcement Learning:**