# **Reinforcement Learning**
<img align="right" src="https://vitalflux.com/wp-content/uploads/2020/12/Reinforcement-learning-real-world-example.png">

- In reinforcement learning, your system learns how to interact intuitively with the environment by basically doing stuff and watching what happens.

if you need the last version of gym use block of code below:
```
!pip uninstall gym -y
!pip install gym
```

In [None]:
# !pip install -U gym==0.25.2
!pip install gym[atari]
!pip install autorom[accept-rom-license]
!pip install swig
!pip install gym[box2d]

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import gym
from IPython.core.display import HTML
from base64 import b64encode
from gym.wrappers import record_video, record_episode_statistics
from gym.wrappers import RecordVideo, RecordEpisodeStatistics
import torch
import os

import warnings
warnings.filterwarnings('ignore')

In [None]:
def display_video(episode=0, video_width=600, video_dir= "/content/video"):

    video_path = os.path.join(video_dir, f"rl-video-episode-{episode}.mp4")
    video_file = open(video_path, "rb").read()
    decoded = b64encode(video_file).decode()
    video_url = f"data:video/mp4;base64,{decoded}"
    return HTML(f"""<video width="{video_width}"" controls><source src="{video_url}"></video>""")

def create_env(name, render_mode=None, video_folder='/content/video'):
    # render mode: "human", "rgb_array", "ansi")
    env = gym.make(name, new_step_api=True, render_mode=render_mode)
    env = RecordVideo(env, video_folder=video_folder, episode_trigger=lambda x: x % 50 == 0)
    env = RecordEpisodeStatistics(env)
    return env

def show_reward(total_rewards):
    plt.plot(total_rewards)
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.show()

## **Understanding the Markov Decision Process (MDP)**


1. **Markov Chains(MC)**

Future is Independent of the past, knowing the present (present state at time $t$) makes the future (future state at time $t+1$) independent of the past (all past states at time $0 , 1, …, t-1$), which means The state at time ($t+1$) has no dependence on the states before time ($t$).

Here we consider transition probability from one state to another not thier rewards.

2. **Markov Reward Processes(MRP)**
We can calculate the value of a state $v(s)$, which is the cumulative reward that the agent gets when it is in state $S=s$ at time $t$ and it follows the dynamics of the system.

$$
    \large v(s) = \mathbb{E} \left[G_t | S_t=s \right]
$$

> Where $G_t =R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+...$

The expectation operator $E[•]$ is used to derive formulae and to prove theoretical results. However, in practice, it is replaced by averages over many sample simulations and is also known as **Monte Carlo simulation**.

<br>

3. **Markov decision processes(MDP)**
Now we add “action.” In MRP. the agent had no control on the outcome, it learns everything by interacting with environment. However, under the MDP regime, the agent can choose actions based on the current state/observation. over time the agent can learn to take actions that maximize the cumulative reward ($G_t$).

&nbsp; &nbsp; In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming.

**Transition** : Moving from one state to another is called Transition.

**Transition Probability (T)**: The probability that the agent will move from one state to another.

<br>

<img align='right' width='400' src="https://miro.medium.com/v2/resize:fit:860/format:webp/1*MBcie302iU3qbQPbhU0psw.png">

&nbsp; &nbsp; The edges of the tree denote transition probability. From this chain let’s take some sample.

&nbsp; &nbsp; Now, suppose that we were sleeping and the according to the probability distribution there is a 0.6 chance that we will Run and 0.2 chance we sleep more and again 0.2 that we will eat ice-cream.

- you can see transition matrix below (each row for one state):

$$
\begin{bmatrix}
0.2 & 0.6 & 0.2\\
0.1 & 0.6 & 0.3 \\
0.2 & 0.7 & 0.1
\end{bmatrix}
$$

<br>
<br>
<br>


- **The return Gt is the total discounted reward from time-step $t$ (discounting the rewards from the future to present)**

$\large G_t =R_{t+1}+γR_{t+2}+...=\sum_{k=0}^{\infty}γ^kR_{t+k+1}$

**Note:** discounting is always introduced in continuing tasks and is optional in episodic tasks.

<br>

<img align='left' width='350' src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*p5KQnP1rwTcXFooMF0n-TA.png">

<br>
<br>
<br>
Suppose our start state is Class 2,
Class 2 > Class 3 > Pass > Sleep.
<br>
$γ = 0.5 \\
G_t = -2 + (-2 * 0.5) + (10 * 0.25) +0 = -0.5$

## **Policies and Value Functions**

the agent gets feedback from the environment by rewards. The dynamics of MDP are defined as $p(s’ ,r | s, a)$. alongisde the $G_t$, which is the sum total of all rewards received from time.

<br>

The transition dynamics is outside the agent control. The agent, however, can control the decision, which means take action in a particular state. The agent does so with an objective to maximize the $G_t$ for each state $S_t$.

The mapping of states to actions is known as **policy**:

$$
    \pi (a|s)
$$

So policy is the probability of taking action $a$ at time $t$ when the agent is in state $s$ at time t. The agent objective is to learn the mapping function from the state to actions to maximize $G_t$.

- **stochastic** policies: there are multiple actions that the agent can take, and the probability of taking each such action is defined by $π(a| s)$.
- **deterministic** policies where there is only one unique action for the state

<br>

The value function is always defined in the context of the policy the agent is following. It is also referred to as the agent’s behavior.

$$
    \large v_{\pi}(s) = \mathbb{E_{\pi}} \left[G_t | S_t=s \right]
$$

> Where $v_π(s)$ specifies the “state value” of state $s$ when the agent is following a policy $π$. The value of $G_t$ is dependent on the trajectory of states that the agent will see after time $t$.

in practice we usually calculate these values using simulation. We do so over multiple iterations and then average the value, which converge to expectations $\mathbb{E}$. This method call **Monte Carlo simulations**.

- action value functions: The expected return that the agent gets at time t is now known as action value function $q_π(s, a)$

$$
    \large q_{\pi}(s, a) = \mathbb{E_{\pi}} \left[G_t | S_t=s, A_t=a \right]
$$




## **Bellman Equation for Value Function**


First let's write down $G_t$ in a recursive form:

$$
\begin{split}
    G_t & = R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+...+γ^{T-t+1}R_{T} \\
    G_t & = R_{t+1}+γ\left[R_{t+2}+γR_{t+3}+γ^{2}R_{t+4}+...+γ^{T-t}R_{T} \right] \\
    G_t & = R_{t+1} + γ G_{t+1}
\end{split}
$$

<br>

So:

$$
\begin{split}
    \large v_{\pi}(s) & = \mathbb{E_{\pi}} \left[G_t | S_t=s \right] \\
    \large v_{\pi}(s) & = \mathbb{E_{\pi}} \left[R_{t+1} + γ G_{t+1} | S_t=s \right]
\end{split}
$$

<br>

The Bellman equation for $V_\pi(s)$:
---

---
<br>

$$
    V_\pi(s) = \sum_{a} \pi(a|s) \sum_{s{'}} P(s{'}| s, a) \left[ R(s, a, s{'}) + \gamma V_\pi(s{'}) \right]
$$

> where: <br>
- $\pi(a \mid s)$ : Probability of taking action $a$ in state $s$ under policy $\pi$,
- $P(s{'} \mid s, a)$ : Probability of transitioning to state $s{'}$ after taking action $a$ in state $s$,
- $R(s, a, s{'})$ : Reward received when transitioning from $s$ to $s{'}$ using action $a$.

<br>

The action-value function $Q_\pi(s, a)$
---

---
The action-value function $ Q_\pi(s, a)$  represents the expected cumulative reward starting from state $s$, taking action $a$, and then following policy  $\pi$:

$$
    Q_\pi(s, a)  = \mathbb{E_\pi} \left[ R_{t+1} + γ G_{t+1} | s, a \right]
$$


The q-value is the value of the paired $(s, a)$, and the state value is the value for a state $(s)$. The policy links the state to the possible set of actions through a probability distribution:

$$
    \large v_{\pi}(s) = \sum_{a} \pi(a| s) Q_\pi(s, a)
$$

<br>

- The Bellman equation for $Q_\pi(s, a)$ is:

$$
\begin{split}
    Q_\pi(s, a) & = \sum_{s{'}} P(s{'}| s, a) \left[ R(s, a, s{'}) + \gamma \sum_{a{'}} \pi(a{'} | s{'}) Q_\pi(s{'}, a{'}) \right] \\
    Q_\pi(s, a) & = \sum_{s{'}} P(s{'}| s, a) \left[ R(s, a, s{'}) + \gamma v_{\pi}(s^{'}) \right] \\
\end{split}
$$

<br>

Optimal State-Value Function:
---

---
The optimal state-value function $V(s) $ is the maximum value achievable in state $s$ under any policy, which is the objective of  a reinforcement learning problem:

$$
V_*(s) = \max_\pi v_{\pi}(s)
$$

> the optimal state value is the maximum one that can be obtained across all possible policies $π$.

If an agent is following the optimal policy, then the agent in state $(s)$ will take the action that maximizes the $Q(s, a)$.

$$
V_*(s) = \max_a Q(s, a)
$$

The Bellman optimality equation for $V(s)$ is:

$$
V_*(s) = \max_a \sum_{s{'}} P(s{'} | s, a) \left[ R(s, a, s{'}) + \gamma V_*(s{'}) \right]
$$

<br>

Optimal Action-Value Function:
---

---
The optimal action-value function $Q(s, a)$ is the maximum value achievable starting from state $s$, taking action $a$, and then following the optimal policy:

$$
Q_*(s, a) = \sum_{s{'}} P(s{'} | s, a) \left[ R(s, a, s{'}) + \gamma \max_{a{'}} Q_*(s{'}, a{'}) \right]
$$



---
<br>

Bellman Equation helps us to find optimal policies and value functions.

$ V_{t+1} = R + \gamma \times T \times V_{t} $

<br>

when the value converges, which means $ V_{t+1} = V_{t}$:

<br>

$$   
V - \gamma \times T \times V = R \\
V(I - \gamma \times T) = R \\
V = (I - \gamma \times T)^{-1} \times R
$$

<br>

**Create and MDP:**
> we have an environment consists of: <br>
$
S = [s_0, s_1, s_2], \\ A = [a_0, a_1], \\
\text{transition matrix}: T(s, a, s^{'}), \\
\text{discount factor}: \gamma
$

<br>

- In the beginning we don't want to implement a complex model, so we assume optimal policy select first action in all circumstances. $ \hspace{1mm}{action = a_0} $ which means it would be 0 all the time.
- define reward function and discount factor



In [None]:
def calc_value_matrix_inv(gamma, trans_matrix, rewards):
    inv = torch.inverse(torch.eye(rewards.shape[0]) - gamma * trans_matrix)
    v = torch.mm(inv, rewards.reshape(-1, 1))
    return v

In [None]:
T = torch.tensor(
    [[[0.8, 0.1, 0.1],
      [0.1, 0.6, 0.3]],
     [[0.7, 0.2, 0.1],
      [0.1, 0.8, 0.1]],
     [[0.6, 0.2, 0.2],
      [0.1, 0.4, 0.5]]]
)

R = torch.tensor([1., 0., -1.])
gammas = [0, 0.5, 0.99]
action = 0

Trans_matrix = T[:, action]
for gamma in gammas:
    v = calc_value_matrix_inv(gamma, Trans_matrix, R)
    print(f"The value function under he optimal policy and discount factor = {gamma} is: \n{v.numpy()} \n")

The value function under he optimal policy and discount factor = 0 is: 
[[ 1.]
 [ 0.]
 [-1.]] 

The value function under he optimal policy and discount factor = 0.5 is: 
[[ 1.6786704 ]
 [ 0.62603873]
 [-0.48199445]] 

The value function under he optimal policy and discount factor = 0.99 is: 
[[65.8293 ]
 [64.71942]
 [63.4876 ]] 



## **Problems**

There are two main problem;
1. the environment transition is unknown
2. calculate matrix inversion is not an easy task

So in reality we use different methods.