# Reinforcement Learning With Practice

I'm writing this notebook for a better understanding of Reinforcement Learning and its practical implementation. I got the motive while reading [A Survey on Deep Reinforcement Learning Algorithms for Robotic Manipulation](https://www.mdpi.com/1424-8220/23/7/3762). I found it obscure to apply the algorithms to real-world problems, like I don't know what's the reward or policy.

I introduce all the algorithms briefly and use them to solve the [CartPole](https://gymnasium.farama.org/environments/classic_control/cart_pole/) environment from OpenAI Gym.

---
**Notations**

A Markov devision process (MDP) is defined as a tuple $\langle S, A, r, T, \gamma \rangle$ where:
- $S$ - current state
- $A$ - action
- $r$ - reward for taking action $A$ in state $S$
- $T$ - state transition function
- $\gamma$ - discount factor implying that a reward obtained in the future is worth a smaller amount than an immediate reward
---

## About CartPole
1. action space: a ndarray with shape `(1,)` which can take values `{0, 1}` indicating the direction of the fixed force the cart is pushed with.
2. observation space: a ndarray with shape `(4,)` containing the cart position, cart velocity, pole angle, and pole velocity at the tip.
3. reward: `1` for every step taken, including the termination step.
4. termination: the episode ends when the pole is more than 15 degrees from vertical or the cart moves more than 2.4 units from the center.
5. truncation: the episode ends after 500 steps for v1 (200 for v0)

## 1. Value-Based RL

### 1.1 Q-Learning
$$ Q(s, a) = r(s, a) + \gamma \max_{a} Q(s', a) \tag{1} $$

$Q(s, a)$ is the Bellman action-value function, which estimates how good it is to take an action at a given state. Q-Learning is off-policy that learns the optimal policy directly.

### 1.2 SARSA
$$ Q(s, a) = Q(s, a) + \alpha \left[ R + \gamma Q(s', a') - Q(s, a) \right] \tag{2}$$
SARSA is on-policy, which means $a'$ in $Q(s', a')$ follows the current policy.

### 1.3 Deep Q-Learning (DQN)
Deep Q-learning (DQN) was introduced by Mnih et al. [18], which uses a neural network to approximate the Q-values. The update rule depends on the values produced by the network itself, making convergence diffucult. To address this, the DQN algorithm introduces the use of a replay buffer and target network $\theta'$. The replay buffer stores past interactions as a list of tuples, which can be sampled to update the value and policy networks. This allows the network to learn from individual tuples multiple times and reduces dependence on the current experience. The target networks are time-delayed copies of the policy and Q-networks, and their parameters are updated according to the following equations:

$$ \theta_Q' \leftarrow \tau \theta_Q + (1 - \tau) \theta_Q' \tag{4}$$
$$ \theta_\mu' \leftarrow \tau \theta_\mu + (1 - \tau) \theta_\mu' \tag{5}$$

where $\theta_\mu'$ and $\theta_Q'$ denote the parameters of the policy and Q-networks, respectively. The loss function such as the MSE loss can be:
$$ L = \left(Q(s, a;\theta) - (r + \gamma \max_{a'} Q(s', a';\theta')) \right) ^ 2 \tag{3}$$

<center>
    <img src="./images/dqn.webp" alt="example">
</center>


### 1.4 Double Deep Q-Learning (Double DQN)
$$ Q(s,a;\theta) = r + \gamma Q(s', argmax_{a'} Q(s', a'; \theta); \theta') \tag{6}$$
The main neural network, $\theta$ determines the best next action $a'$, while the target network is used to evaluate this action and compute its Q-value. This simple change has been shown to reduce overestimations of Q-values in DQN.


Watch [Double Deep Q-Learning](double_dqn.ipynb) for code example.

### 1.5 Dueling Deep Q-Learning
Dueling DQN improves upon traditional DQN by decomposing the Q-values into two separate components
- the value function $V(s)$, which the expected reward for a given state
- the advantage function $A(s, a)$, which reflects the relative advantage of taking a particular action compared to other actions