# Reinforcement Learning With Practice

I'm writing this notebook for a better understanding of Reinforcement Learning and its practical implementation. I got the motive while reading [A Survey on Deep Reinforcement Learning Algorithms for Robotic Manipulation](https://www.mdpi.com/1424-8220/23/7/3762). I found it obscure to apply the algorithms to real-world problems, like I don't know what's the reward or policy.

I introduce all the algorithms briefly and use them to solve the [CartPole](https://gymnasium.farama.org/environments/classic_control/cart_pole/) environment from OpenAI Gym.

---
**Notations**

A Markov devision process (MDP) is defined as a tuple $\langle S, A, r, T, \gamma \rangle$ where:
- $S$ - current state
- $A$ - action
- $r$ - reward for taking action $A$ in state $S$
- $T$ - state transition function
- $\gamma$ - discount factor implying that a reward obtained in the future is worth a smaller amount than an immediate reward
---

## About CartPole
1. action space: a ndarray with shape `(1,)` which can take values `{0, 1}` indicating the direction of the fixed force the cart is pushed with.
2. observation space: a ndarray with shape `(4,)` containing the cart position, cart velocity, pole angle, and pole velocity at the tip.
3. reward: `1` for every step taken, including the termination step.
4. termination: the episode ends when the pole is more than 15 degrees from vertical or the cart moves more than 2.4 units from the center.
5. truncation: the episode ends after 500 steps for v1 (200 for v0)

## 1. Value-Based RL

### 1.1 Q-Learning
$$ Q(s, a) = r(s, a) + \gamma \max_{a} Q(s', a) \tag{1} $$

$Q(s, a)$ is the Bellman action-value function, which estimates how good it is to take an action at a given state. Q-Learning is off-policy that learns the optimal policy directly.

### 1.2 SARSA
$$ Q(s, a) = Q(s, a) + \alpha \left[ R + \gamma Q(s', a') - Q(s, a) \right] \tag{2}$$
SARSA is on-policy, which means $a'$ in $Q(s', a')$ follows the current policy.

### 1.3 Deep Q-Learning (DQN)
[Deep Q-Learning](dqn.ipynb) 

In [1]:
from collections import deque

replay_buffer = deque(maxlen=2)


def creat(buffer):
    for i in range(5):
        buffer.append(i)


creat(replay_buffer)
print(replay_buffer)

deque([3, 4], maxlen=2)
