# Reinforcement Learning 

### Key Concepts

- **State (s)**: Represents the helicopter's position, orientation, and speed, detailing the current situation.

- **Action (a)**: Decisions made to control the helicopter's movements based on its state.

- **Reward**: The core of reinforcement learning, guiding the algorithm by rewarding positive outcomes (good helicopter) and penalizing negative ones (bad helicopter), similar to training a dog with feedback. This encourages the system to autonomously maximize positive outcomes.


## Return 
- Discount Factor ($\gamma$): A number slightly less than 1 $\implies$ emphasizing the value of immediate gains in the decision-making process.
- $Return = R_1 +\gamma R_2+\gamma ^2R_2+\dots(until \space terminal \space state)$

$\therefore$ The Return will be different with different position (state)

![Image](./image/Return.png)

## Policy

- **Policy ($\pi$)**: A function that maps a given state ($s$) to an action ($a$) that the agent should take when in that state. The policy is denoted as $\pi(s) = a$, indicating the action $a$ recommended when the system is in state $s$.

## State-action value 

- **Definition $Q(s, a)$**:  it represents the return (total future rewards) expected for being in a state $s$ and taking an $a$, followed by following an optimal policy thereafter.

- **Choose max Return**: A natural way is that we will compute the return for all $a$ (action). Then choose the best $Return$

## Bellman Equation
- Formula: $Q(s,a) = R(s) + \gamma\space m\underset{a^{'}}ax Q(s',a')$

# Continuous State

- In the real world, the state of entity can be continuous like in range of number. 

## Example

#### 1. Autonomous Driving

- **State Variables**: Position ($x, y$) on a map, velocity ($v$), orientation angle ($θ$), and acceleration ($a$).
- **Description**: The continuous state space includes the car's precise location, speed, direction, and rate of speed change. The RL algorithm must decide on actions like steering angle adjustments, acceleration, or braking to navigate roads safely and efficiently.

#### 2. Robotic Manipulation

- **State Variables**: Joint angles ($θ1, θ2, ..., θn$), joint velocities ($θ̇1, θ̇2, ..., θ̇n$), and object positions ($x, y, z$).
- **Description**: In robotic arm control, the state includes the angles and velocities of each joint for precision movements, as well as the position of objects the robot interacts with. The goal is often to manipulate objects or perform tasks with high dexterity.

# Algorithm Build Up


### Data 

- Our goal is to know which is the best action to take, then we need to know their Bellman Value $\implies$ Our model is be trained to predict $Q(s,a)$ 

#### Paramters

- $X: s \space (State), a \space (Action)$
- $Y: Q(s,a) = R(s) + \gamma\space m\underset{a^{'}}ax Q(s',a')$

![Image](./image/DataCreation.png)


### Algorithm

![Image](./image/Algorithm.png)

# Epsilon-greedy policy

![Image](./image/epsilon.png)

- There are some action will not be taken because of low $Return$. Despite it can be the better option. 
$\implies$ To reduce that case, we will give it a small chance to appear (called epsilon ($\epsilon$))

## Trick: 
- We initialize it a high chance to randomly generate action for equal chance. Then after the model be better, gradually reduce it, because our model have been trained on that action (if that action is good, it will be kept and opposite)

# Mini-batch and soft-updates

## Mini - Batch
- **Definition**: Split the dataset into small set and train it several time $\implies$ It will have increasing efficient

![Image](./image/Mini-Batch.png)

## Soft-Update
- **Definition**: Not update the old one to new one directly. We weights them and sum it up to get the new one, this will have the new result also be affected by both new and old value

![Image](./image/Soft-Update.png)