# More Dynamic Objective Surface to Tackle Harder Problems to Optimize

<center><img src=images/erm_2.png width=35%></center>

Everything is base back to **Empirical Risk Minimization** or the equivalence of it for objective functions instead of loss functions. It is the most foundationaal modern unsconstrained optimization method and all neural network, reinforcement learning algorithm, etc... is just building more and more fancy ways, more complicated, and more intelligent ways of finding such optimization point (using memory, using system, ...) on the objective function's surface, all just some sort of **unconstrained global search optimization**.

<center><img src=images/erm_1.png width=30%></center>

## PPO Main Training Loop
Demistifying TorchRL, all components is customizable and adjustable

The main training loop for TorchRL is very very simple and direct, cna be directly transformed from the PPO paper
- Step MDP
- Collect radnom sample data using **probabilistic policy player** with DataCollector -> ReplayBuffer
    - Compute rewards via **TD bellman equation**
    - Compute advantage (GAE) via comparison using ReplayBuffer data
        - Loop over the collected to compute loss values
        - Back propagate (SGD on value using MSE (Adam)) -> better GAE understanding what is good
        - Optimize policy by maximize ppo_clip objective (uses GAE) -> going to the direction that GAE points at (Adam)
            - Forward pass: calculate loss
            - Backward pass: SGD update network and zero out gradient
            - To preserve creativity with alignment: PPO Loss = Value Loss + Policy Loss + Entropy Loss
        - Repeat
    - Repeat
- Repeat

Algorithm link: https://pytorch.org/rl/tutorials/coding_ppo.html

<center><img src=images/ppo.png width=70%></center>

## PPO Loss Functions & Full Algorithm Representation:
PPO Algoriithm: https://spinningup.openai.com/en/latest/algorithms/ppo.html

This is the full representation of the PPO Algorithm
<center><img src=images/ppo_3.svg width=60%></center>

### Policy Network Optimization
This representation is an **off-policy** version of ppo in some sense, uisng **`important sampling ratio`**, **`clipping`**, and **`KL Divergence`** to make a more preservative approach on making updates:

1. The **important sampling ratio** $\frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}$ compares the action taken by the current policy and the action taken by the previous policy, which indicates how much more or less likely the new policy is to take action $a$ under state $s_t$ compared to the previosu policy, which **bounds the action from been too drastic change**.

2. $g(\epsilon, A^{\pi_{\theta_k}}(s,a))$ is a clipped version of this ratio, which uses the advantage function $A^{\pi_{\theta_k}}(s,a)$ and a clipping parameter $\epsilon$ to **prevent the new policy from moving too far away from the old policy**.
    - We are taking the $min$ of it, showing a more **preservative** approach.

3. The $D_k$ value signifies **KL Divergence** that measures ho big of a change did the update cause on the policy (statistical method) -> bigger change have more punishments.

4. Reacall that in the original **policy gradient** (gradient search method), it is about maximizing the **expectation of the reward**. The Expectation here is formed by doing empirical sample mean over a finite batch of trajectories collected under the previous policy from the replay buffer.

5. Think in terms of an optimization problem now, **`abstract the problem now to a search problem`**, we are trying to find maximum point of a non-convex plain formed by this empirical loss function, now update the policy network's parameter in the direction of **stochastic distribution gradient ascent**, but the actual update of parameter is in **back proporgation**.
    - ***Training a Neural Network itself is an optimization problem***
    - Neural network is essentially a bunch of **weights**, updgraded linear rgression or perceptron, these weights having a non-linear relationship with objective function forming a high-dimensional risk surface using the objective function, then using gradient search to do unconstrained optimization on this surface via **gradient search**.
    - The objective function of risk function is guided **dynamically** with the advantageous function.

### Value Network Gradient Decent
On the other hand, the value network is updated via **`Mean Square Error`**:

1. Comparing randomly **new** sampled reward with **different path computed by current policy** at the **current state** (that try to get reward to the best by default **max bellman update** idea) with the current known reward at current state:

    - ***One interpretation***: from looking at the setting of bellman update, here is where the assumption of bellman equation comes in, bellman update is always assuming getting the better next state, so compare current state reward with randomly sampled **theoritically best current** (new sample with $max(Q(s,a))$) makes sense. **Bellman update ganrantees helping the agent getting more reward when updating like this -> help to expand understanding of the MDP**

    - ***Second interpretation***: from a more search perspective, this is **expanding the understanding of the MDP**, keep trying to reshape the value neural network's **"memory"** towards the direction that have higher reward -> some what like a **learning or adapted heuristic** idea.

2. The $D_k$ value signifies **KL Divergence** that measures ho big of a change did the update cause on the value (statistical method) -> bigger change have more punishments.

3. We try to minimize this value, usually with ADAM optimizer, which is still **stochastic gradient acsent**.
    - Again, think in terms of an optimization problem now, **`abstract the problem now to a search problem`**, we are trying to find maximum point of a non-convex plain formed by this empirical loss function.
    - The actual update of parameter is in **back proporgation**

## Imitation Learning With Brax
- Repository from Charles: https://github.com/charles-zhng/Brax-Imitation
- Repository from Talmo's Lab: https://github.com/talmolab/VNL-Brax-Imitation
- MOCAP Data Set: https://drive.google.com/file/d/10WbPKUr9_1vH0c5KwuqpvdIlcRGEeE2k/view?usp=drive_link

<center><img src=images/imitation_pipeline_7.png width=90%></center>

## brax interaction + network training
<center><img src=images/brax_env_imitation_4.png width=70%></center>

## ppo train detailed breakdown
<center><img src=images/ppo_train_breakdown_3.png width=70%></center>

## ppo train hyperparameter tuning
<center><img src=images/ppo_hyperparameter.png width=70%></center>

## brax environment function in relation to the networks
<center><img src=images/brax_function_2.png width=60%></center>