<a href="https://colab.research.google.com/github/George7531/for_machine_learning/blob/main/ml/Learnings/deep_learning_tool_kits/Deep_RL/advantage_actor_critic(aac).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# day 397


## intro:

![fsd](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/thumbnail.png)

* Remember the goal hypothesis: any goal can be described as maximizing the cumulative rewards of pursing it.
* in policy based method such as in `Reinforce` algorithm is a policy gradient ascent, a subclass of polciy based method,  we used monte carlo sampling which has a tendency to give a big variance in policy estimations.
* and also monte carlo method leads to slower computation.
* so, Actor critic could be a replacement for either a pure polciy based method or pure value based method.
* Actor critic method helps the agent see both worlds of policy based method and value based method.
* actor - makes the agent act(policy based method).
* critic - gives a score/value for the action taken(value based method).


## our target with this unit:
* To train a robotic arm.


## what gives variance in policy gradient ascent method?

![fssd](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/variance.jpg)
* Rewards calculated in policy gradient ascent in unbiased meaning. The rewards are factual and actual rewards of taking particular actions not some estimations.
* The variance comes from the fact that the environment itself is stochastic(random things may sprout in the environment anytime _increasing the situational entropy). On top of that, we have varying rewards from following different trajectories to reach the same given the same starting point leading to variance.
* one known solution to reduce variance is to increase the number of sample trajectories. but it simply raises computation and increased batch size nullifying the purpose of batching the data.


![fsfs](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/pg.jpg)

1. If the return is high, we will push up the probabilities of the (state, action) combinations.
2. Otherwise, if the return is low, it will push down the probabilities of the (state, action) combinations.


This return R(τ) is calculated using a Monte-Carlo sampling. We collect a trajectory and calculate the discounted return, and use this score to increase or decrease the probability of every action taken in that trajectory. If the return is good, all actions will be “reinforced” by increasing their likelihood of being taken.


# How to reduce the variance of the Reinforce algorithm?

## Actor critic method:
* we use policy function to take the best action.
* we use value function to give a value for the action given the state.
* To understand the Actor-Critic, imagine you’re playing a video game. You can play with a friend that will provide you with some feedback. You’re the Actor and your friend is the Critic.

![sdfs](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/ac.jpg)

You don’t know how to play at the beginning, so you try some actions randomly. The Critic observes your action and provides feedback.

Learning from this feedback, you’ll update your policy and be better at playing that game.

On the other hand, your friend (Critic) will also update their way to provide feedback so it can be better next time.

This is the idea behind Actor-Critic. We learn two function approximations:

1. A policy that controls how our agent acts:πθ(s)

2. A value function to assist the policy update by measuring how good the action taken is: $q_w(s,a)$


## process:Let’s see the training process to understand how the Actor and Critic are optimized:

At each timestep, t, we get the current state
$ S_t $ from the environment and pass it as input through our Actor and Critic.

Our Policy takes the state and outputs an action
$A_t$.

![ssf](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step1.jpg)


The Critic takes that action also as input and, using
$ S_t \text{ and } A_t $, computes the value of taking that action at that state: the Q-value.

![fsf](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step2.jpg)

* The action $A_t$ performed in the environment gives a new state and reward $S_{t+1} \text{ and } R_{t+1} $


![sfs](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step3.jpg)

* notice that the new state $S_{t+1}$ goes to both actor and critic but the reward $ R_{t+1} $ goes only to the policy the actor.

The Actor updates its policy parameters using the Q value.

![sfsf](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step4.jpg)

* Thanks to its updated weight parameters the actor takes $A_{t+1}$ given the new state $S_{t+1}$

* The Critic then updates its value parameters.


![sfsdf](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step5.jpg)

## Adding Advantage in Actor-Critic (A2C):

We can stabilize learning further by using the Advantage function as Critic instead of the Action value function.

The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: how taking that action at a state is better compared to the average value of the state. It’s subtracting the mean value of the state from the state action pair:

![sfsd](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/advantage1.jpg)

In other words, this function calculates the extra reward we get if we take this action at that state compared to the mean reward we get at that state.

The extra reward is what’s beyond the expected value of that state.

1. If A(s,a) > 0: our gradient is pushed in that direction.
2. If A(s,a) < 0 (our action does worse than the average value of that state), our gradient is pushed in the opposite direction.


The problem with implementing this advantage function is that it requires two value functions Q(s,a) and V(s). Fortunately, we can use the TD error as a good estimator of the advantage function.


![sfsd](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/advantage2.jpg)


# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym 🤖

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/thumbnail.png"  alt="Thumbnail"/>

In this notebook, you'll learn to use A2C with [Panda-Gym](https://github.com/qgallouedec/panda-gym). You're going **to train a robotic arm** (Franka Emika Panda robot) to perform a task:

- `Reach`: the robot must place its end-effector at a target position.

After that, you'll be able **to train in other robotics tasks**.


## Objectives of this notebook 🏆

At the end of the notebook, you will:

- Be able to use **Panda-Gym**, the environment library.
- Be able to **train robots using A2C**.
- Understand why **we need to normalize the input**.
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.



## Create a virtual display 🔽

During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).

Hence the following cell will install the librairies and create and run a virtual screen 🖥

In [None]:
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7cfb53d12920>

### Install dependencies 🔽

The first step is to install the dependencies, we’ll install multiple ones:
- `gymnasium`
- `panda-gym`: Contains the robotics arm environments.
- `stable-baselines3`: The SB3 deep reinforcement learning library.
- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.

⏲ The installation can **take 10 minutes**.

In [None]:
!pip install stable-baselines3[extra]
!pip install gymnasium

Collecting stable-baselines3[extra]
  Downloading stable_baselines3-2.3.2-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.3/182.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium<0.30,>=0.28.1 (from stable-baselines3[extra])
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting shimmy[atari]~=1.3.0 (from stable-baselines3[extra])
  Downloading Shimmy-1.3.0-py3-none-any.whl (37 kB)
Collecting autorom[accept-rom-license]~=0.6.1 (from stable-baselines3[extra])
  Downloading AutoROM-0.6.1-py3-none-any.whl (9.4 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra])
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m434.7/434.7 kB[0m [31m17.0 MB/s[0m eta [

In [None]:
!pip install huggingface_sb3
!pip install huggingface_hub
!pip install panda_gym

Collecting huggingface_sb3
  Downloading huggingface_sb3-3.0-py3-none-any.whl (9.7 kB)
Installing collected packages: huggingface_sb3
Successfully installed huggingface_sb3-3.0
Collecting panda_gym
  Downloading panda_gym-3.0.7-py3-none-any.whl (23 kB)
Collecting pybullet (from panda_gym)
  Downloading pybullet-3.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (103.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.2/103.2 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pybullet, panda_gym
Successfully installed panda_gym-3.0.7 pybullet-3.2.6


## Import the packages 📦

In [None]:
import os

import gymnasium as gym
import panda_gym

from huggingface_sb3 import load_from_hub, package_to_hub

from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.env_util import make_vec_env

from huggingface_hub import notebook_login

## PandaReachDense-v3 🦾

The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).

In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.

In `PandaReach`, the robot must place its end-effector at a target position (green ball).

We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.

Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg"  alt="Robotics"/>


This way **the training will be easier**.



### Create the environment

#### The environment 🎮

In `PandaReachDense-v3` the robotic arm must place its end-effector at a target position (green ball).

In [None]:
env_id = "PandaReachDense-v3"

# Create the env
env = gym.make(env_id)

# Get the state space and action space
s_size = env.observation_space.shape
a_size = env.action_space

In [None]:
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample()) # Get a random observation

_____OBSERVATION SPACE_____ 

The State Space is:  None
Sample observation OrderedDict([('achieved_goal', array([ 8.738305 , -2.0686479,  3.8928504], dtype=float32)), ('desired_goal', array([-0.9353803, -9.929406 , -7.5904245], dtype=float32)), ('observation', array([-5.1071725 , -9.88583   ,  0.94734216, -8.460135  ,  3.3830638 ,
       -9.177484  ], dtype=float32))])


The observation space **is a dictionary with 3 different elements**:
- `achieved_goal`: (x,y,z) the current position of the end-effector.
- `desired_goal`: (x,y,z) the target position for the end-effector.
- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).

Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**.

In [None]:
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample()) # Take a random action


 _____ACTION SPACE_____ 

The Action Space is:  Box(-1.0, 1.0, (3,), float32)
Action Space Sample [ 0.8071037   0.44063088 -0.36583576]


In [None]:
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample()) # Take a random action


 _____ACTION SPACE_____ 

The Action Space is:  Box(-1.0, 1.0, (3,), float32)
Action Space Sample [-0.20394306  0.91041374  0.48198608]


A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).

For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.

We also normalize rewards with this same wrapper by adding `norm_reward = True`

[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)

In [None]:
# vectorize the environment
env = make_vec_env(env_id, n_envs=4)

# Adding this wrapper to normalize the observation and the reward
env = VecNormalize(env,
                   norm_obs=True,
                   norm_reward=True,
                   clip_obs=10.0)

### Create the A2C Model 🤖

For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes

To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3).

In [None]:
# Create the A2C model and try to find the best parameters
model = A2C('MultiInputPolicy',env,verbose=1)
model.learn(total_timesteps=1000000)
model.save('a2c_PandaReachDenseV-3')


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|    std                | 0.348    |
|    value_loss         | 0.000356 |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 2.86     |
|    ep_rew_mean        | -0.225   |
|    success_rate       | 1        |
| time/                 |          |
|    fps                | 563      |
|    iterations         | 23800    |
|    time_elapsed       | 845      |
|    total_timesteps    | 476000   |
| train/                |          |
|    entropy_loss       | -1.03    |
|    explained_variance | 0.971    |
|    learning_rate      | 0.0007   |
|    n_updates          | 23799    |
|    policy_loss        | -0.00335 |
|    std                | 0.346    |
|    value_loss         | 0.000117 |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 2.86     |
|    ep_re

In [None]:
model.save("a2c-PandaReachDense-v3")
env.save("vec_normalize.pkl")

### Evaluate the agent 📈
- Now that's our  agent is trained, we need to **check its performance**.
- Stable-Baselines3 provides a method to do that: `evaluate_policy`

In [None]:
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

# Load the saved statistics
eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v3")])
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)

# We need to override the render_mode
eval_env.render_mode = "rgb_array"

#  do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False

# Load the agent
model = A2C.load("a2c-PandaReachDense-v3")

mean_reward, std_reward = evaluate_policy(model, eval_env)

print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")

Mean reward = -0.18 +/- 0.11




### Publish your trained model on the Hub 🔥
Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.

📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20


By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.

This way:
- You can **showcase our work** 🔥
- You can **visualize your agent playing** 👀
- You can **share with the community an agent that others can use** 💾
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard


To be able to share your model with the community there are three more steps to follow:

1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join

2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
- Create a new token (https://huggingface.co/settings/tokens) **with write role**

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">

- Copy the token
- Run the cell below and paste the token

In [None]:
notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_sb3 import package_to_hub

package_to_hub(
    model=model,
    model_name=f"a2c-{env_id}",
    model_architecture="A2C",
    env_id=env_id,
    eval_env=eval_env,
    repo_id="GeorgeImmanuel/a2c_robotic_arm", # Change the username
    commit_message="Initial commit",
)

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Saving video to /tmp/tmpx3t7v7_5/-step-0-to-step-1000.mp4
Moviepy - Building video /tmp/tmpx3t7v7_5/-step-0-to-step-1000.mp4.
Moviepy - Writing video /tmp/tmpx3t7v7_5/-step-0-to-step-1000.mp4





Moviepy - Done !
Moviepy - video ready /tmp/tmpx3t7v7_5/-step-0-to-step-1000.mp4
[38;5;4mℹ Pushing repo GeorgeImmanuel/a2c_robotic_arm to the Hugging Face
Hub[0m


policy.optimizer.pth:   0%|          | 0.00/45.0k [00:00<?, ?B/s]

pytorch_variables.pth:   0%|          | 0.00/864 [00:00<?, ?B/s]

policy.pth:   0%|          | 0.00/46.3k [00:00<?, ?B/s]

a2c-PandaReachDense-v3.zip:   0%|          | 0.00/110k [00:00<?, ?B/s]

Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s]

vec_normalize.pkl:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/GeorgeImmanuel/a2c_robotic_arm/tree/main/[0m


CommitInfo(commit_url='https://huggingface.co/GeorgeImmanuel/a2c_robotic_arm/commit/ae57fd408d0c3a83da2ae4fbd81d44fc316be802', commit_message='Initial commit', commit_description='', oid='ae57fd408d0c3a83da2ae4fbd81d44fc316be802', pr_url=None, pr_revision=None, pr_num=None)

## Some additional challenges 🏆
The best way to learn **is to try things by your own**! Why not trying  `PandaPickAndPlace-v3`?

If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.

PandaPickAndPlace-v1 (this model uses the v1 version of the environment): https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1

And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html

We provide you the steps to train another agent (optional):

1. Define the environment called "PandaPickAndPlace-v3"
2. Make a vectorized environment
3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
4. Create the A2C Model (don't forget verbose=1 to print the training logs).
5. Train it for 1M Timesteps
6. Save the model and  VecNormalize statistics when saving the agent
7. Evaluate your agent
8. Publish your trained model on the Hub 🔥 with `package_to_hub`


In [None]:
# define the environment called 'PandaPickAndPlace-v3'

env_id = 'PandaPickAndPlace-v3'

env = gym.make(env_id)

# vectorize the environment
env = make_vec_env(env_id,n_envs=1)

# normalize the vectorized environment's observations and rewards
env = VecNormalize(env,
                   norm_obs=True,
                   norm_reward=True,
                   clip_obs=10.0)

In [None]:
# create the A2C model and train it for 1M steps
model = A2C('MultiInputPolicy',env,verbose=1)
model.learn(total_timesteps=1000000)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|    std                | 0.461    |
|    value_loss         | 1.82e-09 |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 48       |
|    ep_rew_mean        | -48      |
|    success_rate       | 0.04     |
| time/                 |          |
|    fps                | 175      |
|    iterations         | 173800   |
|    time_elapsed       | 4965     |
|    total_timesteps    | 869000   |
| train/                |          |
|    entropy_loss       | -2.5     |
|    explained_variance | 0.00695  |
|    learning_rate      | 0.0007   |
|    n_updates          | 173799   |
|    policy_loss        | -0.002   |
|    std                | 0.461    |
|    value_loss         | 8.32e-07 |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 48.5     |
|    ep_re

<stable_baselines3.a2c.a2c.A2C at 0x7cfa0f4f4d60>

In [None]:
model.save('a2c_PandaPickAndPlace-v3')
env.save('normalized_env.pkl')

In [None]:
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

# Load the saved statistics
eval_env = DummyVecEnv([lambda: gym.make("PandaPickAndPlace-v3")])
eval_env = VecNormalize.load("normalized_env.pkl", eval_env)

# We need to override the render_mode
eval_env.render_mode = "rgb_array"

#  do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False

# Load the agent
model = A2C.load("a2c_PandaPickAndPlace-v3")

mean_reward, std_reward = evaluate_policy(model, eval_env)

print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")



Mean reward = -45.00 +/- 15.00


In [None]:
notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_sb3 import package_to_hub

package_to_hub(
    model=model,
    model_name=f"a2c-{env_id}",
    model_architecture="A2C",
    env_id=env_id,
    eval_env=eval_env,
    repo_id="GeorgeImmanuel/a2c_PickAndPlaceRobot-v2", # Change the username
    commit_message="Initial commit",
)

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Saving video to /tmp/tmpyjph6_8b/-step-0-to-step-1000.mp4
Moviepy - Building video /tmp/tmpyjph6_8b/-step-0-to-step-1000.mp4.
Moviepy - Writing video /tmp/tmpyjph6_8b/-step-0-to-step-1000.mp4





Moviepy - Done !
Moviepy - video ready /tmp/tmpyjph6_8b/-step-0-to-step-1000.mp4
[38;5;4mℹ Pushing repo GeorgeImmanuel/a2c_PickAndPlaceRobot-v2 to the Hugging
Face Hub[0m


policy.optimizer.pth:   0%|          | 0.00/52.0k [00:00<?, ?B/s]

Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s]

pytorch_variables.pth:   0%|          | 0.00/864 [00:00<?, ?B/s]

policy.pth:   0%|          | 0.00/53.2k [00:00<?, ?B/s]

a2c-PandaPickAndPlace-v3.zip:   0%|          | 0.00/124k [00:00<?, ?B/s]

vec_normalize.pkl:   0%|          | 0.00/3.03k [00:00<?, ?B/s]

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/GeorgeImmanuel/a2c_PickAndPlaceRobot-v2/tree/main/[0m


CommitInfo(commit_url='https://huggingface.co/GeorgeImmanuel/a2c_PickAndPlaceRobot-v2/commit/66441103f1776af8ba48d5aa2cdfe6bae457d6da', commit_message='Initial commit', commit_description='', oid='66441103f1776af8ba48d5aa2cdfe6bae457d6da', pr_url=None, pr_revision=None, pr_num=None)

# day 401

#indepth understanding of Bias-Variance tradeoff:

## Bias and variance relationship in supervised learning:
* Bias increases generalization and variance increases fitting well with data.
* over bias causes underfitting_too much generalization.
* too much variance causes _____? overfitting.


## difference between supervised learning and Reinforcement learning:
* in supervised learning the labels are known ahead of time that the model can try to predict closer to.
* in reinforcement learning we don't have training data ahead of time because of the nature of the changing dynammic environment so the agent collects some sample signals from its environment(monte carlo method) using its various sensors and try to get the maximum cumulative reward for its action from the environment. The reward comes at every time point. a correct action leading to the positive reward and an incorrect correction leading to no/negative reward. But unlike in supervised learning there is no external oracle to tell you what is correct action and what is wrong action.

## value estimates and probability distribution over action space:
![sfs](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*lzpznPvKnSUvLwNfG9pOew.png)
* Rewarding state denoted by yellow star. Value estimates denoted by green spheres. Above: Without credit assignment, only rewarding state is seen as being valuable. Below: By using discounted sums over future rewards, the trajectory toward star has meaningful value estimates.


## gamma, bias, variance, dart game:
A naive approach to an RL learning algorithm would be to encourage actions which were associated with positive rewards, and discourage actions associated with negative rewards. Instead of updating our agent’s policy based on immediate rewards though, we often want to account for actions (and the states of the environment when those actions were taken) which lead up to rewards. For example, imagine walking down a corridor to a rewarding object. It isn’t just the final step we want to perform again, but all the steps up to that rewarding one. There are a number of approaches for doing this, all of which involving doing a form of credit assignment. This means giving some credit to the series of actions which led to a positive reward, not just the most recent action. This credit assignment is often referred to as learning a value estimate: V(s) for state, and Q(s, a) for state-action pair.

We control how rewarding past actions and states are considered to be by using a discount factor (γ, ranging from 0 to 1). Large values of γ lead to assigning credit to states and actions far into the past, while a small value leads to only assigning credit to more recent states and actions. In the case of RL, variance now refers to a noisy, but on average accurate value estimate, whereas bias refers to a stable, but inaccurate value estimate. To make this more concrete, imagine a game of darts. A high-bias player is one who always hits close to the target, but is always consistently off in some direction. A high-variance player, on the other hand, is one who sometimes hits the target, and is sometimes off, but on average near the target.

![fsfs](https://miro.medium.com/v2/resize:fit:828/format:webp/1*aZNcm3rx8EqKIQRDuzKZlA.png)

* Red: True value of a given state/action. Blue: Low-variance, high-bias estimate. Green: low-bias, high-variance estimates.

There is a multitude of ways of assigning credit, given an agent’s trajectory through an environment, each with different amounts of variance or bias. Monte-Carlo sampling of action trajectories as well as Temporal-Difference learning are two classic algorithms used for value estimation, and both are prototypical examples of methods which are variance and bias heavy, respectively.

- conclusion:
    * so, we should not only reward the agent only when they reach the final target but also when they are on their trajectory closer to the target.
    * gamma value _the discount rate (ranging between 0 and 1)
      1. if it is too low then it appreciates the agent even when it is way too far from the target but is on the trajectory
      2. if if is too high closer to 0.99,0.98 then we get to reward the agent only when it is closer to the target.

    * To estimate the value of the state we use classic monte carlo sampling method and Temporal difference
    * Monte carlo sampling method is laden with high variance.
    * and Temporal difference method gives a high bias output.


## High variance Monte carlo estimate:
In Monte-Carlo (MC) sampling, we rely on full trajectories of an agent acting within an episode of the environment to compute the reinforcement signal. Given a trajectory, we produce a value estimate R(s, a) for each step in the path by calculating a discounted sum of future rewards for each step in the trajectory. The problem is that the policies we are learning (and often the environments we are learning in) are stochastic, which means there is a certain level of noise to account for. This stochasticity leads to variance in the rewards received in any given trajectory. Imagine again the example with the reward at the end of the corridor. Given that an agent’s policy might be stochastic, it could be the case that in some trajectories the agent is able to walk to the rewarding state at the end, and in other trajectories it fails to do so. These two kinds of trajectories would provide very different value estimates, with the former suggesting the end of the corridor is valuable, and the latter suggesting it isn’t. This variance is typically mitigated by using a large number of action trajectories, with the hope that the variance introduced in any one trajectory will be reduced in aggregate, and provide an estimate of the “true” reward structure of the environment.

![sfsf](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*3wb1I1Hgl6jtb5gogeySeg.png)
* Monte-Carlo Estimate of Reward Signal. t refers to time-step in the trajectory. r refers to reward received at each time-step.

## High Bias Temporal Differerence:
On the other end of the spectrum is one-step Temporal Difference (TD) learning. In this approach, the reward signal for each step in a trajectory is composed of the immediate reward plus a learned estimate of the value at the next step. By relying on a value estimate rather than a Monte-Carlo rollout there is much less stochasticity in the reward signal, since our value estimate is relatively stable over time. The problem is that the signal is now biased, due to the fact that our estimate is never completely accurate. In our corridor example, we might have some estimate of the value of the end of the corridor, but it may suggest that the corridor is less valuable than it actually is, since our estimate may not be able to distinguish between it and other similar unrewarding corridors. Furthermore, in the case of Deep Reinforcement Learning, the value estimate is often modeled using a deep neural network, making things worse. In Deep Q-Networks for example, the Q-estimates (value estimates over actions) are computed using an old copy of the network (a “target” network), which will provide “older” Q-estimates, with a very specific kind of bias, relating to the belief of an outdated model.

![fsds](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*Nma9I0Ot4TyzoL7rmqFatQ.png)
Temporal-Difference Estimate of Reward Signal. r refers to reward at time t. V(s) refers to parameterized value estimate.

conclusion:
    * Trajectory an agent take to reach the target might vary its probability of reaching the actual target for example if the steps the agent can take at each episode is 15 then it takes a trajectory of length 15 to reach the target but due to the infinite possibilities of reaching the target from various ways, not always the agent reaches the target. sometimes it does and sometimes it doesn't. This is called variance. Variance in output.
    * monte carlo sampling does reduce the variance a little bit by taking the average of bunch of trajectories but it doesn't help much_the time synchronous averaging method we learned in signal processing.

## Approaches to Balancing Bias and Variance
Now that we understand bias and variance and their causes, how do we address them? There are a number of approaches which attempt to mitigate the negative effect of too much bias or too much variance in the reward signal. I am going to highlight a few of the most commonly used approaches in modern systems such as Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Trust Region Policy Optimization (TRPO), and others.


### 1. Advantage Learning
One of the most common approaches to reducing the variance of an estimate is to employ a baseline which is subtracted from the reward signal to produce a more stable value. Many of the baselines chosen fall into the category of Advantage-based Actor-Critic methods, which utilize both an actor which defines the policy, and a critic (often a parameterized value estimate) which provides a more reduced variance reward signal to update the actor. The thinking goes that variance can simply be subtracted out from a Monte-Carlo sample (R/Q) using a more stable learned value function V(s) in the critic. This value function is typically a neural network, and can be learned using either Monte-Carlo sampling, or Temporal difference (TD) learning. The resulting Advantage A(s, a) is then the difference between the two estimates. This advantage estimate has the other nice property of corresponding to how much better the agent actually performed than was expected on average, thus allowing for intuitively interpretable values.

![sfs](https://miro.medium.com/v2/resize:fit:828/format:webp/1*NUi91JHh4YhcGi7vRBp9HQ.png)

Advantage Estimate Equation. Pi refers to the current policy. Q(s, a) here refers to Monte-Carlo sampled reward signal analogous to R(s, a), rather than a learned estimate. V(s) refers to parameterized value estimate.

### 2. Generalized Advantage Estimate

We can also arrive at advantage functions in other ways than employing a simple baseline. For example, the value function can be applied to directly smooth the reinforcement signal obtained from a series of trajectories. The Generalized Advantage Estimate (GAE), introduced by John Schulman in 2016 does just this. The GAE formulation allows for an interpolation between pure TD learning and pure Monte-Carlo sampling using a lambda parameter. By setting lambda to 0, the algorithm reduces to TD learning, while setting it to 1 produces Monte-Carlo sampling. Values in-between (particularly those in the 0.9 to 0.999 range) produce better empirical performance by trading off the bias of V(s) with the variance of the trajectory.

![fsd](https://miro.medium.com/v2/resize:fit:828/format:webp/1*RwKSm0KeX1Vkm3-ew17Zeg.png)
Generalized Advantage Estimate under two edge cases which reduce to: TD Learning (Above), and MC Sampling (Below).


### 3.  Value-Function Bootstrapping

Outside of calculating an advantage function, the bias-variance trade-off presents itself when deciding what to do at the end of a trajectory when learning. Instead of waiting for an entire episode to complete before collecting a trajectory of experience, modern RL algorithms often break experience batches down into smaller sub-trajectories, and use a value-estimate to bootstrap the Monte-Carlo signal when that trajectory doesn’t end with the termination of the episode. By using a bootstrap signal, that estimate can contain information about the rewards the agent might have gotten, if it continued going to the end of the episode. It is essentially a guess about how the episode will turn out from that point onward. Take again our example of the corridor. If we are using a time horizon for our trajectories that ends halfway through the corridor, and if our value estimate reflects the fact that there is a rewarding state at the end, we will be able to assign value to the early part of the corridor, even though the agent didn’t experience the reward. As one might expect, the longer the trajectory length we use, the less frequently value estimates are used for bootstrapping, and thus the greater the variance (and lower the bias). In contrast, using short trajectories means relying more on the value estimate, creating a more biased reinforcement signal. By deciding how long the trajectory needs to be before cutting it off and bootstrapping it, we can propagate the reward signal in a more efficient way, but only if we get the balance right.

![sfsf](https://miro.medium.com/v2/resize:fit:828/format:webp/1*zSWwqPzzptvquHrTppZqYg.png)
Arrow corresponds to agent trajectory. Above: The value estimate at time step 3 is used to bootstrap trajectory value estimate. Below: no bootstrapping is used, and no value is assumed for these states.

#### conclusion:
* Bootstrapping is resampling the the same sample, in our example the monte carlo sub samples over and over.
* Bootstrapping the sub sample is we resample the same smaller version of the sample many times(m times) with replacement.
* Bootstrapping allows for replacement meaning same number can repeat in the sample.
* Bootstrapped resample must be same length as of the orginal subsample if the original sub sample was of size 3 then the bootstrapped sub sample must also be of size 3.
* I think instead of estimating the value for all the steps in the trajectory, breaking the trajectory well before reaching the actual target would save us compute time therefore faster decision making systems.



## Beautiful explanation for bootstrapping: n-step Bootstrapping
If Monte Carlo method overfits and TD method underfits, then it is natural to consider the middle ground. We don’t want to only consider the last action, but we also don’t want to consider all actions made. so we consider
n steps. This is called n-step bootstrapping.

![sfsf](https://www.endtoend.ai/assets/blog/misc/bias-variance-tradeoff-in-reinforcement-learning/n_step.png)

At the cost of an extra hyperparameter n, the n-step bootstrapping method works better than Monte Carlo or TD.
![sfs](https://www.endtoend.ai/assets/blog/misc/bias-variance-tradeoff-in-reinforcement-learning/rms_nstep.png)

## conclusion:
* at around 0.35-0.40 (the mix of both monte carlo and temporal difference in GAE) seems to work better with 4-step bootstrapping as it gives the least RMS(root mean square).
* Bias and Variance is a significant problem in Reinforcement Learning, as they can slow down the agent’s learning. For simple environments, Monte Carlo method or Temporal Difference method work well enough, but for complex environments
n-step bootstrapping can significant boost learning.




# state value function and action value function:
* $ v_{\pi}(s_t) $ - state value function
* $ q_{\pi}(s_t,a_t) $ - action value function.