# 5. Tasks <a id='5.'></a>

## I. Tasks 1: use following algorithms to solve the robot sanding task. 
  - **PPO** (you can take code from ex1 as a basis)
  - **DDPG** (you can take code from ex6 as a basis)

Implementations from the exercises can be used and extended, otherwise please implement algorithms yourself. For learning purposes, you can look at existing implementations on the Internet.

- We do not want to focus on hyperparameter or neural network architecture tuning. Therefore, use the following choices:
  - For PPO, use the neural network policy configuration found in exercise 1 code.
  - For DDPG, use the neural network policy and value function configuration found in exercise 6 code.
  - <span style="color:red">We provide the hyper-parameters in configuration files 'cfg/algo/*', DO NOT change the parameters there</span>

- **No copying of code! Code should be original, written by yourself or taken from the exercises.**

- You should extend or modify your basic PPO/DDPG algorithm. Below possible extensions are described but you can also come up with your own ones. For each modification or extension, describe in exactly two sentences the extension and refer to where the modifications can be found in the code (for example, file name and function name or line numbers).

- More detailed instructions are given in the following sections. If you are not sure what you are allowed or not allowed to do, contact the TAs, preferably on Zulip so that also others may learn from the question.

**Note: not following the requirements may lead to point deduction or rejection of the project work.**

## II. Task 2: possible extensions to improve perfromance

After the basic PPO/DDPG implementation, you shall try to do some technical improvements to improve the agent's performance. Below we list several possible extensions you can apply to improve the perfromance. 

**Note 1**: some of the suggested extensions require more effort than others. If you implement some of the "easier" extensions that require less effort than others such as "Driving during training the log standard deviation of each dimension of the Gaussian policy from the original value to zero", please implement multiple extensions.

**Note 2**: You can also propose your found improvements, but you should also give the references.

#### PPO
- **Exploration**: Crucial in policy gradient methods. Options include:
  - During training, linearly decrease the log standard deviation of each dimension of the Gaussian policy from the original value to zero.
  - During training, linearly decrease the standard deviation of each dimension of the Gaussian policy from the original value to a small value.
  - Adding an entropy bonus to the policy loss or reward function. The strength of the entropy bonus is typically controlled by a parameter $\alpha$. To select $\alpha$, you can, for example:
    - Keep $\alpha$ constant.
    - Employ a schedule for $\alpha$, for example, drive it from a high value (high exploration) to a low one (high exploitation) during training.
    - Employ a schedule for a target entropy.
- **Implementation Techniques**: Several techniques can impact PPO's performance, such as value normalization.
- **Further Reading**:
  1. [The 37 Implementation Details of Proximal Policy Optimization (Shengyi et al., 2022)](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)
  2. [What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study (Andrychowicz et al., 2020)](https://arxiv.org/abs/2006.05990)

#### DDPG
- **Mitigating Value Overestimation**: 
  - Consider the Twin Delayed DDPG (TD3) algorithm to address overestimation bias and training instability. ([Fujimoto et al., 2018](https://arxiv.org/abs/1802.09477))
- **Utilizing Distributional Critics**: 
  - Focus on the entire distribution of value functions for enhanced performance and stability.
  - **QR-DDPG**: Provides robust value estimates through quantile regression. ([Dabney et al., 2018](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17184))
  - **D4PG**: Combines Distributional Value Functions, Off-Policy Training, and Actor-Critic Methods.
  - **IQN for DDPG**: Extends DDPG by representing the full quantile function for the value distribution. ([Dabney et al., 2018](http://proceedings.mlr.press/v80/dabney18a.html))
  - **FQF**: Enhances distributional RL by learning both quantile values and their fractions. ([Yang et al., 2019](https://papers.nips.cc/paper/2019/hash/8fb134e0e6d44a4f95a8bb2d5b2cb1c4-Abstract.html))
- **Enhancing Exploration**: 
  - Explore efficiently and avoid suboptimal solutions with strategies like:
    - **Ornstein-Uhlenbeck Process**: Generates correlated noise, helpful in control tasks with inertia.
    - **Intrinsic Curiosity Module (ICM)**: Encourages exploration through self-supervised prediction. ([Pathak et al., 2017](https://openaccess.thecvf.com/content_cvpr_2017_workshops/w13/html/Pathak_Curiosity-Driven_Exploration_by_CVPR_2017_paper.html))
    - **Random Network Distillation (RND)**: Generates intrinsic rewards based on prediction errors. ([Burda et al., 2018](https://arxiv.org/abs/1810.12894))
    - *Additional Strategies*: Feel free to explore other methods.

#### Model-Based RL
- To improve the basic PPO or DDPG algorithm, you can integrate model learning ([Model-based RL survey](https://arxiv.org/abs/2206.09328)) into the training process, that is, learning a dynamics and reward model.
  - **More data**: With a learned model you can generate more data (the 200k sample limit applies only to samples generated using the sanding simulator, not to samples generated using learned models).
  - **Planning and Acting**: Combine planning with learning for more informed decisions. You can use, for example, the cross entropy method (CEM) introduced during the course for planning with the learned model.

## III. Hints & Tips

### a) Hints
- Consider the sanding area's dimensions of 100 units in both width and height when sampling x, y coordinates.
- Due to the multidimensional actions, when using a Gaussian policy, remember to use a multivariate Gaussian probability distribution, or, product of standard Gaussian distributions that corresponds to a multivariate Gaussian distribution with a diagonal covariance matrix.

### b) Debugging Tips
- To debug with a fixed seed, set `reset(seed=fixed_seed)` when resetting the environment each episode. This ensures a consistent initial position, making the task easier to learn. By default, training uses random seeds, and policies are evaluated likewise. Leaving `reset()` empty defaults to random seeds.
