Add A2C #183

shermansiu · 2023-01-11T18:22:37Z

Given how A2C is a special case of PPO (Huang et al., 2022), adding A2C to trlX becomes a matter of adding a few dedicated A2C configurations, as opposed to implementing A2C from scratch.

Some minor refactoring was necessary to get everything to work (and comply with flake8's standards), but overall the changes are quite minimal.

[1] S. Huang, A. Kanervisto, A. Raffin, W. Wang, S. Ontañón, and R. F. J. Dossa, A2C is a special case of PPO. 2022. Retrieved from https://arxiv.org/pdf/2205.09123.pdf

shermansiu · 2023-01-11T18:59:57Z

I added a2c_sentiments.py, as requested by Louis Castricato. Unfortunately, there is a lot of overlap with ppo_sentiments.py but adding relative imports in the examples folder might increase confusion.

shermansiu · 2023-01-11T19:02:53Z

Related issue: #16.

shermansiu · 2023-01-11T20:40:21Z

@Dahoas On the Discord channel, I think you indicated that you were interested in reviewing code for #16?

LouisCastricato · 2023-01-11T23:09:50Z

I added a2c_sentiments.py, as requested by Louis Castricato. Unfortunately, there is a lot of overlap with ppo_sentiments.py but adding relative imports in the examples folder might increase confusion.

That's fine.

LouisCastricato · 2023-01-11T23:10:32Z

Do you have compute access yet? Or colab? Can you run a2c sentiment and post the wandb here?

I can merge after we verify it runs.

LouisCastricato · 2023-01-11T23:12:53Z

trlx/utils/__init__.py

    SGD: str = "sgd"


-def get_optimizer_class(name: OptimizerName):
+torch_optimizers: Dict[str, type] = dict(


@jon-tow I remember you being somewhat against this?

This should be fine. We don't really care about much of the torch optimizers outside of the current ones. I'm not even sure we should supply RMSProp as it's never really used for optimizing transformer language models.

LouisCastricato · 2023-01-11T23:13:38Z

Do NOT merge until we verify on wandb runs.

shermansiu · 2023-01-11T23:15:36Z

I don't have access to the compute cluster yet. I can run the script after I get home from the University: I have a research computer at home that has a GPU with more VRAM.

jon-tow

Thanks, @shermansiu! I've made some minor change requests. Let's get some reports on this before doing a final review for merging. If you're unable to get resources soon-ish we can run things for you later in the week 👍

jon-tow · 2023-01-11T23:26:40Z

examples/ilql_sentiments.py

+TRLX_PATH = pathlib.Path(__file__).resolve().parent.parent
+with TRLX_PATH.joinpath("configs/ilql_config.yml").open() as f:
+    default_config = yaml.safe_load(f)


If we want to make this change, we have to do it for every example (not just sentiments) for consistency. Revert otherwise.

jon-tow · 2023-01-12T00:06:53Z

configs/a2c_gptj.yml

@@ -0,0 +1,55 @@
+train:


Remove the a2c_gptj.yml unless tested thoroughly. We don't want folks wasting their compute resources on a large-ish tune with untested hparams.

jon-tow · 2023-01-12T00:11:34Z

I don't have access to the compute cluster yet. I can run the script after I get home from the University: I have a research computer at home that has a GPU with more VRAM.

You should be able to get a reasonable signal + sanity checks from the gpt2-imdb model which is rather small. Let us know it goes!

LouisCastricato · 2023-01-12T20:48:59Z

If you can get us runs in the next 12 hours we can merge this for 0.4

The RMSProp optimizer is used in the original A3C/A2C paper (Mnih et al., 2016). As suggested by Huang et al. (2022), we switch to using the RMSProp optimizer to "implement" A2C using our existing implementation of PPO.

We follow the steps taken by Huang et al. (2022) in implementing A2C as a special case of PPO. Because 'scale_reward' is set to False within the existing configurations, we don't need to remove advantage normalization, as it's already disabled. Moreover, entropy regularization is not implemented in TRLX's implementation of PPO, so we don't need to manually set the entropy coefficient to 0.

Flake8 was giving the C901 complaint, so I refactored the `get_optimizer_class` function. Moreover, now, adding a default PyTorch or a `bitsandbytes` optimizer is as simple as a one-liner.

shermansiu · 2023-01-13T20:37:57Z

I made a run of A2C: unfortunately, the mean reward is quite volatile compared to that of PPO. I no longer think A2C is a good candidate algorithm for RLHF.

https://api.wandb.ai/report/shermansiu/lydosf6p

shermansiu · 2023-01-13T20:44:50Z

This is even when I slash the learning rate from 1e-5 to 1e-7... Both are equally volatile, but 1e-7 has a lower mean reward.

RobertKirk · 2023-01-17T08:56:16Z

I no longer think A2C is a good candidate algorithm for RLHF.

It is the algorithm that DeepMind is using for all their RLHF work, and it seems to work well there. Probably there's some other hyperparameters/etc. that make it work there. They also use the KL penalty as an auxiliary loss rather than reward, which makes the RL reward simpler, which is maybe why A2C works for them.

shermansiu · 2023-01-17T16:02:57Z

This is the relevant paper. Muesli is a higher priority item, but I could look at this later.
https://arxiv.org/pdf/1706.03741.pdf

conceptofmind · 2023-01-21T04:08:19Z

This is even when I slash the learning rate from 1e-5 to 1e-7... Both are equally volatile, but 1e-7 has a lower mean reward.

This is what was used in the Sparrow paper for A2C:

"We extend the RL scheme of Menick et al. (2022); Perez et al. (2022), training a 70B A2C policy using Adafactor (Shazeer and Stern, 2018), a learning rate of 2 × 10−6, an effective batch size of 16, and 𝑙2-norm gradient clipping to a max norm of 1.0. Instead of the typical entropy term, we regularise by adding the KL divergence between the RL policy and the initial language model (SFT or Chinchilla) to the loss, with a weight 0.2. To reduce memory usage, we freeze the first 80% of the weights (64/80 transformer layers) to the pretrained values, share parameters between policy and value functions, and train with reduced precision using bfloat16 as in Rae et al. (2021) and stochastic rounding (Gupta et al., 2015). The value function predicts the final reward (without discounting) at each token. We implement the value function as an MLP with two hidden layers of size 2048, which takes as input the final transformer representation at each time step. We shard the models across 64 TPU v3 machines (Shoeybi et al., 2019)".

shermansiu mentioned this pull request Jan 11, 2023

Implement A2C #16

Closed

LouisCastricato approved these changes Jan 11, 2023

View reviewed changes

jon-tow requested changes Jan 12, 2023

View reviewed changes

shermansiu added 5 commits January 13, 2023 12:15

Add RMSProp optimizer.

f1cb875

The RMSProp optimizer is used in the original A3C/A2C paper (Mnih et al., 2016). As suggested by Huang et al. (2022), we switch to using the RMSProp optimizer to "implement" A2C using our existing implementation of PPO.

Simplify get_optimizer_class function.

40a4c6f

Flake8 was giving the C901 complaint, so I refactored the `get_optimizer_class` function. Moreover, now, adding a default PyTorch or a `bitsandbytes` optimizer is as simple as a one-liner.

Ensure that sentiment examples can be run from anywhere.

e97e4d6

Add A2C sentiments example.

85a7785

shermansiu force-pushed the add-a2c branch from b2283ed to 85a7785 Compare January 13, 2023 17:21

shermansiu added 7 commits January 13, 2023 14:15

Ensure that optimizer hyperparameters are parsed as floats.

7e6cef6

Remove untested A2C GPT-J config.

c05c482

Update A2C sentiments example to match the PPO example.

322ef42

Make the config loading consistent across all example scripts.

fa2c1fd

Make JSON loading consistent in examples.

4dee1dc

Ensure that A2C and PPO train for the same number of epochs.

d1c599d

Fix config loading in examples.

4580e81

Reduce learning rate.

7dfa77f

LouisCastricato closed this Jan 13, 2023

shermansiu mentioned this pull request Jan 13, 2023

Refactor code #191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add A2C #183

Add A2C #183

shermansiu commented Jan 11, 2023

shermansiu commented Jan 11, 2023

shermansiu commented Jan 11, 2023

shermansiu commented Jan 11, 2023

LouisCastricato commented Jan 11, 2023

LouisCastricato commented Jan 11, 2023 •

edited

LouisCastricato Jan 11, 2023

jon-tow Jan 12, 2023

LouisCastricato commented Jan 11, 2023

shermansiu commented Jan 11, 2023

jon-tow left a comment

jon-tow Jan 11, 2023

jon-tow Jan 12, 2023

jon-tow commented Jan 12, 2023

LouisCastricato commented Jan 12, 2023 •

edited

shermansiu commented Jan 13, 2023

shermansiu commented Jan 13, 2023

RobertKirk commented Jan 17, 2023

shermansiu commented Jan 17, 2023

conceptofmind commented Jan 21, 2023

Add A2C #183

Add A2C #183

Conversation

shermansiu commented Jan 11, 2023

shermansiu commented Jan 11, 2023

shermansiu commented Jan 11, 2023

shermansiu commented Jan 11, 2023

LouisCastricato commented Jan 11, 2023

LouisCastricato commented Jan 11, 2023 • edited

LouisCastricato Jan 11, 2023

Choose a reason for hiding this comment

jon-tow Jan 12, 2023

Choose a reason for hiding this comment

LouisCastricato commented Jan 11, 2023

shermansiu commented Jan 11, 2023

jon-tow left a comment

Choose a reason for hiding this comment

jon-tow Jan 11, 2023

Choose a reason for hiding this comment

jon-tow Jan 12, 2023

Choose a reason for hiding this comment

jon-tow commented Jan 12, 2023

LouisCastricato commented Jan 12, 2023 • edited

shermansiu commented Jan 13, 2023

shermansiu commented Jan 13, 2023

RobertKirk commented Jan 17, 2023

shermansiu commented Jan 17, 2023

conceptofmind commented Jan 21, 2023

LouisCastricato commented Jan 11, 2023 •

edited

LouisCastricato commented Jan 12, 2023 •

edited