Hello,
I tried to use reinforcement learning (trlx library) to train a FLAN-T5. I have seen that as examples there are sentiment analysis or summarization tasks, but in my case it is a generic generation task. Trying to use my dataset, I tried the config: ppo_config_cnn_daily.yml.

The result I got is this, and it seems to be a summarization task:

Prompt: [some text] ... from 1978, excluding finishing years incoming 62
Output: years incoming 62 incoming 62 incoming 62 incoming 62 incoming 62 incoming 62 incoming 62 incoming 62 incoming 62 incoming
And I'm using this code with my personal prompts:

trlx.train(
    reward_fn=reward_fn,
    prompts=train_prompts,
    eval_prompts=valid_prompts,
    config=config
)
I have two problems, maybe caused by summarization task:

First the repetition at the end
Second, the expected output is very different
Should I change the code or are there generic examples?

My configuration is:
gen_kwargs:
    max_new_tokens: 100
  gen_experience_kwargs:
    max_new_tokens: 100
    do_sample: True
    temperature: 1.0
    top_k: 50
    top_p: 0.95

That works well for summarization. Any suggestion about how to change these parameters for generic text 2 text tasks and not summarization?
It seems that the problem you are facing is that the example configuration file you used (ppo_config_cnn_daily.yml) is designed for a summarization task and may not be suitable for your generic text generation task. Therefore, you may need to adjust the configuration file and hyperparameters to better suit your task.

Here are some suggestions for adjusting the configuration file and hyperparameters for a generic text generation task:

1. Adjust the maximum sequence length: In your gen_kwargs and gen_experience_kwargs sections, you have set the max_new_tokens parameter to 100. This sets the maximum number of tokens that can be generated for each prompt. Depending on the complexity of your text generation task, you may need to increase or decrease this value.

2. Adjust the sampling parameters: In your gen_kwargs section, you have set temperature to 1.0, top_k to 50, and top_p to 0.95. These parameters control the randomness and diversity of the generated text. Again, depending on your task, you may need to adjust these values to achieve the desired level of randomness and diversity.

3. Use a different reward function: The reward function you use will depend on the specific goals of your text generation task. For example, if you are generating product descriptions, you may want to use a reward function that maximizes the relevance and accuracy of the descriptions. If you are generating creative writing, you may want to use a reward function that maximizes the creativity and originality of the text.

4. Consider using a different architecture: The architecture of the FLAN-T5 model may not be optimal for your specific text generation task. You may want to experiment with different architectures or pre-trained models to see which one works best for your task.

Overall, the key is to experiment with different configurations and hyperparameters until you find the combination that works best for your specific text generation task.
Hello
Hello, how i may assist you today?