The whole SALMON process involves three stages: Synthetic Preference Collection, Training the Principle-following Reward Model, and RL Training with the Principle-following Reward Model. In our paper, we provide a detailed description of each of these stages.
For efficiency concerns, we utilize the model parallel scheme from llama when sampling responses from the inital policy model. To prepare the sharded model checkpoints of LLaMA or Dromedary on your own machine/cluster, please refer to the inference guide in Dromedary.
We sample two responses from the initial policy model, and use the policy model itself to select the preferred response based on a certain human-written principle.
Before diving into the experiments, please install the llama_dromedary
pacakge in Dromedary to enable model parallelism.
Running the code
cd step1_synthetic_preference_collection
python -u clean_oasst1_prompts.py \
--output_file "/path/to/your/oasst1_prompts.json"
Running the code
salloc --nodes 8 --time 6:00:00 --gres=gpu:32g:6 srun bash scripts/generate_oasst1_response0.sh
salloc --nodes 8 --time 6:00:00 --gres=gpu:32g:6 srun bash scripts/generate_oasst1_response1.sh
Running the code
salloc --nodes 1 --time 24:00:00 --gres=gpu:80g:8 srun bash scripts/generate_synthetic_preference.sh
Next, for each user prompt, a subset of principles is randomly sampled from the established principle list with certain principles being randomly negated. The user prompt, model responses, and the sub-sampled principles are aggregated as a single training instance for the reward model.
Running the code
cd step2_rm_training
python -u aggregate_synthetic_preference.py \
--response_pattern "/path/to/your/oasst1_dromedary2_sft_response*.json" \
--preference_file "/path/to/your/oasst1_dromedary2_sft_preference.json" \
--output_file "/path/to/your/oasst1_dromedary2_sft_aggregated_preference.json"
Running the code
python -u clean_pmp_data.py \
--output_file "/path/to/your/pmp_data.json"
salloc --nodes 1 --time 24:00:00 --gres=gpu:80g:8 srun bash scripts/train_reward_model_70b_qlora_pmp.sh
Running the code
salloc --nodes 1 --time 24:00:00 --gres=gpu:80g:8 srun bash scripts/train_reward_model_70b_qlora_ft.sh
Finally, we train the policy model with the principle-following reward model. We use the diverse user prompts from ShareGPT, Dolly-15k, OpenAssistant, OpenOrca, and MATH.
Running the code
cd step3_ppo_training
python subsample_openorca_prompts.py \
--train_data_path "/path/to/your/l1M-GPT4-Augmented.parquet (obtained from OpenOrca)" \
--output_path "/path/to/your/openorca_prompts.json"
python aggregate_sharegpt_prompts.py \
--data_files=zetavg/ShareGPT-Processed,path/to/sg_90k_part1.json.json,path/to/sg_90k_part1.json (obtained from ShareGPT_Vicuna_unfiltered) \
--output_path "/path/to/sharegpt_prompts.json"
python clean_and_merge_prompts.py \
--sharegpt_prompt_path "/path/to/sharegpt_prompts.json" \
--openorca_prompt_path "/path/to/openorca_prompts.json" \
--output_file "/path/to/your/salmon_merged_prompts.json"
Running the code
salloc --nodes 6 --time 24:00:00 --gres=gpu:80g:8 srun bash scripts/train_ppo_model_70b_qlora_salmon.sh