Name		Name	Last commit message	Last commit date
parent directory ..
data_utils		data_utils
models		models
step1_synthetic_preference_collection		step1_synthetic_preference_collection
step2_rm_training		step2_rm_training
step3_ppo_training		step3_ppo_training
README.md		README.md
qlora_utils.py		qlora_utils.py
train_qlora_ppo.py		train_qlora_ppo.py
train_qlora_rm.py		train_qlora_rm.py
train_qlora_sft.py		train_qlora_sft.py

README.md

Training Experiences

The whole SALMON process involves three stages: Synthetic Preference Collection, Training the Principle-following Reward Model, and RL Training with the Principle-following Reward Model. In our paper, we provide a detailed description of each of these stages.

Prerequisites

For efficiency concerns, we utilize the model parallel scheme from llama when sampling responses from the inital policy model. To prepare the sharded model checkpoints of LLaMA or Dromedary on your own machine/cluster, please refer to the inference guide in Dromedary.

Step 1: Collecting Principle-Driven Synthetic Preference

We sample two responses from the initial policy model, and use the policy model itself to select the preferred response based on a certain human-written principle.

Before diving into the experiments, please install the llama_dromedary pacakge in Dromedary to enable model parallelism.

Step 1.1: Preparing OASST1 Prompt Dataset

Running the code

cd step1_synthetic_preference_collection

python -u clean_oasst1_prompts.py \
    --output_file "/path/to/your/oasst1_prompts.json"

Step 1.2: Sampling Responses from the Policy Model

Running the code

salloc --nodes 8 --time 6:00:00 --gres=gpu:32g:6 srun bash scripts/generate_oasst1_response0.sh
salloc --nodes 8 --time 6:00:00 --gres=gpu:32g:6 srun bash scripts/generate_oasst1_response1.sh

Step 1.3: Collecting Synthetic Preferences

Running the code

salloc --nodes 1 --time 24:00:00 --gres=gpu:80g:8 srun bash scripts/generate_synthetic_preference.sh

Step 2: Training the Principle-following Reward Model

Next, for each user prompt, a subset of principles is randomly sampled from the established principle list with certain principles being randomly negated. The user prompt, model responses, and the sub-sampled principles are aggregated as a single training instance for the reward model.

Step 2.1: Aggregating the Collected Preferences

Running the code

cd step2_rm_training

python -u aggregate_synthetic_preference.py \
    --response_pattern "/path/to/your/oasst1_dromedary2_sft_response*.json" \
    --preference_file "/path/to/your/oasst1_dromedary2_sft_preference.json" \
    --output_file "/path/to/your/oasst1_dromedary2_sft_aggregated_preference.json"

Step 2.2: Preference Modeling Pre-training (PMP) of the Reward Model

Running the code

python -u clean_pmp_data.py \
    --output_file "/path/to/your/pmp_data.json"

salloc --nodes 1 --time 24:00:00 --gres=gpu:80g:8 srun bash scripts/train_reward_model_70b_qlora_pmp.sh

Step 2.3: Fine-tune the Reward Model with Principle-driven Preferences

Running the code

salloc --nodes 1 --time 24:00:00 --gres=gpu:80g:8 srun bash scripts/train_reward_model_70b_qlora_ft.sh

Step 3: RL Training with the Principle-following Reward Model

Finally, we train the policy model with the principle-following reward model. We use the diverse user prompts from ShareGPT, Dolly-15k, OpenAssistant, OpenOrca, and MATH.

Step 3.1: Preparing the Prompt Dataset for RL Training

Running the code

cd step3_ppo_training

python subsample_openorca_prompts.py \
    --train_data_path "/path/to/your/l1M-GPT4-Augmented.parquet (obtained from OpenOrca)" \
    --output_path "/path/to/your/openorca_prompts.json"

python aggregate_sharegpt_prompts.py \
    --data_files=zetavg/ShareGPT-Processed,path/to/sg_90k_part1.json.json,path/to/sg_90k_part1.json (obtained from ShareGPT_Vicuna_unfiltered) \
    --output_path "/path/to/sharegpt_prompts.json"

python clean_and_merge_prompts.py \
    --sharegpt_prompt_path "/path/to/sharegpt_prompts.json" \
    --openorca_prompt_path "/path/to/openorca_prompts.json" \
    --output_file "/path/to/your/salmon_merged_prompts.json"

Step 3.2: RL Training

Running the code

salloc --nodes 6 --time 24:00:00 --gres=gpu:80g:8 srun bash scripts/train_ppo_model_70b_qlora_salmon.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training

training

README.md

Training Experiences

Prerequisites

Step 1: Collecting Principle-Driven Synthetic Preference

Step 1.1: Preparing OASST1 Prompt Dataset

Step 1.2: Sampling Responses from the Policy Model

Step 1.3: Collecting Synthetic Preferences

Step 2: Training the Principle-following Reward Model

Step 2.1: Aggregating the Collected Preferences

Step 2.2: Preference Modeling Pre-training (PMP) of the Reward Model

Step 2.3: Fine-tune the Reward Model with Principle-driven Preferences

Step 3: RL Training with the Principle-following Reward Model

Step 3.1: Preparing the Prompt Dataset for RL Training

Step 3.2: RL Training

Files

training

Directory actions

More options

Directory actions

More options

Latest commit

History

training

Folders and files

parent directory

README.md

Training Experiences

Prerequisites

Step 1: Collecting Principle-Driven Synthetic Preference

Step 1.1: Preparing OASST1 Prompt Dataset

Step 1.2: Sampling Responses from the Policy Model

Step 1.3: Collecting Synthetic Preferences

Step 2: Training the Principle-following Reward Model

Step 2.1: Aggregating the Collected Preferences

Step 2.2: Preference Modeling Pre-training (PMP) of the Reward Model

Step 2.3: Fine-tune the Reward Model with Principle-driven Preferences

Step 3: RL Training with the Principle-following Reward Model

Step 3.1: Preparing the Prompt Dataset for RL Training

Step 3.2: RL Training