CUP IT | Hackathon

About

In this task we are required to sort/range comment in social network. We desided to use reward part from RLHF approach.

Exploratory Data Analysis

Approach

The pairwise training of reward model, as used in the InstructGPT paper by OpenAI, is an effective method for rating comments on a given post. This approach allows the model to learn from the relative differences between comments, rather than relying on an absolute rating scale. This is especially useful when the rating scale is inconsistent or not well-defined. By training the model on relative differences between pairs of comments, it becomes less reliant on a predefined rating scale and is better able to generalize to new data. Overall, pairwise training of reward model is a sound choice for training a model to rate comments on a given post, as it enables more nuanced and accurate ratings and is more robust to inconsistencies in the rating scale.

For each row in the dataset, a list of comemnts is given with the score from 0 to 4 (0 - best, 4 - worst). We use this data to train a reward model that maps a (post, comment) pair to a reward r. The reward model is trained to predict which comment a human will prefer, using the rewards as logits.

We used first 2 steps from RLHF pipeline:

Supervised fine-tuning on the given dataset
Reward model training based on comparisons

Reward model training deteils

We used the following loss function to train our reward model:

loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()

Results

We used NDCG metrics to compare runs.

We compared 2 main models:

with no additional context
with post context from Google Big Query dataset: link

Grouped NDCG with k=5

Paired NDCG with k=2

Weights & Biases Report

https://wandb.ai/aleksey-korshuk/huggingface/reports/CUP-IT-Report--VmlldzozODMzODI0

Train SFT model

deepspeed reward_model/train_sft.py \
  --model_name_or_path gpt2 \
  --dataset_name AlekseyKorshuk/up-it-ds-sft \
  --per_device_train_batch_size 8 \
  --per_device_eval_batch_size 8 \
  --do_train \
  --do_eval \
  --output_dir /tmp/test-clm \
  --push_to_hub

Train reward model

deepspeed reward_model/train_reward_model.py \
  --model_path AlekseyKorshuk/cup-it-ds-sft-pretrained \
  --dataset_path AlekseyKorshuk/cup-it-ds-pairwise \
  --output_dir no-context

Resulting model: https://huggingface.co/AlekseyKorshuk/cup-it-ds-reward-model-no-context

deepspeed reward_model/train_reward_model.py \
  --model_path AlekseyKorshuk/cup-it-ds-sft-pretrained \
  --dataset_path ummagumm-a/cup-it-ds-classification-pairwise-train-val \
  --output_dir with-context

Resulting model: https://huggingface.co/AlekseyKorshuk/cup-it-ds-reward-model-with-context

Inference

To generate scores for test dataset:

wget  https://huggingface.co/AlekseyKorshuk/cup-it-ds-reward-model-no-context/resolve/main/pytorch_model.bin -O ./rm_checkpoint/no-context/checkpoint-4956/pytorch_model.bin
python3 reward_model/inference.py

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
analysis		analysis
deepspeed_config		deepspeed_config
predict_validation		predict_validation
reward_model		reward_model
text_classification		text_classification
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

deepspeed_config

deepspeed_config

predict_validation

predict_validation

reward_model

reward_model

text_classification

text_classification

README.md

README.md

Repository files navigation

CUP IT | Hackathon

About

Exploratory Data Analysis

Approach

Reward model training deteils

Results

Grouped NDCG with k=5

Paired NDCG with k=2

Weights & Biases Report

Train SFT model

Train reward model

Inference

References

About

Releases

Packages

Contributors 3

Languages

AlekseyKorshuk/cup-it-ds

Folders and files

Latest commit

History

Repository files navigation

CUP IT | Hackathon

About

Exploratory Data Analysis

Approach

Reward model training deteils

Results

Grouped NDCG with k=5

Paired NDCG with k=2

Weights & Biases Report

Train SFT model

Train reward model

Inference

References

About

Resources

Stars

Watchers

Forks

Languages