Skip to content

LLM as a Judge

lhl edited this page Feb 5, 2024 · 1 revision

During this project, we explored using ChatGPT4/gpt-4-0613 as a rater/critic to mixed results.

  • gpt-4-0613 only caught about 50% of basic Japanese language errors from output vs manual native-speaker review (n=100+)
  • Neither ChatGPT4 nor gpt-4-0613 could reliably rate quality of Japanese responses or separate rating categories, even on a single question, multi-turn basis even with extensive prompting.

The Rakuda blog post cites the MT-Bench paper's 83% agreement with human rankings, but this was only performed with English ratings, and AFAIK has not ever been validated for Japanese ratings. Based on my experience above, I have some doubts.

There are at least two projects, Yuzu.ai's Rakuda Benchmarkand JP Stability.ai's Japanese MT-Bench that basically port the FastChat MT-Bench code to use.

One thing to note is that when going through the code, there are some tricky issues worth point out:

  • Japanese MT-Bench defaults to a somewhat crazy few-shot English prompt, so you need to be very careful to go through the prompt templates and assign the right one to your model.
  • Rakuda uses the default prompt for each individual model, which is better, but for example, uses llama-2-chat's insanely neurotic English prompt (and English prompts for all the non-Japanese models) which is something to keep in mind when looking at their leaderboard - non-JA models are actually even stronger than they seem as in our testing, we found that swapping to a JA prompt gave a huge boost in Japanese language performance.
  • For testing, it seems that ratings (or Rakuda's pairwise testing) are typically done at temp 0.7 on a single generation for each model. For my testing, I set --num_choices=4 to try to counteract the sampling issues, although I believe that each model may need to have their own sample settings tweaked. Using do_sample=False may be the best way to compare models more reliably otherwise.