Skip to content
Leonard edited this page Dec 22, 2023 · 5 revisions

Repos

As we were working with a variety of different cloud instances over time, we forked a bunch of the repos to track our own fixes:

  • llm-jp-eval - this was just to add bos_token, support Qwen testing (trust_remote_code)
    • scripts and outputs are in the eval folder in the main repo
  • Rakuda - we created a template for our model, it kept barfing on inferencing and I just ripped it all out to use vLLM, which ended up being 50X faster as well
  • Japanese MT-Bench - the FastChat codebase is super gnarly and I couldn't figure out their prompt-picking algo so I wedged in what we needed. It's also slow as heck and really needs to be replaced...
    • llm-judge - start of a custom inferencer for MT-Bench-likes
  • gpt4-autoeval - this adapts a gpt-4 autoeval of ELYZA-100 tasks

Review of Evals

llm-jp-eval - as in the DATASET.md, basically about a dozen reading comprehension style tests (more are being added atm)

Rakuda is interesting, basically an adaptation of MT-Bench llm_judge code bout using pairwise comparison ranking nstead, and with Japanese oriented Q&A.

Evals to Make

  • Combined Task comparison
  • JA output fluency error rate. (char rate)
  • EN/JA leakage testing (char rate)

We have a prototype Streamlit Eval Server that's focused on letting you run/analyze ratings:

Results

llm-jp-eval

I ran a lot of llm-eval-jp benchmarks as sanity checks and baselines. The original llm-jp-eval sheet started with LLM-jp's original leaderboard with some validation tests on models there to make sure that it was running but grew to include a lot more data

HuggingFace Leaderboard Output

All data is taken from [https://huggingface.co/open-llm-leaderboard](HF Leaderboard) and can also be found as a spreadsheet.

Our lower MMLU is probably due to this bug.

HF's leaderboard uses a specific version of lm-evaluation-harness - once fixed though, I may revisit and include our fixed benchmark scores sometime if I have time.

Type Model Average ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
Base mistralai/Mistral-7B-v0.1 60.97 59.98 83.31 64.16 42.15 78.37 37.83
FT augmxnt/shisa-7b-v1 55.01 56.14 78.63 23.12 52.49 78.06 41.62
FT mistralai/Mistral-7B-Instruct-v0.1 54.96 54.52 75.63 55.38 56.28 73.72 14.25
Base augmxnt/shisa-base-7b-v1 51.64 52.30 77.63 23.12 42.40 78.53 35.86
FT elyza/ELYZA-japanese-Llama-2-7b-instruct 49.78 53.16 78.25 47.07 39.08 73.24 7.88
FT elyza/ELYZA-japanese-Llama-2-7b-fast-instruct 49.15 53.75 77.55 46.85 38.84 71.59 6.29
Base elyza/ELYZA-japanese-Llama-2-7b 48.70 52.22 76.42 44.60 37.92 72.69 8.34
FT rinna/youri-7b-chat 48.51 51.19 76.09 46.06 41.17 75.06 1.52
FT elyza/ELYZA-japanese-Llama-2-7b-fast 47.67 51.88 75.46 44.34 36.45 71.59 6.29
Base rinna/youri-7b 47.11 49.06 74.89 42.22 36.03 71.82 8.64
FT llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0 31.77 26.88 44.78 23.12 45.19 50.67 0.00
FT llm-jp/llm-jp-13b-instruct-full-jaster-v1.0 31.63 27.22 44.70 23.12 44.69 50.04 0.00
Base cyberagent/open-calm-7b 28.21 20.48 30.65 25.22 44.15 48.54 0.23