Evals

Repos

As we were working with a variety of different cloud instances over time, we forked a bunch of the repos to track our own fixes:

llm-jp-eval - this was just to add bos_token, support Qwen testing (trust_remote_code)
- scripts and outputs are in the eval folder in the main repo
Rakuda - we created a template for our model, it kept barfing on inferencing and I just ripped it all out to use vLLM, which ended up being 50X faster as well
Japanese MT-Bench - the FastChat codebase is super gnarly and I couldn't figure out their prompt-picking algo so I wedged in what we needed. It's also slow as heck and really needs to be replaced...
- llm-judge - start of a custom inferencer for MT-Bench-likes
gpt4-autoeval - this adapts a gpt-4 autoeval of ELYZA-100 tasks

Review of Evals

llm-jp-eval - as in the DATASET.md, basically about a dozen reading comprehension style tests (more are being added atm)

Rakuda is interesting, basically an adaptation of MT-Bench llm_judge code bout using pairwise comparison ranking nstead, and with Japanese oriented Q&A.

Evals to Make

Combined Task comparison
JA output fluency error rate. (char rate)
EN/JA leakage testing (char rate)

We have a prototype Streamlit Eval Server that's focused on letting you run/analyze ratings:

For generations, allow splitting into shorter sections for rating
Tests: 100% correct Japanese, Which do you like more, Various rubrics
Users - metrics, ratings/stats, favorite, personal win/loss (motivation and filtering)
Same speed output
Review: https://www.reddit.com/r/LocalLLaMA/comments/18nn78x/building_an_llm_rating_platform_and_need_criteria/

Results

llm-jp-eval

I ran a lot of llm-eval-jp benchmarks as sanity checks and baselines. The original llm-jp-eval sheet started with LLM-jp's original leaderboard with some validation tests on models there to make sure that it was running but grew to include a lot more data

HuggingFace Leaderboard Output

All data is taken from [https://huggingface.co/open-llm-leaderboard](HF Leaderboard) and can also be found as a spreadsheet.

Our lower MMLU is probably due to this bug.

HF's leaderboard uses a specific version of lm-evaluation-harness - once fixed though, I may revisit and include our fixed benchmark scores sometime if I have time.

Type	Model	Average	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
Base	mistralai/Mistral-7B-v0.1	60.97	59.98	83.31	64.16	42.15	78.37	37.83
FT	augmxnt/shisa-7b-v1	55.01	56.14	78.63	23.12	52.49	78.06	41.62
FT	mistralai/Mistral-7B-Instruct-v0.1	54.96	54.52	75.63	55.38	56.28	73.72	14.25
Base	augmxnt/shisa-base-7b-v1	51.64	52.30	77.63	23.12	42.40	78.53	35.86
FT	elyza/ELYZA-japanese-Llama-2-7b-instruct	49.78	53.16	78.25	47.07	39.08	73.24	7.88
FT	elyza/ELYZA-japanese-Llama-2-7b-fast-instruct	49.15	53.75	77.55	46.85	38.84	71.59	6.29
Base	elyza/ELYZA-japanese-Llama-2-7b	48.70	52.22	76.42	44.60	37.92	72.69	8.34
FT	rinna/youri-7b-chat	48.51	51.19	76.09	46.06	41.17	75.06	1.52
FT	elyza/ELYZA-japanese-Llama-2-7b-fast	47.67	51.88	75.46	44.34	36.45	71.59	6.29
Base	rinna/youri-7b	47.11	49.06	74.89	42.22	36.03	71.82	8.64
FT	llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0	31.77	26.88	44.78	23.12	45.19	50.67	0.00
FT	llm-jp/llm-jp-13b-instruct-full-jaster-v1.0	31.63	27.22	44.70	23.12	44.69	50.04	0.00
Base	cyberagent/open-calm-7b	28.21	20.48	30.65	25.22	44.15	48.54	0.23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly