A Review of Public Japanese Training Sets

(And some comments on their impact on Japanese LLM performance)

Review

The original starting point for our work was to see could collect and use public Japanese training data. This was what we found:

Japanese Pretrain


Rows	Tokens	Size	Name	URL
90.3K		1.5G	MC4 Japanese Data	https://huggingface.co/datasets/Atom007/mc4-japanese-data/tree/main
1.62M		3G	Oscar	https://huggingface.co/datasets/oscar/viewer/unshuffled_deduplicated_ja/train https://huggingface.co/datasets/oscar/tree/refs%2Fconvert%2Fparquet/unshuffled_deduplicated_ja/partial-train
		2.74G	Wikipedia JA 20220808	https://huggingface.co/datasets/inarikami/wikipedia-japanese
		300G	CulturaX JA (mC4 + OSCAR)	https://huggingface.co/datasets/uonlp/CulturaX/tree/main/ja

EN/JA


Rows	Tokens	Size	Name	URL
			Seamless	https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/seamless_align_README.md https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-jpn.tsv.gz
1.12K			xtreme_s	https://huggingface.co/datasets/google/xtreme_s/viewer/covost2.ja.en/train
49.4K			JA Anthropic/hh-rlhf	https://huggingface.co/datasets/kunishou/hh-rlhf-49k-ja see also: https://huggingface.co/datasets/fujiki/japanese_hh-rlhf-49k https://huggingface.co/datasets/shi3z/anthropic_hh_rlhf_japanese
50K			Snow Simplified	https://huggingface.co/datasets/snow_simplified_japanese_corpus

Japanese Instruct


Rows	Tokens	Size	Name	URL
1.05K			Newschat	https://huggingface.co/datasets/fujiki/newschat-with-impression https://github.com/fukanarita/newschat-with-impression
500			Sakura	https://huggingface.co/datasets/saldra/sakura_japanese_dataset
52K			JA Alpaca	https://huggingface.co/datasets/fujiki/japanese_alpaca_data https://github.com/masa3141/japanese-alpaca-lora Cleaned version? https://github.com/shi3z/alpaca_ja gpt-3.5-turbo translation: https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-japanese
111K			JA Guanaco	https://huggingface.co/datasets/fujiki/guanaco_ja https://github.com/masa3141/japanese-alpaca-lora
1.35M			izumi-lab/llm-japanese-dataset wikipedia tasks	https://huggingface.co/datasets/fujiki/llm-japanese-dataset_wikipedia
84.3K			izumi-lab/llm-japanese-dataset snow tasks	https://huggingface.co/datasets/fujiki/llm-japanese-dataset_snow
4.84K			khalidalt/tydiqa-goldp	https://huggingface.co/datasets/khalidalt/tydiqa-goldp/viewer/japanese/train
7.11K			csebuetnlp/xlsum	https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/japanese/train

Analysis

Our original intuition was that the relatively poor conversational performance of prior released JA models was due to the use of dated (Alpaca/HHRLHF) fine-tuning training sets that have since been superseded in the English speaking open source community, and that we could simply augment the training data by adapting some of the newer training sets available (how Jon Durbin initially got involved, as the creator of the extremely popular and well-regarded Airoboros training sets), and just converting them to Japanese.

While doing conversational benchmarking however, we found some alarming discrepancies. Even some of the strongest benchmark-scoring models, like the just released Japanese Stable LM Beta Instruct 70B, which was a 70B Llama 2-based model pre-trained on an additional 100B JA tokens (this is ~50X the number of JA tokens in Llama 2's original training set), had problems replying with correct and natural Japanese for the most basic types of chat prompts. In fact, in our basic native speaker testing, it performed about on par with Qwen-14B-Chat and significantly worse than XWin-LM-70B-V0.1, both of which had no additional JA pre-training or fine-tuning at all! What's going on?

One pattern we noticed is that most JA models have largely used the same fine-tuning datasets for their instruction tunes: a combination of translations of Alpaca, Databricks' Dolly 15K, Anthropic HH-RLHF, un-filtered OASST, and non-conversational data (wikinews Q&A's, benchmarks train, etc). Ignoring that the wider open source fine-tuning community has largely moved on from these early tuning sets (Dolly has such bangers as "Why mobile is bad for human," OASST is 54% non-English, or well, just review the actual data and you'll find copious oddities), it looks like that on top of that, the Japanese versions used are actually derived from a single set of (very flawed) translations.

Let's do a deep dive into one of them, the kunishou/hh-rlhf-49k-ja (here's the list of HF models that mark themselves using this dataset):


Original	kunishou/hh-rlhf-49k-ja	Notes	GPT-4 Translation	Notes
First of all, I would like to know what NASA stands for.	まず第一に、私はNASAが何を意味しているのか知りたいです。		まず最初に、NASAの意味を知りたいと思います。
NASA, the National Aeronautics and Space Administration, has an excellent website that answers a lot of questions about NASA. This might answer what you’re looking for: “The National Aeronautics and Space Administration (NASA) is the agency of the United States government responsible for the nation’s civilian space program and aeronautics and aerospace research. The agency’s vision is to “pioneer the future in space exploration, scientific discovery and aeronautics research.””	国立航空宇宙局であるNASAには、NASAに関する多くの質問に答える優れたウェブサイトがあります。これはあなたが探しているものに答えるかもしれません：「国立航空宇宙局（NASA）は、米国の民間宇宙プログラムと`航空航空航空`宇宙研究を担当する米国政府の機関です。`エージェンシー`のビジョンは、「宇宙探査、科学的発見、航空学の研究における未来の先駆者」です。」	* 3 Repetitions * Agency not translated properly	国立航空宇宙局（NASA）は、多くのNASAに関する質問に答える素晴らしいウェブサイトを持っています。これによりあなたが探している答えが見つかるかもしれません:「国立航空宇宙局（NASA）は、アメリカ政府の機関であり、国の民間宇宙プログラムと航空宇宙研究を担当しています。この機関のビジョンは、"宇宙探査、科学的発見、航空研究の未来を開拓する"です。」	Super natural, way better.
Do you have some ideas for making hamburgers more unique such as spices and toppings?	スパイスやトッピングなど、ハンバーガーをよりユニークにするためのアイデアはありますか？		スパイスやトッピングを使ってハンバーガーをよりユニークにするためのアイデアは何かありますか？
Well I’ve actually got a bunch of suggestions for you that will make your hamburgers more delicious! You might want to try adding some parmesan cheese, onions, or other flavored toppings.	さて、`私は実際にあなたのためにあなたのためにあなたのためにあなたのためにあなたのためにもっと美味しかったです！`パルメザンチーズ、玉ねぎ、またはその他の風味のあるトッピングを追加してみてください。	* "I actually" 5X repetition "for you" * then says "it was tastier" which is also wrong	実は、ハンバーガーをもっと美味しくするためのたくさんの提案がありますよ！パルメザンチーズや玉ねぎ、その他の風味豊かなトッピングを追加してみてはいかがでしょうか？	Natural, great.
What is the difference between the FBI and CIA?	FBIとCIAの違いは何ですか？		FBIとCIAの違いは何ですか？
The FBI and CIA are both intelligence agencies, which means they are responsible for gathering and analyzing information about national security topics. Both were created by the National Security Act of 1947, but today the FBI is primarily responsible for domestic investigations, while the CIA’s focus is on overseas missions. The FBI is an abbreviation of “Federal Bureau of Investigation”, while the CIA stands for “Central Intelligence Agency”.	FBIとCIAはどちらも諜報機関です。つまり、国家安全保障のトピックに関する情報の収集と分析を担当しています。どちらも1947年の国家安全保障法によって作成されましたが、今日ではFBIは主に国内の調査を担当していますが、CIAの焦点は海外任務に焦点を当てています。FBIは「連邦捜査局」の略語であり、CIAは「中央情報局」の略です。		FBIとCIAは両方とも情報機関であり、国家安全保障に関する情報の収集と分析を担当しています。両機関は1947年の国家安全保障法により設立されましたが、現在ではFBIが主に国内の調査を、CIAが海外のミッションを中心として担当しています。FBIは「Federal Bureau of Investigation」（連邦捜査局）の略で、CIAは「Central Intelligence Agency」（中央情報局）を意味します
What is the best way to learn money management?	お金の管理を学ぶための最良の方法は何ですか？		お金の管理を学ぶ最善の方法は何でしょうか？
I think the best way to learn about money management is to get your basic finances in order first. In order to manage money well, you have to have at least a basic understanding of what you’re managing.	お金の管理について学ぶ最良の方法は、最初に基本的な財政を順番に`取得する`ことだと思います。お金をうまく管理するには、少なくともあなたが管理しているものを基本的に理解する必要があります。	* This means literally "get" as is receive, so makes the sentence mean something completely different (and also quite confusing).	私は、金銭管理について学ぶ最善の方法は、まず基本的な財務状況を整理することだと思います。金銭をうまく管理するためには、少なくとも管理しているものが何であるかを基本的に理解していなければなりません。	Actually understandable for a Japanese person, idiom translated correctly.
What do you believe is the secret to a successful marriage?	結婚を成功させるための秘密は何だと思いますか？		あなたが思う結婚生活の成功の秘訣は何ですか？	Question phrased better as Japanese.
The secret to a successful marriage is to respect each other’s quirks, accept that we all have flaws, forgive and be patient with each other’s mistakes, and be conscious of your responsibilities as a spouse.	結婚の成功の`秘secretは`、お互いの癖を尊重し、私たち全員が欠陥を持っていることを受け入れ、お互いの間違いを許し、忍耐強く、配偶者としてのあなたの責任を意識することです。	secret in Japanese is 秘密	成功した結婚生活の秘訣は、お互いの個性を尊重し、私たち全員には欠点があることを受け入れ、お互いの間違いを許し、忍耐強く接し、配偶者としての責任を自覚することです。	Much more natural.

As you can see from the manual annotations/notes, 4 out of the first 5 items have errors, in many cases, major errors. As a point of reference, we've also included a gpt-4-0613 translation which in this case, had no errors and was judged to be distinctly more natural Japanese in almost every case. In this case, ChatGPT-4 (Turbo) analysis of items was actually able to correctly identify the errors (not always reliable), however, we can still be fairly confident that its translations are of generally high quality.

Also, having validated ChatGPT4's ability to do some critique of these translations, we performed a GPT-assisted analysis of a couple of the kunishou/databricks-dolly-15k-ja (HF models) items and again, a similar analysis from the kunishou/oasst1-89k-ja (HF models).

Note: llm-jp-13b uses their own version of the dolly-15k and oasst1 datasets translated by DeepL but we couldn't find a copy for analysis.

Conclusion

Based on these discoveries, and our review of the suitability of the other training sets available, we abandoned the use of the pre-existing machine-translated datasets entirely and decided to go our own way.

From our testing, GPT-4 is very strong at polite/normal speech level Japanese and could be used reliably for translation, although its throughput and cost was an issue for us. We did a number of human-validated comparisons between DeepL, Google Translate and text-bison-32k, and gpt-4-0613 (also, with human-validated ChatGPT4 assisted reviews like this one) and we found that text-bison-32k could be run cheaply and quickly enough to rapidly generate translations, and using certain simple algorithms, we could use gpt-4 as a "big gun" to fix potentially problematic translations.

One other data issue we noticed is that almost all of the pre-training on models has been done with relatively low quality, unfiltered data sets from Wikipedia, mC4, The Pile, OSCAR, etc. Over the past few months, much higher quality/better-filtered datasets for pre-training like SlimPajama, MADLAD-400, and CulturaX have been publicly released, and this to us, was another easy win. we coupled this with DSIR sampling to maximize diversity within a limited amount of pre-train compute resources.

We share this analysis in the hopes that those that are training their own future Japanese models will pay special attention and care in curating the data they use, as it seems that many GPU hours have been unnecessarily burned.

Also, while we are releasing our higher quality machine-translated dataset with our models, we also encourage teams with the resources to conduct manual reviews of training sets by native speakers, and to share their findings.

Additional Notes

In Japan, the government has taken a clear stance that copyright law does not apply to AI training, so datasets like https://annas-archive.org/ have no restrictions/risks vs other jurisdictions. Given that Japanese is not a low-resource language, in theory, there should should be no shortage of usable tokens. (In practice, the sources of publicly available, high quality Japanese training data seems much more limited.)
A good list of general datasets: https://github.com/Zjh-819/LLMDataHub
A discussion on better pre-training filtering: https://www.reddit.com/r/LocalLLaMA/comments/17om8xf/redpajamadatav2_is_incredible/
For base models, we found llm-jp-eval to be quite useful as a fast, wide-ranging JA benchmark eval that was recently published. We were a bit confused though, as it was published with a llm-jp-leaderboard with their own "jaster" fine tunes sitting on top without additional commentary (also sitting near the top of the Nejumi LLM evals). When we reviewed these "jaster" models however, we found that these universally performed significantly worse than the non-jaster tunes of every model we tried in conversational testing. After some digging, it seems that benchmark train data makes up ~60% of the tuning data being fed at a high LR (or 100% for the non llm-jp "jaster" tunes?). In any case, while an interesting pre-Goodhart's Lawing of the benchmark, we're still left a bit confused. More interestingly, in our ablation testing, we found that training with a percentage of benchmark train in the mix gave better (manually native speaker reviewed) results that when tested without. We speculate that as a smaller part of our process and at a lower LR, the additional good JA tokens outweighs negative effects of the SFT answers). Also, that the rest of our training set includes extended and multi-turn answers while the "jaster" tunes do not.

2023-12-06 Addendum: Ablations

As we were finishing up our Shisa 7B tunes, one thing I tried queuing up was doing up some ablations (for science). Here are some JA MT-Bench results from two tests.

First, we run our shisa-base-7b-v1 model with a tune w/ our allsources-v0.4 tuning set vs using the training sets listed by Japanese Stable LM Instruct Gamma 7B:

Model	Score
shisa-base-7b-allsources-v0.4	5.68
shisa-base-7b-jaslmg-instruct	4.42

Then, we do the reverse and take the Japanese Stable LM Base Gamma 7B (Mistral 7B w/ 100B JA token additional pre-training) and apply again our allsources-v0.4 training set:

Model	Score
shisa-gamma-7b-allsources-v0.4	5.65
ja-stablelm-instruct-gamma-7b*	4.01

This is encouraging in the sense that it suggests that our tuning set can be effective in increasing the performance of potentially any JA pre-training model, and that our base model, with only 8B JA additional tokens of pre-training (>10X less than the Gamma model) can score competitively, with either "good" or "bad" fine-tuning data.

2023-12-07: LIAT AIP @ RIKEN

I was passed a link to LLMのための日本語インストラクションデータ作成プロジェクト. There is apparently a new "Joint Research" project for developing a new commercially licensed fine-tuning data set underway (ETA March 2024). Interesting and worth noting, although I think an open-source set would be better, faster to build, and could be scaled much larger than their project.

2023-12-11 JA Translation Rating

Back in June apparently, a JP developer @Sakkusakumura did BERTScoring by retranslating text to rate. I have some doubts on the exact approach since you could just be adding uncertainty of the JA-EN translation compounded onto the original EN-JA, but it's still interesting: https://twitter.com/Sakkusakumura/status/1674273868897288193

Provide feedback

Saved searches

Use saved searches to filter your results more quickly