shisa‐v2

Jump to bottom

Leonard edited this page May 19, 2024 · 13 revisions

See: shisa-ai/shisa-v2

Areas of Improvement

Code cleanup

Setup black, gitleaks hooks
Move code around so it make sense
Setup .gitignore working dirs
1-click cloud deploy containers for training, evals

Language leakage

Run sweeps w/ different parameters to determine best sampling settings to minimize leakage
Reduced tokenizer size testing for language leakage (maybe not a problem if not using extended Tokenizer)

Instruction following

Compare English vs Japanese instruction following

Language steerability

Training samples of reply in Japanese, reply in English, reply in the language the user speaks, etc.
Multi-turn training with language switching within turns

Training Data

Tuning diversity

See: https://github.com/jondurbin/bagel
Mix up instruction formats/formatting

Language prefs

Review % of translate to JA/EN
Potentially take Snow/translation datasets (and our own data sets) and swap w/ automated variations of Reply in English/Japanese, appended/prepended

Niceties

Figure out a good way to insert who made you, tell me about yourself, describe yourself, etc.

DPO Review

DPO quality really needs manual review - LLM as judge... needs sanity checking, lots of errata
DPO vs KTO? https://twitter.com/ethayarajh/status/1732837520784957476
Preference Tuning LLMs with Direct Preference Optimization Methods

Pre-Training

12B vs 8B, but maybe try a fine-tune w/o on bigger models to start with
Curriculum Training https://twitter.com/stablequan/status/1734057289542484038

Relevant New Research

Evals

See: https://github.com/AUGMXNT/inference-benchmark for benchmarks

llm_judge fork

Swap to lm-eval vLLM for fast inference (or OpenAI API w/ llama.cpp GGUF, ExLlamaV2, MLC, etc) - 50X faster than HF Transformers
Keep data format
OpenAI API 1.0+
Make compatible w/ shisa-eval-server (human eval)
Turn Elyza 100 tasks into tasks.json w/ custom judging rubric (may need to extend format?)

Bigger runs

Options

Orion 14B
- 2.5T tokens, multilingual
- Efficient tokenizer
- Instant commercial license: https://www.orionstar.com/llm-license.html
- very high reasoning scores
Swallow 70B
- Llama2 based, Llama license (700MAU)
- 7B, 13B, 70B
- 70B has GQA, +100B JA pretrain, 46K (JA extended vocab)

Bad Options

Yi 34B
- Not so efficient
- Instant commercial license: https://www.lingyiwanwu.com/yi-license
- terrible JA tokenizer
DeepSeek LLM 67B (MIT License)
- No commercial limitations, just restriction to lawful, non-military, non-harming minors etc
- Has a 7B to tune on
- GQA, 2T EN/CN pretrain, 4K context, 102.4K vocab
  - en: 4.329528 , ja: 0.852132
  - Oof, bad tokenizer for Japanese
Qwen-72B (Qwen License) - licensing sucks
- Can't train derived works
- 100M MAU
Mixtral 8x7B (Apache 2.0) - too hard to tune

Misc

Improved HF Space?

See:

https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2