Skip to content

shisa‐v2

Leonard edited this page May 19, 2024 · 13 revisions

See: shisa-ai/shisa-v2

Areas of Improvement

Code cleanup

  • Setup black, gitleaks hooks
  • Move code around so it make sense
  • Setup .gitignore working dirs
  • 1-click cloud deploy containers for training, evals

Language leakage

  • Run sweeps w/ different parameters to determine best sampling settings to minimize leakage
  • Reduced tokenizer size testing for language leakage (maybe not a problem if not using extended Tokenizer)

Instruction following

  • Compare English vs Japanese instruction following

Language steerability

  • Training samples of reply in Japanese, reply in English, reply in the language the user speaks, etc.
  • Multi-turn training with language switching within turns

Training Data

Tuning diversity

Language prefs

  • Review % of translate to JA/EN
  • Potentially take Snow/translation datasets (and our own data sets) and swap w/ automated variations of Reply in English/Japanese, appended/prepended

Niceties

  • Figure out a good way to insert who made you, tell me about yourself, describe yourself, etc.

DPO Review

Pre-Training

Relevant New Research

Evals

See: https://github.com/AUGMXNT/inference-benchmark for benchmarks

llm_judge fork

  • Swap to lm-eval vLLM for fast inference (or OpenAI API w/ llama.cpp GGUF, ExLlamaV2, MLC, etc) - 50X faster than HF Transformers
  • Keep data format
  • OpenAI API 1.0+
  • Make compatible w/ shisa-eval-server (human eval)
  • Turn Elyza 100 tasks into tasks.json w/ custom judging rubric (may need to extend format?)

Bigger runs

Options

Bad Options

  • Yi 34B
  • DeepSeek LLM 67B (MIT License)
    • No commercial limitations, just restriction to lawful, non-military, non-harming minors etc
    • Has a 7B to tune on
    • GQA, 2T EN/CN pretrain, 4K context, 102.4K vocab
      • en: 4.329528 , ja: 0.852132
      • Oof, bad tokenizer for Japanese
  • Qwen-72B (Qwen License) - licensing sucks
    • Can't train derived works
    • 100M MAU
  • Mixtral 8x7B (Apache 2.0) - too hard to tune

Misc

Improved HF Space?

See: