Zelda Rose

A straightforward trainer for transformer-based models.

Installation

Simply install with pipx

pipx install zeldarose

Train MLM models

Here is a short example of training first a tokenizer, then a transformer MLM model:

TOKENIZERS_PARALLELISM=true zeldarose tokenizer --vocab-size 4096 --out-path local/tokenizer  --model-name "my-muppet" tests/fixtures/raw.txt
zeldarose 
transformer --tokenizer local/tokenizer --pretrained-model flaubert/flaubert_small_cased --out-dir local/muppet --val-text tests/fixtures/raw.txt tests/fixtures/raw.txt

The .txt files are meant to be raw text files, with one sample (e.g. sentence) per line.

There are other parameters (see zeldarose transformer --help for a comprehensive list), the one you are probably mostly interested in is --config, giving the path to a training config (for which we have examples/).

The parameters --pretrained-models, --tokenizer and --model-config are all fed directly to Huggingface's transformers and can be pretrained models names or local path.

Distributed training

This is somewhat tricky, you have several options

If you are running in a SLURM cluster use --strategy ddp and invoke via srun
- You might want to preprocess your data first outside of the main compute allocation. The --profile option might be abused for that purpose, since it won't run a full training, but will run any data preprocessing you ask for. It might also be beneficial at this step to load a placeholder model such as RoBERTa-minuscule to avoid runnin out of memory, since the only thing that matter for this preprocessing is the tokenizer.
Otherwise you have two options
- Run with --strategy ddp_spawn, which uses multiprocessing.spawn to start the process swarm (tested, but possibly slower and more limited, see pytorch-lightning doc)
- Run with --strategy ddp and start with torch.distributed.launch with --use_env and --no_python (untested)

Other hints

Data management relies on 🤗 datasets and use their cache management system. To run in a clear environment, you might have to check the cache directory pointed to by theHF_DATASETS_CACHE environment variable.

Inspirations

Citation

@inproceedings{grobol:hal-04262806,
    TITLE = {{Zelda Rose: a tool for hassle-free training of transformer models}},
    AUTHOR = {Grobol, Lo{\"i}c},
    URL = {https://hal.science/hal-04262806},
    BOOKTITLE = {{3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)}},
    ADDRESS = {Singapore, Indonesia},
    YEAR = {2023},
    MONTH = Dec,
    PDF = {https://hal.science/hal-04262806/file/Zeldarose_OSS_EMNLP23.pdf},
    HAL_ID = {hal-04262806},
    HAL_VERSION = {v1},
}

Name		Name	Last commit message	Last commit date
Latest commit History 464 Commits
.github		.github
docs		docs
examples		examples
scripts		scripts
tests		tests
zeldarose		zeldarose
.gitignore		.gitignore
.markdownlint		.markdownlint
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENCE.md		LICENCE.md
README.md		README.md
pyproject.toml		pyproject.toml
tox.ini		tox.ini

License

LoicGrobol/zeldarose

Folders and files

Latest commit

History

Repository files navigation

Zelda Rose

Installation

Train MLM models

Distributed training

Other hints

Inspirations

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages