GERPT - Training a German Generative Transformer Model using N-Gram Multihot Encodings

Experiments for my thesis 🤗

Setup

Install necessary dependencies:

pip install -r requirements.txt

Optional

Compile the CUDA extensions for the N-Gram Multihot approach:

cd cpp
python setup.py install

To run the training on GPUs please install pytorch with CUDA support.

The following tasks can all be run with: tools/run_all.sh

Pre-Training

Pre-Process

The preprocess script sets the vocabulary and the tokenized dataset up. The easiest way is to use the training config, with the configs data for the dataset, saved_dict and saved_data for the outfile of the dictionary and tokenized dataset respectively.

NOTE: The data setting can be a huggingface dataset set or a local one that is prefixed with "text/"

python preprocess.py --config configs/base.yaml

Training

The training script will either train a standard implementation of a LSTM or Transformer model, with the N-Gram Multihot approach.

All parameters can be defined in a yaml configuration file. See configs/base.yaml for possible options or run python train.py --help.

python train.py --config configs/base.yaml

Parameters can also be set through the command line and will overwrite the yaml configs.

Downstream Evaluation

For downstream evaluation we use the flair library. In another yaml configuration file (see configs/flair_base.yaml) different downstream tasks can be declared. If the setting use is set to True training for the task is started. Multiple training tasks can be declared.

python train_ds.py --config configs/flair_base.yaml

Troubleshooting

Deepspeed tries to access some tmp folders for cuda extensions, that the user may not have permissions for. Export TORCH_EXTENSIONS_DIR to a new location.

h, e, l, l, o, ,w . ,o , r, l, d h, he, el, ll, lo, , , wo, or, rl, ld

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
benchmarks		benchmarks
configs		configs
cpp		cpp
data		data
dicts		dicts
scripts		scripts
src		src
tests		tests
tools		tools
vocabs		vocabs
writeup		writeup
.gitignore		.gitignore
README.md		README.md
build_vocab.py		build_vocab.py
ft_model.py		ft_model.py
generate.py		generate.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
run_125.sh		run_125.sh
run_train.sh		run_train.sh
train.py		train.py
train_ds.py		train_ds.py
train_hf.py		train_hf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GERPT - Training a German Generative Transformer Model using N-Gram Multihot Encodings

Setup

Optional

Pre-Training

Pre-Process

Training

Downstream Evaluation

Troubleshooting

About

Releases

Packages

Languages

HallerPatrick/GERPT

Folders and files

Latest commit

History

Repository files navigation

GERPT - Training a German Generative Transformer Model using N-Gram Multihot Encodings

Setup

Optional

Pre-Training

Pre-Process

Training

Downstream Evaluation

Troubleshooting

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages