GPT-NeoX
is a fork ofMegatron-LM
with some additional features.GPT-NeoX
supports various tokenizers thanMegatron-LM
, such asLlamaTokenizer
,HFGPT2Tokenizer
, orTiktokenTokenizer
.
However, both repository does not support Transformers AutoTokenizer
out of the box.
That's why I made this repository.
In this repository:
- You can encode your dataset with any
AutoTokenizer
-compatible tokenizer.
- Checkout
examples/sample_dataset.jsonl
for the format of the dataset.
// examples/sample_dataset.jsonl
{"text": "This is the first sentence."}
{"text": "This is the second sentence."}
// ...
- Install the dependencies
pip install -r requirements.txt
for more details, please checkout requirements.frozen.txt.
- Encode the dataset with the following command
- Check sample shell file:run.sh
# Create output directory
mkdir -p data-sample/
# Encode the dataset
python preprocess_data.py \
--input examples/sample_dataset.jsonl \
--tokenizer-type AutoTokenizer \
--vocab-file beomi/llama-2-ko-7b \
--output-prefix data-sample/out \
--dataset-impl mmap \
--workers 10
-
Input file name must ends with
.jsonl
-
Your JSONL file must NOT contain any empty lines.
Example(This makes error):
{"text": "This is the first sentence."}
{"text": "This is the second sentence."}
// YOU SHOULD NOT HAVE EMPTY LINE HERE
{"text": "This is the third sentence."}
// YOU SHOULD NOT HAVE EMPTY LINE AT THE END OF THE FILE TOO
JSON above will cause this error:
jsonlines.jsonlines.InvalidLineError: line contains invalid json: Expected object or value (line 3)
This is OK:
{"text": "This is the first sentence."}
{"text": "This is the second sentence."}
This repo codes are partially copied/edited from EleutherAI/gpt-neox @ 10bf78.