Welcome to our tutorial on using SentencePiece to create a powerful, language-agnostic tokenizer! Learn how to transform raw English and Spanish text into organized, numerical tokens β an essential first step for building high-quality Neural Machine Translation models. π£οΈπ
This repository contains a tutorial notebook (.ipynb
) that guides you through the process of training your own SentencePiece tokenizer from scratch. We cover everything from preparing your text data to training the model and testing its capabilities.
- Practical Focus: Get hands-on experience with a widely-used tokenization tool in NLP.
- Clear Steps: Follow a simple, step-by-step guide to train your own tokenizer for English-Spanish.
- NMT Foundation: Understand a crucial preprocessing step required for many translation and language generation tasks.
- Beginner-Friendly: Suitable for those new to tokenization or SentencePiece, with clear explanations.
- Setting the Stage:
- Introduction to why tokenization is needed for NLP models.
- What SentencePiece is and its key advantages (subword units, language-agnostic).
- Getting Your Tools Ready:
- Installing the
sentencepiece
library. - Downloading and preparing the English-Spanish dataset (
spa-eng.zip
).
- Installing the
- Training Your Custom Tokenizer:
- Preparing a combined text file from English and Spanish sentences for SentencePiece training.
- Using
SentencePieceTrainer
to train your model, understanding key parameters likevocab_size
,model_type
(Unigram vs. BPE),character_coverage
, and special token IDs (<pad>
,<s>
,</s>
,<unk>
).
- Testing Your New Tokenizer:
- Loading your trained SentencePiece model (
.model
file). - Encoding text into subword pieces and numerical IDs.
- Decoding IDs back into human-readable text to verify.
- Understanding the role of the
.vocab
file.
- Loading your trained SentencePiece model (
- Subword Tokenization (Unigram & BPE)
sentencepiece.SentencePieceTrainer.train()
: Training a new tokenizer.input
: Specifying training data.model_prefix
: Naming your output model files.vocab_size
: Setting the desired vocabulary size.character_coverage
: Ensuring representation of characters.model_type
: Choosing betweenunigram
orbpe
.- Defining
pad_id
,bos_id
,eos_id
,unk_id
.
sentencepiece.SentencePieceProcessor()
: Loading a trained model.load()
: (Implicitly called or can usemodel_file=
in constructor).encode_as_pieces()
: Converting text to subword strings.encode_as_ids()
: Converting text to numerical IDs.decode_pieces()
: Converting subword strings back to text.decode_ids()
: Converting numerical IDs back to text.- Accessing special token IDs (
bos_id()
,eos_id()
, etc.) and vocabulary size (get_piece_size()
).
- Understanding the
.model
and.vocab
output files. - Text Normalization (default NFKC,
nmt_nfkc_cf
).
Dive into the notebook and start turning your text into tidy tokens!