## Lesson Notebook 6: Machine Translation

In this notebook we will look at several examples related to machine translation:

   * Simple translation examples with T5

   * Translation example with M2M100 - many more languages

   * MT metrics examples

   * Subword models and tokenizers



<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup)
  * 2. [Simple Translation Model](#simpleTranslation)
  * 3. [M2M100 Translation Example](#m2mTranslation)
  * 4. [Machine Translation Metrics](#translationMetrics)  
  * 5. [Subword Models](#subwordModels)
  * [Answers](#answers)







  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-spring-main/blob/master/materials/lesson_notebooks/lesson_6_Machine_Translation.ipynb)

[Return to Top](#returnToTop)  
<a id = 'setup'></a>

## 1. Setup


We'll start with the usual setup. We need to begin with the sentencepiece code in order to tokenize the text for some of the models.

In [1]:
!pip install -q sentencepiece

In [2]:
!pip install tensorflow -U --quiet
!pip install keras -U --quiet
!pip install tensorflow-datasets -U --quiet
!pip install tensorflow-text -U --quiet
!pip install transformers -U --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
!pip install -q -U datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m27.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m27.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m27.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m27.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   

In [4]:
!pip install -q git+https://github.com/keras-team/keras-nlp.git --upgrade

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for keras-hub (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
keras-nlp 0.18.1 requires keras-hub==0.18.1, but you have keras-hub 0.19.0 which is incompatible.[0m[31m
[0m

In [5]:
#Am I running a GPU and what type is it?
!nvidia-smi

Sun Feb  9 01:11:31 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   57C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

[Return to Top](#returnToTop)  
<a id = 'simpleTranslation'></a>


## 2. Simple Translation Example

These T5 models are trained to translate in one direction only.  For example, they can translate from English to French but not from French to English.

Let's test this out.

In [6]:
#import T5 and show
from transformers import T5Tokenizer, TFT5Model, TFT5ForConditionalGeneration

In [7]:
SENTENCE_TO_TRANSLATE = ( "PG&E stated it scheduled the blackouts in response to forecasts for high winds \
            amid dry conditions.")

BACK_TRANSLATE_TEST = ("PG&E a déclaré qu'elle avait prévu les panne de courant.")

In [8]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-base') #also t5-small and t5-large
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

t5_model.summary()

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (Embedding)          multiple                  24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  109628544 
                                                                 
 decoder (TFT5MainLayer)     multiple                  137949312 
                                                                 
Total params: 222903552 (850.31 MB)
Trainable params: 222903552 (850.31 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Add the prompt to the sentence we want to translate so the model knows what we want it to do with the input.

In [9]:
t5_input_text = "translate english to french: " + SENTENCE_TO_TRANSLATE

In [10]:
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

**QUESTION 1**: What do the inputs look like?  We've already seen BERT inputs. What's happening with T5? What's the same as what we saw with BERT and what's different?

In [11]:
t5_inputs

{'input_ids': <tf.Tensor: shape=(1, 29), dtype=int32, numpy=
array([[13959, 22269,    12, 20609,    10,     3,  7861,   184,   427,
         4568,    34,  5018,     8,  1001,   670,     7,    16,  1773,
           12,  7555,     7,    21,   306, 13551, 18905,  2192,  1124,
            5,     1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 29), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

In [12]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   max_length=20)

print([t5_tokenizer.decode(g, skip_special_tokens=True,
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

["PG&E a déclaré qu'elle avait prévu les panne de courant"]


Not bad. Now let's try the reverse even though we know the model wasn't trained to translate in that direction.  What do you think it will do?

In [13]:
t5_back_text = "translate french to english: " + BACK_TRANSLATE_TEST

In [14]:
t5_binputs = t5_tokenizer([t5_back_text], return_tensors='tf')

The decoder still runs and emits language, specifically French, as requested.  These models will pretty much always produce some output but you need to make sure that you're asking it to do something it can and that it is doing the right thing.

In [15]:
t5_summary_ids = t5_model.generate(t5_binputs['input_ids'],
                                   max_length=20)

print([t5_tokenizer.decode(g, skip_special_tokens=True,
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

["PG&E a déclaré qu'elle avait prévu les panne de courant"]


[Return to Top](#returnToTop)  
<a id = 'm2mTranslation'></a>


## 3. M2M100 Translation Example

M2M100 is a large model that was pre-trained on many languages simultaneoulsy.  You do need to give it some clues about what you are expecting when it translates.  Typically this takes the form of specifying the input and the output languages.  Let's look at the tokenizer first. How can it handle 100 different languages?

In [16]:
from transformers import M2M100Config, M2M100Tokenizer

tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="en", tgt_lang="fr")

tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")

tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

['▁Don',
 "'",
 't',
 '▁you',
 '▁love',
 '▁',
 '🤗',
 '▁Transform',
 'ers',
 '?',
 '▁We',
 '▁sure',
 '▁do',
 '.']

Now let's try to translate.  We have two original sentences, one in English and one in Chinese that have roughly the same meaning.

In [17]:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
chinese_text = "不要插手巫師的事務, 因為他們是微妙的, 很快就會發怒."

tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="zh")
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

In [18]:
encoded_zh = tokenizer(chinese_text, return_tensors="pt")

Let's start by taking the Chinese sentence and translate it back in to English.

In [19]:
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

model.safetensors:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

['Do not interfere with the matters of the witches, because they are delicate and will soon be angry.']

Interesting and subtly different from our English original. Now we'll try translating the English to Chinese and then we'll take that Chinese output and translate it back into English.  This should give us an idea of how well the model works.

In [20]:
encoded_en = tokenizer(en_text, return_tensors="pt")

In [21]:
generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.get_lang_id("zh"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

['不要介入魔術師的事情,因為他們是微妙和快樂的憤怒。']

Now we'll store that Chinese output in a variable so we can translate back to English.

In [22]:
chinese_back_text = '不要介入魔術師的事情,因為他們是微妙和快樂的憤怒。'

In [23]:
encoded_zhb = tokenizer(chinese_back_text, return_tensors="pt")

In [24]:
generated_tokens = model.generate(**encoded_zhb, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

['Do not interfere with the things of the witches, because they are delicate and pleasant anger.']

Now you can see how far it has drifted as we have translated back and forth. With some care this approach can be used to generate novel content that can augment a training set (as long as the drift isn't too bad). This is what we call back translation.

[Return to Top](#returnToTop)  
<a id = 'translationMetrics'></a>

## 4. Machine Translation Metrics

HuggingFace provides a library called evaluate that includes a large number of metrics.  We'll use two of them here.

In [25]:
!pip install -q evaluate
import evaluate

from datasets import DownloadConfig

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━[0m [32m61.4/84.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### 4.1 BLEU example

The [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu) has been around for awhile. Let's run an example of the scoring using the function provided by the evaluate library from HuggingFace.

In [26]:
#let's manually create some candidates and references
#individual sentece example - this is best to experiment with.
bleu_candidate = ["the earth trembled in Japan again on Monday the 4th of September"
]

bleu_reference = [["earthquakes hit Japan again on Monday September 4"]
]

#multiple pairs of inputs and reference outputs
bleu_candidates = ["the earth trembled in Japan again on Monday the 4th of September",
                   "earthquakes struck Japan again on Monday the 4th of September"
]
bleu_references = [
                   ["earthquakes hit Japan again on Monday September 4"],
                   ["On September 4th , a Monday , Japan had another earthquake"]
]

Let's first try our individual candidate and reference.  They're both sort of saying the same thing.  Does the BLEU score reflect that similarity?

In [27]:
bleu = evaluate.load("bleu")
results = bleu.compute(predictions=bleu_candidate, references=bleu_reference)
print(results)

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

{'bleu': 0.22416933501922287, 'precisions': [0.4166666666666667, 0.2727272727272727, 0.2, 0.1111111111111111], 'brevity_penalty': 1.0, 'length_ratio': 1.5, 'translation_length': 12, 'reference_length': 8}


BLEU is typically used in aggregate with multiple candidates as well as multiple reference examples for each sentence pair.  Here we run the candidates and references so you can see how it's done.

In [28]:
bleu = evaluate.load("bleu")
results = bleu.compute(predictions=bleu_candidates, references=bleu_references)
print(results)

{'bleu': 0.14367696612929734, 'precisions': [0.4090909090909091, 0.15, 0.1111111111111111, 0.0625], 'brevity_penalty': 1.0, 'length_ratio': 1.1578947368421053, 'translation_length': 22, 'reference_length': 19}


### 4.2 BERTScore

The BLEU score matches the actual word strings in your candidate translation to the word strings in the reference translation.  But what if your candidate says the same thing as the reference but simply uses different words to do so?  In that case your BLEU score may be zero because no words match but at a meaning level your candidate is actually a partial match.

There's another way to measure, called BERTScore, that leverages the contextualized vectors geerated by a BERT model to makes the comparison between the candidate and the reference.  You can read about [BERTScore here](https://openreview.net/pdf?id=SkeHuCVFDr).

BERTScore makes pairwise comparisons between the vectors for all of the words in the reference and the candidate using cosine similarity.  The result is a similarity score between the two sentences that takes in to account synonyms and also alternate orderings of words.  

In [29]:
!pip install -q bert_score

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m622.4 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m77.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m46.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To see how it works, let's give the algorithm some data we create.  Let's make examples that use diffrent words but mean sort of the same thing.  We would expect these to produce a high score.  You can change the inputs below to see how it affects the scores.

In [30]:
from evaluate import load
bertscore = load("bertscore")

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

In [31]:
predictions = ["hello there", "general kenobi"]
references = ["hi there", "obie wan kenobi"]
results = bertscore.compute(predictions=predictions, references=references, lang="en")
#results = bertscore.compute(predictions=predictions, references=references, lang="en", model_type="distilbert-base-uncased")
print(results)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [0.9971846342086792, 0.9397112727165222], 'recall': [0.9971846342086792, 0.8711985349655151], 'f1': [0.9971846342086792, 0.904158890247345], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.48.3)'}


### 4.3 Sentence Transformers

There's another way that you can compare sequences of text to see how similar they are.  We can use a model that generates a single vector to represent that sequence of text and then compare those vectors to see how "similiar" the two sentences are.  There's a library called [Sentence Transformers](https://sbert.net/docs/sentence_transformer/pretrained_models.html) that includes a variety of pretrained models that convert sequences of text into vectors.  You can choose among the models to find one that works well for your particular circumstances.

In [32]:
from sentence_transformers import SentenceTransformer

# Load https://huggingface.co/sentence-transformers/all-mpnet-base-v2
model = SentenceTransformer("all-mpnet-base-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Let's compare the vectors of several sentences. Two are topically similiar so should generate a higher but not very hgih score, while the third is unrelated and should therefore have a low score.

In [33]:
embeddings = model.encode([
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
])
similarities = model.similarity(embeddings, embeddings)
print(similarities)

tensor([[1.0000, 0.6817, 0.0492],
        [0.6817, 1.0000, 0.0421],
        [0.0492, 0.0421, 1.0000]])


[Return to Top](#returnToTop)  
<a id = 'subwordModels'></a>

## 5. Subword Models

Different pretrained models use different subword models.  Each subword model identifies a different set of "tokens" based on an efficient representation of words and parts of words in the pre-training corpus.  The model has an embedding for each one of the subwords in its vocabulary.

We do not typically interact directly with the subword models but rather do so indirectly through the Tokenizer object.

Let's try the BERT base cased tokenizer.  It uses a wordpiece subword model.

In [34]:
#wordpiece
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

print(f'The vocabulary size is {tokenizer.vocab_size}')

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

The vocabulary size is 28996


In [35]:
tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")

['Don',
 "'",
 't',
 'you',
 'love',
 '[UNK]',
 'Transformers',
 '?',
 'We',
 'sure',
 'do',
 '.']

This is the same tokenizer code but instead it is loaded with the multilingual model version. Note that it contains many more tokens than BERT base because it has to be able to deal with multiple kinds of symbols.

In [36]:
#wordpiece
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-uncased")
print(f'The vocabulary size is {tokenizer.vocab_size}')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

The vocabulary size is 105879


In [37]:
tokenizer.tokenize("你不喜欢🤗变形金刚吗？ 我们肯定会。")

['你',
 '不',
 '喜',
 '欢',
 '[UNK]',
 '变',
 '形',
 '金',
 '刚',
 '吗',
 '？',
 '我',
 '们',
 '肯',
 '定',
 '会',
 '。']

Let's put that first English sentence through the multilingual tokenizer.  It produces the same subwords for English even though it can also handle other languages as shown by it's much lager vocabulary size.

In [38]:
tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")

['don',
 "'",
 't',
 'you',
 'love',
 '[UNK]',
 'transformers',
 '?',
 'we',
 'sure',
 'do',
 '.']

T5 uses the sentencepiece subword model.  Here we'll use the tokenizer for the multilingual version of T5 called mt5.  Notice the vocabulary size.

In [39]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
print(f'The vocabulary size is {tokenizer.vocab_size}')

tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

The vocabulary size is 250100


In [40]:
tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")

['▁Don',
 "'",
 't',
 '▁you',
 '▁love',
 '▁',
 '🤗',
 '▁',
 'Transformers',
 '?',
 '▁We',
 '▁sure',
 '▁do',
 '.']

The sentencepiece subword model includes a marker to indicate if a subword is at the begining of a word and thus, in English, is preceeded by a space.  This means that with sentence piece it is possible to accurately reconstruct the sentence because we explicitly identify the word boundaries.


Finally, let's look at **GPT2** which uses the BytePair Encoding subword model.  Its output will be completely different.

In [41]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
print(f'The vocabulary size is {tokenizer.vocab_size}')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

The vocabulary size is 50257


In [42]:
tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")

['Don',
 "'t",
 'Ġyou',
 'Ġlove',
 'ĠðŁ',
 '¤',
 'Ĺ',
 'ĠTransformers',
 '?',
 'ĠWe',
 'Ġsure',
 'Ġdo',
 '.']

[Return to Top](#returnToTop)  
<a id = 'answers'></a>

## ANSWERS

1.  The T5 model doesn't have the token type ids that BERT uses to identify different segments.

