<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/dl_in_hlt_2025_exercise_1_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example solution to exercise task 1

Task description:

> ### Subword tokenization
>
> [...] Select two models, one monolingual trained purely on your target language, and one multilingual trained on several languages including your target language. Here are some suggestions:
>
> * Finnish: `TurkuNLP/bert-base-finnish-cased-v1` and `bert-base-multilingual-cased`

We'll use the `transformers` library and follow that suggestion.

In [None]:
!pip --quiet install transformers

In [None]:
MODEL_1_NAME = 'TurkuNLP/bert-base-finnish-cased-v1'
MODEL_2_NAME = 'bert-base-multilingual-cased'

We could get the tokenizer from a `pipeline` as shown on the [Generation basics notebook](https://github.com/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/generation_basics.ipynb), but here we'll use the `AutoTokenizer` class ([documentation](https://huggingface.co/docs/transformers/en/model_doc/auto)) to load just the tokenizer, avoiding needlessly loading also the model itself.

In [None]:
from transformers import AutoTokenizer

tokenizer_1 = AutoTokenizer.from_pretrained(MODEL_1_NAME)
tokenizer_2 = AutoTokenizer.from_pretrained(MODEL_2_NAME)

Task description:

> a) What are the vocabulary sizes in these models? Keeping in mind that the multilingual model is trained on several languages (how many?), how do these compare?

As with most well-documented python classes, you can use `help` as well as the online documentation to figure out what the class offers.

In [None]:
help(AutoTokenizer)

we can get the vocabulary sizes of the tokenizers simply with `.vocab_size`:

In [None]:
print(MODEL_1_NAME, tokenizer_1.vocab_size)
print(MODEL_2_NAME, tokenizer_2.vocab_size)

The monolingual model has a vocabulary size of approximately 50,000, while the multilingual model size is approx. 120,000, i.e. more than twice as large. This makes sense as the multilingual model needs to represent words in more languages than the monolingual one.

From the documentation of [`bert-base-multilingual-cased`](https://huggingface.co/google-bert/bert-base-multilingual-cased) we can see that the multilingual model was trained on 104 languages. So, if we were to (incorrectly!) assume that the vocabularies of these languages do not overlap, each language would get on average approx. 1% of the 120,000 vocabulary items, i.e. just 1200. In practice many of the languages share the latin script and will share many of the same words (e.g. proper nouns), but the number of vocabulary items specific to each language will still be comparatively small.

Task description:

> b) Write code to load the selected tokenizers and tokenize text using these (output is expected to be subword tokens in text form, not numbers). Select a piece of text (e.g. Wikipedia page or news article) written in your target language and tokenize it separately using both models. Inspect whether the tokenization results differ. How many subwords did each produce?

We've already loaded the tokenizers. Let's pick a paragraph of text from the [Finnish Wikipedia article on Turku](https://fi.wikipedia.org/wiki/Turku)

In [None]:
text_fin = """Turku syntyi Aurajoen suulle jo ennen 1200-lukua ja se on Suomen vanhin kaupunki. Kaupungin perustamisvuotena pidetään vuotta 1229, jolloin paavi Gregorius IX mainitsee bullassaan ensimmäistä kertaa kaupungin nimeltä Aboa (Turun latinankielinen nimi). Jo sitä aikaisemmin arabialainen maantieteilijä Al Idrisi viittaa Abuan kaupunkiin kaukana pohjoisessa. Turku oli pitkään Suomen merkittävin asutuskeskus, maan epävirallinen pääkaupunki vuosina 1809–1812[9] ja 1840-luvulle saakka Suomen suurin kaupunki. Turku on yhä alueensa paikallishallinnollinen, taloudellinen ja kulttuurinen keskus. Kaupunki on Lounais-Suomen aluehallintoviraston päätoimipaikka ja Varsinais-Suomen maakuntakeskus."""

For reference, how many whitespace-separated strings are there in that text?

In [None]:
print(len(text_fin.split()))

To tokenize text you can invoke the tokenizer object directly or call e.g. `tokenize` or `encode`. There may be minor differences between the counts you will get from these due to special characters, but these are not important four our estimates here. We'll use `tokenize` here and look at the number of subwords first.

In [None]:
print(MODEL_1_NAME, len(tokenizer_1.tokenize(text_fin)))
print(MODEL_2_NAME, len(tokenizer_2.tokenize(text_fin)))

The multilingual tokenizer uses more subwords to represent this text despite having a much larger vocabulary. This is because the vocabulary of the monolingual model is specific to the target language. We get a different result if we tokenize text not in the target language of the first tokenizer:

In [None]:
text_eng = """Turku is Finland's oldest city.[1] It is not known when Turku was granted city status. Pope Gregory IX first mentioned the town of Aboa in his Bulla in 1229, and this year is now used as the founding year of the city.[4][5][14] Turku was the most important city in the eastern part (today's Finland) of the Kingdom of Sweden. After the Finnish War, Finland became an autonomous Grand Duchy of the Russian Empire in 1809, and Turku became the capital of the Grand Duchy. However, Turku lost its status as capital only three years later in 1812,[1] when Tsar Alexander I of Russia decided to move the capital to Helsinki. It was only after the last great fire in 1827 that most government institutions were moved to Helsinki along with the Royal Academy of Turku, founded in 1640, which later became the University of Helsinki, thus consolidating Helsinki's position as the new capital. Turku was Finland's most populous city until the late 1840s and remains the regional capital, an important business and cultural centre, and a port."""

print(MODEL_1_NAME, len(tokenizer_1.tokenize(text_eng)))
print(MODEL_2_NAME, len(tokenizer_2.tokenize(text_eng)))

Let's have a look at the differences in tokenization

In [None]:
for t1, t2 in list(zip(tokenizer_1.tokenize(text_fin), tokenizer_2.tokenize(text_fin)))[:30]:
  print(t1, '\t', t2)

Task description:

> c) [This webpage](https://platform.openai.com/tokenizer) can be used to visualize how ChatGPT tokenizes text. Try what it does with English and Finnish (or any other smaller language) text. What is your take on this?

We find that older GPT models are highly inefficient for Finnish. For example,

> Turku syntyi Aurajoen suulle jo ennen 1200-lukua ja se on Suomen vanhin kaupunki.

is 81 characters split into 30 tokens by models up to and including GPT-4, while

> Turku was founded at the mouth of the Aura River before the 13th century and is Finland's oldest city.

is 102 characters but only 24 tokens. The more recent GPT-4o models are somewhat more efficient at tokenizing Finnish, splitting the above text into 26 tokens. However, this is still not as efficient as a tokenizer dedicated to the language:

In [None]:
len(tokenizer_1.tokenize("Turku syntyi Aurajoen suulle jo ennen 1200-lukua ja se on Suomen vanhin kaupunki."))