# Yerevan Winter School Tutorial 3
## Multilingual Tokenization and Sentence Embeddings for Low Resource Languages

**Context.**  
In this hands on session, you will explore how multilingual models represent different languages at two levels.  
First, you will inspect how tokenizers split sentences in English and in at least one low resource language of interest.  
Second, you will generate sentence embeddings for short sentences in multiple languages and visualize them in two dimensions.

**What you will do.**

1. Use a multilingual tokenizer to inspect how sentences are split into tokens in English and another language.  
2. Compare tokenization granularity and discuss how this might affect model performance.  
3. Generate sentence embeddings for short sentences in multiple languages using a multilingual sentence encoder.  
4. Visualize the embeddings in two dimensions and inspect clustering patterns.  
5. Briefly document your observations about representation quality and typical failure cases for your language of interest.

**Important note.**  
The goal is to build intuition about how multilingual models handle your language, not to reach definitive scientific conclusions.  
We work with very small examples so that everything fits into a short tutorial.


## 0. Setup

Run the following cells to install and import the required libraries.  
This notebook is designed for Google Colab, but it will also work in a local Jupyter environment with internet access.


In [None]:
!pip install -q transformers sentence-transformers datasets umap-learn matplotlib pandas scikit-learn

In [None]:
import random
from typing import Dict, List

import matplotlib.pyplot as plt
import pandas as pd
import torch
from datasets import Dataset
from sklearn.decomposition import PCA
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer

print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

## 1. Configuration. choose your languages and example sentences

In this section, you define:

- At least one low resource language of interest.  
- A small set of short sentences in each language.  
- The multilingual models you want to inspect.

By default, we include English (`en`) and Luxembourgish (`lb`) as an example low resource language.  
You can replace `lb` with another language that is relevant to you, for example Armenian (`hy`), Kurdish (`ku`), or any language supported by your chosen models.


In [None]:
# Define the languages you will work with.
# You must include "en" for English and at least one additional language code.

languages = [
    {"code": "en", "name": "English"},
    {"code": "lb", "name": "Luxembourgish"},  # change this to your low resource language if you prefer
]

languages

In [None]:
# Define example sentences for each language.
# You can modify these or replace them with your own small set.

example_sentences = {
    "en": [
        "The doctor arrived late to the hospital.",
        "Children are playing outside in the snow.",
        "This research project focuses on low resource languages.",
        "The bus was very crowded this morning.",
        "I enjoy reading books in different languages.",
    ],
    "lb": [
        "Den Dokter ass spéit am Spidol ukomm.",
        "Kanner spillen dobaussen am Schnéi.",
        "Dëse Fuerschungsprojet konzentréiert sech op Sproochen mat wéineg Ressourcen.",
        "De Bus war haut de Moien ganz voll.",
        "Ech liese gär Bicher an ënnerschiddleche Sproochen.",
    ],
}

# Quick sanity check
for lang in languages:
    code_ = lang["code"]
    print(f"Language {code_} has {len(example_sentences.get(code_, []))} sentences.")

If your low resource language is different from Luxembourgish, update:

- The `languages` list above.  
- The corresponding entry in `example_sentences`.  

Make sure that each language code in `languages` has a matching key in `example_sentences` with at least three sentences.


## 2. Choose multilingual models for tokenization and embeddings

We will typically use two related but not identical models.

- A multilingual tokenizer, for example from `xlm-roberta-base` or a similar encoder only model.  
- A multilingual sentence transformer, for example `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.

You can modify the defaults below if you have a specific model you want to inspect.


In [None]:
# Model used to inspect tokenization behaviour.
tokenizer_model_name = "xlm-roberta-base"

# Model used to generate sentence embeddings.
embedding_model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_name)
print("Tokenizer loaded from:", tokenizer_model_name)

embedder = SentenceTransformer(embedding_model_name)
print("Sentence embedding model loaded from:", embedding_model_name)

## 3. Inspecting tokenization behaviour

In this section, you will:

1. Tokenize each sentence using the chosen multilingual tokenizer.  
2. Inspect the tokens, including special characters that indicate subword splits.  
3. Compare token counts and average token length across languages.

Tokenization granularity matters because:

- Shorter tokens mean longer sequences for the same sentence, which increases computation cost.  
- Very fragmented tokenization may harm performance for your language if the model sees many rare subword patterns.


In [None]:
def tokenize_sentence(sentence: str) -> Dict:
    tokens = tokenizer.tokenize(sentence)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    return {
        "tokens": tokens,
        "token_ids": token_ids,
        "num_tokens": len(tokens),
    }

# Build a table with tokenization details for all sentences.
rows = []
for lang in languages:
    code_ = lang["code"]
    name_ = lang["name"]
    for sent in example_sentences.get(code_, []):
        tok_info = tokenize_sentence(sent)
        rows.append({
            "language_code": code_,
            "language_name": name_,
            "sentence": sent,
            "tokens": tok_info["tokens"],
            "num_tokens": tok_info["num_tokens"],
            "avg_chars_per_token": len(sent) / max(tok_info["num_tokens"], 1),
        })

token_df = pd.DataFrame(rows)
token_df

You can scroll through the table above to see how sentences in each language are split into tokens.

The `avg_chars_per_token` column gives a rough indication of granularity.  
Lower values mean that the tokenizer splits the sentence into many short pieces.  
Higher values mean that tokens are longer on average.


In [None]:
# Summary statistics by language.
summary_token_stats = token_df.groupby(["language_code", "language_name"])[
    ["num_tokens", "avg_chars_per_token"]
].agg(["mean", "min", "max"]).round(2)

summary_token_stats

### 3.1 Visualizing token counts

Run the cell below to visualize average token counts per language in a simple bar chart.


In [None]:
# Compute average number of tokens per sentence by language.
avg_tokens = token_df.groupby("language_code")["num_tokens"].mean()

plt.figure()
avg_tokens.plot(kind="bar")
plt.xlabel("Language code")
plt.ylabel("Average number of tokens per sentence")
plt.title("Average tokenization length by language")
plt.tight_layout()
plt.show()

### 3.2 Inspect specific tokenizations

If you see interesting or surprising patterns, you can inspect them more closely by printing tokens for selected sentences.


In [None]:
# Change these indices or conditions to inspect specific examples.

for idx, row in token_df.head(10).iterrows():
    print("Language:", row["language_code"])
    print("Sentence:", row["sentence"])
    print("Tokens:", row["tokens"])
    print("Number of tokens:", row["num_tokens"])
    print("-" * 60)

You can adapt the loop above to filter for a specific language or a specific sentence.  
For example, you can look at the longest or shortest tokenizations per language.


In [None]:
# Example. inspect the three sentences with the highest number of tokens.

top_long = token_df.sort_values("num_tokens", ascending=False).head(3)
for idx, row in top_long.iterrows():
    print("Language:", row["language_code"])
    print("Sentence:", row["sentence"])
    print("Tokens:", row["tokens"])
    print("Number of tokens:", row["num_tokens"])
    print("-" * 60)

## 4. Generating multilingual sentence embeddings

Next, you will use a multilingual sentence transformer to obtain vector representations for all sentences.

Steps.

1. Create a list of all `(language, sentence)` pairs.  
2. Encode each sentence into a high dimensional vector.  
3. Reduce the vectors to two dimensions using PCA.  
4. Visualize them in a scatter plot, colored by language.


In [None]:
# Build a unified dataset of sentences with language labels.
all_rows = []
for lang in languages:
    code_ = lang["code"]
    name_ = lang["name"]
    for sent in example_sentences.get(code_, []):
        all_rows.append({
            "language_code": code_,
            "language_name": name_,
            "sentence": sent,
        })

sentence_df = pd.DataFrame(all_rows)
sentence_df

In [None]:
# Encode sentences into embeddings.

sentences = sentence_df["sentence"].tolist()
embeddings = embedder.encode(sentences, convert_to_numpy=True)

print("Embeddings shape:", embeddings.shape)

In [None]:
# Reduce to two dimensions using PCA for visualization.

pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

sentence_df["dim1"] = embeddings_2d[:, 0]
sentence_df["dim2"] = embeddings_2d[:, 1]

sentence_df.head()

## 5. Visualizing sentence embeddings

Run the cell below to see a scatter plot of sentences in the two dimensional PCA space, colored by language.

You should look for patterns such as.

- Do sentences from the same language cluster together.  
- Are clusters for different languages clearly separated or overlapping.  
- Does your low resource language behave similarly to high resource languages.


In [None]:
# Simple scatter plot of sentence embeddings by language.

plt.figure()
for lang in languages:
    code_ = lang["code"]
    subset = sentence_df[sentence_df["language_code"] == code_]
    plt.scatter(subset["dim1"], subset["dim2"], label=code_)

for _, row in sentence_df.iterrows():
    plt.text(row["dim1"], row["dim2"], row["language_code"], fontsize=8)

plt.xlabel("PCA dimension 1")
plt.ylabel("PCA dimension 2")
plt.title("Sentence embeddings projected to 2D")
plt.legend()
plt.tight_layout()
plt.show()

If you have enough sentences, you can add more languages or replace PCA with UMAP for potentially clearer clusters.  
For this short tutorial, PCA is usually sufficient.


## 6. Observations and typical failure cases

Use the questions below as a guide when you look at your tokenization and embedding results.

### 6.1 Tokenization

- Does your low resource language have many more tokens per sentence compared to English.  
- Are there specific characters or letter combinations that are split in unexpected ways.  
- Do named entities, technical terms, or local words get fragmented into many sub tokens.

### 6.2 Embeddings

- Do sentences in your language cluster tightly or are they scattered among other languages.  
- If you use parallel or similar sentences across languages, do their embeddings appear close to each other.  
- Are there sentences that you expect to be similar but that appear far apart in the embedding space.

### 6.3 Typical failure cases

- Words or phrases that are unknown, heavily fragmented, or mapped to unexpected parts of the embedding space.  
- Mixing of scripts or code switching that confuses the tokenizer or encoder.  
- Systematically worse behaviour for your language compared to high resource languages.

You can use the markdown cell below to write down your notes.


### 6.4 Notes

Use this space to document your observations.  
You can work in English or in your own language.

- Tokenization observations.  
- Embedding observations.  
- Examples of typical failure cases.  
- Ideas for how these issues might affect downstream tasks in your language.


## 7. Optional extensions

If you have extra time, you can try one or more of the following extensions.

1. **Add more models.**  
   Compare tokenization and embeddings across different multilingual models, for example `xlm-roberta-large` versus a smaller model.

2. **Parallel sentences.**  
   Construct a small set of parallel sentences across languages and check whether embeddings for translations are close in the 2D plot.

3. **Longer sentences.**  
   Add longer and more complex sentences, for example from news or Wikipedia, and inspect how token counts and embeddings change.

4. **Task specific prompts.**  
   Use task oriented sentences, for example questions, instructions, or domain specific text, to see whether certain genres are better represented than others.

These small experiments can help you decide which models and tokenizers might be more suitable for low resource applications in your own projects.
