# WECHSEL Tutorial

In this tutorial, we will use Langsfer to transfer a model trained in English to German with the [WECHSEL](https://arxiv.org/abs/2112.06598) method, similarily to one of the experiments described in the paper.

WECHSEL is a cross-lingual language transfer method that efficiently initializes the embedding parameters of a language model in a target language using the embedding parameters from an existing model in a source language, facilitating more efficient training in the new language.

The method requires as input:

- a tokenizer in the source language,
- a pre-trained language model in the source language,
- a tokenizer in the target language,
- 2 monolingual fastText embeddings for source and target languages respectively. 
  They can be obtained in one of 2 ways:
    - using pre-trained fastText embeddings,
    - trainining fastText embeddings from scratch.

For the tutorial, we will use as much as possible the same parameters as described in the paper:

- For the source model and tokenizer, we will use [roberta-base](https://huggingface.co/FacebookAI/roberta-base),
- For the target tokenizer, we will train one from scratch,
- For the fastText embeddings, we will download pre-trained models from [fastText's website](https://fasttext.cc/docs/en/crawl-vectors.html)

For the sake of brevity, we will however use fewer training samples and steps.

# Setup

We begin by importing libraries and setting some defaults.

In [1]:
%load_ext autoreload
%load_ext tensorboard

In [2]:
import random
import warnings
from typing import Generator

import datasets
import numpy as np
import torch
from transformers import (
    AutoModel,
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
)

warnings.simplefilter("ignore")

# Constants
SOURCE_MODEL_NAME = "roberta-base"
DATASET_NAME = "oscar-corpus/oscar"
DATASET_CONFIG_NAME = "unshuffled_deduplicated_de"
DATASET_SIZE = 20000
TRAIN_DATASET_SIZE = 16000
TRAIN_BATCH_SIZE = 16
GRADIENT_ACCUMULATION_STEPS = 8
EVAL_STEPS = 4000
MAX_TRAIN_STEPS = 48000
LEARNING_RATE = 1e-4
WEIGHT_DECAY = 0.01
ADAM_EPSILON = 1e-6
ADAM_BETA1 = 0.9
ADAM_BETA2 = 0.98
SEED = 16

random.seed(SEED)
np.random.seed(SEED)

We will use the following functions and classes from Langsfer.

In [3]:
%autoreload
from langsfer.high_level import wechsel
from langsfer.embeddings import FastTextEmbeddings
from langsfer.utils import download_file

# Dataset

We use the [datasets](https://huggingface.co/docs/datasets/index) library to load the [oscar](https://huggingface.co/datasets/oscar-corpus/oscar), which stands for **O**pen **S**uper-large **C**rawled **A**LMAnaCH co**R**pus, dataset's german configuration and then take a limited number of samples from it for training and validation.

In [4]:
dataset = datasets.load_dataset(
    DATASET_NAME,
    DATASET_CONFIG_NAME,
    split="train",
    streaming=True,
    trust_remote_code=True,
)
dataset = dataset.shuffle(seed=SEED)
dataset = dataset.take(DATASET_SIZE)
train_dataset = dataset.take(TRAIN_DATASET_SIZE)
val_dataset = dataset.skip(TRAIN_DATASET_SIZE)

We take sample text from the validation set in order to compare tokenization between source and target tokenizers as well as for evaluating the generation of our trained model at the end. 

In [5]:
sample_text = list(val_dataset.skip(10).take(1))[0]["text"]
print(sample_text)

mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester (Julia Jentsch) Angst… Doppelbödig.
in der rechten Armbeuge beim Öffnen des Mehlsilos zur Rettung der Bäckerstocher, welch ein Regiefehler! Der tiefergehende Sinn des Falles wird ansonsten auch nicht klar. Wirkt leider alles etwas zusammengeschustert.
Wer spielte die Hauptrolle in Film "The International" und wurde als potenzieller James Bond-Nachfolger gehandelt?


# Embeddings and Tokenizers

We load the source tokenizer as well as the source model and extract the input embeddings matrix from it.

In [6]:
source_tokenizer = AutoTokenizer.from_pretrained(SOURCE_MODEL_NAME)
source_model = AutoModel.from_pretrained(SOURCE_MODEL_NAME)
source_embeddings_matrix = source_model.get_input_embeddings().weight.detach().numpy()
del source_model

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We use the source tokenizer to convert the sample text to tokens.

In [7]:
tokens = source_tokenizer.tokenize(sample_text)
print(f"Number of tokens {len(tokens)}, tokens: {tokens}")

Number of tokens 172, tokens: ['mit', 'ĠEva', 'ĠMatt', 'es', 'Ġal', 's', 'ĠKl', 'ara', 'ĠBl', 'um', '.', 'ĠE', 'ine', 'ĠB', 'Ã¤', 'cker', 'st', 'och', 'ter', 'Ġstir', 'bt', 'Ġin', 'Ġder', 'ĠBack', 'r', 'Ã¶', 'h', 're', '.', 'ĠJet', 'z', 't', 'Ġhat', 'Ġi', 'h', 're', 'ĠSchw', 'ester', 'Ġ(', 'Jul', 'ia', 'ĠJ', 'ents', 'ch', ')', 'ĠAng', 'st', 'âĢ¦', 'ĠDo', 'ppel', 'b', 'Ã¶', 'dig', '.', 'Ċ', 'in', 'Ġder', 'Ġre', 'ch', 'ten', 'ĠArm', 'be', 'uge', 'Ġbe', 'im', 'ĠÃĸ', 'ff', 'nen', 'Ġdes', 'ĠMeh', 'ls', 'il', 'os', 'Ġz', 'ur', 'ĠR', 'ett', 'ung', 'Ġder', 'ĠB', 'Ã¤', 'cker', 'st', 'oc', 'her', ',', 'Ġwel', 'ch', 'Ġe', 'in', 'ĠReg', 'ief', 'eh', 'ler', '!', 'ĠDer', 'Ġt', 'ief', 'er', 'ge', 'hend', 'e', 'ĠSinn', 'Ġdes', 'ĠFall', 'es', 'Ġw', 'ird', 'Ġan', 'son', 'sten', 'Ġa', 'uch', 'Ġn', 'icht', 'Ġk', 'lar', '.', 'ĠW', 'irk', 't', 'Ġle', 'ider', 'Ġall', 'es', 'Ġet', 'was', 'Ġz', 'us', 'amm', 'enges', 'ch', 'ust', 'ert', '.', 'Ċ', 'W', 'er', 'Ġsp', 'iel', 'te', 'Ġdie', 'ĠHau', 'pt', 'rol', 'le',

We train a new target tokenizer using the same configuration as the source tokenizer using the training dataset 

In [8]:
def batch_iterator(
    dataset: datasets.Dataset, batch_size: int = 1000
) -> Generator[str, None, None]:
    for batch in dataset.iter(batch_size=batch_size):
        yield batch["text"]


target_tokenizer = source_tokenizer.train_new_from_iterator(
    batch_iterator(train_dataset), vocab_size=len(source_tokenizer)
)






We then use the target tokenizer to convert the sample text to tokens and notice that the conversion creates fewer tokens than previously.

In [9]:
tokens = target_tokenizer.tokenize(sample_text)
print(f"Number of tokens {len(tokens)}, tokens: {tokens}")

Number of tokens 106, tokens: ['mit', 'ĠEva', 'ĠMatt', 'es', 'Ġals', 'ĠKl', 'ara', 'ĠBlum', '.', 'ĠEine', 'ĠBÃ¤cker', 'stochter', 'Ġstirbt', 'Ġin', 'Ġder', 'ĠBack', 'rÃ¶hre', '.', 'ĠJetzt', 'Ġhat', 'Ġihre', 'ĠSchwester', 'Ġ(', 'Julia', 'ĠJ', 'ent', 'sch', ')', 'ĠAngst', 'âĢ¦', 'ĠDoppel', 'bÃ¶', 'dig', '.', 'Ċ', 'in', 'Ġder', 'Ġrechten', 'ĠArmb', 'euge', 'Ġbeim', 'ĠÃĸffnen', 'Ġdes', 'ĠMehl', 'sil', 'os', 'Ġzur', 'ĠRettung', 'Ġder', 'ĠBÃ¤cker', 'st', 'ocher', ',', 'Ġwelch', 'Ġein', 'ĠReg', 'ief', 'ehler', '!', 'ĠDer', 'Ġtiefer', 'gehende', 'ĠSinn', 'Ġdes', 'ĠFall', 'es', 'Ġwird', 'Ġansonsten', 'Ġauch', 'Ġnicht', 'Ġklar', '.', 'ĠWir', 'kt', 'Ġleider', 'Ġalles', 'Ġetwas', 'Ġzusammen', 'gesch', 'uster', 't', '.', 'Ċ', 'Wer', 'Ġspielte', 'Ġdie', 'ĠHauptrolle', 'Ġin', 'ĠFilm', 'Ġ"', 'The', 'ĠInternational', '"', 'Ġund', 'Ġwurde', 'Ġals', 'Ġpoten', 'ziel', 'ler', 'ĠJames', 'ĠBond', '-', 'Nach', 'folger', 'Ġgehandelt', '?']


We then load pre-trained fasttext embeddings to use as auxiliary embeddings

In [10]:
source_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("en")
target_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("de")

After that, we download a bilinigual dictionary for English and German in order to be able to align the auxiliary embeddings

In [11]:
bilingual_dictionary_file = download_file(
    "https://raw.githubusercontent.com/CPJKU/wechsel/main/dicts/data/german.txt",
    "/tmp/german.txt",
)

If we open the file and read the first few lines, we can see that it maps English words to their German equivalent.

In [12]:
with bilingual_dictionary_file.open() as f:
    dictionary_lines = [dict([f.readline().strip().split("\t")]) for _ in range(10)]

dictionary_lines

[{'free': 'frei'},
 {'free': 'gratis'},
 {'free of charge': 'umsonst'},
 {'synonymous': 'synonym'},
 {'off': 'ab-'},
 {'suddenly': 'abrupt'},
 {'teetotal': 'abstinent'},
 {'pluralistic': 'plural'},
 {'house': 'Haus'},
 {'it': 'dat'}]

We finally, instantiate the embedding initializer for WECHSEL

In [13]:
embedding_initializer = wechsel(
    source_tokenizer=source_tokenizer,
    source_embeddings_matrix=source_embeddings_matrix,
    target_tokenizer=target_tokenizer,
    target_auxiliary_embeddings=target_auxiliary_embeddings,
    source_auxiliary_embeddings=source_auxiliary_embeddings,
    bilingual_dictionary_file=bilingual_dictionary_file,
)

And then initialize the target embeddings

In [14]:
target_embeddings_matrix = embedding_initializer.initialize(seed=16, show_progress=True)

Overlapping Tokens:   0%|          | 0/5 [00:00<?, ?it/s]

Non-Overlapping Tokens: 0it [00:00, ?it/s]

Once we have the initialized embeddings matrix, we can use it to replace the embeddings matrix in the source model. 

In [15]:
target_model_wechsel = AutoModelForCausalLM.from_pretrained(SOURCE_MODEL_NAME)

# Resize its embedding layer
target_model_wechsel.resize_token_embeddings(len(target_tokenizer))

# Replace the source embeddings matrix with the target embeddings matrix
target_model_wechsel.get_input_embeddings().weight.data = torch.as_tensor(
    target_embeddings_matrix
)

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


> We used `AutoModelForCausalLM` instead of `AutoModel` because we will train the newly initialized model for causal language modelling.

# Training

## Dataset preprocessing

Before training, we must preprocess the training and validation sets by tokenizing the text, removing all other columns and then converting the resulting arrays to PyTorch tensors.

In [16]:
train_dataset = train_dataset.map(
    lambda x: target_tokenizer(x["text"], truncation=True),
    batched=True,
    remove_columns=dataset.column_names,
)
train_dataset = train_dataset.with_format("torch")

val_dataset = val_dataset.map(
    lambda x: target_tokenizer(x["text"], truncation=True),
    batched=True,
    remove_columns=dataset.column_names,
)
val_dataset = val_dataset.with_format("torch")

We define the training parameters and instantiate a [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) object.

In [17]:
data_collator = DataCollatorForLanguageModeling(tokenizer=target_tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="/tmp/wechsel",
    eval_strategy="steps",
    report_to="tensorboard",
    eval_steps=EVAL_STEPS // GRADIENT_ACCUMULATION_STEPS,
    max_steps=MAX_TRAIN_STEPS // GRADIENT_ACCUMULATION_STEPS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    adam_epsilon=ADAM_EPSILON,
    adam_beta1=ADAM_BETA1,
    adam_beta2=ADAM_BETA2,
    bf16=True,
)

trainer = Trainer(
    model=target_model_wechsel,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=target_tokenizer,
)

max_steps is given, it will override any value given in num_train_epochs


We evaluate the model before training

In [18]:
eval_loss = trainer.evaluate()["eval_loss"]
print(f"Evaluation loss before training: {eval_loss:.3f}")

Evaluation loss before training: 13.195


We then train the model

In [19]:
trainer.train()

Step,Training Loss,Validation Loss
500,0.7057,0.035726
1000,0.0154,0.00496
1500,0.0062,0.001831
2000,0.0044,0.000643
2500,0.0011,0.000399
3000,0.0008,0.000289
3500,0.0014,0.000218
4000,0.0005,0.000182
4500,0.0006,0.00015
5000,0.0006,0.000134


TrainOutput(global_step=6000, training_loss=0.061477883825699485, metrics={'train_runtime': 23892.0883, 'train_samples_per_second': 64.289, 'train_steps_per_second': 0.251, 'total_flos': 4.0437385940213856e+17, 'train_loss': 0.061477883825699485, 'epoch': 95.01041666666667})

We finally evaluate the model after the training

In [20]:
eval_loss = trainer.evaluate()["eval_loss"]
print(f"Evaluation loss after training: {eval_loss:.3f}")

Evaluation loss after training: 0.000


As an additional evaluation, we take the sample text, truncate it and then make the trained model generate a completion for it 

In [42]:
sample_input_ids = target_tokenizer(sample_text)["input_ids"]
shortened_input_ids = sample_input_ids[: len(sample_input_ids) // 3 - 13]

generated_token_ids = (
    trainer.model.generate(
        torch.as_tensor(shortened_input_ids).reshape(1, -1).to(trainer.model.device),
        max_length=300,
        min_length=10,
        top_p=0.9,
        temperature=0.9,
        repetition_penalty=2.0,
    )
    .detach()
    .cpu()
    .numpy()
    .reshape(-1)
)
generated_tokens = target_tokenizer.decode(
    generated_token_ids, add_special_tokens=False
)
print("Original Text:")
print(sample_text)
print("---")
print("Generated Text:")
print(generated_tokens)

Original Text:
mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester (Julia Jentsch) Angst… Doppelbödig.
in der rechten Armbeuge beim Öffnen des Mehlsilos zur Rettung der Bäckerstocher, welch ein Regiefehler! Der tiefergehende Sinn des Falles wird ansonsten auch nicht klar. Wirkt leider alles etwas zusammengeschustert.
Wer spielte die Hauptrolle in Film "The International" und wurde als potenzieller James Bond-Nachfolger gehandelt?
---
Generated Text:
<s>mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester freundinschwesterOmamutter</s>


The generated text's quality is horrible and the model needs further training on more data,
but this was just done for the sake of the tutorial and is not meant to be a full-blown model training.

# Summary

In this tutorial, we have seen how to use WECHSEL in order to transfer a pre-trained language model to a new language.