# FOCUS Tutorial

In this tutorial, we will use Langsfer to transfer a model trained in English to German with the [FOCUS](https://arxiv.org/abs/2305.14481) method, similarily to one of the experiments described in the paper.

FOCUS is a cross-lingual language transfer method that is similar to WECHSEL in that it uses FastText auxiliary embeddings
but rather than use pre-trained ones, it relies on embeddings that were trained from scratch on pre-tokenized text using the source and target tokenizers.

It is also similar to CLP-Transfer in that it relies on finding overlapping and non-overlapping tokens,
but rather rely on exact matching it uses fuzzy token matching to determine overlapping tokens.

The method requires as input:

- a tokenizer in the source language,
- a pre-trained language model in the source language,
- a tokenizer in the target language,
- 2 monolingual fastText embeddings for source and target languages respectively.
  They are trained from scratch for both languages using pre-tokenized text with the respective language tokenizer.

For the tutorial, we will use as much as possible the same parameters as described in the paper:

- For the source model and tokenizer, we will use [gpt2-large](openai-community/gpt2-large),
- For the target tokenizer, we will train one from scratch,

For the sake of brevity, we will however use fewer training samples and steps.

# Setup

We begin by importing libraries and setting some defaults.

In [1]:
%load_ext autoreload
%load_ext tensorboard

In [2]:
import random
import warnings
from typing import Generator

import datasets
import numpy as np
import torch
from transformers import (
    AutoModel,
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
)

warnings.simplefilter("ignore")

# Constants
SOURCE_MODEL_NAME = "openai-community/gpt2-large"
DATASET_NAME = "oscar-corpus/oscar"
SOURCE_DATASET_CONFIG_NAME = "unshuffled_deduplicated_en"
TARGET_DATASET_CONFIG_NAME = "unshuffled_deduplicated_de"
DATASET_SIZE = 20000
TRAIN_DATASET_SIZE = 16000
TRAIN_BATCH_SIZE = 2
GRADIENT_ACCUMULATION_STEPS = 64
EVAL_STEPS = 4000
MAX_TRAIN_STEPS = 24000
LEARNING_RATE = 1e-4
WEIGHT_DECAY = 0.01
ADAM_EPSILON = 1e-6
ADAM_BETA1 = 0.9
ADAM_BETA2 = 0.98
SEED = 16

random.seed(SEED)
np.random.seed(SEED)

We will use the following functions and classes from Langsfer.

In [3]:
%autoreload
from langsfer.high_level import focus
from langsfer.embeddings import FastTextEmbeddings
from langsfer.utils import train_fasttext_model

# Dataset

We use the [datasets](https://huggingface.co/docs/datasets/index) library to load the [oscar](https://huggingface.co/datasets/oscar-corpus/oscar), which stands for **O**pen **S**uper-large **C**rawled **A**LMAnaCH co**R**pus, dataset's german configuration and then take a limited number of samples from it for training and validation.

In [4]:
dataset = datasets.load_dataset(
    DATASET_NAME,
    TARGET_DATASET_CONFIG_NAME,
    split="train",
    streaming=True,
    trust_remote_code=True,
)
dataset = dataset.shuffle(seed=SEED)
dataset = dataset.take(DATASET_SIZE)
train_dataset = dataset.take(TRAIN_DATASET_SIZE)
val_dataset = dataset.skip(TRAIN_DATASET_SIZE)

We take sample text from the validation set in order to compare tokenization between source and target tokenizers as well as for evaluating the generation of our trained model at the end. 

In [5]:
sample_text = list(val_dataset.skip(10).take(1))[0]["text"]
print(sample_text)

mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester (Julia Jentsch) Angst… Doppelbödig.
in der rechten Armbeuge beim Öffnen des Mehlsilos zur Rettung der Bäckerstocher, welch ein Regiefehler! Der tiefergehende Sinn des Falles wird ansonsten auch nicht klar. Wirkt leider alles etwas zusammengeschustert.
Wer spielte die Hauptrolle in Film "The International" und wurde als potenzieller James Bond-Nachfolger gehandelt?


# Embeddings and Tokenizers

We load the source tokenizer as well as the source model and extract the input embeddings matrix from it.

In [6]:
source_tokenizer = AutoTokenizer.from_pretrained(SOURCE_MODEL_NAME)
source_model = AutoModel.from_pretrained(SOURCE_MODEL_NAME)
source_embeddings_matrix = source_model.get_input_embeddings().weight.detach().numpy()
del source_model

We use the source tokenizer to convert the sample text to tokens.

In [7]:
tokens = source_tokenizer.tokenize(sample_text)
print(f"Number of tokens {len(tokens)}, tokens: {tokens}")

Number of tokens 172, tokens: ['mit', 'ĠEva', 'ĠMatt', 'es', 'Ġal', 's', 'ĠKl', 'ara', 'ĠBl', 'um', '.', 'ĠE', 'ine', 'ĠB', 'Ã¤', 'cker', 'st', 'och', 'ter', 'Ġstir', 'bt', 'Ġin', 'Ġder', 'ĠBack', 'r', 'Ã¶', 'h', 're', '.', 'ĠJet', 'z', 't', 'Ġhat', 'Ġi', 'h', 're', 'ĠSchw', 'ester', 'Ġ(', 'Jul', 'ia', 'ĠJ', 'ents', 'ch', ')', 'ĠAng', 'st', 'âĢ¦', 'ĠDo', 'ppel', 'b', 'Ã¶', 'dig', '.', 'Ċ', 'in', 'Ġder', 'Ġre', 'ch', 'ten', 'ĠArm', 'be', 'uge', 'Ġbe', 'im', 'ĠÃĸ', 'ff', 'nen', 'Ġdes', 'ĠMeh', 'ls', 'il', 'os', 'Ġz', 'ur', 'ĠR', 'ett', 'ung', 'Ġder', 'ĠB', 'Ã¤', 'cker', 'st', 'oc', 'her', ',', 'Ġwel', 'ch', 'Ġe', 'in', 'ĠReg', 'ief', 'eh', 'ler', '!', 'ĠDer', 'Ġt', 'ief', 'er', 'ge', 'hend', 'e', 'ĠSinn', 'Ġdes', 'ĠFall', 'es', 'Ġw', 'ird', 'Ġan', 'son', 'sten', 'Ġa', 'uch', 'Ġn', 'icht', 'Ġk', 'lar', '.', 'ĠW', 'irk', 't', 'Ġle', 'ider', 'Ġall', 'es', 'Ġet', 'was', 'Ġz', 'us', 'amm', 'enges', 'ch', 'ust', 'ert', '.', 'Ċ', 'W', 'er', 'Ġsp', 'iel', 'te', 'Ġdie', 'ĠHau', 'pt', 'rol', 'le',

We train a new target tokenizer using the same configuration as the source tokenizer using the training dataset 

In [8]:
def batch_iterator(
    dataset: datasets.Dataset, batch_size: int = 1000
) -> Generator[str, None, None]:
    for batch in dataset.iter(batch_size=batch_size):
        yield batch["text"]


target_tokenizer = source_tokenizer.train_new_from_iterator(
    batch_iterator(train_dataset), vocab_size=len(source_tokenizer)
)






We then use the target tokenizer to convert the sample text to tokens and notice that the conversion creates fewer tokens than previously.

In [9]:
tokens = target_tokenizer.tokenize(sample_text)
print(f"Number of tokens {len(tokens)}, tokens: {tokens}")

Number of tokens 106, tokens: ['mit', 'ĠEva', 'ĠMatt', 'es', 'Ġals', 'ĠKl', 'ara', 'ĠBlum', '.', 'ĠEine', 'ĠBÃ¤cker', 'stochter', 'Ġstirbt', 'Ġin', 'Ġder', 'ĠBack', 'rÃ¶hre', '.', 'ĠJetzt', 'Ġhat', 'Ġihre', 'ĠSchwester', 'Ġ(', 'Julia', 'ĠJ', 'ent', 'sch', ')', 'ĠAngst', 'âĢ¦', 'ĠDoppel', 'bÃ¶', 'dig', '.', 'Ċ', 'in', 'Ġder', 'Ġrechten', 'ĠArmb', 'euge', 'Ġbeim', 'ĠÃĸffnen', 'Ġdes', 'ĠMehl', 'sil', 'os', 'Ġzur', 'ĠRettung', 'Ġder', 'ĠBÃ¤cker', 'st', 'ocher', ',', 'Ġwelch', 'Ġein', 'ĠReg', 'ief', 'ehler', '!', 'ĠDer', 'Ġtiefer', 'gehende', 'ĠSinn', 'Ġdes', 'ĠFall', 'es', 'Ġwird', 'Ġansonsten', 'Ġauch', 'Ġnicht', 'Ġklar', '.', 'ĠWir', 'kt', 'Ġleider', 'Ġalles', 'Ġetwas', 'Ġzusammen', 'gesch', 'uster', 't', '.', 'Ċ', 'Wer', 'Ġspielte', 'Ġdie', 'ĠHauptrolle', 'Ġin', 'ĠFilm', 'Ġ"', 'The', 'ĠInternational', '"', 'Ġund', 'Ġwurde', 'Ġals', 'Ġpoten', 'ziel', 'ler', 'ĠJames', 'ĠBond', '-', 'Nach', 'folger', 'Ġgehandelt', '?']


## FastText Embeddings

In [10]:
source_train_dataset = (
    datasets.load_dataset(
        DATASET_NAME,
        SOURCE_DATASET_CONFIG_NAME,
        split="train",
        streaming=True,
        trust_remote_code=True,
    )
    .shuffle(seed=SEED)
    .take(DATASET_SIZE)
)
source_train_tokenized_dataset = source_train_dataset.map(
    lambda x: source_tokenizer(x["text"], truncation=True),
    batched=True,
    remove_columns=dataset.column_names,
)

In [11]:
target_train_tokenized_dataset = train_dataset.map(
    lambda x: target_tokenizer(x["text"], truncation=True),
    batched=True,
    remove_columns=dataset.column_names,
)

We then train auxiliary fasttext embeddings based on tokenized text using the source and target tokenizers.

In [12]:
source_fasttext_model = train_fasttext_model(
    source_train_dataset, total_examples=TRAIN_DATASET_SIZE
)
source_auxiliary_embeddings = FastTextEmbeddings(source_fasttext_model)

In [13]:
target_fasttext_model = train_fasttext_model(
    train_dataset, total_examples=TRAIN_DATASET_SIZE
)
target_auxiliary_embeddings = FastTextEmbeddings(target_fasttext_model)

## Embedding Initialization

We finally, instantiate the embedding initializer for FOCUS

In [14]:
embedding_initializer = focus(
    source_tokenizer=source_tokenizer,
    source_embeddings_matrix=source_embeddings_matrix,
    target_tokenizer=target_tokenizer,
    target_auxiliary_embeddings=target_auxiliary_embeddings,
    source_auxiliary_embeddings=source_auxiliary_embeddings,
)

And then initialize the target embeddings

In [15]:
target_embeddings_matrix = embedding_initializer.initialize(seed=16, show_progress=True)

Non-Overlapping Tokens: 0it [00:00, ?it/s]

Once we have the initialized embeddings matrix, we can use it to replace the embeddings matrix in the source model. 

In [16]:
target_model_wechsel = AutoModelForCausalLM.from_pretrained(SOURCE_MODEL_NAME)

# Resize its embedding layer
target_model_wechsel.resize_token_embeddings(len(target_tokenizer))

# Replace the source embeddings matrix with the target embeddings matrix
target_model_wechsel.get_input_embeddings().weight.data = torch.as_tensor(
    target_embeddings_matrix
)

> We used `AutoModelForCausalLM` instead of `AutoModel` because we will train the newly initialized model for causal language modelling.

# Training

## Dataset preprocessing

Before training, we must preprocess the training and validation sets by tokenizing the text, removing all other columns and then converting the resulting arrays to PyTorch tensors.

In [17]:
train_dataset = train_dataset.map(
    lambda x: target_tokenizer(x["text"], truncation=True),
    batched=True,
    remove_columns=dataset.column_names,
)
train_dataset = train_dataset.with_format("torch")

val_dataset = val_dataset.map(
    lambda x: target_tokenizer(x["text"], truncation=True),
    batched=True,
    remove_columns=dataset.column_names,
)
val_dataset = val_dataset.with_format("torch")

We define the training parameters and instantiate a [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) object.

In [18]:
data_collator = DataCollatorForLanguageModeling(tokenizer=target_tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="/tmp/wechsel",
    eval_strategy="steps",
    report_to="tensorboard",
    eval_steps=EVAL_STEPS // GRADIENT_ACCUMULATION_STEPS,
    max_steps=MAX_TRAIN_STEPS // GRADIENT_ACCUMULATION_STEPS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    adam_epsilon=ADAM_EPSILON,
    adam_beta1=ADAM_BETA1,
    adam_beta2=ADAM_BETA2,
    bf16=True,
)

if target_tokenizer.pad_token is None:
    target_tokenizer.pad_token = target_tokenizer.eos_token

trainer = Trainer(
    model=target_model_wechsel,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=target_tokenizer,
)

max_steps is given, it will override any value given in num_train_epochs


We evaluate the model before training by computing the average loss on the entire training set

In [19]:
eval_loss = trainer.evaluate()["eval_loss"]
print(f"Evaluation loss before training: {eval_loss:.3f}")

Evaluation loss before training: 10.399


and then we take the sample text, truncate it and then make the initialized model generate a completion for it 

In [20]:
sample_input_ids = target_tokenizer(sample_text)["input_ids"]
shortened_input_ids = sample_input_ids[: len(sample_input_ids) // 3 - 13]
shortened_text = target_tokenizer.decode(shortened_input_ids, add_special_tokens=False)

generated_token_ids = (
    trainer.model.generate(
        torch.as_tensor(shortened_input_ids).reshape(1, -1).to(trainer.model.device),
        max_length=300,
        min_length=10,
        top_p=0.9,
        temperature=0.9,
        repetition_penalty=2.0,
    )
    .detach()
    .cpu()
    .numpy()
    .reshape(-1)
)
generated_tokens = target_tokenizer.decode(
    generated_token_ids, add_special_tokens=False
)
print("Original Text:")
print(sample_text)
print("---")
print("Shortened Text:")
print(shortened_text)
print("---")
print("Generated Text:")
print(generated_tokens)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Original Text:
mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester (Julia Jentsch) Angst… Doppelbödig.
in der rechten Armbeuge beim Öffnen des Mehlsilos zur Rettung der Bäckerstocher, welch ein Regiefehler! Der tiefergehende Sinn des Falles wird ansonsten auch nicht klar. Wirkt leider alles etwas zusammengeschustert.
Wer spielte die Hauptrolle in Film "The International" und wurde als potenzieller James Bond-Nachfolger gehandelt?
---
Shortened Text:
mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester
---
Generated Text:
mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester, jemanden sich nicht zu einer ganzen Weltkrieg und die Verwendung von den Geburtstag auf dem Königin erfolgt werden kann:
The German-born American who was the first person to be on Mars and is now an international hero of human rights has been made into one by his own people

We then train the model

In [21]:
trainer.train()

Step,Training Loss,Validation Loss
62,No log,6.4109
124,No log,6.019698
186,No log,5.782487
248,No log,5.624911
310,No log,5.540668
372,No log,5.507936


TrainOutput(global_step=375, training_loss=6.1109830729166665, metrics={'train_runtime': 17329.4418, 'train_samples_per_second': 5.54, 'train_steps_per_second': 0.022, 'total_flos': 2.917360387347456e+17, 'train_loss': 6.1109830729166665, 'epoch': 5.166666666666667})

We finally repeat the model evaluationg after the training

In [22]:
eval_loss = trainer.evaluate()["eval_loss"]
print(f"Evaluation loss after training: {eval_loss:.3f}")

Evaluation loss after training: 5.508


In [23]:
sample_input_ids = target_tokenizer(sample_text)["input_ids"]
shortened_input_ids = sample_input_ids[: len(sample_input_ids) // 3 - 13]
shortened_text = target_tokenizer.decode(shortened_input_ids, add_special_tokens=False)

generated_token_ids = (
    trainer.model.generate(
        torch.as_tensor(shortened_input_ids).reshape(1, -1).to(trainer.model.device),
        max_length=300,
        min_length=10,
        top_p=0.9,
        temperature=0.9,
        repetition_penalty=2.0,
    )
    .detach()
    .cpu()
    .numpy()
    .reshape(-1)
)
generated_tokens = target_tokenizer.decode(
    generated_token_ids, add_special_tokens=False
)
print("Original Text:")
print(sample_text)
print("---")
print("Shortened Text:")
print(shortened_text)
print("---")
print("Generated Text:")
print(generated_tokens)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Original Text:
mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester (Julia Jentsch) Angst… Doppelbödig.
in der rechten Armbeuge beim Öffnen des Mehlsilos zur Rettung der Bäckerstocher, welch ein Regiefehler! Der tiefergehende Sinn des Falles wird ansonsten auch nicht klar. Wirkt leider alles etwas zusammengeschustert.
Wer spielte die Hauptrolle in Film "The International" und wurde als potenzieller James Bond-Nachfolger gehandelt?
---
Shortened Text:
mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester
---
Generated Text:
mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester eine kleine Rolle, die sich mit dem Mann und den anderen Menschen zu einem kleinen Tag auf einen Blick machen können!
Die beiden Schüler ist ein paar Minuten lang wieder im Jahr von einer Hand aus ihren Augen an seinen ersten Mal gestickert werden - das war es nicht so gut w

The generated text's quality is not bad but the model needs further training on more data.
This was just done for the sake of the tutorial and is not meant to be a full-blown model training.

# Summary

In this tutorial, we have seen how to use FOCUS in order to transfer a pre-trained language model to a new language.