# CLP-Transfer Tutorial

In this tutorial, we will use Langsfer to transfer a model trained in English to German with the [CLP-Transfer](https://arxiv.org/abs/2301.09626) method, similarily to one of the experiments described in the paper.

Cross-Lingual and Progressive Transfer, or CLP-Transfer for short, is another cross-lingual language transfer method that efficiently initializes the embedding parameters of a language model in a target language using the embedding parameters from an existing model in a source language as well as the embedding parameters of a helper model in the target language.

The method requires as input:

- a tokenizer in the source language,
- a pre-trained language model in the source language,
- a tokenizer in the target language,
- a helper pre-trained language model in the target language.

For the tutorial, we will use as much as possible the same parameters as described in the paper:

- For the source model and tokenizer, we will use [gpt2-large](openai-community/gpt2-large),
- For the helper model and target tokenizer, we will use [benjamin/gpt2-wechsel-german](https://huggingface.co/benjamin/gpt2-wechsel-german).

For the sake of brevity, we will however use fewer training samples and steps.

# Setup

We begin by importing libraries and setting some defaults.

In [1]:
%load_ext autoreload
%load_ext tensorboard

In [2]:
import random
import warnings

import datasets
import numpy as np
import torch
from transformers import (
    AutoModel,
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
)

warnings.simplefilter("ignore")

# Constants
SOURCE_MODEL_NAME = "openai-community/gpt2-large"
HELPER_MODEL_NAME = "benjamin/gpt2-wechsel-german"
DATASET_NAME = "oscar-corpus/oscar"
DATASET_CONFIG_NAME = "unshuffled_deduplicated_de"
DATASET_SIZE = 20000
TRAIN_DATASET_SIZE = 16000
TRAIN_BATCH_SIZE = 2
GRADIENT_ACCUMULATION_STEPS = 64
EVAL_STEPS = 4000
MAX_TRAIN_STEPS = 48000
LEARNING_RATE = 1e-4
WEIGHT_DECAY = 0.01
ADAM_EPSILON = 1e-6
ADAM_BETA1 = 0.9
ADAM_BETA2 = 0.98
SEED = 16

random.seed(SEED)
np.random.seed(SEED)

We will use the following functions and classes from Langsfer.

In [3]:
%autoreload
from langsfer.high_level import clp_transfer
from langsfer.embeddings import TransformersEmbeddings

# Dataset

We use the [datasets](https://huggingface.co/docs/datasets/index) library to load the [oscar](https://huggingface.co/datasets/oscar-corpus/oscar), which stands for **O**pen **S**uper-large **C**rawled **A**LMAnaCH co**R**pus, dataset's german configuration and then take a limited number of samples from it for training and validation.

In [4]:
dataset = datasets.load_dataset(
    DATASET_NAME,
    DATASET_CONFIG_NAME,
    split="train",
    streaming=True,
    trust_remote_code=True,
)
dataset = dataset.shuffle(seed=SEED)
dataset = dataset.take(DATASET_SIZE)
train_dataset = dataset.take(TRAIN_DATASET_SIZE)
val_dataset = dataset.skip(TRAIN_DATASET_SIZE)

We take sample text from the validation set in order to evaluate the generation of our trained model at the end. 

In [5]:
sample_text = list(val_dataset.skip(10).take(1))[0]["text"]
print(sample_text)

mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester (Julia Jentsch) Angst… Doppelbödig.
in der rechten Armbeuge beim Öffnen des Mehlsilos zur Rettung der Bäckerstocher, welch ein Regiefehler! Der tiefergehende Sinn des Falles wird ansonsten auch nicht klar. Wirkt leider alles etwas zusammengeschustert.
Wer spielte die Hauptrolle in Film "The International" und wurde als potenzieller James Bond-Nachfolger gehandelt?


# Embeddings and Tokenizers

We load the source tokenizer as well as the source model and extract the input embeddings matrix from it.

In [6]:
source_tokenizer = AutoTokenizer.from_pretrained(SOURCE_MODEL_NAME)
source_model = AutoModel.from_pretrained(SOURCE_MODEL_NAME)
source_embeddings_matrix = source_model.get_input_embeddings().weight.detach().numpy()
del source_model

We then load the target tokenizer as well as the helper model's embeddings to use as auxiliary embeddings. 

In [7]:
target_tokenizer = AutoTokenizer.from_pretrained(HELPER_MODEL_NAME)
target_auxiliary_embeddings = TransformersEmbeddings.from_model_name_or_path(
    HELPER_MODEL_NAME
)

In [8]:
tokens = source_tokenizer.tokenize(sample_text)
print(f"Number of tokens {len(tokens)}, tokens: {tokens}")

Number of tokens 172, tokens: ['mit', 'ĠEva', 'ĠMatt', 'es', 'Ġal', 's', 'ĠKl', 'ara', 'ĠBl', 'um', '.', 'ĠE', 'ine', 'ĠB', 'Ã¤', 'cker', 'st', 'och', 'ter', 'Ġstir', 'bt', 'Ġin', 'Ġder', 'ĠBack', 'r', 'Ã¶', 'h', 're', '.', 'ĠJet', 'z', 't', 'Ġhat', 'Ġi', 'h', 're', 'ĠSchw', 'ester', 'Ġ(', 'Jul', 'ia', 'ĠJ', 'ents', 'ch', ')', 'ĠAng', 'st', 'âĢ¦', 'ĠDo', 'ppel', 'b', 'Ã¶', 'dig', '.', 'Ċ', 'in', 'Ġder', 'Ġre', 'ch', 'ten', 'ĠArm', 'be', 'uge', 'Ġbe', 'im', 'ĠÃĸ', 'ff', 'nen', 'Ġdes', 'ĠMeh', 'ls', 'il', 'os', 'Ġz', 'ur', 'ĠR', 'ett', 'ung', 'Ġder', 'ĠB', 'Ã¤', 'cker', 'st', 'oc', 'her', ',', 'Ġwel', 'ch', 'Ġe', 'in', 'ĠReg', 'ief', 'eh', 'ler', '!', 'ĠDer', 'Ġt', 'ief', 'er', 'ge', 'hend', 'e', 'ĠSinn', 'Ġdes', 'ĠFall', 'es', 'Ġw', 'ird', 'Ġan', 'son', 'sten', 'Ġa', 'uch', 'Ġn', 'icht', 'Ġk', 'lar', '.', 'ĠW', 'irk', 't', 'Ġle', 'ider', 'Ġall', 'es', 'Ġet', 'was', 'Ġz', 'us', 'amm', 'enges', 'ch', 'ust', 'ert', '.', 'Ċ', 'W', 'er', 'Ġsp', 'iel', 'te', 'Ġdie', 'ĠHau', 'pt', 'rol', 'le',

We then use the target tokenizer to convert the sample text to tokens and notice that the conversion creates fewer tokens than previously.

In [9]:
tokens = target_tokenizer.tokenize(sample_text)
print(f"Number of tokens {len(tokens)}, tokens: {tokens}")

Number of tokens 108, tokens: ['mit', 'ĠEva', 'ĠMatt', 'es', 'Ġals', 'ĠKl', 'ara', 'ĠBlum', '.', 'ĠEine', 'ĠBÃ¤cker', 'st', 'ochter', 'Ġstirbt', 'Ġin', 'Ġder', 'ĠBack', 'rÃ¶hre', '.', 'ĠJetzt', 'Ġhat', 'Ġihre', 'ĠSchwester', 'Ġ(', 'Jul', 'ia', 'ĠJ', 'ent', 'sch', ')', 'ĠAngst', 'âĢ¦', 'ĠDoppel', 'bÃ¶', 'dig', '.', 'Ċ', 'in', 'Ġder', 'Ġrechten', 'ĠArmb', 'euge', 'Ġbeim', 'ĠÃĸffnen', 'Ġdes', 'ĠMehl', 'sil', 'os', 'Ġzur', 'ĠRettung', 'Ġder', 'ĠBÃ¤cker', 'st', 'ocher', ',', 'Ġwelch', 'Ġein', 'ĠReg', 'ief', 'ehler', '!', 'ĠDer', 'Ġtiefer', 'gehende', 'ĠSinn', 'Ġdes', 'ĠFall', 'es', 'Ġwird', 'Ġansonsten', 'Ġauch', 'Ġnicht', 'Ġklar', '.', 'ĠWir', 'kt', 'Ġleider', 'Ġalles', 'Ġetwas', 'Ġzusammen', 'gesch', 'uster', 't', '.', 'Ċ', 'Wer', 'Ġspielte', 'Ġdie', 'ĠHauptrolle', 'Ġin', 'ĠFilm', 'Ġ"', 'The', 'ĠInternational', '"', 'Ġund', 'Ġwurde', 'Ġals', 'Ġpoten', 'ziel', 'ler', 'ĠJames', 'ĠBond', '-', 'Nach', 'folger', 'Ġgehandelt', '?']


We finally, instantiate the embedding initializer for CLP-Transfer

In [10]:
embedding_initializer = clp_transfer(
    source_tokenizer=source_tokenizer,
    source_embeddings_matrix=source_embeddings_matrix,
    target_tokenizer=target_tokenizer,
    target_auxiliary_embeddings=target_auxiliary_embeddings,
)

And then initialize the target embeddings

In [11]:
target_embeddings_matrix = embedding_initializer.initialize(seed=16, show_progress=True)

Non-Overlapping Tokens: 0it [00:00, ?it/s]

Once we have the initialized embeddings matrix, we can use it to replace the embeddings matrix in the source model. 

In [12]:
target_model_wechsel = AutoModelForCausalLM.from_pretrained(SOURCE_MODEL_NAME)

# Resize its embedding layer
target_model_wechsel.resize_token_embeddings(len(target_tokenizer))

# Replace the source embeddings matrix with the target embeddings matrix
target_model_wechsel.get_input_embeddings().weight.data = torch.as_tensor(
    target_embeddings_matrix
)

> We used `AutoModelForCausalLM` instead of `AutoModel` because we will train the newly initialized model for causal language modelling.

# Training

## Dataset preprocessing

Before training, we must preprocess the training and validation sets by tokenizing the text, removing all other columns and then converting the resulting arrays to PyTorch tensors.

In [13]:
train_dataset = train_dataset.map(
    lambda x: target_tokenizer(x["text"], truncation=True),
    batched=True,
    remove_columns=dataset.column_names,
)
train_dataset = train_dataset.with_format("torch")

val_dataset = val_dataset.map(
    lambda x: target_tokenizer(x["text"], truncation=True),
    batched=True,
    remove_columns=dataset.column_names,
)
val_dataset = val_dataset.with_format("torch")

We define the training parameters and instantiate a [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) object.

In [14]:
data_collator = DataCollatorForLanguageModeling(tokenizer=target_tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="/tmp/clp_transfer",
    eval_strategy="steps",
    report_to="tensorboard",
    eval_steps=EVAL_STEPS // GRADIENT_ACCUMULATION_STEPS,
    max_steps=MAX_TRAIN_STEPS // GRADIENT_ACCUMULATION_STEPS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    adam_epsilon=ADAM_EPSILON,
    adam_beta1=ADAM_BETA1,
    adam_beta2=ADAM_BETA2,
    bf16=True,
)

if target_tokenizer.pad_token is None:
    target_tokenizer.pad_token = target_tokenizer.eos_token

trainer = Trainer(
    model=target_model_wechsel,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=target_tokenizer,
)

max_steps is given, it will override any value given in num_train_epochs


We evaluate the model before training

In [15]:
eval_loss = trainer.evaluate()["eval_loss"]
print(f"Evaluation loss before training: {eval_loss:.3f}")

Evaluation loss before training: 10.631


We then train the model

In [16]:
trainer.train()

Step,Training Loss,Validation Loss
62,No log,6.976626
124,No log,6.584036
186,No log,6.243427
248,No log,5.986516
310,No log,5.811096
372,No log,5.672745
434,No log,5.567986
496,No log,5.485695
558,6.278000,5.421839
620,6.278000,5.378


TrainOutput(global_step=750, training_loss=6.01977197265625, metrics={'train_runtime': 34687.4607, 'train_samples_per_second': 5.535, 'train_steps_per_second': 0.022, 'total_flos': 5.861596369744896e+17, 'train_loss': 6.01977197265625, 'epoch': 11.083333333333334})

We finally evaluate the model after the training

In [17]:
eval_loss = trainer.evaluate()["eval_loss"]
print(f"Evaluation loss after training: {eval_loss:.3f}")

Evaluation loss after training: 5.338


As an additional evaluation, we take the sample text, truncate it and then make the trained model generate a completion for it 

In [None]:
sample_input_ids = target_tokenizer(sample_text)["input_ids"]
shortened_input_ids = sample_input_ids[: len(sample_input_ids) // 3 - 13]
shortened_text = target_tokenizer.decode(shortened_input_ids, add_special_tokens=False)

generated_token_ids = (
    trainer.model.generate(
        torch.as_tensor(shortened_input_ids).reshape(1, -1).to(trainer.model.device),
        max_length=300,
        min_length=10,
        top_p=0.9,
        temperature=0.9,
        repetition_penalty=2.0,
    )
    .detach()
    .cpu()
    .numpy()
    .reshape(-1)
)
generated_tokens = target_tokenizer.decode(
    generated_token_ids, add_special_tokens=False
)
print("Original Text:")
print(sample_text)
print("---")
print("Shortened Text:")
print(shortened_text)
print("---")
print("Generated Text:")
print(generated_tokens)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Original Text:
mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester (Julia Jentsch) Angst… Doppelbödig.
in der rechten Armbeuge beim Öffnen des Mehlsilos zur Rettung der Bäckerstocher, welch ein Regiefehler! Der tiefergehende Sinn des Falles wird ansonsten auch nicht klar. Wirkt leider alles etwas zusammengeschustert.
Wer spielte die Hauptrolle in Film "The International" und wurde als potenzieller James Bond-Nachfolger gehandelt?
---
Shortened Text:
mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester
---
Generated Text:
mit Eva Mattes als Klara Blum. Eine Bäckerstochter stirbt in der Backröhre. Jetzt hat ihre Schwester die Welt, das sie mit dem Mann und den anderen Menschen zu tun ist:
Die Frau wird von einem kleinen Kind aus einer großen Stadt auf ihrem Weg im Wald gebracht worden – auch wenn es sich um eine große Geschichte gibt! Die Mutter wurde am Ende des Jahres nach Berlin-Wittenberg

The generated text's quality is not bad but the model needs further training on more data.
This was just done for the sake of the tutorial and is not meant to be a full-blown model training.

# Summary

In this tutorial, we have seen how to use CLP-Transfer in order to transfer a pre-trained language model to a new language.