## Main ideas

* Create zero-shot baseline
* Train XLM-R on one language and zero-shot to another
* Create learning curves to see how much labelled data we need to beat the baseline
* Add youtube videos from the course

🤗

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# For HF machines
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

## Fine-tuning your first Transformer!

Add intro. Put inference API screenshot?

## Setup

Probably need Git LFS

In [3]:
# from huggingface_hub import notebook_login

# notebook_login()

## The dataset

In this tutorial we'll use the [Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi) (or MARC for short). This is a large-scale collection of Amazon product reviews in several languages: English, Japanese, German, French, Spanish, and Chinese. 

We can download the dataset from the Hugging Face Hub with the 🤗 Datasets library, but first let's take a look at the available subsets:

In [3]:
from datasets import get_dataset_config_names

dataset_name = "amazon_reviews_multi"
langs = get_dataset_config_names(dataset_name)
langs

Downloading:   0%|          | 0.00/2.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.62k [00:00<?, ?B/s]

['all_languages', 'de', 'en', 'es', 'fr', 'ja', 'zh']

Okay, we can see the language codes associated with each language, as well as an `all_languages` subset which presumably concatenates all the languages together. Let's begin by downloading the German subset with the `load_dataset()` function from 🤗 Datasets:

In [4]:
from datasets import load_dataset

german_dataset = load_dataset(path=dataset_name, name="de")
german_dataset

Reusing dataset amazon_reviews_multi (/data/.cache/hf/datasets/amazon_reviews_multi/de/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})

We can see that `german_dataset` is a `DatasetDict` object which provides a mapping between each split (`train`, `validation`, and `test`) and its corresponding `Dataset`. To access one of these splits, we need to select the key and then the index:

In [5]:
german_dataset["train"][0]

{'language': 'de',
 'product_category': 'sports',
 'stars': 1,
 'review_id': 'de_0203609',
 'reviewer_id': 'reviewer_de_0267719',
 'review_title': 'Leider nach 1 Jahr kaputt',
 'review_body': 'Armband ist leider nach 1 Jahr kaputt gegangen',
 'product_id': 'product_de_0865382'}

## From Datasets to DataFrames and back

In [8]:
from IPython.display import display, HTML

german_dataset.set_format("pandas")
german_df = german_dataset["train"][:]
# Create a random sample
sample = german_df.sample(n=5, random_state=42)
display(HTML(sample.to_html()))

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
119737,de_0970901,product_de_0712478,reviewer_de_0308094,3,Ist ok ...blondierung quillt schnell auf,Ok,de,beauty
72272,de_0042217,product_de_0734686,reviewer_de_0904358,2,Kein typischer Geruch oder Geschmack von einem Ghee! Ich würde es nicht wieder kaufen oder weiter empfehlen. Konkurrenz Produkt fand ich besser.,Kein typischer Geruch oder Geschmack von einem Ghee !,de,grocery
158154,de_0278932,product_de_0388890,reviewer_de_0940030,4,Dieses Buch hat mir sehr geholfen mit dem ersten Schlupf und der weiteren Aufzucht. Kann ich nur weiter empfehlen.,Sehr hilfreich,de,book
65426,de_0737352,product_de_0560586,reviewer_de_0632435,2,"super Schale, wunderschön, gutes Produkt ABER Der Saugnapf geht von der Schale runter, da die Maße des Saugnapf Ringes nicht passen. Man muss aufpassen dass man den nicht dauernd neu aufsetzen muss.",der Saugnapf hält nicht,de,baby_product
30074,de_0455430,product_de_0375951,reviewer_de_0482228,1,"Artikel ist niemals angekommen, habe ihn aber bezahlt! Und dann steht noch dort ich hätte unterschrieben, als er angeblich angekommen sei! null Sterne! Unglaublich 😒",Artikel ist niemals angekommen!!,de,book


In [9]:
german_df["product_category"].value_counts()

home                        26063
wireless                    19964
sports                      13748
home_improvement            12408
apparel                     10178
toy                          9781
pc                           8577
drugstore                    8075
lawn_and_garden              7426
beauty                       7162
electronics                  7114
other                        6460
furniture                    6334
kitchen                      5787
automotive                   5321
pet_products                 5028
book                         4927
office_product               4343
baby_product                 4070
shoes                        3568
luggage                      3256
digital_video_download       2970
personal_care_appliances     2836
grocery                      2737
digital_ebook_purchase       2720
jewelry                      2380
camera                       1906
watch                        1706
video_games                  1219
industrial_sup

In [10]:
german_df["stars"].value_counts()

1    40000
2    40000
3    40000
4    40000
5    40000
Name: stars, dtype: int64

In [11]:
german_dataset.reset_format()

## Filtering for a domain

In [12]:
def filter_for_wireless(example):
    return example["product_category"] == "wireless"

In [14]:
german_dataset = german_dataset.filter(filter_for_wireless)

Loading cached processed dataset at /data/.cache/hf/datasets/amazon_reviews_multi/de/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-a02aa3c50d82bb3a.arrow
Loading cached processed dataset at /data/.cache/hf/datasets/amazon_reviews_multi/de/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-4215bd725f9f9ab7.arrow
Loading cached processed dataset at /data/.cache/hf/datasets/amazon_reviews_multi/de/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-6926e4ca8872dc14.arrow


In [15]:
german_dataset

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 19964
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 500
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 491
    })
})

## Re-mapping the labels

In [18]:
german_dataset = german_dataset.rename_column("stars", "labels")

In [19]:
label_mapping = {idx+1:idx for idx in range(5)}
label_mapping

{1: 0, 2: 1, 3: 2, 4: 3, 5: 4}

In [20]:
def map_labels(examples):
    return {"labels": label_mapping[examples["labels"]]}

In [21]:
german_dataset = german_dataset.map(map_labels)
german_dataset

  0%|          | 0/19964 [00:00<?, ?ex/s]

  0%|          | 0/500 [00:00<?, ?ex/s]

  0%|          | 0/491 [00:00<?, ?ex/s]

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'labels', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 19964
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'labels', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 500
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'labels', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 491
    })
})

## Creating a strong baseline

In [52]:
zeroshot_classifier = pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli", device=0)

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [54]:
zeroshot_classifier("Ich liebe dieses Buch!", candidate_labels=[0,1,2,3,4])

{'sequence': 'Ich liebe dieses Buch!',
 'labels': [3, 4, 0, 2, 1],
 'scores': [0.2328941971063614,
  0.2296580672264099,
  0.21016642451286316,
  0.19636668264865875,
  0.13091468811035156]}

In [55]:
def compute_zeroshot_preds(examples):
    preds = zeroshot_classifier(examples["review_body"], candidate_labels=[0,1,2,3,4])
    return {"zeroshot_prediction": preds["labels"][0]}

In [56]:
german_test_dataset = german_dataset["test"].map(compute_zeroshot_preds)
german_test_dataset

  0%|          | 0/5000 [00:00<?, ?ex/s]



In [57]:
german_test_dataset["zeroshot_prediction"][:10]

[2, 0, 4, 0, 1, 0, 2, 2, 2, 2]

In [58]:
german_test_dataset["labels"][0]

0

In [59]:
mean_absolute_error(german_test_dataset["labels"], german_test_dataset["zeroshot_prediction"])

1.2926

## Tokenization

In [60]:
model_checkpoint = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [61]:
def tokenize_reviews(examples):
    return tokenizer(examples["review_body"], truncation=True, max_length=512)

In [62]:
tokenized_dataset = german_dataset.map(tokenize_reviews, batched=True)

  0%|          | 0/200 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

## Load model

In [63]:
num_labels = 5
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_p

## Create metrics

In [64]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"MAE": mean_absolute_error(labels, predictions)}

## Create Trainer

In [67]:
%env TOKENIZERS_PARALLELISM=false

env: TOKENIZERS_PARALLELISM=false


In [65]:
model_name = model_checkpoint.split("/")[-1]
batch_size = 16
num_train_epochs = 3

num_train_samples = 500
train_dataset = tokenized_dataset["train"].shuffle(seed=42).select(range(num_train_samples))
logging_steps = len(train_dataset) // (batch_size * num_train_epochs)

args = TrainingArguments(
    f"{model_name}-finetuned-marc-{num_train_samples}-samples",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.01,
    logging_steps=logging_steps,
    push_to_hub=True,
)

In [68]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

/home/lewis/git/workshops/nlp-zurich/xlm-roberta-base-finetuned-marc-500-samples is already a clone of https://huggingface.co/lewtun/xlm-roberta-base-finetuned-marc-500-samples. Make sure you pull the latest changes with `repo.git_pull()`.


In [69]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: reviewer_id, product_id, review_title, review_body, review_id, product_category, language.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 16


{'eval_loss': 1.6235263347625732,
 'eval_MAE': 2.0,
 'eval_runtime': 17.1764,
 'eval_samples_per_second': 291.098,
 'eval_steps_per_second': 18.223}

In [70]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: reviewer_id, product_id, review_title, review_body, review_id, product_category, language.
***** Running training *****
  Num examples = 500
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 160


Epoch,Training Loss,Validation Loss,Mae
1,1.6149,1.603234,1.5906
2,1.5967,1.589962,1.2496
3,1.5675,1.48555,1.1496
4,1.4382,1.366122,0.8454
5,1.3009,1.327512,0.8746


The following columns in the evaluation set  don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: reviewer_id, product_id, review_title, review_body, review_id, product_category, language.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 16
Saving model checkpoint to xlm-roberta-base-finetuned-marc-500-samples/checkpoint-32
Configuration saved in xlm-roberta-base-finetuned-marc-500-samples/checkpoint-32/config.json
Model weights saved in xlm-roberta-base-finetuned-marc-500-samples/checkpoint-32/pytorch_model.bin
tokenizer config file saved in xlm-roberta-base-finetuned-marc-500-samples/checkpoint-32/tokenizer_config.json
Special tokens file saved in xlm-roberta-base-finetuned-marc-500-samples/checkpoint-32/special_tokens_map.json
tokenizer config file saved in xlm-roberta-base-finetuned-marc-500-samples/tokenizer_config.json
Special tokens file saved in xlm-roberta-base-finetuned-marc-500-samples/special_tokens_m

TrainOutput(global_step=160, training_loss=1.505726170539856, metrics={'train_runtime': 219.5709, 'train_samples_per_second': 11.386, 'train_steps_per_second': 0.729, 'total_flos': 232485434971824.0, 'train_loss': 1.505726170539856, 'epoch': 5.0})

In [71]:
trainer.push_to_hub(commit_message="Training complete")

Saving model checkpoint to xlm-roberta-base-finetuned-marc-500-samples
Configuration saved in xlm-roberta-base-finetuned-marc-500-samples/config.json
Model weights saved in xlm-roberta-base-finetuned-marc-500-samples/pytorch_model.bin
tokenizer config file saved in xlm-roberta-base-finetuned-marc-500-samples/tokenizer_config.json
Special tokens file saved in xlm-roberta-base-finetuned-marc-500-samples/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 32.0k/1.04G [00:00<?, ?B/s]

Upload file runs/Oct12_16-57-03_vorace/events.out.tfevents.1634050856.vorace: 100%|##########| 8.01k/8.01k [00…

remote: error: cannot lock ref 'refs/heads/main': is at cc8d97f269f272d245b499730226a5e45c790897 but expected 1ad385c8a07abc1ca653d5ac11b1e91586cb4163        
To https://huggingface.co/lewtun/xlm-roberta-base-finetuned-marc-500-samples
 ! [remote rejected] main -> main (failed to update ref)
error: failed to push some refs to 'https://huggingface.co/lewtun/xlm-roberta-base-finetuned-marc-500-samples'



OSError: remote: error: cannot lock ref 'refs/heads/main': is at cc8d97f269f272d245b499730226a5e45c790897 but expected 1ad385c8a07abc1ca653d5ac11b1e91586cb4163        
To https://huggingface.co/lewtun/xlm-roberta-base-finetuned-marc-500-samples
 ! [remote rejected] main -> main (failed to update ref)
error: failed to push some refs to 'https://huggingface.co/lewtun/xlm-roberta-base-finetuned-marc-500-samples'


## Zero-shot cross-lingual evaluation

In [72]:
def evaluate_corpus(lang):
    dataset = load_dataset(dataset_name, lang, split="test")
    dataset = dataset.rename_column("stars", "labels")
    dataset = dataset.map(map_labels)
    tokenized_dataset = dataset.map(tokenize_reviews, batched=True)
    preds = trainer.evaluate(eval_dataset=tokenized_dataset)
    return {"MAE": preds["eval_MAE"]}

In [73]:
evaluate_corpus("en")

Reusing dataset amazon_reviews_multi (/data/.cache/hf/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)


  0%|          | 0/5000 [00:00<?, ?ex/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

The following columns in the evaluation set  don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: reviewer_id, product_id, review_title, review_body, review_id, product_category, language.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 16


{'MAE': 0.874}

In [74]:
evaluate_corpus("en")

Reusing dataset amazon_reviews_multi (/data/.cache/hf/datasets/amazon_reviews_multi/fr/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)


  0%|          | 0/5000 [00:00<?, ?ex/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

The following columns in the evaluation set  don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: reviewer_id, product_id, review_title, review_body, review_id, product_category, language.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 16


{'MAE': 0.8742}

In [75]:
classifier = pipeline("text-classification", model=trainer.model, tokenizer=trainer.tokenizer, device=0)

In [76]:
classifier("I love this book!")

[{'label': 'LABEL_3', 'score': 0.30068162083625793}]

In [77]:
classifier("Ich hasse dieses Buch!")

[{'label': 'LABEL_0', 'score': 0.372832715511322}]

In [78]:
classifier("J'adore ce livre")

[{'label': 'LABEL_3', 'score': 0.2572416365146637}]