<a href="https://colab.research.google.com/github/MorenoLaQuatra/DeepNLP/blob/main/%F0%9F%A4%97Transformers_Overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**


---


**Teaching Assistant:** Moreno La Quatra

**Overview:** 🤗 Transformers library

In [None]:
%%capture
! pip install transformers

# Pipelines

The easiest way to run inference on the 🤗 Transformers library is to interact with [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines). They embed all the steps required to analyze input text:
- Preprocessing
- Model inference
- Post processing

![](https://huggingface.co/course/static/chapter2/full_nlp_pipeline.png)

## Sentiment analysis pipeline (Encoder-only models)

In [None]:
from transformers import pipeline

In [None]:
# simple example
sentiment_analyzer = pipeline("sentiment-analysis")
res = sentiment_analyzer(["I like Deep NLP course", "I don't like Deep NLP course!"])
print (res)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.997292697429657}, {'label': 'NEGATIVE', 'score': 0.9898529648780823}]


In [None]:
# providing model: look for it on Model Hub: https://huggingface.co/models
sentiment_analyzer = pipeline("sentiment-analysis", model="finiteautomata/bertweet-base-sentiment-analysis")
res = sentiment_analyzer(["TLDR: the movie was amazing", "What a mess! The plot was awful"])
print (res)

Downloading:   0%|          | 0.00/890 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/515M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/295 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/824k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

emoji is not installed, thus not converting emoticons or emojis into text. Please install emoji: pip3 install emoji


[{'label': 'POS', 'score': 0.9896277189254761}, {'label': 'NEG', 'score': 0.9823752045631409}]


## Text Generation Pipeline (Decoder-only models)

In [None]:
# providing model: look for it on Model Hub: https://huggingface.co/models

text_generator = pipeline("text-generation")
res = text_generator("The meaning of life is")
print (res)

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The meaning of life is to not only be lived in the body, but also to be lived in the spirit of reason. And that the spiritual soul in being free from disease, has all the faculties of life is to be praised and praised; thus'}]


In [None]:
text_generator = pipeline("text-generation", model="GroNLP/gpt2-small-italian")
res = text_generator("Il senso della vita è")
print (res)

Downloading:   0%|          | 0.00/959 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/427M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/135 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/475k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/280k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

  next_indices = next_tokens // vocab_size


[{'generated_text': 'Il senso della vita è il fondamento dell\'esistenza di un individuo, la sua relazione con le persone e i suoi rapporti familiari. L\'espressione "paziente umano" si riferisce all\'individuo in quanto non ha alcun legame fisico o psichico con gli esseri umani (come per esempio l\'uomo), bensì alla persona che ci viene a conoscenza delle sue condizioni: esso puٍ essere considerato come «maiale» ed anche “personaggibile».\n\nIn base al rapporto uomo-animale'}]


## Text summarization Pipeline (Encoder-Decoder models):

In [None]:
# simple example

summarizer = pipeline("summarization")
res = summarizer("Scientific articles can be annotated with short sentences, called highlights, providing readers with an at-a-glance overview of the main findings. Highlights are usually manually specified by the authors. This paper presents a supervised approach, based on regression techniques, with the twofold aim at automatically extracting highlights of past articles with missing annotations and simplifying the process of manually annotating new articles. To this end, regression models are trained on a variety of features extracted from previously annotated articles. The proposed approach extends existing extractive approaches by predicting a similarity score, based on n-gram co-occurrences, between article sentences and highlights. The experimental results, achieved on a benchmark collection of articles ranging over heterogeneous topics, show that the proposed regression models perform better than existing methods, both supervised and not.")
print (res)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

[{'summary_text': ' Scientific articles can be annotated with short sentences, called highlights, providing readers with an at-a-glance overview of the main findings . Highlights are usually manually specified by the authors . This paper presents a supervised approach, based on regression techniques, with the twofold aim of automatically extracting highlights of past articles with missing annotations .'}]


In [None]:
# providing model: look for it on Model Hub: https://huggingface.co/models

summarizer = pipeline("summarization", model="shamikbose89/mt5-small-finetuned-arxiv-cs-finetuned-arxiv-cs-full")
res = summarizer("Scientific articles can be annotated with short sentences, called highlights, providing readers with an at-a-glance overview of the main findings. Highlights are usually manually specified by the authors. This paper presents a supervised approach, based on regression techniques, with the twofold aim at automatically extracting highlights of past articles with missing annotations and simplifying the process of manually annotating new articles. To this end, regression models are trained on a variety of features extracted from previously annotated articles. The proposed approach extends existing extractive approaches by predicting a similarity score, based on n-gram co-occurrences, between article sentences and highlights. The experimental results, achieved on a benchmark collection of articles ranging over heterogeneous topics, show that the proposed regression models perform better than existing methods, both supervised and not.")
print (res)

Downloading:   0%|          | 0.00/702 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/432 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/7.93M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

[{'summary_text': 'Efficient Regression Models for Scientific Articles'}]


# Tokenizers

Tokenizers are used to convert text into tokens. They are used to preprocess text before feeding it to the model. While the model is responsible for converting tokens into embeddings, the tokenizer is responsible for converting text into tokens (and vice versa).

Those are the main methods of the tokenizer:
- `tokenizer.encode(text)`: converts text into tokens
- `tokenizer.decode(token_ids)`: converts tokens into text


[Tokenizers](https://huggingface.co/docs/transformers/master/en/main_classes/tokenizer) contains all the pre-processing tools that are used to split long text into tokens. Once trained on a corpus, they can then be used to tokenize new text. They learn the vocabulary of the corpus and the rules to split the text into tokens.

NB: AutoClasses allow to generate tokenizer (and model) objects without instantiating the specific model tokenizer (and model).

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokens = tokenizer ("I'm learning Deep NLP") 
print (tokens)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'input_ids': [101, 146, 112, 182, 3776, 7786, 21239, 2101, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


Each model configuration has a maximum length of tokens that can be used for processing. It is common to process sentences that have different lenghts. In this case:

- `max_length` parameter allow to set a maximum number of tokens for processing
- `truncation` allows to enable truncation for sentences exceeding the `max_length`
- `padding` allows to enable padding for sentences shorter than `max_length`

The tokenizer return the `attention_mask` that allow the model to compute attention weights only for tokens (and not for padding). The attention mask is a binary tensor of shape (batch_size, sequence_length) where 1 indicates a token that should be attended to, and 0 for a token that should not. You can use the `return_attention_mask=True` parameter to return the attention mask.

In [None]:
tokens = tokenizer ("I'm learning Deep NLP", padding='max_length', max_length=16) 
print (tokens)

{'input_ids': [101, 146, 112, 182, 3776, 7786, 21239, 2101, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}


In [None]:
tokens = tokenizer ("I'm learning Deep NLP at Politecnico di Torino. I'm a 2nd year master student", padding='max_length', max_length=16, return_attention_mask=True) 
print (tokens)
tokens = tokenizer ("I'm learning Deep NLP at Politecnico di Torino. I'm a 2nd year master student", padding='max_length', max_length=16, truncation=True, return_attention_mask=True) 
print (tokens)
tokens = tokenizer ("I'm learning Deep NLP at Politecnico di Torino. I'm a 2nd year master student", padding='max_length', max_length=32, truncation=True, return_attention_mask=True) 
print (tokens)

{'input_ids': [101, 146, 112, 182, 3776, 7786, 21239, 2101, 1120, 17129, 3150, 1665, 7770, 1186, 4267, 27882, 119, 146, 112, 182, 170, 2518, 1214, 3283, 2377, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 146, 112, 182, 3776, 7786, 21239, 2101, 1120, 17129, 3150, 1665, 7770, 1186, 4267, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 146, 112, 182, 3776, 7786, 21239, 2101, 1120, 17129, 3150, 1665, 7770, 1186, 4267, 27882, 119, 146, 112, 182, 170, 2518, 1214, 3283, 2377, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Tokenizers does not only allow encoding text to IDs, they also allow the inverted conversion. Tokenizers may also contains some "special tokens" that are used to encode specific information. For example, the `[CLS]` token is used to encode the beginning of the sentence, and the `[SEP]` token is used to encode the end of the sentence. Those tokens are added by the tokenizer when encoding the text and should be removed when decoding the tokens.

In [None]:
text = tokenizer.decode(tokens.input_ids, skip_special_tokens=True)
print (text)

# [CLS] special token for encoder model, used for classification/regression tasks
# [SEP] special token to separate multiple sentences
# [PAD] special token for padding

[CLS] I'm learning Deep NLP at Politecnico di Torino. I'm a 2nd year master student [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


# Models

Tranformer-based models are wrapped around their own class in 🤗 Transformers. Similarly to AutoTokenizer, AutoModel class is able to take in charge the instantiation of the correct class for the model we want to use.

Given that, models for specific tasks exist with the same backbone architecture (e.g., BERT can be used both for sequence classification or for token-level classification), the Auto Model should be instantiated with the correct task appended (e.g., AutoModelForSequenceClassification).

In [None]:
from transformers import AutoModelForSequenceClassification
bert_model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

However, pre-trained bert model is not fine-tuned for any specific task (this is the reason behind the warning). If we want to use this model, we first need to finetune it (or we can use another model already finetuned for the task).

[Model Hub](https://huggingface.co/models)

In [None]:
from transformers import AutoModelForSequenceClassification
bert_model_sc = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

Downloading:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

In [None]:
import numpy as np
sentences = ["Google stocks went up suddently, I earned 30B$"]
tokenized_sentence = tokenizer(sentences, return_tensors="pt", padding="max_length", truncation=True, max_length=16)
pred = bert_model_sc(**tokenized_sentence)
print (pred[0][0].detach().numpy(), np.argmax(pred[0][0].detach().numpy()))

[ 0.10727316 -1.487677    1.9541717 ] 2


# Finetuning a pretrained model

Pretraining + Finetuning paradigm is the key of the success of the 🤗 Transformers library. [Model Hub](https://huggingface.co/models) contains plenty of pre-trained models that can be used as they are, or can be finetuned on new datasets.

While the theoretical aspects of pretraining and finetuning have been discussed during the lectures, the following sections will show how to finetune a model on a new dataset.


For this example, we will use the `Trainer` class. The `Trainer` class is a high-level class that is used to train and evaluate models. It is designed to be used with the 🤗 Transformers library.
[Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) allows user to easily finetune the selected model for the task at hand.

In [None]:
# Your own data
import pandas as pd

!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/transformers_overview/Corona_NLP_train.csv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/transformers_overview/Corona_NLP_test.csv

df_train = pd.read_csv("Corona_NLP_train.csv")
df_test = pd.read_csv("Corona_NLP_test.csv")

df_train = df_train.dropna(how = 'any')
df_test  = df_test.dropna (how = 'any')

train_sentences = df_train["OriginalTweet"].tolist()
train_y = df_train["Sentiment"].tolist()

print(f"Train set: {len(train_sentences)}, {len(train_y)}")

eval_samples = int(0.05*len(train_sentences))


eval_sentences = train_sentences[:eval_samples]
eval_y = train_y[:eval_samples]

train_sentences = train_sentences[eval_samples:]
train_y = train_y[eval_samples:]

test_sentences = df_test["OriginalTweet"].tolist()
test_y = df_test["Sentiment"].tolist()

print(f"Train set: {len(train_sentences)}, {len(train_y)}")
print(f"Eval set: {len(eval_sentences)}, {len(eval_y)}")
print(f"Test set: {len(test_sentences)}, {len(test_y)}")

--2021-12-15 11:04:58--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/transformers_overview/Corona_NLP_train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10538325 (10M) [text/plain]
Saving to: ‘Corona_NLP_train.csv’


2021-12-15 11:04:59 (66.4 MB/s) - ‘Corona_NLP_train.csv’ saved [10538325/10538325]

--2021-12-15 11:05:00--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/transformers_overview/Corona_NLP_test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1006793 (983K

In [None]:
# Examples for Sequence Classification

# tokenizer and model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels = len(set(train_y)))

# Tokenization step
tokenized_train = tokenizer(train_sentences, padding="max_length", truncation=True, max_length=64)
tokenized_test = tokenizer(test_sentences, padding="max_length", truncation=True, max_length=64)
tokenized_eval = tokenizer(eval_sentences, padding="max_length", truncation=True, max_length=64)


# Label encoding step
from sklearn.preprocessing import LabelEncoder

def label_encoding(labels, le):
    # instantiate labelencoder object
    y = le.transform(labels)
    return y

all_labels = [] 
for label in set(train_y):
    all_labels.append(label)

le = LabelEncoder()
le.fit(all_labels)

train_y = label_encoding(train_y, le)
test_y = label_encoding(test_y, le)
eval_y = label_encoding(eval_y, le)

# V1: use your own dataset class
import torch
class SCDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


train_ds = SCDataset(tokenized_train, train_y)
eval_ds = SCDataset(tokenized_eval, eval_y)
test_ds = SCDataset(tokenized_test, test_y)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Compute metrics function should return a dictionary with the metrics computed for the task (e.g., accuracy)

def compute_metrics(pred):
    predictions = np.argmax(pred.predictions, axis=-1)
    labels = pred.label_ids
    return {
        "acc": accuracy_score(labels, predictions),
        "f1_macro": f1_score(labels, predictions, average="macro"),
        "f1_weight": f1_score(labels, predictions, average="weighted")
    }

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=10,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    eval_steps=100,
    save_steps=100,
    load_best_model_at_end=True,
)

In [None]:
from transformers import Trainer
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_ds,         # training dataset
    eval_dataset=eval_ds,             # evaluation dataset
    compute_metrics=compute_metrics
)

trainer.train()

***** Running training *****
  Num examples = 30939
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 967


Epoch,Training Loss,Validation Loss,Acc,F1 Macro,F1 Weight
1,0.7898,0.776426,0.701474,0.71355,0.70036


***** Running Evaluation *****
  Num examples = 1628
  Batch size = 64
Saving model checkpoint to ./results/checkpoint-967
Configuration saved in ./results/checkpoint-967/config.json
Model weights saved in ./results/checkpoint-967/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-967 (score: 0.7764255404472351).


TrainOutput(global_step=967, training_loss=1.075773225821778, metrics={'train_runtime': 734.4956, 'train_samples_per_second': 42.123, 'train_steps_per_second': 1.317, 'total_flos': 1017576526211712.0, 'train_loss': 1.075773225821778, 'epoch': 1.0})

In [None]:
# Trainer APIs could also be used for testing the model
preds = trainer.predict(test_ds)
print(preds)

***** Running Prediction *****
  Num examples = 2964
  Batch size = 64


PredictionOutput(predictions=array([[-7.4395530e-02, -2.9149461e+00,  2.6422818e+00,  3.0899101e-01,
        -6.5377206e-02],
       [-3.1479673e+00,  1.2602379e+00, -5.6172097e-01,  1.6221674e-02,
         3.2608027e+00],
       [-3.2403991e-01, -2.8279240e+00,  2.5448132e+00,  3.9962062e-01,
         4.2341407e-03],
       ...,
       [-1.2931813e+00, -2.8563869e-01,  1.0930369e+00, -2.4310488e-01,
         1.5647271e+00],
       [-2.6589458e+00, -1.7427621e+00, -2.0157535e-01,  3.4326386e+00,
         9.9057478e-01],
       [-1.5231811e+00,  4.6537170e+00, -1.2794781e+00, -1.7383429e+00,
         1.3228137e+00]], dtype=float32), label_ids=array([0, 4, 2, ..., 2, 3, 1]), metrics={'test_loss': 0.7736325263977051, 'test_acc': 0.6973684210526315, 'test_f1_macro': 0.7102028427981064, 'test_f1_weight': 0.6970589266885456, 'test_runtime': 21.8475, 'test_samples_per_second': 135.668, 'test_steps_per_second': 2.151})


Additional information on how to fine-tune a pretrained model both with Trainer API or with standard PyTorch/TensorFlow (Keras) could be found [here](https://huggingface.co/docs/transformers/training).

Some additional notebooks that can be useful for the project (if you want and can use HF):

- PyTorch + HF: https://github.com/huggingface/transformers/tree/master/notebooks#pytorch-examples 
- TensorFlow + HF: https://github.com/huggingface/transformers/tree/master/notebooks#tensorflow-examples 

# Datasets & metrics

Hugginface also provide separate packages for [datasets](https://huggingface.co/datasets) and [evaluate](https://huggingface.co/docs/evaluate/index)

In [None]:
! pip install evaluate datasets

Collecting metrics
  Downloading metrics-0.3.3.tar.gz (18 kB)
Collecting datasets
  Downloading datasets-1.16.1-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 12.3 MB/s 
[?25hCollecting Pygments==2.2.0
  Downloading Pygments-2.2.0-py2.py3-none-any.whl (841 kB)
[K     |████████████████████████████████| 841 kB 22.3 MB/s 
[?25hCollecting pathspec==0.5.5
  Downloading pathspec-0.5.5.tar.gz (21 kB)
Collecting pathlib2>=2.3.0
  Downloading pathlib2-2.3.6-py2.py3-none-any.whl (17 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 18.0 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 31.8 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
[K     |████████

In [None]:
# example on how to use a metric https://huggingface.co/spaces/evaluate-metric/accuracy
import evaluate
accuracy = evaluate.load("accuracy")

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

In [None]:
y_pred = preds.predictions.argmax(-1)
# compute the accuracy
accuracy.compute(predictions=y_pred, references=test_y)

{'accuracy': 0.6973684210526315}


In [None]:
# using a dataset
from datasets import load_dataset
dataset = load_dataset("scitldr")

Downloading:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

No config specified, defaulting to: scitldr/Abstract


Downloading and preparing dataset scitldr/Abstract (download: 5.23 MiB, generated: 4.58 MiB, post-processed: Unknown size, total: 9.81 MiB) to /root/.cache/huggingface/datasets/scitldr/Abstract/0.0.0/79e0fa75961392034484808cfcc8f37deb15ceda153b798c92d9f621d1042fef...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/1.01M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/356k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/378k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset scitldr downloaded and prepared to /root/.cache/huggingface/datasets/scitldr/Abstract/0.0.0/79e0fa75961392034484808cfcc8f37deb15ceda153b798c92d9f621d1042fef. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
print(dataset["train"][0])
print("Source text:", dataset["train"][0]["source"])
print("Target text:", dataset["train"][0]["target"])

{'source': ['Due to the success of deep learning to solving a variety of challenging machine learning tasks, there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect.', 'Particularly, the properties of critical points and the landscape around them are of importance to determine the convergence performance of optimization algorithms.', 'In this paper, we provide a necessary and sufficient characterization of the analytical forms for the critical points (as well as global minimizers) of the square loss functions for linear neural networks.', 'We show that the analytical forms of the critical points characterize the values of the corresponding loss functions as well as the necessary and sufficient conditions to achieve global minimum.', 'Furthermore, we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of linear neural networks and shallow ReLU networks.', 'One partic

In [None]:
max_input_length = 512
max_output_length = 64
def preprocess_function(examples):
    inputs = [s for s in examples["source"]]
    inputs = " ".join(inputs)
    targets = examples["target"]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=max_output_length, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
dataset = dataset.map(preprocess_function)

  0%|          | 0/1992 [00:00<?, ?ex/s]

  0%|          | 0/618 [00:00<?, ?ex/s]

  0%|          | 0/619 [00:00<?, ?ex/s]

In [None]:
print(dataset["train"][0])
print("Source text:", dataset["train"][0]["input_ids"])
print("Target text:", dataset["train"][0]["labels"])

{'source': ['Due to the success of deep learning to solving a variety of challenging machine learning tasks, there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect.', 'Particularly, the properties of critical points and the landscape around them are of importance to determine the convergence performance of optimization algorithms.', 'In this paper, we provide a necessary and sufficient characterization of the analytical forms for the critical points (as well as global minimizers) of the square loss functions for linear neural networks.', 'We show that the analytical forms of the critical points characterize the values of the corresponding loss functions as well as the necessary and sufficient conditions to achieve global minimum.', 'Furthermore, we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of linear neural networks and shallow ReLU networks.', 'One partic

In [None]:
columns_to_return = ['input_ids', 'labels', 'attention_mask']
dataset.set_format(type='torch', columns=columns_to_return)


In [None]:
print(dataset["train"][0])
print("Source text:", dataset["train"][0]["input_ids"])
print("Target text:", dataset["train"][0]["labels"])

{'input_ids': tensor([    0, 28084,     7,     5,  1282,     9,  1844,  2239,     7, 15582,
           10,  3143,     9,  4087,  3563,  2239,  8558,     6,    89,    16,
           10,  2227,   773,    11,  2969,   872,  8047,    13,  1058, 26739,
         4836,    31,    10, 26534,  6659,     4, 36863,     6,     5,  3611,
            9,  2008,   332,     8,     5,  5252,   198,   106,    32,     9,
         3585,     7,  3094,     5, 33345,   819,     9, 25212, 16964,     4,
           96,    42,  2225,     6,    52,   694,    10,  2139,     8,  7719,
        34934,     9,     5, 23554,  4620,    13,     5,  2008,   332,    36,
          281,   157,    25,   720, 15970, 11574,    43,     9,     5,  3925,
          872,  8047,    13, 26956, 26739,  4836,     4,   166,   311,    14,
            5, 23554,  4620,     9,     5,  2008,   332, 33776,     5,  3266,
            9,     5, 12337,   872,  8047,    25,   157,    25,     5,  2139,
            8,  7719,  1274,     7,  3042,   720, 