<a href="https://colab.research.google.com/github/HeleneFabia/nlp-exploration/blob/main/exploring_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face NLP Course



In [None]:
# install libraries
!pip install datasets
!pip install transformers

## (Datasets)

In [27]:
# imports 
from datasets import (
    load_dataset_builder, 
    load_dataset,
)
from transformers import (
    BertTokenizerFast,
    AutoTokenizer
)

import torch
from torch.utils.data import(
    DataLoader
)

In [None]:
dataset_builder = load_dataset_builder(path="poem_sentiment")

train_dataset = load_dataset(path="poem_sentiment", split="train")
# valid_dataset = load_dataset(path="poem_sentiment", split="validation")
# test_dataset = load_dataset(path="poem_sentiment", split="test")

In [7]:
print("Description:", train_dataset.description)
print("Num data entries:", len(train_dataset))
print("Column names:", train_dataset.column_names)
print("Classes:", train_dataset.features["label"].names)
print("Example data entry:", train_dataset[0])

Description: Poem Sentiment is a sentiment dataset of poem verses from Project Gutenberg. This dataset can be used for tasks such as sentiment classification or style transfer for poems.

Num data entries: 892
Column names: ['id', 'verse_text', 'label']
Classes: ['negative', 'positive', 'no_impact', 'mixed']
Example data entry: {'id': 0, 'verse_text': 'with pale blue berries. in these peaceful shades--', 'label': 1}


In [32]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
train_enc_ds = train_dataset.map(lambda examples: 
                                          tokenizer(
                                              examples["verse_text"], 
                                              truncation=True,
                                              padding="max_length",
                                          ),
                                 batched=True
                                 )

In [40]:
print("Column names of encoded dataset:", train_enc_ds.column_names)
print("Tokenized data entry:", train_enc_ds[0])

Column names of encoded dataset: ['attention_mask', 'id', 'input_ids', 'label', 'token_type_ids', 'verse_text']
Tokenized data entry: {'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

--- What are attention mask, input ids and token type ids?

In [44]:
# use dataset with pytorch
train_enc_ds.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"])

# create pytorch data loader
train_dl = DataLoader(train_enc_ds, batch_size=32)
next(iter(train_dl))

{'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'input_ids': tensor([[  101,  1114,  4554,  ...,     0,     0,     0],
         [  101,  1122,  5611,  ...,     0,     0,     0],
         [  101,  1105,  1115,  ...,     0,     0,     0],
         ...,
         [  101,  1106,  1115,  ...,     0,     0,     0],
         [  101,   192,  2386,  ...,     0,     0,     0],
         [  101,  1123, 15219,  ...,     0,     0,     0]]),
 'label': tensor([1, 2, 0, 3, 3, 3, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1, 0, 2, 2, 1, 2, 2, 1,
         2, 2, 2, 2, 2, 2, 2, 2]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]])}

## 1. Transformer models

### What can Transformers do?

Playing around with HuggingFace's OTB models:

In [45]:
# imports
from transformers import pipeline

In [52]:
classifier = pipeline("sentiment-analysis")
classifier("The ocean is beautiful.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9998816251754761}]

In [54]:
generator = pipeline("text-generation")
generator("Looking at the ocean in front of me, I felt")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Looking at the ocean in front of me, I felt like an airplane flying right.\n\nThe storm had descended very quickly, so it appeared this was about to fall at a later date.\n\nWe all started feeling a wave of panic:'}]

In [56]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="What is my hobby?",
    context="I work as an engineer, but in my free time I enjoy cooking."
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'answer': 'cooking', 'end': 58, 'score': 0.428894579410553, 'start': 51}

### How do Transformers work?

Important concepts: 
- **self-supervised learning**: labels are automatically computed from the input
- **pretraining**: training a model from scratch on very large amounts of data
- **transfer learning**: fine-tuning a pretrained model in a supervised manner with a smaller dataset for a specific language task
- **encoder**: receives input and builds representation of it (optimized for  acquiring an understanding from inputs)
- **decoder**: receives encoder's representation plus other inputs in order to generate a target sequence (optimized for generating an output)
- **encoder-only models** (e.g., BERT, DistilBERT): for tasks that require understanding of the input e.g., sentence classification, named entity recognition
- **decoder-only models** (e.g., GPT, GPT-2): for generative tasks e.g., text generation
- **encoder-decoder /seq2seq models** (e.g., BART, Marian): for generative tasks that require an input e.g., translation, summarization
- **attention layer**: tells the model to pay attention to specific words in the input


## 2. Using HuggingFace Transformers

### Simple pipeline

Tokenizer:
- splits the input words/subwords/symbols (=tokens), since a model cannot process words directly
- maps each token to an integer
- adds additional inputs necessary for the model
- tokenization needs to happen in exactly the same way as was done with the data used for pretraining a model

In [72]:
# imports
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
)

from torch.nn.functional import softmax

In [58]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [66]:
input = ["I am looking at the ocean. How beautiful!"]
tokenized_input = tokenizer(input, padding=True, truncation=True)
print(tokenized_input)

{'input_ids': [[101, 1045, 2572, 2559, 2012, 1996, 4153, 1012, 2129, 3376, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


In [67]:
tokenized_input["input_ids"] = torch.tensor(tokenized_input["input_ids"])
tokenized_input["attention_mask"] = torch.tensor(tokenized_input["attention_mask"])
print(tokenized_input)

{'input_ids': tensor([[ 101, 1045, 2572, 2559, 2012, 1996, 4153, 1012, 2129, 3376,  999,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


input vector:
- of shape (batch_size, sequence_length, hidden_size)
- batch_size: number of sequences per batch
- sequence_length: length of numerical representation of sequence
- hidden_size:  vector dimension of each model input (depends on the model)

In [69]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [70]:
output = model(**tokenized_input)
print(output.logits)
prediction = softmax(output.logits, dim=1)
print(prediction)

tensor([[-4.3032,  4.5966]], grad_fn=<AddmmBackward0>)
tensor([[1.3640e-04, 9.9986e-01]], grad_fn=<SoftmaxBackward0>)


In [86]:
input_id = 0
class_prediction = int(torch.argmax(prediction))
print(f"Prediction for sentiment of '{input[input_id]}':", 
      model.config.id2label[class_prediction],
      f"with {prediction.tolist()[input_id][class_prediction]:.4f}% probability"
      )

Prediction for sentiment of 'I am looking at the ocean. How beautiful!': POSITIVE with 0.9999% probability


### Models

In [2]:
# imports
from transformers import (
    BertConfig,
    BertModel,
)

In [18]:
config = BertConfig()
print(config)
model = BertModel(config)  # randomly initialized
model = BertModel.from_pretrained("bert-base-cased")  # pretrained (https://huggingface.co/bert-base-cased)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.15.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
model.save_pretrained("/path/to/my_trained_model")

### Tokenizers

Used to transform language data into numerical data so that te model can process it. Some approach are:

**Word-based tokenizer**
- split raw text into words and find numerical representation for them
- would need A LOT of different input IDs (one for each word in a language) 
- no means of showing relationships between words ("dog" and "dogs" would have different input IDs)
- need "unknown" token ("[UNK]") for words that are not in the vocabulary.

**Character-based tokenizer**
- raw text is split into characters
- fewer distinct input IDs are necessary but numerical sequences would be much longer with this approach

**Subword tokenizer**
- frequently used words remain as they are, less frequently used ones are split into meaningful subwords (e.g., "annoyingly" --> "annoying" + "ly")
- good tradeoff between small number of distinct input IDs and short sequences
- examples: WordPiece (BERT), BPE (GPT-2), and Unigram

In [8]:
from transformers import (
    BertTokenizer,
    AutoTokenizer
)

In [9]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# essentially the same, but the second module is a wrapper that can be used with any checkpoint

In [12]:
tokenizer("The sea is incredibly blue and glittering today.")

{'input_ids': [101, 1109, 2343, 1110, 12170, 2221, 1105, 22837, 2052, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

**encoding** = general process of converting text to numbers

tokenization = splitting text into tokens (according to the way it was done for the pretrained model we want to use)

In [16]:
tokens = tokenizer.tokenize("The sea is incredibly blue and glittering today.")
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['The', 'sea', 'is', 'incredibly', 'blue', 'and', 'glittering', 'today', '.']
[1109, 2343, 1110, 12170, 2221, 1105, 22837, 2052, 119]


**decoding** = converting numbers to text

In [17]:
text = tokenizer.decode(ids)
print(text)

The sea is incredibly blue and glittering today.


### Handling multiple sequences

In [19]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification
)

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [26]:
sequence = "Hmmm, I love green tea!"
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input = torch.tensor([ids]) # model expects a batch, not one single sample
print(input.shape)

output = model(input)
print(output.logits)

torch.Size([1, 8])
tensor([[-2.6319,  2.8690]], grad_fn=<AddmmBackward0>)


**padding** = making sure all sequences have the same length by adding a padding token

In [34]:
sequences = ["Hmmm, I love green tea!", "It's Thursday"]  # sequences are of different length!
try:
  inputs = tokenizer(sequences, return_tensors="pt")
  input = inputs["input_ids"]
  print(input.shape)
  output = model(input)
  print(output.logits)
except ValueError as error:
  print(error)


Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.


**attention mask** = tensor of exact same shape as input IDs; filled with 0s and 1s; 1 means a specific token is paid attention to in the attention layer, 0 means it is not paid attention to

In [32]:
input_ids = [[5, 5, 5], [5, 5, tokenizer.pad_token_id]]
attention_mask = [[1, 1, 1], [1, 1, 0]]

outputs = model(torch.tensor(input_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 0.8322, -0.7892],
        [ 0.3235, -0.2539]], grad_fn=<AddmmBackward0>)


**truncation** = limiting the length of a sequence
(necessary because models can only handle up to 512/1024 tokens per sequence - however, there are also models (e.g. Longformer) which can handle longer sequences)

In [42]:
inputs = tokenizer(
    sequences, 
    padding=True, 
    truncation=True,
    return_tensors="pt"  # or "np" for numpy arrays
    )

**special tokens** = added to the inputs, for example [CLS] (= beginning of a sequence) and [SEP] (= end of a sequence)

In [45]:
print(inputs["input_ids"][0])
print(tokenizer.decode(inputs["input_ids"][0]))

tensor([  101, 17012,  2213,  1010,  1045,  2293,  2665,  5572,   999,   102])
[CLS] hmmm, i love green tea! [SEP]


## 3. Fine-tuning a pretrained model

### Processing the data

In [70]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

In [55]:
ds = load_dataset("glue", "mrpc", split="train")
print("Columns:", ds.column_names)
print("Number of samples:", len(ds))
print("Classes:", ds.features["label"].names)
print("Example:", ds[0])

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Columns: ['sentence1', 'sentence2', 'label', 'idx']
Number of samples: 3668
Classes: ['not_equivalent', 'equivalent']
Example: {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}


In [None]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

**token type ids** = in this example, the tensor tells the model which input ids belong to the first sentence and which belong to the second sentence.

In [61]:
inputs = tokenizer(ds["sentence1"][10], ds["sentence2"][10])
print(inputs["input_ids"])
print(inputs["token_type_ids"])

[101, 6094, 2437, 2009, 6211, 2005, 10390, 2000, 22505, 2037, 13930, 1999, 10528, 2457, 2180, 10827, 2160, 6226, 1999, 2233, 1012, 102, 6094, 2437, 2009, 6211, 2005, 10390, 2000, 22505, 2037, 13930, 1999, 10528, 2457, 2180, 26203, 1010, 2160, 6226, 1999, 2233, 1998, 2001, 11763, 2011, 1996, 2317, 2160, 1012, 102]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [None]:
tokenized_ds = tokenizer(
    ds["sentence1"],
    ds["sentence1"],
    padding=True,
    truncation=True
)  # will require a lot of RAM; will return the dataset as a dictionary

In [62]:
def tokenize_func(dataset):  # use with Dataset.map() method
  return tokenizer(dataset["sentence1"], dataset["sentence2"], truncation=True)
  # no padding, since whole dataset would be padded to the same length (unnecessary)
  # instead, padding is applied to each batch 

In [69]:
ds = load_dataset("glue", "mrpc")
print(ds)
tokenized_ds = ds.map(tokenize_func, batched=True)
print(tokenized_ds)

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-b8e53849ba067d19.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-04e9d6999f15f318.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-f66963ee13090126.arrow


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})
DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})


**dynamic padding** = apply padding in the collate function that builds the DataLoader

In [76]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [79]:
samples = tokenized_ds["train"][:6]
print([len(sample) for sample in samples["input_ids"]]) # sequences are still of different lengths

[50, 59, 47, 67, 59, 50]


In [81]:
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
batch = data_collator(samples)  # automatically padded to max length in whole dataset
print({k: v.shape for k, v in batch.items()})  

{'attention_mask': torch.Size([6, 67]), 'input_ids': torch.Size([6, 67]), 'token_type_ids': torch.Size([6, 67]), 'labels': torch.Size([6])}


### Fine-tuning a model with the Trainer API

In [83]:
from transformers import TrainingArguments