<a href="https://colab.research.google.com/github/HeleneFabia/nlp-exploration/blob/main/nlp_with_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hugging Face NLP Course



In [None]:
# install libraries
!pip install datasets
!pip install transformers

## 1. Transformer models

### What can Transformers do?

Playing around with HuggingFace's OTB models:

In [45]:
# imports
from transformers import pipeline

In [52]:
classifier = pipeline("sentiment-analysis")
classifier("The ocean is beautiful.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9998816251754761}]

In [54]:
generator = pipeline("text-generation")
generator("Looking at the ocean in front of me, I felt")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Looking at the ocean in front of me, I felt like an airplane flying right.\n\nThe storm had descended very quickly, so it appeared this was about to fall at a later date.\n\nWe all started feeling a wave of panic:'}]

In [56]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="What is my hobby?",
    context="I work as an engineer, but in my free time I enjoy cooking."
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'answer': 'cooking', 'end': 58, 'score': 0.428894579410553, 'start': 51}

### How do Transformers work?

Important concepts: 
- **self-supervised learning**: labels are automatically computed from the input
- **pretraining**: training a model from scratch on very large amounts of data
- **transfer learning**: fine-tuning a pretrained model in a supervised manner with a smaller dataset for a specific language task
- **encoder**: receives input and builds representation of it (optimized for  acquiring an understanding from inputs)
- **decoder**: receives encoder's representation plus other inputs in order to generate a target sequence (optimized for generating an output)
- **encoder-only models** (e.g., BERT, DistilBERT): for tasks that require understanding of the input e.g., sentence classification, named entity recognition
- **decoder-only models** (e.g., GPT, GPT-2): for generative tasks e.g., text generation
- **encoder-decoder /seq2seq models** (e.g., BART, Marian): for generative tasks that require an input e.g., translation, summarization
- **attention layer**: tells the model to pay attention to specific words in the input


## 2. Using HuggingFace Transformers

### Simple pipeline

Tokenizer:
- splits the input words/subwords/symbols (=tokens), since a model cannot process words directly
- maps each token to an integer
- adds additional inputs necessary for the model
- tokenization needs to happen in exactly the same way as was done with the data used for pretraining a model

In [72]:
# imports
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
)

from torch.nn.functional import softmax

In [58]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [66]:
input = ["I am looking at the ocean. How beautiful!"]
tokenized_input = tokenizer(input, padding=True, truncation=True)
print(tokenized_input)

{'input_ids': [[101, 1045, 2572, 2559, 2012, 1996, 4153, 1012, 2129, 3376, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


In [67]:
tokenized_input["input_ids"] = torch.tensor(tokenized_input["input_ids"])
tokenized_input["attention_mask"] = torch.tensor(tokenized_input["attention_mask"])
print(tokenized_input)

{'input_ids': tensor([[ 101, 1045, 2572, 2559, 2012, 1996, 4153, 1012, 2129, 3376,  999,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


input vector:
- of shape (batch_size, sequence_length, hidden_size)
- batch_size: number of sequences per batch
- sequence_length: length of numerical representation of sequence
- hidden_size:  vector dimension of each model input (depends on the model)

In [69]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [70]:
output = model(**tokenized_input)
print(output.logits)
prediction = softmax(output.logits, dim=1)
print(prediction)

tensor([[-4.3032,  4.5966]], grad_fn=<AddmmBackward0>)
tensor([[1.3640e-04, 9.9986e-01]], grad_fn=<SoftmaxBackward0>)


In [86]:
input_id = 0
class_prediction = int(torch.argmax(prediction))
print(f"Prediction for sentiment of '{input[input_id]}':", 
      model.config.id2label[class_prediction],
      f"with {prediction.tolist()[input_id][class_prediction]:.4f}% probability"
      )

Prediction for sentiment of 'I am looking at the ocean. How beautiful!': POSITIVE with 0.9999% probability


### Models

In [2]:
# imports
from transformers import (
    BertConfig,
    BertModel,
)

In [18]:
config = BertConfig()
print(config)
model = BertModel(config)  # randomly initialized
model = BertModel.from_pretrained("bert-base-cased")  # pretrained (https://huggingface.co/bert-base-cased)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.15.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
model.save_pretrained("/path/to/my_trained_model")

### Tokenizers

Used to transform language data into numerical data so that te model can process it. Some approach are:

**Word-based tokenizer**
- split raw text into words and find numerical representation for them
- would need A LOT of different input IDs (one for each word in a language) 
- no means of showing relationships between words ("dog" and "dogs" would have different input IDs)
- need "unknown" token ("[UNK]") for words that are not in the vocabulary.

**Character-based tokenizer**
- raw text is split into characters
- fewer distinct input IDs are necessary but numerical sequences would be much longer with this approach

**Subword tokenizer**
- frequently used words remain as they are, less frequently used ones are split into meaningful subwords (e.g., "annoyingly" --> "annoying" + "ly")
- good tradeoff between small number of distinct input IDs and short sequences
- examples: WordPiece (BERT), BPE (GPT-2), and Unigram

In [8]:
from transformers import (
    BertTokenizer,
    AutoTokenizer
)

In [9]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# essentially the same, but the second module is a wrapper that can be used with any checkpoint

In [12]:
tokenizer("The sea is incredibly blue and glittering today.")

{'input_ids': [101, 1109, 2343, 1110, 12170, 2221, 1105, 22837, 2052, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

**encoding** = general process of converting text to numbers

tokenization = splitting text into tokens (according to the way it was done for the pretrained model we want to use)

In [16]:
tokens = tokenizer.tokenize("The sea is incredibly blue and glittering today.")
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['The', 'sea', 'is', 'incredibly', 'blue', 'and', 'glittering', 'today', '.']
[1109, 2343, 1110, 12170, 2221, 1105, 22837, 2052, 119]


**decoding** = converting numbers to text

In [17]:
text = tokenizer.decode(ids)
print(text)

The sea is incredibly blue and glittering today.


### Handling multiple sequences

In [19]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification
)

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [26]:
sequence = "Hmmm, I love green tea!"
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input = torch.tensor([ids]) # model expects a batch, not one single sample
print(input.shape)

output = model(input)
print(output.logits)

torch.Size([1, 8])
tensor([[-2.6319,  2.8690]], grad_fn=<AddmmBackward0>)


**padding** = making sure all sequences have the same length by adding a padding token

In [34]:
sequences = ["Hmmm, I love green tea!", "It's Thursday"]  # sequences are of different length!
try:
  inputs = tokenizer(sequences, return_tensors="pt")
  input = inputs["input_ids"]
  print(input.shape)
  output = model(input)
  print(output.logits)
except ValueError as error:
  print(error)


Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.


**attention mask** = tensor of exact same shape as input IDs; filled with 0s and 1s; 1 means a specific token is paid attention to in the attention layer, 0 means it is not paid attention to

In [32]:
input_ids = [[5, 5, 5], [5, 5, tokenizer.pad_token_id]]
attention_mask = [[1, 1, 1], [1, 1, 0]]

outputs = model(torch.tensor(input_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 0.8322, -0.7892],
        [ 0.3235, -0.2539]], grad_fn=<AddmmBackward0>)


**truncation** = limiting the length of a sequence
(necessary because models can only handle up to 512/1024 tokens per sequence - however, there are also models (e.g. Longformer) which can handle longer sequences)

In [42]:
inputs = tokenizer(
    sequences, 
    padding=True, 
    truncation=True,
    return_tensors="pt"  # or "np" for numpy arrays
    )

**special tokens** = added to the inputs, for example [CLS] (= beginning of a sequence) and [SEP] (= end of a sequence)

In [45]:
print(inputs["input_ids"][0])
print(tokenizer.decode(inputs["input_ids"][0]))

tensor([  101, 17012,  2213,  1010,  1045,  2293,  2665,  5572,   999,   102])
[CLS] hmmm, i love green tea! [SEP]


## 3. Fine-tuning a pretrained model

### Processing the data

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

In [39]:
ds = load_dataset("glue", "mrpc", split="train")
print("Columns:", ds.column_names)
print("Number of samples:", len(ds))
print("Classes:", ds.features["label"].names)
print("Example:", ds[0])

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Columns: ['sentence1', 'sentence2', 'label', 'idx']
Number of samples: 3668
Classes: ['not_equivalent', 'equivalent']
Example: {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}


In [None]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

**token type ids** = in this example, the tensor tells the model which input ids belong to the first sentence and which belong to the second sentence.

In [93]:
inputs = tokenizer(ds["sentence1"][10], ds["sentence2"][10])
print(inputs["input_ids"])
print(inputs["token_type_ids"])

[101, 6094, 2437, 2009, 6211, 2005, 10390, 2000, 22505, 2037, 13930, 1999, 10528, 2457, 2180, 10827, 2160, 6226, 1999, 2233, 1012, 102, 6094, 2437, 2009, 6211, 2005, 10390, 2000, 22505, 2037, 13930, 1999, 10528, 2457, 2180, 26203, 1010, 2160, 6226, 1999, 2233, 1998, 2001, 11763, 2011, 1996, 2317, 2160, 1012, 102]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [94]:
tokenized_ds = tokenizer(
    ds["sentence1"],
    ds["sentence1"],
    padding=True,
    truncation=True
)  # will require a lot of RAM; will return the dataset as a dictionary

In [6]:
def tokenize_func(dataset):  # use with Dataset.map() method
  return tokenizer(dataset["sentence1"], dataset["sentence2"], truncation=True)
  # no padding, since whole dataset would be padded to the same length (unnecessary)
  # instead, padding is applied to each batch 

In [7]:
ds = load_dataset("glue", "mrpc")
print(ds)
tokenized_ds = ds.map(tokenize_func, batched=True)
print(tokenized_ds)

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})


  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})


**dynamic padding** = apply padding in the collate function that builds the DataLoader

In [8]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [98]:
samples = tokenized_ds["train"][:6]
print([len(sample) for sample in samples["input_ids"]]) # sequences are still of different lengths

[50, 59, 47, 67, 59, 50]


In [99]:
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
batch = data_collator(samples)  # automatically padded to max length in whole dataset
print({k: v.shape for k, v in batch.items()})  

{'attention_mask': torch.Size([6, 67]), 'input_ids': torch.Size([6, 67]), 'token_type_ids': torch.Size([6, 67]), 'labels': torch.Size([6])}


### Fine-tuning a model with the Trainer API

In [9]:
import numpy as np

from transformers import (
    TrainingArguments,
    AutoModelForSequenceClassification,
    Trainer
    )
from datasets import load_metric

In [12]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")

In [13]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [14]:
def compute_metrics(valid_preds):
  metric = load_metric("glue", "mrpc"),
  logits, labels = valid_preds
  preds_cls = np.argmax(logits, axis=-1)
  return metric.compute(predictions=preds_cls, references=labels)

In [15]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

### A full training

In [34]:
from tqdm.auto import tqdm

import torch
from torch.utils.data import DataLoader
from transformers import (
    AutoModelForSequenceClassification,
    AdamW,
    get_scheduler
)
from datasets import load_metric
# from accelerate import Accelerator

In [16]:
tokenized_ds = tokenized_ds.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_ds = tokenized_ds.rename_column("label", "labels")
tokenized_ds.set_format("torch")

In [19]:
train_dl = DataLoader(
    tokenized_ds["train"], 
    shuffle=True, 
    batch_size=8,
    collate_fn=data_collator
    )

valid_dl = DataLoader(
    tokenized_ds["validation"], 
    batch_size=8, 
    collate_fn=data_collator
    )

In [22]:
for batch in train_dl:  # padding is applied to max length of respective batch
  print({k: v.shape for k, v in batch.items()})
  break

{'attention_mask': torch.Size([8, 72]), 'input_ids': torch.Size([8, 72]), 'labels': torch.Size([8]), 'token_type_ids': torch.Size([8, 72])}


In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [27]:
optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
# accelerator = Accelerator()

In [30]:
num_epochs = 3
num_training_steps = num_epochs * len(train_dl)

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# train_dl, valid_dl, model, optimizer = accelerator.prepare(
#    train_dl, 
#    valid_dl,
#    model,
#    optimizer
#)  # --> handles device placement so putting model and data on device during training is unnecessary when working with Accelerator

In [None]:
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
  for batch in train_dl:
    batch = {k: v.to(device) for k, v in batch.items()}
  
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    # accelerator.backward(loss)
    
    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    
    progress_bar.update(1)

In [None]:
metric = load_metric("glue", "mrpc")
model.eval()
for batch in valid_dl:
  batch = {k: v.to(device) for k, v in batch.items()}
  with torch.no_grad():
    outputs = model(**batch)
  logits = outputs.logits
  predictions = torch.argmax(logits, dim=-1)
  metric.add_batch(predictions=predictions, references=batch["labels"])
metric.compute()

## 4. Datasets library

In [27]:
from datasets import (
    load_dataset_builder, 
    load_dataset,
)
from transformers import (
    BertTokenizerFast,
    AutoTokenizer
)

import torch
from torch.utils.data import(
    DataLoader
)

In [None]:
dataset_builder = load_dataset_builder(path="poem_sentiment")

train_dataset = load_dataset(path="poem_sentiment", split="train")
# valid_dataset = load_dataset(path="poem_sentiment", split="validation")
# test_dataset = load_dataset(path="poem_sentiment", split="test")

In [7]:
print("Description:", train_dataset.description)
print("Num data entries:", len(train_dataset))
print("Column names:", train_dataset.column_names)
print("Classes:", train_dataset.features["label"].names)
print("Example data entry:", train_dataset[0])

Description: Poem Sentiment is a sentiment dataset of poem verses from Project Gutenberg. This dataset can be used for tasks such as sentiment classification or style transfer for poems.

Num data entries: 892
Column names: ['id', 'verse_text', 'label']
Classes: ['negative', 'positive', 'no_impact', 'mixed']
Example data entry: {'id': 0, 'verse_text': 'with pale blue berries. in these peaceful shades--', 'label': 1}


In [32]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
train_enc_ds = train_dataset.map(lambda examples: 
                                          tokenizer(
                                              examples["verse_text"], 
                                              truncation=True,
                                              padding="max_length",
                                          ),
                                 batched=True
                                 )

In [40]:
print("Column names of encoded dataset:", train_enc_ds.column_names)
print("Tokenized data entry:", train_enc_ds[0])

Column names of encoded dataset: ['attention_mask', 'id', 'input_ids', 'label', 'token_type_ids', 'verse_text']
Tokenized data entry: {'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [44]:
# use dataset with pytorch
train_enc_ds.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"])

# create pytorch data loader
train_dl = DataLoader(train_enc_ds, batch_size=32)
next(iter(train_dl))

{'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'input_ids': tensor([[  101,  1114,  4554,  ...,     0,     0,     0],
         [  101,  1122,  5611,  ...,     0,     0,     0],
         [  101,  1105,  1115,  ...,     0,     0,     0],
         ...,
         [  101,  1106,  1115,  ...,     0,     0,     0],
         [  101,   192,  2386,  ...,     0,     0,     0],
         [  101,  1123, 15219,  ...,     0,     0,     0]]),
 'label': tensor([1, 2, 0, 3, 3, 3, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1, 0, 2, 2, 1, 2, 2, 1,
         2, 2, 2, 2, 2, 2, 2, 2]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]])}

## 5. Tokenizers library

When training a model from scratch, it also makes sense to train the tokenizer from scratch. (Training a model from scratch makes sense if a model is not available in a particular language or if your corpus is very different from those other models were trained on). So whenever you want to pretrain a model and the dataset is different from the on used by an existing pretraiend model, you have to train a new tokenizer.

### Training a new tokenizer from an old one

A tokenizer needs to be trained on a corpus in order to identify subwords that are of interest and occur most frequently in the corpus. Training a tokenizer is unlike training a model - it's a statistical process with the exact rules depending on the tokenization algorithm. It's not random (like model) but deterministic.

In [52]:
from datasets import load_dataset
from transformers import AutoTokenizer

In [None]:
raw_ds = load_dataset("code_search_net", "python")

In [43]:
print(raw_ds["train"])
print(raw_ds["train"][1]["whole_func_string"])

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})
def close(self):
        '''
        Cleanly shutdown the router socket
        '''
        if self._closing:
            return
        log.info('MWorkerQueue under PID %s is closing', os.getpid())
        self._closing = True
        # pylint: disable=E0203
        if getattr(self, '_monitor', None) is not None:
            self._monitor.stop()
            self._monitor = None
        if getattr(self, '_w_monitor', None) is not None:
            self._w_monitor.stop()
            self._w_monitor = None
        if hasattr(self, 'clients') and self.clients.closed is False:
            self.clients.close()
        if hasattr(self, 'workers') and self.workers.closed is False:
            self.workers.close()
        if ha

It makes sense converting the dataset to an iterator (e.g., a list of lists of texts), so that the tokenizer can train on batches of texts AND the whole dataset does not need to be loaded into memory all at once.

In [48]:
# training_corpus = [
#                   raw_ds["train"][i: i + 1000]["whole_func_string"] 
#                   for i in range(0, len(raw_ds["train"]), 1000)
#                   ]
# creates a list of lists of 1000 texts each and load it into memory

# use parentheses instead of brackets to create a Python generator that does not
# load anything into memory until it's necessary.
# nothing is loaded into memory but object is created that can be used in Python
# for loop - however, the object can only be used once

training_corpus = (
    raw_ds["train"][i: i + 1000]["whole_func_string"] 
    for i in range(0, len(raw_ds["train"]), 1000)
)

In [50]:
gen1 = (i for i in range(10))
gen2 = [i for i in range(10)]
print(gen1)
print(gen2)

<generator object <genexpr> at 0x7f8cf2e11b50>
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [51]:
def get_training_corpus():
  return (
      raw_ds["train"][i: i + 1000]["whole_func_string"]
      for i in range(0, len(raw_ds["train"]), 1000)
  )

# ALTERNATIVE: use yield statement to create generator
# def get_training_corpus():
#  for start_idx in range(0, len(raw_ds["train"]), 1000):
#    yield raw_ds["train"][start_idx: start_idx + 1000]["whole_func_string"] 

training_corpus = get_training_corpus()

In [None]:
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")
# not starting *entirely* from scratch but using tokenization algorithm of GPT-2

In [55]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

print(old_tokenizer.tokenize(example))

['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


In [57]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, vocab_size=52000)

In [59]:
print(tokenizer.tokenize(example))

['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


### Normalization and pre-tokenization

Before splitting a text into subtokens, a tokenizer performs normalization and pre-tokenization before.

**normalization** = general cleaning e.g., removing unnecessary whitespaces, lowercasing, removing accents

In [61]:
from transformers import AutoTokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [71]:
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Hellö, how ARE YOU tòday?"))

hello, how are you today?


**pre-tokenization** = splitting the text into smaller entities i.e., words (e.g., at whitespaces or punctuation) (which are later split into tokens).

In [72]:
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("hello, how are you today?"))

[('hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (15, 18)), ('today', (19, 24)), ('?', (24, 25))]


### Byte-Pair Encoding (BPE) tokenization

- used for training GPT models

Training of tokenizer:
- first, normalization and pre-tokenization are applied
- computes unique set of words and builds vocabulary by taking all the symbols used to write those words
  - corpus: `["hug", "pug", "pun", "bun", "hugs"]`
  - initial vocabulary: `["b", "g", "h", "n", "p", "s", "u"]` 
- vocabulary is expanded by learning merges i.e., merging two elements of existing vocabulary - as such, longer and longer subwords are added to the vocabulary
  - this is done by looking by the most frequent "pair" i.e., the most frequently appearing merge between elements in the existing vocabulary. The most frequent pair will be merged and added to the vocabulary. Then the processed is repeated.
  - vocabulary during merging: `["b", "g", "h", "n", "p", "s", "u", "ug"]` --> `["b", "g", "h", "n", "p", "s", "u", "ug", "un"]` --> `["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]` --> ...

Tokenization with trained tokenizer:
- normalization
- pre-tokenization
- splitting words into individual characters
- applying merge rules learned in order to those splits
- example: `"unhug"` will be tokenized as `["un", "hug"]`, `"bug"` as `["[UNK]", "ug"]` (since "b" was not in the corpus with which tokenizer was trained)

### WordPiece tokenization

- starts like BPE from base vocabulary and then learns merging rules
- does not select most frequent pairs (like BPE) but calculates special score for each pair
- does not save the rules learned from training (like BPE) but saves only the final vocabulary

### Unigram tokenization

- starts from a big vocabulary and removes tokens from it until it reaches a desired vocabulary size
computes loss over the corpus given the current vocabulary and iterativels removes the 10-20% of tokens based on how little the loss would increase by their removal



### Building a tokenizer, block by block

Tokenization consists of 
- normalization
- pre-tokenization
- tokenizer model (like BPE, WordPiece, ...)
- post-processing (adding special tokens, generating attention mask and token type IDs)

In [40]:
from datasets import load_dataset
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer
)
from transformers import (
    PreTrainedTokenizerFast,
    BertTokenizerFast
)

In [None]:
ds = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")

In [10]:
def get_training_corpus():
  for i in range(0, len(ds), 1000):
    yield ds[i: i + 1000]["text"]

In [11]:
# model block
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

In [14]:
# normalizer block 
tokenizer.normalizer = normalizers.Sequence(
    [
     normalizers.NFD(),  # NFD Unicode normalizer
     normalizers.Lowercase(),
     normalizers.StripAccents()
    ]
)
print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

hello how are u?


In [16]:
# pre-tokenizer block
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace() # splits of whitespaces and punctuation (excluding underscore character)
print(tokenizer.pre_tokenizer.pre_tokenize_str("Today's such a beautiful and sunny day."))

[('Today', (0, 5)), ("'", (5, 6)), ('s', (6, 7)), ('such', (8, 12)), ('a', (13, 14)), ('beautiful', (15, 24)), ('and', (25, 28)), ('sunny', (29, 34)), ('day', (35, 38)), ('.', (38, 39))]


In [20]:
# train tokenizer
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

In [29]:
encoding = tokenizer.encode("Let's test the tokenizer, shall we?")
print("Tokens:", encoding.tokens)
print("Input IDs:", encoding.ids)
print("Offsets:", encoding.offsets)
print("Attention mask:", encoding.attention_mask)
print("Special token mask:", encoding.special_tokens_mask)
print("Overflowing:", encoding.overflowing)

Tokens: ['let', "'", 's', 'test', 'the', 'tok', '##eni', '##zer', ',', 'shall', 'we', '?']
Input IDs: [3019, 11, 61, 3611, 1333, 24319, 18903, 6614, 16, 11448, 1626, 35]
Offsets: [(0, 3), (3, 4), (4, 5), (6, 10), (11, 14), (15, 18), (18, 21), (21, 24), (24, 25), (26, 31), (32, 34), (34, 35)]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Special token mask: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Overflowing: []


In [37]:
# post-processing
cls_token_id = tokenizer.token_to_id("[CLS]") # to add at beginning of sequence
sep_token_id = tokenizer.token_to_id("[SEP]") # to add at end of each sentence in sequence
print("[CLS]", cls_token_id, "[SEP]", sep_token_id)

tokenizer.post_processor = processors.TemplateProcessing(
    single="[CLS]:0 $A:0 [SEP]:0", # token type IDs for single sentences ($A means single sentence)
    pair="[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1", # tokent type IDs to use for sentence pairs ($B)
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)]
)

[CLS] 2 [SEP] 3


In [38]:
encoding = tokenizer.encode("Let's test the tokenizer, shall we?")
print("Tokens:", encoding.tokens)
print("Input IDs:", encoding.ids)
print("Offsets:", encoding.offsets)
print("Attention mask:", encoding.attention_mask)
print("Special token mask:", encoding.special_tokens_mask)
print("Overflowing:", encoding.overflowing)

Tokens: ['[CLS]', 'let', "'", 's', 'test', 'the', 'tok', '##eni', '##zer', ',', 'shall', 'we', '?', '[SEP]']
Input IDs: [2, 3019, 11, 61, 3611, 1333, 24319, 18903, 6614, 16, 11448, 1626, 35, 3]
Offsets: [(0, 0), (0, 3), (3, 4), (4, 5), (6, 10), (11, 14), (15, 18), (18, 21), (21, 24), (24, 25), (26, 31), (32, 34), (34, 35), (0, 0)]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Special token mask: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Overflowing: []


In [39]:
# decoder
tokenizer.decoder = decoders.WordPiece(prefix="##")

tokenizer.decode(encoding.ids)

"let's test the tokenizer, shall we?"

In [None]:
# save tokenizer
tokenizer.save("tokenizer.json")

#load tokenizer
new_tokenizer = Tokenizer.from_file("tokenizer.json")

In [41]:
# wrap tokenizer in PreTrainedTokenizerFast
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json",  # alternatively, load from file 
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]"
)

Building a BPE (GPT-2) tokenizer from scratch:

In [42]:
# choose model
tokenizer = Tokenizer(models.BPE())

In [43]:
# add pre-tokenizer (not normalizer needed for GPT-2 tokenizer)
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False) # add_prefix_space add space at beginning of sentence

In [44]:
# train tokenizer
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

In [45]:
# post-processing
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False) # trim_offset means that start of offset will point at first character of word, not at space before it


In [46]:
# decoder
tokenizer.decoder = decoders.ByteLevel()

In [47]:
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>"
)
# OR
# wrapped_tokenizer = GPT2TokenizerFast(tokenier_object=tokenizer)

## 6. Main NLP tasks

### Token classification

- **Named entity recognition (NER)**: find names, locations and organization in a sentence; assign class to each token
- **Part-of-speech tagging (POS)**: mark each word as its respective par of speech (e.g., noun, verb,...)
- **Chunking**: find tokens that belong to the same entity

### Fine-tuning a masked language model

- pre-trained model can be fine-tuned on your data if the dataset is not too different from the corpus used from pretraining the model
- **domain adaptation**: on example of fine-tuning a model, namely on a dataset of special domain texts

In [3]:
from transformers import (
    AutoModelForMaskedLM,
    AutoTokenizer
)
from datasets import load_dataset
import torch

In [None]:
checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [5]:
text = "This is a great [MASK]."

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
mask_token_idx = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_idx, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
  print(f"{text.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

This is a great deal.
This is a great success.
This is a great adventure.
This is a great idea.
This is a great feat.


In [None]:
imdb_ds = load_dataset("imdb")

In [76]:
print(imdb_ds["train"]["text"][0])

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, eve