This and following series of notebooks dive into the Transformers & Huggingface philosophy and how things are built. 

https://huggingface.co/docs/transformers/philosophy

### Easy & Fast to use

> 3 Classes for any models: configuration, models and preprocessor like tokenizer(NLP), image_procesor(vision), feature_extractor(audio) and processor for multi-modal. All intialized using .from_pretrained() method. The model data is pulled from huggingface_hub. 

> pipeline() to do inference and trainer() to train the models

### Provide SOTA models that are close in performance to the original models:

> One example of each architecture is provided, that reproduces the results of the model authors. 

> The code is **close** to original, meaning some code may not be pytorchic

> Provides API access to **Full Hidden States** and **attention weights** of the model

In [1]:
# Looking at the attention masks in Transformers

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [3]:
seq_a = "This is a sentence of 3 words"
seq_b = "This is a sentence of more than 3 words, providing lot more information"

In [4]:
encode_a = tokenizer(seq_a)['input_ids']
encode_b = tokenizer(seq_b)['input_ids']
len(encode_a), len(encode_b)  # (9, 16)  # Have different lengths

(9, 16)

In [5]:
# How the tokenizer output looks, with a single input
tokenizer(seq_a)

{'input_ids': [101, 1188, 1110, 170, 5650, 1104, 124, 1734, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
tokenizer([seq_a, seq_b], padding=True) # The tokens and attention masks are padded where required

{'input_ids': [[101, 1188, 1110, 170, 5650, 1104, 124, 1734, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 5650, 1104, 1167, 1190, 124, 1734, 117, 3558, 1974, 1167, 1869, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [7]:
encoded = tokenizer([seq_a, seq_b], padding=True) # The tokens and attention masks are padded where required

In [11]:
tokenizer.decode(encoded['input_ids'][0])

'[CLS] This is a sentence of 3 words [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

#### Some interestin / important terms:

backbone: model / network that outputs raw hidden states. This is connected to a **head**. There are different "head" for different tasks. 
    
    > LM Head
    > DoubleHeads
    > Question Answering
    > Sequence Classification
    > Token Classification

CTC / connectionist temporal algorithm: Model learns without exactly knowing how the inputs and outputs are aligned. Its used in **speech recognition**

convolution: NN Layer, where the inputs are multiplied element-wise by a smaller kernel-matrix & summed up into new matrix

decoder models are auto-regressive, as they learn to predict the next words from the dataset of masked sentences. 

encoder models are auto-encoding, which uses Masked Language Modeling and embedding to create numerical representation

labels are optional argument which can be passed in order for the model to compute loss itself. The base models don't accept labels, as they just output featurers

Position_ids are required by Transformers to identify the location of a particular tokens. There are many positional embedding like sinusoidal and relative embeddings. The position must be between [0, config.max_position_embeddings - 1]

self-supervised learning, is the process of creating its own learning objectives and learn from **unlabled data**. Masked language modelling is one such self-supervised learning.

ZeRO : Zero Redundancy Optimizer, which is a kind of tensor sharding for parallel operation. The shards are reconstructed during forward and backward computation.



In [12]:
# Domain and the models segregated into different architectures

computer_vision = {
    "encoder": ['ViT','Swin', 'SegFormer', 'BEiT'],
    "decoder": ['ImageGPT'],
    "encoder-decoder": ['DETR'],
    "convolution": ['ConvNeXT']
}

NLP = {
    "encoder": ["BERT", "RoBERTa", "ALBERT", "DistillBERT", "DeBERTa", "Longformer",],
    "decoder": ["GPT-2", "XLNet", "GPT-J", "OPT", "BLOOM"],
    "encoder-decoder": ["BART", "Pegasus", "T5", ],
}

Audio = {
    "encoder": ["Wav2Vec2", "Hubert"],
    "encoder-decoder": ["Speech2Text", "Whisper"]
}

MultiM = {
    "encoder": ["VisualBERT", "ViLT", "CLiP", "OWL-ViT"],
    "encoder-decoder": ["TrOCR", "Donut"]
}

Reinforcement = {
    "decoder": ["Trajectory transformer", "Decision transformer"]
}

In [13]:
# Tokenizers
# moving from rule based, word level to char level and settling on subword algorithm.
# subword allows for reasonable vocabulary size, and allows to learn the representation
Rule_based = ['spacy', 'moses', 'XLM', 'FlauBERT',]

sub_word = ['Byte-pair-encoding', 'WordPiece', 'Unigram', 'SentencePiece']
# Need to locate the data on models and their respective tokenisation algorithms

space_based = ["GPT-2", "RoBERTa"]

In [None]:
# GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.
# GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
tokenizer_algos = {
    "byte_pair": {
        "base": ['GPT'],
        "byte_level": ['GPT-2'],
        "intro": "https://arxiv.org/abs/1508.07909"
    },
    "WordPiece":{
        "base": ['BERT', 'DistilBERT', 'Electra'],
        "intro": "https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf"
    },
    "Unigram": {
        "base": [],
        "intro": "https://arxiv.org/pdf/1804.10959.pdf"
    },
    "SentencePiece":{
        "base": ["XLM", "ALBERT", "XLNET", "Marian", "T5"],
        "intro": "https://arxiv.org/pdf/1808.06226.pdf"
    }
}

In [14]:
# working of BertTokenizer
tokens = tokenizer.tokenize("I have a great Nvidia 4070 GPU")
tokens  
# '##' signifies the word can be attached with earlier token in the list 

['I', 'have', 'a', 'great', 'N', '##vid', '##ia', '40', '##70', 'GP', '##U']

In [2]:
from transformers import XLNetTokenizer

xlnet_tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
xlnet_tokenizer.tokenize("Do you love your GPU very much? I do.")

['▁Do',
 '▁you',
 '▁love',
 '▁your',
 '▁G',
 'PU',
 '▁very',
 '▁much',
 '?',
 '▁I',
 '▁do',
 '.']

### Attention mechanism is the biggest computation bottleneck. 

> Reformer(https://huggingface.co/docs/transformers/attention#reformer) uses LSH attention. An Hash is used to determine is q and k are close

> Longformer uses LocalAttention, that is local context of left and right words. Following that by stacking attention layers that have smaller window, so entire sentence representation can be built. It also uses Axial Positional encoding, in order to reduce the space taken in the GPU

### Padding and Truncation

To work with inputs that are of different lengths, either padding with new tokens can be provided or truncation of longer sentences after a max_length can be implemented.

> Both will require padding, truncation and max_length variable.

padding => 'longest'(True), 'max_length', 'do_not_pad'(False) are 3 options, with the 3rd being the default

truncation => 'longest_first'(True), 'only_second' (truncate the 2nd sentence only), 'only_first' (truncate only first),
'do_not_truncate'(False)

In [7]:
# Following are series of tokenizer options at work
from rich import print as print
bat = [seq_a, seq_b]  # This has to be a list, else seq_a, seq_b will be concated
# no padding
out = tokenizer(bat)

print(out)

In [8]:
# padding to length of max_seq, just use True

out = tokenizer(bat, padding=True)
print(out)

In [11]:
# pad to max_length of model accepted length

out = tokenizer(bat, padding='max_length')  # note the max_length is not provided
print(len(out['input_ids'][0]))  # model max_length of 512

In [14]:
# pad to max_length with a max_length value

out = tokenizer(bat, padding='max_length', max_length=15)
print(out)

In [15]:
out = tokenizer(bat, padding=True, pad_to_multiple_of=8) # this is very efficient
print(out)

In [16]:
# truncation strategies
out = tokenizer(bat, truncation=True)  # you wont see the longer sentence broken down, as it is longest_first
print(out)

In [18]:
out = tokenizer(bat, padding=True, truncation=True)
print(out)

In [20]:
# out = tokenizer(bat, padding='max_length', truncation=True)  # this will give 512 tokens
out = tokenizer(bat, padding='max_length', truncation=True, max_length=15)  # this will give 512 tokens
print(out)

In [None]:
tokened_data = tokenizer(["This is a test",
                          "There is more to the logic than I know."],
                        truncation=True, padding=True,
                        max_length=8)
tokened_data
# this truncates the tokens to 8, and if the sentences is
# smaller than that then padding is provided

In [22]:
out = tokenizer(bat, truncation=True, max_length=12)
# Truncating and not padding
print(out)

In [23]:
# truncating & padding to specific length
out = tokenizer(bat, truncation=True, padding=True, max_length=12)
print(out)

### Perplexity is defined as the exponentiated average negative log-likelihood of a sequence.

Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note that the metric applies specifically to classical language models (sometimes called autoregressive or causal language models). It is **not used for Masked Language Models**

In [None]:
# Calculating the perplexity

from transformers import GPT2LMHeadModel, GPT2TokenizerFast

device = "cuda"
model_id = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

In [25]:
from datasets import load_dataset

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt")

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (281597 > 512). Running this sequence through the model will result in indexing errors


In [None]:
import torch
from tqdm import tqdm

max_length = model.config.n_positions  # get the positions
stride = 512
seq_len = encodings.input_ids.size(1)  # must provide token lengths

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)  # this will help in maintaining window of max_length
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)  # extract the window
    target_ids = input_ids.clone()  # clone it as target_ids
    target_ids[:, :-trg_len] = -100  

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)  # collect the negative log

    prev_end_loc = end_loc  # updating prev end_location
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())