## Behind the pipeline()
Transformer models can't process raw texts directly, so the first step is to convert the text inputs into numbers which the model can understand through a process called tokenization using a tokenizer.

### Preprocessing - Tokenizer

In [2]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Next We pass our sentences to the tokenizer and feed the returned dictionary to model. 
Transformer models only accept tensors (plain numpy or PyTouch)

In [3]:
raw_inputs: list[str] = [
    "I've been a fan of Manchester United for years!",
    "The new movie was a fantastic adventure with stunning visuals. I highly recommend it to everyone.",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  1037,  5470,  1997,  5087,  2142,
          2005,  2086,   999,   102,     0,     0,     0,     0,     0,     0],
        [  101,  1996,  2047,  3185,  2001,  1037, 10392,  6172,  2007, 14726,
         26749,  1012,  1045,  3811, 16755,  2009,  2000,  3071,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

### Model
This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states, also known as features. For each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model

In [4]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

size = [batchsize, sequence_length, hidden_size]

Batch size: The number of sequences processed at a time (2 in our example).
Sequence length: The length of the numerical representation of the sequence (16 in our example).
Hidden size: The vector dimension of each model input.

It is said to be “high dimensional” because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

In [5]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)


torch.Size([2, 20, 768])


Now if we look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

print(outputs.logits.shape)

torch.Size([2, 2])


### Postprocessing the output
Our model predicted [-1.5607, 1.6123] for the first sentence and [ 4.1692, -3.3464] for the second one. Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer 

In [7]:
print(outputs.logits)

tensor([[-3.5853,  3.8333],
        [-4.3617,  4.7374]], grad_fn=<AddmmBackward0>)


In [8]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions

tensor([[5.9962e-04, 9.9940e-01],
        [1.1176e-04, 9.9989e-01]], grad_fn=<SoftmaxBackward0>)

Now we can see that the model predicted [0.000599, 0.994] for the first sentence and [0.000111, 0.9999] for the second one. These are recognizable probability scores. To get the labels corresponding to each position, we can inspect the id2label attribute of the model config

In [9]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing

# Models
Explore creating models using AutoModel

### Creating a transformer

In [10]:
from transformers import AutoModel
# download model
model = AutoModel.from_pretrained("bert-base-cased")

# save model
model.save_pretrained('Models/bert-base-cased')

# load model
model = AutoModel.from_pretrained('Models/bert-base-cased')

### Encoding text

In [11]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_input = tokenizer("Hello, Have you sang today?")
encoded_input

{'input_ids': [101, 8667, 117, 4373, 1128, 6407, 2052, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [12]:
# We can decode the input IDs to get back the original text
tokenizer.decode(encoded_input['input_ids'])

'[CLS] Hello, Have you sang today? [SEP]'

In [13]:
# Encode multiple sentences at once
sequences: list[str] = [
    "I've been a fan of Manchester United for years!",
    "The new movie was a fantastic adventure with stunning visuals. I highly recommend it to everyone.",
]
inputs = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt", max_length=16)
inputs


{'input_ids': tensor([[  101,   146,   112,  1396,  1151,   170,  5442,  1104,  4280,  1244,
          1111,  1201,   106,   102,     0,     0],
        [  101,  1109,  1207,  2523,  1108,   170, 14820,  7644,  1114, 15660,
          5173,  1116,   119,   146,  3023,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [14]:
for ids in inputs['input_ids']:
    print(tokenizer.decode(ids))


[CLS] I ' ve been a fan of Manchester United for years! [SEP] [PAD] [PAD]
[CLS] The new movie was a fantastic adventure with stunning visuals. I highly [SEP]


### Putting it all together

In [None]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequences = ["I've been a fan of Manchester United for years!", "In all my years in service i have never eaten corn"]

tokens = tokenizer(sequences)
tokens

{'input_ids': [[101, 1045, 1005, 2310, 2042, 1037, 5470, 1997, 5087, 2142, 2005, 2086, 999, 102], [101, 1999, 2035, 2026, 2086, 1999, 2326, 1045, 2031, 2196, 8828, 9781, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [23]:
# we can pad, truncate, and set a max length
tokens = tokenizer(sequences, padding=True, return_tensors="pt")
output = model(**tokens)
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.2819, -0.0083,  0.0651,  ..., -0.2407,  0.4291,  0.1038],
         [-0.1350, -0.6947,  0.2639,  ..., -0.1382,  0.6238,  0.6530],
         [-0.3829, -0.9847,  0.5163,  ...,  0.0986,  0.4749,  0.7193],
         ...,
         [-0.2084, -0.4938,  0.2015,  ...,  0.1113,  0.4608,  0.2006],
         [ 0.1782, -0.2340,  0.2628,  ..., -0.4469,  0.1198,  0.2138],
         [ 0.6864, -0.7907,  0.0539,  ..., -0.9187,  0.7601, -0.0813]],

        [[ 0.4178,  0.0213,  0.0459,  ..., -0.2817,  0.1344, -0.0787],
         [ 0.0534, -0.2096,  0.1823,  ...,  0.0932, -0.0687, -0.3376],
         [-0.1404, -0.1754,  0.1438,  ...,  0.3340, -0.0152,  0.1059],
         ...,
         [ 0.2635, -0.3036, -0.0576,  ..., -0.2675,  0.1084, -0.0407],
         [ 0.6577,  0.4530,  0.2398,  ..., -0.7735,  0.3696, -1.0877],
         [-0.0570, -0.2286,  0.0018,  ...,  0.3271, -0.0414,  0.0727]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_ou

In [24]:
tokenizer.tokenize("I've been a fan of Manchester United for years!")

['i',
 "'",
 've',
 'been',
 'a',
 'fan',
 'of',
 'manchester',
 'united',
 'for',
 'years',
 '!']