# Preprocessing with a tokenizer
---
Transformer models cant process raw text directly, so the first step of this pipeline is to convert the text inputs into numbers that the model can make sense of. This will be done with the use of a tokenizer, which will 
- Spit the inputs into words, subwords to subols like punctuation which are called tokens
- Mapping each token to an integer 
- Adding additional inputs that may be useful to the model 

In [2]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)



In [5]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate thsi so much"
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223, 16215,  5332,  2061,  2172,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [6]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


### This returned a high dimensional vector. The vector output by the Transformer module is usually large. It generally has three dimensions: 
- Batch size: the number of sequences processed at a time (2)
- Sequence lenth: The length of the numerical representation of teh sequence (16)
- Hidden size: the vector dimension of each model input (768)
---

In [7]:
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)


torch.Size([2, 2])


Now we can see that the shape of the outputs is lower in dimensions, since the model takes as input the high-dimensional vectors we saw before and outputs vectors containing two values (one per label)

In [8]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 3.1425, -2.6553]], grad_fn=<AddmmBackward0>)


The model predicted [-1.5607,  1.6123] for the first model and [ 3.1425, -2.6553]] for the second.
<br>
Now we can use torch to convert this values to readable probabilities using a SoftMax layer

In [10]:
import torch 

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)


tensor([[0.0402, 0.9598],
        [0.9970, 0.0030]], grad_fn=<SoftmaxBackward0>)


In [12]:
model.config.id2label


{0: 'NEGATIVE', 1: 'POSITIVE'}

- First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
- Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005
<br>

Now for comparision we will use the sentiment analysis transformer to see how close we were

In [13]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model=checkpoint)
classifier(raw_inputs)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9598051905632019},
 {'label': 'NEGATIVE', 'score': 0.9969748258590698}]