In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline("sentiment-analysis")

In [None]:
classifier([" i like you","i hate you"])

Pipeline has multiple components. 
token ids, vocab, model inference, convertig the logits to ids and then to tokens.

--> split the words into tokens
--> Mapping each token to an integer

we use the AutoTokenizer class and its from_pretrained() method to understand and apply
Tokenizer in standalone form. Otherwise its already part of the pipeline.
Each model has a specific tokenizer.


In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

now we can pass text directly to tokenizer and see what happens to it

In [None]:
tokenizer("i like you")

this gets fed into the model and embedding vectors are craeted from these ids

To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the return_tensors argument: below is pytorch tensors
The main things to remember here are that you can pass one sentence or a list of sentences, as well as specifying the type of tensors you want to get back (if no type is passed, you will get a list of lists as a result).



In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

inputs = tokenizer(raw_inputs, padding=True, truncation= True, return_tensors="pt")
inputs

We can download our pretrained model the same way we did with our tokenizer. 🤗 Transformers provides an AutoModel class which also has a from_pretrained() method:

In [None]:
from transformers import AutoModel

In [None]:
model_cp = AutoModel.from_pretrained(checkpoint)

This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states
so whenever it mentions base model in the checkkpint name, high chance that
as an output we get hidden states. Hidden states are also called features.
which is essesentially ,for each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.
retrieve a high-dimensional vector representing is feature vector or hidden states.
these feartures are usually fed into some heads for downstream tasks 
or they can be used in unsupervised way as well.
he different tasks could have been performed with the same architecture, but each of these tasks will have a different head associated with it.


Usually the hidden vector or feature vector has three main components.
1. batch size 
2. sequence length
3. hidden dimension size such as 768 etc. Hidden size: The vector dimension of each model input.

[10, 128, 728] something like this.
It is said to be “high dimensional” because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more)

In [None]:
outputs =  model_cp(**inputs)
outputs

In [None]:
outputs.last_hidden_state.shape

Note that the outputs of 🤗 Transformers models behave like namedtuples or dictionaries. You can access the elements by attributes (like we did) or by key (outputs["last_hidden_state"]), or even by index if you know exactly where the thing you are looking for is (outputs[0]).

In [None]:
outputs[0].shape

In [None]:
outputs["last_hidden_state"].shape

Model heads: Making sense out of numbers:
The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:

raw text-->tokenizer-->input ids-->embedding layers-->transformer blocks which includes attention layer
as well-->hidden state of features --> project to certain dimesnion using linear head
-->logits-->class or softmax

This is how the typical model looks like


In [None]:
from transformers import AutoModel


In [None]:
from transformers import AutoTokenizer

In [None]:
raw_inputs

In [None]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
tokenizer(raw_inputs)

In [None]:
tokenizer(raw_inputs, truncation= True, padding=True)

In [None]:
input = tokenizer(raw_inputs, truncation= True, padding=True, return_tensors="pt")

In [None]:
from transformers import AutoModel

In [None]:
model_cp = AutoModel.from_pretrained(checkpoint)

In [None]:
output = model_cp(**input)

In [None]:
output.last_hidden_state.shape

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
classification_full_model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [None]:
output= classification_full_model(**input)

In [None]:
output

In [None]:
print(output.logits.shape)

Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

In [None]:
print(output.logits)

The values we get as output from our model don’t necessarily make sense by themselves.
Our model predicted [-1.5607, 1.6123] for the first sentence and [ 4.1692, -3.3464] for the second one. 
Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model (out of the linear layer which gets projected from the hidden state, thus a head)

To be converted to probabilities, they need to go through a SoftMax layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [None]:
import torch

In [None]:
predictions = torch.nn.functional.softmax(output.logits, dim=-1)

In [None]:
predictions

Now we can see that the model predicted [0.0402, 0.9598] for the first sentence and [0.9995, 0.0005] for the second one. These are the These are recognizable probability scores.

In [None]:
classification_full_model.config.id2label

Now we can conclude that the model predicted the following:
at 0 index we have prob for class 'NEGATIVE' and at 1 index we have prob for class 'POSITIVE'

First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005

We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing! Now let’s take some time to dive deeper into each of those steps.

In [None]:
from transformers import pipeline

In [None]:
input_raw_text = [" i am so happy",
                  "i am so so so so sad"]

In [None]:
sentiment_classification_pipeline = pipeline("sentiment-analysis")

In [None]:
pipeline_output = sentiment_classification_pipeline(input_raw_text)

In [None]:
pipeline_output = sentiment_classification_pipeline(input_raw_text)


In [None]:
pipeline_output

The above pipeline has done tokenization, embeddings, transformer attention blocks,
hidden space features, annd then clasifuaction head, and logits to raw probabalities
and then to actual labels all in one go. but we can do it in bits and pieces as well.


In [None]:
from transformers import AutoModel
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

In [None]:
# each tokenzer should be assocaited with a model as well, so we need to define model first
model_cp = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
# the base in the name shows that this is a base model which means
# it will only give us the hidden states or features as an output and not the actual classifition
# 
tokenizer = AutoTokenizer.from_pretrained(model_cp)

In [None]:
input_token_ids = tokenizer(input_raw_text, truncation = True, 
                              padding = True,
                              return_tensors = "pt")

In [None]:
input_token_ids

In [None]:
# now we have got input token ids, we can apply base model to them

sent_model = AutoModel.from_pretrained(model_cp)

In [None]:
output_sent = sent_model(**input_token_ids)
# this output_sent is the last hidden state

In [None]:
output_sent.last_hidden_state.shape

In [None]:
# now we need to pass it through classification head
sent_model_full = AutoModelForSequenceClassification.from_pretrained(model_cp)

In [None]:
output_sent_full = sent_model_full(**input_token_ids)

In [None]:
output_sent_full.logits

In [None]:
# we see its not the hidden state but the logits already been passed through classification head.

In [None]:
import torch
output_sent_prob = torch.nn.functional.softmax(output_sent_full.logits, dim =-1)

In [None]:
output_sent_prob

In [None]:
sent_model_full.config.id2label