<a href="https://colab.research.google.com/github/Harsh-2909/NLP-Colab-Notebooks/blob/main/course/chapter2/section2_pt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Behind the pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.

# Using the pipeline function

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
classifier(
    [
        "I am so happy after raising a Series A funding from YC",
        "I hate this so much!"
    ]
)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998717308044434},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

# Breaking down the pipeline function tasks

## Using AutoTokenizer to convert the sentence to tokens

Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens.
- Mapping each token to an integer.
- Adding additional inputs that may be useful to the model.

Transformer models only accept tensors as input so we need to pass `return_tensors` when tokenizing the sentences.

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
print(tokenizer)

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


In [None]:
raw_inputs =  [
    "I am so happy after raising a Series A funding from YC",
    "I hate this so much!"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[ 101, 1045, 2572, 2061, 3407, 2044, 6274, 1037, 2186, 1037, 4804, 2013,
         1061, 2278,  102],
        [ 101, 1045, 5223, 2023, 2061, 2172,  999,  102,    0,    0,    0,    0,
            0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}


## Using AutoModel to pass the tokens in Encoder

The `AutoModel` will return only the base transformer model without the head. In this case, it will only give us the body of the encoder model which is used for encoding the tokens to their `embeddings`/`feature vectors` with all their context.

These embeddings are the inputs to the other part of the model which, you guessed it, is the header. Each of the task will have a different header even if they have the same architecture.

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
print(model)

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

In [None]:
outputs = model(**inputs)
print(outputs)
print(outputs.last_hidden_state)
print(outputs.last_hidden_state.shape)

BaseModelOutput(last_hidden_state=tensor([[[ 0.5984,  0.2175,  0.3714,  ...,  0.5597,  0.7327, -0.5380],
         [ 0.8314,  0.3081,  0.2288,  ...,  0.4910,  0.8217, -0.3606],
         [ 0.7083,  0.3264,  0.2397,  ...,  0.4904,  0.7760, -0.3500],
         ...,
         [ 0.9967,  0.2553,  1.0954,  ...,  0.0643,  0.9568,  0.1535],
         [ 0.4903,  0.1943,  0.9168,  ...,  0.0867,  0.3059, -0.4790],
         [ 1.3664,  0.1974,  0.5670,  ...,  0.6967,  0.3468, -0.7772]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.2319,  0.8268, -0.0312,  ..., -0.0764, -0.8509, -0.1043],
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)
t

## Using AutoModelForSequenceClassification for generating the logits output

Here, we will use the `AutoModelForSequenceClassification` to pass the tokens once again. Just like in `AutoModel`, it will first convert the tokens to their feature vectors and then pass those vectors to the header which will classify the sentence.

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs)
print(outputs.logits.shape)
print(outputs.logits)

SequenceClassifierOutput(loss=None, logits=tensor([[-4.3027,  4.6586],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
torch.Size([2, 2])
tensor([[-4.3027,  4.6586],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


## Using PyTorch to calculate the `softmax` function of logits to get the actual probability distribution

The output of the model is a logits which is the raw, unnormalized scores outputted by the last layer of the model. It needs to pass through the `SoftMax` function to generate the probability distribution.

The 2x2 tensor matrix is the output probability for the 2 sentences with index 0 representing -ve proability and index 1 representing +ve probability.

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)
print(model.config.id2label)

tensor([[1.2826e-04, 9.9987e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)
{0: 'NEGATIVE', 1: 'POSITIVE'}
