<a href="https://colab.research.google.com/github/Krishishah7/nlp-learning-series/blob/main/04_transformers/transformer_feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers torch

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

In [None]:
text = "Natural language processing is transforming artificial intelligence"

In [12]:
inputs = tokenizer(
    text,
    return_tensors="pt",
    padding=True,
    truncation=True
)

inputs

{'input_ids': tensor([[  101,  3019,  2653,  6364,  2003, 17903,  7976,  4454,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [13]:
with torch.no_grad():
    outputs = model(**inputs)

In [14]:
outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.3168,  0.0616, -0.1400,  ..., -0.5041, -0.2429,  0.2568],
         [ 0.3129,  0.2846,  0.0033,  ..., -0.2937,  0.3206,  0.3899],
         [-0.6273,  0.2300,  0.4675,  ..., -0.7940, -0.4448,  0.4404],
         ...,
         [ 0.3122,  0.4512,  0.0710,  ..., -0.6283,  0.0959,  1.0699],
         [-0.1590, -0.1183, -0.3428,  ..., -0.6872, -0.0175,  0.4321],
         [ 0.6419,  0.0927, -0.4300,  ..., -0.0649, -0.9082, -0.1934]]]), pooler_output=tensor([[-0.9357, -0.4791, -0.3929,  0.7156,  0.2830, -0.5229,  0.7785,  0.4939,
         -0.2261, -1.0000, -0.3376,  0.7401,  0.9899, -0.2240,  0.9055, -0.6604,
         -0.4744, -0.6179,  0.4350, -0.4089,  0.6952,  0.9996,  0.0974,  0.4790,
          0.4639,  0.8681, -0.7765,  0.9551,  0.9568,  0.7455, -0.7003,  0.4731,
         -0.9929, -0.2733, -0.5125, -0.9887,  0.5664, -0.7014, -0.0141, -0.2808,
         -0.8763,  0.4440,  0.9998, -0.5939,  0.5225, -0.4677, -1.0000,  0.

In [15]:
token_embeddings = outputs.last_hidden_state
token_embeddings.shape

torch.Size([1, 9, 768])

In [16]:
cls_embedding = token_embeddings[:, 0, :]
cls_embedding.shape

torch.Size([1, 768])

In [17]:
attention_mask = inputs["attention_mask"]
mask = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()

masked_embeddings = token_embeddings * mask
sentence_embedding = masked_embeddings.sum(dim=1) / mask.sum(dim=1)

sentence_embedding.shape

torch.Size([1, 768])

- This notebook demonstrates how transformer models can be used as feature extractors.
- Token-level embeddings are generated from the last hidden layer of the model.
- The CLS token and mean pooling strategies are shown for creating sentence-level representations.
- These features are commonly used for downstream NLP tasks such as classification and similarity.
