<a href="https://colab.research.google.com/github/Lijo-C/Class-Work/blob/main/BD_25_11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install transformers



In [2]:
from transformers import AutoTokenizer, AutoModel
import torch

# Define the sentence
sentence = "The cat sat on the mat."

# Load pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize the sentence
# return_tensors='pt' returns PyTorch tensors
inputs = tokenizer(sentence, return_tensors='pt')

# Get the embeddings
# model(**inputs) returns a BaseModelOutputWithPoolingAndCrossAttentions object
# last_hidden_state contains the word embeddings
with torch.no_grad(): # Disable gradient calculation for inference
    outputs = model(**inputs)

# The last_hidden_state contains the contextualized embeddings for each token
# Its shape is (batch_size, sequence_length, hidden_size)
# For a single sentence, batch_size is 1
embeddings = outputs.last_hidden_state

print("Sentence:", sentence)
print("Token IDs:", inputs['input_ids'])
print("Tokens:", tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]))
print("Embeddings shape:", embeddings.shape)
print("First 5 embeddings for the first token:", embeddings[0, 0, :5])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Sentence: The cat sat on the mat.
Token IDs: tensor([[  101,  1996,  4937,  2938,  2006,  1996, 13523,  1012,   102]])
Tokens: ['[CLS]', 'the', 'cat', 'sat', 'on', 'the', 'mat', '.', '[SEP]']
Embeddings shape: torch.Size([1, 9, 768])
First 5 embeddings for the first token: tensor([-0.3642, -0.0531, -0.3673, -0.0297, -0.4608])


### Explanation of the Code:

1.  **`pip install transformers`**: Installs the Hugging Face `transformers` library, which provides access to pre-trained models like BERT.

2.  **`from transformers import AutoTokenizer, AutoModel`**: Imports the necessary classes.
    *   `AutoTokenizer`: A class that automatically loads the correct tokenizer for a given pre-trained model.
    *   `AutoModel`: A class that automatically loads the correct model architecture for a given pre-trained model.

3.  **`tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")`**: Loads a pre-trained BERT tokenizer that converts text into token IDs. `"bert-base-uncased"` specifies the particular BERT model variant.

4.  **`model = AutoModel.from_pretrained("bert-base-uncased")`**: Loads the pre-trained BERT model itself, which will generate the embeddings.

5.  **`inputs = tokenizer(sentence, return_tensors='pt')`**: This is the core tokenization step.
    *   The `sentence` is passed to the tokenizer.
    *   `return_tensors='pt'` tells the tokenizer to return PyTorch tensors, which are required by the model.
    *   The `inputs` dictionary will typically contain `input_ids` (the token IDs) and `attention_mask` (to distinguish real tokens from padding).

6.  **`with torch.no_grad(): outputs = model(**inputs)`**: Passes the tokenized inputs through the BERT model to get the outputs.
    *   `torch.no_grad()` is used to disable gradient calculations during inference, saving memory and computation since we're not training the model.
    *   `model(**inputs)` unpacks the `inputs` dictionary into keyword arguments for the model.

7.  **`embeddings = outputs.last_hidden_state`**: The `outputs` object from the BERT model contains various information. `last_hidden_state` specifically holds the contextualized embeddings for each token in the input sequence. The shape of `embeddings` will be `(batch_size, sequence_length, hidden_size)`:
    *   `batch_size`: The number of sentences processed (here, 1).
    *   `sequence_length`: The total number of tokens in the sentence (including special tokens like `[CLS]` and `[SEP]`).
    *   `hidden_size`: The dimensionality of the embeddings for each token (for `bert-base-uncased`, this is 768).

The printed output shows the token IDs, their corresponding tokens, the shape of the resulting embeddings, and a slice of the embeddings for the first token (`[CLS]`).