# Embeddings and LLMs

+ Embeddings are dense vector representations of text, capturing semantic information. 
+ Similar texts have similar embeddings, allowing for tasks like clustering and similarity search.

In [1]:
from transformers import AutoTokenizer, AutoModel
import torch

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

emb1 = model(**tok("Cats are cute", return_tensors="pt")).last_hidden_state[0, 0]
print("Embedding dimension (BERT base): ", len(emb1))

Embedding dimension (BERT base):  768


+ The [0, 0] indexing in your code extracts the first token of the first sentence in the batch, which is the $\texttt{[CLS]}$ token.
+ The $\texttt{[CLS]}$ token is a special token added at the beginning of every input sentence in BERT.
+ The corresponding hidden state for this token is considered to capture the overall semantic representation of the entire input sentence.

In [2]:
emb2 = model(**tok("Dogs are loyal", return_tensors="pt")).last_hidden_state[0, 0]
print("Cosine similarity of generic embeddings: ", torch.cosine_similarity(emb1, emb2, dim=0).item())

Cosine similarity of generic embeddings:  0.9672764539718628


## Training a Simple Language Model

1. **Dataset:** 6 text samples, each labeled with a 3D vector (e.g., `[1, 0, 0]` for ML, `[0, 1, 0]` for animals, etc.).  
2. **Tokenization:** Text is tokenized using DistilBERT, generating `input_ids` and `attention_mask`.  
3. **Model Architecture:** DistilBERT + Linear Layer (768 → 3) projects CLS token to 3D label space.  
4. **Loss Function:** MSE Loss measures the difference between predicted embeddings and target vectors.  
5. **Training Loop:** Forward pass extracts CLS embedding, applies linear layer, computes loss, and backpropagates.  

In [3]:
import torch
from transformers import AutoTokenizer, AutoModel, Trainer, TrainingArguments
from torch.utils.data import Dataset

# --- Toy Dataset ---
class ToyDataset(Dataset):
    def __init__(self):
        self.samples = [
            ("I love machine learning", [1, 0, 0]),
            ("I enjoy deep learning", [1, 0, 0]),
            ("Cats are cute", [0, 1, 0]),
            ("Dogs are loyal", [0, 1, 0]),
            ("Python is great for programming", [0, 0, 1]),
            ("I code in Python", [0, 0, 1])
        ]
        self.tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        text, label = self.samples[idx]
        inputs = self.tokenizer(text, padding="max_length", truncation=True, max_length=16, return_tensors="pt")
        return {"input_ids": inputs["input_ids"].squeeze(), "attention_mask": inputs["attention_mask"].squeeze(), "labels": torch.tensor(label)}

# --- Model Definition ---
class SimpleEmbedder(torch.nn.Module):
    def __init__(self):
        super(SimpleEmbedder, self).__init__()
        self.model = AutoModel.from_pretrained("distilbert-base-uncased")
        self.embedding_layer = torch.nn.Linear(768, 3)
        self.loss_fn = torch.nn.MSELoss()

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]
        embeddings = self.embedding_layer(cls_output)

        if labels is not None:
            loss = self.loss_fn(embeddings, labels.float())
            return {'loss': loss, 'embeddings': embeddings}
        return {'embeddings': embeddings}


# --- Training ---
dataset = ToyDataset()
train_args = TrainingArguments(
    output_dir="./embeddings_model",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_dir="./logs"
)

model = SimpleEmbedder()
trainer = Trainer(
    model=model,
    args=train_args,
    train_dataset=dataset
)

trainer.train()



Step,Training Loss


TrainOutput(global_step=9, training_loss=0.16863920953538683, metrics={'train_runtime': 2.909, 'train_samples_per_second': 6.188, 'train_steps_per_second': 3.094, 'total_flos': 0.0, 'train_loss': 0.16863920953538683, 'epoch': 3.0})

In [4]:
# --- Generating Embeddings ---
model.eval()
sample_text = "I love coding"
tokenizer = dataset.tokenizer
inputs = tokenizer(sample_text, return_tensors="pt")
with torch.no_grad():
    embedding = model(**inputs)

print("Embedding for sample text:", embedding["embeddings"].numpy())

Embedding for sample text: [[0.6994633  0.03243814 0.20225406]]
