 1. Installing Required Libraries

In [1]:
!pip install transformers torch --quiet

 2. Importing the tokenizer and model loader from transformers, along with PyTorch and NumPy.

In [2]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
import os

 3. I am choosing the distilbert-base-uncased model — a lightweight and fast BERT variant — and load both the tokenizer and the model. Also suppress the symlink warning that appears on Windows systems.

In [3]:
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

print(f"Successfully loaded model and tokenizer: {model_name}")

Successfully loaded model and tokenizer: distilbert-base-uncased


 4. Defining descriptive, natural-language captions for flowers. These simulate real captions a user might input for text-to-image generation.

In [4]:
text_descriptions = [
    "The water lily is floating gently on the pond with large green leaves.",
    "A desert rose with thick succulent stems and vibrant pink petals.",
    "Gazania flowers bloom with bright orange and yellow striped petals.",
    "A wild pansy with a mix of purple, yellow, and white petals.",
    "An oxeye daisy with white petals and a sunny yellow center in a meadow.",
    "The columbine flower has a unique bell shape with purple and white colors.",
    "A tall sword lily with elegant, vibrant red flowers blooming upward.",
    "A bright orange dahlia with multiple layered petals and a dark center.",
    "A barbeton daisy showing vivid pink petals with a golden center."
]

 5. Tokenizing the flower descriptions using the tokenizer. Padding and truncation are applied to ensure consistent sequence length.

In [5]:
encoded_inputs = tokenizer(
    text_descriptions,
    padding=True,
    truncation=True,
    return_tensors="pt"
)

input_ids = encoded_inputs["input_ids"]
attention_mask = encoded_inputs["attention_mask"]

print("Input IDs shape:", input_ids.shape)
print("Attention Mask shape:", attention_mask.shape)

Input IDs shape: torch.Size([9, 18])
Attention Mask shape: torch.Size([9, 18])


 6. Passing the tokenized input into the model to get the last hidden states — the contextual embeddings for each token.

In [6]:
with torch.no_grad():
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)

last_hidden_states = outputs.last_hidden_state
print("Last hidden states shape:", last_hidden_states.shape)

Last hidden states shape: torch.Size([9, 18, 768])


 7. To convert variable-length token embeddings into fixed-size sentence embeddings, performing mean pooling, using the attention mask to exclude padding.

In [7]:
def mean_pooling(hidden_states, attention_mask):
    mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
    sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1)
    sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
    return sum_embeddings / sum_mask

sentence_embeddings = mean_pooling(last_hidden_states, attention_mask)
print("Sentence embeddings shape:", sentence_embeddings.shape)

Sentence embeddings shape: torch.Size([9, 768])


 8. The final pooled sentence embeddings are converted into a NumPy array for compatibility with machine learning models.

In [8]:
sentence_embeddings_np = sentence_embeddings.cpu().numpy()
print("Sample vector for 'Water Lily':")
print(sentence_embeddings_np[0][:10])

Sample vector for 'Water Lily':
[-0.20325074  0.05302349 -0.00336746  0.2590729   0.30261055 -0.33806202
 -0.35584727  0.5285601  -0.16368869  0.00060811]
