## Generate embeddings

You will take the extracted text chunks and the questions that we generated in the previous steps and generate embeddings with an embedding model.

### Load the model

In [1]:
import torch

from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
model = model.to(device)
print(f"Model running on {device}.")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

Model running on cuda.


### Load the data

In [2]:
import json

prefix = "rog_strix_gaming_notebook_pc_unscanned_file_chunks"
text_chunk_file_path = f"../data/chunks/{prefix}.json"
questions_file_path = f"../data/question_answer_pairs/{prefix}_qa_pairs.json"

with open(text_chunk_file_path, "r") as f:
    text_chunks = json.load(f)

with open(questions_file_path, "r") as f:
    questions = json.load(f)

print(f"Loaded {len(text_chunks)} text chunks.")
print(f"Loaded {len(questions)} questions.")

Loaded 43 text chunks.
Loaded 377 questions.


### Embed the text

With `Qwen3-Embedding` models, you can pass a prompt to the `encode` method. This prompt helps the model generate more relevant embeddings.

In [6]:
print(model.prompts)

{'query': 'Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:', 'document': ''}


Start by embedding the text chunks.

In [9]:
from tqdm import tqdm
from sentence_transformers.SentenceTransformer import SentenceTransformer


def embed_text(
    model: SentenceTransformer, text: str, prompt_name: str
) -> list[float] | None:
    embedding = None
    embedding = model.encode(sentences=text, prompt_name=prompt_name).tolist()
    if not embedding:
        raise ValueError("No embedding returned from the model.")

    return embedding


model_name = "qwen3-embedding-0.6b"
failed_text_chunks = []
prompt_name = "query"

for chunk in tqdm(text_chunks, total=len(text_chunks)):
    chunk_id = chunk["id"]
    text_chunk = chunk["text_chunk"]

    try:
        text_chunk_embedding = embed_text(model, text_chunk, prompt_name)
        chunk["embedding"] = text_chunk_embedding
    except Exception as e:
        print(f"Failed to embed chunk ID {chunk_id}: {e}")
        failed_text_chunks.append({"id": chunk_id, "text_chunk": text_chunk})
        continue

100%|██████████| 43/43 [00:04<00:00,  9.80it/s]


In [8]:
print(f"Failed to embed {len(failed_text_chunks)} text chunks.")

Failed to embed 0 text chunks.


In [10]:
failed_questions = []

for question_object in tqdm(questions, total=len(questions)):
    chunk_id = question_object["chunk_id"]
    question = question_object["question"]

    try:
        question_embedding = embed_text(model, question, prompt_name)
        question_object["embedding"] = question_embedding
    except Exception as e:
        print(f"Failed to embed chunk ID {chunk_id}: {e}")
        failed_questions.append({"chunk_id": chunk_id, "question": question})
        continue

100%|██████████| 377/377 [00:06<00:00, 61.78it/s]


In [11]:
print(f"Failed to embed {len(failed_questions)} questions.")

Failed to embed 0 questions.


### Merge the lists

You will generate a new list with the following structure:


```json
{
    "chunks": [
        {
            "id": 0,
            "text_chunk": "",
            "embeddings": {
                "gemini-embedding-001": [],
            }
        },
        ...
    ],
    "question_answer_pairs": [
        {
            "chunk_id": 0,
            "question": "",
            "embeddings": {
                "gemini-embedding-001": [],
            }
        },
        ...
    ]
}
```

Additionally, before generating embeddings, the code checks if the merged list already exists on disk. If it does, the list is loaded instead of being recreated. This approach avoids unnecessary recomputation of embeddings and makes it easy to add new models to the embeddings dictionary without duplicating work.

In [12]:
import os
import json

embedding_directory = "../data/embeddings"
if not os.path.exists(embedding_directory):
    os.makedirs(embedding_directory)

embedding_file_path = f"{embedding_directory}/{prefix}_embeddings.json"
if os.path.exists(embedding_file_path):
    print(f"{embedding_file_path} already exists. Loading existing embeddings.")
    with open(embedding_file_path, "r") as f:
        merged_list = json.load(f)

    for i, chunk in enumerate(text_chunks):
        merged_list["chunks"][i]["embeddings"][model_name] = chunk["embedding"]

    for i, question in enumerate(questions):
        merged_list["question_answer_pairs"][i]["embeddings"][model_name] = question[
            "embedding"
        ]

else:
    merged_list = {"chunks": [], "question_answer_pairs": []}
    for chunk in text_chunks:
        merged_list["chunks"].append(
            {
                "id": chunk["id"],
                "text_chunk": chunk["text_chunk"],
                "embeddings": {model_name: chunk["embedding"]},
            }
        )

    for question in questions:
        merged_list["question_answer_pairs"].append(
            {
                "chunk_id": question["chunk_id"],
                "question": question["question"],
                "embeddings": {model_name: question["embedding"]},
            }
        )

print(
    f"Merged list has {len(merged_list['chunks'])} chunks and {len(merged_list['question_answer_pairs'])} question-answer pairs."
)

../data/embeddings/rog_strix_gaming_notebook_pc_unscanned_file_chunks_embeddings.json already exists. Loading existing embeddings.
Merged list has 43 chunks and 377 question-answer pairs.


### Save the embeddings

In [13]:
with open(embedding_file_path, "w") as f:
    json.dump(merged_list, f)