## Generate embeddings

You will take the extracted text chunks and the questions that we generated in the previous steps and generate embeddings with an embedding model.

### Download the model

Download the model in `GGUF` format from [Hugging Face](https://huggingface.co/Qwen/Qwen3-Embedding-8B-GGUF/tree/main). This requires more resources, so make sure you have enough RAM and VRAM.

Download the following file:

- `Qwen3-Embedding-8B-Q4_K_M.gguf`

After that serve the model using [llama-server](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md):

```bash
llama-server \
  --model ~/.cache/llama.cpp/Qwen3-Embedding-8B-Q4_K_M.gguf \
  --n-gpu-layers 999 \
  --ctx-size 6144 \
  --batch-size 6144 \
  --ubatch-size 6144 \
  --embedding \
  --pooling last \
  --port 36912
```

If you run out of memory, try to offload less layers to the GPU by reducing the value of `--n-gpu-layers`.

### Prepare the endpoint

In [4]:
import json

import requests

LLAMA_SERVER_URL = "http://localhost:36912"


def embed_text(text: str) -> list[float] | None:
    embedding = None
    response = requests.post(
        url=f"{LLAMA_SERVER_URL}/v1/embeddings",
        headers={
            "Content-Type": "application/json",
        },
        data=json.dumps({"input": text}),
    )
    response.raise_for_status()

    response_json = response.json()
    embedding = response_json["data"][0]["embedding"]
    return embedding

### Load the data

In [5]:
prefix = "rog_strix_gaming_notebook_pc_unscanned_file_chunks"
text_chunk_file_path = f"../data/chunks/{prefix}.json"
questions_file_path = f"../data/question_answer_pairs/{prefix}_qa_pairs.json"

with open(text_chunk_file_path, "r") as f:
    text_chunks = json.load(f)

with open(questions_file_path, "r") as f:
    questions = json.load(f)

print(f"Loaded {len(text_chunks)} text chunks.")
print(f"Loaded {len(questions)} questions.")

Loaded 43 text chunks.
Loaded 377 questions.


### Embed the text

Start by embedding the text chunks.

In [7]:
from tqdm import tqdm


model_name = "qwen3-embedding-8b"
failed_text_chunks = []

for chunk in tqdm(text_chunks, total=len(text_chunks)):
    chunk_id = chunk["id"]
    text_chunk = chunk["text_chunk"]

    try:
        text_chunk_embedding = embed_text(text_chunk)
        chunk["embedding"] = text_chunk_embedding
    except Exception as e:
        print(f"Failed to embed chunk ID {chunk_id}: {e}")
        failed_text_chunks.append({"id": chunk_id, "text_chunk": text_chunk})
        break

100%|██████████| 43/43 [00:16<00:00,  2.64it/s]


In [8]:
print(f"Failed to embed {len(failed_text_chunks)} text chunks.")

Failed to embed 0 text chunks.


In [10]:
failed_questions = []

for question_object in tqdm(questions, total=len(questions)):
    chunk_id = question_object["chunk_id"]
    question = question_object["question"]

    try:
        question_embedding = embed_text(question)
        question_object["embedding"] = question_embedding
    except Exception as e:
        print(f"Failed to embed chunk ID {chunk_id}: {e}")
        failed_questions.append({"chunk_id": chunk_id, "question": question})
        continue

100%|██████████| 377/377 [00:11<00:00, 32.63it/s]


In [11]:
print(f"Failed to embed {len(failed_questions)} questions.")

Failed to embed 0 questions.


### Merge the lists

You will generate a new list with the following structure:


```json
{
    "chunks": [
        {
            "id": 0,
            "text_chunk": "",
            "embeddings": {
                "gemini-embedding-001": [],
            }
        },
        ...
    ],
    "question_answer_pairs": [
        {
            "chunk_id": 0,
            "question": "",
            "embeddings": {
                "gemini-embedding-001": [],
            }
        },
        ...
    ]
}
```

Additionally, before generating embeddings, the code checks if the merged list already exists on disk. If it does, the list is loaded instead of being recreated. This approach avoids unnecessary recomputation of embeddings and makes it easy to add new models to the embeddings dictionary without duplicating work.

In [12]:
import os
import json

embedding_directory = "../data/embeddings"
if not os.path.exists(embedding_directory):
    os.makedirs(embedding_directory)

embedding_file_path = f"{embedding_directory}/{prefix}_embeddings.json"
if os.path.exists(embedding_file_path):
    print(f"{embedding_file_path} already exists. Loading existing embeddings.")
    with open(embedding_file_path, "r") as f:
        merged_list = json.load(f)

    for i, chunk in enumerate(text_chunks):
        merged_list["chunks"][i]["embeddings"][model_name] = chunk["embedding"]

    for i, question in enumerate(questions):
        merged_list["question_answer_pairs"][i]["embeddings"][model_name] = question[
            "embedding"
        ]

else:
    merged_list = {"chunks": [], "question_answer_pairs": []}
    for chunk in text_chunks:
        merged_list["chunks"].append(
            {
                "id": chunk["id"],
                "text_chunk": chunk["text_chunk"],
                "embeddings": {model_name: chunk["embedding"]},
            }
        )

    for question in questions:
        merged_list["question_answer_pairs"].append(
            {
                "chunk_id": question["chunk_id"],
                "question": question["question"],
                "embeddings": {model_name: question["embedding"]},
            }
        )

print(
    f"Merged list has {len(merged_list['chunks'])} chunks and {len(merged_list['question_answer_pairs'])} question-answer pairs."
)

../data/embeddings/rog_strix_gaming_notebook_pc_unscanned_file_chunks_embeddings.json already exists. Loading existing embeddings.
Merged list has 43 chunks and 377 question-answer pairs.


### Save the embeddings

In [13]:
with open(embedding_file_path, "w") as f:
    json.dump(merged_list, f)