## Generate embeddings

You will take the extracted text chunks and the questions that we generated in the previous steps and generate embeddings with an embedding model.

### Load the data

In [2]:
import json

prefix = "rog_strix_gaming_notebook_pc_unscanned_file_chunks"
text_chunk_file_path = f"../data/chunks/{prefix}.json"
questions_file_path = f"../data/question_answer_pairs/{prefix}_qa_pairs.json"

with open(text_chunk_file_path, "r") as f:
    text_chunks = json.load(f)

with open(questions_file_path, "r") as f:
    questions = json.load(f)

print(f"Loaded {len(text_chunks)} text chunks.")
print(f"Loaded {len(questions)} questions.")

Loaded 43 text chunks.
Loaded 377 questions.


### Embed the text

The OpenAI `text-embedding-3-large` model can be accessed via the OpenAI API. Here are some details about the model:

| Model Name              | Price per 1M Tokens | Price per 1M Tokens (Batch) | Max input tokens |
|-------------------------|----------------------------|------------------------------|-------------------|
| text-embedding-3-large  | $0.13                      | $0.065                        | 8192              |

Start by embedding the text chunks.

In [3]:
from tqdm import tqdm
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()


def embed_text(client: OpenAI, text: str, model_name: str) -> list[float] | None:
    embedding = None
    response = client.embeddings.create(input=text, model=model_name)
    embedding = response.data[0].embedding
    if not embedding:
        raise ValueError("No embedding returned from the model.")

    return embedding


model_name = "text-embedding-3-large"
client = OpenAI()
failed_text_chunks = []

for chunk in tqdm(text_chunks, total=len(text_chunks)):
    chunk_id = chunk["id"]
    text_chunk = chunk["text_chunk"]

    try:
        text_chunk_embedding = embed_text(client, text_chunk, model_name)
        chunk["embedding"] = text_chunk_embedding
    except Exception as e:
        print(f"Failed to embed chunk ID {chunk_id}: {e}")
        failed_text_chunks.append({"id": chunk_id, "text_chunk": text_chunk})
        continue

100%|██████████| 43/43 [00:15<00:00,  2.72it/s]


In [4]:
print(f"Failed to embed {len(failed_text_chunks)} text chunks.")

Failed to embed 0 text chunks.


In [5]:
failed_questions = []

for question_object in tqdm(questions, total=len(questions)):
    chunk_id = question_object["chunk_id"]
    question = question_object["question"]

    try:
        question_embedding = embed_text(client, question, model_name)
        question_object["embedding"] = question_embedding
    except Exception as e:
        print(f"Failed to embed chunk ID {chunk_id}: {e}")
        failed_questions.append({"chunk_id": chunk_id, "question": question})
        continue

100%|██████████| 377/377 [02:14<00:00,  2.81it/s]


In [6]:
print(f"Failed to embed {len(failed_questions)} questions.")

Failed to embed 0 questions.


The requests I made costed me practically nothing.

In [10]:
import tiktoken


def num_tokens_from_string(text: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(text))
    return num_tokens


total_tokens = 0
for chunk in text_chunks:
    text_chunk = chunk["text_chunk"]
    total_tokens += num_tokens_from_string(text_chunk, "cl100k_base")

for question_object in questions:
    question = question_object["question"]
    total_tokens += num_tokens_from_string(question, "cl100k_base")

price_per_million_tokens_usd = 0.13
total_cost_usd = (total_tokens / 1_000_000) * price_per_million_tokens_usd
print(f"Total tokens embedded: {total_tokens}")
print(f"Total cost: ${total_cost_usd:.6f}")

Total tokens embedded: 23188
Total cost: $0.003014


### Merge the lists

You will generate a new list with the following structure:


```json
{
    "chunks": [
        {
            "id": 0,
            "text_chunk": "",
            "embeddings": {
                "gemini-embedding-001": [],
            }
        },
        ...
    ],
    "question_answer_pairs": [
        {
            "chunk_id": 0,
            "question": "",
            "embeddings": {
                "gemini-embedding-001": [],
            }
        },
        ...
    ]
}
```

Additionally, before generating embeddings, the code checks if the merged list already exists on disk. If it does, the list is loaded instead of being recreated. This approach avoids unnecessary recomputation of embeddings and makes it easy to add new models to the embeddings dictionary without duplicating work.

In [11]:
import os
import json

embedding_directory = "../data/embeddings"
if not os.path.exists(embedding_directory):
    os.makedirs(embedding_directory)

embedding_file_path = f"{embedding_directory}/{prefix}_embeddings.json"
if os.path.exists(embedding_file_path):
    print(f"{embedding_file_path} already exists. Loading existing embeddings.")
    with open(embedding_file_path, "r") as f:
        merged_list = json.load(f)

    for i, chunk in enumerate(text_chunks):
        merged_list["chunks"][i]["embeddings"][model_name] = chunk["embedding"]

    for i, question in enumerate(questions):
        merged_list["question_answer_pairs"][i]["embeddings"][model_name] = question[
            "embedding"
        ]

else:
    merged_list = {"chunks": [], "question_answer_pairs": []}
    for chunk in text_chunks:
        merged_list["chunks"].append(
            {
                "id": chunk["id"],
                "text_chunk": chunk["text_chunk"],
                "embeddings": {model_name: chunk["embedding"]},
            }
        )

    for question in questions:
        merged_list["question_answer_pairs"].append(
            {
                "chunk_id": question["chunk_id"],
                "question": question["question"],
                "embeddings": {model_name: question["embedding"]},
            }
        )

print(
    f"Merged list has {len(merged_list['chunks'])} chunks and {len(merged_list['question_answer_pairs'])} question-answer pairs."
)

../data/embeddings/rog_strix_gaming_notebook_pc_unscanned_file_chunks_embeddings.json already exists. Loading existing embeddings.
Merged list has 43 chunks and 377 question-answer pairs.


### Save the embeddings

In [12]:
with open(embedding_file_path, "w") as f:
    json.dump(merged_list, f)