# Recreate Verse Embeddings JSON

This notebook regenerates the embeddings JSON file (`VersejetKJV.json`) from the original verse text file (`verses-1769.json`) using OpenAI's `text-embedding-3-small` embedding model.

The output JSON will have the same structure as `VersejetKJV.json` including fields:
- `_id` (generated UUID)
- `book`
- `chapter`
- `verse`
- `cleaned_text`
- `embedding` (list of floats)

Replace the existing embedding JSON with this to prevent embedding model mismatches in VerseJet.

In [None]:
# Import required libraries
import json
import os
import re
import time
import uuid
from typing import List, Dict
%pip install --quiet python-dotenv
from dotenv import load_dotenv

load_dotenv()

import openai

# You will need to pip install openai if you haven't already
# !pip install openai

# Set your OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')  # or set your key here directly
if not openai.api_key:
    raise ValueError('OPENAI_API_KEY environment variable is not set')

Note: you may need to restart the kernel to use updated packages.


In [4]:
os.getenv('OPENAI_API_KEY')

'sk-your-actual-openai-api-key-here'

In [5]:
# Constants
VERSES_TEXT_PATH = '../verses-1769.json'  # adjust if you put the notebook elsewhere
OUTPUT_EMBEDDING_JSON_PATH = '../VersejetKJV_recreated.json'
EMBEDDING_MODEL = 'text-embedding-3-small'

# Regular expression to parse Bible reference
REF_REGEX = re.compile(r'^(.+?)\s+(\d+):(\d+)$')

MAX_BATCH_SIZE = 100  # number of embeddings per API call (modify if rate limits)

def parse_reference(ref: str):
    """Parse a verse reference into book, chapter, and verse."""
    m = REF_REGEX.match(ref)
    if not m:
        raise ValueError(f'Invalid verse reference format: {ref}')
    book, chapter, verse = m.group(1), int(m.group(2)), int(m.group(3))
    return book, chapter, verse

In [6]:
def chunk_list(lst: List, size: int) -> List[List]:
    """Split a list into chunks of specified size."""
    for i in range(0, len(lst), size):
        yield lst[i:i + size]

In [7]:
def generate_embeddings(texts: List[str], model: str) -> List[List[float]]:
    """Generate embeddings for a list of texts using OpenAI API (v1.x)."""
    response = openai.embeddings.create(
        input=texts,
        model=model
    )
    # The new API returns an object with a 'data' attribute, which is a list of objects with 'embedding'
    embeddings = [item.embedding for item in response.data]
    return embeddings

In [8]:
# Load original verse texts
with open(VERSES_TEXT_PATH, 'r', encoding='utf-8') as f:
    verse_texts: Dict[str, str] = json.load(f)

print(f"Loaded {len(verse_texts)} verses from {VERSES_TEXT_PATH}")

Loaded 31102 verses from ../verses-1769.json


In [9]:
# Prepare list for embedding generation
refs = sorted(verse_texts.keys())
texts_for_embedding = [verse_texts[ref].strip() for ref in refs]

# Verify example
print(f"Example verse reference: {refs[0]}")
print(f"Example verse text: {texts_for_embedding[0][:80]}...")

Example verse reference: 1 Chronicles 10:1
Example verse text: Now the Philistines fought against Israel; and the men of Israel fled from befor...


In [10]:
# Generate embeddings in batches to avoid rate limits
all_embeddings = []
total = len(texts_for_embedding)
print(f"Generating embeddings for {total} verses in batches of {MAX_BATCH_SIZE}...")
for i, batch_texts in enumerate(chunk_list(texts_for_embedding, MAX_BATCH_SIZE)):
    print(f"Processing batch {i+1}/{(total + MAX_BATCH_SIZE - 1)//MAX_BATCH_SIZE}")
    try:
        batch_embeddings = generate_embeddings(batch_texts, EMBEDDING_MODEL)
    except Exception as e:
        print(f"Error generating embeddings at batch {i+1}: {e}")
        break
    all_embeddings.extend(batch_embeddings)
    time.sleep(.1)  # slight delay to avoid rate limits

assert len(all_embeddings) == total, f"Mismatch: expected {total}, got {len(all_embeddings)} embeddings"

Generating embeddings for 31102 verses in batches of 100...
Processing batch 1/312
Processing batch 2/312
Processing batch 2/312
Processing batch 3/312
Processing batch 3/312
Processing batch 4/312
Processing batch 4/312
Processing batch 5/312
Processing batch 5/312
Processing batch 6/312
Processing batch 6/312
Processing batch 7/312
Processing batch 7/312
Processing batch 8/312
Processing batch 8/312
Processing batch 9/312
Processing batch 9/312
Processing batch 10/312
Processing batch 10/312
Processing batch 11/312
Processing batch 11/312
Processing batch 12/312
Processing batch 12/312
Processing batch 13/312
Processing batch 13/312
Processing batch 14/312
Processing batch 14/312
Processing batch 15/312
Processing batch 15/312
Processing batch 16/312
Processing batch 16/312
Processing batch 17/312
Processing batch 17/312
Processing batch 18/312
Processing batch 18/312
Processing batch 19/312
Processing batch 19/312
Processing batch 20/312
Processing batch 20/312
Processing batch 21/3

In [11]:
# Construct output data structure matching VersejetKJV.json
output_verses = []

for idx, ref in enumerate(refs):
    book, chapter, verse = parse_reference(ref)
    cleaned_text = verse_texts[ref].strip()
    embedding = all_embeddings[idx]

    verse_entry = {
        "_id": {"$oid": str(uuid.uuid4())},
        "book": book,
        "chapter": chapter,
        "verse": verse,
        "cleaned_text": cleaned_text,
        "embedding": embedding
    }
    output_verses.append(verse_entry)

print(f"Constructed {len(output_verses)} verse records with embeddings")

Constructed 31102 verse records with embeddings


In [12]:
# Save to JSON file
with open(OUTPUT_EMBEDDING_JSON_PATH, 'w', encoding='utf-8') as f_out:
    json.dump(output_verses, f_out, indent=2)

print(f"Saved new embedding JSON to {OUTPUT_EMBEDDING_JSON_PATH}")

Saved new embedding JSON to ../VersejetKJV_recreated.json
