# Lab 2: Chunk & Embed — From Text to Vectors**Estimated Time:** 6-7 minutes---

## Step 1: Chunk Documents with VECTOR_CHUNKSEmbedding models have a maximum input size. Long documents need to be split into smaller **chunks** that fit within that limit.

In [None]:
print("=== CHUNKING A LONG DOCUMENT ===\n")
run_query("""
    SELECT ROWNUM AS chunk_num,
           SUBSTR(C.chunk_text, 1, 80) AS chunk_preview,
           LENGTH(C.chunk_text) AS chunk_chars
    FROM city_knowledge_base kb,
         VECTOR_CHUNKS(kb.content BY WORDS
             MAX 200
             OVERLAP 40
             SPLIT BY SENTENCE) C
    WHERE kb.title LIKE 'Harbor Bridge Annual%'
""")

In [None]:
print("=== CHUNKING A SHORT DOCUMENT ===\n")
run_query("""
    SELECT ROWNUM AS chunk_num,
           SUBSTR(C.chunk_text, 1, 80) AS chunk_preview,
           LENGTH(C.chunk_text) AS chunk_chars
    FROM city_knowledge_base kb,
         VECTOR_CHUNKS(kb.content BY WORDS
             MAX 200
             OVERLAP 40
             SPLIT BY SENTENCE) C
    WHERE kb.title LIKE '%Working Near Energized%'
""")

In [None]:
print("=== CHUNKS PER DOCUMENT ===\n")
run_query("""
    SELECT kb.doc_id,
           SUBSTR(kb.title, 1, 50) AS title,
           COUNT(*) AS chunk_count
    FROM city_knowledge_base kb,
         VECTOR_CHUNKS(kb.content BY WORDS
             MAX 200
             OVERLAP 40
             SPLIT BY SENTENCE) C
    GROUP BY kb.doc_id, kb.title
    ORDER BY chunk_count DESC
""")

## Step 2: Create the Chunks Table

In [None]:
with connection.cursor() as cursor:
    # Create the chunks table
    cursor.execute("""
        CREATE TABLE city_knowledge_chunks (
            chunk_id    NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
            doc_id      NUMBER NOT NULL REFERENCES city_knowledge_base(doc_id),
            chunk_text  CLOB,
            chunk_pos   NUMBER,
            embedding   VECTOR
        )
    """)

    # Populate with chunks from all documents
    cursor.execute("""
        INSERT INTO city_knowledge_chunks (doc_id, chunk_text, chunk_pos)
        SELECT kb.doc_id,
               C.chunk_text,
               C.chunk_offset
        FROM city_knowledge_base kb,
             VECTOR_CHUNKS(kb.content BY WORDS
                 MAX 200
                 OVERLAP 40
                 SPLIT BY SENTENCE) C
    """)

    chunk_count = cursor.rowcount
    connection.commit()

print(f"Created city_knowledge_chunks table with {chunk_count} chunks.")

In [None]:
print("=== CHUNK TABLE SUMMARY ===\n")
run_query("""
    SELECT COUNT(*) AS total_chunks,
           ROUND(AVG(LENGTH(chunk_text))) AS avg_chunk_chars,
           MIN(LENGTH(chunk_text)) AS min_chunk_chars,
           MAX(LENGTH(chunk_text)) AS max_chunk_chars
    FROM city_knowledge_chunks
""")

## Step 3: Generate Vector EmbeddingsOracle's built-in ONNX embedding model converts each chunk into a vector — directly inside the database.

In [None]:
with connection.cursor() as cursor:
    cursor.execute("""
        UPDATE city_knowledge_chunks
        SET embedding = VECTOR_EMBEDDING(
            doc_model USING chunk_text
        )
    """)
    updated = cursor.rowcount
    connection.commit()

print(f"Generated embeddings for {updated} chunks.")

In [None]:
with connection.cursor() as cursor:
    cursor.execute("""
        SELECT chunk_id,
               SUBSTR(chunk_text, 1, 60) AS preview,
               VECTOR_DIMENSION_COUNT(embedding) AS dimensions,
               embedding
        FROM city_knowledge_chunks
        WHERE ROWNUM = 1
    """)
    row = cursor.fetchone()

print(f"Chunk ID:   {row[0]}")
print(f"Preview:    {row[1]}...")
print(f"Dimensions: {row[2]}")
print(f"\nEmbedding (first 10 values):")
vec_str = str(row[3])
print(vec_str[:200] + "...")

In [None]:
print("=== EMBEDDING VERIFICATION ===\n")
run_query("""
    SELECT COUNT(*) AS total_chunks,
           COUNT(embedding) AS with_embedding,
           COUNT(*) - COUNT(embedding) AS missing_embedding
    FROM city_knowledge_chunks
""")

Your knowledge base is now vectorized and ready for similarity search. **Proceed to Lab 3.**