## **1. Introduction**

Now that the text chunks are prepared and saved as JSON files, the next step is to vectorize and index the chunks for semantic search. This notebook handles the core of the retrieval pipeline: converting text into a machine-understandable format and storing it in a specialized database.

The process involves three main steps:
1.  **Load the Data**: We'll load the two sets of chunks (`fixed_chunks.json` and `section_chunks.json`) that we created in the previous notebook.
2.  **Generate Embeddings**: We'll use a powerful sentence transformer model to convert the text content of each chunk into a numerical vector, also known as an embedding. These vectors capture the semantic meaning of the text.
3.  **Populate the Vector Database**: Finally, we'll set up a Weaviate vector database and import our chunks, storing their text, metadata, and corresponding vector embeddings.

---

## **2. Setup and Configuration**

### **2.1. Libraries**

This notebook relies on a few key libraries to handle the embedding and database operations:
* **`torch`**: Used to check for GPU availability, which can significantly speed up the embedding generation process.
* **`sentence-transformers`**: A library that provides an easy way to use state-of-the-art embedding models.
* **`tqdm`**: A simple utility for creating smart progress bars, which is helpful for monitoring long-running tasks like encoding hundreds of chunks.
* **`weaviate-client`**: The official Python client for interacting with the Weaviate vector database. We'll use it to create our data schema and import the chunks.

### **2.2. The Embedding Model**

For generating embeddings, I chose **`BAAI/bge-large-en-v1.5`**. This is a top-performing, open-source model that consistently ranks high on the MTEB (Massive Text Embedding Benchmark) leaderboard for retrieval tasks. Using a powerful, open-source model allows for high-quality semantic search without relying on paid APIs.

The initial cells handle all the necessary setup. After defining the file paths and model configurations, the script checks for an available GPU to accelerate the process and then loads the powerful `BGE` embedding model into memory, making it ready for use.

Next, the script loads both sets of chunks—our baseline fixed-size chunks and the more advanced section-based ones—from the JSON files. Instead of creating two separate database collections, both sets are combined into a single list to be stored together. This single-collection approach is more efficient and simplifies the evaluation process. Each chunk's metadata contains a `method` property (`'fixed_size'` or `'section_based'`), which will allow us to easily filter and compare the performance of each chunking strategy in the next notebook.

In [1]:
# === IMPORTS AND CONFIGURATION ===
import weaviate
import json
import torch
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm

# Paths to processed chunks
FIXED_CHUNK_PATH = '../data/processed/fixed_chunks.json'
SECTION_CHUNK_PATH = '../data/processed/section_chunks.json'

# Model and collection configuration
MODEL_NAME = 'BAAI/bge-large-en-v1.5'
COLLECTION_NAME = 'StyleGuide'

In [2]:
# === DEVICE SETUP AND MODEL LOADING ===
if torch.cuda.is_available():
    device = 'cuda'
    print(f'✅ CUDA available. Using GPU: {torch.cuda.get_device_name(0)}')
else:
    device = 'cpu'
    print('⚠️ CUDA not available. Using CPU.')

embedding_model = SentenceTransformer(MODEL_NAME, device=device)
print('✅ Embedding model loaded')

✅ CUDA available. Using GPU: NVIDIA GeForce GTX 1050 Ti
✅ Embedding model loaded


In [3]:
# === LOAD PROCESSED CHUNKS ===
with open(FIXED_CHUNK_PATH, 'r', encoding='utf-8') as f:
    fixed_chunks = json.load(f)

with open(SECTION_CHUNK_PATH, 'r', encoding='utf-8') as f:
    section_chunks = json.load(f)

all_chunks = section_chunks + fixed_chunks
print(f'Loaded {len(fixed_chunks)} fixed-size chunks and {len(section_chunks)} section-based chunks')

Loaded 361 fixed-size chunks and 593 section-based chunks


---

## **3. Generating the Embeddings**

With the model loaded and all chunks prepared, the next step is to iterate through each one and use the model to convert its text into a vector embedding.

In [4]:
# === GENERATE EMBEDDINGS ===
embeddings = []
for chunk in tqdm(all_chunks, desc='Encoding chunks'):
    text = chunk['text']
    embedding = embedding_model.encode(text)
    embeddings.append(embedding)

print(f'Generated {len(embeddings)} embeddings')

Encoding chunks:   0%|          | 0/954 [00:00<?, ?it/s]

Generated 954 embeddings


---

## **4. Populating the Vector Database**

Now that we generated the vector embeddings, the final step is to load them into Weaviate. This will make our chunks searchable and ready for the retrieval experiments in the next notebook.

The process is handled in three main parts:
1.  **Creating a Clean Collection**: To ensure a fresh start every time the script is run, the code first checks if a collection named `StyleGuide` already exists and deletes it. This is a common practice during development to avoid errors or duplicate data.
2.  **Defining the Data Schema**: Before we can add data, we need to tell Weaviate what our data objects will look like. The code creates a new collection and defines a schema with properties for all the metadata we generated in the first notebook (`chunk_id`, `text`, `part`, `chapter`, `method`, etc.). It also configures the vector index to use **cosine distance**. This metric is well-suited for text embeddings because it measures similarity based on the **direction** of the vectors (capturing semantic meaning) rather than their magnitude, which can be influenced by factors like text length.
3.  **Batch Importing the Data**: Finally, the script iterates through our chunks and their corresponding embeddings. It uses Weaviate's efficient **batch import** feature to load 100 items at a time, storing both the chunk's metadata and its vector in a single database entry.

In [5]:
# === WEAVIATE SETUP AND DATA IMPORT ===
with weaviate.connect_to_local() as client:
    # Remove existing collection to ensure clean data import
    if client.collections.exists(COLLECTION_NAME):
        client.collections.delete(COLLECTION_NAME)

    # Define collection schema with document metadata structure
    collection = client.collections.create(
        name=COLLECTION_NAME,
        properties=[
            weaviate.classes.config.Property(name='chunk_id', data_type=weaviate.classes.config.DataType.TEXT),
            weaviate.classes.config.Property(name='text', data_type=weaviate.classes.config.DataType.TEXT),
            weaviate.classes.config.Property(name='part', data_type=weaviate.classes.config.DataType.TEXT),
            weaviate.classes.config.Property(name='chapter', data_type=weaviate.classes.config.DataType.TEXT),
            weaviate.classes.config.Property(name='section', data_type=weaviate.classes.config.DataType.TEXT),
            weaviate.classes.config.Property(name='subsection', data_type=weaviate.classes.config.DataType.TEXT),
            weaviate.classes.config.Property(name='page_number', data_type=weaviate.classes.config.DataType.TEXT),
            weaviate.classes.config.Property(name='token_count', data_type=weaviate.classes.config.DataType.INT),
            weaviate.classes.config.Property(name='method', data_type=weaviate.classes.config.DataType.TEXT),
            weaviate.classes.config.Property(name='source_document', data_type=weaviate.classes.config.DataType.TEXT),
            weaviate.classes.config.Property(name='is_split', data_type=weaviate.classes.config.DataType.BOOL),
        ],
        # Configure vector index for custom embeddings
        vector_index_config=weaviate.classes.config.Configure.VectorIndex.hnsw(
            distance_metric=weaviate.classes.config.VectorDistances.COSINE
        )
    )

    # Batch import chunks with pre-computed embeddings
    with collection.batch.fixed_size(100) as batch:
        for chunk, embedding in zip(all_chunks, embeddings):
            chunk_properties = {
                'chunk_id': chunk.get('chunk_id'),
                'text': chunk.get('text'),
                'part': chunk.get('part', 'N/A'),
                'chapter': chunk.get('chapter', 'N/A'),
                'section': chunk.get('section', 'N/A'),
                'subsection': chunk.get('subsection', 'N/A'),
                'page_number': str(chunk.get('page_number', 'N/A')),
                'token_count': chunk.get('token_count'),
                'method': chunk.get('method'),
                'source_document': chunk.get('source_document'),
                'is_split': chunk.get('is_split', False),
            }
            batch.add_object(
                properties=chunk_properties,
                vector=embedding.tolist()
            )

    print(f'✅ Imported {len(all_chunks)} chunks into Weaviate')

✅ Imported 954 chunks into Weaviate


We have now transformed our raw text chunks into a searchable vector index, setting the stage for the evaluation phase.