1. Setting Up Weaviate
First, we need a running Weaviate instance. You can run Weaviate locally via Docker or use a cloud-hosted instance:
Local (Docker): If Docker is installed, launch Weaviate with a single command. This runs Weaviate on port 8080 with default settings (no authentication, HNSW index, etc.):

In [None]:
docker run -d -p 8080:8080 -p 50051:50051 semitechnologies/weaviate:latest

(The above uses the latest Weaviate image. You can also use a specific version tag. The second port 50051 is for gRPC, not used in this workshop.)
Cloud: Alternatively, sign up for Weaviate Cloud Service (WCS) to get a free sandbox cluster. For WCS, you’ll need the cluster URL and an API key, and you should pass an API key for authentication when connecting.

Next, install the Python client and any other needed libraries:

In [None]:
pip install weaviate-client openai PyMuPDF pdfplumber matplotlib numpy

Now, connect to Weaviate using the Python client. For local Docker, no auth is needed. For cloud, use the URL and API key:

In [None]:
import weaviate

# If using local Weaviate
client = weaviate.Client("http://localhost:8080")  


Defining the Schema
We’ll create a schema with a single class to store document chunks and their embeddings. In Weaviate, a class is like a collection of objects with defined properties. Our class will be called "DocumentChunk" and have properties for the chunk text and any metadata (like page number or chunk index). Weaviate can automatically vectorize data using modules, but since we’ll supply our own embeddings, we set the vectorizer to "none". We also configure the vector index (Weaviate uses an HNSW index by default) and specify the distance metric as cosine:

In [None]:
# Define class schema for document chunks
class_obj = {
    "class": "DocumentChunk",
    "description": "A chunk of document text and its embedding",
    "vectorizer": "none",            # We'll provide our own vectors (embeddings)&#8203;:contentReference[oaicite:3]{index=3}
    "vectorIndexType": "hnsw",       # Use HNSW index (default)
    "vectorIndexConfig": {
        "distance": "cosine"        # Cosine similarity for vector comparisons
    },
    "properties": [
        {
            "name": "content",
            "dataType": ["text"],
            "description": "Text content of the document chunk"
        },
        {
            "name": "page",
            "dataType": ["int"],
            "description": "Page number of the source PDF where this chunk is found"
        },
        {
            "name": "chunkIndex",
            "dataType": ["int"],
            "description": "Sequential index of the chunk in the document"
        }
    ]
}

# Remove class if it exists (for re-runs of the workshop notebook)
if "DocumentChunk" in [c['class'] for c in client.schema.get('classes')]:
    client.schema.delete_class("DocumentChunk")

# Create the class in Weaviate
client.schema.create_class(class_obj)
print("Schema created with classes:", [c['class'] for c in client.schema.get('classes')])

This schema will allow us to store each chunk of the PDF along with its embedding vector. We included a page property to enable filtering by page, and a chunkIndex to keep track of chunk ordering. The vectorizer: none setting ensures Weaviate will not attempt to vectorize our text (we handle it externally). The distance: cosine means the HNSW index will use cosine distance for similarity search (cosine similarity is a common choice for embeddings).

2. Loading and Chunking a Long PDF
With Weaviate ready, the next step is to load our large technical PDF and break it into chunks. Chunking is crucial: it balances context size (each chunk should be large enough to contain a meaningful piece of information) with relevance (chunks should be specific enough to match queries closely). We will explore multiple chunking strategies:
Fixed-length chunks (e.g. N words per chunk) – with and without overlaps.
Sentence-based splitting – chunk by whole sentences or groups of sentences.
Overlapping vs. non-overlapping – overlapping can help preserve context between chunks.
Hybrid approaches – e.g. split by paragraph or section, ensuring a minimum length.
Let’s start by reading the PDF. We’ll use PyMuPDF (imported as fitz) to extract text from each page. We’ll also demonstrate chunking by page to capture page-wise metadata:

In [None]:
import fitz  # PyMuPDF
import math

pdf_path = "technical_spec.pdf"  # replace with your PDF file path
doc = fitz.open(pdf_path)

all_chunks = []
chunk_size = 100   # example size: 100 words per chunk
overlap = 20       # example overlap: 20 words

chunk_index = 0
for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    text = page.get_text().strip()
    if not text:
        continue  # skip blank pages
    # Split the page text into chunks of ~100 words with 20-word overlap
    words = text.split()
    for i in range(0, len(words), chunk_size - overlap if overlap else chunk_size):
        chunk_words = words[i : i + chunk_size]
        chunk_text = " ".join(chunk_words)
        # Store the chunk with metadata
        chunk_obj = {
            "content": chunk_text,
            "page": page_num,
            "chunkIndex": chunk_index
        }
        all_chunks.append(chunk_obj)
        chunk_index += 1

print(f"Total chunks created: {len(all_chunks)}")
print("Sample chunk:\n", all_chunks[0]['content'][:250], "...")
print("Metadata of sample chunk:", {k: all_chunks[0][k] for k in ['page','chunkIndex']})

In the above code, we iterate through each page, extract text, then create chunks of ~100 words, with a 20-word overlap between consecutive chunks. Overlapping chunks by some fraction (here 20%) helps preserve context that spans chunk boundaries​
WEAVIATE.IO
. We stored each chunk in a dictionary with the text and metadata (page number and a sequential index). Chunking strategies: The fixed-size approach above is one strategy. Depending on your document and use case, you might try different methods:
Fixed-length (non-overlapping): set overlap = 0 in the code above to make disjoint chunks of chunk_size words.
Sentence-based splitting: maintain whole sentences in each chunk. For example, we can split the text by sentence and then group sentences until a length threshold is reached:

In [None]:
import re
def chunk_by_sentence(text, max_words=100):
    # Split text into sentences (naively by period/question/exclamation)
    sentences = re.split(r'(?<=[.?!])\s+', text)
    chunks = []
    current_chunk = []
    current_count = 0
    for sent in sentences:
        word_count = len(sent.split())
        if current_count + word_count > max_words:
            # finalize the current chunk and start a new one
            chunks.append(" ".join(current_chunk).strip())
            current_chunk = []
            current_count = 0
        current_chunk.append(sent)
        current_count += word_count
    # add the last chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk).strip())
    return chunks

# Example: chunk first page text by sentence grouping
page0_text = doc.load_page(0).get_text()
sentence_chunks = chunk_by_sentence(page0_text, max_words=100)
print(f"Page 0 split into {len(sentence_chunks)} chunks by sentences.")

This function ensures chunks end at sentence boundaries and contain up to ~100 words. Sentence-based chunks might be more semantically coherent, which can help the embeddings capture meaning better.

Paragraph-based or Hybrid: Many technical PDFs have clear paragraph or section breaks. We can split on double newlines (\n\n) to get paragraphs, then merge smaller paragraphs so each chunk has at least a minimum number of words. For example:

In [None]:
def chunk_by_paragraph(text, min_words=50):
    paras = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks = []
    buffer = ""
    for para in paras:
        if len(para.split()) < min_words:
            # accumulate small paragraphs in buffer
            buffer += " " + para
        else:
            # if buffer has content, append it before the current para
            if buffer:
                combined = buffer + " " + para
                chunks.append(combined.strip())
                buffer = ""
            else:
                chunks.append(para)
    if buffer:
        chunks.append(buffer.strip())
    return chunks

# Example: chunk first page by paragraph
para_chunks = chunk_by_paragraph(page0_text, min_words=50)
print(f"Page 0 split into {len(para_chunks)} chunks by paragraph (with min_words=50).")

This approach tries to keep paragraphs intact and ensures no chunk is too short (small paragraphs get merged with neighbors). It’s a simple heuristic for a hybrid strategy.

Handling tables and diagrams (optional): Technical PDFs may contain tables or diagrams that are not plain text. By default, our text extraction might skip or mishandle these. For a thorough solution, you could incorporate OCR or specialized parsing:
Use the pdfplumber library to detect tables (via page.extract_table) and include them as text (e.g., CSV format in the chunk).
Use an OCR or the unstructured library to handle images/diagrams and get a text description. In this workshop, we focus on text chunks. You can consider these enhancements as optional exercises (the code is structured so you can plug in additional parsing if needed).

Mini-exercise: Try adjusting the chunk_size and overlap parameters, or switch to the sentence-based function, and observe how the number and content of chunks change. For instance, compare chunk_size=50 vs 200, or overlap=0 vs 20%. Smaller, non-overlapping chunks result in more segments with narrower focus, whereas larger or overlapping chunks carry more context. An optimal chunking strikes a balance – one forum discussion notes that larger chunks provide more context but can dilute specificity of the embedding. Experiment and see what works best for retrieval accuracy.

3. Generating and Storing Embeddings in Weaviate
Now that we have chunks of the document, the next step is to convert each chunk into a vector embedding and store them in Weaviate. We will use OpenAI’s text embedding model (for example, text-embedding-ada-002, which produces 1536-dimensional embeddings) to generate vectors for each chunk. You can replace this with other providers or models (we’ll mention alternatives in comments). Before running embedding, make sure you have an API key for OpenAI (or your chosen provider). Set up the OpenAI API:

In [None]:
import openai

openai.api_key = "YOUR_OPENAI_API_KEY"  # **Replace with your key** or load from environment
model_name = "text-embedding-ada-002"   # OpenAI embedding model (Ada v2)

Now, iterate through the chunks, compute the embedding for each, and send the data into Weaviate. We’ll use Weaviate’s batch API for efficiency. We’ll also allow tweaking preprocessing (for example, lowercasing the text or removing special characters) if needed before embedding – though often it’s not necessary for modern models:

In [None]:
# (Optional) Preprocessing function, e.g., to normalize text (currently just a placeholder)
def preprocess_text(text):
    return text.strip()  # We could add lowercasing, remove punctuation, etc., if needed.

# Prepare batch import
batch_size = 100
client.batch.configure(batch_size=batch_size, dynamic=True)  # dynamic batching adapts to payload size

for i, chunk_obj in enumerate(all_chunks):
    text = preprocess_text(chunk_obj["content"])
    # Generate embedding vector using OpenAI
    try:
        emb_response = openai.Embedding.create(input=text, model=model_name)
    except Exception as e:
        raise RuntimeError(f"Embedding API call failed at chunk {i}: {e}")
    vector = emb_response['data'][0]['embedding']
    # Add to Weaviate batch
    client.batch.add_data_object(
        data_object={
            "content": text,
            "page": chunk_obj["page"],
            "chunkIndex": chunk_obj["chunkIndex"]
        },
        class_name="DocumentChunk",
        vector=vector
    )
    # Send batch every 100 objects (to avoid too large batches in memory)
    if (i + 1) % batch_size == 0:
        client.batch.create_objects()
        print(f"{i+1} chunks indexed...")
# Flush remaining
client.batch.create_objects()
print(f"Finished indexing {len(all_chunks)} chunks.")

This code will take each chunk, get a 1536-dim embedding from OpenAI, and batch import it into Weaviate as a DocumentChunk object with the vector attached. We configured vectorizer: none earlier, so Weaviate knows these vectors are the ones to index (it won’t try to generate its own).

Embedding alternatives: The above uses OpenAI. You can switch to other providers:
Cohere: e.g. use cohere.Client to embed text (co.embed(texts=[text]) returns vectors). Make sure to adjust the vector dimension and schema if using a model with different output size.
Hugging Face Transformers: use a SentenceTransformer model or a pipeline. For example:

In [None]:
# Using a SentenceTransformer model from Hugging Face
from sentence_transformers import SentenceTransformer
hf_model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dim embeddings
vector = hf_model.encode(text).tolist()

If you use a 384-dim model like MiniLM, update vectorIndexConfig in the schema or ensure all vectors are the same length (Weaviate requires consistent dimensions). Also consider the quality: smaller models may be faster but might reduce accuracy for complex technical text.
Local GPU models: if you have GPU and a larger model (e.g., OpenAI’s text-embedding-ada-002 clone or similar), you can use that. The code structure remains the same; only the embedding function changes.

Tweakable settings: Feel free to modify model_name, adjust batch_size for indexing, or add preprocessing (e.g., removing stopwords) before embedding. These can impact performance:
Using a different model may require re-running the schema setup if dimensions differ.
Preprocessing might slightly improve embeddings if your text contains a lot of irrelevant characters (but often the embedding model is robust enough without heavy cleaning).

After running the above, your Weaviate instance should now be populated with one object per chunk, each with an embedding. We can verify by checking object count or retrieving a sample object:

In [None]:
# Verify data indexed
obj_count = client.query.aggregate("DocumentChunk").with_meta_count().do()
print("Object count in Weaviate:", obj_count["data"]["Aggregate"]["DocumentChunk"][0]["meta"]["count"])

# Retrieve a sample object to verify content
result = client.query.get("DocumentChunk", ["content", "page", "_additional {vector}"]).with_limit(1).do()
print("Sample stored object:", result["data"]["Get"]["DocumentChunk"][0])

With our vector database populated, we can perform semantic search to retrieve relevant chunks for a given query. Weaviate supports multiple query types:
Vector similarity search – find chunks with embeddings nearest to the query embedding (semantic search).
Keyword (BM25) search – find chunks by lexical match of query terms (like traditional search).
Hybrid search – combines vector similarity and keyword relevance.
Filtering – restrict results by metadata (e.g., only certain pages or sections).
These capabilities can be combined. As the Weaviate docs note, you can use similarity, keyword, and hybrid searches, along with filtering, to find the information you need​
WEAVIATE.IO
. We’ll explore pure vector vs hybrid, and demonstrate a metadata filter. First, define a sample user query. For example, our use-case query (from the prompt) is:

In [None]:
query_text = ("Generate an AT command sequence that will attach the device to an "
              "LTE network using eDRX with 81 seconds cycle interval, periodically send 100 bytes of data using HTTPs, "
              "and immediately release the connection using RAI.")

This is a complex request that likely spans multiple parts of the technical specification (network attachment, eDRX settings, HTTP data sending, and RAI usage). Our goal is to retrieve the most relevant chunks from the document that contain information about these topics. 

Next, get the query’s embedding vector (using the same model as we did for the documents, to ensure vector space alignment):

In [None]:
# Embed the query text to a vector
query_emb = openai.Embedding.create(input=query_text, model=model_name)
query_vector = query_emb['data'][0]['embedding']

Now we can query Weaviate. We’ll try three approaches: 
a. Pure Vector Search: This uses only the embedding similarity (cosine similarity) to rank results. Weaviate’s GraphQL (via the Python client) provides a nearVector operator where we pass the query vector. We’ll request the content, page, chunkIndex, and the similarity score. Since we set distance: cosine in the schema, Weaviate will return a distance (0 = identical, higher = less similar). We can ask for distance or certainty (Weaviate uses certainty in some contexts, which is 1 - distance for cosine).

In [None]:
# Pure vector similarity search (cosine)
results_vector = client.query.get("DocumentChunk", 
                                  ["content", "page", "chunkIndex", "_additional {distance}"]
                                 ).with_near_vector({"vector": query_vector}).with_limit(5).do()

chunks_vector = results_vector["data"]["Get"]["DocumentChunk"]
for i, res in enumerate(chunks_vector, 1):
    text_snippet = res["content"][:100].replace("\n", " ")
    print(f"{i}. [Page {res['page']}] {text_snippet}... (distance={res['_additional']['distance']:.3f})")

This will output the top 5 most similar chunks by vector embedding. Lower distance means higher similarity. You should see chunks that mention terms like “LTE attach”, “eDRX 81 seconds”, “100 bytes HTTP” or “RAI” if those are in the document. If the distance values are all relatively low (say 0.2–0.4), it indicates a strong semantic match for those concepts.

b. Hybrid Search: Hybrid search combines keyword matching with vector similarity for a more robust retrieval. For example, if the query contains specific terms (like “RAI” or “HTTPs”) that appear in the document, BM25 keyword search can ensure those chunks are considered, even if their overall embedding similarity might be slightly lower. Weaviate’s hybrid operator allows a mix of vector and text search. We can provide the query text and the query vector together. An alpha parameter (0 to 1) controls the balance (alpha=0 is purely keyword, alpha=1 is purely vector; 0.5 gives equal weight):

In [None]:
# Hybrid search: combine vector similarity and keyword (BM25) search
results_hybrid = client.query.get("DocumentChunk", 
                                  ["content", "page", "chunkIndex", "_additional {score}"]
                                 ).with_hybrid(query=query_text, vector=query_vector, alpha=0.5).with_limit(5).do()

chunks_hybrid = results_hybrid["data"]["Get"]["DocumentChunk"]
for i, res in enumerate(chunks_hybrid, 1):
    snippet = res["content"][:100].replace("\n", " ")
    print(f"{i}. [Page {res['page']}] {snippet}... (hybrid_score={res['_additional']['score']:.3f})")

In the hybrid result, _additional {score} is a fusion score (higher = more relevant) that combines vector and keyword relevance. You might observe that hybrid results include chunks that contain the exact query terms (e.g., a chunk explicitly mentioning “RAI” or “eDRX”) even if those weren’t the top pure vector hits. Hybrid search is powerful when you want to ensure certain keywords are present in results while still leveraging semantic similarity.

c. Metadata Filtering: Sometimes you want to restrict search to a subset of the data. For instance, if our PDF had multiple sections or if we only trust certain pages, we can apply a filter. Weaviate allows filtering on properties using .with_where in the query. We stored page numbers, so as an example, we can limit the search to the first 50 pages:

In [None]:
# Example filter: only consider chunks from pages 0-49
page_filter = {
    "path": ["page"],
    "operator": "LessThan",
    "valueInt": 50
}
results_filtered = client.query.get("DocumentChunk", ["content", "page", "_additional {distance}"])\
                    .with_near_vector({"vector": query_vector})\
                    .with_where(page_filter)\
                    .with_limit(5).do()

chunks_filtered = results_filtered["data"]["Get"]["DocumentChunk"]
print(f"Found {len(chunks_filtered)} results with page filter:")
for res in chunks_filtered:
    print(f"- Page {res['page']} (distance={res['_additional']['distance']:.3f}): {res['content'][:80]}...")

This query will only return chunks from pages 0–49 that are closest to the query. If the relevant info was in later pages, those would be excluded, demonstrating the effect of the filter. You can filter on any property – for example, if we had a section or chapter property, we could filter by that. Filtering is useful to narrow scope (e.g., “only search in the AT commands reference section”). Comparison of methods: In practice, you might try pure vector vs hybrid to see which gives better results for your queries. Pure vector search may surface semantically relevant chunks that don’t share exact wording with the query, while hybrid can improve precision when specific terms are important. Weaviate’s hybrid search essentially fuses BM25 and vector results, which often yields the best of both worlds.

Mini-exercise: Try running both the pure vector and hybrid searches for the sample query. Do you notice any differences in the returned chunks or their order? Which method retrieves the chunk about “RAI” first? Now adjust the alpha in hybrid (e.g., 0.3 vs 0.7) to put more weight on keywords or vectors and see how results change. Also experiment with the metadata filter – for example, change the filter to a different page range or remove it entirely to see the unfiltered results.

5. Evaluating Retrieval Performance
To measure how well our retrieval is doing, we can calculate similarity scores between the query and the retrieved chunks. Since we’re using cosine similarity, a higher cosine similarity (closer to 1.0) means a more relevant chunk (in terms of embedding). Weaviate already gives us a distance or score, but we can also recompute cosine similarity manually to double-check or to aggregate results.
Let’s take the results from the pure vector search in section 4a and compute cosine similarities for each retrieved chunk

In [None]:
import numpy as np

def cosine_sim(vec1, vec2):
    vec1, vec2 = np.array(vec1), np.array(vec2)
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Get the top vector search results again (including their chunkIndex for lookup)
results = client.query.get("DocumentChunk", ["content", "chunkIndex", "_additional {distance}"])\
            .with_near_vector({"vector": query_vector}).with_limit(5).do()
chunks = results["data"]["Get"]["DocumentChunk"]

# Assuming we still have the list of chunk vectors from indexing step, let's build a lookup by chunkIndex:
# (If not stored, we could also ask Weaviate for vectors via _additional {vector} in the query above)
# For demonstration, let's retrieve vectors for these chunks in a second query:
idx_list = [res["chunkIndex"] for res in chunks]
vector_results = client.query.get("DocumentChunk", ["chunkIndex", "_additional {vector}"])\
                  .with_where({"path": ["chunkIndex"], "operator": "ContainedIn", "valueInt": idx_list})\
                  .with_limit(len(idx_list)).do()
vector_data = vector_results["data"]["Get"]["DocumentChunk"]
# Build a map from chunkIndex to vector
chunk_vectors = {obj["chunkIndex"]: obj["_additional"]["vector"] for obj in vector_data}

# Calculate cosine similarity between query and each retrieved chunk vector
sims = []
for res in chunks:
    idx = res["chunkIndex"]
    vec = chunk_vectors.get(idx)
    if vec:
        sim = cosine_sim(query_vector, vec)
        sims.append(sim)
        snippet = res["content"][:60].replace("\n", " ")
        print(f"Chunk {idx} (cosine sim={sim:.3f}): {snippet}...")

This will print the cosine similarity for each of the top 5 chunks. You can compare these similarities to the distances or scores returned by Weaviate:
If using cosine distance: cosine_similarity = 1 - distance (approximately, since cosine distance = 1 - cos sim for normalized vectors).
If using certainty: certainty roughly correlates directly with similarity (higher certainty = more similar).

We can also compute an average cosine similarity of the top results as a simple relevance metric. For example:

In [None]:
if sims:
    avg_sim = sum(sims) / len(sims)
    print(f"Average cosine similarity of top {len(sims)} results: {avg_sim:.3f}")

This average can indicate how tightly clustered the top results are around the query in vector space. A higher average might mean the query hits a very specific concept (all top results are similar to the query and to each other), whereas a lower average might mean the results are more loosely related or the query is broad. Dynamic re-ranking: With this setup, you can tweak chunking or embedding and quickly recompute the similarity metrics to see the effect. For instance, if you re-chunk the document with a different strategy and re-index, you can run the same query and measure the new average cosine similarity or check if the relevant chunk now ranks higher.

Mini-exercise: After modifying your chunking approach or embedding model, run the retrieval and similarity computation again. Does the average top-k similarity increase or decrease? Examine the individual similarities – ideally, the truly relevant chunks should show the highest similarity to the query. If your changes improve the retrieval, you might see the target information chunk moving up to rank 1 with a higher score. This iterative approach helps in tuning the system.

6. Integrating with GPT-4o-mini for RAG-Based Answer Generation
Retrieval alone isn’t enough – we want to use the retrieved chunks to answer the user’s question. In a RAG pipeline, the relevant chunks are fed into a generative model (like GPT-4 or a smaller variant) as context. The model then crafts an answer that draws from that context. For our example query about the AT command sequence, let’s assume we have retrieved several chunks covering:
How to attach to LTE with eDRX and the cycle interval,
How to send data over HTTP,
How to release connection using RAI.

We will now prompt an LLM with these chunks. We’ll use OpenAI’s GPT API for demonstration (you can use GPT-4 if you have access, or GPT-3.5 as a proxy). We refer to “GPT-4o-mini” as our hypothetical model – in practice, this could be an approximation of GPT-4 or any suitable LLM that can handle the prompt length. First, collect the top chunks’ content as context:

In [None]:
# Let's use the hybrid search results (assuming it provided well-rounded context) from earlier
context_chunks = [res["content"] for res in chunks_hybrid]  # top 5 from hybrid search
# Or use chunks_vector if you prefer pure vector results

# Prepare a single string with all context
context_text = "\n\n".join(context_chunks)

Now, construct the prompt for the LLM. A common strategy is to include the context and then the query, instructing the model to use the context to answer. For example:

In [None]:
prompt = (f"You are a helpful AI assistant. You are given the following technical documentation context:\n\n"
          f"{context_text}\n\n"
          f"Using this information, please answer the question:\n"
          f"{query_text}\n\n"
          f"Provide a step-by-step AT command sequence with explanations.")

In the prompt above, we set a role (assistant with context) and explicitly ask for a step-by-step AT command sequence, since the query expects a sequence of commands. We included the retrieved chunks in the prompt. It’s important to note that if the context is very large, you must ensure it fits within the model’s token limit (for GPT-3.5/4 this is usually a few thousand tokens, and our 5 chunks likely fit). Now we send this prompt to the model:

In [None]:
# Use OpenAI ChatCompletion (GPT)
response = openai.ChatCompletion.create(
    model="gpt4o-mini", 
    messages=[{"role": "user", "content": prompt}]
)
answer = response["choices"][0]["message"]["content"]
print("AI Answer:\n", answer)

This will output the generated answer. Ideally, the answer will list a series of AT commands (like maybe AT+CGATT=1 to attach, AT+CEDRXS=... to set eDRX, commands to send data over HTTP, and AT+RAI=... or an equivalent command to release with RAI) along with some explanation.

Prompt refinement: The initial answer might not be perfectly formatted or may miss some detail. You can refine the prompt by:
Adding instructions such as “if the context does not contain the info, say you don’t know” to avoid hallucination.
Changing the format request (e.g., “provide just the AT commands without extra text” if you want a concise output).
Adding a system message in the ChatCompletion for clearer role (OpenAI’s API allows a system message like: {"role": "system", "content": "You are an expert IoT device assistant..."}).
For instance, to ensure the answer is an actual command sequence, you might do:

In [None]:
messages = [
    {"role": "system", "content": "You are an expert IoT modem assistant."},
    {"role": "user", "content": prompt}
]
response = openai.ChatCompletion.create(model="gpt-4", messages=messages)
print(response["choices"][0]["message"]["content"])

Alternative LLMs: If you don’t have access to OpenAI’s API or prefer local models, you can use Hugging Face Transformers. For example, a suitable instruct-tuned model (like a smaller GPT-J or FLAN-T5) can be invoked with the transformers pipeline:

In [None]:
# Using a local model via HuggingFace (ensure the model fits in memory and is suitable for Q&A)
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_name = "tiiuae/falcon-7b-instruct"  # example open model (7B parameters)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
response = generator(prompt, max_length=500, do_sample=False)
print(response[0]['generated_text'])

Be aware that smaller models may not produce as accurate or detailed answers as GPT-3.5/4, especially for technical queries. However, they can be used for experimentation if API access is an issue.

Mini-exercise: Run the end-to-end query for the provided sample question and examine the output. Does the answer use information from the retrieved chunks correctly? If something is off, try modifying the prompt (e.g., ask for clarification or a different format). You could also test another query: for example, “How do I enable Power Saving Mode (PSM) on this device via AT commands?” (assuming the document covers PSM). See if the pipeline retrieves relevant context and if the LLM can answer it. This will validate your RAG system on a different question.

7. Experimentation and Mini-Exercises
By now, we have a working RAG pipeline. This is a great point to pause and experiment. Here are some mini-exercises and ideas to try, to deepen your understanding and possibly improve the system:

Chunking strategies: Change the chunking approach and re-index the data. For example, use the sentence-based chunking or a larger chunk_size. How does that affect retrieval? Does the answer quality improve or degrade? Perhaps overlapping chunks gave redundant results – try without overlap and see if you get a broader range of information in the top hits.

Embedding model choices: If you have access to other embedding models (Cohere, HuggingFace, etc.), try using them. You may need to adjust the Weaviate schema if the vector dimension changes. After indexing with a different model, run the same query and compare the results. Are the retrieved chunks more or less relevant? This can show how embedding quality impacts downstream QA.

Vector vs Keyword vs Hybrid: We demonstrated vector and hybrid search. You can also simulate a pure keyword search by using .with_hybrid(query=query_text, alpha=0) which would rank purely by BM25 text relevance. Try this and see what results you get (likely, chunks containing literal “LTE”, “eDRX”, “RAI” will surface). Compare the answers the LLM gives when using pure keyword context vs pure vector context. Which answer is more accurate?

Metadata filtering: If your document had labeled sections (say “Section 5: Network Attach”), you could add that as metadata and then filter queries to only search within a specific section when you know where the answer should come from. Try adding a dummy filter (like page range) as we did, and also try no filter – observe if irrelevant chunks from elsewhere ever creep in. This teaches when filtering can enhance precision.

Multiple queries testing: Write a small list of queries related to the document’s content (for example, other AT command scenarios if it’s a modem spec). Automate retrieving answers for each. Evaluate the results – this can highlight strengths and weaknesses of your RAG system. You might find certain queries where the retrieval needs tuning or the LLM needs a better prompt.

Performance and scaling: If your PDF is hundreds of pages, you might have thousands of chunks. The code as written should handle it, but monitor performance. Weaviate’s vector search is very fast even for large corpora, but embedding generation (if using an external API) could be a bottleneck. You could experiment with batch embedding (OpenAI allows up to 2048 tokens per request and you can send multiple texts in one API call to speed it up). Also, adjusting the batch_size in Weaviate import can optimize throughput.

By trying these experiments, you’ll get a feel for how each component affects the whole system. This kind of iterative experimentation is typical in building real-world RAG applications.

8. Visualization of Retrieval Effectiveness
To better understand our retrieval results, it’s helpful to visualize the similarity scores. We will plot the cosine similarities of the top retrieved chunks to see the distribution and drop-off. For example, let’s take the similarities we computed in section 5 for the top 5 vector search results and visualize them:

In [None]:
import matplotlib.pyplot as plt

# Assuming sims list from earlier (cosine similarities of top results) is available
sims_sorted = sorted(sims, reverse=True)
plt.figure(figsize=(6,4))
plt.bar(range(1, len(sims_sorted)+1), sims_sorted, color='skyblue')
plt.xlabel('Result Rank')
plt.ylabel('Cosine Similarity to Query')
plt.title('Similarity of Top Retrieved Chunks')
plt.xticks(range(1, len(sims_sorted)+1))
plt.ylim(0, 1.0)
plt.show()

This bar chart shows the cosine similarity for the 1st, 2nd, 3rd, etc., ranked chunks. You might see something like: the top result has similarity ~0.95, second ~0.90, and then a drop to ~0.8 for the later ones (this is just an illustration; actual values depend on your data and query). A sharp drop after the first result could mean the query has one extremely relevant chunk and the rest are less so. A more gradual decline suggests the query spans multiple chunks or the information is spread out.

If you want a bigger picture, you can retrieve, say, the top 10 or 20 results and plot their similarities or distances. A histogram of distances could also be insightful:

In [None]:
# Get top 20 results distances
results20 = client.query.get("DocumentChunk", ["_additional {distance}"]).with_near_vector({"vector": query_vector}).with_limit(20).do()
distances = [obj["_additional"]["distance"] for obj in results20["data"]["Get"]["DocumentChunk"]]
plt.figure(figsize=(6,4))
plt.hist(distances, bins=10, color='orange', edgecolor='black')
plt.xlabel('Cosine Distance (lower = closer)')
plt.ylabel('Frequency')
plt.title('Distribution of distances for top 20 results')
plt.show()

This might show if there’s a cluster of very close results versus a tail of less relevant ones. Interpreting the visuals: A tight cluster of low-distance (high similarity) results means the query vector found a very specific region in the vector space – often good, as it means the relevant info is clearly differentiated. If the similarities are all somewhat low (say around 0.5–0.6 only), it could indicate the query doesn’t match strongly with any single chunk (maybe the info is scattered or the embedding isn’t capturing it well). In such cases, you might consider whether your chunking could be improved (maybe the relevant info got split) or if the query should be reformulated. Also, comparing the vector vs hybrid approach via visualization can be interesting. For instance, you could plot the hybrid scores alongside vector similarities for the same query. However, since hybrid scores aren’t cosine values, a direct comparison is tricky – but you can at least see their rank order differences.

Mini-exercise: After making a change (like a new chunk strategy or a different query), generate a new similarity plot. For example, if you try a query that the system isn’t very confident on, does the chart show a flatter similarity line (indicating uncertainty)? Or if you increase overlap significantly, do you see multiple top results with almost equal similarity (because overlapping chunks contain similar content, thus both rank high)? This visual analysis can guide you in fine-tuning the system further.

9. Conclusion and Next Steps
In this workshop, we built a complete Retrieval-Augmented Generation pipeline:
Weaviate setup and schema – storing chunked data with custom embeddings.
PDF parsing and chunking – turning a large document into manageable pieces using various strategies.
Embedding generation – using a model to vectorize chunks and indexing them for similarity search.
Retrieval methods – exploring semantic vector search, keyword-based search, and hybrid combinations, plus filtering by metadata.
Evaluation – using cosine similarity and visual tools to measure retrieval relevance.
LLM integration – feeding retrieved context to an LLM (GPT-4o-mini) to answer a real technical question.
Throughout, we suggested experiments to solidify your understanding of how each component affects the outcome.

By completing the steps and exercises, you’ve gained experience with chunking best practices, optimizing vector search, and building a QA system that can handle domain-specific queries by augmenting an LLM with external knowledge. This approach can be applied to many scenarios – from technical manuals and product documentation to legal texts or research papers – anywhere you need precise answers from a large document. We encourage you to continue iterating on this system:
Try using Weaviate’s Generative Module which can do retrieval and generation in one step (offloading some prompt management to Weaviate).
Explore reranking techniques if you have many relevant chunks (using a second stage model to refine which chunks are truly most relevant to the query).
If you have multiple documents, expand the schema (e.g., add a document_title property) and explore multi-document RAG.
Finally, consider evaluation on a set of Q&A pairs if you can gather some ground-truth answers – this will allow more quantitative measurement of how well your RAG system is performing and where to improve.
Happy coding, and happy querying with your new RAG system!