# Open Source RAG Pipeline Tutorial

This tutorial demonstrates how to build a complete **Retrieval-Augmented Generation (RAG)** pipeline using open source components from the DLLMForge library. 

## What You'll Learn

The RAG pipeline consists of several key components:

1. **Document Loading**: Load PDF documents and extract text
2. **Text Chunking**: Split documents into manageable chunks
3. **Embedding Generation**: Create vector embeddings using open source models
4. **Vector Storage**: Store embeddings in a FAISS vector database
5. **Retrieval**: Find relevant document chunks for queries
6. **Generation**: Generate answers using an open source LLM
7. **Evaluation**: Assess the quality of the RAG system

## Step 1: Import Required Modules

Start by importing all necessary components for the RAG pipeline.

In [1]:
from dllmforge.rag_embedding_open_source import LangchainHFEmbeddingModel
from dllmforge.rag_evaluation import RAGEvaluator
from dllmforge.LLMs.Deltares_LLMs import DeltaresOllamaLLM
from dllmforge.rag_preprocess_documents import PDFLoader, TextChunker

from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from pathlib import Path
import faiss

INFO:faiss.loader:Loading faiss with AVX2 support.
INFO:faiss.loader:Successfully loaded faiss with AVX2 support.
INFO:faiss:Failed to load GPU Faiss: name 'GpuIndexIVFFlat' is not defined. Will not load constructor refs for GPU indexes. This is only an error if you're trying to use GPU Faiss.


## Step 2: Initialize the Embedding Model

Create an embedding model using open source HuggingFace transformers.

The `LangchainHFEmbeddingModel` class supports any HuggingFace sentence transformer model and provides:
- Automatic model downloading and caching
- Batch embedding for efficient processing
- Input validation for embeddings

We use the multilingual-e5-large model here, but you can experiment with others from the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

**Note:** Bigger models perform better but require more resources and take longer to download.

In [2]:
# Initialize the embedding model
# Default model: "sentence-transformers/all-MiniLM-L6-v2"
model = LangchainHFEmbeddingModel("sentence-transformers/all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


## Step 3: Download Sample Documents

For this tutorial, you'll need some PDF documents to use as your knowledge base.

**Example:** Download the schemaGAN paper from [Science Direct](https://www.sciencedirect.com/science/article/pii/S0266352X25001260).

Place the PDF in a folder named `documents` in your working directory.

## Step 4: Load and Process Documents

Two important parameters to consider when chunking documents:

- **`chunk_size`**: The maximum size of each text chunk (in characters). Smaller chunks may improve retrieval performance but increase the number of chunks.
- **`overlap_size`**: The number of overlapping characters between chunks. Overlapping chunks help preserve context but may increase redundancy.

Now let's load PDF documents and create text chunks:

In [6]:
# Define the directory containing PDF documents
data_dir = Path(r'..\..\documents')
pdfs = list(data_dir.glob("*.pdf"))

# Initialize document loader and chunker
loader = PDFLoader()
chunker = TextChunker(chunk_size=1000, overlap_size=200)

global_embeddings = []
metadatas = []

# Process each PDF file
for pdf_path in pdfs:
    # Load the PDF document
    pages, file_name, metadata = loader.load(pdf_path)
    
    # Create chunks with overlap for better context preservation
    chunks = chunker.chunk_text(pages, file_name, metadata)
    
    # Generate embeddings for chunks
    chunk_embeddings = model.embed(chunks)
    
    # Store embeddings and metadata
    global_embeddings.extend(chunk_embeddings)
    metadatas.extend([chunk["metadata"] for chunk in chunks])
    
    print(f"Embedded {len(chunk_embeddings)} chunks from {file_name}.")

print(f"Total embeddings generated: {len(global_embeddings)}")

Embedded 107 chunks from SchemaGAN_ A conditional Generative Adversarial Network for geotechnical subsurface schematisation - 1-s2.0-S0266352X25001260-main.pdf.
Total embeddings generated: 107


**Expected Output:**
```
Embedded ... chunks from ...pdf
Total embeddings generated: ...
```

## Step 5: Create Vector Store

Set up a FAISS vector store for efficient similarity search. FAISS is a library for efficient similarity search and clustering of dense vectors.

**Alternative Index Types:**
- `IndexFlatL2`: Exact L2 distance (slower but accurate)
- `IndexFlatIP`: Inner product similarity
- `IndexIVFFlat`: Faster approximate search for large datasets

Other vector stores like MongoDB or Weaviate can also be used.

In [7]:
# Get embedding dimension
embedding_dim = len(global_embeddings[0]["text_vector"])

# Create FAISS index for L2 (Euclidean) distance
index = faiss.IndexFlatL2(embedding_dim)

# Initialize vector store
vector_store = FAISS(
    embedding_function=model.embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

# Add embeddings to the vector store
for chunk, meta in zip(global_embeddings, metadatas):
    vector_store.add_texts(
        texts=[chunk["chunk"]],
        metadatas=[meta],
        ids=[chunk["chunk_id"]],
        embeddings=[chunk["text_vector"]]
    )

print(f"Vector store created with {len(global_embeddings)} embeddings")

Vector store created with 107 embeddings


## Step 6: Test Retrieval

Let's test the vector store with a sample query to see if it retrieves relevant chunks.

In [8]:
# Query the vector store directly
query_embedding = vector_store.similarity_search_with_score(
    query="Size of images for schema GAN", 
    k=5
)

print("Query result:", query_embedding)
print("\n" + "="*80 + "\n")

# Each result contains (Document, similarity_score)
for i, (doc, score) in enumerate(query_embedding, 1):
    print(f"Result {i}:")
    print(f"Score: {score}")
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata: {doc.metadata}")
    print("-" * 80)

Query result: [(Document(id='SchemaGAN_ A conditional Generative Adversarial Network for geotechnical subsurface schematisation - 1-s2.0-S0266352X25001260-main.pdf_i37', metadata={'/Producer': 'cairo 1.18.0 (https://cairographics.org)', '/Creator': 'Mozilla Firefox 144.0.2', '/CreationDate': "D:20251030112331+01'00"}, page_content=', adhering \nto best practices in the field of GAN training. During training, the \nDiscriminator‚Äôs parameters are adjusted based on the results from the \nloss function just like in the Generator.\n2.4. The combined cGAN architecture\nFor the compilation of the combined Generator and Discriminator \nnetworks into the schemaGAN model, the Adam optimiser was also \nused with a learning rate of 0.0002 and a beta value of 0.5 (Goodfellow \net al., 2014; Salimans et al., 2016).\nThe complete schemaGAN model boasts a total of 78,172,226 pa-\nrameters representing the weights and biases of the network‚Äôs layers. \nFrom these, a total of 67,003,137 are trainable

## Step 7: Initialize the LLM

Set up the open source language model using Ollama. We'll use the Qwen3 model in this example.

**Note:** Make sure you're on the Deltares network or VPN to access the Deltares hosted models.

In [9]:
# Initialize Ollama LLM
llm = DeltaresOllamaLLM(
    base_url="https://chat-api.directory.intra",  
    model_name="qwen3:latest",  # Or another available model
    temperature=0.8
)

## Step 8: Create Retriever and Generate Answers

Set up the retriever with similarity thresholds and generate answers to questions using the RAG pipeline.

In [10]:
# Create retriever with similarity threshold
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.1,  # Minimum similarity score
        "k": 10  # Maximum number of documents to retrieve
    },
)

# Generate answer using RAG
question = "Size of images produced by schemaGAN? give me the answer in axb format"
chat_result = llm.ask_with_retriever(question, retriever)
answer = chat_result.generations[0].message.content

print(f"Question: {question}")
print(f"\nAnswer: {answer}")

Question: Size of images produced by schemaGAN? give me the answer in axb format

Answer: <think>
Okay, the user is asking about the size of images produced by schemaGAN, specifically in axb format. Let me check the provided context documents.

Looking through the documents, the first one (i32) mentions that the final layer of the Generator uses a transposed Conv2D layer to upscale feature maps back to the original image size of 512 √ó 32 pixels. Another document (i34) also refers to input images sized 512 √ó 32 √ó 1. Additionally, the Discriminator's input is mentioned as 512 √ó 32 √ó 1. 

So, it seems consistent that the image size is 512 pixels in width and 32 pixels in height. The user wants the answer in axb format, which typically represents width √ó height. Therefore, the answer should be 512√ó32.
</think>

The images produced by schemaGAN have a size of **512√ó32** pixels. This is explicitly mentioned in the context regarding the Generator's final layer upsampling to the origin

The answer should be relevant to the question based on the retrieved documents. If the answer is not satisfactory, consider:
- Refining the query
- Adjusting the retriever parameters
- Using a different embedding model

## Step 9: Evaluate the RAG System

Use the built-in evaluation framework to assess RAG performance with multiple test questions.

In [11]:
# Define test questions with ground truth answers
TEST_QUESTIONS = [{
    "question": "Size of images produced by schemaGAN?",
    "ground_truth": "The images produced by schemaGAN have a size of **512 √ó 32 pixels**."
}, {
    "question": "What is the network architecture based on?",
    "ground_truth": "the pix2pix method from Isola et al. (2017)"
}]

# Initialize evaluator
evaluator = RAGEvaluator(llm_provider="deltares", deltares_llm=llm)

results = []
for q_data in TEST_QUESTIONS:
    question = q_data["question"]
    ground_truth = q_data["ground_truth"]
    
    # Generate answer
    chat_result = llm.ask_with_retriever(question, retriever)
    answer = chat_result.generations[0].message.content
    answer = answer.split("</think>")[-1].strip()  # Clean up response
    
    # Get retrieved contexts
    retrieved_contexts = retriever.invoke(question)
    
    # Evaluate the RAG pipeline
    evaluation = evaluator.evaluate_rag_pipeline(
        question=question,
        generated_answer=answer,
        retrieved_contexts=retrieved_contexts,
        ground_truth_answer=ground_truth
    )
    
    # Store results
    result = {
        'question': question,
        'ground_truth': ground_truth,
        'response': answer,
        'context': retrieved_contexts,
        'evaluation': evaluation
    }
    results.append(result)
    
    # Print evaluation metrics
    print(f"Question: {question}")
    print(f"RAGAS Score: {evaluation.ragas_score:.3f}")
    print(f"Answer Relevancy: {evaluation.answer_relevancy.score:.3f}")
    print(f"Faithfulness: {evaluation.faithfulness.score:.3f}")
    print(f"Context Recall: {evaluation.context_recall.score:.3f}")
    print(f"Context Relevancy: {evaluation.context_relevancy.score:.3f}")
    print("=" * 80)

üîç Starting RAG evaluation...
  üìä Evaluating context relevancy...
  üìä Evaluating faithfulness...
  üìä Evaluating answer relevancy...
  üìä Evaluating context recall...
Question: Size of images produced by schemaGAN?
RAGAS Score: 0.285
Answer Relevancy: 0.950
Faithfulness: 0.500
Context Recall: 1.000
Context Relevancy: 0.100


  self.vectorstore.similarity_search_with_relevance_scores(
  self.vectorstore.similarity_search_with_relevance_scores(


üîç Starting RAG evaluation...
  üìä Evaluating context relevancy...
  üìä Evaluating faithfulness...
  üìä Evaluating answer relevancy...
  üìä Evaluating context recall...
Question: What is the network architecture based on?
RAGAS Score: 0.625
Answer Relevancy: 0.950
Faithfulness: 0.800
Context Recall: 0.000
Context Relevancy: 0.400


## Understanding Evaluation Metrics

The RAG evaluation provides four key metrics:

### 1. Context Relevancy (0-1)
Measures how relevant the retrieved documents are to the question. 

- **Higher scores** = Better retrieval
- **Low scores?** ‚Üí Check embedding model or vector store configuration

### 2. Context Recall (0-1)
Measures whether all necessary information was retrieved by comparing retrieved context with ground truth.

- **Low scores?** ‚Üí Important documents may be missing
- **Improvement:** Adjust chunk size, overlap, or try a different embedding model

### 3. Faithfulness (0-1)
Measures factual accuracy and absence of hallucinations. Checks if the answer is grounded in the retrieved context.

- **Low scores?** ‚Üí LLM may be generating unsupported information
- **Improvement:** Adjust LLM temperature or try a different model

### 4. Answer Relevancy (0-1)
Measures how directly the answer addresses the question. Penalizes verbose or off-topic responses.

- **Low scores?** ‚Üí LLM may not be using retrieved context effectively
- **Improvement:** Experiment with prompt engineering or different LLMs

### RAGAS Score
Overall score combining all metrics, providing a single measure of RAG system quality.

## Advanced Configuration

Now let's explore some advanced techniques to optimize your RAG pipeline.

### Optimizing Chunk Size

Experiment with different chunk sizes based on your document types:

In [None]:
# For technical documents with detailed information
chunker_technical = TextChunker(chunk_size=1500, overlap_size=300)

# For shorter, conversational content
chunker_short = TextChunker(chunk_size=500, overlap_size=100)

# For very long documents
chunker_long = TextChunker(chunk_size=2000, overlap_size=400)

print("Different chunkers configured for various document types")

### Using Different Embedding Models

Try different embedding models for better performance. Visit the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for more options.

In [None]:
# More capable but larger model
# model = LangchainHFEmbeddingModel(
#     model_name="sentence-transformers/all-mpnet-base-v2"
# )

# Multilingual model
# model = LangchainHFEmbeddingModel(
#     model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
# )

# Domain-specific model (for scientific papers)
# model = LangchainHFEmbeddingModel(
#     model_name="sentence-transformers/allenai-specter"
# )

print("Alternative embedding models available (commented out)")

### Improving Retrieval

Fine-tune retrieval parameters for different use cases:

In [None]:
# For high precision (fewer but more relevant results)
retriever_precision = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.3,  # Higher threshold
        "k": 5  # Fewer results
    },
)

# For high recall (more results, potentially less relevant)
retriever_recall = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 20  # More results
    },
)

# For diverse results using Maximum Marginal Relevance
retriever_mmr = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 10,
        "lambda_mult": 0.7  # Balance between relevance and diversity
    },
)

print("Different retriever configurations created")

### Docling Automatic Chunker (Advanced)

If documents are not well structured, use the automatic chunker from Docling. This chunker uses a language model to create semantically meaningful chunks based on content and structure.

**Benefits:**
- More semantically meaningful chunks
- Better handling of document structure
- Improved retrieval performance

In [None]:
# This replaces the previous chunking code
from langchain_docling import DoclingLoader
from docling.chunking import HybridChunker
from langchain_docling.loader import ExportType

# Find all PDF files in the directory
pdfs = list(data_dir.glob("*.pdf"))
global_chunks = []

for pdf_path in pdfs:
    loader = DoclingLoader(
        file_path=pdf_path,
        export_type=ExportType.DOC_CHUNKS,
        chunker=HybridChunker(
            tokenizer=model.embeddings.model_name,
            chunk_size=512,        # Max length supported by MiniLM
            chunk_overlap=50       # Some overlap for better context
        )
    )
    docs = loader.load()
    global_chunks.extend(docs)

print(f"Total chunks generated with Docling: {len(global_chunks)}")

# Create FAISS vector store
index = faiss.IndexFlatL2(len(model.embed("test")))
vector_store_docling = FAISS(
    embedding_function=model.embeddings,
    index=index,
    docstore=InMemoryDocstore({}),
    index_to_docstore_id={},
)
vector_store_docling.add_documents(global_chunks)

# Query the vector store
query_embedding = vector_store_docling.similarity_search_with_score(
    query="Size of images for schema GAN in pixels", 
    k=5
)
print("\nDocling Query result:")
for i, (doc, score) in enumerate(query_embedding, 1):
    print(f"{i}. Score: {score:.4f} - {doc.page_content[:100]}...")

## Summary

Congratulations! You've built a complete RAG pipeline. Here's what you learned:

‚úÖ **Document Processing**: Load and chunk PDFs efficiently  
‚úÖ **Embeddings**: Generate vector embeddings with open source models  
‚úÖ **Vector Storage**: Store and retrieve embeddings with FAISS  
‚úÖ **LLM Integration**: Use open source LLMs for answer generation  
‚úÖ **Evaluation**: Measure RAG performance with RAGAS metrics  
‚úÖ **Optimization**: Fine-tune chunking, retrieval, and embeddings  

### Next Steps

1. Experiment with different embedding models
2. Try various chunk sizes and overlap settings
3. Test different retriever configurations
4. Evaluate with your own documents and questions
5. Explore Docling for improved document structure handling

### Key Takeaways

- **Start simple**: Use default configurations first
- **Iterate**: Adjust parameters based on evaluation metrics
- **Document-specific**: Optimize for your specific document types
- **Balance**: Trade-off between speed, accuracy, and resource usage