<a href="https://colab.research.google.com/github/LashawnFofung/RAG-Pipelines/blob/main/src/Task_RAG_with_Open_Source_Model_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **RAG with Open-Source LLM Model: Mistral 7B**


**My Goal:** I'm moving beyond proprietary APIs to build a complete and self-contained **RAG (Retrieval-Augmented Generation)** system using powerful open-source tools entirely within this Google Colab environment. The core of this system is **Mistral 7B Instruct**, loaded in the memory-efficient GGUF format for fast performance on a Colab GPU.
<br><br>

This notebook is focused on practical deployment: taking a text-heavy document (a simulated contract PDF) and building a functional system that can accurately answer questions using the locally hosted LLM.
<br><br>

**The "Why I'm Doing This": Going Open Source & Local**

Relying on external APIs (like OpenAI or Gemini) means sacrificing control and incurring costs. This walkthrough eliminates those dependencies:

- üö´ **No API Keys Needed:** The entire generation and retrieval process runs locally on my Colab instance.

- üõ†Ô∏è **Total Control:** I'm using the open-source Mistral 7B model and customizing the system architecture with the LlamaIndex framework.

- üöÄ **Practical Efficiency:** Using the highly optimized GGUF format and CUDA acceleration, I can run a powerful 7-billion parameter model efficiently on the available T4 GPU.
<br><br>

**‚ö†Ô∏è Important Setup Requirement**

Before running any code, I must ensure the correct hardware is allocated:

- Go to **Runtime > Change runtime type**.

- Select **GPU** as the hardware accelerator (the **Tesla T4** is ideal).

- *Note: The model will not load or run efficiently without this GPU runtime!*
<br><br>

**Notebook Structure: My RAG Blueprint**

This notebook is structured into seven logical steps to guide me from setup to a final, operational RAG query.
<br><br>

- [Section 1: Install Required Packages and Check GPU Support](#scrollTo=rCeTTYTK60ho&line=5&uniqifier=1)

  - Set up the foundation: installing llama-cpp-python with CUDA support for acceleration and verifying that my GPU is correctly detected.

- [Section 2: Load Mistral 7B in GGUF Format](#scrollTo=a5SuRP_R7WNV&line=9&uniqifier=1)

  - Download the Mistral 7B Instruct v0.2 Q4_K_M GGUF model (a compressed, 4.1GB version) and initialize it using LlamaCPP to ensure it loads onto the GPU layers.

- [Section 3: Run a Basic Test Query](#scrollTo=ZAlknaYu7nMx&line=9&uniqifier=1)

  - A quick test to confirm the local LLM is working and responding, separate from the RAG system.

- [Section 4: Installing RAG Connectors](#scrollTo=rGPn5NaaAxTm&line=1&uniqifier=1)

  - Install the essential tools for document handling (pymupdf) and the specific LlamaIndex integrations for running local models (llama-index-llms-llama-cpp and llama-index-embeddings-huggingface).

- [Section 5: Data Loading and Preparation](#scrollTo=rtNqlli2BUE1&line=3&uniqifier=1)

  - Simulate loading a real-world document by loading a sample PDF contract and extracting all the raw text using PyMuPDF.

- [Section 6: Configure and Build the RAG Pipeline](#scrollTo=0cx5dlafBsmC&line=3&uniqifier=1)

  This is the heart of the system. I will configure:

    - The local Mistral 7B LLM as the generator.

    - An open-source Hugging Face Embedding Model (BAAI/bge-small-en-v1.5) as the vectorizer.

    - The Vector Store and Retriever within LlamaIndex to chunk, embed, and index the document text.

- [Section 7: Run the RAG Query](#scrollTo=TxK96oRDDt6d&line=3&uniqifier=1)

  - The final demonstration: Submit a complex question about the contract, and the system will use the indexed context to provide a grounded, document-specific answer.

## **‚ö°Ô∏è Section 1: Install Required Pckages and Check GPU Support**

This section sets up the environment and ensures I have a working GPU, which is critical for running the model efficiently.
<br><br>

**üí° Why I Installed These**
- **torch:** The foundation for GPU acceleration. Without it, the model runs painfully slow on the CPU.
- **llama-cpp-python:** This is the specialized library that knows how to efficiently run GGUF models on consumer hardware, including my Colab GPU. The cu123 index ensures I get the correct, pre-compiled binary for my CUDA version.
- **llama-index:** The higher-level framework I'll use to connect my local Mistral model to external documents (RAG).

In [None]:
# 1. Install necessary libraries with CUDA support (just to be safe, though Colab often has torch installed)
!pip install -q torch

import torch


In [None]:
# 2. Check if the GPU is available and display its name
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")


CUDA available: True
GPU: Tesla T4


In [None]:
# 3. Check CUDA version first (this helps me pick the right 'llama-cpp-python' binary)
!nvcc --version

# Install llama-cpp-python with CUDA support (cu123 means CUDA 12.3).
# llama-cpp-python is the engine that runs our efficient GGUF model.
!pip install --no-cache-dir llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu123

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu123


In [None]:
# 4. Install LlamaIndex (The framework for building my RAG pipeline)
!pip install -q llama-index

# Install the 'jedi' package to resolve the ipython dependency conflict.
!pip install -q jedi



[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m1.6/1.6 MB[0m [31m47.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.6/1.6 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[?25h

## **üíæ Section 2: Load Mistral 7B in GGUF Format**

This is the core step: downloading the specific quantized model and initializing it to run on my GPU.
<br><br>

**üí° Model Parameters Explained**
- **`model_path`**: Where the GGUF file lives on my Colab instance.
- **`n_gpu_layers`**: This controls how many of the model's layers are loaded onto the GPU memory (VRAM). Setting it to a positive number offloads work from the slower CPU, speeding up inference significantly.
- **`n_ctx`**: The context window. This is the maximum length (in tokens) of the prompt and the response the model can handle.

In [None]:
from llama_cpp import Llama
import os

# Define the local path where I want to save the model file
model_path = "/content/mistral.gguf"

# Check if the model is already downloaded to avoid re-downloading
if not os.path.exists(model_path):
    # !wget command downloads the Q4_K_M (4-bit quantization, medium quality) version of Mistral 7B Instruct v0.2
    # This specific version is ~4.1 GB and is optimized for speed and memory.
    !wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf -O {model_path}
    print(f"Model downloaded to {model_path}")

# Verification check
if os.path.exists(model_path):
    # Print the size for confirmation (it should be around 4.1 GB or 4166 MB)
    print(f"Model file exists. Size: {os.path.getsize(model_path) / (1024 * 1024):.2f} MB")
else:
    print("Model file not found!")

# Load the model with GPU acceleration
try:
    llm = Llama(
        model_path=model_path,

        # IMPORTANT: Offload model layers to the GPU. -1 means all layers possible.
        # I'm using 1 here to be safe on the Tesla T4, but I could try -1 or 32 for max speed.
        n_gpu_layers=1,


        # Context window size: 2048 is standard, but Mistral supports much more (32768).
        # I'm keeping it small for quick testing.
        n_ctx=2048,
        verbose=True     # Show detailed loading progress (what layers are being loaded where)
    )

    print("Model loaded successfully!")

except Exception as e:
    print(f"Error loading model: {e}")

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /content/mistral.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.

Model file exists. Size: 4166.07 MB


llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4165.37 MiB
llm_load_tensors:      CUDA0 buffer size =   132.50 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   248.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =     8.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer

Model loaded successfully!


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': 

## **üöÄ Section 3: Run a Basic Query to Test**

Test the model to ensure it's functional and responsive.
<br><br>

**üí° What to Observe?**
- **Speed:** My eval time shows the speed in "tokens per second." On the Tesla T4, **1.69 tokens per second** is slow.
- **Actionable Insight:** I should change n_gpu_layers from 1 to -1 (all possible layers) or 32 (all 32 layers) in Step 2 for a major speed boost.
- **Correctness:** The model's answer about RAG is **partially incorrect** (it says RAG means "Recipe for Artificial General Intelligence"). This shows the base Mistral model can sometimes hallucinate or confuse acronyms. This is why I need a full RAG pipeline (connecting it to my documents) to keep it grounded!

In [None]:
# Test with a simple RAG-related query to see if it understands the concept
prompt = "What is RAG in the context of large language models?"
print(f"\nSending prompt: {prompt}")

# Run the inference!
response = llm(
    prompt,
    max_tokens=256,       # Set the maximum length for the model's answer
    temperature=0.1       # Low temperature for more focused, factual answers
    # Note: Mistral 7B Instruct uses a specific chat format which Llama-CPP handles automatically here.
)

print("\nResponse:")
print(response["choices"][0]["text"])


Sending prompt: What is RAG in the context of large language models?



llama_print_timings:        load time =   10815.00 ms
llama_print_timings:      sample time =      14.07 ms /   248 runs   (    0.06 ms per token, 17621.15 tokens per second)
llama_print_timings: prompt eval time =   10814.91 ms /    13 tokens (  831.92 ms per token,     1.20 tokens per second)
llama_print_timings:        eval time =  151964.77 ms /   247 runs   (  615.24 ms per token,     1.63 tokens per second)
llama_print_timings:       total time =  163015.44 ms /   260 tokens



Response:


RAG (Recipe for Algorithmic Generalization) is a framework developed by researchers at the University of California, Berkeley, for evaluating the generalization ability of large language models. The framework focuses on assessing how well a model can understand and apply instructions that are different from the training data, but still within the same domain.

The RAG framework consists of a set of benchmarks that test a model's ability to follow instructions, reason about missing information, and generalize to new situations. The benchmarks cover various domains such as mathematics, physics, and common sense reasoning. The framework also includes a human-in-the-loop component, where human annotators provide ground truth labels for the model's outputs, allowing for accurate evaluation and comparison of different models.

The RAG framework is important because it provides a more nuanced and comprehensive evaluation of large language models than traditional metrics such as a

## **‚öôÔ∏è Section 4: Installing RAG Components (with a PDF)**

Installing the specific connectors needed to build the **Retrieval-Augmented Generation (RAG)** pipeline: a PDF reader and the LlamaIndex integration for my local LLM.

In [None]:
# Install PyMuPDF (fitz) for reading the PDF document
!pip install -q pymupdf

# Install the specific LlamaIndex connector for models running via llama-cpp-python (GGUF)
# Note: This step might re-install 'llama-cpp-python', which is fine.
!pip install -q llama-index-llms-llama-cpp

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.1/24.1 MB[0m [31m71.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m50.7/50.7 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone


In [None]:
# Install a high-quality, lightweight embedding model from Hugging Face for creating vectors.
# Embeddings are essential for RAG, as they convert text into numerical vectors for searching.
!pip install -q llama-index-embeddings-huggingface

## **üìÑ Section 5: Data Loading and Preparation**

Simulating a real-world use case by loading a contract PDF and extracting its content.

In [12]:
# Placeholder content simulating a loaded document (used as a fallback)
raw_document_text = """
The monthly payment is due on the 1st of every month. Payments received after the 5th day
of the month will incur a late fee of $50. If payment is delayed by more than 30 days,
the account will be flagged, and an additional penalty of 1.5% of the outstanding balance
will be applied, compounded monthly. Failure to pay within 60 days will result in a
suspension of services and potential legal action. Please review section 4.3 for payment
processing guidelines and dispute resolution procedures. All disputes must be filed
within 10 calendar days of the late fee application date.
"""
text = raw_document_text
is_pdf_loaded = False


try:

  # The `files` utility for dynamic file uploads in the Colab environment and PyMuPDF.
  from google.colab import files

  # PyMuPDF (imported as 'fitz') for reliable, fast PDF parsing.
  import fitz
  print("\n--- Attempting interactive PDF upload ---")

  # --- 1. Document Loading and Extraction via Upload ---

  # Prompts to upload the PDF interactively from local machine.
  print("\n--- Uploading Document: 'sample_contract.pdf' ---")
  uploaded = files.upload()


  # Check if a file was successfully uploaded.
  if uploaded:
      # If successful, extracts the filename (which becomes the path) from the dictionary keys.
      pdf_path = list(uploaded.keys())[0]
      print(f"Successfully uploaded: {pdf_path}")

      # With valid pdf_path, the document can be opened and text can be extracted.
      # Using PyMuPDF (fitz) to open the PDF file for reading.
      doc = fitz.open(pdf_path)

      # Iterate through every page of the document to get the text from each,
      # and join them all together with a newline character (\n) as a separator.
      text = "\n".join([page.get_text() for page in doc])
      doc.close()

      # A quick check to make sure text extraction worked and to see the scale of data.
      print(f"‚úÖ Extracted {len(text.split())} words from the contract.")
      is_pdf_loaded = True
  else:
      # If no file is uploaded, exits the cell execution to prevent errors in subsequent steps.
      print("No file uploaded. Using placeholder text for RAG processing.")

except ImportError:
    # This block handles running outside a Colab environment
    print("‚ö†Ô∏è Skipping Colab/PyMuPDF interactive file upload (environment dependency).")
    print("Using placeholder text for RAG processing.")





--- Attempting interactive PDF upload ---

--- Uploading Document: 'sample_contract.pdf' ---


Saving sample_contract.pdf to sample_contract.pdf
Successfully uploaded: sample_contract.pdf
‚úÖ Extracted 315 words from the contract.


## **üß† Section 6: Configure and Build the RAG Pipeline**

Integrate my local LLM and embedding model into the LlamaIndex framework to make them work together.

In [13]:
# üß† Step 6: Configure LlamaIndex and Build the Vector Store üß†

from llama_index.core import VectorStoreIndex, Document, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.settings import Settings
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# 1. Configure the Local LLM (Mistral 7B) for use within LlamaIndex
llm = LlamaCPP(
    model_path="/content/mistral.gguf",
    temperature=0.7,
    max_new_tokens=512,
    context_window=2048,
    # Crucial: Offload layers to GPU for faster inference (though I should increase n_gpu_layers!)
    model_kwargs={"n_gpu_layers": 1}
)

# 2. Configure the Embedding Model
# This converts my text chunks into searchable vectors. BGE is a top-performing open-source choice.
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)


# 3. Set LlamaIndex defaults
# I'm telling LlamaIndex to use my local models, not default to OpenAI or others.
Settings.llm = llm
Settings.embed_model = embed_model


# 4. Prepare the Document Object
documents = [Document(text=text)]  # Wrap the extracted text in a LlamaIndex Document object


# 5. Build the Vector Index
# This is where the embedding model processes the text and stores the resulting vectors.
index = VectorStoreIndex.from_documents(documents)


# 6. Configure the Retriever
# The retriever decides which pieces of text (chunks) are most relevant to a query.
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,  # I'm asking for the top 2 most relevant chunks from the document.
)


# 7. Configure the Response Synthesizer
# This component takes the retrieved text and the original query, then feeds them to Mistral 7B to generate the final answer.
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize", # A good mode for summarizing retrieved context.
)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /content/mistral.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## **üîç Section 7: Run the RAG Query**

The final step: putting the entire pipeline to work to answer a question based on the document's content.

In [14]:
# Query the RAG Engine

query = "What are the late payment penalties in this contract?"

# Assemble the full Query Engine using the Retriever (for finding context) and the Synthesizer (for generating the answer).
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# Send the query to the engine
response = query_engine.query(query)
print(response)


llama_print_timings:        load time =    1623.47 ms
llama_print_timings:      sample time =       0.91 ms /    16 runs   (    0.06 ms per token, 17660.04 tokens per second)
llama_print_timings: prompt eval time =    2610.32 ms /   557 tokens (    4.69 ms per token,   213.38 tokens per second)
llama_print_timings:        eval time =    9459.46 ms /    15 runs   (  630.63 ms per token,     1.59 tokens per second)
llama_print_timings:       total time =   12083.43 ms /   572 tokens


1.5% per month from the due date until paid in full.
