# RAG (Retrieval-Augmented Generation) Example for Payment Systems

This notebook demonstrates a **production-oriented RAG pipeline** designed for **payment and risk-related use cases**.

The focus is on:
- Grounded generation using internal knowledge
- Safety and explainability
- Explicit **evaluation and observability hooks**

The architecture shown here is suitable for regulated environments where AI assists humans rather than making decisions.

## Architecture Overview

This RAG system has two phases:

1. **Offline (batch) phase**
   - Load internal payment and risk documents from the `data` folder.
   - Split them into chunks, embed them, and build a vector index.
   - This runs periodically whenever policies or documentation change.

2. **Online (request-time) phase**
   - Receive a question from a payments analyst (for example, why a transaction was high risk).
   - Retrieve the most relevant chunks from the index.
   - Ask the LLM to answer **using only that retrieved context**.
   - Run validation rules and log inputs/outputs for audit and monitoring.

The rest of the notebook walks through these two phases step by step.


## 1. Technology Stack

- **Python** for orchestration
- **LlamaIndex** for document ingestion, chunking, and retrieval
- **Vector-based RAG** for grounding responses in internal knowledge
- **OpenAI-compatible LLM interface** (swappable with other providers)

This notebook focuses on architectural clarity rather than framework-specific optimizations.

## 2. Install Dependencies

In a production environment, these dependencies would be baked into the runtime image.

In [10]:
pip install llama-index chromadb llama-index-llms-ollama llama-index-embeddings-huggingface
#pip install psutil

SyntaxError: invalid syntax (2747484933.py, line 1)

## 3. Imports and Environment Configuration

In this step we:
- Import the core building blocks from LlamaIndex.
- Configure the **LLM** using Ollama (runs locally, no API key needed).
- Configure the **embedding model** using HuggingFace (runs locally, no API key needed).

**Prerequisites**: You need Ollama installed and running with a model downloaded.
Run this in your terminal first:
```bash
# Install Ollama from https://ollama.ai
ollama pull llama3.1:8b
```

In [2]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding


## 4. Offline Phase: Knowledge Ingestion

The `./data` directory is assumed to contain internal payment-related documents such as:
- Risk rules documentation
- Compliance policies
- Country-specific payment regulations

This ingestion step runs offline and is re-executed whenever knowledge changes.

In [11]:
documents = SimpleDirectoryReader('./data').load_data()
print(f"Loaded {len(documents)} documents")



Loaded 8 documents


## 5. Build the Vector Index

This step performs:
- **Chunking**: splitting documents into smaller pieces
- **Embedding**: converting text chunks into vector representations (using HuggingFace locally)
- **Vector indexing**: storing embeddings for fast similarity search

We also configure the **LLM** (Ollama with Llama 3.1 8B, running locally) that will be used to generate answers.

The resulting index is the foundation of the RAG system.

In [12]:
llm = Ollama(model="llama3.1:8b", request_timeout=120.0)

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

index = VectorStoreIndex.from_documents(documents, llm=llm, embed_model=embed_model)

2025-12-17 15:36:16,165 - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
2025-12-17 15:36:19,147 - INFO - 1 prompt is loaded, with the key: query


## 6. Online Phase: Query Execution

In the online phase we simulate a **payments analyst** asking a question.
The system will:
- Take the analyst question as input.
- Retrieve the most relevant chunks from the index.
- Ask the LLM to answer using only that retrieved context.

Example use case:
**Explain why a payment transaction was flagged as high risk**.

The goal is to help the analyst understand *possible reasons* based on internal policies and risk rules, not to automatically approve or reject the transaction.

In [13]:
import time

query_engine = index.as_query_engine(similarity_top_k=3, llm=llm)

query = """
You are assisting a payments analyst.

Question: What are the key steps in the procure-to-pay process and what controls should be in place?
Explain using only the retrieved internal policies and guidelines.
Do not make any approval or rejection decisions.
"""

start_time = time.time()
response = query_engine.query(query)
query_latency = time.time() - start_time

print(response)

2025-12-17 15:36:21,877 - INFO - HTTP Request: POST http://localhost:11434/api/show "HTTP/1.1 200 OK"
2025-12-17 15:36:26,849 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


The procure-to-pay process involves several key steps. First, purchases over $500 require a formal purchase requisition, which must include certain details such as item description, quantity, estimated cost, vendor information, and business justification. The requisition must be submitted through the SAP procurement system.

Once approved, a Purchase Order will be generated automatically. The PO must be sent to the vendor within 2 business days. Goods receipt must be recorded in SAP within 24 hours of delivery, and any discrepancies must be reported to Procurement immediately.

Invoices must reference a valid Purchase Order number and undergo three-way matching with the PO and Goods Receipt records. Invoices with variances over 5% require manual review. Payments should be processed within 5 business days, and payments over $50,000 require dual authorization.

To ensure compliance, all procurement activities are subject to internal audit, and documentation must be retained for 7 years.


## 7. Retrieval Observability Hook

Inspecting retrieved context is critical for debugging RAG quality.
This allows operators to verify that the system is grounding responses in the correct documents.

Here we inspect, for each retrieved chunk:
- The similarity score (how close it is to the query in embedding space).
- The source document metadata (for example, file name and page number).
- A short snippet of the retrieved text.

This is what lets a human analyst quickly see **where the model is getting its answer from**.

In [14]:
for source in response.source_nodes:
    print("---")
    print(f"Score: {source.score}")
    print(f"Source: {source.node.metadata.get('file_name')}, page {source.node.metadata.get('page_label')}")
    print(source.node.text[:300])

---
Score: 0.7480235902677769
Source: procure-to-pay-guideline.pdf, page 1
Procure-to-Pay Guidelines
Acme Corporation - Internal Policy Document
Version 2.1 - Effective Date: January 2024
1. Introduction
This document outlines the procure-to-pay process for Acme Corporation.
All employees must follow these guidelines when purchasing goods or services.
2. Purchase Requisiti
---
Score: 0.6988250829971236
Source: procure-to-pay-guideline.pdf, page 2
5. Purchase Order Creation
5.1 Once approved, a Purchase Order will be generated automatically.
5.2 The PO must be sent to the vendor within 2 business days.
5.3 PO numbers follow the format: PO-YYYY-XXXXXX
6. Goods Receipt
6.1 All goods must be inspected upon delivery.
6.2 The receiving department 
---
Score: 0.6985231131393258
Source: procure-to-pay-guideline.pdf, page 3
9. Emergency Purchases
9.1 Emergency purchases may bypass normal approval workflow.
9.2 Retroactive approval must be obtained within 48 hours.
9.3 Emergency purchases are li

## 8. Generation Observability Hook

With local models (Ollama), there's no API cost, but we still want to monitor:
- **Latency**: how long the query took end-to-end (retrieval + generation)
- **Response length**: size of the generated answer
- **Memory usage**: RAM consumed by the Python process
- **CPU usage**: processor utilization

These metrics help detect performance issues and resource constraints in production.

In [15]:
import psutil
import os

print(f"Query latency: {query_latency:.2f} seconds")
print(f"Response length: {len(str(response))} characters")

process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
cpu_percent = process.cpu_percent(interval=0.1)

print(f"Process memory usage: {memory_mb:.1f} MB")
print(f"Process CPU usage: {cpu_percent:.1f}%")

Query latency: 4.97 seconds
Response length: 999 characters
Process memory usage: 276.3 MB
Process CPU usage: 1.0%


## 9. Output Validation (Structural Guardrail)

In production, AI output should never be trusted blindly.
Typical validation steps include:
- Enforcing structured output (JSON / schemas)
- Confidence thresholds
- Consistency checks against deterministic systems

Below is a simplified example of a post-generation validation hook.
Here we specifically enforce a **"no autonomous decisions"** rule: the model is not allowed to say it will approve, reject, or block a payment.

In [16]:
from typing import TypedDict


class ValidationResult(TypedDict):
    is_valid: bool
    reasons: list[str]


def validate_response(text: str) -> ValidationResult:
    lowered = text.lower()
    forbidden_phrases = ["approve", "reject", "block the payment"]
    hits = [p for p in forbidden_phrases if p in lowered]
    return {"is_valid": len(hits) == 0, "reasons": [f"Contains forbidden term: {h}" for h in hits]}


validation = validate_response(str(response))
print(validation)

{'is_valid': False, 'reasons': ['Contains forbidden term: approve']}


## 10. Retrieval Quality Evaluation (Offline)

Retrieval quality is evaluated separately from generation.

Typical offline metrics:
- Recall@K (did we retrieve the correct knowledge?)
- Precision@K (how much noise was retrieved?)

In practice, this requires a labeled dataset of questions and expected source documents.

Below we create a **tiny synthetic example** with one question and its expected source document, just to illustrate how such an evaluation loop works in code.

In [17]:
eval_queries = [
    {
        "query": "What are the steps in the procure-to-pay process?",
        "expected_doc_ids": ["procure-to-pay-guideline.pdf"],
    }
]


def recall_at_k(engine, eval_data, k: int = 3) -> float:
    hits = 0
    total = len(eval_data)

    for row in eval_data:
        res = engine.query(row["query"])
        retrieved_ids = [s.node.metadata.get("file_name") for s in res.source_nodes[:k]]
        if any(doc_id in retrieved_ids for doc_id in row["expected_doc_ids"]):
            hits += 1

    return hits / total if total > 0 else 0.0


print("Recall@3:", recall_at_k(query_engine, eval_queries, k=3))


2025-12-17 15:36:49,112 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Recall@3: 1.0


## 11. End-to-End Evaluation Signals

Beyond technical metrics, system success is measured by:
- Analyst resolution time
- Reduction in support escalations
- Human corrections of AI output

These signals feed back into improving retrieval, chunking, and prompts.

## 12. Auditability and Governance

For each request, a production system should store:
- Input query and metadata
- Retrieved document identifiers
- Prompt version
- Model version
- Output and validation result

This enables regulatory audits and incident investigations.

## 13. Summary

This notebook demonstrates a **payment-safe RAG architecture** where:
- Retrieval grounds model responses in internal knowledge
- Generation is constrained and observable
- Evaluation and validation are first-class concerns

The same pattern can be extended with human-in-the-loop review, multi-model routing, and stricter compliance controls.