Large Language Models (LLMs) are powerful, but they have a major limitation: their knowledge is static and limited to the data they were trained on. This is where Retrieval-Augmented Generation (RAG) comes in. It enhances LLMs by retrieving relevant external knowledge before generating responses.
We will build a RAG Pipeline for LLMs using Hugging Face Transformers.

A Retrieval-Augmented Generation (RAG) pipeline consists of two key components:

- Retriever: Searches a knowledge base for relevant documents based on the user’s query.
- Generator: Uses retrieved documents as context to generate accurate and relevant responses.

RAG improves LLMs by reducing hallucinations through real-world context, ensuring responses are more accurate and grounded in factual information. It also keeps answers up-to-date by retrieving the latest knowledge, eliminating the need for frequent retraining. Additionally, by incorporating external data sources, RAG significantly enhances the factual accuracy of AI-generated responses, which makes LLMs more reliable and context-aware.

Source: https://thecleverprogrammer.com/2025/02/25/building-a-rag-pipeline-for-llms/

In [1]:
import wikipedia
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

#### Retrieving Knowledge
- To simulate an external knowledge base, we’ll fetch relevant Wikipedia articles based on a given topic:

In [4]:
def get_wikipedia_content(topic):
    try:
        page = wikipedia.page(topic)
        return page.content
    except wikipedia.exceptions.PageError:
        return None
    except wikipedia.exceptions.DisambiguationError as e:
        # handle cases where the topic is ambiguous
        print(f"Ambiguous topic. Please be more specific. Options: {e.options}")
        return None

# user input
topic = input("Enter a topic to learn about: ")
document = get_wikipedia_content(topic)

# here we input a topic "VAT in South Africa"
print(document)
if not document:
    print("Could not retrieve information.")
    exit()

Taxation may involve payments to a minimum of two different levels of government: central government through SARS or to local government. Prior to 2001 the South African tax system was "source-based", where in income is taxed in the country where it originates. Since January 2001, the tax system was changed to "residence-based" wherein taxpayers residing in South Africa are taxed on their income irrespective of its source. Non residents are only subject to domestic taxes.
Central government revenues come primarily from income tax, value added tax (VAT) and corporation tax. Local government revenues come primarily from grants from central government funds and municipal rates. In the 2018/19 fiscal year SARS collected R 1 287.7 billion (equivalent to US$ 86.4 billion) in tax revenue, a figure R71.2 billion (or 5.8%) more than that from the previous fiscal year.
In 2018/19 financial year, South Africa had a tax-to-GDP ratio of 26.2% that was only slightly more than the 25.9% in 2017/18. T

In [7]:
# Check document size / characters -  the total number of individual letters, spaces, and punctuation marks.
# Example: "Hello world!" has 12 characters (including space and exclamation mark
print(len(document))

51789


#### Chunking
- Since Wikipedia articles can be long, we will split the text into smaller overlapping chunks for better retrieval.

The code is chunking text based on token count using the tokenizer from the all-mpnet-base-v2 model. 

##### Chunking Mechanism:
- Tokenization: The text is first converted into tokens using the model's tokenizer (this splits text into word pieces/subwords). For example: "Hello world" might become ["Hello", "world"] while "unhappiness" might become ["un", "##happiness"]

- Chunk Size: chunk_size=256 means each chunk will contain up to 256 tokens. This is a common size for transformer models as many have max length limits around this size

- Overlap: chunk_overlap=20 means consecutive chunks will share 20 tokens. This helps maintain context between chunks

- Sliding Window: The first chunk goes from token 0 to 256. The next chunk starts at token (256 - 20) = 236, and goes to 236+256=492. This continues until all tokens are processed


##### 1. Why chunk_size=256?
Transformer models (like all-mpnet-base-v2) have a maximum sequence length (often 512 tokens), 256 tokens is a safe middle ground because it’s small enough to avoid hitting the model’s max limit when combined with other text (e.g., prompts, queries) and it’s large enough to preserve meaningful context. If we set chunk_size=512, we risk hitting the model’s token limit if extra text is added laterand losing efficiency (longer chunks take more compute).

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")

def split_text(text, chunk_size=256, chunk_overlap=20):
    tokens = tokenizer.tokenize(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_text = tokenizer.convert_tokens_to_string(tokens[start:end])
        chunks.append({
            'text': chunk_text,
            'token_count': end - start,
            'char_count': len(chunk_text)
        })
        if end == len(tokens):
            break
        start = end - chunk_overlap
    return chunks

chunks = split_text(document)

print(f"\nTotal chunks: {len(chunks)}\n")
print("="*50)
for i, chunk in enumerate(chunks, 1):
    print(f"\nChunk {i}:")
    print("-"*20)
    print(chunk['text'])
    print("-"*20)
    print(f"Tokens: {chunk['token_count']}")
    print(f"Characters: {chunk['char_count']}")
    print("="*50)

Token indices sequence length is longer than the specified maximum sequence length for this model (11148 > 512). Running this sequence through the model will result in indexing errors



Total chunks: 48


Chunk 1:
--------------------
taxation may involve payments to a minimum of two different levels of government : central government through sars or to local government. prior to 2001 the south african tax system was " source - based ", where in income is taxed in the country where it originates. since january 2001, the tax system was changed to " residence - based " wherein taxpayers residing in south africa are taxed on their income irrespective of its source. non residents are only subject to domestic taxes. central government revenues come primarily from income tax, value added tax ( vat ) and corporation tax. local government revenues come primarily from grants from central government funds and municipal rates. in the 2018 / 19 fiscal year sars collected r 1 287. 7 billion ( equivalent to us $ 86. 4 billion ) in tax revenue, a figure r71. 2 billion ( or 5. 8 % ) more than that from the previous fiscal year. in 2018 / 19 financial year, south africa had a tax - t

#### Storing and Retrieving Knowledge
To efficiently search for relevant chunks based on a user's query, we will use Sentence Transformers to convert text into embeddings and store them in a FAISS index, a vector dd to store the embeddings.


In [13]:
embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
embeddings = embedding_model.encode(chunks)

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

print(embeddings)



[[ 0.01459188 -0.01774593 -0.0095589  ... -0.02351378 -0.0146273
   0.00042935]
 [ 0.01131538 -0.01086929 -0.00028749 ... -0.03539604  0.00576692
  -0.01074661]
 [ 0.00170391  0.02792783 -0.02949901 ... -0.00161242  0.00039813
  -0.0116331 ]
 ...
 [ 0.02365734 -0.04670324 -0.00976814 ...  0.03365016  0.02881048
   0.00461233]
 [-0.03186371 -0.03576483  0.01363133 ...  0.03879073  0.01748673
   0.02023913]
 [-0.01187734 -0.01171428  0.00763611 ...  0.02620777  0.00085154
   0.00733871]]


#### Querying the RAG Pipeline
Now, we will take user input for the RAG pipeline. When a user asks a question, we will convert the query into an embedding, retrieve the most relevant chunks from the FAISS vector db and use an LLM-powered question-answering model to generate the answer.

In [32]:
query = input("Ask a question about the topic: ")
query_embedding = embedding_model.encode([query])

k = 15
distances, indices = index.search(np.array(query_embedding), k)
retrieved_chunks = [chunks[i] for i in indices[0]]
print("Retrieved chunks:")
print(retrieved_chunks)

Retrieved chunks:
[{'text': 'of the country by collecting the revenue due to enable government to deliver on its constitutional obligations, policy and delivery priorities in pursuance of better life for all in south africa. by encouraging tax and customs compliance, we also aspire to contribute to the building of fiscal citizenship reflected by a law abiding society. the anchor for sars to deliver on this mandate is the higher purpose and values which drives and informs all sars employees \' behaviour. " = = = number of taxpayers = = = on 31 march 2019, the tax register of sars had in excess of 26 million entries, excluding the following : 1 ) those cases where the persons or entities were suspended ; 2 ) estates ; and 3 ) entities with unknown addresses. individuals made up 79 % of the entries with an aggregate income of r1. 7 trillion. the tax register increased to more than 27 million entries for 2020. in 2019, of the 22. 1 million individual taxpayers only 6. 6 million ( 31 % ) we

In [33]:
qa_model_name = "deepset/roberta-base-squad2"
qa_tokenizer = AutoTokenizer.from_pretrained(qa_model_name)
qa_model = AutoModelForQuestionAnswering.from_pretrained(qa_model_name)
qa_pipeline = pipeline("question-answering", model=qa_model, tokenizer=qa_tokenizer)

context = " ".join([chunk['text'] for chunk in retrieved_chunks])
answer = qa_pipeline(question=query, context=context)
print(f"Answer: {answer['answer']}")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Answer: collection of taxes
