In [2]:
%pip install -q wikipedia transformers sentence-transformers faiss-cpu numpy


  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m74.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone


In [3]:
import wikipedia
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

Step 1: Retrieving Knowledge
To simulate an external knowledge base, we’ll fetch relevant Wikipedia articles based on a given topic:


In [4]:
def get_wikipedia_content(topic):
    try:
        page = wikipedia.page(topic)
        return page.content
    except wikipedia.exceptions.PageError:
        return None
    except wikipedia.exceptions.DisambiguationError as e:
        # handle cases where the topic is ambiguous
        print(f"Ambiguous topic. Please be more specific. Options: {e.options}")
        return None

# user input
topic = input("Enter a topic to learn about: ")
document = get_wikipedia_content(topic)

if not document:
    print("Could not retrieve information.")
    exit()

Enter a topic to learn about: Apple Computers


Here, we are retrieving Wikipedia content based on a user-provided topic using the Wikipedia API. If the topic is valid, the function returns the page content; otherwise, it handles errors by either notifying the user of an ambiguous topic with multiple options or exiting if no relevant page is found.

Since Wikipedia articles can be long, we will split the text into smaller overlapping chunks for better retrieval:

In [5]:
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")

def split_text(text, chunk_size=256, chunk_overlap=20):
    tokens = tokenizer.tokenize(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunks.append(tokenizer.convert_tokens_to_string(tokens[start:end]))
        if end == len(tokens):
            break
        start = end - chunk_overlap
    return chunks

chunks = split_text(document)
print(f"Number of chunks: {len(chunks)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (17988 > 512). Running this sequence through the model will result in indexing errors


Number of chunks: 77


Here, we are tokenizing the retrieved Wikipedia content and splitting it into smaller overlapping chunks for efficient retrieval. We used a pre-trained tokenizer (all-mpnet-base-v2) to break the text into tokens, then divided it into fixed-size segments (256 tokens each) with an overlap of 20 tokens to maintain context between chunks.

Step 2: Storing and Retrieving Knowledge
To efficiently search for relevant chunks, we will use Sentence Transformers to convert text into embeddings and store them in a FAISS index:

In [6]:
embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
embeddings = embedding_model.encode(chunks)

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Here, we converted the text chunks into numerical embeddings using the Sentence Transformer model (all-mpnet-base-v2), which captures their semantic meaning. We then created a FAISS index with an L2 (Euclidean) distance metric and stored the embeddings in it. This will allow us to efficiently retrieve the most relevant chunks based on a user’s query.

Step 3: Querying the RAG Pipeline
Now, we will take user input for the RAG pipeline. When a user asks a question, we will:

1.Convert the query into an embedding.
2.Retrieve the top-k most relevant chunks using FAISS.
3.Use an LLM-powered question-answering model to generate the answer.

In [7]:
query = input("Ask a question about the topic: ")
query_embedding = embedding_model.encode([query])

k = 3
distances, indices = index.search(np.array(query_embedding), k)
retrieved_chunks = [chunks[i] for i in indices[0]]
print("Retrieved chunks:")
for chunk in retrieved_chunks:
    print("- " + chunk)

Ask a question about the topic: Legal Cases Against Apple Computers
Retrieved chunks:
- the courts as shell companies known as patent trolls, with no evidence of actual use of patents in question. on december 21, 2016, nokia announced that in the u. s. and germany, it has filed a suit against apple, claiming that the latter ' s products infringe on nokia ' s patents. most recently, in november 2017, the united states international trade commission announced an investigation into allegations of patent infringement in regards to apple ' s remote desktop technology ; aqua connect, a company that builds remote desktop software, has claimed that apple infringed on two of its patents. epic games filed lawsuit against apple in august 2020 in the united states district court for the northern district of california, related to apple ' s practices in the ios app store. in january 2022, ericsson sued apple over payment of royalty of 5g technology. on june 24, 2024, the european commission accused

Step 4: Answering the Question with an LLM
Now, we will use a pre-trained question-answering model to extract the final answer from the retrieved context:

In [8]:
qa_model_name = "deepset/roberta-base-squad2"
qa_tokenizer = AutoTokenizer.from_pretrained(qa_model_name)
qa_model = AutoModelForQuestionAnswering.from_pretrained(qa_model_name)
qa_pipeline = pipeline("question-answering", model=qa_model, tokenizer=qa_tokenizer)

context = " ".join(retrieved_chunks)
answer = qa_pipeline(question=query, context=context)
print(f"Answer: {answer['answer']}")

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

Device set to use cuda:0


Answer: app store


So, this is how you now have a fully functional RAG pipeline for LLMs that can be used in real-world AI applications.

Summary
In this article, we built a Retrieval-Augmented Generation (RAG) pipeline for LLMs using:

Wikipedia as an external knowledge base
Sentence Transformers for embedding generation
FAISS for fast and efficient retrieval
Hugging Face’s QA pipeline to extract final answers