<a href="https://colab.research.google.com/github/SrishaRavi-SrishaRavi/RAG-Pipeline-using-Wikipedia---Distilbert/blob/main/Building_RAG_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lets import all the libraries required

## **## Step 1: Retrieving Knowledge from Wikipedia**

In [None]:
import wikipedia
from transformers import AutoTokenizer, AutoModelForQuestionAnswering,pipeline
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

def get_wikipedia_content(topic):
  try:
    page = wikipedia.page(topic)
    return page.content
  except wikipedia.exceptions.PageError:
    return None
  except wikipedia.exceptions.DisambiguationError as e:
    #handle cases where the topic is ambigious
    print(f"Ambiguous topic, please be more specific. Options:{e.options}")
    return None

topic = input("Enter a topic you want to learn about:")
document = get_wikipedia_content(topic)

if not document:
  print("Sorry, I am not able to retrieve the particular document, please check your spelling once")
  exit()

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")
def split_text(text,chunk_size=256,chunk_overlap =20):
  tokens = tokenizer.tokenize(text)
  chunks = []
  start =0
  while start < len(tokens):
    end = min(start + chunk_size,len(tokens))
    chunks.append(tokenizer.convert_tokens_to_string(tokens[start:end]))
    if end ==len(tokens):
      break
    start = end - chunk_overlap
  return chunks

chunks = split_text(document)
print(f"Number of chunks:{len(chunks)}")

embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
embeddings = embedding_model.encode(chunks)
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))


query = input("Ask a question about this topic")
query_embeddings = embedding_model.encode([query])

k=3
distances,indices = index.search(np.array(query_embeddings),k)
retrieved_chunks = [chunks[i] for i in indices[0]]
print("Retrieved Chunks")
for chunk in retrieved_chunks:
  print("- "+ chunk)


qa_model_name = "distilbert-base-uncased-distilled-squad"
qa_tokenizer = AutoTokenizer.from_pretrained(qa_model_name)
qa_model = AutoModelForQuestionAnswering.from_pretrained(qa_model_name)
qa_pipeline = pipeline("question-answering",model=qa_model,tokenizer=qa_tokenizer)

context = " ".join(retrieved_chunks)
answer = qa_pipeline(question=query,context=context)
print(f"Answer:{answer['answer']}")

In [None]:
import nbformat

# 🔧 Replace this with your notebook's filename
notebook_path = "/content/Building RAG Pipeline.ipynb"

# Load and clean
with open(notebook_path, "r", encoding="utf-8") as f:
    nb = nbformat.read(f, as_version=4)

# Remove bad widget metadata if exists
if "widgets" in nb.metadata:
    del nb.metadata["widgets"]

# Save cleaned version (overwrite same file)
with open(notebook_path, "w", encoding="utf-8") as f:
    nbformat.write(nb, f)

print(f"✅ Cleaned notebook: {notebook_path}")


Retrieving Wikipedia content based on user provided topic using the Wikipedia API.If the topic is valid, the function returns the page content; otherwise, it handles error by either notifiying the user a ambiuous topic with multiple options or exiting if no relevant page found.

Wikipedia articles are long, splitting them into smaller overlapping chunks for better retrievel

Tokenizing the retrieved content from wikipedia and splitting it up into small overlapping chunks for efficient retrieval.
Pre-trainedTokenizer = all-mpnet-base-v2
chunk_szie = 256
overlap tokens= 20


## **Step 2: Storing and Retrieving Knowledge**

Converting text chunks into numerical embeddings using the Sentence Transformer model (all-mpnet-base-v2), which captures their semantic meaning.
Creating a FAISS index with an L2 (Euclidean) distance metric and store
the embeddings in it. This will allow us to efficiently retrieve the most relevant chunks based on a user’s query.

## **Step 3: Querying the RAG Pipeline**

*  Convert the query into an embedding.
*  Retrieve the top-k most relevant chunks using FAISS.
*  Use an LLM-powered question-answering model to generate the answer.


## **Step 4: Answering the Question with an LLM**