<a href="https://colab.research.google.com/github/SrishaRavi-SrishaRavi/RAG-Pipeline-using-Wikipedia---Distilbert/blob/main/Building_RAG_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install wikipedia
!pip install transformers
!pip install sentence-transformers
!pip install faiss-cpu

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=60dd3588b89a4f506c28cc0d088885ad147c3b3a3a7369e53e70b6a61723a505
  Stored in directory: /root/.cache/pip/wheels/8f/ab/cb/45ccc40522d3a1c41e1d2ad53b8f33a62f394011ec38cd71c6
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cu

Lets import all the libraries required

In [2]:
import wikipedia
from transformers import AutoTokenizer, AutoModelForQuestionAnswering,pipeline
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

def get_wikipedia_content(topic):
  try:
    page = wikipedia.page(topic)
    return page.content
  except wikipedia.exceptions.PageError:
    return None
  except wikipedia.exceptions.DisambiguationError as e:
    #handle cases where the topic is ambigious
    print(f"Ambiguous topic, please be more specific. Options:{e.options}")
    return None

topic = input("Enter a topic you want to learn about:")
document = get_wikipedia_content(topic)

if not document:
  print("Sorry, I am not able to retrieve the particular document, please check your spelling once")
  exit()

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")
def split_text(text,chunk_size=256,chunk_overlap =20):
  tokens = tokenizer.tokenize(text)
  chunks = []
  start =0
  while start < len(tokens):
    end = min(start + chunk_size,len(tokens))
    chunks.append(tokenizer.convert_tokens_to_string(tokens[start:end]))
    if end ==len(tokens):
      break
    start = end - chunk_overlap
  return chunks

chunks = split_text(document)
print(f"Number of chunks:{len(chunks)}")

embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
embeddings = embedding_model.encode(chunks)
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))


query = input("Ask a question about this topic")
query_embeddings = embedding_model.encode([query])

k=3
distances,indices = index.search(np.array(query_embeddings),k)
retrieved_chunks = [chunks[i] for i in indices[0]]
print("Retrieved Chunks")
for chunk in retrieved_chunks:
  print("- "+ chunk)


qa_model_name = "distilbert-base-uncased-distilled-squad"
qa_tokenizer = AutoTokenizer.from_pretrained(qa_model_name)
qa_model = AutoModelForQuestionAnswering.from_pretrained(qa_model_name)
qa_pipeline = pipeline("question-answering",model=qa_model,tokenizer=qa_tokenizer)

context = " ".join(retrieved_chunks)
answer = qa_pipeline(question=query,context=context)
print(f"Answer:{answer['answer']}")

Enter a topic you want to learn about:Indian Democrazy


Token indices sequence length is longer than the specified maximum sequence length for this model (4841 > 512). Running this sequence through the model will result in indexing errors


Number of chunks:21
Ask a question about this topicWhat is the capital of India?
Retrieved Chunks
- representative governance. in the 10th century ce, inscriptions at the vaikunda perumal temple suggest the election of local representatives to village councils during the chola empire. = = = independence from colonial rule = = = following nearly two centuries of british colonial rule — initially under the east india company and later under direct governance by the british crown — india gained independence in 1947 after a sustained nationalist anti - colonial movement. this movement was predominantly led by the indian national congress ( inc ; also known simply as the " congress " ) and prominent figures such as mahatma gandhi and jawaharlal nehru. however, the movement was also shaped by a diverse range of ideological influences, including communism, dalit leaders, and to a lesser extent, hindutva, a far - right hindu nationalist ideology, though the latter ' s participation is debated.

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

Device set to use cpu


Answer:new delhi


## **## Step 1: Retrieving Knowledge from Wikipedia**

Retrieving Wikipedia content based on user provided topic using the Wikipedia API.If the topic is valid, the function returns the page content; otherwise, it handles error by either notifiying the user a ambiuous topic with multiple options or exiting if no relevant page found.

Enter a topic you want to learn about:Mangoes
Sorry, I am not able to retrieve the particular document


Wikipedia articles are long, splitting them into smaller overlapping chunks for better retrievel

Tokenizing the retrieved content from wikipedia and splitting it up into small overlapping chunks for efficient retrieval.
Pre-trainedTokenizer = all-mpnet-base-v2
chunk_szie = 256
overlap tokens= 20


## **Step 2: Storing and Retrieving Knowledge**

Converting text chunks into numerical embeddings using the Sentence Transformer model (all-mpnet-base-v2), which captures their semantic meaning.
Creating a FAISS index with an L2 (Euclidean) distance metric and store
the embeddings in it. This will allow us to efficiently retrieve the most relevant chunks based on a user’s query.

## **Step 3: Querying the RAG Pipeline**

*  Convert the query into an embedding.
*  Retrieve the top-k most relevant chunks using FAISS.
*  Use an LLM-powered question-answering model to generate the answer.


## **Step 4: Answering the Question with an LLM**