<a href="https://colab.research.google.com/github/MPRaghava/SmartStoreNET/blob/master/rag_distilbert_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Downloading the dependecies

In [1]:
!pip install langchain transformers faiss-cpu PyPDF2


Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2, faiss-cpu
Successfully installed PyPDF2-3.0.1 faiss-cpu-1.9.0


In [26]:
!pip install transformers



## PDF Loading and Text extraction.

---



In [59]:
from PyPDF2 import PdfReader

def extract_text_from_pdf(pdf_path):
  reader = PdfReader(pdf_path)
  text = ""
  for page in reader.pages:
    text += page.extract_text()
  return text


pdf_text = extract_text_from_pdf("/Active Retrieval Augmented Generation.pdf")

In [60]:
print(pdf_text)

Active Retrieval Augmented Generation
Zhengbao Jiang1∗Frank F. Xu1∗Luyu Gao1∗Zhiqing Sun1∗Qian Liu2
Jane Dwivedi-Yu3Yiming Yang1Jamie Callan1Graham Neubig1
1Language Technologies Institute, Carnegie Mellon University
2Sea AI Lab3FAIR, Meta
{zhengbaj,fangzhex,luyug,zhiqings,gneubig}@cs.cmu.edu
Abstract
Despite the remarkable ability of large lan-
guage models (LMs) to comprehend and gen-
erate language, they have a tendency to hal-
lucinate and create factually inaccurate out-
put. Augmenting LMs by retrieving informa-
tion from external knowledge resources is one
promising solution. Most existing retrieval aug-
mented LMs employ a retrieve-and-generate
setup that only retrieves information once based
on the input. This is limiting, however, in
more general scenarios involving generation
of long texts, where continually gathering in-
formation throughout generation is essential. In
this work, we provide a generalized view of ac-
tive retrieval augmented generation , methods
that activel

##Chunking the text data

In [61]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


def split_text(text,c_size =500,c_overlap=50):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size = c_size, chunk_overlap= c_overlap)
  chunks = text_splitter.split_text(text)
  return chunks


chunks = split_text(pdf_text)

In [62]:
len(chunks)

207

In [42]:
#for i in chunks:
 # print(f"Page: "+i)

## Embedding the Text Chunks

In [63]:
from transformers import T5Tokenizer, T5EncoderModel
import torch

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5EncoderModel.from_pretrained("t5-base")


def embed_text(text):
  inputs = tokenizer(text, return_tensors ="pt",padding =True,max_length =512)
  with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state.mean(dim=1)
  return embeddings

chunk_embeddings = [embed_text(chunk).squeeze().numpy() for chunk in chunks]



In [66]:
print(len(chunk_embeddings[0]))

768


## Store Embeddings in FAISS

In [67]:
import faiss
import numpy as np

# Initialize FAISS index
embedding_size = chunk_embeddings[0].shape[0]
index = faiss.IndexFlatL2(embedding_size)

# Convert embeddings to numpy array and add them to the index
faiss_embeddings = np.array(chunk_embeddings).astype("float32")
index.add(faiss_embeddings)

##Define the Retrieval

In [68]:
def retrieve_chunks(question, top_k = 3):
  question_embedding = embed_text(question).squeeze().numpy().astype("float32").reshape(1, -1)

   # Search FAISS index for similar embeddings
  distances, indices = index.search(question_embedding, top_k)
  retrieved_chunks = [chunks[i] for i in indices[0]]
  return retrieved_chunks


##Generate an Answer Using T5

In [69]:
from transformers import T5ForConditionalGeneration

generation_model = T5ForConditionalGeneration.from_pretrained("t5-base")


def generate_answer(question,retrieved_chunks):
   # Combine retrieved chunks into a single context
   context =" ".join(retrieved_chunks)
   input_text = f"question:{question} context :{context}"


   # Tokenize and generate answer
   inputs = tokenizer(input_text,return_tensors="pt", truncation=True, max_length=512)
   outputs = generation_model.generate(inputs["input_ids"], max_length=150)
   answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
   return answer

## Full question Answer function


In [73]:
   # Embed chunks and store in FAISS (only needed once per PDF)
   chunk_embeddings = [embed_text(chunk).squeeze().numpy() for chunk in chunks]
   faiss_embeddings = np.array(chunk_embeddings).astype("float32")
   index.add(faiss_embeddings)



   # Retrieve relevant chunks
   #def retrieve_chunks(question, top_k = 3):
   question_embedding = embed_text(question).squeeze().numpy().astype("float32").reshape(1, -1)
    # Search FAISS index for similar embeddings
   distances, indices = index.search(question_embedding, top_k)
   retrieved_chunks = [chunks[i] for i in indices[0]]

   response = generate_answer("what is FLARE",retrieved_chunks)

TypeError: handle_Index.<locals>.replacement_search() got an unexpected keyword argument 'top_k'

In [70]:
def answer_question(pdf_path,question,top_k=3):
   # Extract text from PDF and split into chunks
   pdf_text = extract_text_from_pdf(pdf_path)
   chunks = split_text(pdf_text)



   # Embed chunks and store in FAISS (only needed once per PDF)
   chunk_embeddings = [embed_text(chunk).squeeze().numpy() for chunk in chunks]
   faiss_embeddings = np.array(chunk_embeddings).astype("float32")
   index.add(faiss_embeddings)



   # Retrieve relevant chunks
   #def retrieve_chunks(question, top_k = 3):
   question_embedding = embed_text(question).squeeze().numpy().astype("float32").reshape(1, -1)
    # Search FAISS index for similar embeddings
   distances, indices = index.search(question_embedding, top_k)
   retrieved_chunks = [chunks[i] for i in indices[0]]
    #return retrieved_chunks

    #retrieveed_chunks = retrieve_chunks(question,top_k)


   # Generate and return the answer
   answer = generate_answer(question, retrieved_chunks)
   return answer




In [58]:
pdf_path = "/Active Retrieval Augmented Generation.pdf"
question = "Single-time Retrieval Augmented Generation"
print(answer_question(pdf_path, question))

IndexError: list index out of range