# SCENARIO :- 3

You are working in the AI team of an educational platform that wants to build a smart chatbot. The goal is to help students and literature enthusiasts ask questions about the content of a specific literary short story and get accurate, context-aware answers.

Problem:

Build a RAG-based (Retrieval-Augmented Generation) chatbot that can answer questions from a given PDF file.




Your Task:

1.Extract text from the PDF using a Python library.

2.Split the text into small chunks.

3.Create embeddings for each chunk.

4.Build a retriever to find relevant chunks for a user question.

5.Generate answers by combining the user's question and the retrieved text using a language model.

Expected Output:
* A chatbot that can answer questions about the PDF
* Answers should be relevant and based on the document
*  Try evaluating with a few test questions




In [2]:
# pip install transformers PyPDF2 langchain-community

In [3]:
from PyPDF2 import PdfReader
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import torch
import textwrap


def extract_pdf_text(pdf_path):

  """
  Extracts and returns all text content from a PDF file.

  Parameters:
      pdf_path (str): The path to the PDF file.

  Returns:
      str: Combined text content from all pages in the PDF.
  """

  reader = PdfReader(pdf_path)
  text = ""
  for page in reader.pages:

    page_text = page.extract_text()
    if page_text:
      text += page_text
  return text



def split_into_chunks(text, max_tokens=300):
  """
  Splits a long text string into smaller chunks.

  Parameters:
      text (str): The text to  split.
      max_tokens (int): Maximum number of tokens.

  Returns:
      List[str]: A list of text chunks.
  """


  sentences = text.split('. ')
  chunks, current_chunk = [], ""
  for sentence in sentences:

    if len(current_chunk.split()) + len(sentence.split()) < max_tokens:
      current_chunk += sentence + ". "
    else:
      chunks.append(current_chunk.strip())
      current_chunk = sentence + ". "

  if current_chunk:
    chunks.append(current_chunk.strip())
  return chunks


def load_flan_pipeline():
  """
  Loads 'google/flan-t5-base' model.

  Returns:
      transformers.Pipeline: A text generation pipeline.
  """

  model_name = "google/flan-t5-base"
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
  pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
  return pipe


def find_relevant_chunk(query, chunks):
  """
  Finds the most relevant text chunk based on the query.

  Parameters:
      query (str): The user input.
      chunks (List[str]): A list of text chunks to search through.

  Returns:
      str: The chunk of text most relevant to the query.
  """

  query_words = set(query.lower().split())
  best_score = 0
  best_chunk = ""
  for chunk in chunks:
    score = len(query_words.intersection(chunk.lower().split()))
    if score > best_score:
        best_score = score
        best_chunk = chunk
  return best_chunk



In [4]:
def chat_with_pdf(pdf_path):

  """
  Creates chat loop where user queries are answered
  based on the content of the PDF.

  Parameters:
      pdf_path (str): Path to the PDF file.
  """

  print("Reading PDF...")
  text = extract_pdf_text(pdf_path)
  chunks = split_into_chunks(text)
  flan_pipe = load_flan_pipeline()

  print("Ask me anything.\nType 'exit' to quit.\n\n")
  while True:
    query = input("You: ")
    if query.lower() in ['exit', 'quit']:

      print("\n\nGOODBYE!\n\n")
      break
    context = find_relevant_chunk(query, chunks)
    prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
    response = flan_pipe(prompt, max_new_tokens=100)[0]['generated_text']
    print("\nBot:", textwrap.fill(response, width=80), "\n")


In [15]:
chat_with_pdf("/content/3_story.pdf")

Reading PDF...


Device set to use cpu


Ask me anything.
Type 'exit' to quit.


You: who wrote the story

Bot: Gabriel Garcia Marquez 

You: what was frau frieda's talent

Bot: oracular 

You: What was discovered inside the car that hit the hotel wall

Bot: Frau Frieda 

You: Who was in the car?

Bot: She was the housekeeper for the new Portuguese ambassador and his wife 

You: How many stories are there in the section "I Sell My Dreams" belongs to?

Bot: three short stories and two long ones 

You: What unusual ring did the woman wear?

Bot: a snake ring 

You: What happened to the woman in the car during the storm in Havana?

Bot: she was extricated from the car encrusted in the wall of Havana Riviera Hotel 

You: Who raised Gabriel Garcia Marquez?

Bot: his grandparents 

You: What prize did Gabriel Garcia Marquez win?

Bot: the Nobel Prize in Literature 

You: What did Frau Frieda do at breakfast for the family she worked for?

Bot: learn the immediate future of each of its members 

You: What unusual item did the woman 