Overview: What I Have Done and Why
In this project, the goal was to build a question-answering system that can take a PDF document as input and allow users to ask questions about the content of the PDF. The system will then provide answers based on the document. To make the interaction more engaging, the system also includes a text-to-speech feature, allowing the answers to be read out loud.

The solution uses several cutting-edge tools, libraries, and techniques to achieve this:

Natural Language Processing (NLP) with Hugging Face Transformers:
I used LaMini-T5-738M, a powerful language model fine-tuned for text generation and question-answering tasks. This model helps convert user questions into meaningful answers based on the input text.
Document Processing and Retrieval:
I used Langchain, PDFMiner, and Chroma to read, process, and store the content of the PDF document in a way that allows for efficient searching and retrieval. This way, the system can find relevant sections of the document to answer specific questions.
Text-to-Speech (TTS):
To enhance the user experience, I integrated a text-to-speech feature using gTTS and pydub, allowing the system to read the answers out loud to the user.
Interactive Question-Answering System:
The system was designed to allow users to ask one question at a time and receive a text-based and audio response. The interaction is controlled, so after receiving an answer, the system stops, giving the user control over when to ask the next question.
Why I Used These Tools and Features
1. Motivation Behind Using Hugging Face Transformers

I chose Hugging Face Transformers because they provide state-of-the-art NLP models that are pre-trained on vast amounts of text data. The specific model, LaMini-T5-738M, is designed for tasks like question answering, text summarization, and more. It’s efficient for handling complex language tasks and provides high-quality answers based on user questions. This allows us to easily convert user queries into text-based responses.
2. PDF Document Processing

Motivation: Extracting text from PDFs is not straightforward because PDF files are not designed to be easily parsed as plain text. Therefore, I used PDFMiner, which is one of the best tools available to extract and process content from PDFs, even complex ones.
To ensure efficient search and retrieval of relevant sections of the PDF, I used Langchain combined with Chroma, which helps break the document into smaller chunks and store them as embeddings (numerical representations). These embeddings can then be searched using machine learning techniques, which is much faster and more accurate than searching raw text.
3. Question-Answer System with Langchain and Chroma

Motivation: PDFs can contain a lot of content, making it difficult to directly retrieve the most relevant parts to answer a specific question. To handle this, I used Langchain to create a pipeline that integrates a retrieval system (with Chroma) with the language model (from Hugging Face). This allows us to efficiently search for relevant parts of the document and use those to generate accurate answers to user questions.
4. Text-to-Speech Feature

Motivation: The goal was to make the interaction more engaging. By adding a text-to-speech feature, the system can read out the generated answer, making it more accessible and interactive for the user. The gTTS (Google Text-to-Speech) library converts text into speech, and pydub ensures that the audio is processed and played back in a format that works well in Google Colab.
5. Designed for Simplicity and User Control

Motivation: I wanted the interaction to be simple and controlled. By designing the system to handle one question at a time and stop after providing the answer, I ensure that the user is in control. They can decide when to ask the next question by manually calling the function again. This setup is intuitive, especially for exploratory tasks where users might want to ask multiple questions over time.

In [8]:
!pip install transformers
!pip install chromadb
!pip install langchain
!pip install pdfminer.six
!pip install sentence-transformers
!pip install pydub
!pip install ffmpeg



What is happening here?

Motivation: Every machine learning or natural language processing task requires specific libraries. Here, we're installing the necessary libraries to handle PDF documents, language models, and text-to-speech processing. This step ensures that all the tools we need are available in the environment.
Breakdown of libraries:
transformers: Used for natural language models, such as the one we're using to answer questions. Hugging Face’s transformers library gives us access to powerful pre-trained language models.
chromadb: We use this for managing embeddings of the text. Chroma provides a vector database that lets us store the numerical representations of text and retrieve the most relevant parts efficiently.
langchain: This library allows us to build the "chains" for our question-answering process. It helps to integrate language models, retrieve relevant data, and process results in an efficient way.
pdfminer.six: This library is responsible for reading and extracting text from PDFs. It breaks down the content in a way that the model can understand.
sentence-transformers: A powerful tool for creating embeddings (numerical representations) of the text that we can use to search for relevant parts of the document.
pydub and ffmpeg: These are used for converting text responses into audio so that we can generate and play back the text-to-speech results in Colab.


In [9]:
import os
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_community.document_loaders import PDFMinerLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from chromadb.config import Settings
from chromadb import Client
from IPython.display import Audio, display
from pydub import AudioSegment
from io import BytesIO


What is happening here?

Motivation: We need to bring in the core functions that help us handle PDFs, process text, manage embeddings, and use pre-trained models.
Breakdown of important imports:
transformers.AutoTokenizer and AutoModelForSeq2SeqLM: These functions load a pre-trained language model and its tokenizer. The model we're using is a variant of the T5 model, which is great for answering questions.
pipeline: This allows us to easily set up a text-generation pipeline where the input is a question, and the output is an answer. It simplifies how we interact with the language model.
PDFMinerLoader: A tool from langchain to load and process PDF documents. It reads the PDF, breaks it down into smaller chunks, and prepares the text for analysis.
RecursiveCharacterTextSplitter: This function breaks the large PDF text into smaller, manageable chunks that we can use to find relevant information.
SentenceTransformerEmbeddings: This is used to create numerical "embeddings" from text, so we can store and compare different parts of the document.
Chroma: This helps us build a searchable database where we can store embeddings and retrieve relevant chunks of the PDF efficiently.
gTTS (in the text-to-speech section): Used for converting text into speech, so the response can also be read aloud.
pydub and AudioSegment: These are used to process and play audio files directly in Colab.

In [11]:
checkpoint = "MBZUAI/LaMini-T5-738M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
base_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)


What is happening here?

Motivation: I chose the LaMini-T5-738M model because it is optimized for question-answering tasks and performs well with relatively complex queries. The "T5" model is great for transforming one type of text into another, which makes it perfect for this type of task where we need to turn questions into answers.
Details:
The checkpoint specifies the model’s location. We're using a version of T5 that has been fine-tuned for tasks like summarization, question answering, etc.
AutoTokenizer: This prepares the input text (your question) in a way that the model can understand.
AutoModelForSeq2SeqLM: This loads the actual pre-trained model, which will generate answers.

In [12]:

# Directory to store the embeddings
persist_directory = "db"

# Function to ingest PDF data and process it
def data_ingestion(pdf_file):
    loader = PDFMinerLoader(pdf_file)
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=500)
    texts = text_splitter.split_documents(documents)

    # Create embeddings
    embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

    # Create Chroma client using new settings
    chroma_client = Client(Settings(persist_directory=persist_directory))

    # Create vector store, persistence is automatic
    db = Chroma.from_documents(
        texts,
        embeddings,
        client=chroma_client
    )
    return db

What is happening here?

Motivation: I needed a way to extract the text from the PDF and break it into manageable chunks. PDFs are not structured like regular text, so we need specialized tools to properly extract and handle their content. The idea is to "read" the PDF in a way that the machine can understand.
Steps:
PDFMinerLoader: Loads the PDF and extracts the raw text.
RecursiveCharacterTextSplitter: This function breaks the text into small pieces (500 characters) that overlap slightly. The reason for the overlap is that relevant information might spill from one chunk to the next.
SentenceTransformerEmbeddings: This transforms the text chunks into numerical "embeddings" that the model can process and search through.
Chroma: It stores the embeddings in a searchable database, allowing us to retrieve relevant sections when we ask questions.

In [13]:
def llm_pipeline():
    pipe = pipeline(
        'text2text-generation',
        model=base_model,
        tokenizer=tokenizer,
        max_length=256,
        do_sample=True,
        temperature=0.3,
        top_p=0.95
    )
    local_llm = HuggingFacePipeline(pipeline=pipe)
    return local_llm

What is happening here?

Motivation: I wanted a simple way to interact with the language model, where I can provide a question and receive a generated answer in return. The Hugging Face pipeline allows us to connect the model with just a few lines of code.
Details:
text2text-generation: This task type tells the pipeline that we will be providing input text and expect output text (as opposed to, say, a classification or translation task).
temperature and top_p: These settings control how "creative" the model is. A lower temperature (0.3) means the model will give more straightforward answers, and top_p=0.95 ensures we’re only sampling from the top 95% of possible tokens, making answers more focused.

In [14]:
def qa_llm(db):
    llm = llm_pipeline()
    retriever = db.as_retriever()
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )
    return qa


What is happening here?

Motivation: The goal here is to link the language model with the document database we built earlier (the one holding the PDF content). This way, when we ask a question, the system knows how to find the relevant parts of the PDF to help answer it.
Details:
from_chain_type: This method sets up the pipeline so that the model retrieves relevant information from the document database (Chroma) before answering the question.

In [15]:
def process_answer(query, qa_model):
    generated_text = qa_model(query)
    answer = generated_text['result']
    return answer

What is happening here?

Motivation: The user will input a question, and this function sends the question to the question-answer model (qa_model). The model then processes the query and returns the answer.
Details:
This function takes the user's input, runs it through the question-answering model, and extracts the final result.

In [16]:
def text_to_speech(text):
    from gtts import gTTS
    tts = gTTS(text=text, lang='en')

    # Create a bytes buffer for the audio
    audio_fp = BytesIO()
    tts.write_to_fp(audio_fp)
    audio_fp.seek(0)

    # Load the audio in pydub and convert to playable format
    audio_segment = AudioSegment.from_file(audio_fp, format="mp3")

    # Export audio in wav format as Colab prefers this format for playback
    audio_buffer = BytesIO()
    audio_segment.export(audio_buffer, format="wav")
    audio_buffer.seek(0)

    # Play the audio in the notebook
    display(Audio(audio_buffer.read(), autoplay=True))


What is happening here?

Motivation: I wanted to make the interaction more immersive by adding a text-to-speech feature. This allows the answer to be played out loud after it's provided.
Details:
gTTS: Converts the text response into speech.
pydub: Handles the audio processing and format conversion (from MP3 to WAV).
Audio: The Audio class from IPython.display plays the generated audio directly in the Colab notebook.

In [17]:
def ask_question_once():
    # Input a single question
    question = input("Enter your question: ")

    # Process the question and display the answer
    answer = process_answer({'query': question}, qa_model)
    print("\nAnswer:", answer)

    # Optionally convert the answer to speech and play it
    text_to_speech(answer)


What is happening here?

Motivation: This function ties everything together. It lets the user enter a question, processes the question using the model, displays the answer, and plays the answer as speech.
Details:
It takes one question at a time and provides one response at a time. This stops after giving the answer, allowing users to manually call the function again for the next question.

In [19]:
# Install necessary libraries
!pip install transformers
!pip install chromadb
!pip install langchain
!pip install pdfminer.six
!pip install sentence-transformers
!pip install pydub  # Install pydub for audio handling
!pip install ffmpeg  # Install ffmpeg for handling audio formats

import os
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_community.document_loaders import PDFMinerLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from chromadb.config import Settings
from chromadb import Client
from IPython.display import Audio, display
from pydub import AudioSegment
from io import BytesIO

# Set the model checkpoint
checkpoint = "MBZUAI/LaMini-T5-738M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
base_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

# Directory to store the embeddings
persist_directory = "db"

# Function to ingest PDF data and process it
def data_ingestion(pdf_file):
    loader = PDFMinerLoader(pdf_file)
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=500)
    texts = text_splitter.split_documents(documents)

    # Create embeddings
    embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

    # Create Chroma client using new settings
    chroma_client = Client(Settings(persist_directory=persist_directory))

    # Create vector store, persistence is automatic
    db = Chroma.from_documents(
        texts,
        embeddings,
        client=chroma_client
    )
    return db

# Function for LLM pipeline
def llm_pipeline():
    pipe = pipeline(
        'text2text-generation',
        model=base_model,
        tokenizer=tokenizer,
        max_length=256,
        do_sample=True,
        temperature=0.3,
        top_p=0.95
    )
    local_llm = HuggingFacePipeline(pipeline=pipe)
    return local_llm

# Function for QA retrieval
def qa_llm(db):
    llm = llm_pipeline()
    retriever = db.as_retriever()
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )
    return qa

# Function to process the answer
def process_answer(query, qa_model):
    generated_text = qa_model(query)
    answer = generated_text['result']
    return answer

# Function to convert text to speech and play the audio using pydub
def text_to_speech(text):
    from gtts import gTTS
    tts = gTTS(text=text, lang='en')

    # Create a bytes buffer for the audio
    audio_fp = BytesIO()
    tts.write_to_fp(audio_fp)
    audio_fp.seek(0)

    # Load the audio in pydub and convert to playable format
    audio_segment = AudioSegment.from_file(audio_fp, format="mp3")

    # Export audio in wav format as Colab prefers this format for playback
    audio_buffer = BytesIO()
    audio_segment.export(audio_buffer, format="wav")
    audio_buffer.seek(0)

    # Play the audio in the notebook
    display(Audio(audio_buffer.read(), autoplay=True))

# Manually upload the PDF and replace with the actual path after upload
pdf_path = "/content/Gen AI Engineer _ Machine Learning Engineer Assignment.pdf"  # Replace with your actual file path

# Ingest the PDF and create the QA model
db = data_ingestion(pdf_path)
qa_model = qa_llm(db)

# Function to handle a single question-response interaction and stop afterward
def ask_question_once():
    # Input a single question
    question = input("Enter your question: ")

    # Process the question and display the answer
    answer = process_answer({'query': question}, qa_model)
    print("\nAnswer:", answer)

    # Optionally convert the answer to speech and play it
    text_to_speech(answer)









In [18]:
 ask_question_once()

Enter your question: what Gen AI Engineer / Machine Learning Engineer Assignment is given?

Answer: The Gen AI Engineer / Machine Learning Engineer Assignment given is to develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA) bot.


In [5]:
 ask_question_once()

Enter your question: what problem statement is given?

Answer: The problem statement is: "Develop an interactive interface for the QA bot from Part 1 allowing users to input queries and retrieve answers in real time. The interface should enable users to upload documents and ask questions based on the content of the uploaded document."


In [6]:
 ask_question_once()

Enter your question: what vector database is suggested?

Answer: The vector database suggested is Pinecone.


In [7]:
 ask_question_once()

Enter your question: what general guidelines are given

Answer: The general guidelines given are to ensure modular and scalable code following best practices for both frontend and backend development, document the approach thoroughly, share the code, deployment GitHub instructions, and the final working model through General Guidelines.
