### RAG based Querying on PDF Documents
The scope of this exercise is to build a simple step within a data pipeline to help with data collection and transformation for an AI assistant based system. The primary data source will be a PDF document for this pipeline, and multiple documents can be used. The objective is that these steps are delivering a RAG based context for the AI assistant to answer questions based on the contents within the document(s).

In [1]:
# Import libraries
import os
import re
import glob
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pipeline.text_extraction import read_ocr_results
# input the openai key below, used for embedding.
os.environ["OPENAI_API_KEY"] = ""

import warnings
warnings.filterwarnings('ignore')

#### Read file from the local OCR output

In [2]:
def find_json_files(directory):
    """
    Function to find all json files in a directory and its subdirectories.
    """
    json_files = []
    for json_file in glob.glob(os.path.join(directory, '**', '*.json')):
        json_files.append(json_file)
    return json_files

directory_path = 'data/results/'
json_files = find_json_files(directory_path)

text_data_all_files = []
metadata_all_files = []
for file in json_files:
    text_data, metadata = read_ocr_results(file)
    text_data_all_files += text_data
    metadata_all_files += metadata


In [3]:
# Data Cleaning
# Remove references in the format [<text>] and special characters and convert to lower case.
cleaned_text = [re.sub(r'\[[^\]]*\]', '', text) for text in text_data_all_files]
cleaned_text = [text.lower() for text in cleaned_text]
cleaned_text = [re.sub(r'[^a-zA-Z0-9 .]', '', text) for text in cleaned_text]


In [4]:
# Text Splitting: Chunk the text into smaller pieces with overlap.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=300,
    length_function=len
)
texts = text_splitter.create_documents(cleaned_text, metadata_all_files)


#### ChromaDB based Vector storage
Create vector embeddings out of the text chunks that are extracted.

In [5]:
# Vector DB
directory_name = 'vector_db'

embedding = OpenAIEmbeddings(model='text-embedding-ada-002')

vectordb = Chroma.from_documents(documents=texts, 
                                 embedding=embedding,
                                 persist_directory=directory_name)

In [6]:
# Now we can load the persisted database from disk, and use it as normal. 
vectordb_loaded = Chroma(persist_directory=directory_name, 
                  embedding_function=embedding)

#### Direct use of similarity search

In [7]:
query = "What is IQA?"
docs = vectordb_loaded.similarity_search(query)
docs

[Document(page_content='offers an overview of image quality assessment iqa metrics sorted by reference image availability and nature and reviews the current evaluation framework for novel view synthesis nvs. 2.1 image quality assessment metrics fullreference metrics mean squared error mse peak signaltonoise ratio psnr and structural similarity index measure ssim  are key metrics in the friqa group for comparing a query image with the ground truth due to their simplicity and accuracy. further several variants are proposed to improve performance by evaluating at multiscale  utilising handcrafted features  and extend to specific applications such as image stitching  and high dynamic range hdr images . recently deep neural networks have advanced friqa towards aligning assessments more closely with human visual perception . overall friqa offers detailed evaluation at the cost of requiring ground truth images. reducedreference metrics rriqa methods are designed to address situa tions where o

#### Convert Query to Vector and then DB search

In [8]:
query_embed = embedding.embed_query(query)
docs = await vectordb_loaded.amax_marginal_relevance_search_by_vector(query_embed)
docs

[Document(page_content='offers an overview of image quality assessment iqa metrics sorted by reference image availability and nature and reviews the current evaluation framework for novel view synthesis nvs. 2.1 image quality assessment metrics fullreference metrics mean squared error mse peak signaltonoise ratio psnr and structural similarity index measure ssim  are key metrics in the friqa group for comparing a query image with the ground truth due to their simplicity and accuracy. further several variants are proposed to improve performance by evaluating at multiscale  utilising handcrafted features  and extend to specific applications such as image stitching  and high dynamic range hdr images . recently deep neural networks have advanced friqa towards aligning assessments more closely with human visual perception . overall friqa offers detailed evaluation at the cost of requiring ground truth images. reducedreference metrics rriqa methods are designed to address situa tions where o

#### Building a LLM retriever based on prompt template.

In [9]:
retriever = vectordb_loaded.as_retriever(search_type='similarity_score_threshold' ,search_kwargs={"score_threshold": 0.7})

In [10]:
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(), 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)

# Default prompt template
qa_chain.combine_documents_chain.llm_chain.prompt.template

"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"

In [11]:
def process_llm_response(llm_response):
    """
    Function to process the response from the LLM model.
    """
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"][:1]:
        print("Page No: ", source.metadata['page_no'])
        print("PDF File: ", source.metadata['file_name'])
        print("Page Content: ", source.page_content)

#### Example Usage:

In [12]:
query = "What is IQA?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 IQA stands for image quality assessment.


Sources:
Page No:  3
PDF File:  sample2
Page Content:  offers an overview of image quality assessment iqa metrics sorted by reference image availability and nature and reviews the current evaluation framework for novel view synthesis nvs. 2.1 image quality assessment metrics fullreference metrics mean squared error mse peak signaltonoise ratio psnr and structural similarity index measure ssim  are key metrics in the friqa group for comparing a query image with the ground truth due to their simplicity and accuracy. further several variants are proposed to improve performance by evaluating at multiscale  utilising handcrafted features  and extend to specific applications such as image stitching  and high dynamic range hdr images . recently deep neural networks have advanced friqa towards aligning assessments more closely with human visual perception . overall friqa offers detailed evaluation at the cost of requiring ground truth images. reduced

In [13]:
query = "What is SpaceByte?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 SpaceByte is a new bytelevel decoder architecture that utilizes multiscale modeling and a dynamic patching rule to improve efficiency and performance on a variety of text modalities. It has been shown to outperform other bytelevel architectures and roughly match the performance of subword transformers.


Sources:
Page No:  2
PDF File:  sample1
Page Content:  as a subword level transformer. to close this substantial performance gap we propose a new bytelevel decoder architecture spacebyte. spacebyte also utilizes multiscale modeling to improve efficiency by grouping bytes into patches. but ulike megabyte which uses a fixed patch size spacebyte uses a simple rule to dynamically partition the bytes into patches that are aligned with word and other language boundaries. a similar technique was also explored by thawani et al. . our experiments show that this simple modification is crucial for performance allowing spacebyte to outperform other bytelevel architectures and roughly match the pe

In [14]:
query = "What are the disaddvantage of Tokenization?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Tokenization imposes several disadvantages, including performance biases, increased adversarial vulnerability, decreased character level modeling performance, and increased modeling complexity.


Sources:
Page No:  1
PDF File:  sample1
Page Content:  spacebyte towards deleting tokenization from large language modeling kevin slagle rice university kevin.slaglerice.edu rxiv2404.14408v1  22 apr 202 abstract tokenization is widely used in large language models because it significantly improves performance. however tokenization imposes several disadvantages such as performance biases increased adversarial vulnerability decreased character level modeling performance and increased modeling complexity. to address these disadvantages without sacrificing performance we propose spacebyte a novel bytelevel decoder architecture that closes the performance gap between bytelevel and subword autoregressive language modeling. spacebyte consists of a bytelevel transformer model but with extra larger tr