Required Packages Needed to be Installed

In [None]:
!pip install PyPDF2 requests
!pip install pdfminer.six
!pip -q install langchain openai tiktoken chromadb
!pip install chroma

Importing the necessary Libraries

In [1]:
from pdfminer.high_level import extract_text
import requests
import os
import PyPDF2
import glob

Function Used to Download the PDF from the URL

In [2]:
def download_pdf(url, save_folder, idx):
    response = requests.get(url)
    
    #Checking if the request was successful
    if response.status_code == 200:
        pdf_filename = f"Paper_{idx + 1}"
        #Ensuring the filename has ".pdf" extension
        if not pdf_filename.lower().endswith(".pdf"):
            pdf_filename = f"{pdf_filename}.pdf"
        pdf_path = os.path.join(save_folder, pdf_filename)
        with open(pdf_path, 'wb') as file:
            file.write(response.content)
        print(f"PDF downloaded and saved as {pdf_filename}")
    else:
        print(f"Failed to download PDF from {url}")

In [3]:
pdf_urls = [
    'https://dl.acm.org/doi/pdf/10.1145/3397271.3401075',
    'https://arxiv.org/pdf/2104.07186.pdf',
    'https://arxiv.org/pdf/2106.14807.pdf
    'https://arxiv.org/pdf/2301.03266.pdf',
    'https://arxiv.org/pdf/2303.07678.pdf',
]

Parsing The Text from URL Directly

In [5]:
!pip install beautifulsoup4



In [None]:
import requests
from bs4 import BeautifulSoup

def extract_text_from_url(url):
    # Send an HTTP GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Extract text from the parsed content
        text = soup.get_text()
        return text
    else:
        print(f"Failed to fetch content from {url}")
        return None

In [4]:
# Creating an Output directory to store the file as pdf from the given URL

output_folder = 'pdfs'
os.makedirs(output_folder, exist_ok=True)

In [5]:
# Downloading each File from URL as PDF and saving it in in output folder 
for idx, pdf_url in enumerate(pdf_urls):
    download_pdf(pdf_url, output_folder, idx)

PDF downloaded and saved as Paper_1.pdf
PDF downloaded and saved as Paper_2.pdf
PDF downloaded and saved as Paper_3.pdf
PDF downloaded and saved as Paper_4.pdf
PDF downloaded and saved as Paper_5.pdf


Parsing Function is used to Extract the text from the Downloaded PDF file 

In [13]:
def parse_pdf_to_text(pdf_file):
    with open(pdf_file, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
        return text

Parsed Content from the PDF file are stored in the form of text file in Another Folder

In [17]:
def create_text_files(parsed_texts, output_folder, pdf_filename):
    os.makedirs(output_folder, exist_ok=True)
    for idx, text in enumerate(parsed_texts):
        text_filename = f"{pdf_filename}.txt"
        text_filepath = os.path.join(output_folder, text_filename)
        with open(text_filepath, "w", encoding='utf-8') as file:
            file.write(text)
    print(f"Parsed content from '{pdf_filename}' saved in '{output_folder}'.")

In [18]:
def parse_pdfs_and_save_as_text(input_folder, output_folder):
    pdf_files = glob.glob(os.path.join(input_folder, "*.pdf"))
    for pdf_file in pdf_files:
        pdf_filename = os.path.basename(pdf_file).replace(".pdf", "")
        parsed_text = parse_pdf_to_text(pdf_file)
        create_text_files([parsed_text], output_folder, pdf_filename)

In [19]:
parse_pdfs_and_save_as_text("./pdfs/",output_folder="./parsed_file/")

Parsed content from 'Paper_1' saved in './parsed_file/'.
Parsed content from 'Paper_2' saved in './parsed_file/'.
Parsed content from 'Paper_3' saved in './parsed_file/'.
Parsed content from 'Paper_4' saved in './parsed_file/'.
Parsed content from 'Paper_5' saved in './parsed_file/'.


Summarize from the URL

In [6]:
!pip install gensim



In [None]:
from gensim.summarization import summarize

def text_summarization(text, ratio=0.2):
    with open('summaries.txt', 'w', encoding='utf-8') as output_file:
    # Use the TextRank algorithm to summarize the text
        summarized_text = summarize(text, ratio=ratio)
        output_file.write(summarized_text + "\n\n")
        return summarized_text

Setting up the LangChain for summarizing the content from the PDF files using Openai API Key

In [20]:
import os
os.environ["OPENAI_API_KEY"] = "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"

Summarizing Multiple PDFs and the summarized File can be Downloaded

In [21]:
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader
from langchain import OpenAI, PromptTemplate

In [22]:
llm = OpenAI(temperature=0.2)

def summarize_pdfs_from_folder(pdfs_folder):
    summaries = []
    with open('summaries.txt', 'w', encoding='utf-8') as output_file:
        for pdf_file in glob.glob(pdfs_folder + "/*.pdf"):
            loader = PyPDFLoader(pdf_file)
            docs = loader.load_and_split()
            chain = load_summarize_chain(llm, chain_type = "map_reduce")
            summary = chain.run(docs)
            output_file.write(f"Summary for: {pdf_file}\n")
            output_file.write(summary + "\n\n")
            summaries.append(summary)
    return summaries

In [23]:
summaries = summarize_pdfs_from_folder("./pdfs")

Summary for:  ./pdfs\Paper_1.pdf
 This paper presents ColBERT, a novel ranking model that adapts deep language models such as BERT for efficient retrieval. It uses a late interaction architecture to independently encode the query and the document, and employs a cheap yet powerful interaction step to model their fine-grained similarity. Results show that ColBERT is competitive with existing BERT-based models in terms of effectiveness, while executing two orders-of-magnitude faster and requiring up to four orders-of-magnitude fewer FLOPs per query. It is evaluated on MS MARCO and TREC CAR datasets, and its reference implementation is released as open source.




Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-l6PgmUGoM4u3TCQP6hDBScVG on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-l6PgmUGoM4u3TCQP6hDBScVG on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/acco

Summary for:  ./pdfs\Paper_2.pdf


This paper introduces a novel retrieval system, COIL, which combines the efficiency of exact match and the representation power of deep language models. It uses an inverted list indexing to store document vectors and vector similarity to compute matching scores. Experiments show that COIL outperforms classical lexical retrievers and state-of-the-art deep LM retrievers with similar or smaller latency. It is also able to mitigate vocabulary mismatch with a high-level CLS vector matching and provides a step towards building a next-generation index that stores semantics.




Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-l6PgmUGoM4u3TCQP6hDBScVG on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-l6PgmUGoM4u3TCQP6hDBScVG on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/acco

Summary for:  ./pdfs\Paper_3.pdf
 This paper presents a novel conceptual framework for understanding recent developments in information retrieval, which is organized along two dimensions: sparse vs. dense representations and unsupervised vs. learned representations. It introduces a new technique called "uniCOIL" which achieves the current state-of-the-art in sparse retrieval on the MS MARCO passage ranking dataset. It also discusses various approaches to passage retrieval for open-domain question answering, including Learning Passage Impacts for Inverted Indexes, From doc2query to docTTTTTquery, RocketQA, LDA-based Document Models for Ad-hoc Retrieval, Global Term Weights for Document Retrieval Learned from TREC Data, Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval, Efficient Passage Retrieval with Hashing for Open-Domain Question Answering, and Anserini.




Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-l6PgmUGoM4u3TCQP6hDBScVG on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-l6PgmUGoM4u3TCQP6hDBScVG on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/acco

Summary for:  ./pdfs\Paper_4.pdf


Doc2Query is a technique used to improve the effectiveness of search engines by expanding documents before indexing. Gospodinov et al. explored the effects of relevance filtering on the retrieval effectiveness of Doc2Query models and found that it can improve retrieval effectiveness by up to 16%, reduce query execution time by 23%, and reduce index size by 33%. This paper also presents 25 research papers that explore various aspects of information retrieval, such as deeper text understanding, context-aware document term weighting, and multi-step retriever-reader interaction.




Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-l6PgmUGoM4u3TCQP6hDBScVG on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-l6PgmUGoM4u3TCQP6hDBScVG on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/acco

Summary for:  ./pdfs\Paper_5.pdf
 This paper presents a query expansion approach, called query2doc, which uses language models to generate pseudo-documents to improve sparse and dense retrieval systems. Results show that query2doc boosts the performance of BM25 by 3-15% on ad-hoc IR datasets, such as MS-MARCO and TREC DL, without any model fine-tuning. It also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results. Additionally, the paper discusses four different methods for dense passage retrieval and provides details on the hyperparameters used for training the dense retrievers.




Summary for: Paper_1.pdf
 This paper presents ColBERT, a novel ranking model that adapts deep language models such as BERT for efficient retrieval. 
It uses a late interaction architecture to independently encode the query and the document, and employs a cheap yet powerful interaction step to model their 
fine-grained similarity. Results show that ColBERT is competitive with existing BERT-based models in terms of effectiveness, while executing two orders-of-magnitude faster 
and requiring up to four orders-of-magnitude fewer FLOPs per query. It is evaluated on MS MARCO and TREC CAR datasets, and its reference implementation is released as open 
source.

Connecting the Parsed Content into the Chromadb

In [24]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader

Load and Processing Multiple Documents

In [25]:
loader = DirectoryLoader('./parsed_file/', glob="./*.txt", loader_cls=TextLoader)

documents = loader.load()

In [26]:
#Splitting the Text

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

In [27]:
len(texts)

244

In [28]:
texts[3]

Document(page_content='including DRMM [ 7], KNRM [ 4,36], and Duet [ 20,22]. In contrast\nPermission to make digital or hard copies of all or part of this work for personal or\nclassroom use is granted without fee provided that copies are not made or distributed\nfor profit or commercial advantage and that copies bear this notice and the full citation\non the first page. Copyrights for components of this work owned by others than the\nauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or\nrepublish, to post on servers or to redistribute to lists, requires prior specific permission\nand/or a fee. Request permissions from permissions@acm.org.\nSIGIR ’20, July 25–30, 2020, Virtual Event, China\n©2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.\nACM ISBN 978-1-4503-8016-4/20/07. . . $15.00\nhttps://doi.org/10.1145/3397271.3401075\n0.15 0.20 0.25 0.30 0.35 0.40\nMRR@10101102103104105Query Latency (ms)\nBM25doc2queryKNRMDuet\nDe

Creating The Database

In [None]:
# Embbed and store the text
# Supplying a persist_directory will store the embeddings on disk

persist_directory = 'db'

embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

In [31]:
#Persiste the db to disk
vectordb.persist()
vectordb = None

In [32]:
# Loading the persisted database from disk and use it as normal
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)

Making retriever

In [33]:
retriever = vectordb.as_retriever()

In [34]:
docs = retriever.get_relevant_documents("what is ColBert?")

In [35]:
len(docs)

4

In [36]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

In [37]:
retriever.search_type

'similarity'

In [38]:
retriever.search_kwargs

{'k': 2}

Making a Chain such that the parsed content can be retrieved from the DB using RAG method

In [39]:
# Create the Chain to answer the question

qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(),
                                       chain_type="stuff",
                                       retriever = retriever)

In [40]:
def process_llm_response(llm_response):
    print(llm_response['result'])

In [41]:
query = "what is ColBert?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 ColBert is a novel ranking model that employs contextualized late interaction over deep language models (in particular, BERT) for efficient retrieval. It allows for scaling to end-to-end neural retrieval directly from a large document collection, which can improve recall over existing models.


Changing the model gpt-3.5

In [42]:
from langchain.chat_models import ChatOpenAI

# Setting up turbo LLM
turbo_llm = ChatOpenAI(
    temperature=0,
    model_name='gpt-3.5-turbo'
)

In [43]:
qa_chain = RetrievalQA.from_chain_type(llm=turbo_llm,
                                       chain_type="stuff",
                                       retriever=retriever,
                                       return_source_documents=True)

process_llm_response Function is used to retrieve the answers to the respective query from the user

In [44]:
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [45]:
query = "Explain Information Retrival Techniques"
llm_response = qa_chain(query)
process_llm_response(llm_response)

Information retrieval (IR) techniques are used to locate relevant documents from a large corpus based on a user's query. There are two main paradigms for IR: lexical-based sparse retrieval and embedding-based dense retrieval.

1. Lexical-based sparse retrieval: This approach uses algorithms like BM25 (Best Match 25) to match the query terms with the terms in the documents. It relies on the frequency and distribution of the query terms in the documents to determine relevance. This approach is widely used and performs well on out-of-domain datasets.

2. Embedding-based dense retrieval: This approach utilizes neural network models to represent the queries and documents as dense vectors in a high-dimensional space. The similarity between the query and documents is measured based on the proximity of their vector representations. Dense retrievers, such as BERT (Bidirectional Encoder Representations from Transformers), have shown improved performance when large amounts of labeled data are ava

In [46]:
query = "who is the Author of 'ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT'?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

The authors of 'ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT' are Omar Khattab and Matei Zaharia.


Sources:
parsed_file\Paper_1.txt
parsed_file\Paper_3.txt


In [47]:
query = "Explain about the 'Schematic diagrams illustrating query–document matching paradigms in neural IR' in paper_1?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In paper_1, the authors present a schematic diagram that illustrates different query-document matching paradigms in neural information retrieval (IR). The diagram compares existing approaches with a proposed late interaction paradigm.

The sub-figures (a), (b), and (c) represent existing approaches, while sub-figure (d) represents the proposed late interaction paradigm. These paradigms are used to match queries with documents in neural IR systems.

The authors highlight that BERT, a popular neural model, significantly improves search precision, as measured by Mean Reciprocal Rank (MRR) at the top 10 results. It shows an improvement of almost 7% compared to previous methods. However, using BERT also increases latency, with response times potentially reaching tens of thousands of milliseconds, even with high-end GPUs.

This tradeoff between search precision and latency is challenging because even a small increase in query response times can impact user experience and revenue. To address 

In [2]:
!pip install streamlit

Collecting streamlit
  Using cached streamlit-1.25.0-py2.py3-none-any.whl (8.1 MB)
Collecting altair<6,>=4.0 (from streamlit)
  Using cached altair-5.0.1-py3-none-any.whl (471 kB)
Collecting blinker<2,>=1.0.0 (from streamlit)
  Using cached blinker-1.6.2-py3-none-any.whl (13 kB)
Collecting pympler<2,>=0.9 (from streamlit)
  Using cached Pympler-1.0.1-py3-none-any.whl (164 kB)
Collecting rich<14,>=10.14.0 (from streamlit)
  Using cached rich-13.4.2-py3-none-any.whl (239 kB)
Collecting tzlocal<5,>=1.1 (from streamlit)
  Using cached tzlocal-4.3.1-py3-none-any.whl (20 kB)
Collecting validators<1,>=0.2 (from streamlit)
  Using cached validators-0.20.0.tar.gz (30 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Using cached GitPython-3.1.32-py3-none-any.whl (188 kB)
Collecting pydeck<1,>=0.8 (from streamlit)
  Using cached pydeck-0.8.0-py2.py3-none-any.whl (4.7 MB)
Collecting 