__Query Unstructured PDF URL__
- Create a search index from a PDF file
- Encode the text into vectors using a Google embedding model.
- Use the index to find relevant passages/sections based on search queries.


[h3manth.com](https://h3manth.com)

In [1]:
!pip install langchain unstructured[pdf] unstructured chromadb -qU
!pip install --quiet langchain-google-genai -qU

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.2.1+cu121 requires torch==2.2.1, but you have torch 2.2.0 which is incompatible.
torchtext 0.17.1 requires torch==2.2.1, but you have torch 2.2.0 which is incompatible.[0m[31m
[0m

In [14]:
import pathlib
import textwrap

from IPython.display import display
from IPython.display import Markdown

def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [4]:
from google.colab import userdata
import os
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [10]:
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-pro", convert_system_message_to_human=True)

In [6]:
from langchain.document_loaders import UnstructuredURLLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import GooglePalmEmbeddings
from langchain.chains import RetrievalQA

In [7]:
"""
urls: This creates a list with a single URL pointing to a PDF file. This will be the document that is indexed.
loader: This creates an UnstructuredURLLoader object, passing the list of URLs to index as a parameter. This loader will download and parse the content from the URLs.
VectorstoreIndexCreator: This class from Vectorstore is used to create a search index.
embedding: An embedding model is specified to encode the text into vectors/embeddings that can be searched. Here it uses GooglePalmEmbeddings.
text_splitter: This splits the text into chunks before encoding as some embedding models have text length limits. Here it splits into 3000 character chunks with no overlap.
from_loaders: This creates and populates the index using the loader(s) passed to it.

So it will parse the PDF content, split into chunks, generate embeddings with the Google model, and add it to the index.
"""

urls = ['https://www.cs.kent.ac.uk/people/staff/dat/miranda/whyfp90.pdf']
loader = [UnstructuredURLLoader(urls=urls)]
index = VectorstoreIndexCreator(
        embedding=GooglePalmEmbeddings(),
        text_splitter=CharacterTextSplitter(chunk_size=3000, chunk_overlap=0)).from_loaders(loader)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [11]:
"""
RetrievalQA.from_chain_type: Creates a RetrievalQA pipeline using a predefined chain/architecture.
llm: This is specifying the language model to use for question answering.
chain_type="stuff": This selects the "stuff" architecture which is retrieval augmented QA. It will retrieve documents to augment the QA model's context.
retriever=index.vectorstore.as_retriever(): This configures the vectorstore index created earlier to be used for retrieval.
input_key="question": Specifies that the input key for queries will be "question" (i.e. we pass queries via a "question" key to the pipeline).
"""

chain = RetrievalQA.from_chain_type(llm=llm,
                            chain_type="stuff",
                            retriever=index.vectorstore.as_retriever(),
                            input_key="question")

In [15]:
summary = chain.run('Summarize the paper in 8 bullet points in markdown format line by line')

to_markdown(summary)


> - Functional programming emphasizes applying functions to arguments to structure software.
> - Higher-order functions and lazy evaluation contribute significantly to modularity.
> - Modularization simplifies software design and reduces programming costs.
> - Foldr replaces all occurrences of Cons in a list with f and all occurrences of Nil with a.
> - Foldr can be used to define functions like append, length, doubleall, and summatrix.
> - Modularity allows for the reuse of general-purpose modules, faster development, and independent testing.
> - Functional languages provide new kinds of "glue" to enhance modularization.
> - Functional programming's power lies in improved modularization, leading to smaller, simpler, and more general modules.