<a href="https://colab.research.google.com/github/TanmayWINTR/Langchain/blob/main/PDFQueryLangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installing Necessary Packages:

In [2]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

Collecting langchain
  Downloading langchain-0.1.12-py3-none-any.whl (809 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/809.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/809.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m532.5/809.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m809.0/809.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m809.1/809.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.28 (from

Importing Libraries and Initializing Components:

In [39]:
# Import necessary libraries for the project.
from PyPDF2 import PdfReader  # Used for reading PDF files.
from langchain.embeddings.openai import OpenAIEmbeddings  # Langchain module for utilizing OpenAI embeddings.
from langchain.text_splitter import CharacterTextSplitter  # Langchain tool for splitting text based on character count.
from langchain.vectorstores import FAISS  # Langchain module for working with FAISS vector stores.



Setting Environment Variables:

In [40]:
# Set environment variables for API keys.
# OPENAI_API_KEY: Your OpenAI API key, necessary for accessing OpenAI's API services.
# SERPAPI_API_KEY: Your SerpAPI key, assuming it's used for search-related features in this project.
# Note: It's important to replace the empty string values with your actual API keys to ensure the APIs work correctly.
import os
os.environ["OPENAI_API_KEY"] = ""
os.environ["SERPAPI_API_KEY"] = ""

Specifying PDF File for Processing:

In [41]:
# Specify the path of the PDF file to be processed.
# The PdfReader class from PyPDF2 is used to open and prepare the PDF for text extraction.
# Replace '/content/bedrock-ug.pdf' with the actual path to your PDF file.
pdfreader = PdfReader('/content/bedrock-ug.pdf')

Extracting Text from the PDF:

In [42]:
# Import necessary extensions for type annotations, enhancing code readability and error checking.
from typing_extensions import Concatenate

# Initialize a variable to hold the extracted text.
raw_text = ''

# Enumerate through each page of the PDF, extracting text content.
# The 'enumerate' function provides a counter (i) along with the page object for any operations requiring indexing.
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()  # Attempt to extract text from the current page.
    if content:  # Check if any text was successfully extracted.
        raw_text += content  # Concatenate the content to the raw_text variable.

# At this point, 'raw_text' contains the concatenated text from all readable pages of the PDF.


In [43]:
raw_text



Splitting Text for Tokenization Compatibility:

In [44]:
# Initialize a CharacterTextSplitter to segment the extracted text into smaller chunks.
# This approach ensures that each text chunk does not exceed the token size limit imposed by the model or application.
# Parameters:
# separator: Defines the character used to separate chunks, here set to a newline character.
# chunk_size: The target size for each text chunk, set to 800 characters.
# chunk_overlap: Allows for an overlap of 200 characters between consecutive chunks to maintain context.
# length_function: The function used to measure the length of text, here simply using the built-in len function.
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

Determining the Number of Text Chunks:

In [45]:
# Display the number of chunks created after splitting the text.
# This is useful for understanding the distribution and segmentation of the text for further processing.
len(texts)


1927

Initializing Embeddings from OpenAI:

In [46]:
# Initialize the OpenAI embeddings.
# This step involves setting up the OpenAIEmbeddings object to generate embeddings for text segments.
# Embeddings are crucial for representing text in a form suitable for similarity searches and other vector-based operations.
embeddings = OpenAIEmbeddings()


Creating a FAISS Vector Store from Texts:

In [47]:
# Create a FAISS vector store to manage and search document embeddings.
# This operation converts the text segments into embeddings and stores them in a FAISS index for efficient similarity searching.
document_search = FAISS.from_texts(texts, embeddings)


Inspecting the Document Search Object:

In [48]:
# Display the document_search object.
# This inspection is useful for verifying the successful creation of the FAISS vector store and its readiness for executing searches.
document_search


<langchain_community.vectorstores.faiss.FAISS at 0x7c2270a05720>

Loading a Question Answering Chain:

In [49]:
# Load a question answering (QA) chain using LangChain and OpenAI's model.
# This chain is configured for processing input documents and questions to generate answers.
# The 'chain_type' parameter specifies the configuration or model type to use, here indicated as "stuff".
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

chain = load_qa_chain(OpenAI(), chain_type="stuff")


Executing the QA Chain with a Specific Query:

In [50]:
# Define a query and perform a similarity search to find relevant document segments.
# The identified documents are then processed through the QA chain to generate an answer.
query = "How to add an action group to your agent in Amazon Bedrock"
docs = document_search.similarity_search(query)  # Find similar document segments to the query.
chain.run(input_documents=docs, question=query)  # Process the documents and query through the QA chain.


' To add an action group to your agent in Amazon Bedrock, follow these steps: \n1. Sign in to the AWS Management Console and open the Amazon Bedrock console at \nhttps://console.aws.amazon.com/bedrock/.\n2. Select Agents from the left navigation pane and choose an agent in the Agents section.\n3. Choose an agent from the Agents section and then choose the Working draft in the Working Draft section.\n4. Select Add in the Action groups section.\n5. Fill out the action group details.\n6. To define the schema for the action group, use the in-line OpenAPI schema editor. \n7. Select Add and wait for the success banner to appear. \n8. Select Prepare to apply the changes to your agent before testing it.'

Another Example of Executing the QA Chain:

In [51]:
query = "Do i need to pay for a foundation model in bedrock"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Yes, you do need to pay for a foundation model in Amazon Bedrock. Pricing is based on the volume of input tokens and output tokens, and on whether you have purchased provisioned throughput for the model. You can see the pricing for each model on the Model providers page in the Amazon Bedrock console.'