# Load and extract text from PDF files

In [1]:
%pip install pdfplumber==0.11.1

Note: you may need to restart the kernel to use updated packages.Collecting pdfplumber==0.11.1
  Downloading pdfplumber-0.11.1-py3-none-any.whl (57 kB)
Collecting pdfminer.six==20231228
  Downloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
Collecting pypdfium2>=4.18.0
  Downloading pypdfium2-4.30.0-py3-none-win_amd64.whl (2.9 MB)
Collecting charset-normalizer>=2.0.0
  Downloading charset_normalizer-3.3.2-cp39-cp39-win_amd64.whl (100 kB)
Collecting cryptography>=36.0.0
  Downloading cryptography-43.0.0-cp39-abi3-win_amd64.whl (3.1 MB)
Collecting cffi>=1.12
  Downloading cffi-1.16.0-cp39-cp39-win_amd64.whl (181 kB)
Collecting pycparser
  Downloading pycparser-2.22-py3-none-any.whl (117 kB)
Installing collected packages: pycparser, cffi, cryptography, charset-normalizer, pypdfium2, pdfminer.six, pdfplumber
Successfully installed cffi-1.16.0 charset-normalizer-3.3.2 cryptography-43.0.0 pdfminer.six-20231228 pdfplumber-0.11.1 pycparser-2.22 pypdfium2-4.30.0



You should consider upgrading via the 'c:\Users\Sanskruti Jajoo\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [2]:
import os
import pdfplumber

In [5]:
combined_text = ''
files_directory = 'files'
 
# Loop through all files in the directory
for filename in os.listdir(files_directory):
    if filename.endswith('.pdf'):
        # Open the PDF file
        with pdfplumber.open(os.path.join(files_directory, filename)) as pdf:
            # Loop through all pages in the PDF file
            for page in pdf.pages:
                # Extract the text from the page and add it to the rest of the text
                combined_text += page.extract_text() + ' '
                
print(combined_text)

NexGen AI Tech Solutions
Quarterly Earnings
Report for Q2 2024
4th June, 2024 Contents
1. ExecutiveSummary
2. FinancialPerformance
3. RevenuebyDepartment
4. StrategicInitiatives
5. ExpectedPerformancefortheRestof2024
6. Conclusion
1 Executive Summary
InQ22024,NexGenAITechSolutionscontinuedtodriveexceptionalgrowththroughthe
strategicintegrationofAItechnologiesacrossourservicespectrum.Theadoptionof
cutting-edgeAIapplicationshasresultedinaremarkable20%growthintotalrevenuecompared
toQ22023,affirmingourleadershipinthetechsolutionssector.
2 Financial Performance
FinancialPerformanceinQ22024
● TotalRevenue:$6.5million,up20%from$5.4millioninQ22023.
● GrossProfit:$4.8million,representingagrossmarginof73.8%.
● OperatingExpenses:$2.5million,focusedonexpandingourAIcapabilitiesand
infrastructure.
● NetIncome:$2.3million,anetmarginof35.4%,upfrom$2.0millioninQ22023.
3 Revenue by Department
● AI-PoweredCloudServices:$2.8million,up25%from$2.24million.
● AI-EnhancedCybersecuritySolutions:$2.0million,up1

# Text Preprocessign and Splitting

By using a text splitter the goal is to optimize large text handling, enhancing LLM performance and processing efficiency.

In [6]:
print(len(combined_text))

6156


In [None]:
%pip install sentence-transformers==3.0.1
%pip install langchain-community==0.2.5
%pip install langchain-huggingface==0.0.3

: 

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

text_chunks = text_splitter.split_text(combined_text)

print(len(text_chunks))

ModuleNotFoundError: No module named 'langchain'

# Generate Text Embedding

To allow for more accurate and relevant search results, represent the text in the PDF documents as vectors by creating embeddings, which are numerical representations of text data.

Use an open-source sentence transformer model from HuggingFace to compute the embeddings: 

sentence-transformers/paraphrase-MiniLM-L6-v2

Store the texts and the embeddings in a FAISS (Facebook AI Similarity Search) vector store.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-MiniLM-L6-v2")

: 

In [None]:
%pip install faiss-cpu==1.8.0

: 

In [None]:
from langchain.vectorstores import FAISS
db = FAISS.from_texts(text_chunks, embeddings)
print(db.index.ntotal)

: 

# Setup the Retrieval System

Convert the FAISS vector store into a retriever that can return documents for a given unstructured query.

In [None]:
retriever = db.as_retriever()

: 

In [None]:
print(retriever.invoke("Who is the CEO?"))

: 

# Create a RAG prompt template

Create an appropriate prompt template that includes both the question and the necessary context to answer the question posted by the user.

The goal of defining a prompt template is to translate user input and parameters into clear instructions for the OpenAI language model.  This will help the model understand the context and answer the question posted by the user.

In [None]:
template = """
    Answer the question based only on the following context:
    {context}
 
    Question: {input}
"""

: 

In [None]:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(template)
prompt

: 

# Setup the LLM and RAG retrieval chain

Construct a chain that can be used to generate a response based on a set of documents and a user query.

Then construct a RAG retrieval chain that will take a user query as input, pass this information to the retriever to fetch relevant documents, and finally pass both the user query and the document content as context to the OpenAI model to generate a response.

In [None]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo")

: 

Construct the chain that will be subsequently used to generate a response based on a set of documents and a question.

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain
combine_docs_chain = create_stuff_documents_chain(
    llm, prompt
)

: 

Create the retrieval chain

In [None]:
from langchain.chains import create_retrieval_chain
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)
print(rag_chain.invoke({"input": "Who is the CEO of the company?"}).get("answer"))

: 

# Test the RAG Application

In [None]:
input = "What was the total revenue in Q2 2024?"
print(input)
print(rag_chain.invoke({"input": input}).get("answer"))

: 

In [None]:
input = "Which department had the higest revenue?"
print(input)
print(rag_chain.invoke({"input": input}).get("answer"))

: 

In [None]:
print(rag_chain.invoke({"input": input}))

: 