# Title: Document Processing and Question Answering with LangChain

## Description
This notebook demonstrates how to set up and use a language model to process PDF documents and answer questions based on their content. The notebook includes steps for loading environment variables, initializing the language model, processing PDF documents, and querying the model for answers.

### Libraries Required:
- `os`
- `torch`
- `dotenv`
- `langchain_core.prompts`
- `langchain.chains`
- `langchain_community.embeddings`
- `langchain_community.document_loaders`
- `langchain.text_splitter`
- `langchain_community.vectorstores`
- `langchain_community.llms`
- `sentence-transformers`
- `InstructorEmbedding`


In [1]:
!pip install torch
!pip install langchain
!pip install langchain_core
!pip install langchain_community
!pip install pypdf
!pip install chromadb
!pip install sentence-transformers==2.2.2
!pip install InstructorEmbedding



In [2]:
# Import necessary libraries
import os
import torch
from langchain_core.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.llms import HuggingFaceEndpoint

## Check for GPU availability and set the appropriate device for computation.

In [3]:
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

## Global Variables

In [4]:
chat_history = []
embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512


## Initialize the language model

In [5]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] ="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"
model_id = "mistralai/Mistral-7B-Instruct-v0.3"

# Initialize the model with the correct task without overriding
llm_hub = HuggingFaceEndpoint(
    repo_id=model_id,
    task="text-generation",  # Specify the task explicitly
    max_length=2000,         # Increase max_length for longer responses
    temperature=0.7,         # Adjust temperature for more detailed responses
    top_p=0.9,               # Adjust top_p for more varied responses
    add_to_git_credential=True
)

  warn_deprecated(
                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.
                    add_to_git_credential was transferred to model_kwargs.
                    Please make sure that add_to_git_credential is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Formatting the Document

In [None]:
#Enter your document's path
document_path = "trypd.pdf"

In [6]:

loader = PyPDFLoader(document_path)
documents = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts = text_splitter.split_documents(documents)

# Create an embeddings database using Chroma from the split text chunks
db = Chroma.from_documents(texts, embedding=embeddings)

# Build the QA chain, which utilizes the LLM and retriever for answering questions
conversation_retrieval_chain = RetrievalQA.from_chain_type(
    llm=llm_hub,
    chain_type="stuff",
    retriever=db.as_retriever(search_type="mmr", search_kwargs={'k': 6, 'lambda_mult': 0.25}),
    return_source_documents=True,  # Retrieve source documents for extraction
    input_key="question"
)

## Function to process a user prompt

In [7]:
# Process a user prompt
prompt = "Is the order of priority defined? If yes, what is the order of precedence in the case of ambiguity between drawings and technical specifications?"

# Query the model
output = conversation_retrieval_chain({"question": prompt, "chat_history": chat_history})
answer = output["result"]
sources = output["source_documents"]

# Create extraction and summary
extraction = "\n".join([source.page_content for source in sources])

# Use the LLM to generate a summary based on the extracted text and prompt
summary_prompt = f"Based on the following extraction and the question, provide a detailed summary:\n\nExtraction:\n{extraction}\n\nQuestion:\n{prompt}\n\nSummary:"
response = llm_hub.generate(prompts=[summary_prompt])

# Extract the generated text from the first generation
if isinstance(response, type(response)):
    generated_text = response.generations[0][0].text
else:
    raise ValueError("Unexpected response type from llm_hub.generate()")

summary = generated_text.strip()

# Simple heuristic to extract reference clause, this should be adjusted based on document structure
reference_clause = extraction.split('\n')[0]  # Assume the first line contains the reference clause

# Update the chat history
chat_history.append((prompt, answer))

# Return the structured response
response = {
    "Question": prompt,
    "Reference clause": reference_clause,
    "Extraction": extraction,
    "Summary": summary
}


  warn_deprecated(


{'Question': 'Is the order of priority defined? If yes, what is the order of precedence in the case of ambiguity between drawings and technical specifications?', 'Reference clause': 'GCC July 2020 ', 'Extraction': "GCC July 2020 \n \n1 \n PART I  \nREGULATIONS FOR TENDERS AND CONTRACTS  \nFOR THE GUIDANCE OF ENGINEERS & CONTRACTORS FOR WORKS \nCONTRACTS  \nMEANING OF TERMS  \n1.0 Applicability:  These conditions of contract shall be applicable for all the tenders and \ncontracts of railways for execution of works as defined in GFR 2017.  \n1.01  Order of Precedence of Documents : In a tender/contract, in case of any difference, \ncontradiction, discrepancy, with regard to conditions of tender/contract, specifications, \ndrawings, bill of quantities etc., forming part of the tender/contract, the following shall be the \norder of preceden ce: \ni. Letter of Award  \nii. Schedule of Items, Rates & Quantities  \niii. Special Conditions of Contract  \niv. Technical Specifications as given i

## Printing the response

In [15]:
print(response.keys())
print("Question:")
print(response["Question"])
print("--"*50+"\n\n")
print("Reference clause:")
print(response["Reference clause"])
print("--"*50+"\n\n")
print("Extraction:")
print(response["Extraction"])
print("--"*50+"\n\n")
print("Summary:")
print(response["Summary"])

dict_keys(['Question', 'Reference clause', 'Extraction', 'Summary'])
Question:
Is the order of priority defined? If yes, what is the order of precedence in the case of ambiguity between drawings and technical specifications?
----------------------------------------------------------------------------------------------------


Reference clause:
GCC July 2020 
----------------------------------------------------------------------------------------------------


Extraction:
GCC July 2020 
 
1 
 PART I  
REGULATIONS FOR TENDERS AND CONTRACTS  
FOR THE GUIDANCE OF ENGINEERS & CONTRACTORS FOR WORKS 
CONTRACTS  
MEANING OF TERMS  
1.0 Applicability:  These conditions of contract shall be applicable for all the tenders and 
contracts of railways for execution of works as defined in GFR 2017.  
1.01  Order of Precedence of Documents : In a tender/contract, in case of any difference, 
contradiction, discrepancy, with regard to conditions of tender/contract, specifications, 
drawings, bill of qua