# Title: Document Processing and Question Answering with LangChain

## Description
This notebook demonstrates how to set up and use a language model to process Contract PDF documents and answer questions based on their content. The notebook includes steps for loading environment variables, initializing the language model, processing PDF documents, and querying the model for answers.

#### At the end we've also given ourworking model as an Streamlit application

### Libraries Required:
- `os`
- `torch`
- `dotenv`
- `langchain_core.prompts`
- `langchain.chains`
- `langchain_community.embeddings`
- `langchain_community.document_loaders`
- `langchain.text_splitter`
- `langchain_community.vectorstores`
- `langchain_community.llms`
- `sentence-transformers`
- `InstructorEmbedding`


In [1]:
!pip install torch
!pip install langchain
!pip install langchain_core
!pip install langchain_community
!pip install pypdf
!pip install chromadb
!pip install sentence-transformers==2.2.2
!pip install InstructorEmbedding
!pip install pdfplumber



In [2]:
# Import necessary libraries
import os
import torch
from langchain_core.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.llms import HuggingFaceEndpoint

## Check for GPU availability and set the appropriate device for computation.

In [3]:
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

## Global Variables

In [4]:
chat_history = []
embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512




## Initialize the language model

In [5]:
os.environ["HUGGINGFACEHUB_API_TOKEN"]= ''
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# Initialize the model with the correct task without overriding
llm_hub = HuggingFaceEndpoint(
    repo_id=model_id,
    task="text-generation",  # Specify the task explicitly
    max_length=2000,         # Increase max_length for longer responses
    temperature=0.4,         # Adjust temperature for more detailed responses
    top_p=0.9,               # Adjust top_p for more varied responses
    add_to_git_credential=True
)

  warn_deprecated(
                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.
                    add_to_git_credential was transferred to model_kwargs.
                    Please make sure that add_to_git_credential is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /home/kodliebe/.cache/huggingface/token
Login successful


## Formatting the Document

In [6]:
#Enter your document's path
document_path = "GCC_July_2020.pdf"

In [7]:
loader = PyPDFLoader(document_path)
documents = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts = text_splitter.split_documents(documents)

# Create an embeddings database using Chroma from the split text chunks
db = Chroma.from_documents(texts, embedding=embeddings)

# Build the QA chain, which utilizes the LLM and retriever for answering questions
conversation_retrieval_chain = RetrievalQA.from_chain_type(
    llm=llm_hub,
    chain_type="stuff",
    retriever=db.as_retriever(search_type="mmr", search_kwargs={'k': 6, 'lambda_mult': 0.25}),
    return_source_documents=True,  # Retrieve source documents for extraction
    input_key="question"
)

## Function to process a user prompt

In [8]:
# Process a user prompt
prompt = "Is the order of priority defined? If yes, what is the order of precedence in the case of ambiguity between drawings and technical specifications?"

# Query the model
output = conversation_retrieval_chain({"question": prompt, "chat_history": chat_history})
answer = output["result"]
sources = output["source_documents"]

# Create extraction and summary
extraction = "\n".join([source.page_content for source in sources])

# Use the LLM to generate a summary based on the extracted text and prompt
summary_prompt = f"Based on the following extraction and the question, provide a detailed summary:\n\nExtraction:\n{extraction}\n\nQuestion:\n{prompt}\n\nSummary:"
response = llm_hub.generate(prompts=[summary_prompt])

# Extract the generated text from the first generation
if isinstance(response, type(response)):
    generated_text = response.generations[0][0].text
else:
    raise ValueError("Unexpected response type from llm_hub.generate()")

summary = generated_text.strip()

# Simple heuristic to extract reference clause, this should be adjusted based on document structure
reference_clause = extraction.split('\n')[0:5]  # Assume the first line contains the reference clause

# Update the chat history
chat_history.append((prompt, answer))

# Return the structured response
response = {
    "Question": prompt,
    "Reference clause": reference_clause,
    "Extraction": extraction,
    "Summary": summary
}


  warn_deprecated(


## Printing the response

In [9]:
print(response.keys())
print("Question:")
print(response["Question"])
print("--"*50+"\n\n")
print("Reference clause:")
print(response["Reference clause"])
print("--"*50+"\n\n")
print("Extraction:")
print(response["Extraction"])
print("--"*50+"\n\n")
print("Summary:")
print(response["Summary"])

dict_keys(['Question', 'Reference clause', 'Extraction', 'Summary'])
Question:
Is the order of priority defined? If yes, what is the order of precedence in the case of ambiguity between drawings and technical specifications?
----------------------------------------------------------------------------------------------------


Reference clause:
['by means of drawings or otherwise, necessary for the proper execution of the works or any ', 'part thereof. All such drawings and instructions shall be consistent with the Contract ', 'Documents and reasonably inferable there from . ', ' ', '22.(5)  Meaning and Intent of Specification and Drawings:   If any ambiguity arises as to ']
----------------------------------------------------------------------------------------------------


Extraction:
by means of drawings or otherwise, necessary for the proper execution of the works or any 
part thereof. All such drawings and instructions shall be consistent with the Contract 
Documents and reasonabl

# Streamlit Application

In [12]:
!pip install streamlit
!npm install localtunnel

Collecting streamlit
  Downloading streamlit-1.36.0-py2.py3-none-any.whl (8.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m60.1 MB/s[0m eta [36m0:00:00[0m
Collecting watchdog<5,>=2.1.5 (from streamlit)
  Downloading watchdog-4.0.1-py3-none-manylinux2014_x86_64.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.0/83.0 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Downloading gitdb-

In [15]:
%%writefile chatbot_streamlit.py

import os
import torch
from dotenv import load_dotenv
import streamlit as st
from langchain_core.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.llms import HuggingFaceEndpoint

# Load environment variables from .env file
load_dotenv()

# Check for GPU availability and set the appropriate device for computation.
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

# Global variables
conversation_retrieval_chain = None
chat_history = []
llm_hub = None
embeddings = None

def init_llm():
    global llm_hub, embeddings

    # Set up the environment variable for HuggingFace and initialize the desired model
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = "PASTE_API_KEY_FROM_ZIP_FILE"
    model_id = "mistralai/Mistral-7B-Instruct-v0.3"

    # Initialize the model with the correct task without overriding
    llm_hub = HuggingFaceEndpoint(
        repo_id=model_id,
        task="text-generation",  # Specify the task explicitly
        max_length=2000,         # Increase max_length for longer responses
        temperature=0.7,         # Adjust temperature for more detailed responses
        top_p=0.9                # Adjust top_p for more varied responses
    )

    # Initialize embeddings using a pre-trained model to represent the text data
    embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Function to process a PDF document
def process_document(document_path):
    global conversation_retrieval_chain

    # Load the document
    loader = PyPDFLoader(document_path)
    documents = loader.load()

    # Split the document into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
    texts = text_splitter.split_documents(documents)

    # Create an embeddings database using Chroma from the split text chunks
    db = Chroma.from_documents(texts, embedding=embeddings)

    # Build the QA chain, which utilizes the LLM and retriever for answering questions
    conversation_retrieval_chain = RetrievalQA.from_chain_type(
        llm=llm_hub,
        chain_type="stuff",
        retriever=db.as_retriever(search_type="mmr", search_kwargs={'k': 6, 'lambda_mult': 0.25}),
        return_source_documents=True,  # Retrieve source documents for extraction
        input_key="question"
    )

# Function to process a user prompt
def process_prompt(prompt):
    global conversation_retrieval_chain
    global chat_history

    if conversation_retrieval_chain is None:
        raise ValueError("The document must be processed before querying.")

    # Query the model
    output = conversation_retrieval_chain({"question": prompt, "chat_history": chat_history})
    answer = output["result"]
    sources = output["source_documents"]

    # Create extraction and summary
    extraction = "\n".join([source.page_content for source in sources])
    summary = generate_summary(extraction, prompt)

    # Update the chat history
    chat_history.append((prompt, answer))

    # Return the structured response
    return {
        "Question": prompt,
        "Reference": generate_reference_clause(extraction),
        "Extraction": extraction,
        "Summary": summary
    }

def generate_summary(extraction, prompt):
    # Use the LLM to generate a summary based on the extracted text and prompt
    summary_prompt = f"Based on the following extraction and the question, provide a detailed summary:\n\nExtraction:\n{extraction}\n\nQuestion:\n{prompt}\n\nSummary:"
    response = llm_hub.generate(prompts=[summary_prompt])

    # Extract the generated text from the first generation
    generated_text = response.generations[0][0].text

    return generated_text.strip()

def generate_reference_clause(extraction):
    # Simple heuristic to extract reference clause, this should be adjusted based on document structure
    reference_clause = extraction.split('\n')[0]  # Assume the first line contains the reference clause
    return reference_clause

# Initialize the language model
init_llm()

# Streamlit application
st.title("PDF Document Q&A")

uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")

if uploaded_file is not None:
    document_path = f"temp_{uploaded_file.name}"
    with open(document_path, "wb") as f:
        f.write(uploaded_file.getbuffer())
    st.success("PDF uploaded successfully")

    process_document(document_path)
    st.success("PDF processed successfully")

    user_question = st.text_input("Ask a question about the document:")
    #Result
    if st.button("Submit"):
        response = process_prompt(user_question)
        st.write("### Question:")
        st.write(response["Question"])
        st.write("### Reference:")
        st.write(response["Reference"])
        st.write("### Extraction:")
        st.write(response["Extraction"])
        st.write("### Summary:")
        st.write(response["Summary"])

        # Clean up uploaded file
        os.remove(document_path)

Writing chatbot_streamlit.py


## Copy the IPv4 address, click the last URL and enter the IPv4 address as the password

In [None]:
! wget -q -O - ipv4.icanhazip.com
!streamlit run chatbot_streamlit.py & npx localtunnel --port 8501

34.168.147.106

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.168.147.106:8501[0m
[0m
[K[?25hnpx: installed 22 in 1.619s
your url is: https://tender-birds-repair.loca.lt
