<a href="https://colab.research.google.com/github/Shrest4647/CloudFirewall/blob/main/LangchainPDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDF Interaction with LangChain and ChatGPT

Here we learn how to use ChatGPT and the LangChain framework to ask questions to a PDF.

## Steps
The general structure of the code can be split into four main sections:

- Loading the document
- Creating embeddings and Vectorization
- Querying the PDF

## First lets download a pdf for analysis

In [None]:
!curl -o paper.pdf http://login.lisnepal.com.np/uploads/HRM/HR_Manual_Ver_7_LIS_Nepal.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1336k  100 1336k    0     0   627k      0  0:00:02  0:00:02 --:--:--  627k


## Install the required packages

In [None]:
# On jupyter notebook you can uncomment the below lines to install the packages
!pip install langchain
!pip install pypdf
!pip install chromadb
!pip install openai tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.205-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.8-py3-none-any.whl (26 kB)
Collecting langchainplus-sdk>=0.0.9 (from langchain)
  Downloading langchainplus_sdk-0.0.16-py3-none-any.whl (24 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain)
  Downloading op

## Imports

In [None]:
from langchain.document_loaders import PyPDFLoader # for loading the pdf
from langchain.embeddings import OpenAIEmbeddings # for creating embeddings
from langchain.vectorstores import Chroma # for the vectorization part
from langchain.chains import ChatVectorDBChain # for chatting with the pdf
from langchain.llms import OpenAI # the LLM model we'll use (CHatGPT)

## Load and Split the pdf

In [None]:
pdf_path = "./paper.pdf"
loader = PyPDFLoader(pdf_path)
pages = loader.load_and_split()
print(pages[0].page_content)

HUMAN RESOURCE MANUAL  
 
LIS Nepal Pvt. Ltd.  
 Lokeshwor Tole, Manbhawan,  
 Lalitpur, Nepal  
 
This manual is the intellectual property of LIS Nepal . Unauthorized use or duplication of any idea or material contained 
here is prohibited. No part of this document, in whole or in part, may be reproduced, stored, transmitted, or used without 
prior written permission of HR.


## Set OPENAI_API_KEY

In [None]:
%env OPENAI_API_KEY = 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

env: OPENAI_API_KEY=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


# Chat Over Documents with Chat History

# Next Section

## Creating embeddings and Vectorization

In [None]:
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(pages, embedding=embeddings,
                                 persist_directory="/usrdb")
vectordb.persist()

## Set up the the Conversational Chain

ChatVectorDBChain, ConversationBufferMemory class from langchain.chains to interact with ChatGPT using the previously generated vector database.

In [None]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [None]:
from langchain.chains import ConversationalRetrievalChain
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1, model_name="gpt-3.5-turbo"), vectordb.as_retriever(), memory=memory, verbose=True, )



In [None]:
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT

In [None]:
CONDENSE_QUESTION_PROMPT

PromptTemplate(input_variables=['chat_history', 'question'], output_parser=None, partial_variables={}, template='Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:', template_format='f-string', validate_template=True)

In [None]:
from langchain.prompts import (
    ChatPromptTemplate,
    PromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

In [None]:
template="You are a cheerful and helpful assistant who answers user queries from the knowledge gained from the document context."
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_template="{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

In [None]:
chat_template = ChatPromptTemplate.from_template("""
System: You are a cheerful and helpful assistant who is happy to answer user queries. You start each reply with the pharse 'Hi I am PDF Assistant', and end your reply with 'Glad to help you. Please let me know if you have more questions'. You are very elegant assistant and try format your replies using markdown format to make it easier for the readers to understand.
Now, Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.
Chat History:
  {chat_history}
Follow Up Input:
  {question}
Standalone question:
""")

In [None]:
chat_template

ChatPromptTemplate(input_variables=['chat_history', 'question'], output_parser=None, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['chat_history', 'question'], output_parser=None, partial_variables={}, template='\nSystem: You are a cheerful and helpful assistant who is happy to answer user queries.  \nGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\nChat History:\n  {chat_history}\nFollow Up Input: \n  {question}\nStandalone question:\n', template_format='f-string', validate_template=True), additional_kwargs={})])

In [None]:
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1, model_name="gpt-3.5-turbo"), vectordb.as_retriever(), memory=memory, verbose=True, condense_question_prompt=chat_template)
# chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, chat_template])

# get a chat completion from the formatted messages
# chat_prompt.format_prompt(question="What is the first day of the week", chat_history="").to_messages()

In [None]:
qa({'question': "How can you help me?"})



[1m> Entering new  chain...[0m
Prompt after formatting:
[32;1m[1;3mHuman: 
System: You are a cheerful and helpful assistant who is happy to answer user queries. You start each reply with the pharse 'Hi I am PDF Assistant', and end your reply with 'Glad to help you. Please let me know if you have more questions'. You are very elegant assistant and try format your replies using markdown format to make it easier for the readers to understand.
Now, Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.
Chat History:
  
Human: How can you help me?
Assistant: I'm sorry, I cannot answer that question as there is no context provided to suggest what kind of help is needed.
Follow Up Input: 
  How can you help me?
Standalone question:
[0m

[1m> Finished chain.[0m


[1m> Entering new  chain...[0m


[1m> Entering new  chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of 

{'question': 'How can you help me?',
 'chat_history': [HumanMessage(content='How can you help me?', additional_kwargs={}, example=False),
  AIMessage(content="I'm sorry, I cannot answer that question as there is no context provided to suggest what kind of help is needed.", additional_kwargs={}, example=False),
  HumanMessage(content='How can you help me?', additional_kwargs={}, example=False),
  AIMessage(content="I'm sorry, based on the given context, I cannot determine what kind of help is needed. Please provide more information or context.", additional_kwargs={}, example=False)],
 'answer': "I'm sorry, based on the given context, I cannot determine what kind of help is needed. Please provide more information or context."}

In [None]:
output = chat_prompt.format(input_language="English", output_language="French", text="I love programming.")
output

'System: You are a helpful assistant that translates English to French.\nHuman: I love programming.'

In [None]:
assistant_prompt = chat_prompt

# Utils

In [None]:
def format_words(text, num =10):
    words = text.split()
    words_with_newlines = []
    for i, word in enumerate(words, 1):
        words_with_newlines.append(word)
        if i % num == 0:
            words_with_newlines.append("\n")
    return " ".join(words_with_newlines)


## Querying


In [None]:
query = "What are the different professional roles in this company?"

In [None]:
result = qa({"question": query, "chat_history": ""})
history = result['chat_history']
print(f"Answer: {format_words(result['answer'])}")



Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Human Resource Manual Ver 7.0.  Circulated On: July 17, 2022  
 
LIS Nepal Pvt. Ltd.  | Any duplication or unauthorized distribution  of this document  is strictly prohibited.  24 
  
Term  Definition  
Job Family  General category of job that defines whether a job is related to “revenue generating and delivery” or “Sales and 
General Administration”  
Career Stream  A fluid model of career development for employees within a defined job family.  Allows for movement among streams 
rather than a structured sequence, often referred to as a career ladder.  
Career Level  Numer ical indicator of level and seniority of career within Company based on role and responsibility, time and tenure, 
experience and maturity, job complexities, communication, and impact  
Business Title  Titles used for 

In [None]:
while True:
  query = input("Enter Your Query: ")
  if (query == "quit"):
    break
  # "What the different holiday plans available?"
  result2 = qa({"question": query, "chat_history": history})
  history = result2['chat_history']

  print("Answer:")
  print(format_words(result2["answer"]))

# what are the different employee benefits available?
# Can you explain loan scheme for Advance Salary mean?
# What are the criteria to be meet for claiming referral bonus?



Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: What are the different professional roles in this company?
Assistant: There are different professional roles in this company, categorized into job families such as Delivery Services and Organization Services, with different career streams and levels. Some of the possible business titles and system titles include Solution Specialist, Project & Delivery Management, and Organization Management. There is also an appendix that provides a grade classification of employees.
Follow Up Input: what are the different employee benefits available?
Standalone question:[0m





[1m> Finished chain.[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Human Resource Manual Ver 7.0.  Circulated On: July 17, 2022  
 
LIS Nepal Pvt. Ltd.  | Any duplication or unauthorized distribution  of this document  is strictly prohibited.  15 
 CHAPTER SIX: EMPLOYE ES LOAN AND BENEFIT  
 
6.1. BENEFIT  
 
6.1.1.  DASHAIN ALLOWANCE  
 All the employees will be eligible for the Dashain Allowance after completion of the probation/trainee 
period under the following conditions:  
 Employee who completes the service of one year and above as on Fulpati will be eligible for Dashain 
Allowance equivalent to 100% of one month’s *Total Salary.  
 Employee who completes  the service of less than one year as on Fulpati will be eligible for Dashain 
Allowance on pro -rata basis of one month’s *Total Salary.  
 The service period of

## Using ChatVectorDBChain

In [None]:
pdf_qa = ChatVectorDBChain.from_llm(OpenAI(temperature=0.9, model_name="gpt-3.5-turbo"),
                                    vectordb, return_source_documents=True)

query = "What is the VideoTaskformer?"




ValidationError: ignored

In [None]:
result = pdf_qa({"question": query, "chat_history": ""})
print("Answer:")
print(result["answer"])

NameError: ignored

In [None]:
result2 = pdf_qa({"question": "Summarize the outcome of the paper in 100 words, highlighting its use cases", "chat_history": result["chat_history"]})
print("Answer:")
print(result2["answer"])

In [None]:
result3 = pdf_qa({"question": "Can you describe what VideoTaskformer Pre-training is and how does it work as described in the paper", "chat_history": result2["chat_history"]})
result3

In [None]:
import os

key = os.getenv("OPENAI_API_KEY")

In [None]:
!jupyter nbconvert --to markdown LangchainPDF.ipynb