# Summarization and Chat with Pdf Using Langchain

### Objective:
The aim of this assignment is to develop a Generative AI application using Large Language
Models (LLM) that can take multiple page document of any formats as inputs, learn and
summarize their content, and accurately answer user questions related to the documents.
The application should be conversational and maintain proper session management. This
use case is ideal for individuals and businesses seeking to streamline their document
management process, improve productivity, and save time and resources.

## Add OpenApi Key 

In [2]:
import os
from dotenv import load_dotenv

# Load environment variables from .env
load_dotenv()

api_key = os.getenv("API_KEY")
os.environ['OPENAI_API_KEY'] = api_key

In [3]:
os.getcwd()

'/home/nitin2/Documents/GenerativeAIAssignment'

In [2]:
from langchain import OpenAI, PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader
import tiktoken
import hashlib
import textwrap
llm = OpenAI(temperature=0)

ValidationError: 1 validation error for OpenAI
__root__
  Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass  `openai_api_key` as a named parameter. (type=value_error)

In [5]:
FILE_PATH = '/home/nitin2/Downloads/YouTubeMarketingGuide.pdf'

In [6]:
from langchain.document_loaders import PyPDFLoader
from IPython.display import display, Javascript
def load_document(FILE_PATH): 
    print("Loading Pdf....")
    loader = PyPDFLoader(FILE_PATH)
    #print(loader.metadata)
    docs=loader.load_and_split()
    #generate_newhash(docs)
    #print(new_hash)
    #new_doc = clean_doc(docs)
    print("Document is ready...")
    return docs
    
#cleaning text
def clean_doc(docs):
    # Convert the list of documents into a string
    doc_string = ' '.join([doc.page_content for doc in docs])
    # Replace '\n' characters with a space
    doc_string = doc_string.replace('\n', '')
    print(doc_string) 
    return doc_string

def generate_newhash(docs):
    pdf_name = FILE_PATH.split('/')[-1]
    new_hash = hashlib.md5(''.join([t.page_content for t in docs]).encode()).hexdigest()
    print("new hash",new_hash)
    current_dir = os.getcwd()
    new_dir = current_dir + '/'+pdf_name+'/hash.txt'
    print(new_dir)
    # Path 
    path = os.path.join(current_dir, pdf_name)
    os.mkdir(path) 
    with open(new_dir, "w+") as file:
        file.write(new_hash)
    
    
    

## Loading Document

In [None]:
from IPython.display import display, Javascript

#def restart_kernel():
 #   display(Javascript('IPython.notebook.kernel.restart();'))


def restart_and_rerun_kernel():
    display(Javascript('IPython.notebook.kernel.restart();'))
    time.sleep(1)  # Wait for the kernel to restart
    display(Javascript('IPython.notebook.execute_all_cells();'))


# Call the function to restart the kernel
#restart_kernel()
FILE_PATH = input("Enter File Path..")
docs = load_document(FILE_PATH)
#docs = loader.load_and_split()
print(docs)

# Restart and rerun the kernel
restart_and_rerun_kernel()

Enter File Path../home/nitin2/Downloads/YouTubeMarketingGuide.pdf
Loading Pdf....
Document is ready...
[Document(page_content='E v e r y o n e\nw a t c h e s\nY o u T u b e .\nO v e r\n7 5 %\no f\nA m e r i c a n s\na g e\n1 5\na n d\nu p\na r e\no n\nY o u T u b e ,\np a r t\no f\no v e r\n2\nb i l l i o n\nm o n t h l y\na c t i v e\nu s e r s ,\nm a k i n g\ni t\nt h e\nm o s t\np o p u l a r\nw e b s i t e\ni n\nt h e\nw o r l d\na f t e r\nG o o g l e .\nT h e\np o t e n t i a l\no f\na\nh u g e\na u d i e n c e\ni s\na\ng r e a t\nr e a s o n\nt o\nm a r k e t\ny o u r\nb u s i n e s s\no n\nY o u T u b e .\nB u t\ns h o u t i n g\nf r o m\nt h e\nr o o f t o p s\na b o u t\ny o u r\np r o d u c t s\nw i t h o u t\na\np l a n\nw o n ’ t\ng e t\ny o u\na n y w h e r e .\nY o u\nn e e d\na\ns t r a t e g y\nt o\ns u c c e e d\na n d\nt h a t ’ s\ne x a c t l y\nw h a t\ny o u ’ l l\nﬁ n d\nh e r e :\nt h e\n1 0\ns t e p s\nt o\nc r u s h\nY o u T u b e\nm a r k e t i n g\ni n\n2 0 

<IPython.core.display.Javascript object>

## Summarizing with map_reduce
This method involves an initial prompt on each chunk of data * ( for summarization tasks, this could be a summary of that chunk; for question-answering tasks, it could be an answer based solely on that chunk). Then a different prompt is run to combine all the initial outputs. This is implemented in the LangChain as the MapReduceDocumentsChain.

In [1]:
chain = load_summarize_chain(llm=llm, chain_type="map_reduce")
summary = chain.run(docs)   
print(summary)

NameError: name 'load_summarize_chain' is not defined

In [None]:
#tracking each step with verbose true
chain = load_summarize_chain(llm, 
                             chain_type="map_reduce",
                             verbose=True
                             )
output_summary = chain.run(docs)

print(output_summary)

In [9]:
# for summarizing each part
# Default Prompt template using first
chain.llm_chain.prompt.template

'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

In [10]:
# for combining the parts
chain.combine_document_chain.llm_chain.prompt.template

'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

In [12]:
# summary with Custom Prompts
prompt_template = """Write a concise summary of the following:

{text}

CONSCISE SUMMARY IN BULLET POINTS:"""

BULLET_POINT_PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])

In [13]:
chain = load_summarize_chain(llm,      
                             chain_type="map_reduce",
                             map_prompt=BULLET_POINT_PROMPT, 
                             combine_prompt=BULLET_POINT_PROMPT)

# chain.llm_chain.prompt= BULLET_POINT_PROMPT
# chain.combine_document_chain.llm_chain.prompt= BULLET_POINT_PROMPT
output_summary = chain.run(docs)
print(output_summary)


• YouTube is the most popular website in the world after Google, with over 2 billion users and 72% of American internet users regularly browsing the platform. 
• To market your business on YouTube, you need to create content that your target customers want, optimize your content for the YouTube algorithm, and set up a YouTube channel with a Google account.
• Research your audience and competition, use social listening, and create a SWOT analysis.
• Optimize your videos to get more views, create a compelling title, write a description, add tags, create custom thumbnails, and use end screens and cards.
• Try YouTube advertising and influencer marketing, and track your progress with analytics.


### Now let's start with chat with pdf 

In [14]:
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

#### Text Splitter
This takes the text and splits it into chunks

In [15]:
# split the documents into chunks

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(docs)

In [16]:
len(texts)

41

### Making a Embeddings

In [17]:
# select which embeddings we want to use
embeddings = OpenAIEmbeddings()

In [18]:
pdf_name = FILE_PATH.split('/')[-1]
new_hash = hashlib.md5(''.join([t.page_content for t in docs]).encode()).hexdigest()
if os.path.exists("faiss_index"+pdf_name):
    if os.path.exists(os.path.join("faiss_index"+pdf_name,'hash.txt')):
        with open(os.path.join("faiss_index"+pdf_name,'hash.txt'),'r') as f:
            stored_hash = f.read().strip()
        if new_hash == stored_hash:
            print("loading the index from the disk")
            db = FAISS.load_local("faiss_index"+pdf_name,embeddings)
        else:
            print("Creating new Index..")
            db = FAISS.from_documents(docs, embeddings)
            db.save_local("faiss_index"+pdf_name) 
            print("Index Created")
else:
    print("Creating new Index..")
    db = FAISS.from_documents(docs, embeddings)
    db.save_local("faiss_index"+pdf_name)
    print("Index Created")
    with open(os.path.join("faiss_index"+pdf_name,'hash.txt'),'w') as f:
        f.write(new_hash)
    print("Successfully Created Hash file")
            
        
    

loading the index from the disk


In [19]:
db.embedding_function

<bound method OpenAIEmbeddings.embed_query of OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, embedding_ctx_length=8191, openai_api_key=None, openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6, request_timeout=None, headers=None)>

In [20]:
query = "what are Dietary Supplements"
embedding_vector = embeddings.embed_query(query)
docs_and_scores = db.similarity_search_by_vector(embedding_vector)

In [22]:
#memory
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# expose this index in a retriever interface
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":2})

# create a chain to answer questions 
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever, memory=memory)

In [24]:
query = "what is youtube marketing"
result = qa({"question": query})
result['answer']

' YouTube marketing is the practice of promoting a brand, product, or service on YouTube. It can involve a mix of tactics, including (but not limited to): creating organic promotional videos, working with influencers, and advertising on the platform.'

In [25]:
query = "what was the previous question?"
result = qa({"question": query})
result['answer']

' YouTube marketing is the practice of promoting a brand, product, or service on YouTube. It can involve a mix of tactics, including (but not limited to): creating organic promotional videos, working with influencers, and advertising on the platform.'

In [26]:
print(memory)

chat_memory=ChatMessageHistory(messages=[HumanMessage(content='what is youtube marketing', additional_kwargs={}, example=False), AIMessage(content=' YouTube marketing is the practice of promoting a brand, product, or service on YouTube. It can involve a mix of tactics, including (but not limited to): creating organic promotional videos, working with influencers, and advertising on the platform.', additional_kwargs={}, example=False), HumanMessage(content='what was the previous question?', additional_kwargs={}, example=False), AIMessage(content=' YouTube marketing is the practice of promoting a brand, product, or service on YouTube. It can involve a mix of tactics, including (but not limited to): creating organic promotional videos, working with influencers, and advertising on the platform.', additional_kwargs={}, example=False)]) output_key=None input_key=None return_messages=True human_prefix='Human' ai_prefix='AI' memory_key='chat_history'
