<a href="https://colab.research.google.com/github/TirendazAcademy/LangChain-Tutorials/blob/main/Chat-With-Any-Document.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chat With Anything - From PDFs Files to Image Documents:


### Install the requirements

In [2]:
%%bash

pip -q install langchain faiss-cpu unstructured
pip -q install openai tiktoken
pip -q install pytesseract pypdf
pip -q install unstructured==0.7.12

In [None]:
!sudo apt install tesseract-ocr

# Chat & Query your PDF files

## Detect Document Type

In [4]:
from filetype import guess

def detect_document_type(document_path):

    guess_file = guess(document_path)
    file_type = ""
    image_types = ['jpg', 'jpeg', 'png', 'gif']

    if(guess_file.extension.lower() == "pdf"):
        file_type = "pdf"

    elif(guess_file.extension.lower() in image_types):
        file_type = "image"

    else:
        file_type = "unkown"

    return file_type


In [5]:
research_paper_path = "/content/transformer_paper.pdf"
article_information_path = "/content/rticle_information.png"

print(f"Research Paper Type: {detect_document_type(research_paper_path)}")
print(f"Article Information Document Type: {detect_document_type(article_information_path)}")

Research Paper Type: pdf
Article Information Document Type: image


## Extract Documents Content

In [6]:
from langchain.document_loaders.image import UnstructuredImageLoader
from langchain.document_loaders import UnstructuredFileLoader

"""
YOU CAN UNCOMMENT THE CODE BELOW TO UNDERSTAND THE LOGIC OF THE FUNCTIONS
"""
"""

def extract_text_from_pdf(pdf_file):

    loader = UnstructuredFileLoader(pdf_file)
    documents = loader.load()
    pdf_pages_content = '\n'.join(doc.page_content for doc in documents)

    return pdf_pages_content

def extract_text_from_image(image_file):

    loader = UnstructuredImageLoader(image_file)
    documents = loader.load()

    image_content = '\n'.join(doc.page_content for doc in documents)

    return image_content


"""

def extract_file_content(file_path):

    file_type = detect_document_type(file_path)

    if(file_type == "pdf"):
        loader = UnstructuredFileLoader(file_path)

    elif(file_type == "image"):
        loader = UnstructuredImageLoader(file_path)

    documents = loader.load()
    documents_content = '\n'.join(doc.page_content for doc in documents)

    return documents_content

In [9]:
#research_paper_content = extract_text_from_pdf(research_paper_path)
#article_information_content = extract_text_from_image(article_information_path)


research_paper_content = extract_file_content(research_paper_path)
article_information_content = extract_file_content(article_information_path)

In [10]:
nb_characters = 400

print(f"First {nb_characters} Characters of the Paper: \n{research_paper_content[:nb_characters]}...")
print("---"*5)
print(f"First {nb_characters} Characters of Article Information Document :\n {article_information_content[:nb_characters]}...")


First 400 Characters of the Paper: 
Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.

Attention Is All You Need

3 2 0 2

Ashish Vaswani∗ Google Brain avaswani@google.com

Noam Shazeer∗ Google Brain noam@google.com

Niki Parmar∗ Google Research nikip@google.com

Jakob Uszkoreit∗ Google Research usz@google.com
...
---------------
First 400 Characters of Article Information Document :
 Input Files e ‘ File Content in

Raw Format

Raw Bata Splitter

PDF,

File Tyee Detector File Content Extractor DifRerent Chunks of the Raw Data

°

<< }/ oO <

© 6 6

Chunk

Vector Index Transformer

Chunks, Embedlelings

t

See]

Response

‘ =] Q — Author: Zoumana Keita User...


## Chat Implementation

### Create Chunks

In [11]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)

In [12]:
research_paper_chunks = text_splitter.split_text(research_paper_content)
article_information_chunks = text_splitter.split_text(article_information_content)

print(f"# Chunks in Research Paper: {len(research_paper_chunks)}")
print(f"# Chunks in Article Document: {len(article_information_chunks)}")



# Chunks in Research Paper: 51
# Chunks in Article Document: 1


### Create Embeddings

In [13]:
from langchain.embeddings.openai import OpenAIEmbeddings
import os

os.environ["OPENAI_API_KEY"] = "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"

embeddings = OpenAIEmbeddings()

### Create Vector Index

In [14]:
from langchain.vectorstores import FAISS

def get_doc_search(text_splitter):

    return FAISS.from_texts(text_splitter, embeddings)

In [15]:
doc_search_paper = get_doc_search(research_paper_chunks)
print(doc_search_paper)

<langchain.vectorstores.faiss.FAISS object at 0x7cee3a515720>


### Start chatting with your document

In [16]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(OpenAI(), chain_type = "map_rerank",
                      return_intermediate_steps=True)

def chat_with_file(file_path, query):

    file_content = extract_file_content(file_path)
    file_splitter = text_splitter.split_text(file_content)

    document_search = get_doc_search(file_splitter)
    documents = document_search.similarity_search(query)

    results = chain({
                        "input_documents":documents,
                        "question": query
                    },
                    return_only_outputs=True)
    results = results['intermediate_steps'][0]

    return results

##### Chat with the image file

In [17]:
query = "What is the document about"

results = chat_with_file(article_information_path, query)

answer = results["answer"]
confidence_score = results["score"]

print(f"Answer: {answer}\n\nConfidence Score: {confidence_score}")



Answer:  This document is about the process of extracting information from raw data, including splitting the data into chunks, detecting the type of file, extracting the content, and creating a vector index and embeddings.

Confidence Score: 100


##### Chat with the PDF file

In [18]:
query = "Why is the self-attention approach used in this document?"

results = chat_with_file(research_paper_path, query)

answer = results["answer"]
confidence_score = results["score"]

print(f"Answer: {answer}\n\nConfidence Score: {confidence_score}")



Answer:  Self-attention is used to compute a representation of the sequence, allowing for the construction of the Transformer which relies entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

Confidence Score: 100


# Resources

- [How to Chat With Any PDFs and Image Files Using Large Language Models — With Code](https://towardsdatascience.com/how-to-chat-with-any-file-from-pdfs-to-images-using-large-language-models-with-code-4bcfd7e440bc)

