<h3>Step-1 Trying to load all 3 types of PDF </h3>

<h4>Loading Text BAsed pdf</h4>

In [3]:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyMuPDFLoader
load_dotenv()
TEXT_PDF_PATH=os.getenv("TEXT_PDF_PATH")

#loading the document
text_loader=PyMuPDFLoader(TEXT_PDF_PATH)
text_docs=text_loader.load()
print("Success")
print(f"Loaded the text document : {len(text_docs)}")


Success
Loaded the text document : 420


Using PyMuPDF Loader because as per documentation

Document Lazy Loading   	   

Native Async Support 	   

Extract Images 	        

Extract Tables

https://python.langchain.com/docs/integrations/document_loaders/pymupdf/#setup

In [5]:
#checking the content
print(text_docs[0].page_content[:1000]) #fetching first 1000 characters from page 1


The Project Gutenberg eBook of A Journey to the Centre of the Earth
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.
Title: A Journey to the Centre of the Earth
Author: Jules Verne
Release date: July 18, 2006 [eBook #18857]
                Most recently updated: December 27, 2012
Language: English
Original publication: Griffith and Farran,, 1871
Credits: Produced by Norm Wolcott
*** START OF THE PROJECT GUTENBERG EBOOK A JOURNEY TO THE CENTRE OF THE
EARTH ***


# loading for table based document heavy ones

In [7]:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PDFPlumberLoader
load_dotenv()
TABLE_PDF_PATH=os.getenv("TABLE_PDF_PATH")
table_docs=PDFPlumberLoader(TABLE_PDF_PATH)
table_loader=table_docs.load()


| Loader             | Best For                             | Can Extract Tables? | Notes                                            |
| ------------------ | ------------------------------------ | ------------------- | ------------------------------------------------ |
| `PyMuPDFLoader`    | Normal text-based PDFs (like novels) | ❌ Not reliably      | Gives free-form text only, loses table structure |
| `PDFPlumberLoader` | Tabular/financial/data-heavy PDFs    | ✅ Yes               | Designed for extracting structured tables        |


In [None]:
print(table_loader[0].page_content[:1000])


Country NCountry CSeries NamSeries Cod2015 [YR22016 [YR22017 [YR22018 [YR22019 [YR22020 [YR22021 [YR22022 [YR22023 [YR22024 [YR2
Australia AUS Access to EG.CFT.AC 100 100 100 100 100 100 100 100 .. ..
Australia AUS Access to EG.CFT.AC 100 100 100 100 100 100 100 100 .. ..
Australia AUS Access to EG.CFT.AC 100 100 100 100 100 100 100 100 .. ..
Australia AUS Access to EG.ELC.AC 100 100 100 100 100 100 100 100 100 ..
Australia AUS Access to EG.ELC.AC 100 100 100 100 100 100 100 100 100 ..
Australia AUS Access to EG.ELC.AC 100 100 100 100 100 100 100 100 100 ..
Australia AUS Account oFX.OWN.T.. .. 99.52 .. .. .. 99.32 .. .. ..
Australia AUS Account oFX.OWN.T.. .. 99.2 .. .. .. 100 .. .. ..
Australia AUS Account oFX.OWN.T.. .. 99.85 .. .. .. 98.59 .. .. ..
Australia AUS Account oFX.OWN.T.. .. 99.64 .. .. .. 99.18 .. .. ..
Australia AUS Account oFX.OWN.T.. .. 99.29 .. .. .. 98.3 .. .. ..
Australia AUS Account oFX.OWN.T.. .. 100 .. .. .. 86.48 .. .. ..
Australia AUS Account oFX.OWN.T.. .. 99.

# loading image based pdf

In [2]:
# loaders/load_image_pdf.py

import os
from dotenv import load_dotenv
import fitz  # PyMuPDF
from PIL import Image # to extract a single image from pdf
from pytesseract import image_to_string  #to extract text and return string
from langchain_core.documents import Document

# Load env variables
load_dotenv()
IMAGE_PDF_PATH = os.getenv("IMAGE_PDF_PATH")

def read_image_pdf(path):
    doc = fitz.open(path)
    #doc is now a list object where each item is page of pdf
    print(doc)
    image_docs = []
    # Empty list to store all the extracted pages as LangChain Document objects.

    for page in doc:
        # Try text extraction
        text = page.get_text().strip()
        if not text:
            # print("No text found")- This if looop was entered 9 times
            # Convert page to image for OCR
            pix = page.get_pixmap()
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            text = image_to_string(img)

        if text.strip():
            image_docs.append(Document(
                page_content=text.strip(),
                metadata={"page": page.number + 1, "source": "image"}
            ))

    return image_docs

if __name__ == "__main__":
    docs = read_image_pdf(IMAGE_PDF_PATH)
    print(f"Loaded {len(docs)} pages with image-based text")
    print("-" * 50)
    print("Sample content:\n")
    print(docs[0].page_content[:500])
    print("\nMetadata:", docs[0].metadata)


Document('D:\AGI Projects\MultiPDF RAG Pipeline\Dataset\100 SQL COMMANDS .pdf')
Loaded 9 pages with image-based text
--------------------------------------------------
Sample content:

@, Vikram
# ~ ecode. learning

100 SQL
Commands

Metadata: {'page': 1, 'source': 'image'}


Explanation of the above code

<h4>import fitz</h4> 

It's just a name to call PyMuPDF library for reading and working with PDF files

Can:

Extract text (page.get_text())

Convert pages to images (page.get_pixmap())

Read metadata, bookmarks, etc.

<h4>from PIL import Image</h4> 

Imports Image from the Pillow library (Python Imaging Library).

Used here to handle images converted from PDF pages.

Converts pixmap (from PyMuPDF) into a format that OCR tools can understand.

<h4>from pytesseract import image_to_string</h4> 

From the pytesseract package (Python wrapper for Google’s Tesseract OCR engine).

image_to_string() takes an image and returns the extracted text.

You use it when a PDF contains text inside images (non-selectable).

<h4>from langchain_core.documents import Document</h4>

Part of LangChain’s core data structure.

Document is a wrapper that holds:

page_content → the actual text

metadata → source, page number, type, etc.


🔧 def read_image_pdf(path):
Defines a function that takes the path to an image-based PDF and returns a list of LangChain Document objects, each containing extracted text + metadata.

📄 doc = fitz.open(path)
Opens the PDF using PyMuPDF.

doc becomes a list-like object where each item is a page of the PDF.

📦 image_docs = []
Empty list to store all the extracted pages as LangChain Document objects.

🔁 for page in doc:
Iterates through each page of the PDF file.

🧪 text = page.get_text().strip()
Tries to extract any embedded text from the page directly (like in normal PDFs).

.strip() removes extra whitespace.

If the PDF has scanned images only, this will likely return an empty string.

❗ if not text:
If get_text() didn’t return anything (i.e., the page has no selectable text):

python
Copy code
print("No text found")
Gives you feedback during debugging.

🖼️ pix = page.get_pixmap()
Converts the entire PDF page into a raster image (pixel map).

You need this for OCR, because image_to_string() expects an image.

🧠 img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
Converts the pixmap to a Pillow Image object, so it can be passed to Tesseract.

🔍 text = image_to_string(img)
Performs OCR (Optical Character Recognition) on the image of the page.

Converts image content (like SQL diagrams, screenshots, or scanned docs) into raw text.

✅ if text.strip():
If OCR extracted any non-empty text, then proceed to save it.

📦 image_docs.append(Document(...))
Wraps the extracted text and some helpful metadata into a LangChain Document object.

Metadata includes:

python
Copy code
{
    "page": page.number + 1,
    "source": "image"
}
This tells you:

Which page the text came from

That it came from the image-based PDF

🔚 return image_docs
Returns the list of extracted pages, each as a Document object, ready to be chunked, embedded, etc.

