### Handling PDF  Challenges 
PDFs are hard to parse because they:

-> store text in complex ways (not just simple text)

-> can have formatting issues

-> may contain scanned images(OCR)

-> Often have extraction artifacts

In [2]:
# Example of raw PDF extraction

raw_pdf_text = """ Company financial report

        The financial report for the year 2023 shows a significant increase in
        revenue compared to the previous year. The company has implemented several new
        strategies that have contributed to this growth. The net profit margin has also improved,
        
        
        indicating better cost management and operational efficiency. Overall, the financial performance
        
        
of the company in 2023 has been very positive,
and it is expected to continue this trend in the coming years.

Page 1 of 10
"""
# Apply the cleaning function 
def clean_text(text):
    text = " ".join(text.split())  # Remove extra whitespace

    # Fix ligatures 
    text = text.replace("ﬁ", "fi").replace("ﬂ", "fl")
    return text

cleaned_text = clean_text(raw_pdf_text)
print("BEFORE")
print(raw_pdf_text)
print("\nAFTER")
print(cleaned_text)

    

BEFORE
 Company financial report

        The financial report for the year 2023 shows a significant increase in
        revenue compared to the previous year. The company has implemented several new
        strategies that have contributed to this growth. The net profit margin has also improved,


        indicating better cost management and operational efficiency. Overall, the financial performance


of the company in 2023 has been very positive,
and it is expected to continue this trend in the coming years.

Page 1 of 10


AFTER
Company financial report The financial report for the year 2023 shows a significant increase in revenue compared to the previous year. The company has implemented several new strategies that have contributed to this growth. The net profit margin has also improved, indicating better cost management and operational efficiency. Overall, the financial performance of the company in 2023 has been very positive, and it is expected to continue this trend in the com

In [3]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
from langchain_core.documents import Document
from typing import List

In [8]:
class SmartPDFLoader(PyPDFLoader):
    """Advance PDF processing with error handling"""
    def __init__(self, chunk_size=1000, chunk_overlap=100):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=[" "]
        )   

    def process_pdf(self, pdf_path: str) -> list[Document]:
        """Process PDF with smart chunking and metadata enhancement"""
        # Load PDF
        loader=PyPDFLoader(pdf_path)
        pages = loader.load()

        #Process each page
        processed_chunks=[]

        for page_num, page in enumerate(pages):
            # Clean text
            cleaned_text = self._clean_text(page.page_content)

            # Skip nearly empty pages
            if len(cleaned_text) < 50:
                continue

            # Creating chunks with enhanced metradata
            chunks = self.text_splitter.create_documents(
                texts = [cleaned_text],
                metadatas = [{
                    **page.metadata,
                    "page": page_num + 1,
                    "total_pages" : len(pages),
                    "chunk_method" : "smart_pdf_processor",
                    "char_count" : len(cleaned_text)
                }]
            )
            
            processed_chunks.extend(chunks)
        return processed_chunks
    def _clean_text(self, text: str) -> str:
        # Remove extra whitespace
        text = " ".join(text.split())  
        
        # Fix ligatures 
        text = text.replace("ﬁ", "fi").replace("ﬂ", "fl")
        return text

In [10]:
preprocessor = SmartPDFLoader()

In [None]:
# Process a PDF 
try:
    smart_chunks = processor.process_pdf("data/pdf_files/attention.pdf")
    print(f"Processed {len(smart_chunks)} chunks from the PDF.")

    # Show enhanced metadata
    if smart_chunks:
        print("/nSample chunk metadata:")
        for key, value in smart_chunks[0].metadata.items():
            print(f"{key}: {value}")
            
except Exception as e:
    print(f"Error processing PDF: {e}")

Processed 40 chunks from the PDF.
/nSample chunk metadata:
producer: PyPDF2
creator: PyPDF
creationdate: 
subject: Neural Information Processing Systems http://nips.cc/
publisher: Curran Associates, Inc.
language: en-US
created: 2017
eventtype: Poster
description-abstract: The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms.  We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On Engl