# This notebook only covers PDF/Directory loading and chunking. No embedding, or vector storage is done here.

# PDF Ingestion

In [1]:
from langchain_community.document_loaders import PyPDFLoader #Used to load PDF .
from pathlib import Path # Used to find the path directory or file .
from langchain_core.documents import Document # Used specifying type to return .

## Why PDF Loading Needs Care

PDFs are not plain text files. They often contain:
- Empty pages
- Headers/footers
- Irregular formatting
- Very large text blocks

Before chunking or embedding, we must:
- Filter empty pages
- Track file size
- Preserve source metadata
- Estimate token usage to avoid model limits


In [2]:
# The following function is used to load all the PDF from a directory .
def loading_pdf(dir_path)->list[Document]: # Return Type .
    all_pdf_size_mb=0.0  # Initializing size to 0 .
    all_pdf_documents=[] # Used to store all the PDF documents .
    total_tokens=0
    #Trying to get the PDF's path from given directory .
    pdf_dir=Path(dir_path) #Getting the path of the directory .
    if not pdf_dir.is_dir(): #Checking if the directory exists .
        raise NotADirectoryError(f"{pdf_dir} is not a valid directory .")
    else:
        print(f"The directory path is : {pdf_dir} ") # Printing the Directory Path .

        pdfs=list(pdf_dir.rglob("*.pdf")) # Making a list to store all the PDF paths . Can use glob("**/*.pdf") .
        print(f"Number of pdf's in the directory : {len(pdfs)} ") # Printing Number of PDF's in the directory.

        #Loading all the PDF's using list of paths .

        print("="*20,"PDF LOAD SUMMARY","="*20)
        print("-"*45)
        for serial,pdf_path in enumerate(pdfs):
            print(f"{serial+1})----->Loading {pdf_path.name} .") # Serial number of the PDF .
            size_bytes=pdf_path.stat().st_size # Checking the size of PDF .
            size_mb=(size_bytes/(1024**2)) # Converting into Mb .
            print(f"File size : {size_mb:.3f} Mb .") # Printing File size of the PDF .
            try:
                pdf_loader=PyPDFLoader(pdf_path) # Trying to load the PDF into document .
                pdf=pdf_loader.load()
                pdf_tokens=0
                filtered_pages=[] # Used to get filtered pages and remove empty pages .
                for page in pdf: # Adding extra metadata .
                    if not page.page_content.strip(): # Skipping empty pages .
                        continue
                    else:
                        page.metadata.update({
                            'source': str(pdf_path),
                            'file_name':pdf_path.name,
                            'file_type':"pdf"
                        })
                        page_tokens=len(page.page_content)//4 # Counts tokens per page .
                        pdf_tokens+=page_tokens # Adding current page tokens to previous tokens .
                        page.metadata["estimated_tokens"] = page_tokens # Storing the tokens of each page in metadata .
                        filtered_pages.append(page)
                total_tokens+=pdf_tokens # Adding current PDF tokens to previous PDF .
                print(f"Number of pages loaded : {len(filtered_pages)} .") # Printing number of pages or documents  loaded in this PDF .
                print(f"Estimated Tokens of the pdf : {pdf_tokens} tokens .") # Estimated token in PDF .
                print("Loaded file successfully . ") # Confirming the PDF loaded .
                print("-"*45)

                all_pdf_documents.extend(filtered_pages) # Using extend to add the documents one by one instead of complete PDF .
                all_pdf_size_mb+=size_mb # Add previous file sizes .

            except Exception as e:
                print(f"Error processing the file {pdf_path.name} . Error {e}") # Exception Handling .

        print("-"*60)
        print(f"Final number of documents/pages loaded : {len(all_pdf_documents)}") # Printing final number of documents collected through all PDF's .
        print(f"Total sizes of all pdf's: {all_pdf_size_mb:.3f} Mb .") # Total size of all the PDF's collectively .
        print(f"Estimated Tokens of all the pdf's : {total_tokens} tokens .") # Estimated total number of tokens through all documents .
        print("-"*60)

        return all_pdf_documents # returning all the collected documents .

In [3]:
documents=loading_pdf("../data")

The directory path is : ..\data 
Number of pdf's in the directory : 2 
---------------------------------------------
1)----->Loading He et al. - 2015 - Deep Residual Learning for Image Recognition.pdf .
File size : 0.781 Mb .
Number of pages loaded : 12 .
Estimated Tokens of the pdf : 14837 tokens .
Loaded file successfully . 
---------------------------------------------
2)----->Loading Lewis et al. - 2021 - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf .
File size : 0.844 Mb .
Number of pages loaded : 19 .
Estimated Tokens of the pdf : 17256 tokens .
Loaded file successfully . 
---------------------------------------------
------------------------------------------------------------
Final number of documents/pages loaded : 31
Total sizes of all pdf's: 1.626 Mb .
Estimated Tokens of all the pdf's : 32093 tokens .
------------------------------------------------------------


## Token Estimation (Important Concept)

LLMs do not process text as characters or words ‚Äî they process **tokens**.

A commonly used heuristic:
> **1 token ‚âà 4 characters (English text)**

We use:
```python
len(text) // 4


In [4]:
print(documents[0]) # Only loading first document due to large number of documents .

page_content='Deep Residual Learning for Image Recognition
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com
Abstract
Deeper neural networks are more difÔ¨Åcult to train. We
present a residual learning framework to ease the training
of networks that are substantially deeper than those used
previously. We explicitly reformulate the layers as learn-
ing residual functions with reference to the layer inputs, in-
stead of learning unreferenced functions. We provide com-
prehensive empirical evidence showing that these residual
networks are easier to optimize, and can gain accuracy from
considerably increased depth. On the ImageNet dataset we
evaluate residual nets with a depth of up to 152 layers‚Äî8√ó
deeper than VGG nets [41] but still having lower complex-
ity. An ensemble of these residual nets achieves 3.57% error
on the ImageNettest set. This result won the 1st place on the
ILSVRC 2015 classiÔ¨Åcation task. We also pres

In [5]:
print(documents[0].page_content) # Page content of first document . First page the PDF .

Deep Residual Learning for Image Recognition
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com
Abstract
Deeper neural networks are more difÔ¨Åcult to train. We
present a residual learning framework to ease the training
of networks that are substantially deeper than those used
previously. We explicitly reformulate the layers as learn-
ing residual functions with reference to the layer inputs, in-
stead of learning unreferenced functions. We provide com-
prehensive empirical evidence showing that these residual
networks are easier to optimize, and can gain accuracy from
considerably increased depth. On the ImageNet dataset we
evaluate residual nets with a depth of up to 152 layers‚Äî8√ó
deeper than VGG nets [41] but still having lower complex-
ity. An ensemble of these residual nets achieves 3.57% error
on the ImageNettest set. This result won the 1st place on the
ILSVRC 2015 classiÔ¨Åcation task. We also present analysis
o

In [6]:
print(documents[0].metadata) # Metadata of first document . First page the PDF .

{'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creationdate': '2015-12-11T01:13:45+00:00', 'author': '', 'keywords': '', 'moddate': '2015-12-11T01:13:45+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-2.3-1.40.12 (TeX Live 2011) kpathsea version 6.0.1', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf\\He et al. - 2015 - Deep Residual Learning for Image Recognition.pdf', 'total_pages': 12, 'page': 0, 'page_label': '1', 'file_name': 'He et al. - 2015 - Deep Residual Learning for Image Recognition.pdf', 'file_type': 'pdf', 'estimated_tokens': 1071}


# Chunking

## Why Text Chunking Is Required

Embedding models and LLMs have **context length limits**.

Large documents must be split into smaller chunks to:
- Fit model constraints
- Improve semantic retrieval
- Avoid truncation or failures
- Preserve local context

Chunking is a **mandatory step** in any RAG pipeline.

In [7]:
# Used to split large data into comparatively smaller chunks of data .
from langchain_text_splitters import RecursiveCharacterTextSplitter # Function used to split the data .

## RecursiveCharacterTextSplitter Explained

This splitter attempts to preserve semantic structure by splitting text **recursively** using:

1. Paragraph breaks (`\n\n`)
2. Line breaks (`\n`)
3. Spaces
4. Character-level splits (last resort)

This approach avoids cutting sentences or ideas whenever possible.


In [8]:
# The following function takes list of Document as input and chunks data into smaller Documents .
def document_splitters(documents:list[Document],chunk_size=1000,chunk_overlap=200)->list[Document]:
    # Chunk_size -> number of character in a chunk .
    # Chunk_overlap -> number of previous chunk characters in the beginning of this chunk for relevance .
    # add_start_index -> Due to overlap we need to specify from where the present chunk starts from .
    # separators -> Delimiter used for separating characters .
    # Length_function -> Method used to split chunks . Eg -> characters , tokens ...
    text_splitter=RecursiveCharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=chunk_overlap,add_start_index=True,separators=["\n\n","\n"," ",""],length_function=len)

    # Checking if the documents is valid with respect to type .
    if not documents:
        raise ValueError("Invalid documents .")
    if not all(isinstance(d,Document) for d in documents):
        raise TypeError("All the item in documents must be of Document type .")

    split_docs=text_splitter.split_documents(documents) # Splitting the documents based on given parameters .
    if not split_docs: # Checking if the documents have been split .
        print("No chunks created .")
        return [] # Returning empty list indicating no chunks were created .
    print(f"{len(documents)} documents is split into {len(split_docs)} chunks .") # Printing number of chunks created with respect to original documents .

    tokens=0 #Initializing tokens to 0 .

    for doc in split_docs: # Inserting tokens as for future use .
        chunk_tokens=len(doc.page_content)//4
        tokens+=chunk_tokens
        doc.metadata.update(
            {
                'estimated_tokens':chunk_tokens
            }
        )
    # Printing essential details after chunking the data .
    print("="*30,"Chunking SUMMARY","="*30)
    print("-"*80)
    max_tokens=max(t.metadata['estimated_tokens'] for t in split_docs)  # Maximum tokens in a single chunk .
    min_tokens=min(t.metadata['estimated_tokens'] for t in split_docs)  # Minimum tokens in a single chunk .
    avg_tokens=tokens//len(split_docs)                                 # Average tokens from all the chunks .
    print(f"---> Maximum tokens in a chunk : {max_tokens} tokens")
    print(f"---> Minimum tokens in a chunk : {min_tokens} tokens")
    print(f"---> Average tokens from each chunk : {avg_tokens} tokens .")
    print(f"---> Estimated tokens from all the documents after chunking : {tokens} tokens .") # Estimated tokens after chunking . Higher than original documents due to chunk overlapping .
    print("-"*80)
    return split_docs # Returning documents after chunking .

In [9]:
chunks=document_splitters(documents)

31 documents is split into 170 chunks .
--------------------------------------------------------------------------------
---> Maximum tokens in a chunk : 249 tokens
---> Minimum tokens in a chunk : 50 tokens
---> Average tokens from each chunk : 223 tokens .
---> Estimated tokens from all the documents after chunking : 37984 tokens .
--------------------------------------------------------------------------------


In [10]:
print(chunks[1])

page_content='ity. An ensemble of these residual nets achieves 3.57% error
on the ImageNettest set. This result won the 1st place on the
ILSVRC 2015 classiÔ¨Åcation task. We also present analysis
on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance
for many visual recognition tasks. Solely due to our ex-
tremely deep representations, we obtain a 28% relative im-
provement on the COCO object detection dataset. Deep
residual nets are foundations of our submissions to ILSVRC
& COCO 2015 competitions 1, where we also won the 1st
places on the tasks of ImageNet detection, ImageNet local-
ization, COCO detection, and COCO segmentation.
1. Introduction
Deep convolutional neural networks [22, 21] have led
to a series of breakthroughs for image classiÔ¨Åcation [21,
50, 40]. Deep networks naturally integrate low/mid/high-
level features [50] and classiÔ¨Åers in an end-to-end multi-
layer fashion, and the ‚Äúlevels‚Äù of features can be enriched' metadata={'

In [11]:
print(chunks[1].page_content)

ity. An ensemble of these residual nets achieves 3.57% error
on the ImageNettest set. This result won the 1st place on the
ILSVRC 2015 classiÔ¨Åcation task. We also present analysis
on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance
for many visual recognition tasks. Solely due to our ex-
tremely deep representations, we obtain a 28% relative im-
provement on the COCO object detection dataset. Deep
residual nets are foundations of our submissions to ILSVRC
& COCO 2015 competitions 1, where we also won the 1st
places on the tasks of ImageNet detection, ImageNet local-
ization, COCO detection, and COCO segmentation.
1. Introduction
Deep convolutional neural networks [22, 21] have led
to a series of breakthroughs for image classiÔ¨Åcation [21,
50, 40]. Deep networks naturally integrate low/mid/high-
level features [50] and classiÔ¨Åers in an end-to-end multi-
layer fashion, and the ‚Äúlevels‚Äù of features can be enriched


## üßæ Metadata Propagation

Each chunk preserves and extends metadata such as:

- `source` ‚Äì original file path
- `file_name` ‚Äì PDF name
- `file_type` ‚Äì "pdf"
- `estimated_tokens` ‚Äì token estimate for the chunk
- `start_index` ‚Äì position in source text

This metadata is critical for:
- Retrieval filtering
- Source attribution
- Cost estimation

In [12]:
print(chunks[1].metadata)

{'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creationdate': '2015-12-11T01:13:45+00:00', 'author': '', 'keywords': '', 'moddate': '2015-12-11T01:13:45+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-2.3-1.40.12 (TeX Live 2011) kpathsea version 6.0.1', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf\\He et al. - 2015 - Deep Residual Learning for Image Recognition.pdf', 'total_pages': 12, 'page': 0, 'page_label': '1', 'file_name': 'He et al. - 2015 - Deep Residual Learning for Image Recognition.pdf', 'file_type': 'pdf', 'estimated_tokens': 240, 'start_index': 812}


## Important Note: Documents Are NOT Embeddings

The standard RAG pipeline is:

1. Documents are loaded and cleaned.
2. Documents are split into smaller chunks.
3. Chunks are converted into embeddings.
4. Embeddings are stored in a vector database.

This notebook stops at Step 2.

### What comes next
- Embedding generation
- Vector store ingestion
- Retrieval (RAG)

## Next step: Embedding the chunked data before storing into a vector store .
