# Phase 1: The PDF parser


**Goal:** Extract clean text and page numbers from a user-uploaded PDF.

In [15]:
import fitz  # Import the PyMuPDF library
# Define a function to accept pdf with its path
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path) # Open the Document
    # Preparing "Collection Basket"
    pages_data = [] 
    # loop through each page in the doc
    for page_no , page in enumerate(doc , start=1):
        texts = page.get_text()
        page_dict = {"page_no":page_no , "text":texts}
        pages_data.append(page_dict)

    return pages_data        

In [16]:
# Testing
path = "/Users/tushar04master/Documents/Project QnA/samples/Let Us C By Yashwant Kanetkar.pdf"
extracted_data = extract_text_from_pdf(path)
print(f"The PDF has {len(extracted_data)} pages.")
print("\nData from the first page:")
print(extracted_data[26])
print("\nJust the text from the first page:")
print(extracted_data[26]['text'])

The PDF has 729 pages.

Data from the first page:
{'page_no': 27, 'text': 'Chapter 1: Getting Started                                           9 \nrange is –32768 to 32767. For a 32-bit compiler the range would \nbe even greater. Question like what exactly do you mean by a 16-\nbit or a 32-bit compiler, what range of an Integer constant has to \ndo with the type of compiler and such questions are discussed in \ndetail in Chapter 16. Till that time it would be assumed that we are \nworking with a 16-bit compiler. \n \nEx.:   426 \n \n  +782 \n \n  -8000 \n \n  -7605 \n \nRules for Constructing Real Constants \nReal constants are often called Floating Point constants. The real \nconstants could be written in two forms—Fractional form and \nExponential form. \nFollowing rules must be observed while constructing real \nconstants expressed in fractional form: \n \n(a) \n(b) \n(c) \n(d) \n(e) \nA real constant must have at least one digit. \nIt must have a decimal point. \nIt could be eithe

# Phase 2: The Document Processor (The "Chunker")

#### Problem:
The transformer models we use for Question Answering have a "short attention span." They can only read a certain amount of text at one time (typically around 512 tokens, which is roughly 300-400 words). If you give them an entire page from a book, which could have thousands of words, they get overwhelmed and can't process it.

#### Solution: 
We need to be like a book editor. We will take the long scroll of text from each page and cut it into smaller, standard-sized paragraphs or "chunks." We'll also make these chunks overlap slightly, so if an important sentence gets cut in half at the end of one chunk, it will be complete at the beginning of the next.

#### Goal:
 Write a function that takes the list_of_pages we created in Phase 1 and returns a new, much longer list. Each item in this new list will be a dictionary containing a page_num and a small chunk_of_text.

In [17]:
# Creating function for text chunker
def chunk_text(pages_data,chunk_size,chunk_overlap): # it's input is the output of the previous function . A list .
    # Prepare a New "Collection Basket"
    chunks = []
    # The Outer loop
    for page_item in pages_data:
        page_num = page_item['page_no']
        page_text = page_item['text']
        
        # --- Step 2: Apply the sliding window logic to the text of THIS page ---
        for i in range(0, len(page_text), chunk_size - chunk_overlap):
            
            # Get a slice of the page's text
            chunk_text = page_text[i : i + chunk_size]
            
            # --- Step 3: Create the final dictionary with metadata ---
            chunk_dict = {
                "page_num": page_num,
                "text_chunk": chunk_text
            }
            
            # --- Step 4: Add the dictionary to our final list ---
            chunks.append(chunk_dict)
            
    return chunks         

if we just jumped by the full chunk_size, there would be no overlap. By jumping forward by chunk_size - chunk_overlap, we ensure the start of the new chunk begins inside the end of the old one.

In [18]:
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

print("Chunking the document with the new function...")
final_chunked_data = chunk_text(
    pages_data=extracted_data, 
    chunk_size=CHUNK_SIZE, 
    chunk_overlap=CHUNK_OVERLAP
)
print("Chunking complete! ✅")

# --- Inspect the new results ---
print(f"\nOriginal number of pages: {len(extracted_data)}")
print(f"Number of chunks after processing: {len(final_chunked_data)}")

# Look at a few chunks to verify the structure and page numbers
print("\n--- Example Chunks with Page Numbers ---")
if len(final_chunked_data) > 5:
    for chunk in final_chunked_data[:50]:
        print(chunk)



Chunking the document with the new function...
Chunking complete! ✅

Original number of pages: 729
Number of chunks after processing: 2246

--- Example Chunks with Page Numbers ---
{'page_num': 1, 'text_chunk': ' \n \nUploaded By Mohd Khushhal \nE-Mail- m.khushhal@gmail.com \n \n              m.khushhal@engineer.com\nFor More E-Books visit-  mkhushhal.blogspot.in \n \n'}
{'page_num': 2, 'text_chunk': ' \n \n \n \n \nLet Us C \nFifth Edition \n \n \nYashavant P. Kanetkar \n \n \n \n \n \n \n'}
{'page_num': 3, 'text_chunk': ' \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n'}
{'page_num': 4, 'text_chunk': ' \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nDedicated to baba \nWho couldn’t be here to see this day... \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n'}
{'page_num': 5, 'text_chunk': 'About the Author \nDestiny drew Yashavant Kanetkar towards computers when the IT \nindustry was just making

# Phase 3: The **QA Engine** 