## 10 PDF Parsing Practice Problems (Easy to Medium)

1. Extract all text from a PDF using PyPDF2.

In [2]:
from PyPDF2 import PdfReader
pdf = PdfReader("files/simple.pdf")
for page in pdf.pages :
    print(page.extract_text())

This is a simple PDF file.
It contains just a few lines of text.
Generated using Python and ReportLab.



2. Extract text only from Page 1 of your PDF.

In [3]:
from PyPDF2 import PdfReader
pdf = PdfReader("files/simple.pdf")
print(pdf.pages[0].extract_text())

This is a simple PDF file.
It contains just a few lines of text.
Generated using Python and ReportLab.



3. Count the total number of pages in your PDF.

In [6]:
from PyPDF2 import PdfReader
pdf = PdfReader("files/simple.pdf")
print("Total number of pages :",len(pdf.pages))

Total number of pages : 1


4. Detect whether your PDF is scanned (empty text extraction).

In [1]:
from PyPDF2 import PdfReader
pdf = PdfReader("files/simple.pdf")
scanned = True
for page in pdf.pages :
    text = page.extract_text()
    if text and text.strip() :
        scanned = False
print("Pdf is scanned(empty text extraction)" if scanned else "Pdf is not scanned")

Pdf is not scanned


5. Extract headings (lines in uppercase or bold-like text).

In [None]:
# we can't find the bold like text with PyPDF2, So we are just checking the uppercase sentence
from PyPDF2 import PdfReader
headings = []
pdf = PdfReader("files/py_modules_test.pdf")
for page in pdf.pages :
    lines = page.extract_text().split('\n')
    for line in lines :
        clean = line.strip()
        if clean.isupper() :
            headings.append(clean)
print("Headings :")
for heading in headings :
    print(heading)

Headings :
INTRODUCTION
MODULES OVERVIEW
PYTHON HAS MANY BUILT-IN MODULES LIKE OS, SYS, MATH, RANDOM.
TEXT EXTRACTION TEST
UPPERCASE HEADING SAMPLE
FILE HANDLING AND DATA PROCESSING
ENCRYPTION AND MERGING TEST


6. Extract tables from your PDF using pdfplumber.

In [15]:
import pdfplumber
with pdfplumber.open("files/pdfplumber_tables_test.pdf") as pdf :
    table = pdf.pages[0].extract_table()
    text = pdf.pages[0].extract_text()
    print(text.split('\n')[0])
    for row in table :
        print(row)

STUDENT MARKS TABLE
['Name', 'Subject', 'Marks']
['Aadi', 'Maths', '88']
['Bharat', 'Physics', '76']
['Charan', 'Chemistry', '92']
['Divya', 'English', '81']


7. Extract metadata (Author, Title, Creation Date) from your PDF.

In [9]:
from PyPDF2 import PdfReader
pdf = PdfReader("files/simple.pdf") 
metaData = pdf.metadata
for key, value in metaData.items() :
    print(f"{key} : {value}")

/Author : (anonymous)
/CreationDate : D:20251212071100+00'00'
/Creator : (unspecified)
/Keywords : 
/ModDate : D:20251212071100+00'00'
/Producer : ReportLab PDF Library - www.reportlab.com
/Subject : (unspecified)
/Title : (anonymous)
/Trapped : /False


8. Save extracted text from your PDF into a .txt file.

In [11]:
from PyPDF2 import PdfReader
pdf = PdfReader("files/simple.pdf")
content = ""
for page in pdf.pages :
    content += page.extract_text() + '\n'
with open("files/new.txt", "w") as file :
    file.write(content)
    print("Content written successfully.")

Content written successfully.


9. Extract only email IDs from your PDF text using regex.

In [17]:
from PyPDF2 import PdfReader
import re
pdf = PdfReader("files/test_pdfplumber_sample.pdf")
pattern = r'[a-zA-Z0-9._+%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = set()
for page in pdf.pages :
    content = page.extract_text()
    if content :
        emails.update(re.findall(pattern, content))
print("Extracted emails :")
for email in emails :
    print(email)

Extracted emails :
hello.world123@gmail.com
admin@testdomain.org
support@example.com
student_mail@college.edu


10. Create a combined extractor:
- Try PyPDF2
- If empty → try pdfplumber
- If still empty → print 'Scanned PDF detected'

In [20]:
from PyPDF2 import PdfReader
import pdfplumber 
empty = True
pdf = PdfReader("files/scanned_test_pdf.pdf")
for page in pdf.pages :
    content = page.extract_text()
    if content :
        empty = False
        print(content)
        
if empty :
    with pdfplumber.open("files/scanned_test_pdf.pdf") as pdf :
        for page in pdf.pages :
            content = page.extract_text()
            if content :
                empty = False
                print(content)
                
if empty :
    print("Scanned PDF detected")
            

Scanned PDF detected
