# PDF Document Extraction Tool

This notebook provides functionality to extract text and tables from PDF documents and format them into a structured JSON file.

**User Story:** As a user, I should provide a path of a PDF, and the program should display the text from the PDF.

**Features:**
- Extract regular text from PDFs
- Extract tabular data from PDFs
- Format extracted data into JSON
- Save the JSON output to a file

## Install Required Libraries

First, let's install the necessary libraries for PDF processing:

In [1]:
# Install required packages
!pip install PyPDF2 tabula-py pandas camelot-py opencv-python-headless

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting tabula-py
  Downloading tabula_py-2.10.0-py3-none-any.whl.metadata (7.6 kB)
Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting camelot-py
  Downloading camelot_py-1.0.0-py3-none-any.whl.metadata (9.4 kB)
Collecting opencv-python-headless
  Downloading opencv_python_headless-4.11.0.86-cp37-abi3-win_amd64.whl.metadata (20 kB)
Collecting numpy>1.24.4 (from tabula-py)
  Downloading numpy-2.2.5-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting distro (from tabula-py)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting click>=8.0.1 (from camelot-py)
  Downloading click-8.2.0-py3-none-any.whl.metadata (2.5 kB)
Collecting chard

## Import Required Libraries

Import the necessary libraries for PDF text and table extraction:

In [2]:
import os
import json
import PyPDF2
import tabula
import pandas as pd
import camelot
from datetime import datetime

## Load PDF File

Prompt the user to provide the path to the PDF file and load it for processing:

In [4]:
# Function to check if a file exists and is a PDF
def validate_pdf_path(pdf_path):
    if not os.path.exists(pdf_path):
        return False, "File does not exist."
    
    if not pdf_path.lower().endswith('.pdf'):
        return False, "File is not a PDF."
    
    return True, "PDF file is valid."

# Get PDF path from user
pdf_path = input("Enter the path to your PDF file: ")

# Validate the PDF path
is_valid, message = validate_pdf_path(pdf_path)

if is_valid:
    print(f"PDF file loaded successfully: {pdf_path}")
else:
    print(f"Error: {message}")
    pdf_path = None

PDF file loaded successfully: chapter4.pdf


## Extract Text from PDF

Use PyPDF2 to extract text content from the PDF file:

In [5]:
def extract_text_from_pdf(pdf_path):
    if not pdf_path:
        return None
    
    text_content = []
    
    try:
        # Open the PDF file
        with open(pdf_path, 'rb') as file:
            # Create a PDF reader object
            pdf_reader = PyPDF2.PdfReader(file)
            
            # Get the number of pages in the PDF
            num_pages = len(pdf_reader.pages)
            
            print(f"PDF has {num_pages} pages.")
            
            # Extract text from each page
            for page_num in range(num_pages):
                page = pdf_reader.pages[page_num]
                page_text = page.extract_text()
                text_content.append(page_text)
                
        return text_content
    
    except Exception as e:
        print(f"Error extracting text: {str(e)}")
        return None

# Extract text from the PDF
if pdf_path:
    extracted_text = extract_text_from_pdf(pdf_path)
    
    # Display a sample of the extracted text
    if extracted_text:
        print("\nSample of extracted text (first page):")
        print("-" * 50)
        print(extracted_text[0][:500] + "..." if len(extracted_text[0]) > 500 else extracted_text[0])
        print("-" * 50)
    else:
        print("No text could be extracted from the PDF.")

PDF has 47 pages.

Sample of extracted text (first page):
--------------------------------------------------
Preprocessing data
SUPERVISED LEARNING WITH SCIKIT-LEARN
George Boorman
Core Curriculum Manager, DataCamp

--------------------------------------------------


## Extract Table Information

Use both tabula-py and camelot to extract table data from the PDF. We'll try both libraries since table extraction can vary in accuracy depending on the PDF structure:

In [6]:
def extract_tables_with_tabula(pdf_path):
    try:
        # Extract tables using tabula
        tables = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)
        print(f"Extracted {len(tables)} tables with tabula-py.")
        return tables
    except Exception as e:
        print(f"Error extracting tables with tabula: {str(e)}")
        return []

def extract_tables_with_camelot(pdf_path):
    try:
        # Extract tables using camelot
        tables = camelot.read_pdf(pdf_path, pages='all')
        print(f"Extracted {len(tables)} tables with camelot.")
        return tables
    except Exception as e:
        print(f"Error extracting tables with camelot: {str(e)}")
        return []

# Extract tables from the PDF using both methods
tabula_tables = []
camelot_tables = []

if pdf_path:
    print("\nExtracting tables from PDF...")
    tabula_tables = extract_tables_with_tabula(pdf_path)
    
    try:
        camelot_tables = extract_tables_with_camelot(pdf_path)
    except:
        print("Camelot extraction failed, continuing with tabula results only.")
    
    # Display a sample of the first table if available
    if tabula_tables and len(tabula_tables) > 0:
        print("\nSample of first extracted table (tabula):")
        print("-" * 50)
        display(tabula_tables[0].head())
        print("-" * 50)
    elif camelot_tables and len(camelot_tables) > 0:
        print("\nSample of first extracted table (camelot):")
        print("-" * 50)
        display(camelot_tables[0].df.head())
        print("-" * 50)
    else:
        print("No tables were found in the PDF.")

Failed to import jpype dependencies. Fallback to subprocess.
No module named 'jpype'



Extracting tables from PDF...
Extracted 3 tables with tabula-py.
Error extracting tables with camelot: Image conversion failed with image conversion backend 'ghostscript'
 error: Ghostscript is not installed. You can install it using the instructions here: https://camelot-py.readthedocs.io/en/latest/user/install-deps.html

Sample of first extracted table (tabula):
--------------------------------------------------


Unnamed: 0.1,Unnamed: 0,popularity,acousticness,danceability,...,tempo,valence,genre
0,0,41.0,0.644,0.823,...,102.619,0.649,Jazz
1,1,62.0,0.0855,0.686,...,173.915,0.636,Rap
2,2,42.0,0.239,0.669,...,145.061,0.494,Electronic
3,3,64.0,0.0125,0.522,...,120.406497,0.595,Rock
4,4,60.0,0.121,0.78,...,96.056,0.312,Rap


--------------------------------------------------


## Process Extracted Data

Process the extracted text to identify potential headers and content:

In [7]:
def process_text_for_headers(text_content):
    headers = {}
    
    if not text_content or len(text_content) == 0:
        return headers
    
    # Combine all text content
    all_text = "\n".join(text_content)
    
    # Split text into lines
    lines = all_text.split('\n')
    
    # Simple heuristic: Consider lines with fewer than 5 words and ending with a colon as potential headers
    # This is a basic approach and may need refinement based on the actual PDF structure
    current_header = None
    current_content = []
    
    for line in lines:
        line = line.strip()
        if not line:
            continue
            
        words = line.split()
        
        # Check if this might be a header
        if len(words) < 5 and (line.endswith(':') or line.isupper() or all(c.isupper() for c in line[0])):
            # Save previous header and content
            if current_header and current_content:
                headers[current_header] = ' '.join(current_content)
            
            # Set new header
            current_header = line.rstrip(':').strip()
            current_content = []
        else:
            # Add to current content
            if current_header:
                current_content.append(line)
            elif not headers.get('Title'):
                # Use the first significant line as the title if no title exists yet
                if len(line) > 5:
                    headers['Title'] = line
    
    # Save the last header and content
    if current_header and current_content:
        headers[current_header] = ' '.join(current_content)
    
    # Add some metadata
    headers['extraction_date'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    headers['source_file'] = os.path.basename(pdf_path) if pdf_path else "Unknown"
    
    return headers

# Process the extracted text to identify headers
headers = {}
if extracted_text:
    headers = process_text_for_headers(extracted_text)
    
    print("\nExtracted Headers:")
    print("-" * 50)
    for key, value in headers.items():
        print(f"{key}: {value[:50]}..." if len(value) > 50 else f"{key}: {value}")
    print("-" * 50)


Extracted Headers:
--------------------------------------------------
SUPERVISED LEARNING WITH SCIKIT-LEARN: Where to go from here? Machine Learning with Tree-...
With real-world data: This is rarely the case We will often need to prep...
Dealing with categorical features: scikit-learn will not accept categorical features ...
Dummy variables: SUPERVISED LEARNING WITH SCIKIT-LEARNDummy variabl...
Music dataset: print(music_df.isna().sum().sort_values()) genre  ...
Encoding dummy variables: music_dummies = pd.get_dummies(music_df, drop_firs...
X = music_dummies.drop("popularity", axis=1).values: y = music_dummies["popularity"].values X_train, X_...
Handling missing: data
Missing data: No value for a feature in a particular row
This can occur because: There may have been no observation The data might ...
Dropping missing data: music_df = music_df.dropna(subset=["genre", "popul...
Imputing values: Imputation - use subject-matter expertise to repla...
Imputation with scikit-learn: imp_num 

In [8]:
def process_tables_to_list_items(tabula_tables, camelot_tables):
    list_items = []
    
    # Process tabula tables
    for i, table in enumerate(tabula_tables):
        table_dict = table.to_dict(orient='records')
        for item in table_dict:
            # Remove any items with all None or empty values
            if any(v for v in item.values() if v is not None and str(v).strip()):
                list_items.append(item)
    
    # Process camelot tables if tabula didn't find any
    if not list_items and camelot_tables:
        for i, table in enumerate(camelot_tables):
            df = table.df
            # If the first row contains headers, use it
            if not df.empty:
                headers = df.iloc[0].tolist()
                for _, row in df.iloc[1:].iterrows():
                    item = {}
                    for j, header in enumerate(headers):
                        if j < len(row):
                            item[header] = row[j]
                    # Remove any items with all None or empty values
                    if any(v for v in item.values() if v is not None and str(v).strip()):
                        list_items.append(item)
    
    return list_items

# Process tables into list items
list_items = []
if tabula_tables or camelot_tables:
    list_items = process_tables_to_list_items(tabula_tables, camelot_tables)
    
    print(f"\nProcessed {len(list_items)} list items from tables.")
    if list_items:
        print("Sample list items:")
        print("-" * 50)
        for i, item in enumerate(list_items[:3]):
            print(f"Item {i+1}:")
            print(item)
        print("-" * 50)


Processed 18 list items from tables.
Sample list items:
--------------------------------------------------
Item 1:
{'Unnamed: 0': 0, 'popularity': 41.0, 'acousticness': 0.644, 'danceability': 0.823, '...': '...', 'tempo': 102.619, 'valence': 0.649, 'genre': 'Jazz'}
Item 2:
{'Unnamed: 0': 1, 'popularity': 62.0, 'acousticness': 0.0855, 'danceability': 0.686, '...': '...', 'tempo': 173.915, 'valence': 0.636, 'genre': 'Rap'}
Item 3:
{'Unnamed: 0': 2, 'popularity': 42.0, 'acousticness': 0.239, 'danceability': 0.669, '...': '...', 'tempo': 145.061, 'valence': 0.494, 'genre': 'Electronic'}
--------------------------------------------------


## Format Data into JSON

Combine the extracted headers and list items into a structured JSON format:

In [9]:
def format_data_to_json(headers, list_items):
    json_data = {}
    
    # Add headers to the JSON
    for key, value in headers.items():
        json_data[key] = value
    
    # Add list items
    json_data["List_items"] = list_items
    
    return json_data

# Format the data into JSON
json_data = format_data_to_json(headers, list_items)

# Display the JSON data
print("\nJSON Data:")
print("-" * 50)
print(json.dumps(json_data, indent=2, ensure_ascii=False)[:1000] + "..." if len(json.dumps(json_data, indent=2)) > 1000 else json.dumps(json_data, indent=2))
print("-" * 50)


JSON Data:
--------------------------------------------------
{
  "SUPERVISED LEARNING WITH SCIKIT-LEARN": "Where to go from here? Machine Learning with Tree-Based Models in Python Preprocessing for Machine Learning in Python",
  "With real-world data": "This is rarely the case We will often need to preprocess our data first",
  "Dealing with categorical features": "scikit-learn will not accept categorical features by default Need to convert categorical features into numeric values Convert to binary features called dummy variables 0: Observation was NOT that category 1: Observation was that category SUPERVISED LEARNING WITH SCIKIT-LEARNDummy variables",
  "Dummy variables": "SUPERVISED LEARNING WITH SCIKIT-LEARNDummy variables",
  "Music dataset": "print(music_df.isna().sum().sort_values()) genre                 8 popularity           31 loudness             44 liveness             46 tempo                46 speechiness          59 duration_ms          91 instrumentalness     91 dance

## Save JSON to File

Save the formatted JSON data to a file:

In [10]:
def save_json_to_file(json_data, pdf_path):
    if not pdf_path:
        output_path = "extracted_data.json"
    else:
        # Use the PDF filename as the base for the JSON file
        base_name = os.path.splitext(os.path.basename(pdf_path))[0]
        output_path = f"{base_name}_extracted.json"
    
    try:
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(json_data, f, indent=2, ensure_ascii=False)
        print(f"JSON data saved to {output_path}")
        return output_path
    except Exception as e:
        print(f"Error saving JSON data: {str(e)}")
        return None

# Save the JSON data to a file
if json_data:
    output_path = save_json_to_file(json_data, pdf_path)
    if output_path:
        print(f"\nExtraction complete! Data saved to {output_path}")

JSON data saved to chapter4_extracted.json

Extraction complete! Data saved to chapter4_extracted.json


## Conclusion

This notebook provides a comprehensive solution for extracting text and table data from PDF documents and formatting it into a structured JSON file. The solution:

1. Extracts text using PyPDF2
2. Extracts tables using both tabula-py and camelot for better coverage
3. Processes text to identify potential headers
4. Formats table data into list items
5. Combines all extracted data into a structured JSON format
6. Saves the JSON data to a file

You can modify the processing logic to better suit specific PDF structures if needed.