# Enhanced PDF Parser with Table Extraction

This notebook demonstrates how to use the enhanced PDF Parser to:
1. Extract text from PDF documents
2. Extract tables from PDF documents
3. Export extracted tables to CSV files
4. Process and analyze the extracted data

The enhanced PDF Parser now handles tabular data more effectively, allowing you to convert raw data from PDFs into structured formats for analysis.

## 1. Import Required Libraries

First, we need to import the necessary libraries:
- `src.pdf_parser` - Our enhanced PDF parser
- `pandas` - For data manipulation and analysis
- `matplotlib` and `seaborn` - For data visualization
- `os` and other standard libraries for file handling

In [1]:
import os
import sys
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add the project root to the path if needed
if not '/workspaces/pdf_parser' in sys.path:
    sys.path.append('/workspaces/pdf_parser')

# Import the PDF parser
from src.pdf_parser import PDFParser

# Set up matplotlib
plt.style.use('ggplot')
%matplotlib inline

## 2. Initialize the PDF Parser

Now let's create an instance of our enhanced PDF parser with table extraction capabilities. We'll configure it to:
1. Extract tables from PDFs
2. Use a suitable table extraction method ('lattice' for tables with visible lines/borders or 'stream' for those without)
3. Set up chunk size for text processing

In [2]:
# Create a PDF parser instance with table extraction enabled
pdf_parser = PDFParser(
    chunk_size=1000,        # Size of text chunks for processing
    chunk_overlap=200,      # Overlap between chunks to maintain context
    extract_tables=True,    # Enable table extraction
    table_flavour='lattice' # Use 'lattice' for tables with visible borders, 'stream' for tables without
)

print(f"PDF Parser initialized with table extraction enabled!")
print(f"- Chunk size: {pdf_parser.chunk_size}")
print(f"- Chunk overlap: {pdf_parser.chunk_overlap}")
print(f"- Table extraction method: {pdf_parser.table_extractor.flavour if pdf_parser.table_extractor else 'None'}")

PDF Parser initialized with table extraction enabled!
- Chunk size: 1000
- Chunk overlap: 200
- Table extraction method: lattice


## 3. Process a PDF Document

Now let's process a sample PDF document to extract both text and tabular data. We'll use one of the available PDFs in the workspace.

In [3]:
# Find a PDF with tabular data
pdf_files = [os.path.join("/workspaces/pdf_parser", f) for f in ["Snack_planogram_12_05_2025.pdf", "rei-8727.pdf"]]
pdf_path = ""
for file in pdf_files:
    if os.path.exists(file):
        pdf_path = file
        break

if not pdf_path:
    print("Error: No suitable PDF found. Please upload a PDF file.")
else:
    # Set up output directory for CSV exports
    output_dir = "/workspaces/pdf_parser/notebook_exports"
    os.makedirs(output_dir, exist_ok=True)
    
    # Process the PDF with table extraction and export to CSV
    print(f"Processing PDF: {os.path.basename(pdf_path)}")
    result = pdf_parser.parse_pdf(
        pdf_path=pdf_path,
        output_dir=output_dir  # This will export tables to CSV in the specified directory
    )
    
    # Display basic information about the processed PDF
    print(f"\nPDF Processing Complete!")
    print(f"{'=' * 50}")
    print(f"Title: {result['metadata'].get('Title', 'Not available')}")
    print(f"Author: {result['metadata'].get('Author', 'Not available')}")
    print(f"Pages: {result['metadata'].get('num_pages', 'Unknown')}")
    print(f"Text length: {len(result.get('text', ''))}")
    print(f"Number of chunks: {result.get('num_chunks', 0)}")
    
    # Check if tables were extracted
    if 'tables' in result:
        print(f"\nTables extracted: {len(result['tables'])}")
        
        # Display information about CSV files if they were created
        if 'table_csv_paths' in result:
            print(f"\nTables exported to CSV files:")
            for path in result['table_csv_paths']:
                print(f"- {os.path.basename(path)}")
    else:
        print("\nNo tables were detected in the PDF.")

Processing PDF: Snack_planogram_12_05_2025.pdf


Failed to import jpype dependencies. Fallback to subprocess.
No module named 'jpype'


Camelot stream extraction failed: line_scale cannot be used with flavor='stream'
Camelot lattice extraction failed: cannot access local variable 'stream_tables' where it is not associated with a value
Skipping mostly empty table: 125/175 empty cells
Skipping mostly empty table: 143/195 empty cells

PDF Processing Complete!
Title: T1AP_Chaturbhuj_Snacking Plano.pdf
Author: Arun1 Grover
Pages: 7
Text length: 8527
Number of chunks: 2

Tables extracted: 5

Tables exported to CSV files:
- Snack_planogram_12_05_2025_table_1.csv
- Snack_planogram_12_05_2025_table_2.csv
- Snack_planogram_12_05_2025_table_3.csv
- Snack_planogram_12_05_2025_table_4.csv
- Snack_planogram_12_05_2025_table_5.csv


## 4. Analyze Extracted Tables

Now let's analyze the extracted tables and show how to work with them in pandas DataFrames.

In [4]:
# Load and analyze the extracted tables
if 'tables' in result and result['tables']:
    print(f"Analyzing {len(result['tables'])} extracted tables...\n")
    
    for i, table in enumerate(result['tables']):
        print(f"Table {i+1}:")
        print(f"- Shape: {table.get('shape', (0,0))}")
        print(f"- Extraction method: {table.get('extraction_method', 'Unknown')}")
        
        # Get the table data
        rows = table.get('rows', [])
        headers = table.get('headers', [])
        
        if rows:
            # Create a DataFrame for analysis
            df = pd.DataFrame(rows)
            
            # If we have headers, use them as column names (if they match)
            if headers and len(headers) == len(df.columns):
                df.columns = headers
                
            # Print a preview
            print(f"\nPreview of Table {i+1}:")
            display(df.head(5))
            
            # Basic statistics
            num_rows, num_cols = df.shape
            print(f"- Rows: {num_rows}")
            print(f"- Columns: {num_cols}")
            
            # Find potential numeric columns for data analysis
            numeric_cols = []
            for col in df.columns:
                try:
                    # Try to convert to numeric
                    pd.to_numeric(df[col])
                    numeric_cols.append(col)
                except:
                    pass
            
            print(f"- Numeric columns: {len(numeric_cols)}")
            print("-" * 40)
            
            # Save this table to CSV if not already saved
            csv_path = os.path.join(output_dir, f"table_{i+1}.csv")
            df.to_csv(csv_path, index=False)
            print(f"Saved to: {csv_path}")
            print("\n")
else:
    print("No tables available for analysis.")

Analyzing 5 extracted tables...

Table 1:
- Shape: (36, 6)
- Extraction method: tabula

Preview of Table 1:


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,VALUE FORMAT - PLANOGRAM SCHEMATIC LAYOUT,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,LAYOUT / [ EFFECTIVE DATE ],,SEGMENT,,,CLASS
1,T1AP_W1600-711_SNACKING / [ 05/12/2025 ],,SNACKING,,,SNACKING
2,BAY,,FIXTURE TYPE,,,PLANOGRAM SEGMENT
3,"W16001, W16002, W16003",,,,,Total : 3 Segments
4,SNACKING,,SNACKING,,,SNACKING


- Rows: 36
- Columns: 6
- Numeric columns: 1
----------------------------------------
Saved to: /workspaces/pdf_parser/notebook_exports/table_1.csv


Table 2:
- Shape: (36, 5)
- Extraction method: tabula

Preview of Table 2:


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,VALUE FORMAT - PLANOGRAM SCHEMATIC LAYOUT,Unnamed: 2,Unnamed: 3
0,LAYOUT / [ EFFECTIVE DATE ],,SEGMENT,,CLASS
1,T1AP_W1600-711_SNACKING / [ 05/12/2025 ],,SNACKING,,SNACKING
2,BAY,,FIXTURE TYPE,,PLANOGRAM SEGMENT
3,W16001,,,,Segment: 1 of 3
4,LOC Article,,Vertical Depth,Shelf,Remarks/Tagging


- Rows: 36
- Columns: 5
- Numeric columns: 0
----------------------------------------
Saved to: /workspaces/pdf_parser/notebook_exports/table_2.csv


Table 3:
- Shape: (35, 4)
- Extraction method: tabula

Preview of Table 3:


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,VALUE FORMAT - PLANOGRAM SCHEMATIC LAYOUT,Unnamed: 2
0,LAYOUT / [ EFFECTIVE DATE ],,SEGMENT,CLASS
1,T1AP_W1600-711_SNACKING /,[ 05/12/2025 ],SNACKING,SNACKING
2,BAY,,FIXTURE TYPE,PLANOGRAM SEGMENT
3,W16002,,,Segment: 2 of 3
4,,,SNACKING,


- Rows: 35
- Columns: 4
- Numeric columns: 0
----------------------------------------
Saved to: /workspaces/pdf_parser/notebook_exports/table_3.csv


Table 4:
- Shape: (29, 5)
- Extraction method: tabula

Preview of Table 4:


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,VALUE FORMAT - PLANOGRAM SCHEMATIC LAYOUT,Unnamed: 2,Unnamed: 3
0,LAYOUT / [ EFFECTIVE DATE ],,SEGMENT,,CLASS
1,T1AP_W1600-711_SNACKING / [ 05/12/2025 ],,SNACKING,,SNACKING
2,BAY,,FIXTURE TYPE,,PLANOGRAM SEGMENT
3,W16002,,,,Segment: 2 of 3
4,LOC Article,,Vertical Depth,Shelf,Remarks/Tagging


- Rows: 29
- Columns: 5
- Numeric columns: 0
----------------------------------------
Saved to: /workspaces/pdf_parser/notebook_exports/table_4.csv


Table 5:
- Shape: (51, 5)
- Extraction method: tabula

Preview of Table 5:


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,VALUE FORMAT - PLANOGRAM SCHEMATIC LAYOUT,Unnamed: 2,Unnamed: 3
0,LAYOUT / [ EFFECTIVE DATE ],,SEGMENT,,CLASS
1,T1AP_W1600-711_SNACKING / [ 05/12/2025 ],,SNACKING,,SNACKING
2,BAY,,FIXTURE TYPE,,PLANOGRAM SEGMENT
3,W16003,,,,Segment: 3 of 3
4,LOC Article,,Vertical Depth,Shelf,Remarks/Tagging


- Rows: 51
- Columns: 5
- Numeric columns: 0
----------------------------------------
Saved to: /workspaces/pdf_parser/notebook_exports/table_5.csv


