# EcoMetricx: Advanced PDF Processing & Visual Element Extraction

**A Comprehensive Demonstration of Multi-Modal Document Intelligence**

---

## 🎯 Project Overview

Welcome to **EcoMetricx** - an advanced PDF processing system designed to extract maximum value from energy documents and reports. This notebook demonstrates a complete pipeline that can:

- 📄 **Extract text** from PDFs using multiple intelligent methods
- 👁️ **Process visual content** exactly as humans see it 
- 🔍 **Identify and extract** tables, charts, and images automatically
- 🤖 **Apply OCR technology** to capture text from screenshots
- 📊 **Provide diagnostic insights** about extraction quality

### Why This Matters

Traditional PDF extraction often fails because:
- **Hidden text layers** may contain garbled or invisible data
- **Visual elements** like charts and tables are difficult to parse programmatically
- **Different PDF formats** require different extraction strategies

Our solution uses a **hybrid approach** that combines the best of both worlds: programmatic efficiency and visual accuracy.

---

## 🛠️ Environment Setup

Let's start by setting up our development environment. Think of this as preparing your toolbox before starting a complex project.

### Step 1: Check Your Python Environment

First, let's verify we're using the correct Python environment. This is crucial because different projects may require different versions of libraries.

In [1]:
import sys
import platform
print(f"🐍 Python Version: {sys.version}")
print(f"💻 Platform: {platform.platform()}")
print(f"📁 Python Executable: {sys.executable}")

# Check if we're in the correct conda environment
import os
conda_env = os.environ.get('CONDA_DEFAULT_ENV', 'Not in conda environment')
print(f"🌐 Conda Environment: {conda_env}")

🐍 Python Version: 3.11.13 | packaged by conda-forge | (main, Jun  4 2025, 14:48:23) [GCC 13.3.0]
💻 Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
📁 Python Executable: /root/anaconda3/envs/pdf-extractor/bin/python
🌐 Conda Environment: pdf-extractor


### Step 2: Install Required Libraries

Now we'll install all the libraries our system needs. Each library serves a specific purpose:

- **pdf2image**: Converts PDF pages to high-quality screenshots
- **pytesseract**: Google's OCR engine for reading text from images
- **opencv-python**: Computer vision library for image processing
- **scikit-image**: Advanced image analysis and processing
- **pdfplumber**: Programmatic PDF text extraction
- **matplotlib & plotly**: Data visualization libraries
- **Pillow**: Python's image processing library

In [None]:
# Install core PDF processing libraries
!pip install pdf2image>=3.0.0 --quiet
!pip install pdfplumber>=0.9.0 --quiet

# Install OCR capabilities
!pip install pytesseract --quiet
!pip install Pillow --quiet

# Install computer vision and image processing
!pip install opencv-python>=4.8.0 --quiet
!pip install scikit-image>=0.25.0 --quiet

# Install visualization libraries
!pip install matplotlib>=3.10.0 --quiet
!pip install plotly>=6.3.0 --quiet

# Install additional utilities
!pip install tabula-py --quiet
!pip install easyocr --quiet

print("✅ All libraries installed successfully!")

[0m

### Step 3: Import Libraries and Check Installation

Let's import our libraries and verify everything is working correctly. This step helps us catch any installation issues early.

In [None]:
# Core Python libraries
import os
import sys
import logging
from pathlib import Path
from typing import Dict, List, Any, Optional, Tuple
import json
from datetime import datetime

# PDF and OCR processing
try:
    import pdf2image
    # pdf2image doesn't have __version__, so we'll check if it can be imported
    print(f"✅ pdf2image imported successfully")
except ImportError as e:
    print(f"❌ pdf2image import failed: {e}")

try:
    import pdfplumber
    # Try to get version, fallback to success message
    version = getattr(pdfplumber, '__version__', 'unknown version')
    print(f"✅ pdfplumber version: {version}")
except ImportError as e:
    print(f"❌ pdfplumber import failed: {e}")

try:
    import pytesseract
    print(f"✅ pytesseract imported successfully")
except ImportError as e:
    print(f"❌ pytesseract import failed: {e}")

# Image processing
try:
    import cv2
    print(f"✅ OpenCV version: {cv2.__version__}")
except ImportError as e:
    print(f"⚠️ OpenCV import failed (will use PIL fallback): {e}")

try:
    from PIL import Image, ImageEnhance
    # PIL uses PILLOW_VERSION or __version__
    try:
        version = Image.__version__
    except AttributeError:
        import PIL
        version = getattr(PIL, '__version__', 'unknown version')
    print(f"✅ Pillow (PIL) version: {version}")
except ImportError as e:
    print(f"❌ Pillow import failed: {e}")

try:
    import skimage
    print(f"✅ scikit-image version: {skimage.__version__}")
except ImportError as e:
    print(f"❌ scikit-image import failed: {e}")

# Data processing and visualization
try:
    import numpy as np
    print(f"✅ NumPy version: {np.__version__}")
except ImportError as e:
    print(f"❌ NumPy import failed: {e}")

try:
    import pandas as pd
    print(f"✅ Pandas version: {pd.__version__}")
except ImportError as e:
    print(f"❌ Pandas import failed: {e}")

try:
    import matplotlib.pyplot as plt
    import matplotlib
    print(f"✅ Matplotlib version: {matplotlib.__version__}")
except ImportError as e:
    print(f"❌ Matplotlib import failed: {e}")

try:
    import plotly.graph_objects as go
    import plotly
    print(f"✅ Plotly version: {plotly.__version__}")
except ImportError as e:
    print(f"❌ Plotly import failed: {e}")

### Step 4: Configure System Settings

Now let's set up our system configuration. This includes creating directories for our outputs and configuring logging so we can track what our system is doing.

In [None]:
# Create project structure
PROJECT_ROOT = Path.cwd()
OUTPUT_DIR = PROJECT_ROOT / "output"
SAMPLE_DATA_DIR = PROJECT_ROOT / "sample_data"

# Create necessary directories
directories_to_create = [
    OUTPUT_DIR / "visual_extraction" / "screenshots",
    OUTPUT_DIR / "visual_extraction" / "text",
    OUTPUT_DIR / "enhanced_extraction",
    OUTPUT_DIR / "visual_element_extraction" / "tables" / "extracted",
    OUTPUT_DIR / "visual_element_extraction" / "charts" / "extracted", 
    OUTPUT_DIR / "visual_element_extraction" / "images" / "logos",
    OUTPUT_DIR / "visual_element_extraction" / "images" / "photos",
    OUTPUT_DIR / "visual_element_extraction" / "images" / "diagrams",
    OUTPUT_DIR / "visual_element_extraction" / "metadata",
    SAMPLE_DATA_DIR
]

for directory in directories_to_create:
    directory.mkdir(parents=True, exist_ok=True)
    
print(f"📁 Project root: {PROJECT_ROOT}")
print(f"📁 Output directory: {OUTPUT_DIR}")
print(f"📁 Sample data directory: {SAMPLE_DATA_DIR}")
print("✅ All directories created successfully!")

In [None]:
# Configure logging for our demonstration
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.StreamHandler(sys.stdout),  # Display logs in notebook
        logging.FileHandler(OUTPUT_DIR / 'demo_log.txt')  # Save logs to file
    ]
)

logger = logging.getLogger('EcoMetricx_Demo')
logger.info("🚀 EcoMetricx Demo logging initialized")

# Display system information
logger.info(f"Python version: {sys.version.split()[0]}")
logger.info(f"Working directory: {PROJECT_ROOT}")
logger.info(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

### Step 5: Configure OCR Settings

OCR (Optical Character Recognition) is like teaching our computer to "read" text from images. Let's configure it for optimal performance.

In [None]:
# OCR Configuration
OCR_CONFIG = {
    'tesseract_config': '--oem 3 --psm 6 -c tessedit_char_blacklist=',
    'confidence_threshold': 60,  # Minimum confidence score for text recognition
    'dpi': 300,  # Resolution for PDF to image conversion
    'preprocessing': True  # Apply image enhancement before OCR
}

# Test OCR installation
try:
    # Try to get tesseract version
    version = pytesseract.get_tesseract_version()
    logger.info(f"✅ Tesseract OCR version: {version}")
    
    # Test with a simple image
    test_image = Image.new('RGB', (200, 50), color='white')
    test_result = pytesseract.image_to_string(test_image)
    logger.info("✅ OCR test completed successfully")
    
except Exception as e:
    logger.error(f"❌ OCR configuration failed: {e}")
    logger.info("💡 You may need to install Tesseract OCR separately on your system")

print("🔧 OCR Configuration:")
for key, value in OCR_CONFIG.items():
    print(f"  {key}: {value}")

### Step 6: System Health Check

Let's run a final check to make sure everything is working properly before we start processing PDFs.

In [None]:
def system_health_check():
    """Perform a comprehensive health check of our system."""
    
    checks = {
        'Python Environment': True,
        'Required Directories': True,
        'PDF Processing': False,
        'Image Processing': False,
        'OCR Engine': False,
        'Computer Vision': False
    }
    
    # Check PDF processing
    try:
        import pdf2image, pdfplumber
        checks['PDF Processing'] = True
        logger.info("✅ PDF processing libraries ready")
    except ImportError:
        logger.error("❌ PDF processing libraries missing")
    
    # Check image processing
    try:
        from PIL import Image
        import numpy as np
        checks['Image Processing'] = True
        logger.info("✅ Image processing libraries ready")
    except ImportError:
        logger.error("❌ Image processing libraries missing")
    
    # Check OCR
    try:
        import pytesseract
        pytesseract.get_tesseract_version()
        checks['OCR Engine'] = True
        logger.info("✅ OCR engine ready")
    except Exception:
        logger.error("❌ OCR engine not available")
    
    # Check computer vision
    try:
        import cv2
        checks['Computer Vision'] = True
        logger.info("✅ Computer vision libraries ready")
    except ImportError:
        logger.warning("⚠️ OpenCV not available (will use PIL fallback)")
        checks['Computer Vision'] = 'Partial'
    
    # Display results
    print("\n🏥 System Health Check Results:")
    print("=" * 40)
    
    for component, status in checks.items():
        if status == True:
            print(f"✅ {component}: Ready")
        elif status == 'Partial':
            print(f"⚠️ {component}: Partial (with fallback)")
        else:
            print(f"❌ {component}: Not Ready")
    
    # Overall status
    ready_count = sum(1 for status in checks.values() if status == True)
    total_count = len(checks)
    
    print(f"\n📊 Overall Status: {ready_count}/{total_count} components ready")
    
    if ready_count >= 4:  # Minimum required components
        print("🎉 System is ready for PDF processing!")
        return True
    else:
        print("⚠️ Some components need attention before proceeding")
        return False

# Run the health check
system_ready = system_health_check()

---

## 🎊 Setup Complete!

Congratulations! Your EcoMetricx environment is now configured and ready to process PDFs. 

### What We've Accomplished:

1. ✅ **Verified Python environment** and confirmed we're using the right setup
2. ✅ **Installed all required libraries** for PDF processing, OCR, and computer vision
3. ✅ **Created organized directory structure** for managing outputs
4. ✅ **Configured logging system** to track our processing activities
5. ✅ **Set up OCR engine** for reading text from images
6. ✅ **Performed system health check** to ensure everything works properly

### Next Steps:

In the following sections, we'll demonstrate:
- 📄 **Loading and analyzing PDF documents**
- 🔍 **Comparing different extraction methods**
- 🎯 **Visual element detection and extraction**
- 📊 **Performance analysis and metrics**
- 💡 **Real-world applications and use cases**

---

*Ready to see some PDF magic? Let's continue to the next section!* ✨