Skip to content

CodewithEvilxd/pdf-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Advanced PDF Text Extractor with OCR

React Vite PDF.js Tesseract.js License

Ultimate PDF text extractor with AI-powered OCR, table detection, and advanced text processing capabilities. Extract text, tables, and data from PDFs and scanned documents with unparalleled accuracy and speed.

🌐 Live Demo

Live Demo

Experience the full power of our PDF extractor with our live web application. No installation required - just upload your PDFs and start extracting!

πŸ“‹ Repository

GitHub Repository

git clone https://github.com/CodewithEvilxd/pdf-extractor.git
cd pdf-extractor
npm install
npm run dev

πŸ’¬ Community & Support

Discord Discord

Join our Discord community for support, feature requests, and discussions about PDF processing!


✨ Features

πŸ” Core Extraction Capabilities

  • PDF Text Extraction: Extract clean, formatted text from any PDF document with perfect accuracy
  • AI OCR Technology: Convert scanned documents and images to editable text using advanced Tesseract.js
  • Table Detection: Automatically detect and extract tabular data with CSV export capability
  • Batch Processing: Process multiple PDF files simultaneously with progress tracking
  • Page Range Selection: Extract text from specific page ranges or entire documents

πŸ› οΈ Advanced Text Processing

  • Smart Search: Search within extracted text with highlighted results and instant feedback
  • Text Analysis: Get detailed statistics including word count, reading time, and language detection
  • Text Formatting: Clean and format extracted text with advanced formatting and cleaning tools
  • Advanced Search: Powerful regex search with find & replace, case sensitivity, and whole word matching
  • Text Summarization: AI-powered text summarization using frequency analysis to extract key information
  • Text-to-Speech: Listen to extracted text with customizable voices, speed, and pitch controls

πŸ“Š Document Intelligence

  • Metadata Extraction: Extract detailed document information including author, dates, file size, and PDF properties
  • Keyword Extraction: Identify and extract the most important keywords using advanced TF-IDF and frequency analysis
  • Text Segmentation: Break down extracted text into paragraphs or sentences with detailed statistics
  • Reading Mode: Distraction-free reading experience with customizable fonts, themes, and layout controls

🎨 User Experience

  • Drag & Drop Interface: Intuitive file upload with drag-and-drop support
  • Dark/Light Theme: Toggle between themes for comfortable reading in any environment
  • Responsive Design: Optimized for desktop, tablet, and mobile devices
  • Keyboard Shortcuts: Boost productivity with comprehensive keyboard shortcuts
  • Progress Tracking: Real-time progress indicators and processing statistics
  • Multiple Export Formats: Export to TXT, JSON, CSV, and Markdown formats

πŸ”§ Developer Features

  • Privacy-First: All processing happens in your browser - no data sent to external servers
  • Fast Processing: Lightning-fast extraction using optimized algorithms
  • Accurate Results: Industry-leading accuracy for text extraction and OCR
  • Modern Tech Stack: Built with React 19, Vite, and cutting-edge web technologies

πŸš€ Quick Start

Prerequisites

  • Node.js 18+ and npm
  • Modern web browser with JavaScript enabled

Installation

  1. Clone the repository

    git clone https://github.com/CodewithEvilxd/pdf-extractor.git
    cd pdf-extractor
  2. Install dependencies

    npm install
  3. Start development server

    npm run dev
  4. Open your browser Navigate to http://localhost:5173 to access the application.

Optional: Enable OCR Support

For scanned PDF processing, install Tesseract.js:

npm install tesseract.js

πŸ“– Usage Guide

Basic Text Extraction

  1. Upload PDFs: Click "Choose PDF Files" or drag and drop PDF files into the upload area
  2. Configure Options:
    • Select page range (optional)
    • Enable OCR for scanned documents
    • Enable table extraction
  3. Extract Text: Click upload to start processing
  4. View Results: Extracted text appears with formatting preserved

Advanced Features

Search & Highlight

  • Use the search box to find specific text
  • Results are highlighted in real-time
  • Navigate through matches with previous/next buttons

Text Analysis

  • Click "Text Analysis" to view detailed statistics
  • Includes word count, reading time, language detection
  • View most frequent words and document metrics

Text Formatting

  • Clean up extracted text with formatting tools
  • Remove extra spaces, normalize line breaks
  • Capitalize sentences, remove special characters

Advanced Search (Regex)

  • Use regular expressions for complex searches
  • Find & replace functionality
  • Case-sensitive and whole word options

Text Summarization

  • Generate AI-powered summaries
  • Adjustable compression levels (10%-50%)
  • View original vs. summary statistics

Text-to-Speech

  • Listen to extracted text with natural voices
  • Customize speech rate, pitch, and voice selection
  • Support for multiple languages

Metadata Extraction

  • View detailed document information
  • Includes creation date, author, file size
  • PDF version and encryption status

Keyword Extraction

  • Extract most important keywords
  • TF-IDF scoring algorithm
  • Adjustable keyword count (5-50)

Reading Mode

  • Distraction-free reading experience
  • Customizable fonts and themes
  • Adjustable font size and line height

Text Segmentation

  • Split text into paragraphs or sentences
  • Detailed statistics for each segment
  • Export segmented content

Export Options

  • TXT: Plain text format
  • JSON: Structured data with metadata
  • CSV: Tabular data export
  • Markdown: Formatted document with headers and structure

🎯 Key Features in Detail

PDF Processing Engine

  • PDF.js Integration: Industry-standard PDF processing library
  • Multi-format Support: Handles all PDF versions and formats
  • Error Handling: Robust error handling for corrupted or password-protected files
  • Memory Efficient: Optimized memory usage for large documents

OCR Technology

  • Tesseract.js: Google's Tesseract OCR engine for web
  • Multi-language Support: Support for 100+ languages
  • Image Preprocessing: Automatic image enhancement for better OCR accuracy
  • Fallback Logic: Graceful fallback when OCR is not available

Table Detection

  • Heuristic Analysis: Smart detection of tabular structures
  • CSV Export: Automatic conversion to CSV format
  • Header Detection: Intelligent header row identification
  • Multi-column Support: Handles complex table layouts

Search & Navigation

  • Real-time Search: Instant search results as you type
  • Regex Support: Full regular expression capabilities
  • Case Sensitivity: Configurable case-sensitive matching
  • Whole Word Matching: Exact word boundary matching

Text Analysis

  • Comprehensive Metrics: Word count, character count, sentence analysis
  • Reading Time Estimation: Calculate reading time based on average reading speed
  • Language Detection: Automatic language identification
  • Frequency Analysis: Most common words and phrases

Text Formatting

  • Whitespace Cleanup: Remove extra spaces and normalize formatting
  • Line Break Normalization: Fix inconsistent line breaks
  • Sentence Capitalization: Automatic sentence case correction
  • Special Character Removal: Clean up unwanted characters

Text Summarization

  • Frequency Analysis: Extract important sentences based on word frequency
  • Sentence Scoring: Position-based and length-based scoring
  • Compression Control: Adjustable summary length
  • Statistics Tracking: Original vs. summary comparison

Text-to-Speech

  • Web Speech API: Native browser speech synthesis
  • Voice Selection: Multiple voice options
  • Speech Controls: Rate, pitch, and volume adjustment
  • Progress Tracking: Real-time speech progress

Document Metadata

  • Complete Information: Author, title, subject, creator
  • Date Tracking: Creation and modification dates
  • File Properties: Size, type, PDF version
  • Security Info: Encryption and linearization status

Keyword Extraction

  • TF-IDF Algorithm: Term frequency-inverse document frequency
  • Position Weighting: Consider word position importance
  • Length Optimization: Balance between word length and importance
  • Stop Word Filtering: Remove common words

Reading Mode

  • Theme Options: Light, dark, and sepia themes
  • Font Customization: Adjustable font size and family
  • Layout Control: Line height and spacing options
  • Fullscreen Experience: Immersive reading environment

Text Segmentation

  • Paragraph Detection: Split by double line breaks
  • Sentence Detection: Split by punctuation marks
  • Statistics Generation: Word count and character count per segment
  • Export Capabilities: JSON export with metadata

πŸ› οΈ Technical Architecture

Frontend Stack

  • React 19: Latest React with concurrent features
  • Vite: Fast build tool and development server
  • Modern CSS: Responsive design with CSS Grid and Flexbox
  • Web APIs: File API, Web Speech API, Clipboard API

PDF Processing

  • PDF.js: Mozilla's PDF processing library
  • Worker Threads: Background processing for large files
  • Canvas API: Image rendering for OCR processing
  • FileReader API: Efficient file reading and processing

OCR Integration

  • Tesseract.js: WebAssembly-based OCR engine
  • Image Processing: Canvas-based image preprocessing
  • Language Models: Multiple language support
  • Error Recovery: Graceful handling of OCR failures

Testing & Quality

  • Vitest: Fast unit testing framework
  • React Testing Library: Component testing utilities
  • ESLint: Code quality and consistency
  • Prettier: Code formatting

πŸ“± Screenshots

Main Interface

Main Interface Clean, intuitive interface with drag-and-drop upload area

Feature Grid

Feature Grid Comprehensive feature overview with visual icons

Text Analysis Dashboard

Text Analysis Detailed statistics and analysis of extracted text

Reading Mode

Reading Mode Distraction-free reading experience with customizable themes


πŸ”§ Configuration

Environment Variables

Create a .env file in the root directory:

# Development
VITE_API_URL=http://localhost:5173

# Production
VITE_API_URL=https://your-production-url.com

Build Configuration

The project uses Vite for building. Configure in vite.config.js:

export default defineConfig({
  plugins: [react()],
  build: {
    outDir: 'dist',
    sourcemap: true
  }
})

πŸ§ͺ Testing

Run the test suite:

npm run test

Run tests with UI:

npm run test:ui

Run tests once:

npm run test:run

πŸ“¦ Build & Deployment

Development

npm run dev

Production Build

npm run build

Preview Production Build

npm run preview

Deployment

The application is optimized for deployment on:

  • Vercel (recommended)
  • Netlify
  • GitHub Pages
  • AWS S3 + CloudFront

🀝 Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch
    git checkout -b feature/amazing-feature
  3. Make your changes
  4. Run tests
    npm run test
  5. Commit your changes
    git commit -m 'Add amazing feature'
  6. Push to the branch
    git push origin feature/amazing-feature
  7. Open a Pull Request

Development Guidelines

  • Follow React best practices
  • Write comprehensive tests
  • Update documentation
  • Use meaningful commit messages
  • Follow ESLint rules

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • PDF.js by Mozilla for PDF processing capabilities
  • Tesseract.js for OCR functionality
  • React community for excellent documentation and tools
  • Vite team for the amazing build tool
  • All contributors and users of this project

πŸ“ž Support


πŸ”„ Changelog

Version 1.0.0

  • Initial release with core PDF extraction
  • OCR support for scanned documents
  • Advanced text processing features
  • Responsive UI with dark/light themes
  • Multiple export formats

Made with ❀️ using React, PDF.js, and modern web technologies

Extract text from PDFs with confidence - fast, accurate, and privacy-first!

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published