Ultimate PDF text extractor with AI-powered OCR, table detection, and advanced text processing capabilities. Extract text, tables, and data from PDFs and scanned documents with unparalleled accuracy and speed.
Experience the full power of our PDF extractor with our live web application. No installation required - just upload your PDFs and start extracting!
git clone https://github.com/CodewithEvilxd/pdf-extractor.git
cd pdf-extractor
npm install
npm run devJoin our Discord community for support, feature requests, and discussions about PDF processing!
- PDF Text Extraction: Extract clean, formatted text from any PDF document with perfect accuracy
- AI OCR Technology: Convert scanned documents and images to editable text using advanced Tesseract.js
- Table Detection: Automatically detect and extract tabular data with CSV export capability
- Batch Processing: Process multiple PDF files simultaneously with progress tracking
- Page Range Selection: Extract text from specific page ranges or entire documents
- Smart Search: Search within extracted text with highlighted results and instant feedback
- Text Analysis: Get detailed statistics including word count, reading time, and language detection
- Text Formatting: Clean and format extracted text with advanced formatting and cleaning tools
- Advanced Search: Powerful regex search with find & replace, case sensitivity, and whole word matching
- Text Summarization: AI-powered text summarization using frequency analysis to extract key information
- Text-to-Speech: Listen to extracted text with customizable voices, speed, and pitch controls
- Metadata Extraction: Extract detailed document information including author, dates, file size, and PDF properties
- Keyword Extraction: Identify and extract the most important keywords using advanced TF-IDF and frequency analysis
- Text Segmentation: Break down extracted text into paragraphs or sentences with detailed statistics
- Reading Mode: Distraction-free reading experience with customizable fonts, themes, and layout controls
- Drag & Drop Interface: Intuitive file upload with drag-and-drop support
- Dark/Light Theme: Toggle between themes for comfortable reading in any environment
- Responsive Design: Optimized for desktop, tablet, and mobile devices
- Keyboard Shortcuts: Boost productivity with comprehensive keyboard shortcuts
- Progress Tracking: Real-time progress indicators and processing statistics
- Multiple Export Formats: Export to TXT, JSON, CSV, and Markdown formats
- Privacy-First: All processing happens in your browser - no data sent to external servers
- Fast Processing: Lightning-fast extraction using optimized algorithms
- Accurate Results: Industry-leading accuracy for text extraction and OCR
- Modern Tech Stack: Built with React 19, Vite, and cutting-edge web technologies
- Node.js 18+ and npm
- Modern web browser with JavaScript enabled
-
Clone the repository
git clone https://github.com/CodewithEvilxd/pdf-extractor.git cd pdf-extractor -
Install dependencies
npm install
-
Start development server
npm run dev
-
Open your browser Navigate to
http://localhost:5173to access the application.
For scanned PDF processing, install Tesseract.js:
npm install tesseract.js- Upload PDFs: Click "Choose PDF Files" or drag and drop PDF files into the upload area
- Configure Options:
- Select page range (optional)
- Enable OCR for scanned documents
- Enable table extraction
- Extract Text: Click upload to start processing
- View Results: Extracted text appears with formatting preserved
- Use the search box to find specific text
- Results are highlighted in real-time
- Navigate through matches with previous/next buttons
- Click "Text Analysis" to view detailed statistics
- Includes word count, reading time, language detection
- View most frequent words and document metrics
- Clean up extracted text with formatting tools
- Remove extra spaces, normalize line breaks
- Capitalize sentences, remove special characters
- Use regular expressions for complex searches
- Find & replace functionality
- Case-sensitive and whole word options
- Generate AI-powered summaries
- Adjustable compression levels (10%-50%)
- View original vs. summary statistics
- Listen to extracted text with natural voices
- Customize speech rate, pitch, and voice selection
- Support for multiple languages
- View detailed document information
- Includes creation date, author, file size
- PDF version and encryption status
- Extract most important keywords
- TF-IDF scoring algorithm
- Adjustable keyword count (5-50)
- Distraction-free reading experience
- Customizable fonts and themes
- Adjustable font size and line height
- Split text into paragraphs or sentences
- Detailed statistics for each segment
- Export segmented content
- TXT: Plain text format
- JSON: Structured data with metadata
- CSV: Tabular data export
- Markdown: Formatted document with headers and structure
- PDF.js Integration: Industry-standard PDF processing library
- Multi-format Support: Handles all PDF versions and formats
- Error Handling: Robust error handling for corrupted or password-protected files
- Memory Efficient: Optimized memory usage for large documents
- Tesseract.js: Google's Tesseract OCR engine for web
- Multi-language Support: Support for 100+ languages
- Image Preprocessing: Automatic image enhancement for better OCR accuracy
- Fallback Logic: Graceful fallback when OCR is not available
- Heuristic Analysis: Smart detection of tabular structures
- CSV Export: Automatic conversion to CSV format
- Header Detection: Intelligent header row identification
- Multi-column Support: Handles complex table layouts
- Real-time Search: Instant search results as you type
- Regex Support: Full regular expression capabilities
- Case Sensitivity: Configurable case-sensitive matching
- Whole Word Matching: Exact word boundary matching
- Comprehensive Metrics: Word count, character count, sentence analysis
- Reading Time Estimation: Calculate reading time based on average reading speed
- Language Detection: Automatic language identification
- Frequency Analysis: Most common words and phrases
- Whitespace Cleanup: Remove extra spaces and normalize formatting
- Line Break Normalization: Fix inconsistent line breaks
- Sentence Capitalization: Automatic sentence case correction
- Special Character Removal: Clean up unwanted characters
- Frequency Analysis: Extract important sentences based on word frequency
- Sentence Scoring: Position-based and length-based scoring
- Compression Control: Adjustable summary length
- Statistics Tracking: Original vs. summary comparison
- Web Speech API: Native browser speech synthesis
- Voice Selection: Multiple voice options
- Speech Controls: Rate, pitch, and volume adjustment
- Progress Tracking: Real-time speech progress
- Complete Information: Author, title, subject, creator
- Date Tracking: Creation and modification dates
- File Properties: Size, type, PDF version
- Security Info: Encryption and linearization status
- TF-IDF Algorithm: Term frequency-inverse document frequency
- Position Weighting: Consider word position importance
- Length Optimization: Balance between word length and importance
- Stop Word Filtering: Remove common words
- Theme Options: Light, dark, and sepia themes
- Font Customization: Adjustable font size and family
- Layout Control: Line height and spacing options
- Fullscreen Experience: Immersive reading environment
- Paragraph Detection: Split by double line breaks
- Sentence Detection: Split by punctuation marks
- Statistics Generation: Word count and character count per segment
- Export Capabilities: JSON export with metadata
- React 19: Latest React with concurrent features
- Vite: Fast build tool and development server
- Modern CSS: Responsive design with CSS Grid and Flexbox
- Web APIs: File API, Web Speech API, Clipboard API
- PDF.js: Mozilla's PDF processing library
- Worker Threads: Background processing for large files
- Canvas API: Image rendering for OCR processing
- FileReader API: Efficient file reading and processing
- Tesseract.js: WebAssembly-based OCR engine
- Image Processing: Canvas-based image preprocessing
- Language Models: Multiple language support
- Error Recovery: Graceful handling of OCR failures
- Vitest: Fast unit testing framework
- React Testing Library: Component testing utilities
- ESLint: Code quality and consistency
- Prettier: Code formatting
Clean, intuitive interface with drag-and-drop upload area
Comprehensive feature overview with visual icons
Detailed statistics and analysis of extracted text
Distraction-free reading experience with customizable themes
Create a .env file in the root directory:
# Development
VITE_API_URL=http://localhost:5173
# Production
VITE_API_URL=https://your-production-url.comThe project uses Vite for building. Configure in vite.config.js:
export default defineConfig({
plugins: [react()],
build: {
outDir: 'dist',
sourcemap: true
}
})Run the test suite:
npm run testRun tests with UI:
npm run test:uiRun tests once:
npm run test:runnpm run devnpm run buildnpm run previewThe application is optimized for deployment on:
- Vercel (recommended)
- Netlify
- GitHub Pages
- AWS S3 + CloudFront
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch
git checkout -b feature/amazing-feature
- Make your changes
- Run tests
npm run test - Commit your changes
git commit -m 'Add amazing feature' - Push to the branch
git push origin feature/amazing-feature
- Open a Pull Request
- Follow React best practices
- Write comprehensive tests
- Update documentation
- Use meaningful commit messages
- Follow ESLint rules
This project is licensed under the MIT License - see the LICENSE file for details.
- PDF.js by Mozilla for PDF processing capabilities
- Tesseract.js for OCR functionality
- React community for excellent documentation and tools
- Vite team for the amazing build tool
- All contributors and users of this project
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Discord: Join our community | [Support channel](https://discord.gg/x raj_dev_X)
- Initial release with core PDF extraction
- OCR support for scanned documents
- Advanced text processing features
- Responsive UI with dark/light themes
- Multiple export formats
Made with β€οΈ using React, PDF.js, and modern web technologies
Extract text from PDFs with confidence - fast, accurate, and privacy-first!