A comprehensive PDF processing application that combines OCR (Optical Character Recognition) and PDF unlocking capabilities. Available as both a GUI application and a Discord bot!
- OCR Processing: Extract text from PDF files and save it to a text file
- PDF Unlocking: Remove password protection from PDF files
- Manual password entry
- Brute force password cracking (for simple passwords)
- Discord Bot Integration: Process PDFs directly through Discord commands
- Python 3.7 or higher
- Tesseract OCR engine
- Poppler (for PDF to image conversion)
- Required Python packages:
pikepdf
pdf2image
pytesseract
discord.py
python-dotenv
customtkinter
-
Install Tesseract OCR:
- Download the installer from GitHub
- Add Tesseract to your system PATH
-
Install Poppler:
- Download from poppler releases
- Extract to a folder (e.g., C:\poppler-xx.xx.x)
- Add the bin folder to your system PATH
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install poppler-utils
brew install tesseract
brew install poppler
- Clone this repository:
git clone https://github.com/yourusername/pdf-processor.git
cd pdf-processor
- Install Python dependencies:
pip install -r requirements.txt
The project also includes a simple website in the docs/
directory, which serves as documentation and provides direct links:
- Download App: Links to the latest executable (
.exe
) file available on GitHub Releases. - GitHub Page: Links to the main GitHub repository for the project.
- Invite Bot: Links to invite the Discord bot to your server (requires replacing
YOUR_CLIENT_ID
in the URL with your bot's actual client ID).
- Run the application:
python app.py
- The application has two main tabs:
- Click "Browse" to select a PDF file
- Choose an export directory
- Click "Start OCR" to begin text extraction
- The extracted text will be saved to a text file in the chosen directory
- Click "Browse" to select a password-protected PDF
- Choose an export directory
- Either:
- Enter the known password and click "Unlock with Password"
- Click "Brute Force Unlock" to attempt to crack the password
- Create a new Discord application and bot at Discord Developer Portal
- Get your bot token
- Create a
.env
file in the project root and add your token:
DISCORD_TOKEN=<your-bot-token-here>
- Run the bot:
python bot.py
pdf help
- Show help messagepdf ocr
- Extract text from a PDF file (attach the PDF to your message)pdf unlock <password>
- Unlock a PDF with a password (attach the PDF)pdf bruteforce
- Attempt to crack the PDF password (attach the PDF)
-
The brute force feature is limited to simple passwords (length 1-4 characters) by default
-
OCR accuracy depends on the quality of the PDF and the Tesseract installation
-
The Discord bot creates a temporary directory for processing files, which are automatically cleaned up
-
Custom Theme: The GUI application uses a custom theme defined in the
themes/
directory. You can modifythemes/website_theme.json
to customize the application's appearance.
This project is licensed under the MIT License - see the LICENSE file for details.