A Python-based Optical Character Recognition (OCR) tool to extract text from images using Tesseract OCR. This project supports both PIL-based and OpenCV-based image preprocessing for improved accuracy.
- Extracts text from images using pytesseract
- Dual preprocessing methods: PIL (Python Imaging Library) and OpenCV
- Contrast enhancement, binarization, blurring, dilation/erosion for noise reduction
- Supports multiple languages via Tesseract's language packs
- Command-line interface for quick usage
- Python 3.6+
- Tesseract OCR (must be installed and optionally linked in your environment)
pip install -r requirements.txt- Clone the repository:
git clone https://github.com/HappySR/ocr-image-text-extractor.git
cd ocr-image-text-extractor- Install dependencies:
pip install -r requirements.txt- Install Tesseract OCR:
- Ubuntu/Debian:
sudo apt update && sudo apt install tesseract-ocr- Mac (Homebrew):
brew install tesseract- Windows:
Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
Note the path to the
tesseract.exefor use in the script.
python ocr_script.py path/to/image.jpg--tesseractor-t: Path to the Tesseract executable (required if not in PATH)--langor-l: Language for OCR (default:eng)--no-cv2: Use PIL-based preprocessing instead of OpenCV
python ocr_script.py sample.png --lang eng --tesseract "C:/Program Files/Tesseract-OCR/tesseract.exe"ocr-image-text-extractor/
├── ocr_script.py # Main script with OCRProcessor class and CLI
├── README.md # README file
├── requirements.txt # Required Python packages
- OpenCV-based preprocessing generally yields better OCR accuracy for noisy images.
- Tesseract supports multiple languages, but the relevant language data must be installed.
- For best results, ensure the input image has good contrast and minimal background noise.
--- OCR Result ---
This is a sample text extracted
from an image using OCR!
------------------
This project is licensed under the MIT License. See the LICENSE file for details.