Skip to content

HappySR/OCR-Image-Text-Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 OCR Image Text Extractor

A Python-based Optical Character Recognition (OCR) tool to extract text from images using Tesseract OCR. This project supports both PIL-based and OpenCV-based image preprocessing for improved accuracy.


🚀 Features

  • Extracts text from images using pytesseract
  • Dual preprocessing methods: PIL (Python Imaging Library) and OpenCV
  • Contrast enhancement, binarization, blurring, dilation/erosion for noise reduction
  • Supports multiple languages via Tesseract's language packs
  • Command-line interface for quick usage

🛠️ Requirements

  • Python 3.6+
  • Tesseract OCR (must be installed and optionally linked in your environment)

Install dependencies:

pip install -r requirements.txt

⚙️ Installation

  1. Clone the repository:
git clone https://github.com/HappySR/ocr-image-text-extractor.git
cd ocr-image-text-extractor
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Tesseract OCR:
  • Ubuntu/Debian:
sudo apt update && sudo apt install tesseract-ocr
  • Mac (Homebrew):
brew install tesseract

🖼️ Usage

python ocr_script.py path/to/image.jpg

Optional arguments:

  • --tesseract or -t: Path to the Tesseract executable (required if not in PATH)
  • --lang or -l: Language for OCR (default: eng)
  • --no-cv2: Use PIL-based preprocessing instead of OpenCV

Example:

python ocr_script.py sample.png --lang eng --tesseract "C:/Program Files/Tesseract-OCR/tesseract.exe"

📂 Project Structure

ocr-image-text-extractor/
├── ocr_script.py           # Main script with OCRProcessor class and CLI
├── README.md               # README file
├── requirements.txt        # Required Python packages

📌 Notes

  • OpenCV-based preprocessing generally yields better OCR accuracy for noisy images.
  • Tesseract supports multiple languages, but the relevant language data must be installed.
  • For best results, ensure the input image has good contrast and minimal background noise.

🧪 Sample Output

--- OCR Result ---
This is a sample text extracted
from an image using OCR!
------------------

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.


🙌 Acknowledgements

About

OCR Text Extraction using Pillow, Tesseract, OpenCV, Numpy.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages