<a href="https://github.com/KiarashKiani79" style="text-decoration: none; color: inherit;">
  <img src="https://skillicons.dev/icons?i=github" alt="GitHub"/>
  KiarashKiani79
</a>
<img src="https://user-images.githubusercontent.com/73097560/115834477-dbab4500-a447-11eb-908a-139a6edaec5c.gif">

#### Optical Character Recognition (OCR)
This project focuses on extracting structured text from images and enhancing it using OpenCV for more accurate results. The process involves downloading and installing Tesseract and other dependencies, loading the desired image, and using Tesseract to extract the text. The extracted text is then processed, especially for complex images, using OpenCV. Image processing operations include noise removal, threshold transformation, erosion, and drawing rectangles around specific patterns or characters. This automated Optical Character Recognition (OCR) method can be used by organizations to extract useful information from images and by individuals to save time and effort in typing. It eases the burden of document analysis and understanding. Let's get started with the project.


In [5]:
import requests
import os

# Importing IPython to clear output which is not important 
from IPython.display import clear_output

In [7]:
# Get the current directory
current_dir = os.getcwd()

# Define the path to the model folder in the current directory
model_dir = os.path.join(current_dir, 'tesseract-ocr_model')

# Create the model directory if it doesn't exist
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

# Define the path to the ind.traineddata file in the model directory
path = os.path.join(model_dir, 'ind.traineddata')


First we download the `ind.traineddata` file from the Tesseract OCR’s GitHub repository and writes it to a specific location on your system. This file is a trained model for the Tesseract OCR (which is required for pytesseract library to run) to use when recognizing text in images.

In [8]:
# Download the model from the official repository
r = requests.get("https://raw.githubusercontent.com/tesseract-ocr/tessdata/4.00/ind.traineddata",
                 stream=True)

# Writing data to file to avoid path issues
with open (path, 'wb') as file:
    for block in r.iter_content(chunk_size= 1024):
        if block:
            file.write(block)

In [9]:
# Installing libraries required for optical character recognition
! apt install tesseract-ocr libtesseract-dev libmagickwand-dev

clear_output()

`tesseract-ocr`: This is the name of the package for Tesseract, an open-source optical character recognition (OCR) engine. It can be used to recognize text in images.

`libtesseract-dev`: This is the development package for Tesseract. It includes the header files and libraries needed to compile programs that use Tesseract.

`libmagickwand-dev`: This is the development package for ImageMagick’s MagickWand, a C interface to ImageMagick (a software suite to create, edit, and compose bitmap images). It allows you to work with images in your code.

In [10]:
# Installing pytesseract and opencv
! pip install pytesseract wand opencv-python
clear_output()

In [11]:
# Import libraries
from PIL import Image  # Library for image processing
import pytesseract  # Library for Optical Character Recognition (OCR)
import cv2  # OpenCV library for computer vision tasks
import numpy as np  # Library for numerical operations
from pytesseract import Output  # Output utility for pytesseract
import re  # Regular expressions library for string matching and manipulation