**1. Importing Libraries and Setting up Tesseract Path**
The code begins by importing various essential libraries:

1. pytesseract is used for Optical Character Recognition (OCR) to extract text from images.
2. pandas is for data manipulation, especially working with data from Excel files.
3. cv2 from OpenCV is for image processing, such as loading images and converting color.
4. os is for interacting with the operating system, such as checking file existence.
5. re is for regular expressions, potentially useful for text extraction.
6. easyocr is another OCR tool to extract text.
requests is for downloading images from URLs.
7. numpy is for handling image arrays and numerical operations.
8. ThreadPoolExecutor is for parallel processing, allowing multiple images to be processed simultaneously.
9. openpyxl is for handling Excel files.

Finally, the Tesseract OCR command path is set using pytesseract.pytesseract.tesseract_cmd to point to the installed Tesseract executable.

In [None]:
import pytesseract
import pandas as pd
import cv2
import os
import re
import easyocr
import requests
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from openpyxl import load_workbook
pytesseract.pytesseract.tesseract_cmd= r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"

**2. Extracting Text from Grayscale Image Using Tesseract**

Function: tesocr_img_to_text(gray_img)

This function accepts a grayscale image as input and extracts text using pytesseract.

It uses pytesseract.image_to_string(gray_img) to convert the image to text.

If no text is found (the text is empty or whitespace), it returns "error".

If an exception occurs during the OCR process, the function catches it and returns "error".

In [None]:
# extract text from grayscale image using tesseract
def tesocr_img_to_text(gray_img):

    try:
        text = pytesseract.image_to_string(gray_img)
        if not text.strip():  # Check if the extracted text is empty or whitespace
            return "error"
        return text
    except Exception as e:
        return "error"  # Return "error" if any exception occurs



**3. Processing and Downloading Images**

Function: process_image(row)
This function processes each image link provided in a row from a dataset.
It starts by downloading the image from the given URL using the requests.get method.
The image is converted into a NumPy array, then decoded with OpenCV’s cv2.imdecode function.

Depending on the number of channels in the image, it either converts the image to grayscale or keeps it as is.
The function rotates the image both clockwise and counterclockwise (90 and 270 degrees).

Then, the function uses the tesocr_img_to_text method to extract text from:

Grayscale image (tesocr_text_gray)

Original color image (tesocr_text_color)

Clockwise-rotated image (tesocr_text_cw)

Counterclockwise-rotated image (tesocr_text_ccw)

If the image processing is successful, the function returns the image link, group ID, entity name, and OCR text results from different orientations and conversions. If an error occurs, it returns None.

In [None]:
# Function to download and process each image
def process_image(row):
    try:
        image_link = row['image_link']
        group_id = row['group_id']
        entity_name = row['entity_name']

        response = requests.get(image_link)
        arr = np.asarray(bytearray(response.content), dtype=np.uint8)
        img = cv2.imdecode(arr, -1)  # Load the image as it is

        # Check the number of channels in the image
        if len(img.shape) == 3 and img.shape[2] == 3:
            # Image has 3 channels (BGR), proceed with color conversion
            gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        else:
            # Image is already grayscale or has a different number of channels
            gray_img = img  # No need for conversion

        # Rotate the image by 90 degrees clockwise
        img_cw_90 = cv2.rotate(img, cv2.ROTATE_90_CLOCKWISE)

        # Rotate the image by 270 degrees clockwise or 90 degrees counterclockwise
        img_ccw_90 = cv2.rotate(img, cv2.ROTATE_90_COUNTERCLOCKWISE)

        # Extract text using OCR
        tesocr_text_gray = tesocr_img_to_text(gray_img)
        tesocr_text_color = tesocr_img_to_text(img)
        tesocr_text_cw = tesocr_img_to_text(img_cw_90)
        tesocr_text_ccw = tesocr_img_to_text(img_ccw_90)

        return [image_link, group_id, entity_name,easyocr_text_color, tesocr_text_color, tesocr_text_gray, tesocr_text_cw, tesocr_text_ccw]

    except Exception as e:
        # Return None in case of error, logging can be done outside
        print(f"Error processing image for link {row['image_link']}: {e}")
        return None


**4. Saving Data to an Excel File**

Function: save_to_excel(data, output_file)
This function takes the processed data and saves it to an Excel file.

If the Excel file already exists, it appends the new data without overwriting the old data using the openpyxl engine.

If the file does not exist, it creates a new Excel file and writes the data into it.

In [None]:

def save_to_excel(data, output_file):
    # Create a DataFrame from the collected data
    df = pd.DataFrame(data, columns=["image_link", "group_id", "entity_name","easy_color", "tes_color", "tes_gray", "tes_cw", "tes_ccw"])

    # Append or write to the Excel file
    if os.path.exists(output_file):
        with pd.ExcelWriter(output_file, mode='a', engine='openpyxl', if_sheet_exists='overlay') as writer:
            workbook = load_workbook(output_file)
            sheet = workbook.active
            startrow = sheet.max_row  # Get the last row
            df.to_excel(writer, index=False, header=False, startrow=startrow)
    else:
        with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
            df.to_excel(writer, index=False)



**5. Processing Images in Parallel**

Function: process_images_in_parallel(excel_file, output_file)

This is the main function that coordinates the entire process of downloading, processing, and saving images.
It reads data from the input excel_file and uses ThreadPoolExecutor to process images in parallel with up to 8 threads (to increase efficiency).
For each row of data in the Excel file, the process_image function is called to process the image link.
Successfully processed images are stored in a list, and the code prints a progress update every 10 images.
After all images have been processed, the results are saved to the output Excel file.

In [None]:
# Main process
def process_images_in_parallel(excel_file, output_file):
    data = []
    processed_count = 0  # Track the number of successfully processed images

    # Use ThreadPoolExecutor for parallel processing
    with ThreadPoolExecutor(max_workers=8) as executor:
        results = list(executor.map(process_image, [excel_file.iloc[i] for i in range(excel_file.shape[0])]))

    # Filter out failed results (None)
    for res in results:
        if res is not None:
            data.append(res)
            processed_count += 1

            # Track progress every 10 images
            if processed_count % 10 == 0:
                print(f"{processed_count} images processed successfully.")

    # Save all results to Excel in one go
    save_to_excel(data, output_file)

    print(f'Completed processing {processed_count} images.')