## **1. Library Installation:**
**pytesseract:** This is a Python wrapper for the Tesseract OCR engine, which performs text recognition on images.

**Pillow**: A Python library for handling images, which allows us to open and manipulate image files.

The **sudo apt-get install tesseract-ocr** command installs the Tesseract OCR engine itself on your system.

**tesseract --version** verifies the installation, ensuring Tesseract is installed correctly and ready to use.


In [7]:
# Installing necessary libraries for OCR
!pip install pytesseract Pillow
!sudo apt-get install tesseract-ocr
!tesseract --version

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
tesseract 4.1.1
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8


## **2. Import Statements**:
**pytesseract:** Provides access to Tesseract’s OCR capabilities, enabling text extraction from images.

**Image** (from **PIL**): Allows opening and handling images in formats like PNG, JPEG, etc., so that they can be processed by pytesseract.

**os:** Used for file and directory management, like checking file existence and creating directories.

**json**: Helps format the extracted data into JSON for structured output.


In [8]:
import pytesseract
from PIL import Image
import os
import json

## **3. Function Definition:**
**extract_text_from_images:** This function handles the entire process of text extraction and saving the results.

**Parameters:**

***file_paths:*** A list of paths for image files that need OCR processing.
***output_dir:*** Specifies the directory where the .txt files with extracted text will be saved. By default, it’s set to **/content/ocr_results**.



## **4. OCR Text Extraction Helper:**
This helper function **ocr_text_extraction** processes individual image files:

**File Type Check:** Checks if the file has an image extension (PNG, JPG, JPEG).

**Opening Image:** Image.open(file_path) opens the image file so it can be processed.

**OCR Application:** pytesseract.image_to_string(img) applies OCR to the image and extracts the text.

**Return Text:** The extracted text is returned as a string.

If the file is not in a supported format, it raises a ValueError to notify the user.



## **5. Output Directory Creation:**
Checks if the output directory **(output_dir)** exists. If it doesn’t:

**os.makedirs(output_dir, exist_ok=True)** creates the directory.

**exist_ok=True** ensures that no error is raised if the directory already exists.

A message is printed to confirm directory creation.



## **6. Data Dictionary Initialization:**
**all_extracted_data:** A dictionary that will store the extracted text for each file. This will later be used to create a JSON summary of all processed files.

## **7. File Processing Loop:**
**The for loop** iterates through each **file pat**h in file_paths, allowing us to handle multiple files in one function call.


## **8. File Existence Check and Processing:**
**os.path.exists(file_path):** Checks if the file exists at the given path.

If it does:

A message is printed indicating the file being processed.

**raw_text = ocr_text_extraction(file_path)** calls the helper function to extract text from the image.

**all_extracted_data[file_path]** stores the extracted text for the file in the dictionary. If no text is found, it stores **"No text found"** instead.


## **9. Saving Extracted Text to .txt File:**
**output_file_path:** Constructs the output path for the .txt file based on the original file’s name.

*os.path.basename(file_path):* Extracts the file name from the path.

*os.path.splitext(...)[0] + '.txt':* Removes the file extension and appends .txt.

*os.path.join(output_dir, ...):* Combines the output directory and the file name to get the full output path.

**File Writing:**

Attempts to write the extracted text to the .txt file.

If successful, it prints a confirmation message.

If an error occurs, it prints an error message detailing the issue.

## **10. Error Handling for Missing Files:**
If the file doesn’t exist at the specified path:

An error message is stored in **all_extracted_data** under the respective file path.

A message is printed to notify that the file was not found.

## **11. Generate and Return JSON Output:**
**json_output = json.dumps(all_extracted_data, indent=4):** Converts the all_extracted_data dictionary to a JSON-formatted string, which makes it easy to read and process.

The JSON output is returned, containing the OCR results or error messages for each file.

In [9]:
# Define the function for OCR text extraction from multiple images
def extract_text_from_images(file_paths, output_dir='/content/ocr_results'):
    # Helper function to apply OCR directly on an image file
    def ocr_text_extraction(file_path):
        if file_path.endswith(('.png', '.jpg', '.jpeg')):
            img = Image.open(file_path)
            return pytesseract.image_to_string(img)
        else:
            raise ValueError("Unsupported file format. Use an image file.")

    # Ensure output directory exists
    if not os.path.exists(output_dir):
        os.makedirs(output_dir, exist_ok=True)
        print(f"Created output directory: {output_dir}")

    # Dictionary to hold extracted data for each file
    all_extracted_data = {}

    # Process each file in the list
    for file_path in file_paths:
        # Check if file exists
        if os.path.exists(file_path):
            print(f"Processing file: {file_path}")
            # Extract text using OCR
            raw_text = ocr_text_extraction(file_path)

            # Store structured data in a dictionary
            all_extracted_data[file_path] = {
                "extracted_text": raw_text if raw_text else "No text found"
            }

            # Save each extracted text in a separate .txt file with preserved formatting
            output_file_path = os.path.join(output_dir, os.path.splitext(os.path.basename(file_path))[0] + '.txt')
            try:
                with open(output_file_path, 'w') as txt_file:
                    txt_file.write(raw_text)  # Write the raw OCR text without formatting changes
                print(f"Saved extracted text to: {output_file_path}")
            except Exception as e:
                print(f"Error saving file {output_file_path}: {e}")

        else:
            all_extracted_data[file_path] = {"error": "File not found"}
            print(f"File not found: {file_path}")

    # Convert all data to JSON format
    json_output = json.dumps(all_extracted_data, indent=4)
    return json_output

## **12. Example Usage:**
**file_paths:** A list of image file paths to be processed.
**extracted_json = extract_text_from_images(file_paths):** Calls the function with **file_paths**, returning the JSON-formatted results.

**print(extracted_json):** Prints the JSON output, which provides a summary of extracted text and any errors encountered.

In [10]:
# Example usage
file_paths = ['/content/application.png', '/content/docImg.jpg', '/content/poem.jpg','/content/receipt.png','/content/scannedImg.jpg','/content/scannedImg2.jpg',
              '/content/scannedImg3.jpg','/content/story.png','/content/wallmartReceipt.png']
extracted_json = extract_text_from_images(file_paths)
print(extracted_json)

Created output directory: /content/ocr_results
Processing file: /content/application.png
Saved extracted text to: /content/ocr_results/application.txt
Processing file: /content/docImg.jpg
Saved extracted text to: /content/ocr_results/docImg.txt
Processing file: /content/poem.jpg
Saved extracted text to: /content/ocr_results/poem.txt
Processing file: /content/receipt.png
Saved extracted text to: /content/ocr_results/receipt.txt
Processing file: /content/scannedImg.jpg
Saved extracted text to: /content/ocr_results/scannedImg.txt
Processing file: /content/scannedImg2.jpg
Saved extracted text to: /content/ocr_results/scannedImg2.txt
Processing file: /content/scannedImg3.jpg
Saved extracted text to: /content/ocr_results/scannedImg3.txt
Processing file: /content/story.png
Saved extracted text to: /content/ocr_results/story.txt
Processing file: /content/wallmartReceipt.png
Saved extracted text to: /content/ocr_results/wallmartReceipt.txt
{
    "/content/application.png": {
        "extracted_

## **Result**
### This code effectively handles multiple images, applies OCR to extract text, and saves each output as a .txt file while also summarizing results in JSON.