# OCR Pipeline Module
**Module 1: Receipt/Invoice Text Extraction**

---

**Input:** Receipts, invoices, bank statements (`.png`, `.jpg`, `.pdf`)  
**Output:** Raw text (string)  
**Tool:** Tesseract OCR

---

### Module Overview
This notebook handles the first step of the accounting automation pipeline:
1. Upload receipt/invoice images or PDFs
2. Extract text using Tesseract OCR
3. Output raw text for the next module (Data Extraction)

---

## 1. Installation & Setup
Install Tesseract OCR and required Python libraries for Google Colab

In [None]:
# Install Tesseract OCR and required libraries
!apt-get install -y tesseract-ocr
!pip install -q pytesseract pillow pdf2image

# Install poppler for PDF support
!apt-get install -y poppler-utils

print("Tesseract OCR installed successfully")

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.12).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.
Tesseract OCR installed successfully


## 2. Import Libraries

In [None]:
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
from google.colab import files
import os
import io

## 3. OCR Function
Core function to extract text from images and PDFs

In [None]:
def extract_text_from_file(file_path):
    """
    Extract text from image or PDF file using Tesseract OCR.

    Args:
        file_path (str): Path to the image or PDF file

    Returns:
        str: Extracted raw text
    """
    file_ext = os.path.splitext(file_path)[1].lower()

    try:
        if file_ext == '.pdf':
            # Convert PDF to images
            images = convert_from_path(file_path)
            text = ""
            for i, image in enumerate(images):
                page_text = pytesseract.image_to_string(image)
                text += f"\n--- Page {i+1} ---\n{page_text}"
            return text.strip()

        elif file_ext in ['.png', '.jpg', '.jpeg']:
            # Process image directly
            image = Image.open(file_path)
            text = pytesseract.image_to_string(image)
            return text.strip()

        else:
            return f"Error: Unsupported file format '{file_ext}'"

    except Exception as e:
        return f"Error processing file: {str(e)}"

print("OCR function defined")

OCR function defined


## 4. Upload Files
Upload your receipts, invoices, or bank statements here

In [None]:
print("Click 'Choose Files' to upload receipts/invoices (.png, .jpg, .pdf)")
uploaded = files.upload()

print(f"\nUploaded {len(uploaded)} file(s)")

Click 'Choose Files' to upload receipts/invoices (.png, .jpg, .pdf)


Saving receipt-2.png to receipt-2.png
Saving receipt-example.png to receipt-example.png
Saving receipt-img.jpg to receipt-img (2).jpg
Saving test_receipt.jpg to test_receipt (1).jpg

Uploaded 4 file(s)


## 5. Process Files & Extract Text
Run OCR on all uploaded files

In [None]:
# Dictionary to store extracted text
extracted_texts = {}

print("Processing files...\n")

for filename in uploaded.keys():
    print(f"Processing: {filename}")
    text = extract_text_from_file(filename)
    extracted_texts[filename] = text
    print(f"✓ Extracted {len(text)} characters\n")

print(f"Successfully processed {len(extracted_texts)} file(s)")

Processing files...

Processing: receipt-2.png
✓ Extracted 595 characters

Processing: receipt-example.png
✓ Extracted 247 characters

Processing: receipt-img (2).jpg
✓ Extracted 167 characters

Processing: test_receipt (1).jpg
✓ Extracted 84 characters

Successfully processed 4 file(s)


## 6. View Extracted Text
Display the raw text output from OCR

In [None]:
# Display extracted text for each file
for filename, text in extracted_texts.items():
    print("="*60)
    print(f"File: {filename}")
    print("="*60)
    print(text)
    print("\n")

File: receipt-2.png
WALL-MART-SUPERSTORE

(888) 888 - 8888

MANAGER TOD LINGA

888 WALL STORE ST
WALL ST CITY, LA 88888

ST# 2323 OPE (23432435 TERT TRE. 4354
HAND TOWEL 075953630184 2.97X
GATORADE 068949055223 2.00 X
T-SHIRT 036231852452 16.88 X
PUSH PINS (088348997350 1.24

SUBTOTAL 23.09

TAK 1 7.89% 2.90

TAK 2 4:90% 128

TOTAL a2

CREDIT TEND aan

CHANGE. DUE 0.00

ACCOUNT & seek chee 0449999

APPROVAL # 77W166
REF # 307171075528
TERMINAL # 6419885359

# ITEMS SOLD 4

Tek 1752 6627 3145 9811 0000

Get Free Holiday Savings by Cell!
Thank You for Shopping With Us!
10/17/2020 16:12
‘eset CUSTOMER COPY seve


File: receipt-example.png
KK»)
MB Opa ikrito

Vytenio g, 50-431
Vilnius LT-03229

Table: 415 Guests: 2
Server: Rebecca

1 Americano $2,99
2 Chocolate Cookie $1,98
1 Water bottle $0,50
Subtotal $5,47
Tax $0,24
Total $5.71
Cash $600
Change $0,29
THANK YOU

HAVE A NICE DAY!!!


File: receipt-img (2).jpg
rs

RECEIPT
Terminal #1 09-08-2018
1. Lorem Ipsum 25.7
2. Dolar Sit Amed 125.2
3

## 7. Module Output
**This raw text is ready to be passed to Module 2 (Data Extraction)**

In [None]:
print("OCR Pipeline Output Summary")
print("="*60)
print(f"Total files processed: {len(extracted_texts)}")
print(f"Total characters extracted: {sum(len(text) for text in extracted_texts.values())}")
print("\nFiles processed:")
for filename in extracted_texts.keys():
    print(f"  • {filename}")

print("\nModule 1 Complete - Raw text ready for Module 2 (Data Extraction)")

OCR Pipeline Output Summary
Total files processed: 4
Total characters extracted: 1093

Files processed:
  • receipt-2.png
  • receipt-example.png
  • receipt-img (2).jpg
  • test_receipt (1).jpg

Module 1 Complete - Raw text ready for Module 2 (Data Extraction)


---

## Integration with Next Module

To pass this data to **Module 2 (Data Extraction)**, use:

```python
# Example: Pass to next module
for filename, raw_text in extracted_texts.items():
    # Send raw_text to Data Extraction module
    # Module 2 will convert raw_text → JSON structure
    pass
```

---
