OCR Project: Processing Receipts and Bills
This document outlines the design and a case study for a deep learning project focused on processing receipts and bills. The goal is to extract key, structured information from unstructured images, such as a photo of a receipt.

Case Study: A Personal Finance Management App
Problem Statement: Users of a personal finance app want to automatically track their expenses by simply taking a photo of a receipt. Manual data entry is slow and prone to errors. The challenge is to accurately extract critical information like the merchant, date, total amount, and individual line items from a variety of receipt formats.

Why this is a Deep Learning Problem:

Variability: Receipts from different merchants have vastly different layouts, fonts, and colors. A simple rule-based system would fail to generalize.

Noise: Photos often have glare, shadows, are taken at an angle, or are crumpled.

Semantic Understanding: Extracting the "total amount" requires more than just reading numbers; it requires understanding the context of the text (e.g., distinguishing the total from a subtotal or tax).

Project Design: A Python-Based OCR Pipeline
Our solution will be a multi-stage pipeline that combines classical computer vision techniques with modern deep learning for information extraction.

1. Image Preprocessing 🖼️
The first step is to clean up the input image to improve the accuracy of the OCR engine.

Input: A raw image file (e.g., PNG, JPEG).

Steps:

Grayscale Conversion & Binarization: Convert the image to black and white to improve text contrast.

Noise Reduction: Use a filter (e.g., a median filter) to remove random noise.

Deskewing: Correct any rotational distortion caused by taking the photo at an angle.

Libraries: OpenCV or Pillow are ideal for these tasks.

2. Text Detection and Recognition (OCR) 🔍
This stage identifies all text within the preprocessed image and converts it into a digital format.

Text Detection: A model identifies bounding boxes around all text regions.

Text Recognition: An OCR engine reads the characters within each bounding box.

Output: A list of strings containing all the detected text, often with their coordinates. For example: ['UBER', 'Invoice', 'Date: 25/08/2025', 'Total: $12.50'].

Libraries: Pytesseract (a Python wrapper for the Tesseract engine) is a great starting point due to its simplicity. For a more advanced approach, you could integrate a model like TrOCR from the Hugging Face transformers library.

3. Information Extraction with Deep Learning (NLP) 🧠
This is the most critical and challenging part of the project, where we use deep learning to make sense of the extracted text.

Input: The raw text strings from the OCR stage.

Method: We will frame this as a Named Entity Recognition (NER) problem. We will fine-tune a pre-trained NLP model (like BERT or a simpler model from spaCy) to recognize specific "entities" in the text, such as:

MERCHANT (e.g., "UBER", "Starbucks")

DATE (e.g., "08/26/2025", "26-Aug-2025")

TOTAL (e.g., "$12.50", "150.00")

Output: A structured data object (e.g., JSON or a Python dictionary).

{
  "merchant": "UBER",
  "date": "25/08/2025",
  "total_amount": 12.50
}

Libraries: spaCy with its built-in NER capabilities is a fantastic choice, as is the Hugging Face transformers library, which allows you to fine-tune powerful models like BERT on your specific dataset.

Python Project Flow
Your main Python script would follow this general structure:

main.py

Import Libraries: os, cv2 (for OpenCV), pytesseract, spacy (or transformers).

Function process_receipt_image(image_path):

Takes the image file path as an argument.

Loads the image using OpenCV or Pillow.

Applies all the preprocessing steps from Stage 1.

Runs OCR on the cleaned image to get the raw text (Stage 2).

Passes the raw text to a trained NLP model for NER (Stage 3).

Returns the final structured JSON data.

Main Execution Block:

Check for the existence of the image file.

Call process_receipt_image() with a test image.

Print the resulting structured data.

This project is a perfect mix of different deep learning fields and provides a clear path to building a useful, real-world application.

 # stage 1

In [None]:
# A script for the image preprocessing stage of the OCR project.
# This code will take an image file, clean it up, and save the result.
# We will use OpenCV for image manipulation.

# To install: pip install opencv-python numpy

import cv2
import numpy as np
import os

def preprocess_image(image_path):
    """
    Performs a series of preprocessing steps on a receipt image.

    Args:
        image_path (str): The path to the input image file.

    Returns:
        numpy.ndarray: The preprocessed image as a NumPy array.
    """
    # Check if the file exists
    if not os.path.exists(image_path):
        print(f"Error: Image file not found at {image_path}")
        return None

    # Load the image
    # cv2.imread loads the image in BGR format by default
    img = cv2.imread(image_path)
    if img is None:
        print(f"Error: Could not read image from {image_path}")
        return None

    print("Image loaded successfully. Starting preprocessing...")

    # 1. Convert to grayscale
    # Grayscale conversion simplifies the image and is often the first step for OCR.
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    print("Step 1: Converted to grayscale.")

    # 2. Apply a median filter to remove noise
    # Median filter is effective at removing salt-and-pepper noise while preserving edges.
    denoised = cv2.medianBlur(gray, 3) # The kernel size (e.g., 3) must be odd
    print("Step 2: Applied median blur for noise reduction.")

    # 3. Apply thresholding (binarization)
    # This step converts the grayscale image to a binary (black and white) image,
    # making the text stand out clearly from the background.
    # We use Otsu's method for automatic thresholding.
    _, binarized = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    print("Step 3: Applied Otsu's thresholding for binarization.")
    
    # 4. Deskewing (Advanced, optional but recommended)
    # This is a more complex step to correct for angled photos.
    # We will find the minimum area rectangle enclosing the text and rotate the image.
    coords = np.column_stack(np.where(binarized < 255))
    angle = cv2.minAreaRect(coords)[-1]
    
    # The angle returned by minAreaRect is in the range [-90, 0)
    # We need to adjust it to get a correct rotation angle
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
        
    (h, w) = img.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    deskewed = cv2.warpAffine(binarized, M, (w, h),
                              flags=cv2.INTER_CUBIC,
                              borderMode=cv2.BORDER_REPLICATE)
    
    print(f"Step 4: Deskewed image by {angle:.2f} degrees.")
    
    return deskewed

def save_image(image, output_path):
    """
    Saves the preprocessed image to a specified path.
    """
    try:
        cv2.imwrite(output_path, image)
        print(f"Preprocessed image saved to: {output_path}")
    except Exception as e:
        print(f"Error saving image: {e}")

if __name__ == '__main__':
    # You would replace this with a real image path.
    # For a sample, you can download a receipt image and place it in the same
    # directory as this script.
    
    # Example usage:
    # 1. Create a dummy image for demonstration purposes
    dummy_img = np.zeros((400, 600, 3), dtype=np.uint8)
    cv2.putText(dummy_img, 'Sample Receipt', (100, 200), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2)
    cv2.imwrite('dummy_receipt.png', dummy_img)
    
    input_image_path = 'dummy_receipt.png'
    output_image_path = 'preprocessed_receipt.png'
    
    preprocessed_img = preprocess_image(input_image_path)
    
    if preprocessed_img is not None:
        save_image(preprocessed_img, output_image_path)


# stage 2

In [None]:
# A script for the text recognition stage of the OCR project.
# This code will use PyTesseract to extract text from a preprocessed image.

# To install: pip install pytesseract Pillow
# You must also have the Tesseract OCR engine installed on your system.
# See documentation for installation instructions:
# Windows: https://github.com/UB-Mannheim/tesseract/wiki
# macOS: brew install tesseract
# Ubuntu/Debian: sudo apt update && sudo apt install tesseract-ocr

import pytesseract
from PIL import Image
import os
import cv2
import numpy as np

# Note: If Tesseract is not in your system's PATH, you need to set the path
# to the tesseract executable.
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def preprocess_image(image_path):
    """
    Performs preprocessing on an image. This is the same function from the
    previous step, included here for a complete, runnable example.
    """
    if not os.path.exists(image_path):
        print(f"Error: Image file not found at {image_path}")
        return None
    
    img = cv2.imread(image_path)
    if img is None:
        return None

    # Grayscale conversion
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Denoising with median blur
    denoised = cv2.medianBlur(gray, 3)

    # Binarization using Otsu's method
    _, binarized = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    # Deskewing
    coords = np.column_stack(np.where(binarized < 255))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    
    (h, w) = binarized.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    deskewed = cv2.warpAffine(binarized, M, (w, h),
                              flags=cv2.INTER_CUBIC,
                              borderMode=cv2.BORDER_REPLICATE)
    
    return deskewed

def ocr_image(image_data):
    """
    Extracts text from an image using PyTesseract.

    Args:
        image_data (numpy.ndarray): The preprocessed image as a NumPy array.

    Returns:
        str: A single string containing all the extracted text.
    """
    if image_data is None:
        return ""

    # Convert the OpenCV image (numpy array) to a PIL Image object
    # This is often necessary for pytesseract to work correctly
    pil_image = Image.fromarray(image_data)

    print("Running OCR on the preprocessed image...")

    # Use pytesseract.image_to_string() to perform OCR.
    # We can pass custom configuration options here, for example to
    # improve accuracy on specific document types.
    # config = '--psm 6' # PSM 6 assumes a single uniform block of text
    # config = '--psm 4' # PSM 4 assumes a single column of text of variable sizes
    
    extracted_text = pytesseract.image_to_string(pil_image, lang='eng')

    # Return the extracted text
    return extracted_text

if __name__ == '__main__':
    # We use a dummy image for demonstration purposes.
    # In a real application, you would use a receipt image.
    dummy_img_path = 'dummy_receipt.png'
    
    # Preprocess the dummy image first
    preprocessed_img = preprocess_image(dummy_img_path)

    if preprocessed_img is not None:
        # Pass the preprocessed image to the OCR function
        extracted_text = ocr_image(preprocessed_img)
        print("\n--- Extracted Text ---")
        print(extracted_text)

        # You can also get more detailed information, including bounding boxes
        # This is very useful for visualizing where the text was found.
        # We will need this for the next stage (Information Extraction)
        ocr_data = pytesseract.image_to_data(preprocessed_img, output_type=pytesseract.Output.DICT)
        print("\n--- OCR Data with Bounding Boxes ---")
        print(ocr_data)


# stage 3

In [None]:
# A script for the Information Extraction stage of the OCR project.
# This code will process the output from PyTesseract to extract specific
# key-value pairs like total, tax, and date from a receipt.

import re

def extract_receipt_info(ocr_data):
    """
    Extracts key information from OCR data using a rule-based approach.

    This function simulates the logic of a simple deep learning model
    by searching for keywords and their corresponding values. In a real
    deep learning solution, a model (e.g., LayoutLM) would be trained to
    predict the categories (e.g., 'TOTAL', 'DATE') of each text block
    and its associated value.

    Args:
        ocr_data (dict): The output from pytesseract.image_to_data(),
                         containing 'text', 'left', 'top', 'width', and 'height'.

    Returns:
        dict: A dictionary containing the extracted information.
    """
    if 'text' not in ocr_data:
        print("Error: Invalid OCR data format. 'text' key is missing.")
        return {}
    
    # Store the extracted information
    extracted_info = {
        'total': None,
        'subtotal': None,
        'tax': None,
        'date': None
    }
    
    # Clean the text and create a unified list of words with their coordinates
    words_and_coords = []
    for i, word in enumerate(ocr_data['text']):
        # Clean the word, remove common punctuation, and handle case sensitivity
        clean_word = word.strip().lower().replace('$', '').replace(',', '')
        if clean_word:
            words_and_coords.append({
                'text': clean_word,
                'left': ocr_data['left'][i],
                'top': ocr_data['top'][i],
                'width': ocr_data['width'][i],
                'height': ocr_data['height'][i]
            })

    # Search for keywords and extract the next numeric value
    for i, word_data in enumerate(words_and_coords):
        text = word_data['text']
        
        # Look for the total amount
        if text in ['total', 'balance', 'grandtotal']:
            # The value is likely the next word after the keyword
            for j in range(i + 1, min(i + 5, len(words_and_coords))):
                next_word = words_and_coords[j]['text']
                # Check if the next word is a number with an optional decimal
                if re.match(r'^\d+(\.\d+)?$', next_word):
                    extracted_info['total'] = float(next_word)
                    break # Exit the inner loop once the total is found
        
        # Look for subtotal
        if text in ['subtotal', 'merchandise']:
            for j in range(i + 1, min(i + 5, len(words_and_coords))):
                next_word = words_and_coords[j]['text']
                if re.match(r'^\d+(\.\d+)?$', next_word):
                    extracted_info['subtotal'] = float(next_word)
                    break
        
        # Look for tax
        if text in ['tax', 'gst', 'vat']:
            for j in range(i + 1, min(i + 5, len(words_and_coords))):
                next_word = words_and_coords[j]['text']
                if re.match(r'^\d+(\.\d+)?$', next_word):
                    extracted_info['tax'] = float(next_word)
                    break
                    
        # Look for a date
        # We use a regex to find common date patterns (MM/DD/YYYY, DD-MM-YYYY, etc.)
        if extracted_info['date'] is None:
            # Re-check the raw, uncleaned text for dates
            date_match = re.search(r'(\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4})', text)
            if date_match:
                extracted_info['date'] = date_match.group(1)

    return extracted_info

if __name__ == '__main__':
    # --- Mock Data from Stage 2 ---
    # In a real pipeline, this would be the output from
    # pytesseract.image_to_data() on the preprocessed image.
    mock_ocr_data = {
        'level': [1, 2, 3, 4, 5, 5, 5, 5, 5, 5],
        'page_num': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        'block_num': [1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
        'par_num': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        'line_num': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        'word_num': [1, 1, 1, 1, 1, 2, 3, 4, 5, 6],
        'left': [100, 100, 100, 100, 300, 350, 400, 450, 500, 550],
        'top': [50, 50, 50, 50, 150, 150, 150, 200, 200, 200],
        'width': [50, 50, 50, 50, 60, 60, 60, 60, 60, 60],
        'height': [20, 20, 20, 20, 20, 20, 20, 20, 20, 20],
        'text': ['store', 'name', 'date', '2025/08/26', 'Total', '$', '45.75', 'Tax', '$', '3.15']
    }
    
    print("Starting information extraction...")
    
    # Run the extraction function on the mock OCR data
    extracted_data = extract_receipt_info(mock_ocr_data)
    
    print("\n--- Extracted Receipt Information ---")
    for key, value in extracted_data.items():
        print(f"{key.capitalize()}: {value}")



# GEN AI RAG

In [None]:
# A script for the Retrieval-Augmented Generation (RAG) approach
# to information extraction.

import json
import re
import numpy as np
from typing import List, Dict, Any

# --- Stage 1: Mock Vector Database and Embedding Model ---
# In a real-world application, this would be a full-fledged vector database
# like ChromaDB, Pinecone, or a self-hosted solution.
# The `embedding_model` would be a transformer model (e.g., Sentence-Transformers).

def mock_embedding_model(text: str) -> np.ndarray:
    """
    Simulates a text embedding model.
    A real model would convert text into a high-dimensional vector.
    For this example, we'll just create a simple, repeatable vector.
    """
    # A simple, mock embedding for demonstration.
    # The sum of ASCII values gives a unique-ish number for each string.
    hash_value = sum(ord(char) for char in text)
    return np.array([hash_value, hash_value / 100, hash_value % 100, 1.0])

class MockVectorDatabase:
    """
    Simulates a vector database that stores and retrieves text chunks.
    """
    def __init__(self):
        self.embeddings = []
        self.chunks = []

    def ingest_chunks(self, text_chunks: List[str]):
        """
        Ingests a list of text chunks, creates embeddings, and stores them.
        """
        for chunk in text_chunks:
            embedding = mock_embedding_model(chunk)
            self.embeddings.append(embedding)
            self.chunks.append(chunk)
    
    def retrieve_similar_chunks(self, query_text: str, k: int = 3) -> List[str]:
        """
        Retrieves the top-k most similar chunks to the query.
        In a real system, this would use cosine similarity.
        Here, we use a simple Euclidean distance on our mock vectors.
        """
        query_embedding = mock_embedding_model(query_text)
        
        # Calculate a mock similarity score (e.g., inverse of distance)
        # Note: This is for demonstration. Real vector search is more complex.
        scores = [1.0 / (np.linalg.norm(query_embedding - emb) + 1e-6) for emb in self.embeddings]
        
        # Get indices of the top-k scores
        top_k_indices = np.argsort(scores)[-k:][::-1]
        
        return [self.chunks[i] for i in top_k_indices]

# --- Stage 2: Prompt Augmentation & LLM Generation ---

def create_rag_prompt(query: str, retrieved_context: List[str]) -> str:
    """
    Creates the augmented prompt for the LLM by adding retrieved context.

    Args:
        query (str): The original user query.
        retrieved_context (List[str]): The relevant text chunks from the vector database.

    Returns:
        str: The full, augmented prompt string for the LLM.
    """
    # Join the retrieved chunks into a single string
    context_str = "\n".join([f"- {chunk}" for chunk in retrieved_context])

    prompt = f"""
You are a highly accurate receipt data extractor. Your task is to analyze the provided OCR text from a receipt and extract key information.

You have been provided with relevant text from the receipt to assist you.

Please find the following details and return the output as a JSON object:
- `total`: The final total amount of the receipt.
- `subtotal`: The subtotal amount, if available.
- `tax`: The tax amount, if available.
- `date`: The date of the transaction.
- `store_name`: The name of the store.

If a value is not found, use `null`.

Example JSON format:
{{
  "store_name": "Example Store",
  "total": 45.75,
  "subtotal": 42.60,
  "tax": 3.15,
  "date": "2025/08/26"
}}

---
Relevant OCR Text Chunks (for context):
{context_str}

---
JSON Output:
"""
    return prompt

def mock_llm_api_call(prompt: str) -> str:
    """
    This function simulates a call to a local LLM API, similar to the previous example.
    For this example, it returns a fixed JSON string.
    """
    print("Sending augmented prompt to simulated local LLM...")
    
    # The LLM's job is to use the provided context to fill in the JSON.
    # The fixed response below simulates a correct extraction.
    return """
{
  "store_name": "Groceries R Us",
  "total": 62.48,
  "subtotal": 58.00,
  "tax": 4.48,
  "date": "2025/08/26"
}
"""

def extract_with_rag(ocr_data: Dict[str, Any]) -> Dict[str, Any]:
    """
    The main RAG pipeline function. It retrieves relevant chunks and
    then uses an LLM for generation.

    Args:
        ocr_data (dict): The output from the OCR stage.

    Returns:
        dict: A dictionary with extracted information.
    """
    # Create the mock vector database
    db = MockVectorDatabase()
    
    # The "chunks" are just the words from the OCR text for this demo.
    # In a real app, you would have a more sophisticated chunking strategy.
    text_chunks = [word for word in ocr_data['text'] if word.strip()]
    db.ingest_chunks(text_chunks)
    
    # 1. Retrieval Step: Find relevant chunks based on a query
    # The query for the retriever can be a specific question or the raw OCR text itself.
    retrieved_chunks = db.retrieve_similar_chunks("total tax subtotal store date", k=5)
    
    print("Retrieved relevant chunks for augmentation:")
    print(retrieved_chunks)

    # 2. Augmentation & Generation Step: Create the prompt and call the LLM
    query = "Extract store name, total, subtotal, tax, and date."
    augmented_prompt = create_rag_prompt(query, retrieved_chunks)
    
    json_response_string = mock_llm_api_call(augmented_prompt)
    
    # 3. Parse the LLM's JSON output
    try:
        extracted_info = json.loads(json_response_string)
        return extracted_info
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON from LLM: {e}")
        return {}

if __name__ == '__main__':
    # --- Mock OCR Data for Demonstration ---
    # This simulates a different receipt than the previous example,
    # showing the RAG system's ability to handle new data.
    mock_ocr_data = {
        'text': ['Welcome', 'to', 'Groceries', 'R', 'Us', '!', 'Subtotal', '58.00', 'Tax', '4.48', 'Total', '62.48', 'Thank', 'You', '!', '2025-08-26', '14:30'],
    }

    print("Starting RAG-based information extraction...")
    
    extracted_data = extract_with_rag(mock_ocr_data)
    
    print("\n--- Extracted Receipt Information (RAG) ---")
    if extracted_data:
        for key, value in extracted_data.items():
            print(f"{key.capitalize()}: {value}")
