# 📄 Invoice Information Extraction Notebook

This notebook demonstrates how to extract **invoice number, date, and line items** from invoice images using:
- HuggingFace invoice dataset (`katanaml-org/invoices-donut-data-v1`)
- OCR with Tesseract
- Regex-based extraction
- Optional HuggingFace Donut model

---

In [None]:
# ✅ Install dependencies
!pip install datasets pytesseract pillow pdf2image matplotlib
!apt-get install -y tesseract-ocr

# Optional: HuggingFace Donut model
!pip install transformers accelerate timm sentencepiece

In [None]:
from datasets import load_dataset
import pytesseract
from PIL import Image
import re, json
import matplotlib.pyplot as plt

# Load HuggingFace invoice dataset
dataset = load_dataset("katanaml-org/invoices-donut-data-v1")

# Take first invoice sample
sample = dataset['train'][0]
img = sample['image']

# Show invoice image
plt.imshow(img)
plt.axis('off')
plt.show()

text = pytesseract.image_to_string(img)
print(text[:1000])  # preview OCR text

In [None]:
# ✅ Regex extractors
invoice_number = re.search(r"Invoice\s*No\.?[:\s]*([A-Za-z0-9-]+)", text, re.IGNORECASE)
invoice_date = re.search(r"Date[:\s]*([0-9]{2}/[0-9]{2}/[0-9]{4})", text, re.IGNORECASE)
total_amount = re.search(r"Total[:\s]*\$?([0-9,.]+)", text, re.IGNORECASE)

extracted_data = {
    "invoice_number": invoice_number.group(1) if invoice_number else None,
    "invoice_date": invoice_date.group(1) if invoice_date else None,
    "total_amount": total_amount.group(1) if total_amount else None,
}

print(json.dumps(extracted_data, indent=2))

In [None]:
# ✅ Extract line items (very basic heuristic: look for rows with qty, price, total)
lines = text.split('\n')
line_items = []
for line in lines:
    if re.search(r"[0-9]+\s+x\s+\$?[0-9,.]+", line):
        line_items.append(line.strip())

extracted_data["line_items"] = line_items
print(json.dumps(extracted_data, indent=2))

In [None]:
# ✅ Optional: HuggingFace Donut Model
try:
    from transformers import DonutProcessor, VisionEncoderDecoderModel
    import torch

    processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
    model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")

    pixel_values = processor(img, return_tensors="pt").pixel_values
    task_prompt = "<s_docvqa><s_question>What is the invoice number?<s_answer>"
    decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids

    outputs = model.generate(pixel_values, decoder_input_ids=decoder_input_ids, max_length=64)
    result = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    print("Donut model answer:", result)
except Exception as e:
    print("⚠️ Donut model not available:", e)

---
## ✅ Next Steps (Scalable Design)
- You can define **new regex patterns** or ML models for additional fields.
- Extend schema in `extracted_data` dict for any new field.
- Train HuggingFace Donut or LayoutLM on custom invoices for production use.

---