<h1 align="center">AI-Driven Invoice Processing System.</h1>

>
- Here i have used sample invoice example in image format https://developers.mindee.com/docs/invoice-ocr, but we can also take in pdf or word format.
>

In [1]:
import pytesseract
# Extract text using OCR
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
text = pytesseract.image_to_string("sample_invoice.jpeg")
print(text)

TT Turnpike
Designs Co.

BILLTO
Jiro Doi
1954 Bloor Street West
Toronto, ON, M6P 3K9
Canada

j_doi@example.com
416-555-1212

Services

Platinum web hosting package
Down 35mb, Up 100mb

2 page website design
Includes basic wireframes, and responsive templates

Mobile designs
Includes responsive navigation

INVOICE

Turnpike Designs
156 University Ave, Toronto
ON, Canada , MSH 2H7

416-555-1212

Invoice Number: 14

P.0./S.0. Number: AD29094
Invoice Date: 2018-09-25
Payment Due: Upon receipt

Amount Due (USD): $2,608.20
Quantity Price Amount
1 $65.00 $65.00
3 $2,100.00 $2,100.00
1 $250.00 $250.00
$2,145.00
$193.20
Total: $2,608.20

Amount due (CAD): $2,608.20



In [2]:
import spacy
import imghdr
from PIL import Image

# Validating Correct Invoice type.
def is_invoice(document_path):  
    # Check if uploaded document is an image
    if imghdr.what(document_path) is None:
        return "Uploaded file is not an image. Please upload a valid image file."
    
    image = Image.open(document_path)

    # Check if image is too small to be processed
    if image.size[0] < 100 or image.size[1] < 100:
        return "Image is too small to be processed. Please upload a larger image."

    # Extract text using OCR
    pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
    text = pytesseract.image_to_string(document_path)

    # Check if any text was extracted
    if not text:
        return "OCR failed to extract any text from the document. Please upload a clear and readable image."
    
    # Load NLP model
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    
    # Check if the document contains invoice-related entities
    for entity in doc.ents:
        if entity.label_ in ['ORG', 'PERSON', "MONEY"]:
            return True
    
    return False


is_inv = is_invoice("sample_invoice.jpeg")
is_inv

True

- Here, we have checked if our invoice document is in correct format or not.
>
- As it is giving us output True, means our document is in correct format, and we are ready to go on.
>

In [3]:
import re

def extract_invoice_number(text):
    # Use regular expressions to extract relevant fields from the text
    invoice_number = re.search(r'Invoice Number:\s*(\w+)', text)
    if invoice_number:
        return invoice_number.group(1).strip()
    else:
        return "Invoice number not found in the document. Please enter the invoice number manually."

def extract_invoice_date(text):
    # Change required in Date format, according to American or Indian standard date.
    invoice_date = re.search(r'Invoice Date:\s*(\d{4}-\d{2}-\d{2})', text, re.IGNORECASE)
    if invoice_date:
        return invoice_date.group(1).strip()
    else:
        return "Invoice Date not found in the document. Please enter the invoice date manually."
    
def extract_amount_due(text):
    # We need to adjust amount, as here, i have hard coded the amount with "$" sign 
    amount_due = re.search(r"Amount Due \WUSD\W:\s\$[\d,]+\.\d{2}", text)
    if amount_due:
        return amount_due[0]
    else:
        return "Amount Due not found in the document. Please enter it manually."
    
text = pytesseract.image_to_string("sample_invoice.jpeg")
num = extract_invoice_number(text)
date = extract_invoice_date(text)
amt = extract_amount_due(text)

print('Invoice Number:', num)
print('Invoice Date:', date)
print('Total Amount Due:', amt)

Invoice Number: 14
Invoice Date: 2018-09-25
Total Amount Due: Amount Due (USD): $2,608.20


- After Validation of Invoice document, we have printed some basic information like invoice no., date and amount due.

In [4]:
# Sample = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
sample_invoice_database = range(1,11)

# Validating Invoice Number
def validate_invoice_number(invoice_number):
    if not invoice_number:
        return "Please Enter Invoice Number."
    
    # Check if invoice number is in a valid format
    if not re.match(r'^[A-Za-z0-9]+$', invoice_number):
        return "Not Valid format. Please enter valid Invoice Number."
    
    # Check if invoice number is unique
    if int(invoice_number) in sample_invoice_database:
        return "Invoice number already exists in the system for a different vendor or customer.\nPlease review and correct the information."
    
    return "Successfully Validated..."

is_val = validate_invoice_number(num)
is_val

'Successfully Validated...'

- Here, as the edge case validation, i am checking if the invoice no. is already entered present in our sample database or not.
>
- For this example, our invoice no. is 14, which is not in our sample database(which is basically nos. from 1-10).
>
- We can check by entering different no. as well.

In [5]:
is_val = validate_invoice_number("7")
print(is_val)

Invoice number already exists in the system for a different vendor or customer.
Please review and correct the information.


In [6]:
is_val = validate_invoice_number("@")
print(is_val)

Not Valid format. Please enter valid Invoice Number.


In [7]:
is_val = validate_invoice_number("")
print(is_val)

Please Enter Invoice Number.
