I have about 10 example invoices in the examples folder. The purpose of this document is to extract the text data from the images. I will use the [pytesseract](https://pypi.org/project/pytesseract/) library to achieve this. The library supports all image types present in my example folder: png, jpg, and webp. I will open one image at a time, extract the text information, and save it to a JSON object. If I can successfully capture the text information from all documents, I will save the JSON object and open it in a future notebook to perfect the regex required.

Note: I had to download and install the [OCR engine](https://github.com/UB-Mannheim/tesseract/wiki), along with pip install pytesseract, in order to get this notebook to run.

In [1]:
import os
import json
from PIL import Image
from pytesseract import pytesseract

In [2]:
tesseract_path = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
pytesseract.tesseract_cmd = tesseract_path

images = os.listdir('examples/')
images

['1131w-gU_JD5OzAAQ.webp',
 '1131w-zvoLwRH8Wys.webp',
 '609d5d3c4d120e370de52b70_invoice-lp-light-border.png',
 'Commercial-invoice-example.png',
 'IC-Business-Invoice-Template.jpg',
 'invoice-freshbooks-business.jpg',
 'Invoice-template-example-for-a-marketing-firm.webp',
 'invoice-template-us-band-blue-750px.png',
 'invoice-template-us-dexter-750px.png',
 'services-invoice-with-hours-and-rate-green-modern-simple-1-1-f82c825b6ce1.webp',
 'simple-invoice-template.png']

In [3]:
text_info = []

for image_path in images:
    image_info = {}
    image_info['file'] = image_path

    full_path = os.path.join('examples/', image_path)
    image = Image.open(full_path)

    text = pytesseract.image_to_string(image)
    image_info['text'] = text
    text_info.append(image_info)

text_info[0]    

{'file': '1131w-gU_JD5OzAAQ.webp',
 'text': 'INVOICE ZN\n\nWARDIERE INC.\n\nBILL TO:\n\nOlivia Wilson Date: 15/08/2028\nhello@reallygreatsite.com\n\n123 Anywhere St., Any City, ST 12345 Invoice NO. 2000-15\nFROM:\n\nWardiere Inc.\nhello@reallygreatsite.com\n123 Anywhere St., Any City, ST 12345,\n\nDESCRIPTION HOURS PRICE TOTAL\n\nGraphic design consultiation 2 $100.00 $200.00\n\nLogo design 1 $700.00 $700.00\n\nSocial media templates 1 $600.00 $600.00\n\nRevision 2. $300.00 $600.00\n\nTotal amount $2,100.00\nPAYMENT METHOD NOTES\n\nBank name: Fauget\nAccount No: 123-456-7890\n\nDate Thank youl Signature\n\nwww.reallygreatsite.com\n'}

The setup was much more of a pain than the code. I needed to upgrade the pillow library to the most recent version so it could handle webp images. Once I did that, it worked on the first try.

After getting something to drink, I have determined that the pytesseract library is pretty cool. I'm going to save my JSON object and work with it in the next notebook.

In [4]:
with open('text_data.json', 'w') as f:
    json.dump(text_info, f)

After doing some reading of their limited documentation, I wonder if I could useful information from the .image_to_boxes() method or the .image_to_data() method. I'll experiment with both of them and see what I can get.

I believe they both return coordinate information about where certain pieces of text were found. Pieces of text belonging to the same address should have the same height, which is the distance in pixels from the top of the image. I can use this to my advantage when piecing together bits of text.

Note: I found that I can control the output type and make it a dictionary. This appears to make the data more workable.

In [5]:
text_boxes = []

for image_path in images:
    image_info = {}
    image_info['file'] = image_path

    full_path = os.path.join('examples/', image_path)
    image = Image.open(full_path)

    text = pytesseract.image_to_boxes(image, output_type=pytesseract.Output.DICT)
    image_info['text'] = text
    text_boxes.append(image_info)

text_boxes[0]    

{'file': '1131w-gU_JD5OzAAQ.webp',
 'text': {'char': ['I',
   'N',
   'V',
   'O',
   'I',
   'C',
   'E',
   'Z',
   'N',
   'W',
   'A',
   'R',
   'D',
   'I',
   'E',
   'R',
   'E',
   'I',
   'N',
   'C',
   '.',
   'B',
   'I',
   'L',
   'L',
   'T',
   'O',
   ':',
   'O',
   'l',
   'i',
   'v',
   'i',
   'a',
   'W',
   'i',
   'l',
   's',
   'o',
   'n',
   'D',
   'a',
   't',
   'e',
   ':',
   '1',
   '5',
   '/',
   '0',
   '8',
   '/',
   '2',
   '0',
   '2',
   '8',
   'h',
   'e',
   'l',
   'l',
   'o',
   '@',
   'r',
   'e',
   'a',
   'l',
   'l',
   'y',
   'g',
   'r',
   'e',
   'a',
   't',
   's',
   'i',
   't',
   'e',
   '.',
   'c',
   'o',
   'm',
   '1',
   '2',
   '3',
   'A',
   'n',
   'y',
   'w',
   'h',
   'e',
   'r',
   'e',
   'S',
   't',
   '.',
   ',',
   'A',
   'n',
   'y',
   'C',
   'i',
   't',
   'y',
   ',',
   'S',
   'T',
   '1',
   '2',
   '3',
   '4',
   '5',
   'I',
   'n',
   'v',
   'o',
   'i',
   'c',
   'e',
   'N',
   'O

I am going to try and get things working using information from [this](https://nanonets.com/blog/ocr-with-tesseract/) article.