The purpose of this document is to try out the [EasyOCR](https://github.com/JaidedAI/EasyOCR) library. This one consistently garnered the most praise, and it was rumored to outperform the pytesseract library (which is a wrapper for GoogleOCR). I will give it a shot with a few of my more challenging example invoices, and see if it returns cleaner text than I currently have in my dataset.

In [11]:
import cv2
import easyocr
import json
import os
from PIL import Image

I'll read in the old data for comparison. I'll focus on examples 6 through 9 from the old data, because they had some issues with identifying the text.

In [3]:
with open('text_data.json', 'r') as f:
    old_data = json.load(f)

old_data[6]    

{'file': 'Invoice-template-example-for-a-marketing-firm.webp',
 'text': 'KirkPatrick Marketing Co.\n651 Emily Drive\nColumbia, SC 29201\n\n503-951-7624 Invoice «2084\nDecember 23.2023\n\nBILL TO\n\nAtionta, GA 30208\n\n404 571-1634\n\nDESCRIPTION HOURS RATE AMOUNT\n\nPua Laundry Services Logo Design 2 si0a $200\n\nInstagram Social Assets 2 si00 $300\n\nYour total amount due is . Thank\nyou so much for your business.\n\n‘er month Maka ak checks payabie to KekPatrck Marketing Co,\n'}

To work with the EasyOCR library, I have to instantiate the object on the list of languages I would like to detect.

In [4]:
reader = easyocr.Reader(['en'])

Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
Downloading detection model, please wait. This may take several minutes depending upon your network connection.


Progress: |██████████████████████████████████████████████████| 100.0% Complete

Downloading recognition model, please wait. This may take several minutes depending upon your network connection.


Progress: |██████████████████████████████████████████████████| 100.0% Complete

Now that I have the object, I don't need to bother opening the image using a library. I can read directly from the filepath.

Note: it cannot read webp files, so I will read in the images as a numpy array using the cv2 library.

In [13]:
filepath = os.path.join('examples/', old_data[6]['file'])
image = cv2.imread(filepath)
reader.readtext(image)

[([[147, 53], [313, 53], [313, 71], [147, 71]],
  'KirkPatrick Marketing Co.',
  0.9632051144407789),
 ([[387, 47], [577, 47], [577, 89], [387, 89]], 'INVOICE', 0.4532020407974415),
 ([[147, 71], [221, 71], [221, 83], [147, 83]],
  '851 Emnly Crive',
  0.3887299338834006),
 ([[147, 83], [243, 83], [243, 95], [147, 95]],
  'Columbix, SC 79201',
  0.12316989216701457),
 ([[147, 95], [217, 95], [217, 109], [147, 109]],
  '603-931-7624',
  0.30317013004818727),
 ([[507, 95], [575, 95], [575, 109], [507, 109]],
  'Invoice #2034',
  0.5746950432493378),
 ([[485, 109], [575, 109], [575, 121], [485, 121]],
  'Decem ber 71,2023',
  0.12534118796311425),
 ([[147, 133], [187, 133], [187, 147], [147, 147]],
  'BILL TO',
  0.7551944156456212),
 ([[147, 147], [177, 147], [177, 159], [147, 159]],
  'Marie',
  0.13414225838630503),
 ([[183, 147], [205, 147], [205, 159], [183, 159]], 'Pue', 0.9565716161646148),
 ([[147, 159], [169, 159], [169, 171], [147, 171]],
  'ZUu',
  0.14261201549128802),
 ([[172

Not super promising. I'll use an optional argument to limit the amount of information.

In [14]:
reader.readtext(image, detail=0)

['KirkPatrick Marketing Co.',
 'INVOICE',
 '851 Emnly Crive',
 'Columbix, SC 79201',
 '603-931-7624',
 'Invoice #2034',
 'Decem ber 71,2023',
 'BILL TO',
 'Marie',
 'Pue',
 'ZUu',
 'Huol',
 'Avenve',
 'Atloneo',
 'GA 50309',
 '402 ,71 10J4',
 'BESCRIPTION',
 'HOURS',
 'RATE',
 'AMOUNT',
 'Laundrv Services LoRe Desien',
 '310G',
 '8200',
 'Inseagram 3aci01 Asscts',
 '3100',
 '8300',
 'Your total amount due is . Thank',
 'you s0 much for your business_',
 'MiovE CuY quesiOi',
 'UbJu YCI',
 'Mnyuae',
 'Jieuse CoCLL Juctte',
 'mOun: duc',
 'doy',
 'Ucouris Ad fci',
 '95YICE - Ies',
 '1096',
 'McnGF Rdre',
 'Cnece',
 '{kPOliL< Marceling Co',
 'MyoL',
 'Po,abr']

This is a lot worse than the pytesseract version for a couple of reasons. One, it's less accurate. Two, the text is more chopped up, which would make it extremely difficult to identify an address. I'll try it on another image before I settle on the pytesseract library for this project.

In [15]:
filepath = os.path.join('examples/', old_data[7]['file'])
image = cv2.imread(filepath)
reader.readtext(image, detail=0)

['East Repair Inc_',
 '1912 Harvest Lane',
 'New York; NY 12210',
 'BILL TO',
 'SHIP TO',
 'INVOICE #',
 'US-001',
 'John Smith',
 'John Smith',
 'INVOICE DATE',
 '11/02/2019',
 'Court Square',
 '3787 Pineview Drive',
 'Po#',
 '2312/2019',
 'New York; NY 12210',
 'Cambridge',
 'MA 12210',
 'DUE DATE',
 '26/02/2019',
 'Invoice Total',
 'S154.06',
 'QTY',
 'DESCRIPTION',
 'UNIT PRICE',
 'AMOUNT',
 'Front and rear brake cables',
 '100.00',
 '100.00',
 'New set of pedal arms',
 '15.00',
 '30.00',
 'Labor 3hrs',
 '5.00',
 '15.00',
 'Subtotal',
 '145.00',
 'Sales Tax 6.25%',
 '9.06',
 'KmSwhh',
 'TERMS & CONDITIONS',
 'Payment is due within 15 days',
 'Please make checks payable to: East Repair Inc_']

This is still worse. I could spend time optimizing this model, but I would rather optimize the pytesseract model which has better performance from the start.