The purpose of this document is to make sense of the text data extracted from the example invoices and use a regex pattern to extract the addresses. Some addresses were on multiple lines in the images, which may lead to complications. I look forward to wrestling with this problem and hopefully programming an effective solution that can be scripted.

In [1]:
import json
import re

In [2]:
with open('text_data.json', 'r') as f:
    text_data = json.load(f)

text_data[0]    

{'file': '1131w-gU_JD5OzAAQ.webp',
 'text': 'INVOICE ZN\n\nWARDIERE INC.\n\nBILL TO:\n\nOlivia Wilson Date: 15/08/2028\nhello@reallygreatsite.com\n\n123 Anywhere St., Any City, ST 12345 Invoice NO. 2000-15\nFROM:\n\nWardiere Inc.\nhello@reallygreatsite.com\n123 Anywhere St., Any City, ST 12345,\n\nDESCRIPTION HOURS PRICE TOTAL\n\nGraphic design consultiation 2 $100.00 $200.00\n\nLogo design 1 $700.00 $700.00\n\nSocial media templates 1 $600.00 $600.00\n\nRevision 2. $300.00 $600.00\n\nTotal amount $2,100.00\nPAYMENT METHOD NOTES\n\nBank name: Fauget\nAccount No: 123-456-7890\n\nDate Thank youl Signature\n\nwww.reallygreatsite.com\n'}

Beautiful. In the first example, the address is in one place. I will write a regexpression to capture the address and test it against this example.

I will be more explicit that I likely need to be, because my future approach may require me to split the regexpression into two (for the first and second parts of an address).

In [3]:
regexp = r'\d+\s[\w\s\.]+,[\w\s]+,\s\w{2}\s\d{5}'

re.findall(regexp, text_data[0]['text'])

['123 Anywhere St., Any City, ST 12345',
 '123 Anywhere St., Any City, ST 12345']

This is off to a decent start. I'll run this against every image in my example set to see if there are more single-line addresses I can capture.

In [4]:
hits = {}

for i, invoice in enumerate(text_data):
    addresses = re.findall(regexp, invoice['text'])
    hits[invoice['file']] = addresses

hits    

{'1131w-gU_JD5OzAAQ.webp': ['123 Anywhere St., Any City, ST 12345',
  '123 Anywhere St., Any City, ST 12345'],
 '1131w-zvoLwRH8Wys.webp': ['5 July 2025 123 Anywhere St., Any City, ST 12345'],
 '609d5d3c4d120e370de52b70_invoice-lp-light-border.png': [],
 'Commercial-invoice-example.png': [],
 'IC-Business-Invoice-Template.jpg': [],
 'invoice-freshbooks-business.jpg': [],
 'Invoice-template-example-for-a-marketing-firm.webp': [],
 'invoice-template-us-band-blue-750px.png': ['3787 Pineview Drive\nNew York, NY 12210 Cambridge, MA 12210'],
 'invoice-template-us-dexter-750px.png': ['001\nJohn Smith John Smith INVOICE DATE 1170212019\n2Court Square 3787 Pineview Drive Pout nat2p019\nNew York, NY 12210 Cambridge, MA 12210'],
 'services-invoice-with-hours-and-rate-green-modern-simple-1-1-f82c825b6ce1.webp': [],
 'simple-invoice-template.png': []}

Couple problems here: my regexpression clearly needs improvement. It's too lenient and allows nonsense to make it through. Second, there are addresses spanning multiple lines (and different pieces of them) that can be combined together to form an improper address. For example, the Pineview Drive address is supposed to be in Cambridge, MA, however it picks up New York, NY first. I may try to get some positional information from pytesseract to aid my analysis.

I will first fix my regexpression to be more precise, then I will revisit my previous notebook to pull more information from the invoices.

Making the comma optional has improved identification. I need to try making the number of possible words after the street number shorter.

In [5]:
regexp = r'\d+[\s\w\.]+,?[\w\s]+,\s\w{2}\s\d{5}'

hits = {}

for i, invoice in enumerate(text_data):
    addresses = re.findall(regexp, invoice['text'])
    hits[invoice['file']] = addresses

hits   

{'1131w-gU_JD5OzAAQ.webp': ['123 Anywhere St., Any City, ST 12345',
  '123 Anywhere St., Any City, ST 12345'],
 '1131w-zvoLwRH8Wys.webp': ['5 July 2025 123 Anywhere St., Any City, ST 12345'],
 '609d5d3c4d120e370de52b70_invoice-lp-light-border.png': [],
 'Commercial-invoice-example.png': ['23 Kg\n\nShipment Terms\n\nDDU\n\nINVOICE NO.\n000562\n\nAcme Industries\n\n9176 Riverside Drive\nPanama City, FL 32404',
  '10\n\n20\n\n02\n\nGlobex Corporation\n\n582 Grand Drive\nLithonia, GA 30038',
  '52 Indian Summer Lane\nAustin, MN 55912'],
 'IC-Business-Invoice-Template.jpg': ['123 Main Seat\nHamiton, OH 44416'],
 'invoice-freshbooks-business.jpg': [],
 'Invoice-template-example-for-a-marketing-firm.webp': ['651 Emily Drive\nColumbia, SC 29201',
  '2084\nDecember 23.2023\n\nBILL TO\n\nAtionta, GA 30208'],
 'invoice-template-us-band-blue-750px.png': ['1912 Harvest Lane\nNew York, NY 12210',
  '2.Court Square 3787 Pineview Drive\nNew York, NY 12210 Cambridge, MA 12210'],
 'invoice-template-us-d

I made some progress with that approach, but today is a new day. Today, I will try splitting the text data by the newline character. From there, I will clean up all strings and try to match addresses to my regexpression.

In [7]:
text_data[0]['text'].split('\n')

['INVOICE ZN',
 '',
 'WARDIERE INC.',
 '',
 'BILL TO:',
 '',
 'Olivia Wilson Date: 15/08/2028',
 'hello@reallygreatsite.com',
 '',
 '123 Anywhere St., Any City, ST 12345 Invoice NO. 2000-15',
 'FROM:',
 '',
 'Wardiere Inc.',
 'hello@reallygreatsite.com',
 '123 Anywhere St., Any City, ST 12345,',
 '',
 'DESCRIPTION HOURS PRICE TOTAL',
 '',
 'Graphic design consultiation 2 $100.00 $200.00',
 '',
 'Logo design 1 $700.00 $700.00',
 '',
 'Social media templates 1 $600.00 $600.00',
 '',
 'Revision 2. $300.00 $600.00',
 '',
 'Total amount $2,100.00',
 'PAYMENT METHOD NOTES',
 '',
 'Bank name: Fauget',
 'Account No: 123-456-7890',
 '',
 'Date Thank youl Signature',
 '',
 'www.reallygreatsite.com',
 '']

In [8]:
lines = [line for line in text_data[0]['text'].split('\n') if line != '']
lines

['INVOICE ZN',
 'WARDIERE INC.',
 'BILL TO:',
 'Olivia Wilson Date: 15/08/2028',
 'hello@reallygreatsite.com',
 '123 Anywhere St., Any City, ST 12345 Invoice NO. 2000-15',
 'FROM:',
 'Wardiere Inc.',
 'hello@reallygreatsite.com',
 '123 Anywhere St., Any City, ST 12345,',
 'DESCRIPTION HOURS PRICE TOTAL',
 'Graphic design consultiation 2 $100.00 $200.00',
 'Logo design 1 $700.00 $700.00',
 'Social media templates 1 $600.00 $600.00',
 'Revision 2. $300.00 $600.00',
 'Total amount $2,100.00',
 'PAYMENT METHOD NOTES',
 'Bank name: Fauget',
 'Account No: 123-456-7890',
 'Date Thank youl Signature',
 'www.reallygreatsite.com']

Cool, now I'll go through this list of strings and attempt to match each one to my regexpression.

In [11]:
regexp = r'\d+[\s\w\.]+,?[\w\s]+,\s\w{2}\s\d{5}'
addresses = []

for line in lines:
    address = re.match(regexp, line)
    if address:
        addresses.append(address.group(0))

addresses        


['123 Anywhere St., Any City, ST 12345',
 '123 Anywhere St., Any City, ST 12345']

This idea has enough merit for me to format my other 9 samples the same way: a list of strings, each string being a line from the invoice document.

In [12]:
docs_as_lines = []

for invoice in text_data:
    lines = [line for line in invoice['text'].split('\n') if line != '']
    docs_as_lines.append(lines)

len(docs_as_lines)    

11

Beautiful. I did it this way so as I encounter specific test cases where my code fails to return any addresses, I can investigate them to improve my approach to this problem. I will iterate through the invoices in an initial pass and see how many successes I have.

In [13]:
regexp = r'\d+[\s\w\.]+,?[\w\s]+,\s\w{2}\s\d{5}'

for i, doc in enumerate(docs_as_lines):
    for line in doc:
        address = re.match(regexp, line)
        if address:
            print(f'Found a match in invoice #{i}.')
            print(f'{address}')
            print()
        else:
            continue    

Found a match in invoice #0.
<re.Match object; span=(0, 36), match='123 Anywhere St., Any City, ST 12345'>

Found a match in invoice #0.
<re.Match object; span=(0, 36), match='123 Anywhere St., Any City, ST 12345'>



This result is not discouraging. Looking at the above examples, every document except the first one has newline characters present in the middle of the address. This is where my possible best idea comes into play: I will search every line for a match to the first line of an address (street number and name). If I find a match, I will search the following string for a match to the second line of an address (city, state abbreviation, and zip code). If both of these things happen, I will concatenate the two pieces of the address and present them as a complete address.

Note: before I jump in, I need to refine my regexpressions for the first and second halves of an address. Once I have those rolling, I'll put them together for a final test.

In [20]:
regexp_one = r'\d+ [a-zA-Z]+[\s\w\.]*'

for i, doc in enumerate(docs_as_lines):
    for line in doc:
        address = re.search(regexp_one, line)
        if address:
            print(f'Found a match in file {text_data[i]["file"]}.')
            print(f'{address}')
            print()
        else:
            continue   

Found a match in file 1131w-gU_JD5OzAAQ.webp.
<re.Match object; span=(0, 16), match='123 Anywhere St.'>

Found a match in file 1131w-gU_JD5OzAAQ.webp.
<re.Match object; span=(0, 16), match='123 Anywhere St.'>

Found a match in file 1131w-zvoLwRH8Wys.webp.
<re.Match object; span=(0, 11), match='63 Ivy Road'>

Found a match in file 1131w-zvoLwRH8Wys.webp.
<re.Match object; span=(0, 12), match='16 June 2025'>

Found a match in file 1131w-zvoLwRH8Wys.webp.
<re.Match object; span=(8, 36), match='5 July 2025 123 Anywhere St.'>

Found a match in file 609d5d3c4d120e370de52b70_invoice-lp-light-border.png.
<re.Match object; span=(0, 16), match='1234 Your Street'>

Found a match in file 609d5d3c4d120e370de52b70_invoice-lp-light-border.png.
<re.Match object; span=(5, 13), match='2021 INV'>

Found a match in file 609d5d3c4d120e370de52b70_invoice-lp-light-border.png.
<re.Match object; span=(18, 63), match='30 days using the link in your invoice email.'>

Found a match in file Commercial-invoice-exam

In [35]:
docs_as_lines[3]

['Commercial Invoice',
 'AIRWAY BILL NO.',
 '000231',
 'EXPORTER / SHIPPER',
 'COMPANY NAME',
 'ADDRESS',
 'CONTACT NAME',
 'PHONE / FAX',
 'EMAIL',
 'COUNTRY OF EXPORT',
 'Product',
 'Laser Mouse',
 'Dual XL Monitors',
 'Multi-jet Printer',
 'Total Weight',
 '23 Kg',
 'Shipment Terms',
 'DDU',
 'INVOICE NO.',
 '000562',
 'Acme Industries',
 '9176 Riverside Drive',
 'Panama City, FL 32404',
 'Lacey A Staley',
 '302-545-0909',
 'lacey@mail.com',
 'United States of America',
 'Qty',
 '10',
 '20',
 '02',
 'Globex Corporation',
 '582 Grand Drive',
 'Lithonia, GA 30038',
 'INVOICE DATE DATE OF EXPORT',
 '11/05/2020 11/05/2020',
 'SHIP TO / COSIGNEE',
 'COMPANY NAME Cala Foods',
 'ADDRESS 52 Indian Summer Lane',
 'Austin, MN 55912',
 'CONTACT NAME Andrew T McGuire',
 'PHONE / FAX 480-577-9916',
 'EMAIL andrew@mail.com',
 'COUNTRY OF Singapore',
 'DESTINATION',
 'Unit Price Amount',
 '$950.00 $9,500.00',
 '$150.00 $3,000.00',
 '$150.00 $300.00',
 'Sub Total $12,800.00',
 'Discount $300.00',
 

It's not perfect, but I am able to grab a lot of the first halves of addresses. I'm hoping the second half of the search will provide useful matches only.

In [24]:
regexp_two = r'([a-zA-Z]\s?)+, [A-Z]{2} \d{5}'

for i, doc in enumerate(docs_as_lines):
    for line in doc:
        address = re.search(regexp_two, line)
        if address:
            print(f'Found a match in file {text_data[i]["file"]}.')
            print(f'{address}')
            print()
        else:
            continue 

Found a match in file 1131w-gU_JD5OzAAQ.webp.
<re.Match object; span=(18, 36), match='Any City, ST 12345'>

Found a match in file 1131w-gU_JD5OzAAQ.webp.
<re.Match object; span=(18, 36), match='Any City, ST 12345'>

Found a match in file 1131w-zvoLwRH8Wys.webp.
<re.Match object; span=(38, 56), match='Any City, ST 12345'>

Found a match in file Commercial-invoice-example.png.
<re.Match object; span=(0, 21), match='Panama City, FL 32404'>

Found a match in file Commercial-invoice-example.png.
<re.Match object; span=(0, 18), match='Lithonia, GA 30038'>

Found a match in file Commercial-invoice-example.png.
<re.Match object; span=(0, 16), match='Austin, MN 55912'>

Found a match in file IC-Business-Invoice-Template.jpg.
<re.Match object; span=(0, 17), match='Hamiton, OH 44416'>

Found a match in file Invoice-template-example-for-a-marketing-firm.webp.
<re.Match object; span=(0, 18), match='Columbia, SC 29201'>

Found a match in file Invoice-template-example-for-a-marketing-firm.webp.
<re.M

Not trying to be full of myself but I kinda nailed that one on the first try.

In [39]:
regexp_one = r'\d+ [a-zA-Z]+[\s\w\.]*'
regexp_two = r'([a-zA-Z]\s?)+, [A-Z]{2} \d{5}'

for i, doc in enumerate(docs_as_lines):
    for j, line in enumerate(doc):
        first = re.search(regexp_one, line)
        if first:
            try:
                second = re.search(regexp_two, doc[j+1])
            except:
                print('Reached end of document.')
                continue    
            if second:
                full_address = first.group(0) + ', ' + second.group(0)
                print(f'Address found in document {i}.')
                print(f'Address found on lines {j} and {j+1}.')
                print(full_address)

Reached end of document.
Reached end of document.
Address found in document 3.
Address found on lines 21 and 22.
9176 Riverside Drive, Panama City, FL 32404
Address found in document 3.
Address found on lines 32 and 33.
582 Grand Drive, Lithonia, GA 30038
Address found in document 3.
Address found on lines 38 and 39.
52 Indian Summer Lane, Austin, MN 55912
Address found in document 4.
Address found on lines 3 and 4.
123 Main Seat, Hamiton, OH 44416
Address found in document 6.
Address found on lines 1 and 2.
651 Emily Drive, Columbia, SC 29201
Address found in document 7.
Address found on lines 1 and 2.
1912 Harvest Lane, New York, NY 12210
Address found in document 7.
Address found on lines 5 and 6.
3787 Pineview Drive, New York, NY 12210
Address found in document 8.
Address found on lines 2 and 3.
1912 Harvest Lane, New York, NY 12210
Address found in document 8.
Address found on lines 6 and 7.
3787 Pineview Drive Pout nat2p019, New York, NY 12210
Address found in document 9.
Address

Ok, this is super encouraging. I am detecting so many multi-line addresses in these documents. Now I need to work through them and manually verify that I have collected the correct addresses.

Note: in some of these instances, I believe the performance of my code could be improved by tinkering with OCR. If I did something to increase the contrast of the text (make the text more readable by the machine), I think I would be capturing even more addresses from the invoices. I plan to revisit this at some point in the future.

In [41]:
text_data[6]

{'file': 'Invoice-template-example-for-a-marketing-firm.webp',
 'text': 'KirkPatrick Marketing Co.\n651 Emily Drive\nColumbia, SC 29201\n\n503-951-7624 Invoice «2084\nDecember 23.2023\n\nBILL TO\n\nAtionta, GA 30208\n\n404 571-1634\n\nDESCRIPTION HOURS RATE AMOUNT\n\nPua Laundry Services Logo Design 2 si0a $200\n\nInstagram Social Assets 2 si00 $300\n\nYour total amount due is . Thank\nyou so much for your business.\n\n‘er month Maka ak checks payabie to KekPatrck Marketing Co,\n'}

In [42]:
docs_as_lines[6]

['KirkPatrick Marketing Co.',
 '651 Emily Drive',
 'Columbia, SC 29201',
 '503-951-7624 Invoice «2084',
 'December 23.2023',
 'BILL TO',
 'Ationta, GA 30208',
 '404 571-1634',
 'DESCRIPTION HOURS RATE AMOUNT',
 'Pua Laundry Services Logo Design 2 si0a $200',
 'Instagram Social Assets 2 si00 $300',
 'Your total amount due is . Thank',
 'you so much for your business.',
 '‘er month Maka ak checks payabie to KekPatrck Marketing Co,']

In [45]:
text_data[7]

{'file': 'invoice-template-us-band-blue-750px.png',
 'text': 'East Repair Inc.\n\n1912 Harvest Lane\nNew York, NY 12210\n\nBILLTO SHIP TO\n\nJohn Smith, John Smith\n\n2.Court Square 3787 Pineview Drive\nNew York, NY 12210 Cambridge, MA 12210\n\nINVOICE # us-001\nINVOICE DATE 11/02/2019\nP.O.# 2312/2019\nDUE DATE 26/02/2019\n\nInvoice Total\n\n$154.06\n\nQTY DESCRIPTION\n\n1 Front and rear brake cables\n2 New set of pedal arms\n\n3 Labor 3hrs\n\nTERMS & CONDITIONS\n\nPayment is due within 15 days\n\nPlease make checks payable to: East Repair Inc.\n\nUNIT PRICE AMOUNT\n100.00 100.00\n\n15.00 30.00\n\n5.00 15.00\n\nSubtotal 145.00\n\nSales Tax 6.25% 9.06\n\nSmith,\n\n'}

In [46]:
docs_as_lines[7]

['East Repair Inc.',
 '1912 Harvest Lane',
 'New York, NY 12210',
 'BILLTO SHIP TO',
 'John Smith, John Smith',
 '2.Court Square 3787 Pineview Drive',
 'New York, NY 12210 Cambridge, MA 12210',
 'INVOICE # us-001',
 'INVOICE DATE 11/02/2019',
 'P.O.# 2312/2019',
 'DUE DATE 26/02/2019',
 'Invoice Total',
 '$154.06',
 'QTY DESCRIPTION',
 '1 Front and rear brake cables',
 '2 New set of pedal arms',
 '3 Labor 3hrs',
 'TERMS & CONDITIONS',
 'Payment is due within 15 days',
 'Please make checks payable to: East Repair Inc.',
 'UNIT PRICE AMOUNT',
 '100.00 100.00',
 '15.00 30.00',
 '5.00 15.00',
 'Subtotal 145.00',
 'Sales Tax 6.25% 9.06',
 'Smith,']

OK. I have new ideas.

First: it's obvious the second half of the address is far more constrained. It's a more precise regex statement with almost no room for variation. I should scan the invoices for all matches to the SECOND half of an address, and then check the PREVIOUS line for something that matches the first part of an address. I didn't have any weird looking addresses in my initial test, however I believe this is a more accurate method to find addresses.

Second: if I find the second half of an address, I should check that entire line for a full address BEFORE I check the previous line for the first half of the address. This would help me initially check for a one-liner, and if it was not present, resort to checking the previous line of text.

Third: to allow for the case of multiple addresses present on the same two lines (this happens in invoices 7 and 8), I should do a findall on the SECOND regexpression (for City, ST ZIP) and check the number of resulting groups. If there is one group, I do a re.search on the previous line. If there are two groups, I do a re.findall on the previous line and "align" my addresses by taking group 0 from the second half and marrying it to group 0 from the first half.

Fourth: the overall flow of the program should be something like this:  
- Run a re.findall looking for the SECOND half of an address  
- Evaluate the number of groups of matches present in the result
    1. If it's just one..
        - Look for an entire address match in the current line. If there is, great! Return it
        - If there is not an entire match, search the previous line for the first half of an address
    2. If it's two..
        - Check for two complete addresses on the current line
        - Find all matches to the FIRST half of an address on the previous line. Match group 0 with 0 and 1 with 1
- There is a chance I find two second halves and one first half. If I improve my OCR code, this should be unlikely           

Fifth: Some of the text is a little wonky after it's extracted. I need to go back and parameterize OCR better until I get pristine text results.