# Data Extraction: Automating Invoice Processing

This notebook demonstrates how to extract structured data from text documents like receipts and invoices using AI. We'll explore:

1. Basic text extraction using prompts
2. Structured output formats (JSON, XML)

The techniques shown here can help automate manual data entry tasks and standardize information extraction from semi-structured documents.

In [1]:
with open("./invoice-data-sample.txt", "r") as f:
    receipt_data = f.read()
    
receipt_data

'EXEMPLO, UNIPESSOAL, LDA. AV. DA LIBERDADE, 120 - 2o ESQ 123456789 1250-140 LISBOA\nEXEMPLO, UNIPESSOAL, LDA.\nAV. DA LIBERDADE, 120 - 2o ESQ 1250-140 LISBOA\nLISBOA\nJOANA SILVA PEREIRA 0033\nDATA ANALYST 12345678901\n298765432\nAFIN 0010.20.987654\n123456789 Original\nLISBOA\nRecibo de Vencimentos\nRecibo de Vencimentos Período Janeiro\nData Fecho 31/01/2024 Vencimento 3.900,00\nPeríodo Data Fecho Vencimento Venc. / Hora N. Dias Mês:\nFaltas Alim.\nCód.\nR01 R06 R13 D01 D02 D05\nJaneiro 31/01/2024\n3.900,00 20,00 20.00\nNome\nN.o Mecan. Categoria\nN.o Benef.\nN.o Contrib. Departamento Seguro\nJOANA SILVA PEREIRA 0033\nDATA ANALYST 12345678901\n298765432\nAFIN 0010.20.987654\nRetenção IRS\nSDD IRS Retido Total Remun.\nVenc. / Hora N. Dias Mês:\n20,00 20.00\nNome\nN.o Mecan. Categoria\nN.o Benef.\nN.o Contrib. Departamento Seguro\nTurno Data\n01-2024 01-2024 01-2024 01-2024 01-2024 01-2024\nCDD\nSDH\nFaltas\nAlim. Turno CDH\nRetenção IRS\nIRS Retido\nCDH\nDescrição Remunerações Descon

In [2]:
from IPython.display import Markdown

Markdown(receipt_data)

EXEMPLO, UNIPESSOAL, LDA. AV. DA LIBERDADE, 120 - 2o ESQ 123456789 1250-140 LISBOA
EXEMPLO, UNIPESSOAL, LDA.
AV. DA LIBERDADE, 120 - 2o ESQ 1250-140 LISBOA
LISBOA
JOANA SILVA PEREIRA 0033
DATA ANALYST 12345678901
298765432
AFIN 0010.20.987654
123456789 Original
LISBOA
Recibo de Vencimentos
Recibo de Vencimentos Período Janeiro
Data Fecho 31/01/2024 Vencimento 3.900,00
Período Data Fecho Vencimento Venc. / Hora N. Dias Mês:
Faltas Alim.
Cód.
R01 R06 R13 D01 D02 D05
Janeiro 31/01/2024
3.900,00 20,00 20.00
Nome
N.o Mecan. Categoria
N.o Benef.
N.o Contrib. Departamento Seguro
JOANA SILVA PEREIRA 0033
DATA ANALYST 12345678901
298765432
AFIN 0010.20.987654
Retenção IRS
SDD IRS Retido Total Remun.
Venc. / Hora N. Dias Mês:
20,00 20.00
Nome
N.o Mecan. Categoria
N.o Benef.
N.o Contrib. Departamento Seguro
Turno Data
01-2024 01-2024 01-2024 01-2024 01-2024 01-2024
CDD
SDH
Faltas
Alim. Turno CDH
Retenção IRS
IRS Retido
CDH
Descrição Remunerações Descontos
SDH
SDD
Total Remun.
19.200,00
Descontos
500,00 1.200,00 150,00
CDD
Descrição Remunerações
Vencimento 3.900,00 IHT 650,00
5.400,00 19.200,00
5.400,00
Vencimento 3.900,00 IHT 650,00
Cód. Data
R01 01-2024 R06 01-2024 R13 01-2024 D01 01-2024 D02 01-2024 D05 01-2024
Subsídio Alimentação Cartão Segurança Social
IRS (Venc. 29,18%) (29,18%) Desconto Subsidio Alim. Cartão
150,00
Subsídio Alimentação Cartão Segurança Social
IRS (Venc. 29,18%) (29,18%) Desconto Subsidio Alim. Cartão
150,00
500,00 1.200,00 150,00
Formas de Pagamento: % Remuneração
100,00
Forma de Pagamento Moeda
Formas de Pagamento: % Remuneração
100,00
Forma de Pagamento
Transferência
Moeda
Total
4.400,00 1.800,00 Total Pago ( EUR ) 2.950,00
Total
4.400,00 Total Pago ( EUR )
1.800,00 2.950,00
Declaro que recebi a quantia constante neste recibo,
Obs.
Declaro que recebi a quantia constante neste recibo,
Transferência EUR
EUR

In [3]:
from ai_tools import ask_ai

extraction_prompt = f"""

You are an extraction engine for receipt data.
Users will upload the contents of their receipts and you will extract
the following fields:
- Company name
- Date of closure
- Amount paid

Extract the data from the following receipt:
{receipt_data}
"""

structured_output = ask_ai(prompt=extraction_prompt)

structured_output

'Based on the provided receipt data, here is the extracted information:\n\n- **Company name:** EXEMPLO, UNIPESSOAL, LDA.\n- **Date of closure:** 31/01/2024\n- **Amount paid:** 2.950,00 EUR\n\nIf you need further assistance or additional extraction, feel free to ask!'

This output is ok but we don't want the conversational elements of the response right?

To get around that, let's improve our initial prompt:

In [4]:
extraction_prompt_json = f"""
You are an extraction engine for receipt data.
Users will upload the contents of their receipts and you will extract
the following fields as JSON OBJECTS:
- Company name
- Date of closure
- Amount paid

Extract the data from the following receipt:
{receipt_data}

Your OUTPUT SHOULD ONLY BE A JSON OBJECT WITH THE FOLLOWING FIELDS:
- company_name
- date_of_closure
- amount_paid
"""

structured_output_json = ask_ai(prompt=extraction_prompt_json)

structured_output_json

'```json\n{\n  "company_name": "EXEMPLO, UNIPESSOAL, LDA.",\n  "date_of_closure": "31/01/2024",\n  "amount_paid": "2.950,00"\n}\n```'

In [5]:
# We need to import the json library to parse the JSON output
import json

def parse_json_output(json_str):
    """
    This function parses the JSON output from the AI and removes the markdown code block markers if present.
    """
    # Remove markdown code block markers if present
    json_str = json_str.replace('```json', '').replace('```', '').strip()
    
    # Parse the JSON string into a Python dictionary
    try:
        return json.loads(json_str)
    except json.JSONDecodeError:
        print("Error: Could not parse JSON string")
        return None

parsed_json = parse_json_output(structured_output_json)


parsed_json

{'company_name': 'EXEMPLO, UNIPESSOAL, LDA.',
 'date_of_closure': '31/01/2024',
 'amount_paid': '2.950,00'}

In [6]:
print(f"Company Name: {parsed_json['company_name']}")
print(f"Date of Closure: {parsed_json['date_of_closure']}")
print(f"Amount Paid: {parsed_json['amount_paid']}")

Company Name: EXEMPLO, UNIPESSOAL, LDA.
Date of Closure: 31/01/2024
Amount Paid: 2.950,00


In Claude we can also do this quite easily using `xml` tags: `<output>{"company_name":....etc....} </output>`



In [7]:
from ai_tools import ask_ai

ask_ai(prompt="Hi! Which model are you?", model_name="claude-3-5-sonnet-20240620")

"I am Claude, an AI assistant created by Anthropic to be helpful, harmless, and honest. I don't have a specific version number or model name beyond that."

In [8]:
extraction_prompt_claude = f"""
You are an extraction engine for receipt data.
Users will upload the contents of their receipts and you will extract key fields.

Extract the following fields from this receipt:
{receipt_data}

Format your response using XML tags like this:
<output>
  <company_name>The company name</company_name>
  <date_of_closure>The date of closure</date_of_closure>
  <amount_paid>The amount paid</amount_paid>
</output>

Only include the XML tags and JSON object in your response, nothing else.
"""
output = ask_ai(prompt=extraction_prompt_claude, model_name="claude-3-5-sonnet-20240620")

output

'<output>\n  <company_name>EXEMPLO, UNIPESSOAL, LDA.</company_name>\n  <date_of_closure>31/01/2024</date_of_closure>\n  <amount_paid>2950.00</amount_paid>\n</output>'

Now, let's write a function that properly parses this output from Claude:

In [9]:
def parse_claude_output(output):
    """
    This function parses the output from Claude and removes the XML tags.
    """
    # Remove XML tags if present
    output = output.replace('<output>', '').replace('</output>', '').strip()
    return output

output_parsed = parse_claude_output(output)

output_parsed

'<company_name>EXEMPLO, UNIPESSOAL, LDA.</company_name>\n  <date_of_closure>31/01/2024</date_of_closure>\n  <amount_paid>2950.00</amount_paid>'

Now we can access each individual attribute easily by simply parsing the tags:


In [10]:
import re

def extract_field(output, field_name):
    """Extract value between XML tags for a given field."""
    pattern = f"<{field_name}>(.*?)</{field_name}>"
    match = re.search(pattern, output)
    return match.group(1) if match else None

# Extract each field
company_name = extract_field(output_parsed, "company_name")
date_of_closure = extract_field(output_parsed, "date_of_closure") 
amount_paid = extract_field(output_parsed, "amount_paid")

print(f"Company Name: {company_name}")
print(f"Date of Closure: {date_of_closure}")
print(f"Amount Paid: {amount_paid}")

Company Name: EXEMPLO, UNIPESSOAL, LDA.
Date of Closure: 31/01/2024
Amount Paid: 2950.00


But what if you don't want to send your private data to some cloud provider?

In that case, we use local models! After a lot of advancements, we can now easily use local models to extract structured outputs similar to what we have been doing before.

In [11]:
from ai_tools import ask_local_ai
import json

extraction_prompt_json = f"""
You are an extraction engine for receipt data.
Users will upload the contents of their receipts and you will extract
the following fields as JSON OBJECTS:
- Company name
- Date of closure
- Amount paid

Extract the data from the following receipt:
{receipt_data}

Your OUTPUT SHOULD ONLY BE A JSON OBJECT WITH THE FOLLOWING FIELDS:
- company_name
- date_of_closure
- amount_paid
"""

output_string = ask_local_ai(extraction_prompt_json, structured=True)

output_json = json.loads(output_string)

print(f"Company Name: {output_json['company_name']}")
print(f"Date of Closure: {output_json['date_of_closure']}")
print(f"Amount Paid: {output_json['amount_paid']}")

Company Name: EXEMPLO, UNIPESSOAL, LDA. AV. DA LIBERDADE
Date of Closure: 31/01/2024
Amount Paid: 2950.0


The fancier way of doing this for those interested in exploring more about structured extractions is using something called `pydantic` a data validation library that perfectly integrates with LLM APIs like openai's and anthropics to create these structured outputs in a more programatic and organized fashion.
See an example in: `./structured_output_with_pydantic.py`.