# LLM and data extraction

In this notebook, we will explore how to use the Gemini API to extract metadata from invoices. We will use a PDF file as input and convert it to markdown text. Then, we will use the Gemini API to extract the vendor name, the buyer name and the total amount due from the markdown text.

We will compare different methods to extract metadata from scientific papers using the Gemini API, including:
1. Asking the API to extract the metadata directly from the markdown text.
2. Asking the API to extract the metadata and return the result in JSON format.
3. Using a JSON schema to define the expected output format.
4. Using function calls to extract metadata from the markdown text.

In all the following exemple we'll extract the same information on all these articles:
- Vendor Name
- Buyer Name
- Total Amount Due

## Why JSON?

- **Interoperability**: JSON is language-agnostic and easily parsed in Python, R, and other languages.
- **API Integration**: Many data sources and web services provide data in JSON format, making it essential for fetching and processing external data.
- **Hierarchical Structure**: Supports nested data, making it ideal for representing complex datasets like configurations or structured logs.
- **Integration with Pandas**: Python's `pandas` library provides seamless methods (`pd.read_json`, `to_json`) for handling JSON data.

## Initialize the Google client and load the libraries

In [None]:
!pip install ipykernel
!pip install google-genai
!pip install -U pymupdf4llm

### Libraries

In [1]:
import sys
sys.path.append("../")

In [2]:
import json
import re
import os
import getpass

from IPython.display import Markdown, display

import pymupdf4llm
from pydantic import BaseModel, Field
from typing import List

from google import genai

In [8]:
API_KEY = getpass.getpass("Enter your password: ")
LOCATION = 'europe-west1'
MODEL = 'gemini-2.0-flash'

In [9]:
client = genai.Client(api_key=API_KEY)

## Load pdf and convert to markdown

Let's look at the content of `sample-invoice-1` and `sample-invoice-2` and compare them.

In [5]:
# Render String into Markdown
def print_md(markdown_text):
    display(Markdown(markdown_text))

In [6]:
# Load the PDF file
pdf_path = "./data/"
pdf_filename = "sample-invoice-1.pdf"

markdown_text = pymupdf4llm.to_markdown(os.path.join(pdf_path, pdf_filename))

In [7]:
# Save the output to a markdown file
with open(f"output/{pdf_filename}-markdown.md", "w") as f:
    f.write(markdown_text)

print_md(markdown_text[:380])

CPB Software (Germany) GmbH - Im Bruch 3 - 63897 Miltenberg/Main


Musterkunde AG

Mr. John Doe

Musterstr. 23

12345 Musterstadt Name: Stefanie Müller


Phone: +49 9371 9786-0


**Invoice WMACCESS Internet**


**VAT No. DE199378386**












|Invoice No|Customer No|Invoice Period|Date|
|---|---|---|---|
|123100401|12345|01.02.2024 - 29.02.2024|1. März 2024|

















----
#### **Exercise #1**: Convert Invoice 2 from PDF to Markdown
<div>
  <input type="checkbox" name="uchk">
  <label for="uchk">Invoice 2 is a Markdown</label>
</div>

In [None]:
# Load the PDF file
pdf_path = "./data/"
pdf_filename = "sample-invoice-2.pdf"

_, markdown_text, _ = pymupdf4llm.to_markdown(os.path.join(pdf_path, pdf_filename))

In [11]:
print_md(markdown_text[:420])

Ship To
大嶋佳世
桜丘町21-5 セルリアンタワー
東京都渋谷区

Japan


Carrier My carrier
Payment Method Cash on delivery (COD)


# INVOICE # IN000057

Invoice Number IN000057

Date 2011-01-28

Order No. 000060


Invoice To


大嶋佳世
桜丘町21-5 セルリアンタワー
東京都渋谷区

Japan



|Description|Reference|Qty|Price|Total|
|---|---|---|---|---|
|iPod shuffle - Color : Blue|---|1|33,03 €|33,03 €|
|iPod Nano - Color : Blue, Disk space : 16Go|---|2|83,12 €|166,24 

----
#### **Exercise #2**: Convert Invoice 3 from PDF to Markdown
<div>
  <input type="checkbox" name="uchk">
  <label for="uchk">Invoice 4 is a Markdown file
</label>
</div>

In [None]:
# Load the PDF file
pdf_path = "./data/"
pdf_filename = "sample-invoice-4.pdf"

_, markdown_text, _ = pymupdf4llm.to_markdown(os.path.join(pdf_path, pdf_filename))

In [None]:
print_md(markdown_text[:420])

----

## Default extraction

In this case, we will provide a prompt asking the API to extract the title, authors, and abstract from the markdown text. No extra indications are given to the model.



In [10]:
# Generates Content given a prompt
def generate_completion(message: str):
    response = client.models.generate_content(model=MODEL, contents = message)
    return response.text

In [None]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due

Markdown text:
{markdown_text}
"""

response = generate_completion(prompt)

In [None]:
print_md(response)

Here's the extracted information from the provided markdown text:

*   **Vendor Name:** Demo - modules for PrestaShop
*   **Buyer Name:** 大嶋佳世 (Oshima Kayo)
*   **Total Amount Due:** 211,28 €

### Result

We can see here that the LLM model was able to extract the **Vendor Name**, **the Buyer Name** and the **Total Amount Due** from the markdown text. The result is returned as plain text in a markdown format. 

This format is not very structured and may require additional processing to extract the information.

> Let's request an easily digestible format

## Asking for JSON format

Here we're adding one step more. We're asking the LLM to return the result in JSON format. This way we can have a more structured output and it will be easier to extract the information.

In [26]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

response = generate_completion(prompt)

In [27]:
print_md(response)

```json
{
  "Vendor Name": "CPB Software (Germany) GmbH",
  "Buyer Name": "Musterkunde AG",
  "Total Amount Due": "453,53 €"
}
```

In [None]:
json.loads(response)

In [28]:
# Remove the markdown code block markers
json_str = re.sub(r"^```(?:json)?\s*", "", response)
json_str = re.sub(r"\s*```$", "", json_str)

# Parse the JSON string
json.loads(json_str)

{'Vendor Name': 'CPB Software (Germany) GmbH',
 'Buyer Name': 'Musterkunde AG',
 'Total Amount Due': '453,53 €'}

----
#### **Exercise #3:** Also ask for the Invoice Date
<div>
  <input type="checkbox" name="uchk">
  <label for="uchk">Response also returns the invoice date</label>
</div>

In [None]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due
- Invoice Date

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

# Remove the markdown code block markers
json_str = re.sub(r"^```(?:json)?\s*", "", response)
json_str = re.sub(r"\s*```$", "", json_str)

# Parse the JSON string
json.loads(json_str)

{'Vendor Name': 'Demo - modules for PrestaShop',
 'Buyer Name': '大嶋佳世',
 'Total Amount Due': '211,28 €',
 'Invoice Date': '2011-01-28'}

---

### Result 

By using regex we were able to parse the string into a json format. 
>However, we can get this structured output directly when calling google's API

## Adding a response format as a parameter

Google Gemini allows us to specify the response format to be "json_object". This way we can force the model to return the result in JSON format. That way the parsing of the result will be easier.

In [8]:
# Generates a json given a prompt
def generate_completion_json(message: str):
    response = client.models.generate_content(model=MODEL, contents = message, config={"response_mime_type": "application/json"})
    return response.text


In [38]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

response = generate_completion_json(prompt)

In [39]:
print_md(response)

{
  "Vendor Name": "CPB Software (Germany) GmbH",
  "Buyer Name": "Musterkunde AG",
  "Total Amount Due": "453,53 €"
}

In [40]:
json.loads(response)

{'Vendor Name': 'CPB Software (Germany) GmbH',
 'Buyer Name': 'Musterkunde AG',
 'Total Amount Due': '453,53 €'}

----
#### **Exercise #4:** Also ask for the Invoice Date
<div>
  <input type="checkbox" name="uchk">
  <label for="uchk">Response also returns the invoice date</label>
</div>

In [17]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due
- Invoice Date

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

# Remove the markdown code block markers
response = generate_completion_json(prompt)

In [18]:
print_md(response)

{
  "Vendor Name": "CPB Software (Germany) GmbH",
  "Buyer Name": "Musterkunde AG",
  "Total Amount Due": "453,53 €",
  "Invoice Date": "1. März 2024"
}

In [19]:
json.loads(response)

{'Vendor Name': 'CPB Software (Germany) GmbH',
 'Buyer Name': 'Musterkunde AG',
 'Total Amount Due': '453,53 €',
 'Invoice Date': '1. März 2024'}

### Result

This time the result is returned in JSON format as requested. We can directly parse the JSON object to extract the information using `json.loads()`. 

**However**, there is no guarantee that the JSON object will have the expected structure. The model may return the data in a different format than the one we expect for exemple with different casing.

## Custom json schema

This time we'll pass a json schema to the model as defined here: https://json-schema.org/. This way we can force the model to return the result in a specific structure, provide default values, descriptions and types for each field.


In [31]:
# Generates a json given a prompt and a schema
def generate_completion_json_schema(message: str, schema: dict):
    response = client.models.generate_content(model=MODEL, contents = message, config={"response_mime_type": "application/json",
                                                                                       "response_schema": schema})
    return response.text


In [None]:
schema = {
            "type": "object",
            "description": "Metadata from an invoice including its vendor name, buyer name and total amount due.",
            "properties": {
                "vendor_name": {
                    "type": "string",
                    "description": "The name of the vendor in the invoice.",
                    "default": "Unknown",
                },
                "buyer_name": {
                    "type": "string",
                    "description": "The name of the buyer in the invoice.",
                },
                "total_amount_due": {
                    "type": "number",
                    "description": "The total amount due in the invoice.",
                },
            },
        }

In [33]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

response = generate_completion_json_schema(prompt, schema)

In [34]:
json.loads(response)

{'vendor_name': 'Demo - modules for PrestaShop',
 'buyer_name': '大嶋佳世',
 'total_amount_due': 211.28}

### Result

Now the output will always correspond to the expected schema. Since everything is provided the model doesn't have to guess the shape or part of the shape of the output.

## Function calling

Many LLMs don't support the structured output format. In that case you can specify to the llm to call a function to extract the information. This way you can define the function signature and the llm will call the function with the extracted information.

![Tool Calling](./images/tool_calling.png)

In [147]:
invoice_extraction = {
    "name": "extract_invoice_data",
    "description": "Extract key information from an invoice.",
    "parameters": {
        "type": "object",
        "properties": {
            "vendor_name": {
                "type": "string",
                "description": "The name of the vendor in the invoice.",
            },
            "buyer_name": {
                "type": "string",
                "description": "The name of the buyer in the invoice.",
            },
            "total_amount_due": {
                "type": "number",
                "description": "The total amount due in the invoice.",
            },
        },
        "required": ["vendor_name", "buyer_name", "total_amount_due"],
    },
}

def generate_completion_tool_calls(message: str):
    tools = genai.types.Tool(function_declarations=[invoice_extraction])
    response = client.models.generate_content(model=MODEL, 
                                              contents = message, 
                                              config = genai.types.GenerateContentConfig(tools=[tools]))
    return response.candidates[0].content.parts[0].function_call


In [148]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due

Markdown text:
{markdown_text}
"""

response = generate_completion_tool_calls(prompt)

In [149]:
response

FunctionCall(
  args={
    'buyer_name': 'Musterkunde AG',
    'total_amount_due': 453.53,
    'vendor_name': 'CPB Software (Germany) GmbH'
  },
  name='extract_invoice_data'
)

----
#### **Exercise #5:** Also ask for the Invoice Date
<div>
  <input type="checkbox" name="uchk">
  <label for="uchk">Response also returns the invoice date</label>
</div>

In [29]:
invoice_extraction = {
    "name": "extract_invoice_data",
    "description": "Extract key information from an invoice.",
    "parameters": {
        "type": "object",
        "properties": {
            "vendor_name": {
                "type": "string",
                "description": "The name of the vendor in the invoice.",
            },
            "buyer_name": {
                "type": "string",
                "description": "The name of the buyer in the invoice.",
            },
            "total_amount_due": {
                "type": "number",
                "description": "The total amount due in the invoice.",
            },
            "invoice_date": {
                "type": "string", 
                "format": "date-time",
                "description": "The invoice date.",
            },            
        },
        "required": ["vendor_name", "buyer_name", "total_amount_due"],
    },
}

def generate_completion_tool_calls(message: str):
    tools = genai.types.Tool(function_declarations=[invoice_extraction])
    response = client.models.generate_content(model=MODEL, 
                                              contents = message, 
                                              config = genai.types.GenerateContentConfig(tools=[tools]))
    return response.candidates[0].content.parts[0].function_call


In [30]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due
- Invoice Date

Markdown text:
{markdown_text}
"""

response = generate_completion_tool_calls(prompt)

In [31]:
response

FunctionCall(
  args={
    'buyer_name': 'Musterkunde AG',
    'invoice_date': '1. März 2024',
    'total_amount_due': 453.53,
    'vendor_name': 'CPB Software (Germany) GmbH'
  },
  name='extract_invoice_data'
)

---

## Cost

Let's compute the cost of the completion. We'll use the following pricing: https://openai.com/api/pricing/

In [None]:

def compute_gemini_2_flash_cost(message: str, verbose: bool = False) -> float:
    """
    Computes the cost for the 'gemini-2.0-flash' model based on the completion object.

    Args:
        completion: The completion object returned by the Gemini API.  This object
                    should have a 'usage' attribute with 'prompt_tokens' and
                    'completion_tokens' attributes.
        verbose:  If True, prints the token counts and estimated cost to the console.
                  Defaults to False.

    Returns:
        The estimated cost in US dollars (USD) as a float.
    """
    tools = genai.types.Tool(function_declarations=[invoice_extraction])
    completion = client.models.generate_content(model=MODEL, 
                                                  contents = message, 
                                                  config = genai.types.GenerateContentConfig(tools=[tools]))
    input_tokens = completion.usage_metadata.prompt_token_count
    output_tokens = completion.usage_metadata.candidates_token_count

    #  Always check the official
    #  Google Cloud documentation for the most up-to-date pricing.
    cost_per_1M_input_tokens = 0.10  
    cost_per_1M_output_tokens = 0.40 

    total_cost = (input_tokens / 1e6) * cost_per_1M_input_tokens
    total_cost += (output_tokens / 1e6) * cost_per_1M_output_tokens

    if verbose:
        print(f"Total input tokens: {input_tokens}")
        print(f"Total output tokens: {output_tokens}")
        print(f"Total tokens: {input_tokens + output_tokens}")
        print(f"Estimated cost: ${total_cost:.4f}")

    return total_cost

In [None]:
compute_gemini_2_flash_cost(prompt, verbose=True)

Total input tokens: 1794
Total output tokens: 39
Total tokens: 1833
Estimated cost: $0.0002


0.00019500000000000002

As you can see the major part of the cost is the input tokens. Here we're passing the whole document to the llm which make up for more than 90% of the cost. 

> How could we reduce the costs of the LLM?

## Conclusion

Structured output help the LLM to produce better and more interpretable results. On the chart below you'll find the relative performances in terms of reliability of the output matching the expected json format.

![output_reliability](images/output_reliability.png)

----
#### Exercise: Also ask to return the list of items, and respective price
<div>
  <input type="checkbox" name="uchk">
  <label for="uchk">Response also returns the list of items, and respective price</label>
</div>

In [112]:
invoice_items_extraction = {
    "name": "extract_items_list",
    "description": "Extract the list of items and their prices from the text.",
    "parameters": {
        "type": "object",
        "properties": {
            "items_list": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "item_name": {
                            "type": "string",
                            "description": "The name of the item."
                        },
                        "item_price": {
                            "type": "number",
                            "description": "The price of the item."
                        }
                    },
                    "required": ["item_name", "item_price"]
                },
                "description": "A list of items, where each item includes its name and price."
            }
        },
        "required": ["items_list"],
    },
}

In [113]:

def generate_completion_multiple_tool_calls(message: str):
    tools = genai.types.Tool(function_declarations=[invoice_vendor_name_extraction
                                                    , invoice_buyer_name_extraction
                                                    , invoice_total_amount_due_extraction
                                                    , invoice_items_extraction])
    response = client.models.generate_content(model=MODEL, 
                                              contents = message, 
                                              config = genai.types.GenerateContentConfig(tools=[tools]))
    
    
    # Combine the outputs from each function call.
    extracted_data = {}
    tool_calls = response.candidates[0].content.parts
    for tool_call in tool_calls:
        function_name = tool_call.function_call.name
        arguments = tool_call.function_call.args
        if function_name == "extract_vendor_name":
            extracted_data["vendor_name"] = arguments["vendor_name"]
        elif function_name == "extract_buyer_name":
            extracted_data["buyer_name"] = arguments["buyer_name"]
        elif function_name == "extract_total_amount_due":
            extracted_data["total_amount_due"] = arguments["total_amount_due"]
        elif function_name == "extract_items_list":
            extracted_data["items_list"] = arguments["items_list"]
        elif function_name == "extract_shipping_amount":
            extracted_data["shipping_amount"] = arguments["shipping_amount"]
        elif function_name == "extract_invoice_date":
            extracted_data["invoice_date"] = arguments["invoice_date"]



    return extracted_data




In [115]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Items
- Total Amount Due

Markdown text:
{markdown_text}
"""

response = generate_completion_multiple_tool_calls(prompt)


In [116]:
response

{'vendor_name': 'Demo - modules for PrestaShop',
 'buyer_name': '大嶋佳世',
 'items_list': [{'item_price': 33.03,
   'item_name': 'iPod shuffle - Color : Blue'},
  {'item_price': 83.12,
   'item_name': 'iPod Nano - Color : Blue, Disk space : 16Go'}],
 'total_amount_due': 211.28}

---

# Complete Solution

In [74]:
invoice_vendor_name_extraction = {
    "name": "extract_vendor_name",
    "description": "Extract the vendor name.",
    "parameters": {
        "type": "object",
        "properties": {
            "vendor_name": {
                "type": "string",
                "description": "The name of the vendor in the invoice.",
            },
        },
        "required": ["vendor_name"],
    },
}

invoice_buyer_name_extraction = {
    "name": "extract_buyer_name",
    "description": "Extract the buyer name.",
    "parameters": {
        "type": "object",
        "properties": {
            "buyer_name": {
                "type": "string",
                "description": "The name of the buyer in the invoice.",
            },
        },
        "required": ["buyer_name"],
    },
}

invoice_total_amount_due_extraction = {
    "name": "extract_total_amount_due",
    "description": "Extract the total amount due.",
    "parameters": {
        "type": "object",
        "properties": {
            "total_amount_due": {
                "type": "number",
                "description": "The total amount due in the invoice.",
            },
        },
        "required": ["total_amount_due"],
    },
}


invoice_items_extraction = {
    "name": "extract_items_list",
    "description": "Extract the list of items and their prices from the text.",
    "parameters": {
        "type": "object",
        "properties": {
            "items_list": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "item_name": {
                            "type": "string",
                            "description": "The name of the item."
                        },
                        "item_price": {
                            "type": "number",
                            "description": "The price of the item."
                        }
                    },
                    "required": ["item_name", "item_price"]
                },
                "description": "A list of items, where each item includes its name and price."
            }
        },
        "required": ["items_list"],
    },
}

invoice_shipping_amount_extraction = {
    "name": "extract_shipping_amount",
    "description": "Extract the shipping amount.",
    "parameters": {
        "type": "object",
        "properties": {
            "shipping_amount": {
                "type": "number",
                "description": "The shipping amount in the invoice.",
            },
        },
        "required": ["shipping_amount"],
    },
}


invoice_date_extraction = {
    "name": "extract_invoice_date",
    "description": "Extract the invoice date.",
    "parameters": {
        "type": "object",
        "properties": {
            "invoice_date": {
                "type": "string",
                "description": "The date of the invoice (YYYY-MM-DD).",
            },
        },
        "required": ["invoice_date"],
    },
}



In [75]:

def generate_completion_multiple_tool_calls(message: str):
    tools = genai.types.Tool(function_declarations=[invoice_vendor_name_extraction
                                                    , invoice_buyer_name_extraction
                                                    , invoice_total_amount_due_extraction
                                                    , invoice_items_extraction
                                                    , invoice_shipping_amount_extraction
                                                    , invoice_date_extraction])
    response = client.models.generate_content(model=MODEL, 
                                              contents = message, 
                                              config = genai.types.GenerateContentConfig(tools=[tools]))
    
    
    # Combine the outputs from each function call.
    extracted_data = {}
    tool_calls = response.candidates[0].content.parts
    for tool_call in tool_calls:
        function_name = tool_call.function_call.name
        arguments = tool_call.function_call.args
        if function_name == "extract_vendor_name":
            extracted_data["vendor_name"] = arguments["vendor_name"]
        elif function_name == "extract_buyer_name":
            extracted_data["buyer_name"] = arguments["buyer_name"]
        elif function_name == "extract_total_amount_due":
            extracted_data["total_amount_due"] = arguments["total_amount_due"]
        elif function_name == "extract_items_list":
            extracted_data["items_list"] = arguments["items_list"]
        elif function_name == "extract_shipping_amount":
            extracted_data["shipping_amount"] = arguments["shipping_amount"]
        elif function_name == "extract_invoice_date":
            extracted_data["invoice_date"] = arguments["invoice_date"]



    return extracted_data, response




In [76]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Items
- Total Amount Due
- Date
- Shipping amount

Markdown text:
{markdown_text}
"""

response, completion = generate_completion_multiple_tool_calls(prompt)


In [77]:
response

{'vendor_name': 'Demo - modules for PrestaShop',
 'buyer_name': '大嶋佳世',
 'items_list': [{'item_name': 'iPod shuffle - Color : Blue',
   'item_price': 33.03},
  {'item_price': 83.12,
   'item_name': 'iPod Nano - Color : Blue, Disk space : 16Go'}],
 'total_amount_due': 211.28,
 'invoice_date': '2011-01-28',
 'shipping_amount': 12}