# LLM and data extraction

In this notebook, we will explore how to use the Gemini API to extract metadata from invoices. We will use a PDF file as input and convert it to markdown text. Then, we will use the Gemini API to extract the vendor name, the buyer name and the total amount due from the markdown text.

We will compare different methods to extract metadata from scientific papers using the Gemini API, including:
1. Asking the API to extract the metadata directly from the markdown text.
2. Asking the API to extract the metadata and return the result in JSON format.
3. Using a JSON schema to define the expected output format.
4. Using function calls to extract metadata from the markdown text.

In all the following exemple we'll extract the same information on all these articles:
- Vendor Name
- Buyer Name
- Total Amount Due

## Why JSON?

- **Interoperability**: JSON is language-agnostic and easily parsed in Python, R, and other languages.
- **API Integration**: Many data sources and web services provide data in JSON format, making it essential for fetching and processing external data.
- **Hierarchical Structure**: Supports nested data, making it ideal for representing complex datasets like configurations or structured logs.
- **Integration with Pandas**: Python's `pandas` library provides seamless methods (`pd.read_json`, `to_json`) for handling JSON data.

## Initialize the Google client and load the libraries

In [None]:
%%capture 
!pip install ipykernel
!pip install google-genai
!pip install -U pymupdf4llm

### Libraries

In [None]:
import json
import re
import os
import getpass

from IPython.display import Markdown, display

import pymupdf4llm
from pydantic import BaseModel, Field
from typing import List

from google import genai

In [None]:
API_KEY = getpass.getpass("Enter your password: ")
LOCATION = 'europe-west1'
MODEL = 'gemini-2.0-flash'

In [None]:
client = genai.Client(api_key=API_KEY)

## Load pdf and convert to markdown

Let's look at the content of `sample-invoice-1` and `sample-invoice-2` and compare them.

In [None]:
# Render String into Markdown
def print_md(markdown_text):
    display(Markdown(markdown_text))

In [None]:
# Load the PDF file
pdf_path = "./data/"
pdf_filename = "sample-invoice-1.pdf"

markdown_text = pymupdf4llm.to_markdown(os.path.join(pdf_path, pdf_filename))

In [None]:
# Save the output to a markdown file
with open(f"output/{pdf_filename}-markdown.md", "w") as f:
    f.write(markdown_text)

print_md(markdown_text[:380])

----
#### **Exercise #1**: Convert Invoice 2 from PDF to Markdown

----

## Default extraction

In this case, we will provide a prompt asking the API to extract the title, authors, and abstract from the markdown text. No extra indications are given to the model.



In [None]:
# Generates Content given a prompt
def generate_completion(message: str):
    response = client.models.generate_content(model=MODEL, contents = message)
    return response.text

In [None]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due

Markdown text:
{markdown_text}
"""

response = generate_completion(prompt)

In [None]:
print_md(response)

### Result

We can see here that the LLM model was able to extract the **Vendor Name**, **the Buyer Name** and the **Total Amount Due** from the markdown text. The result is returned as plain text in a markdown format. 

This format is not very structured and may require additional processing to extract the information.

> Let's request an easily digestible format

## Asking for JSON format

Here we're adding one step more. We're asking the LLM to return the result in JSON format. This way we can have a more structured output and it will be easier to extract the information.

In [None]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

response = generate_completion(prompt)

In [None]:
print_md(response)

In [None]:
# Remove the markdown code block markers
json_str = re.sub(r"^```(?:json)?\s*", "", response)
json_str = re.sub(r"\s*```$", "", json_str)

# Parse the JSON string
json.loads(json_str)

----
#### **Exercise #2:** Also ask for the Invoice Date

---

### Result 

By using regex we were able to parse the string into a json format. 
>However, we can get this structured output directly when calling google's API

## Adding a response format as a parameter

Google Gemini allows us to specify the response format to be "json_object". This way we can force the model to return the result in JSON format. That way the parsing of the result will be easier.

In [None]:
# Generates a json given a prompt
def generate_completion_json(message: str):
    response = client.models.generate_content(model=MODEL, contents = message, config={"response_mime_type": "application/json"})
    return response.text


In [None]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

response = generate_completion_json(prompt)

In [None]:
print_md(response)

In [None]:
json.loads(response)

----
#### **Exercise #3:** Also ask for the Invoice Date

---

### Result

This time the result is returned in JSON format as requested. We can directly parse the JSON object to extract the information using `json.loads()`. 

**However**, there is no guarantee that the JSON object will have the expected structure. The model may return the data in a different format than the one we expect for exemple with different casing.

## Function calling

You can specify to the llm to call a function to extract the information. This way you can define the function signature and the llm will call the function with the extracted information.

![Tool Calling](./images/tool_calling.png)

In [None]:
invoice_extraction = {
    "name": "extract_invoice_data",
    "description": "Extract key information from an invoice.",
    "parameters": {
        "type": "object",
        "properties": {
            "vendor_name": {
                "type": "string",
                "description": "The name of the vendor in the invoice.",
            },
            "buyer_name": {
                "type": "string",
                "description": "The name of the buyer in the invoice.",
            },
            "total_amount_due": {
                "type": "number",
                "description": "The total amount due in the invoice.",
            },
        },
        "required": ["vendor_name", "buyer_name", "total_amount_due"],
    },
}


In [None]:
def generate_completion_tool_calls(message: str):
    tools = genai.types.Tool(function_declarations=[invoice_extraction])
    response = client.models.generate_content(model=MODEL, 
                                              contents = message, 
                                              config = genai.types.GenerateContentConfig(tools=[tools]))
    return response.candidates[0].content.parts[0].function_call

In [None]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due

Markdown text:
{markdown_text}
"""

response = generate_completion_tool_calls(prompt)

In [None]:
response

----
#### **Exercise #4** Also ask for the Invoice Date

---

## Custom json schema (Optional)

This time we'll pass a json schema to the model as defined here: https://json-schema.org/. This way we can force the model to return the result in a specific structure, provide default values, descriptions and types for each field.


In [None]:
# Generates a json given a prompt and a schema
def generate_completion_json_schema(message: str, schema: dict):
    response = client.models.generate_content(model=MODEL, contents = message, config={"response_mime_type": "application/json",
                                                                                       "response_schema": schema})
    return response.text


In [None]:
schema = {
            "type": "object",
            "description": "Metadata from an invoice including its vendor name, buyer name and total amount due.",
            "properties": {
                "vendor_name": {
                    "type": "string",
                    "description": "The name of the vendor in the invoice.",
                    "default": "Unknown",
                },
                "buyer_name": {
                    "type": "string",
                    "description": "The name of the buyer in the invoice.",
                },
                "total_amount_due": {
                    "type": "number",
                    "description": "The total amount due in the invoice.",
                },
            }
        }

In [None]:
prompt = f"""
You are a document processing assistant. I have extracted the following markdown text from a PDF.
Please extract the following details:
- Vendor Name
- Buyer Name
- Total Amount Due

Markdown text:
{markdown_text}

Give me the result in JSON format.
"""

response = generate_completion_json_schema(prompt, schema)

In [None]:
json.loads(response)

### Result

Now the output will always correspond to the expected schema. Since everything is provided the model doesn't have to guess the shape or part of the shape of the output.

## Cost

Let's compute the cost of the completion. We'll use the following pricing: https://openai.com/api/pricing/

In [None]:

def compute_gemini_2_flash_cost(message: str, verbose: bool = False) -> float:
    """
    Computes the cost for the 'gemini-2.0-flash' model based on the completion object.

    Args:
        completion: The completion object returned by the Gemini API.  This object
                    should have a 'usage' attribute with 'prompt_tokens' and
                    'completion_tokens' attributes.
        verbose:  If True, prints the token counts and estimated cost to the console.
                  Defaults to False.

    Returns:
        The estimated cost in US dollars (USD) as a float.
    """
    tools = genai.types.Tool(function_declarations=[invoice_extraction])
    completion = client.models.generate_content(model=MODEL, 
                                                  contents = message, 
                                                  config = genai.types.GenerateContentConfig(tools=[tools]))
    input_tokens = completion.usage_metadata.prompt_token_count
    output_tokens = completion.usage_metadata.candidates_token_count

    #  Always check the official
    #  Google Cloud documentation for the most up-to-date pricing.
    cost_per_1M_input_tokens = 0.10  
    cost_per_1M_output_tokens = 0.40 

    total_cost = (input_tokens / 1e6) * cost_per_1M_input_tokens
    total_cost += (output_tokens / 1e6) * cost_per_1M_output_tokens

    if verbose:
        print(f"Total input tokens: {input_tokens}")
        print(f"Total output tokens: {output_tokens}")
        print(f"Total tokens: {input_tokens + output_tokens}")
        print(f"Estimated cost: ${total_cost:.4f}")

    return total_cost

In [None]:
compute_gemini_2_flash_cost(prompt, verbose=True)

As you can see the major part of the cost is the input tokens. Here we're passing the whole document to the llm which make up for more than 90% of the cost. 

> How could we reduce the costs of the LLM?

## Conclusion

Structured output help the LLM to produce better and more interpretable results. On the chart below you'll find the relative performances in terms of reliability of the output matching the expected json format.

![output_reliability](images/output_reliability.png)

----
#### **Extra Exercise:** Also ask to return the list of items, and respective price
<div>
  <input type="checkbox" name="uchk">
  <label for="uchk">Response also returns the list of items, and respective price</label>
</div>

---