# Data Extraction: Automating Invoice Processing

This notebook demonstrates how to extract structured data from text documents like receipts and invoices using AI. We'll explore:

1. Basic text extraction using prompts
2. Structured output formats (JSON, XML)

The techniques shown here can help automate manual data entry tasks and standardize information extraction from semi-structured documents.

In [None]:
with open("./invoice-data-sample.txt", "r") as f:
    receipt_data = f.read()
    
receipt_data

In [None]:
from IPython.display import Markdown

Markdown(receipt_data)

In [None]:
from ai_tools import ask_ai

extraction_prompt = f"""

You are an extraction engine for receipt data.
Users will upload the contents of their receipts and you will extract
the following fields:
- Company name
- Date of closure
- Amount paid

Extract the data from the following receipt:
{receipt_data}
"""

structured_output = ask_ai(prompt=extraction_prompt)

structured_output

This output is ok but we don't want the conversational elements of the response right?

To get around that, let's improve our initial prompt:

In [None]:
extraction_prompt_json = f"""
You are an extraction engine for receipt data.
Users will upload the contents of their receipts and you will extract
the following fields as JSON OBJECTS:
- Company name
- Date of closure
- Amount paid

Extract the data from the following receipt:
{receipt_data}

Your OUTPUT SHOULD ONLY BE A JSON OBJECT WITH THE FOLLOWING FIELDS:
- company_name
- date_of_closure
- amount_paid
"""

structured_output_json = ask_ai(prompt=extraction_prompt_json)

structured_output_json

In [None]:
# We need to import the json library to parse the JSON output
import json

def parse_json_output(json_str):
    """
    This function parses the JSON output from the AI and removes the markdown code block markers if present.
    """
    # Remove markdown code block markers if present
    json_str = json_str.replace('```json', '').replace('```', '').strip()
    
    # Parse the JSON string into a Python dictionary
    try:
        return json.loads(json_str)
    except json.JSONDecodeError:
        print("Error: Could not parse JSON string")
        return None

parsed_json = parse_json_output(structured_output_json)


parsed_json

In [None]:
print(f"Company Name: {parsed_json['company_name']}")
print(f"Date of Closure: {parsed_json['date_of_closure']}")
print(f"Amount Paid: {parsed_json['amount_paid']}")

In Claude we can also do this quite easily using `xml` tags: `<output>{"company_name":....etc....} </output>`



In [None]:
from ai_tools import ask_ai

ask_ai(prompt="Hi! Which model are you?", model_name="claude-3-5-sonnet-20240620")

In [None]:
extraction_prompt_claude = f"""
You are an extraction engine for receipt data.
Users will upload the contents of their receipts and you will extract key fields.

Extract the following fields from this receipt:
{receipt_data}

Format your response using XML tags like this:
<output>
  <company_name>The company name</company_name>
  <date_of_closure>The date of closure</date_of_closure>
  <amount_paid>The amount paid</amount_paid>
</output>

Only include the XML tags and JSON object in your response, nothing else.
"""
output = ask_ai(prompt=extraction_prompt_claude, model_name="claude-3-5-sonnet-20240620")

output

Now, let's write a function that properly parses this output from Claude:

In [None]:
def parse_claude_output(output):
    """
    This function parses the output from Claude and removes the XML tags.
    """
    # Remove XML tags if present
    output = output.replace('<output>', '').replace('</output>', '').strip()
    return output

output_parsed = parse_claude_output(output)

output_parsed

Now we can access each individual attribute easily by simply parsing the tags:


In [None]:
import re

def extract_field(output, field_name):
    """Extract value between XML tags for a given field."""
    pattern = f"<{field_name}>(.*?)</{field_name}>"
    match = re.search(pattern, output)
    return match.group(1) if match else None

# Extract each field
company_name = extract_field(output_parsed, "company_name")
date_of_closure = extract_field(output_parsed, "date_of_closure") 
amount_paid = extract_field(output_parsed, "amount_paid")

print(f"Company Name: {company_name}")
print(f"Date of Closure: {date_of_closure}")
print(f"Amount Paid: {amount_paid}")

But what if you don't want to send your private data to some cloud provider?

In that case, we use local models! After a lot of advancements, we can now easily use local models to extract structured outputs similar to what we have been doing before.

In [None]:
from ai_tools import ask_local_ai
import json

extraction_prompt_json = f"""
You are an extraction engine for receipt data.
Users will upload the contents of their receipts and you will extract
the following fields as JSON OBJECTS:
- Company name
- Date of closure
- Amount paid

Extract the data from the following receipt:
{receipt_data}

Your OUTPUT SHOULD ONLY BE A JSON OBJECT WITH THE FOLLOWING FIELDS:
- company_name
- date_of_closure
- amount_paid
"""

output_string = ask_local_ai(extraction_prompt_json, structured=True)

output_json = json.loads(output_string)

print(f"Company Name: {output_json['company_name']}")
print(f"Date of Closure: {output_json['date_of_closure']}")
print(f"Amount Paid: {output_json['amount_paid']}")

The fancier way of doing this for those interested in exploring more about structured extractions is using something called `pydantic` a data validation library that perfectly integrates with LLM APIs like openai's and anthropics to create these structured outputs in a more programatic and organized fashion.
See an example in: `./structured_output_with_pydantic.py`.

# Extracting Insights from Technology Trends Reports from OReilly Media

Radar Trends website:
- https://www.oreilly.com/radar/trends/

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
raw_contents_oreilly_tech_trends_january_2025 = """

Skip to main content
O'Reilly home
Sign In
Try Now
Teams
For business
For government
For higher ed
Individuals
Features
All features
Courses
Certifications
Interactive learning
Live events
Answers
Insights reporting
Plans
Blog
Content sponsorship
Search
Radar / Radar Trends
Radar Trends to Watch: January 2025
Developments in Security, Programming, AI, and More

By Mike Loukides
January 7, 2025

Learn faster. Dig deeper. See farther.
Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more
Despite its 31 days, December is a short month. It's hard for announcements and happenings other than office parties to get attention. Fighting this trend, OpenAI made a series of announcements: their "12 Days of OpenAI." Not to be upstaged, Google responded with a flurry of announcements, including their Gemini 2.0 Flash Thinking model. Models appeared that could use streaming audio and video for both input and output. But perhaps the most important announcement was DeepSeek-V3, a very large mixture-of-experts model (671B parameters) that has performance on a par with the other top models—but cost roughly 1/10th as much to train.

AI
DeepSeek-V3 is another LLM to watch. Its performance is on a par with Llama 3.1, GPT-4o, and Claude Sonnet. While training was not inexpensive, the cost of training was estimated to be roughly 10% of the bigger models.
Not to be outdone by Google, OpenAI previewed its next models: o3 and o3-mini. These are both "reasoning models" that have been trained to solve logical problems. They may be released in late January; OpenAI is looking for safety and security researchers for testing.
Not to be outdone by 12 Days of OpenAI, Google has released a new experimental model that has been trained to solve logical problems: Gemini 2.0 Flash Thinking. Unlike OpenAI's GPT models that support reasoning, Flash Thinking shows its chain of thought explicitly.
Jeremy Howard and his team have released ModernBERT, a major upgrade to the BERT model they released six years ago. It comes in two sizes: 139M and 395M parameters. It's ideal for retrieval, classification, and entity extraction, and other components of a data pipeline.
AWS's Bedrock service has the ability to check the output of other models for hallucinations.
To make sure they aren't outdone by 12 Days of OpenAI, Google has announced Android XR, an operating system for extended reality headsets and glasses. Google doesn't plan to build their own hardware; they're partnering with Samsung, Qualcomm, and other manufacturers.
Also not to be outdone by 12 Days of OpenAI, Anthropic has announced Clio, a privacy- preserving approach to finding out how people use their models. That information will be used to improve Anthropic's understanding of safety issues and to build more helpful models.
Not to be outdone by 12 Days of OpenAI, Google has announced Gemini 2.0 Flash, a multimodal model that supports streaming for both input and output. The announcement also showcased Astra, an AI agent for smartphones. Neither is generally available yet.
OpenAI has released canvas, a new feature that combines programming with writing. Changes to the canvas (code or text) immediately become part of the context. Python code is executed in the browser using Pyodide (Wasm), rather than in a container (as with Code Interpreter).
Stripe has announced an agent toolkit that lets you build payments into agentic workflows. Stripe recommends using the toolkit in test mode until the application has been thoroughly validated.
Simon Willison shows how to run a GPT-4 class model (Llama 3.3 70B) on a reasonably well-equipped laptop (64GB MacBook Pro M2).
As part of their 12 Days of OpenAI series, OpenAI finally released their video generation model, Sora. It's free to ChatGPT Plus subscribers, though limited to 50 five-second video clips per month; a ChatGPT Pro account relaxes many of the limitations.
Researchers have shown that advanced AI models, including Claude 3 Opus and OpenAI o1, are capable of "scheming": working against the interests of their users to achieve their goals. Scheming includes subverting oversight mechanisms, intentionally delivering subpar results, and even taking steps to prevent shutdown or replacement. Hello, HAL?
Roaming RAG is a new technique for retrieval augmented generation that finds relevant content by searching through headings to navigate documents—like a human might. It requires well-structured documents. A surprisingly simple idea, really.
Google has announced PaliGemma 2, a new version of its Gemma models that incorporates vision.
GPT-4-o1-preview is no more; the preview is now the real thing, OpenAI o1. In addition to advanced reasoning skills, the production release claims to be faster and to deliver more consistent results.
A group of AI agents in Minecraft behaved surprisingly like humans—even developing jobs and religions. Is this a way to model how human groups collaborate?
One thing the AI industry needs desperately (aside from more power) is better benchmarks. Current benchmarks are closed, easily gamed (that's what AI does), and unreproducible, and they may not test anything meaningful. Better Bench is a framework for assessing benchmark quality.
Palmyra Creative, a new language model from Writer, promises the ability to develop "style" so that all AI-generated output won't sound boringly the same.
During training AI picks up biases from human data. When humans interact with the AI, there's a feedback loop that amplifies those biases.
Programming
Unicon may never become one of the top 20 (or top 100) programming languages, but it's a descendant of Icon, which was always my favorite language for string processing.
What do CAPTCHAs mean when LLM-equipped bots can successfully complete tasks set for humans?
egui, together with eframe, is a GUI library and framework for Rust. It's portable and runs natively (on macOS, Windows, Linux, and Android), on the web (using Wasm), and in many game engines.
For the archivist in us: The Manx project isn't about an island in the Irish Sea or about cats. It's a catalog of manuals for old computers.
Cerbrec is a graphical Python framework for deep learning. It's aimed at Python programmers who don't have sufficient expertise to build applications with PyTorch or other AI libraries.
GitHub has announced free access to GitHub Copilot for all current and new users. Free access gives you 2,000 code completions and 50 chat messages per month. They've also added the ability to use Claude 3.5 Sonnet in addition to GPT-4o.
Devin, the AI assisted coding tool that claims to support software development from beginning to end, including design and debugging, has reached general availability.
JSON5, also known as "JSON for humans," is a variant of JSON that has been designed for human readability so that it can be written and maintained by hand—for example, in configuration files.
AWS has announced two significant new services: Aurora DSQL, which is a distributed SQL database, and S3 Tables, which supports data lakehouses through Apache Iceberg.
AutoFlow is an open source tool for creating a knowledge graph. It's based on TiDB (a vector database), LlamaIndex, and DSPy.
Security
Portspoof is a security tool that causes all 65,535 TCP ports to appear open for valid services. It emulates a valid service on every port. It makes it difficult for an attacker to determine which ports are actually open without probing each port.
Let's Encrypt, which issues the certificates that websites (and other applications) use to prove their identities, has announced short-lived certificates that expire after six days. Short-lived certificates increase security by minimizing exposure if a private key is compromised.
Because of the continued presence of attackers within telecommunications networks, the US FBI and CISA have recommended the use of encrypted communications protocols. (Though they still want backdoors into encryption systems, which would make them vulnerable to attack.)
A new phishing attack uses corrupted Word documents to bypass security checks. While the documents are corrupt, Word is able to recover them.
LLM Flowbreaking is a new class of attack against language models that prevent guardrails from stopping objectionable output from reaching the user. These attacks take advantage of race conditions in the application's interaction with users.
Bootkitty is a UEFI bootkit that targets secure boot on Ubuntu systems. It appears to have been developed by cybersecurity students in Korea, then leaked (possibly accidentally). It hasn't yet been found in the wild, but when it is, it will be a dangerous threat.
DEF CON has started a project to improve cybersecurity for water infrastructure in the US. They're starting with six water companies serving rural communities.
Quantum Computing
Google has built a quantum computing chip in which an error-corrected logical qubit can remain stable for an hour. It passes the "below threshold": the error rate decreases as physical qubits are added for error correction. The chip was built in Google's new fabrication facility.
Web
Google is adding "store reviews" to Chrome. Reviews are AI-generated summaries of reports from well-known sources that report scams and other issues.
Here's a how-to on building streaming text user interfaces on the web. Streaming text is almost a necessity for building AI-driven chatbots.
Biology
Yes, we can have virtual taste. A research group has developed a lollipop interface so that people can experience taste in virtual worlds.
Post topics: Radar Trends
Post tags: Signals
Share:   Share
About O'Reilly
Teach/write/train
Careers
O'Reilly news
Media coverage
Community partners
Affiliate program
Submit an RFP
Diversity
O'Reilly for marketers
Support
Contact us
Newsletters
Privacy policy
 
International
Australia & New Zealand
Hong Kong & Taiwan
India
Indonesia
Japan
Download the O'Reilly App
Take O'Reilly with you and learn anywhere, anytime on your phone and tablet.

Apple app store Google play store
Watch on your big screen
View all O'Reilly videos, Superstream events, and Meet the Expert sessions on your home TV.

Roku Payers and TVs Amazon appstore
Do not sell my personal information
O'Reilly home
© 2025, O'Reilly Media, Inc. All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.

Terms of service • Privacy policy • Editorial independence
"""

In [None]:
from ai_tools import ask_ai, parse_json_output

extract_insights_prompt = f"""
Extract from these raw contents, all the insights regarding AI and programming into the following structure format:

- AI insights
- Programming insights

Here are the raw contents:

{raw_contents_oreilly_tech_trends_january_2025}.

Your OUTPUT SHOULD ONLY BE A JSON OBJECT WITH THE FOLLOWING FIELDS:

- ai-insights
- programming-insights
- date

Output:
"""

output = ask_ai(extract_insights_prompt)
parsed_output_json = parse_json_output(output)
parsed_output_json

In [None]:
# Convert JSON insights to markdown format
print("# Tech Trends Report - " + parsed_output_json['date'])
print("\n## AI Insights")
for insight in parsed_output_json['ai-insights']:
    print(f"- {insight}")
    
print("\n## Programming Insights") 
for insight in parsed_output_json['programming-insights']:
    print(f"- {insight}")

We can now easily transform this into a table to store our own databaset of the recent tech trends!

In [None]:
import pandas as pd

# Create lists of insights and dates
ai_insights = parsed_output_json['ai-insights']
prog_insights = parsed_output_json['programming-insights']
dates = [parsed_output_json['date']] * max(len(ai_insights), len(prog_insights))

# Create a DataFrame
df = pd.DataFrame({
    'AI Insights': ai_insights + [None] * (len(prog_insights) - len(ai_insights)) if len(prog_insights) > len(ai_insights) else ai_insights,
    'Programming Insights': prog_insights + [None] * (len(ai_insights) - len(prog_insights)) if len(ai_insights) > len(prog_insights) else prog_insights,
    'Date': dates
})

# Display the table
display(df)

In [None]:
import requests
from bs4 import BeautifulSoup

def scrape_oreilly_ai_programming_news(url):
    """
    Scrape AI and programming-related content from O'Reilly Radar.
    """
    try:
        # Send request to the website
        response = requests.get(url)
        response.raise_for_status()
        
        # Parse HTML content
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find AI and Programming sections
        ai_section = soup.find('h2', string=lambda text: text and 'AI' in text)
        programming_section = soup.find('h2', string=lambda text: text and 'Programming' in text)
        
        news_items = []
        
        # Extract AI content
        if ai_section:
            ai_content = ai_section.find_next('ul')
            if ai_content:
                news_items.extend([li.text.strip() for li in ai_content.find_all('li')])
        
        # Extract Programming content
        if programming_section:
            programming_content = programming_section.find_next('ul')
            if programming_content:
                news_items.extend([li.text.strip() for li in programming_content.find_all('li')])
        
        return news_items
    
    except requests.exceptions.RequestException as e:
        print(f"Error scraping website: {e}")
        return []

# Example usage (replace with actual O'Reilly Radar URL)
oreilly_url = "https://www.oreilly.com/radar/radar-trends-to-watch-january-2025/"
ai_programming_news = scrape_oreilly_ai_programming_news(oreilly_url)

if ai_programming_news:
    print("\nLatest O'Reilly AI & Programming News:")
    for idx, news in enumerate(ai_programming_news, 1):
        print(f"{idx}. {news}")