# Data Parsing and Extraction

## Table of Contents

1. Introduction
2. Examples
3. References and Further Reading


<a id='0'></a>

## 1. Introduction


Generative AI, particularly large language models like GPT-4, can be incredibly useful for data generation and augmentation:

1. **Data Extraction**: Extracting data from contracts (structured and unstructured).
2. **Structure Recognition**: Understanding the structure of documents and identifying relevant sections.
3. **Data Standardization**: Converting extracted data into a consistent format.
4. **Contextual Understanding**: Interpreting complex clauses and conditions in contracts.
5. **Anomaly Detection**: Identifying unusual terms or discrepancies in documents.
6. **HTML Parsing**: Gen AI can analyze HTML structure and extract relevant information.
7. **Natural Language Understanding**: It can interpret and extract data from unstructured text.
8. **Data Cleaning**: Gen AI can clean and format extracted data.
9. **Adaptive Scraping**: It can adjust to different website layouts and structures.
10. **Data Transformation**: Gen AI can convert extracted data into structured formats like JSON or CSV.

### Key Terminology:

- **Document Parsing**: The process of extracting structured information from unstructured or semi-structured documents.
- **Named Entity Recognition (NER)**: Identifying and classifying key information in text into predefined categories.
- **Information Extraction**: The task of automatically extracting structured information from unstructured or semi-structured documents.
- **Entity**: A real-world object, such as a person, location, or organization, mentioned in text.
- **Named Entity Recognition (NER)**: The process of identifying and classifying named entities in text.
- **Unstructured Data**: Data that doesn't have a predefined data model or organization.

For a data engineer, this capability is invaluable as it automates the tedious process of manual data entry and standardization, allowing for quicker analysis and insights generation.


In [1]:
import openai
import os
import json
import pandas as pd
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
from rich.console import Console

In [2]:
console = Console()

load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
console.print(f"[dark_orange]Using OpenAI API key: {OPENAI_API_KEY[:12]}[/]")
# Set up OpenAI API key
client = OpenAI(api_key=OPENAI_API_KEY)

In [3]:
def get_first_value(dict_variable):
    """The function clean takes a dictionary (dict_variable) as input and returns the value of the first key-value pair in the dictionary.

    Here's a breakdown:

    iter(dict_variable.values()) creates an iterator over the dictionary's values.
    next(...) retrieves the first value from the iterator.
    So, if you have a dictionary like {'a': 1, 'b': 2, 'c': 3}, calling clean on it would return 1, which is the value associated with the first key 'a'.

    Note that this function assumes the dictionary is not empty. If the dictionary is empty, next will raise a StopIteration exception.
    """
    return next(iter(dict_variable.values()))

<a id='2'></a>

## 2. Example 1: Parsing a Simple Invoice


In [4]:
# Sample invoice text
invoice_text = """
Date: 5/15/23
John Doe
Invoice #: INV-2023-001
Items:
Website Design - $1000
Logo Creation - $500.00
SEO Services - $750
Total: $2250
"""

# Make API call
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": f"Parse the following invoice and output in JSON form: {invoice_text}",
        }
    ],
    response_format={"type": "json_object"},
)

# Extract and parse the JSON response
parsed_invoice = json.loads(response.choices[0].message.content)

print(json.dumps(parsed_invoice, indent=2))

{
  "invoice": {
    "date": "5/15/23",
    "customer": "John Doe",
    "invoice_number": "INV-2023-001",
    "items": [
      {
        "description": "Website Design",
        "amount": 1000
      },
      {
        "description": "Logo Creation",
        "amount": 500.0
      },
      {
        "description": "SEO Services",
        "amount": 750
      }
    ],
    "total": 2250
  }
}


In [5]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": f"Parse the following invoice and output in JSON form: {invoice_text}. The columns should be Invoice Number, Invoice Date, Total, and Number of Products. The JSON should be able to go into a \
        Pandas dataframe.",
        }
    ],
    response_format={"type": "json_object"},
)

# Extract and parse the JSON response
parsed_invoice = json.loads(response.choices[0].message.content)
print(parsed_invoice)

{'Invoice Number': 'INV-2023-001', 'Invoice Date': '5/15/23', 'Total': 2250, 'Number of Products': 3}


In [6]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": f"Parse the following invoice and output in JSON form: {invoice_text}. The columns should be Invoice Number, Invoice Date, Total, and Number of Products. The JSON should be able to go into a \
        Pandas dataframe.",
        }
    ],
    response_format={"type": "json_object"},
)

# Extract and parse the JSON response
parsed_invoice = json.loads(response.choices[0].message.content)
print(parsed_invoice)

{'Invoice Number': 'INV-2023-001', 'Invoice Date': '2023-05-15', 'Total': 2250, 'Number of Products': 3}


In [7]:
json.loads(response.choices[0].message.content)

{'Invoice Number': 'INV-2023-001',
 'Invoice Date': '2023-05-15',
 'Total': 2250,
 'Number of Products': 3}

<a id='3'></a>

## 2. Example 2: Extracting Data from a Complex Contract


In [8]:
contract_text = """
SERVICE AGREEMENT #1

This Service Agreement (the "Agreement") is entered into on June 1, 2023 (the "Effective Date") by and between:

ABC Corp., a corporation organized under the laws of Delaware, with its principal place of business at 123 Main St, Anytown, USA ("Service Provider")

and

XYZ Inc., a corporation organized under the laws of California, with its principal place of business at 456 Oak Ave, Otherville, USA ("Client")

1. SERVICES
   Service Provider agrees to provide the following services to Client:
   a) Software development
   b) System maintenance
   c) Technical support

2. TERM
   This Agreement shall commence on the Effective Date and continue for a period of 24 months.

3. COMPENSATION
   Client agrees to pay Service Provider a monthly fee of $10,000 for the services provided.

4. TERMINATION
   Either party may terminate this Agreement with 30 days written notice.


SERVICE AGREEMENT #2

This Service Agreement (the "Agreement") is entered into on April 1, 2023 (the "Effective Date") by and between:

Henry Cookies., a corporation organized under the laws of Delaware, with its principal place of business at 123 Main St, Anytown, USA ("Service Provider")

and

XYZ Inc., a corporation organized under the laws of California, with its principal place of business at 456 Oak Ave, Otherville, USA ("Client")

1. SERVICES
   Service Provider agrees to provide the following services to Client:
   a) Software development
   b) System maintenance
   c) Technical support

2. TERM
   This Agreement shall commence on the Effective Date and continue for a period of 8 months.

3. COMPENSATION
   Client agrees to pay Service Provider a monthly fee of $50,000 for the services provided.

4. TERMINATION
   Either party may terminate this Agreement with 30 days written notice.

   
"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": f"Extract key information from the following contract and output in JSON form: {contract_text}",
        }
    ],
    response_format={"type": "json_object"},
)

parsed_contract = json.loads(response.choices[0].message.content)
print(json.dumps(parsed_contract, indent=2))

{
  "ServiceAgreements": [
    {
      "AgreementNumber": 1,
      "EffectiveDate": "2023-06-01",
      "ServiceProvider": {
        "Name": "ABC Corp.",
        "LegalForm": "Corporation",
        "State": "Delaware",
        "Address": "123 Main St, Anytown, USA"
      },
      "Client": {
        "Name": "XYZ Inc.",
        "LegalForm": "Corporation",
        "State": "California",
        "Address": "456 Oak Ave, Otherville, USA"
      },
      "Services": [
        "Software development",
        "System maintenance",
        "Technical support"
      ],
      "Term": "24 months",
      "Compensation": {
        "MonthlyFee": 10000,
        "Currency": "USD"
      },
      "Termination": {
        "NoticePeriod": "30 days"
      }
    },
    {
      "AgreementNumber": 2,
      "EffectiveDate": "2023-04-01",
      "ServiceProvider": {
        "Name": "Henry Cookies.",
        "LegalForm": "Corporation",
        "State": "Delaware",
        "Address": "123 Main St, Anytown, USA"
   

In [9]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": f"Extract key information from the following contract and output in JSON form, and extract columns company name, date, compensation, term: {contract_text}",
        }
    ],
    response_format={"type": "json_object"},
)

parsed_contract = json.loads(response.choices[0].message.content)
print(json.dumps(parsed_contract, indent=2))

{
  "contracts": [
    {
      "company_name": "ABC Corp.",
      "date": "June 1, 2023",
      "compensation": "$10,000",
      "term": "24 months"
    },
    {
      "company_name": "Henry Cookies",
      "date": "April 1, 2023",
      "compensation": "$50,000",
      "term": "8 months"
    }
  ]
}


In [10]:
# Convert to DataFrame
df_contract = pd.DataFrame(get_first_value(parsed_contract))
print(df_contract)

    company_name           date compensation       term
0      ABC Corp.   June 1, 2023      $10,000  24 months
1  Henry Cookies  April 1, 2023      $50,000   8 months


<a id='4'></a>

## 2. Example 3: Batch Processing of Multiple Documents


In [11]:
documents = [
    """
    Invoice #: INV-2023-002
    Date: June 1, 2023
    Bill To: Jane Smith
    Items:
    1. Mobile App Development - $5000
    2. UI/UX Design - $2000
    Total: $7000
    """,
    """
    Invoice #: INV-2023-003
    Date: June 15, 2023
    Bill To: Acme Corp
    Items:
    1. Cloud Migration Services - $10000
    2. Staff Training - $3000
    3. Ongoing Support (3 months) - $4500
    Total: $17500
    """,
]

parsed_documents = []

for doc in documents:
    response = client.chat.completions.create(
        model="gpt-4-0125-preview",
        messages=[
            {
                "role": "user",
                "content": f"Parse the following invoice and output in JSON form: The columns should be Invoice Number, Invoice Date, Total, and Number of Products. The JSON should be able to go into a  {doc}",
            }
        ],
        response_format={"type": "json_object"},
    )
    parsed_documents.append(json.loads(response.choices[0].message.content))

print(json.dumps(parsed_documents, indent=2))

[
  {
    "Invoice Number": "INV-2023-002",
    "Invoice Date": "June 1, 2023",
    "Total": "$7000",
    "Number of Products": 2
  },
  {
    "Invoice Number": "INV-2023-003",
    "Invoice Date": "June 15, 2023",
    "Total": "$17500",
    "Number of Products": 3
  }
]


In [12]:
# Convert to DataFrame
df_batch = pd.DataFrame(parsed_documents)
print(df_batch)

  Invoice Number   Invoice Date   Total  Number of Products
0   INV-2023-002   June 1, 2023   $7000                   2
1   INV-2023-003  June 15, 2023  $17500                   3


## 2. Example 4: Extracting Product Information


In [13]:
import openai
import requests
import json

# Sample product page HTML (simplified)
sample_html = """
<div class="product-updated">
  <h3>Super Comfy Chair</h3>
  <p class="price">$199.99</p>
  <ul class="features">
    <li>Ergonomic design</li>
    <li>Adjustable height</li>
    <li>360-degree swivel</li>
  </ul>
  <p class="availability">In stock</p>
</div>
"""

# OpenAI API call
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that extracts product information from HTML.",
        },
        {
            "role": "user",
            "content": f"Extract the product name, price, features, and availability from this HTML. Output in JSON form:\n{sample_html}",
        },
    ],
    response_format={"type": "json_object"},
)

# Print the extracted data
print(json.dumps(json.loads(response.choices[0].message.content), indent=2))

{
  "product_name": "Super Comfy Chair",
  "price": "$199.99",
  "features": [
    "Ergonomic design",
    "Adjustable height",
    "360-degree swivel"
  ],
  "availability": "In stock"
}


## 2. Example 5: Parsing News Articles


In [14]:
import requests
from bs4 import BeautifulSoup

# Fetch a sample news article
url = "https://craig-west.netlify.app/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# OpenAI API call
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that extracts key information from news articles.",
        },
        {
            "role": "user",
            "content": f"Extract the product name, price, and product description. Output in JSON form:\n{str(response.text)}...",
        },
    ],
    response_format={"type": "json_object"},
)

# Print the extracted data
print(json.dumps(json.loads(response.choices[0].message.content), indent=2))

{
  "product": {
    "name": "Agentic Pythonista",
    "price": "Not specified",
    "description": "AI focused Pythonista, living in Brighton, UK, with a background in Business Information Architecture, data integrity, and AI Powered Knowledge Systems."
  }
}


## 2. Example 6: Extracting Tabular Data from HTML


In [15]:
import pandas as pd

# Sample HTML table
html_table = """
<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
    <th>City</th>
  </tr>
  <tr>
    <td>John Doe</td>
    <td>30</td>
    <td>New York</td>
  </tr>
  <tr>
    <td>Jane Smith</td>
    <td>25</td>
    <td>London</td>
  </tr>
  <tr>
    <td>Bob Johnson</td>
    <td>35</td>
    <td>Paris</td>
  </tr>
</table>
"""

# OpenAI API call
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that extracts tabular data from HTML.",
        },
        {
            "role": "user",
            "content": f"Extract the data from this HTML table into a JSON format. Output in JSON form:\n{html_table}",
        },
    ],
    response_format={"type": "json_object"},
)

# Convert JSON to DataFrame
data = json.loads(response.choices[0].message.content)
df = pd.DataFrame(get_first_value(data))

# Display the DataFrame
print(df)

          Name  Age      City
0     John Doe   30  New York
1   Jane Smith   25    London
2  Bob Johnson   35     Paris


## 2. 7 - Gen AI for getting information out of pictures (e.g., receipts)


In [16]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract the store name, date, and total. Put the results in a JSON, with keys 'store_name', 'date', and 'total'",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://craig-west.netlify.app/receipt.jpg",
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

BadRequestError: Error code: 400 - {'error': {'message': 'Error while downloading https://craig-west.netlify.app/receipt.jpg.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_image_url'}}

In [None]:
print(response.choices[0].message.content)

## 2. 8 - Examples of Gen AI for Entity Recognition <a name="examples"></a>


In [None]:
from openai import OpenAI

# Example 1: Basic Entity Recognition
text = "John Smith visited New York City on July 4, 2023, and met with the CEO of TechCorp."

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant skilled in entity recognition.",
        },
        {
            "role": "user",
            "content": f"Identify and categorize the named entities in the following text. Output in JSON form: {text}",
        },
    ],
    response_format={"type": "json_object"},
)

print(response.choices[0].message.content)

In [None]:
# Example 2: Entity Recognition in a News Article
news_article = """
On September 15, 2023, Apple Inc. unveiled its latest iPhone models at its headquarters in Cupertino, California. 
CEO Tim Cook presented the new devices, highlighting their advanced features. The event was attended by tech journalists from various publications, including The New York Times and TechCrunch.
"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant skilled in entity recognition.",
        },
        {
            "role": "user",
            "content": f"Extract and categorize all named entities from this news article. Include categories such as PERSON, ORGANIZATION, DATE, LOCATION, and PRODUCT. Output in JSON form: {news_article}",
        },
    ],
    response_format={"type": "json_object"},
)

print(response.choices[0].message.content)

In [None]:
# Example 3: Entity Recognition in Social Media Post
social_media_post = """
Just landed in #Paris! 😍 Can't wait to visit the Eiffel Tower and the Louvre. 
Meeting up with @JaneDoeTravels tomorrow for a Seine river cruise. 
Any recommendations for the best cafes near Champs-Élysées? #TravelBlog #ParisAdventures
"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant skilled in entity recognition, especially in social media contexts.",
        },
        {
            "role": "user",
            "content": f"Identify and categorize entities in this social media post, including locations, landmarks, usernames, and hashtags. Output in JSON form: {social_media_post}",
        },
    ],
    response_format={"type": "json_object"},
)

print(response.choices[0].message.content)

<a id='6'></a>

## 3. References and Further Reading

1. OpenAI API Documentation: https://platform.openai.com/docs/
2. "Natural Language Processing with Transformers" by Lewis Tunstall, Leandro von Werra, and Thomas Wolf
3. "Information Extraction: Algorithms and Prospects in a Retrieval Context" by Marie-Francine Moens
4. "Named Entity Recognition: A Literature Survey" by David Nadeau and Satoshi Sekine
5. "Data Science for Business" by Foster Provost and Tom Fawcett
