[Structured outputs on Databricks](https://docs.databricks.com/aws/en/machine-learning/model-serving/structured-outputs)

# Structured Outputs on Databricks: A Complete Guide to Generating JSON from Foundation Models

As generative AI applications become more sophisticated, the need to extract structured, actionable data from language models has become critical. Databricks has introduced **Structured Outputs**, a powerful feature that enables developers to generate JSON objects that conform to specific schemas directly from foundation models. This capability is transforming how organizations process unstructured data and build robust AI pipelines.

## What Are Structured Outputs?

Structured outputs provide a mechanism to generate structured data in JSON format from your input data. Unlike traditional language model responses that return free-form text, structured outputs ensure that the model's response adheres to a predefined format or schema.

This feature supports three primary output formats:

1. **Plain text** - Traditional unstructured text responses
2. **Unstructured JSON objects** - JSON output without schema enforcement
3. **Schema-validated JSON objects** - JSON that strictly adheres to a specified JSON schema

Structured outputs work with OpenAI-compatible models served through Databricks' Foundation Model APIs, available on both pay-per-token and provisioned throughput endpoints.

## Why Use Structured Outputs?

Databricks recommends structured outputs for several high-value use cases:

### 1. **Document Data Extraction**
Extract and classify information from large document collections. For example, automatically identifying and categorizing product review feedback as negative, positive, or neutral at scale.

### 2. **Batch Inference with Format Requirements**
When processing large volumes of data that need consistent output formatting, structured outputs ensure every response matches your downstream system requirements.

### 3. **Data Transformation Pipelines**
Convert unstructured data into structured formats for analytics, database ingestion, or further processing. This is particularly valuable for building data lakes and warehouses from diverse sources.

## How to Implement Structured Outputs

Implementing structured outputs is straightforward using the `response_format` parameter in your chat requests. Let's explore two common scenarios.

### Example 1: Research Paper Data Extraction

This example demonstrates extracting structured information from research papers using a predefined JSON schema:

```python
import os
import json
from openai import OpenAI

DATABRICKS_TOKEN = os.environ.get('YOUR_DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('YOUR_DATABRICKS_BASE_URL')

client = OpenAI(
    api_key=DATABRICKS_TOKEN,
    base_url=DATABRICKS_BASE_URL
)

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "research_paper_extraction",
        "schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "authors": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "abstract": {"type": "string"},
                "keywords": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
        },
        "strict": True
    }
}

messages = [
    {
        "role": "system",
        "content": "You are an expert at structured data extraction. You will be given unstructured text from a research paper and should convert it into the given structure."
    },
    {
        "role": "user",
        "content": "..."
    }
]

response = client.chat.completions.create(
    model="databricks-gpt-oss-20b",
    messages=messages,
    response_format=response_format
)

print(json.dumps(response.choices[0].message.model_dump()['content'], indent=2))
```

This approach ensures that every research paper processed returns data in exactly the same format, making it easy to build databases of academic literature.

### Example 2: Flexible JSON Extraction

When you need JSON output but don't know the exact schema beforehand, you can use the `json_object` type:

```python
response_format = {
    "type": "json_object",
}

messages = [
    {
        "role": "user",
        "content": """Extract the name, size, price, and color from this product description as a JSON object:
<description>
The SmartHome Mini is a compact smart home assistant available in black or white for only $49.99. It's 5 inches wide.
</description>"""
    }
]

response = client.chat.completions.create(
    model="databricks-gpt-oss-20b",
    messages=messages,
    response_format=response_format
)

print(json.dumps(response.choices[0].message.model_dump()['content'], indent=2))
```

This flexibility is perfect for exploratory analysis or when working with diverse data sources.

## JSON Schema Support and Constraints

Databricks Foundation Model APIs support a subset of the [JSON Schema specification](https://json-schema.org/specification) optimized for high-quality generation. To achieve the best results, it's recommended to use simpler schema definitions.

### Unsupported Features

The following JSON schema features are **not supported**:

- **Regular expressions** using `pattern`
- **Complex schema composition** using `anyOf`, `oneOf`, `allOf`, `prefixItems`, or `$ref`
- **Lists of types** except for the special case of `[type, "null"]` where one type is valid and the other is `null`

These constraints exist to ensure reliable, high-quality JSON generation without unnecessary complexity.

## Important Considerations

### Token Usage and Billing

Behind the scenes, Databricks uses prompt injection and other optimization techniques to enhance the quality of structured outputs. These techniques impact both input and output token consumption, which affects billing. When planning your implementation, factor in these additional token costs.

### Limitations to Keep in Mind

1. **Maximum Schema Keys**: The JSON schema can contain a maximum of **64 keys**
2. **No Size Constraints**: Foundation Model APIs don't enforce length or size limits using keywords like `maxProperties`, `minProperties`, or `maxLength`
3. **Nested Schema Complexity**: Heavily nested JSON schemas can result in lower-quality generation. When possible, flatten your schema for better results
4. **Anthropic Claude Models**: These models only accept `json_schema` structured outputs; `json_object` is not supported

## Best Practices for Implementation

Based on the capabilities and constraints, here are some recommendations:

1. **Start Simple**: Begin with straightforward schemas and add complexity only as needed
2. **Flatten When Possible**: Avoid deep nesting in your JSON schemas to improve output quality
3. **Monitor Token Usage**: Track your token consumption to optimize costs, especially for high-volume applications
4. **Test Thoroughly**: Validate that your schema definitions produce the expected results across various inputs
5. **Choose the Right Format**: Use `json_schema` for strict validation requirements and `json_object` for flexible extraction tasks

## Real-World Applications

The potential applications of structured outputs are vast:

- **Customer Service**: Automatically categorize and route support tickets based on sentiment and topic
- **Legal Document Processing**: Extract key clauses, parties, and dates from contracts
- **Medical Records**: Structure patient information from clinical notes
- **Market Research**: Transform survey responses into standardized data formats
- **Content Moderation**: Classify and flag content according to predefined categories

## Conclusion

Structured outputs on Databricks represent a significant advancement in making foundation models more practical for enterprise applications. By ensuring consistent, schema-validated JSON output, organizations can build more reliable data pipelines and AI-powered workflows.

Whether you're extracting insights from documents, transforming unstructured data, or building complex AI applications, structured outputs provide the reliability and consistency needed for production systems. As you implement this feature, remember to balance schema complexity with output quality, and always monitor token usage to optimize costs.

With this capability now available on Databricks Foundation Model APIs, data engineers and AI practitioners have a powerful new tool for bridging the gap between unstructured language model outputs and structured data requirements.

---

*To learn more about structured outputs and other Databricks AI capabilities, visit the [Databricks documentation](https://docs.databricks.com/aws/en/machine-learning/model-serving/structured-outputs).*

### Data extraction of research papers to a specific JSON schema.

In [0]:
%pip install -qq openai

In [0]:
dbutils.library.restartPython()

In [0]:
import os
import json
from openai import OpenAI

DATABRICKS_TOKEN = ""
DATABRICKS_BASE_URL = "https://fe-vm-agentic-ai.cloud.databricks.com/serving-endpoints"

client = OpenAI(
  api_key=DATABRICKS_TOKEN,
  base_url=DATABRICKS_BASE_URL
  )

response_format = {
      "type": "json_schema",
      "json_schema": {
        "name": "research_paper_extraction",
        "schema": {
          "type": "object",
          "properties": {
            "title": { "type": "string" },
            "authors": {
              "type": "array",
              "items": { "type": "string" }
            },
            "abstract": { "type": "string" },
            "keywords": {
              "type": "array",
              "items": { "type": "string" }
            }
          },
        },
        "strict": True
      }
    }

In [0]:
# Unstructured research paper content (messy, real-world format)
unstructured_paper_text = """
Proceedings of the International Conference on Machine Learning and Data Science 2024

Real-Time Anomaly Detection in IoT Networks using Deep Learning: A Hybrid CNN-LSTM Approach

Sarah Chen¹, Michael Rodriguez², Dr. Priya Patel¹, James Thompson³

¹University of California, Berkeley - Department of Computer Science
²Stanford Research Institute  
³Microsoft Research Labs

Received: March 15, 2024 | Accepted: August 22, 2024 | Published: September 10, 2024

INTRODUCTION AND BACKGROUND

The Internet of Things (IoT) has revolutionized how we interact with technology, with an estimated 75 billion connected devices expected by 2025. However, this exponential growth brings unprecedented security challenges. Traditional signature-based intrusion detection systems are inadequate for the dynamic and heterogeneous nature of IoT environments.

METHODOLOGY AND APPROACH

In this work, we propose a novel hybrid architecture that combines the spatial feature extraction capabilities of Convolutional Neural Networks with the temporal modeling strengths of Long Short-Term Memory networks. Our approach processes network traffic data in real-time, analyzing packet headers, payload characteristics, and temporal patterns to identify anomalous behavior.

EXPERIMENTAL SETUP

We collected network traffic data from smart home environments including smart thermostats, security cameras, voice assistants, and lighting systems. The dataset comprised 2.5 million network packets gathered over a six-month period from January to June 2024. Data preprocessing involved feature normalization, sequence padding, and splitting into training (70%), validation (15%), and testing (15%) sets.

RESULTS AND FINDINGS

Our hybrid CNN-LSTM model achieved remarkable performance metrics: 94.7% detection accuracy, 2.1% false positive rate, and 15 milliseconds average response time. When compared to traditional rule-based systems, our approach showed 23% improvement in accuracy and 67% reduction in false alarms. The model successfully detected various attack types including DDoS, man-in-the-middle attacks, and device hijacking attempts.

CONCLUSION

This research demonstrates that deep learning techniques, specifically the combination of CNN and LSTM architectures, provide a robust solution for real-time IoT anomaly detection. The system's low latency and high accuracy make it suitable for deployment in production environments where immediate threat response is critical.

Related research areas include: deep learning applications, anomaly detection algorithms, IoT security frameworks, neural network architectures, real-time processing systems, cybersecurity solutions, machine learning for network security, and network traffic analysis techniques.

© 2024 International Conference on Machine Learning and Data Science. All rights reserved.
DOI: 10.1234/icmlds.2024.5678
Page 142-158
"""

In [0]:
messages = [{
        "role": "system",
        "content": "You are an expert at structured data extraction. You will be given unstructured text from a research paper and should convert it into the given structure."
      },
      {
        "role": "user",
        "content": unstructured_paper_text
      }]

response = client.chat.completions.create(
    model="databricks-meta-llama-3-3-70b-instruct",
    messages=messages,
    response_format=response_format
)

print(json.dumps(response.choices[0].message.model_dump()['content'], indent=2))

In [0]:
print(response.choices[0].message.content)

In [0]:
import os
import json
from openai import OpenAI

response_format = {
      "type": "json_object",
    }

messages = [
      {
        "role": "user",
        "content": "Extract the name, size, price, and color from this product description as a JSON object:\n<description>\nThe SmartHome Mini is a compact smart home assistant available in black or white for only $49.99. It's 5 inches wide.\n</description>"
      }]

response = client.chat.completions.create(
    model="databricks-meta-llama-3-3-70b-instruct",
    messages=messages,
    response_format=response_format
)

print(json.dumps(response.choices[0].message.model_dump()['content'], indent=2))

In [0]:
print(response.choices[0].message.content)

In [0]:
print(response.choices[0].message.model_dump()['content'])