# OpenAI Structured Outputs with Label Studio
This notebook demonstrates how to use OpenAI’s Structured Outputs feature with Label Studio for various labeling tasks such as summarization, text classification, and named entity recognition (NER). By defining schemas with Pydantic, we can directly generate JSON outputs that integrate seamlessly with Label Studio, reducing the need for preprocessing.

## Setup and Imports
Install necessary packages, and initialize the OpenAI client. 

In [None]:
# Install necessary packages
!pip install openai pydantic

In [2]:
import json

from enum import Enum
from typing import List, Union, Optional, Literal
from pydantic import BaseModel, Field

# Import and initialize the OpenAI client
from openai import OpenAI

client = OpenAI()

# Label Studio Summarization
Summarization is the process of condensing a larger body of text into a shorter version while preserving its essential information and meaning. We will use OpenAI’s API to generate concise summaries from longer texts. This can be useful for extracting key points from articles, reports, or any extensive document.

First, we define a schema for summarization using Pydantic to ensure the output matches Label Studio’s requirements.

In [None]:
class Label(BaseModel):
    score: float
    text: str


class ResultItem(BaseModel):
    id: str
    from_name: Literal["answer"] = "answer"
    to_name: Literal["text"] = "text"
    type: Literal["textarea"] = "textarea"
    value: Label


class Prediction(BaseModel):
    model_version: str
    result: List[ResultItem]


class Data(BaseModel):
    text: str


class Summarization(BaseModel):
    data: Data
    predictions: List[Prediction]

Define the text we want to summarize. In this case, we will use a paragraph from Wikipedia on the 2024 Olympics. 

In [4]:
summary_input = "The 2024 Summer Olympics,[a] officially the Games of the XXXIII Olympiad[b] and branded as Paris 2024, were an international multi-sport event that occurred from 26 July to 11 August 2024 in France, with the opening ceremony having taken place on 26 July. Paris was the host city, with events (mainly football) held in 16 additional cities spread across metropolitan France, including the sailing centre in the second-largest city of France, Marseille on the Mediterranean Sea, as well as one subsite for surfing in Tahiti, French Polynesia."

Request a summarization completion with the defined schema.

In [5]:
summarization_completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": """You are a Summarization assistant.
                Your job is to identify and a best, single sentence summary 
                for a given piece of text. Ensure that there is only a single 
                summary for the given text.""",
        },
        {
            "role": "user",
            "content": summary_input
        }
    ],
    response_format=Summarization
)

In [6]:
print(json.dumps(json.loads(summarization_completion.choices[0].message.content), indent=4))

{
    "data": {
        "text": "The 2024 Summer Olympics, officially known as the Games of the XXXIII Olympiad, took place in Paris, France from 26 July to 11 August 2024, with additional events held in various cities across metropolitan France and one surfing site in Tahiti."
    },
    "predictions": [
        {
            "model_version": "legacy",
            "result": [
                {
                    "id": "summary1",
                    "from_name": "answer",
                    "to_name": "text",
                    "type": "textarea",
                    "value": {
                        "score": 1.0,
                        "text": "The 2024 Summer Olympics, officially known as the Games of the XXXIII Olympiad, took place in Paris, France from 26 July to 11 August 2024, with additional events held in various cities across metropolitan France and one surfing site in Tahiti."
                    }
                }
            ]
        }
    ]
}


We can save the output as a file, which can then be imported into our Label Studio project. 

In [7]:
# Save output to file
summary_filename = 'summary_prediction.json'

# Write JSON object to a file
with open(summary_filename, 'w') as file:
    json.dump(json.loads(summarization_completion.choices[0].message.content), file, indent=4)

# Label Studio Text Classification 
Classification involves categorizing text into predefined classes or labels based on its content. In this case, we will use Sentiment Analysis for the classification task. |

Once again, we will create our pydantic schema for the json structure that we want the LLM to populate. 

In [None]:
class EntityType(str, Enum):
    positive = "Positive"
    negative = "Negative"
    neutral = "Neutral"

class Label(BaseModel):
    score: float
    choices: List[EntityType]

class ResultItem(BaseModel):
    id: str
    from_name: Literal["sentiment"] = "sentiment"
    to_name: Literal["text"] = "text"
    type: Literal["choices"] = "choices"
    value: Label

class Prediction(BaseModel):
    model_version: str
    result: List[ResultItem]

class Data(BaseModel):
    text: str

class Classification(BaseModel):
    data: Data
    predictions: List[Prediction]

Define the text that we want to classify.

In [9]:
classification_input = "We’re excited to announce the 1.13 release of Label Studio! This update includes a refreshed UI and some new Generative AI templates for you to use."

In [10]:
classification_completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": """You are a Sentiment analysis assistant.
                Your job is to provide the sentiment for  
                for a given piece of text. Ensure that there is only a single 
                sentiment for the given text.""",
        },
        {
            "role": "user",
            "content": classification_input
        }
    ],
    response_format=Classification
)

In [11]:
print(json.dumps(json.loads(classification_completion.choices[0].message.content), indent=4))

{
    "data": {
        "text": "We\u2019re excited to announce the 1.13 release of Label Studio! This update includes a refreshed UI and some new Generative AI templates for you to use."
    },
    "predictions": [
        {
            "model_version": "1.0",
            "result": [
                {
                    "id": "1",
                    "from_name": "sentiment",
                    "to_name": "text",
                    "type": "choices",
                    "value": {
                        "score": 0.98,
                        "choices": [
                            "Positive"
                        ]
                    }
                }
            ]
        }
    ]
}


Now we can save the output as a file, which can then be imported into our Label Studio project. 

In [12]:
# Save output to file
classification_filename = 'classification_prediction.json'

# Write JSON object to a file
with open(classification_filename, 'w') as file:
    json.dump(json.loads(classification_completion.choices[0].message.content), file, indent=4)

# Label Studio NER Example
Named Entity Recognition (NER) is a natural language processing task that identifies and classifies key entities in text, such as names of people, organizations, locations, dates, and other specific terms. This task is more complex because it not only requires recognizing entity names but also accurately determining their positions within the text, including the exact character offsets (start and end positions). While Large Language Models (LLMs) are capable of identifying these entities, they often struggle with providing precise offsets, which can lead to inconsistencies in the results.

In [13]:
class EntityType(str, Enum):
    person = "Person"
    organization = "Organization"
    location = "Location"
    datetime = "DateTime"
    product = "Product"
    percent = "Percent"
    fact = "Fact"
    money = "Money"


class Label(BaseModel):
    start: int
    end: int
    score: float
    text: str
    labels: List[EntityType]


class ResultItem(BaseModel):
    id: str
    from_name: Literal["label"] = "label"
    to_name: Literal["text"] = "text"
    type: Literal["labels"] = "labels"
    value: Label


class Prediction(BaseModel):
    model_version: str
    # score: Optional[float] = None
    result: List[ResultItem]


class Data(BaseModel):
    text: str


class NamedEntities(BaseModel):
    data: Data
    predictions: List[Prediction]



In [14]:
ner_input = "Samuel Harris Altman (born April 22, 1985) is an American entrepreneur and investor best known as the CEO of OpenAI since 2019 (he was briefly fired and reinstated in November 2023)."

In [15]:
ner_completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": """You are a Named Entity Recognition (NER) assistant.
                Your job is to identify and return all entity names and their 
                types for a given piece of text. You are to strictly conform
                only to the following entity types: Person, Location, Organization
                and DateTime. If uncertain about entity type, please ignore it.
                Be careful of certain acronyms, such as role titles "CEO", "CTO",
                "VP", etc - these are to be ignore.""",
        },
        {
            "role": "user",
            "content": ner_input
        }
    ],
    response_format=NamedEntities
)

In [16]:
print(json.dumps(json.loads(ner_completion.choices[0].message.content), indent=4))

{
    "data": {
        "text": "Samuel Harris Altman (born April 22, 1985) is an American entrepreneur and investor best known as the CEO of OpenAI since 2019 (he was briefly fired and reinstated in November 2023)."
    },
    "predictions": [
        {
            "model_version": "1.0",
            "result": [
                {
                    "id": "1",
                    "from_name": "label",
                    "to_name": "text",
                    "type": "labels",
                    "value": {
                        "start": 0,
                        "end": 18,
                        "score": 0.95,
                        "text": "Samuel Harris Altman",
                        "labels": [
                            "Person"
                        ]
                    }
                },
                {
                    "id": "2",
                    "from_name": "label",
                    "to_name": "text",
                    "type": "labels",
            

If we review the data in Label Studio, we can see that the LLM identified the named entities, but it did not generate the character offsets correctly. We can fix this with a simple regex before saving the output.

In [17]:
import re

json_data = json.loads(ner_completion.choices[0].message.content)

# Extract the text to search in
text = json_data["data"]["text"]

# Iterate over each result in predictions to update start and end indexes
for prediction in json_data["predictions"]:
    for result in prediction["result"]:
        # Get the text to find in the main text
        search_text = result["value"]["text"]
        
        # Use regex to find the exact position of the search_text in text
        match = re.search(re.escape(search_text), text)
        if match:
            # Update start and end indexes with exact positions
            result["value"]["start"] = match.start()
            result["value"]["end"] = match.end()

# Print the updated JSON
print(json.dumps(json_data, indent=4))


{
    "data": {
        "text": "Samuel Harris Altman (born April 22, 1985) is an American entrepreneur and investor best known as the CEO of OpenAI since 2019 (he was briefly fired and reinstated in November 2023)."
    },
    "predictions": [
        {
            "model_version": "1.0",
            "result": [
                {
                    "id": "1",
                    "from_name": "label",
                    "to_name": "text",
                    "type": "labels",
                    "value": {
                        "start": 0,
                        "end": 20,
                        "score": 0.95,
                        "text": "Samuel Harris Altman",
                        "labels": [
                            "Person"
                        ]
                    }
                },
                {
                    "id": "2",
                    "from_name": "label",
                    "to_name": "text",
                    "type": "labels",
            

With our corrected NER prediction, we can save it and import it into Label Studio.

In [18]:
file_path = 'formatted_ner_prediction.json'

with open(file_path, 'w') as json_file:
    json.dump(json_data, json_file, indent=4)