# Streamlining Gen AI: How to structure LLM Output using OpenAI Functions

LLM's are increasingly becoming part of intelligent automation pipelines called Agentic Pipelines. Automation of tasks usually involves integration between many applications of which the Large Language model is one. All applications inside the integration should follow common standards of structured communication like json. Since LLMs are text generation tools, it becomes important for Generative AI developers to impart structure to LLM output which is unstructured. 

In this tutorial, we will show how users can get LLM to output in json format which is the most widely accepted format for integration between applications. The usecase we will showcase in this tutorial involves extracting entities from legal documents which can be uploaded to a data lake or a management software.

### Installing dependencies

Installing all the libraries needed for this tutorial.

In [2]:
#!pip install openai datasets pydantic

### Dataset
For this tutorial, we will be using the [Australian Open Legal corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) from Umar Butler. 

We will use the stream option as the dataset is quite large to download. Streaming the dataset allows us to use only the fraction of dataset that we actually need.

In [18]:
from datasets import load_dataset
corpus = load_dataset('umarbutler/open-australian-legal-corpus', split='corpus', streaming=True)
cases_filtered = list(corpus.take(10))
len(cases_filtered)

10

Here's how the case documentation looks like.

In [58]:
print(cases_filtered[0])

{'version_id': 'nsw_caselaw:54a640013004de94513dcb0b', 'type': 'decision', 'jurisdiction': 'new_south_wales', 'source': 'nsw_caselaw', 'citation': 'R v Dickson (No 15) [2014] NSWSC 1861', 'url': 'https://www.caselaw.nsw.gov.au/decision/54a640013004de94513dcb0b', 'when_scraped': '2024-05-16T23:59:44.965193+10:00', 'text': 'Supreme Court\nNew South Wales\n\nMedium Neutral Citation:  R v Dickson (No 15) [2014] NSWSC 1861                                                                                                          \nHearing dates:            16 December 2014                                                                                                                               \nDecision date:            16 December 2014                                                                                                                               \nJurisdiction:             Common Law                                                                                             

In [26]:
for i in range(3):
    print(''.join(['-']*100),'\n',cases_filtered[i]['text'][0:1000])

---------------------------------------------------------------------------------------------------- 
 Supreme Court
New South Wales

Medium Neutral Citation:  R v Dickson (No 15) [2014] NSWSC 1861                                                                                                          
Hearing dates:            16 December 2014                                                                                                                               
Decision date:            16 December 2014                                                                                                                               
Jurisdiction:             Common Law                                                                                                                                     
Before:                   Beech-Jones J                                                                                                                                  
Decision:       

### LLM

For this tutorial, we will be using [GPT 3.5](https://platform.openai.com/docs/models/gpt-3-5-turbo) from OpenAI as our large language model. We will use API token for authentication to OpenAI for which we need to create an `OPENAI_API_KEY` environment variable.

In [36]:
import os
assert os.environ.get("OPENAI_API_KEY") is not None

## Prompt Engineering
In the following cells, we will show how we can use Prompt Engineering to set a structure to the output from the LLM. The advantage of this method is that it requires <font color='green'><b><u>fewer lines of code</u></b></font> but the disadvantage is the <font color='red'><b><u>stability and reproducibility of the output</u></b></font>. 

In the below cell, we authenticate with the OpenAI API and create a function that ingests free flowing text from a legal case report and uses the GPT 3.5 model to extract entities that we will be using to integrate with downstream applications like a database or a case management software that supports json.

In [57]:
from openai import OpenAI
client = OpenAI()

def extract_legal_entities(case_doc):
    completion = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "system", "content": "You are an AI assistant, skilled in extracting \
         entities from legal documentation and returning them as a python json object.\
         Do not return as a code block."},
        {"role": "user", "content": "Extract the data from the case documentation enclosed in triple quotes. \
        If a value is not present, provide null. Use the following structure: \
        {\"court\": , \"judge\": \ , \"dates\": [] , \"defendant\": \, \"plaintiff\": , \
         \"decision\": , \"summary\": ,} \n \n '''" + case_doc + "'''"}
      ],
      seed = 42
    )
    return completion.choices[0]

entities = extract_legal_entities(cases_filtered[3]['text'])
print(entities.message.content)

{
    "court": "Land and Environment Court of New South Wales",
    "judge": "Hussey C",
    "dates": ["06/04/2006", "22/05/2006", "06/09/2006"],
    "defendant": "Fairfield City Council",
    "plaintiff": "Poppets Child Care Centre",
    "decision": "The appeal is dismissed. Development application (1300/2005) for the construction and operation of a 68 place childcare centre at 18 Edensor Road, Cabramatta West, is refused.",
    "summary": "The appeal was lodged against council\u2019s deemed refusal of a development application for a childcare centre to accommodate 68 children. The primary issue was the size, scale, and amenity impacts of the development in its residential context, relative to the provisions of DCP 39.",
    "legal_representatives": {
        "plaintiff": {
            "name": "Mr S Kondilios",
            "solicitor": "Maddocks"
        },
        "defendant": {
            "name": "Mr J Hewitt",
            "solicitor": "Home Wilkson Lowry"
        }
    }
}


Let us see how stable the output is. It is quite evident that each output below differs the previous in structure like sometimes the "judge" property has a list and sometimes it doesn't. The dates list sometimes has only dates and sometimes it has labels. This instability in the output means a lot of exception handling and programming logic has to be expended to ensure that the result structure can be consumed by the downstream application.

In [61]:
for i in range(3):
    response = extract_legal_entities(cases_filtered[i]['text'])
    print(response.message.content)

{
    "court": "Supreme Court New South Wales",
    "judge": "Beech-Jones J",
    "dates": [
        "Hearing dates: 16 December 2014",
        "Decision date: 16 December 2014"
    ],
    "defendant": "Anthony James Dickson",
    "plaintiff": "Crown (Commonwealth Prosecutor)",
    "decision": "Applications by the accused for further directions rejected.",
    "summary": "CRIMINAL LAW - question arising as to the causal element in the various counts - challenge to listing of the conduct of each of the co-accused."
}
{
    "court": "Civil and Administrative Tribunal New South Wales",
    "judge": ["D Patten, Principal Member", "R C Titterton, Principal Member"],
    "dates": ["14 January 2015"],
    "defendant": "Rajayogan Balachandren",
    "plaintiff": "Ray Wu",
    "decision": "Extension of time in which to commence the appeal granted; Leave to appeal refused; Appeal dismissed; Stay of orders lifted",
    "summary": "The appellant sought to appeal from a decision terminating a reside

## OpenAI Function Calling

Let us now look at how we can solve the above drawbacks using the OpenAI feature called Function Calling. [Function Calling](https://platform.openai.com/docs/guides/function-calling) allows developers to build LLM workflows to intelligently call functions with arguments. Since the output of function calling is a json object, we can use this feature to impart structure onto LLM output.

The first step in using Function Calling is to define schemas of the function and its parameters. After that, we have to pass the schema to the completion object and force the LLM to choose a function so that it can pass parameters to it. But, in reality the parameter values are the outputs that we want.

In [64]:
from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import date

class CaseDates(BaseModel):
    hearing_date: date = Field(description="Case hearing date")
    decision_date: date = Field(description="Date of final decision if available, if unavailable then Null")

class GenerateCaseSummaryParams(BaseModel):
    court: str = Field(description="Name of the court for the case proceedings")
    judge: str = Field(description="Name of the judge presiding the case")
    dates: List[CaseDates] = Field(description="Details of hearing and decision dates of the case")
    defendant: str = Field(description="Name of the defendant in the case")
    plaintiff: str = Field(description="Name of the plaintiff in the case")
    decision: str = Field(description="Final decision of the case if available, else Null")
    summary: str = Field(description="100 word summary of the case with key points of the case")
        
tool_definitions = [  
    {
        "type": "function",
        "function":{
            "name": "generate_case_summary",
            "description": "Generates a legal case summary for uploading to the case management software",
            "parameters": GenerateCaseSummaryParams.model_json_schema()
        },
        "required": ["court", "judge", "dates", "defendant", "plaintiff", "decision", "summary"]
    }
]

Let us update our llm completion function to include the function calling we designed.

In [65]:
def extract_legal_entities_fc(case_doc):
    completion = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "system", "content": "Generate the case summary for the following case '''" + case_doc + "'''"}
      ],
      seed = 42,
      tools=tool_definitions,
      tool_choice={"type": "function", "function": {"name": "generate_case_summary"}}
    )
    return completion.choices[0]

As you can see below, each output is a well defined and expected json structure which can be used to integrate with downstream applications without many errors or exceptions

In [76]:
for i in range(3):
    response = extract_legal_entities_fc(cases_filtered[i]['text'])
    print(json.dumps(json.loads(response.message.tool_calls[0].function.arguments), indent=4))

{
    "court": "Supreme Court New South Wales",
    "judge": "Beech-Jones J",
    "dates": [
        {
            "hearing_date": "16 December 2014",
            "decision_date": "16 December 2014"
        }
    ],
    "defendant": "Anthony James Dickson",
    "plaintiff": "Crown (Commonwealth Prosecutor)",
    "decision": "Applications by the accused for further directions rejected.",
    "summary": "In the case of R v Dickson (No 15) [2014] NSWSC 1861, the defendant, Anthony James Dickson, sought further directions in relation to the summing-up of the case. The court rejected the defendant's applications, stating that the causal element in Count 1 differed from Counts 2 to 5, and listing the conduct of each co-accused was necessary for the jury to understand the conspiracy charges."
}
{
    "court": "Civil and Administrative Tribunal of New South Wales",
    "judge": "D Patten, R C Titterton",
    "dates": [
        {
            "hearing_date": "Not Applicable",
            "decisi

### Conclusion

With the above examples, we have shown different ways of imparting structure to LLM outputs and the advantages of each method. Langchain also provides function/tool calling if you are interested in using this for other LLMs.