# Module 1, Activity 3: Structured and Custom Output Formatting

Because we ultimately care about the answer we get back from the models, it is worth while to spend some time talking about some different formats for returning responses.  This is particularly helpful, for example, in simulating data to and from a system or when you need to use the output of the model as the input to another system.

In [1]:
import boto3
import json
import re
import yaml

from langchain_aws import ChatBedrock
from langchain_core.output_parsers import BaseOutputParser, JsonOutputParser
#from langchain_core.pydantic_v1 import BaseModel, Field
from pydantic import BaseModel, Field
from langchain.output_parsers import YamlOutputParser
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

In [2]:
session = boto3.session.Session()
region = session.region_name

In [3]:
def get_data_from_s3(bucket_name, key):
    s3 = boto3.client('s3', region_name='us-west-2')
    response = s3.get_object(Bucket=bucket_name, Key=key)
    return response['Body'].read()

## JSON parsers

We now introduce the very handy `JsonOutputParser`.  This one is a bit more complicated than just a string output parser because LLMs return strings but JSON follows some pretty strict rules that the LLM output might not follow very well.  If we can get the LLM to return proper JSON as a string, then this is a simple way to get that output formatted in true JSON.  However, be sure to read this system prompt and observe how particular we are being with regard to what the output is.

In [4]:
system_prompt = SystemMessagePromptTemplate.from_template(
    f"""You are a helpful assistant. Please provide an analysis of the benefits of structured outputs in AI applications.
    Return ONLY your answer in valid JSON format with the following keys: 
    "summary": A brief summary of the analysis,
    "benefits": a list of benefits, and 
    "recommendations": a list of recommendations.
    """
)
human_prompt = HumanMessagePromptTemplate.from_template(
    "Respond to the question."
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name=region, 
    temperature=0.0,
)

chain = prompt | llm | JsonOutputParser()

In [5]:
chain.invoke({})

{'summary': 'Structured outputs in AI applications provide a standardized and organized way to present information, making it easier to process and integrate with other systems. They enhance interpretability, interoperability, and enable more efficient data exchange.',
 'benefits': ['Improved interpretability and understandability of AI outputs',
  'Easier integration with other systems and applications',
  'Facilitated data exchange and interoperability',
  'Increased consistency and standardization of outputs',
  'Reduced ambiguity and potential for misinterpretation',
  'Simplified parsing and processing of AI outputs'],
 'recommendations': ['Adopt industry-standard data formats and schemas for structured outputs',
  'Clearly define and document the structure and semantics of outputs',
  'Implement validation and error-handling mechanisms for structured outputs',
  'Leverage existing libraries and tools for working with structured data formats',
  'Continuously monitor and update st

## Making it more complex

That was a pretty simple output so the `JsonOutputParser()` didn't have (or shouldn't have had!) too many problems.  But that is why it was important that we were specific in the prompt about "ONLY" outputting the answer in valid JSON.  LLMs frequently will try to be "too helpful" and return additional text beyond the JSON.  Also, if the data or prompt get more complicated, the LLM is likely to start outputting more characters and that will really confuse this parsers because it only accepts a very specific format.

So let's actually make it harder... 

In [18]:
hamlet_data = get_data_from_s3("dpgenaitraining", "hamlet.txt")
hamlet_data



In [19]:
system_prompt = SystemMessagePromptTemplate.from_template(
    """You are a helpful assistant. You will be provided a script.
    Identify the following in the script and output EXACTLY a valid JSON object with no additional text:
    {{"characters": [list of character names], "locations": [list of location names]}}
    For example, if the script mentions characters "Hamlet" and "Horatio" and locations "Denmark" and "Norway", output exactly:
    {{"characters": ["Hamlet", "Horatio"], "locations": ["Denmark", "Norway"]}}
    Output ONLY the JSON object and nothing else.
    """
)

human_prompt = HumanMessagePromptTemplate.from_template(
    "Here is the script: {script}"
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name=region,
    temperature=0.5,
)

chain = prompt | llm | JsonOutputParser()

Now we are going to try and run all of this text through the chain, which is very likely going to result in the LLM not following the system prompt to the letter.  **Don't worry if you get an error when you run `chain.invoke(...)`.  Just check out what the error says.**

In [20]:
chain.invoke({"script": hamlet_data})

{'characters': ['Hamlet',
  'Horatio',
  'Marcellus',
  'Barnardo',
  'Francisco',
  'Ghost',
  'King',
  'Queen',
  'Polonius',
  'Ophelia',
  'Laertes',
  'Reynaldo',
  'Rosencrantz',
  'Guildenstern',
  'Osric',
  'First Player',
  'Prologue',
  'Player King',
  'Player Queen',
  'Lucianus',
  'Gravedigger',
  'Other',
  'Lord',
  'Messenger',
  'Fortinbras',
  'Captain',
  'Ambassadors',
  'Players',
  'Gentlemen',
  'Sailor',
  'Attendants',
  'Lords',
  'Soldiers',
  'Officers',
  'Doctor of Divinity'],
 'locations': ['Denmark',
  'Elsinore',
  'Norway',
  'Poland',
  'England',
  'France',
  'Wittenberg']}

## Custom parsers

You can create a new class inherited from LangChain's `BaseOutputParser` class to handle whatever life may throw at you in outputs.  Here is on that will use regex to get rid of anything that might not be formatted like proper JSON.

In [21]:
class CustomJsonOutputParser(BaseOutputParser):
    def parse(self, text: str):
        # Use regex to capture the first JSON object in the text.
        json_match = re.search(r'\{.*\}', text, re.DOTALL)
        if not json_match:
            raise ValueError("No JSON object found in the output.")
        json_str = json_match.group(0)
        try:
            return json.loads(json_str)
        except json.JSONDecodeError as e:
            raise ValueError(f"Error decoding JSON: {e}")

    def get_format_instructions(self) -> str:
        return "Output should be a valid JSON object."

## Modifying the system prompt and chain

Notice that we are now providing a few-shot prompt (with double `{{` and `}}` so it understands what is JSON versus a variable).  We also have replaced the original parser with the above custom one in the chain.  This should make it more robust to additional text the LLM might decide to throw in (even thought it has been told not to).

In [23]:
system_prompt = SystemMessagePromptTemplate.from_template(
    """You are a helpful assistant. You will be provided a script.
    Identify the following in the script and output EXACTLY a valid JSON object with no additional text:
    {{"characters": [list of character names], "locations": [list of location names]}}
    For example, if the script mentions characters "Hamlet" and "Horatio" and locations "Denmark" and "Norway", output exactly:
    {{"characters": ["Hamlet", "Horatio"], "locations": ["Denmark", "Norway"]}}
    Output ONLY the JSON object and nothing else.
    """
)

human_prompt = HumanMessagePromptTemplate.from_template(
    "Here is the script: {script}"
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name='us-west-2',
    temperature=0.0,
)

chain = prompt | llm | CustomJsonOutputParser()

In [24]:
chain.invoke({"script": hamlet_data})

{'characters': ['Hamlet',
  'Horatio',
  'Claudius',
  'Gertrude',
  'Polonius',
  'Laertes',
  'Ophelia',
  'Ghost',
  'Fortinbras',
  'Gravedigger',
  'Osric',
  'Player King',
  'Player Queen',
  'Lucianus',
  'Prologue',
  'Marcellus',
  'Barnardo',
  'Francisco',
  'Reynaldo',
  'Voltemand',
  'Cornelius',
  'Rosencrantz',
  'Guildenstern',
  'Sailor',
  'Gentleman',
  'Messenger',
  'Ambassador',
  'Captain',
  'Lord',
  'Doctor of Divinity'],
 'locations': ['Denmark', 'Elsinore', 'Poland', 'Norway', 'England', 'France']}

## YAML

Let's make this a little more interesting by doing this a different way.  We are now going to output YAML (again, creating a custom class beyond the LangChain `YamlOutputParser`, which like the JSON parser has a hard time when LangChain returns text it shouldn't (even though we asked nicely).  But now we are creating a class with Pydantic to show what we are specifically looking for out of the following.

In [25]:
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")


class CustomYamlOutputParser(BaseOutputParser):
    pydantic_object: type[BaseModel] = Field(...)

    def clean_yaml_text(self, text: str) -> str:
        # Remove markdown code fences and optional 'yaml' language tag.
        text = re.sub(r"^```(?:yaml)?\s*", "", text)
        text = re.sub(r"\s*```$", "", text)
        return text.strip()

    def parse(self, text: str):
        try:
            clean_text = self.clean_yaml_text(text)
            data = yaml.safe_load(clean_text)
            return self.pydantic_object.parse_obj(data)
        except Exception as e:
            raise ValueError(f"Error parsing YAML: {e}")

    def get_format_instructions(self) -> str:
        return "Output should be valid YAML corresponding to the pydantic model."


In [26]:
system_prompt = SystemMessagePromptTemplate.from_template(
    """You are a comedian. When given a subject, create a joke in YAML format with exactly two keys:
    setup: The setup or question for the joke.
    punchline: The punchline that resolves the joke.
    For example, if the subject is robots, your output should look exactly like:
    setup: "Why did the robot cross the road?"
    punchline: "To get to the other side!"
    Output ONLY the YAML and nothing else.
    """
)

human_prompt = HumanMessagePromptTemplate.from_template(
    "Create a joke about {subject}"
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name='us-west-2',
    temperature=0.0,
)

chain = prompt | llm | CustomYamlOutputParser(pydantic_object=Joke)

In [27]:
chain.invoke({"subject": "robots"})

Error raised by bedrock service
Traceback (most recent call last):
  File "/Users/skkakollu/genai_foundations/env/lib/python3.12/site-packages/langchain_aws/llms/bedrock.py", line 935, in _prepare_input_and_invoke
    response = self.client.invoke_model(**request_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/skkakollu/genai_foundations/env/lib/python3.12/site-packages/botocore/client.py", line 570, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/skkakollu/genai_foundations/env/lib/python3.12/site-packages/botocore/context.py", line 123, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/skkakollu/genai_foundations/env/lib/python3.12/site-packages/botocore/client.py", line 1031, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ExpiredTokenExceptio

ClientError: An error occurred (ExpiredTokenException) when calling the InvokeModel operation: The security token included in the request is expired

## Bottom line

There are many different output parsers for LangChain.  However, they are only as good as what comes out of the LLM.  As we have seen here, frequently the LLM output does not line up with what these strict parsers want as input.  Therefore, if you are dealing with an output beyond something very basic, creating custom parsers is the way to go.