# Module 2, Activity 3: Formatting and Structuring Model Output

We briefly looked at how to create structured output in a previous activity.  In this notebook we are going to look at this in more detail, along with how (and when) to create custom parsers.

In [None]:
from langchain.output_parsers import YamlOutputParser

In [None]:
import json
from pprint import pprint
import boto3
import re
import yaml

from langchain.chains import LLMChain
from langchain_aws import ChatBedrock, ChatBedrockConverse
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser, BaseOutputParser
from langchain_core.runnables import RunnableLambda
from langchain.output_parsers import YamlOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field

In [None]:
session = boto3.session.Session()
region = session.region_name

In [None]:
def get_data_from_s3(bucket_name, key):
    s3 = boto3.client(
        's3',
        region_name=region,
    )
    response = s3.get_object(Bucket=bucket_name, Key=key)
    data = response['Body'].read().decode('utf-8')

    return data

## Reviewing JSON parsers

We saw the `JsonOutputParser` in a previous activity.  If we can get the LLM to return proper JSON as a string, then this is a simple way to get that output formatted in true JSON.  However, be sure to read this system prompt and observe how particular we are being with regard to what the output is.

In [None]:
system_prompt = SystemMessagePromptTemplate.from_template(
    f"""You are a helpful assistant. Please provide an analysis of the benefits of structured outputs in AI applications.
    Return ONLY your answer in valid JSON format with the following keys: 
    "summary": A brief summary of the analysis,
    "benefits": a list of benefits, and 
    "recommendations": a list of recommendations.
    """
)
human_prompt = HumanMessagePromptTemplate.from_template(
    "Respond to the question."
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

# Initialize the Bedrock LLM.
llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name=region,  # Ensure that `region` is defined in your environment.
    temperature=0.0,
)

# Create the chain and invoke it.
chain = prompt | llm | JsonOutputParser()

In [None]:
chain.invoke({})

## Making it more complex

That was a pretty simple output so the `JsonOutputParser()` didn't have (or shouldn't have had!) too many problems.  But that is why it was important that we were specific in the prompt about "ONLY" outputting the answer in valid JSON.  LLMs frequently will try to be "too helpful" and return additional text beyond the JSON.  Also, if the data or prompt get more complicated, the LLM is likely to start outputting more characters and that will really confuse this parsers because it only accepts a very specific format.

So let's actually make it harder... 

In [None]:
s3_data = get_data_from_s3("bucket-test-cj", "hamlet.txt")
s3_data[0:200]

In [None]:
system_prompt = SystemMessagePromptTemplate.from_template(
    """You are a helpful assistant. You will be provided a script.
    Identify the following in the script and output EXACTLY a valid JSON object with no additional text:
    {{"characters": [list of character names], "locations": [list of location names]}}
    For example, if the script mentions characters "Hamlet" and "Horatio" and locations "Denmark" and "Norway", output exactly:
    {{"characters": ["Hamlet", "Horatio"], "locations": ["Denmark", "Norway"]}}
    Output ONLY the JSON object and nothing else.
    """
)

human_prompt = HumanMessagePromptTemplate.from_template(
    "Here is the script: {script}"
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

# Initialize the Bedrock LLM.
llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name=region,
    temperature=0.0,
)

# Create the chain and invoke it.
chain = prompt | llm | JsonOutputParser()

Now we are going to try and run all of this text through the chain, which is very likely going to result in the LLM not following the system prompt to the letter.  Don't worry if you get an error when you run `chain.invoke(...)`.  Just check out what the error says.

In [None]:
chain.invoke({"script": s3_data})

## Custom parsers

You can create a new class inherited from LangChain's `BaseOutputParser` class to handle whatever life may throw at you in outputs.  Here is on that will use regex to get rid of anything that might not be formatted like proper JSON.

In [None]:
class CustomJsonOutputParser(BaseOutputParser):
    def parse(self, text: str):
        # Use regex to capture the first JSON object in the text.
        json_match = re.search(r'\{.*\}', text, re.DOTALL)
        if not json_match:
            raise ValueError("No JSON object found in the output.")
        json_str = json_match.group(0)
        try:
            return json.loads(json_str)
        except json.JSONDecodeError as e:
            raise ValueError(f"Error decoding JSON: {e}")

    def get_format_instructions(self) -> str:
        return "Output should be a valid JSON object."

## Modifying the system prompt and chain

Notice that we are now providing a few-shot prompt (with double `{{` and `}}` so it understands what is JSON versus a variable).  We also have replaced the original parser with the above custom one in the chain.  This should make it more robust to additional text the LLM might decide to throw in (even thought it has been told not to).

In [None]:
system_prompt = SystemMessagePromptTemplate.from_template(
    """You are a helpful assistant. You will be provided a script.
    Identify the following in the script and output EXACTLY a valid JSON object with no additional text:
    {{"characters": [list of character names], "locations": [list of location names]}}
    For example, if the script mentions characters "Hamlet" and "Horatio" and locations "Denmark" and "Norway", output exactly:
    {{"characters": ["Hamlet", "Horatio"], "locations": ["Denmark", "Norway"]}}
    Output ONLY the JSON object and nothing else.
    """
)

human_prompt = HumanMessagePromptTemplate.from_template(
    "Here is the script: {script}"
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

# Initialize the Bedrock LLM.
llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name=region,
    temperature=0.0,
)

# Create the chain and invoke it.
chain = prompt | llm | CustomJsonOutputParser()

In [None]:
chain.invoke({"script": s3_data})

## YAML

Let's make this a little more interesting by doing this a different way.  We are now going to output YAML (again, creating a custom class beyond the LangChain `YamlOutputParser`, which like the JSON parser has a hard time when LangChain returns text it shouldn't (even though we asked nicely).  But now we are creating a class with Pydantic to show what we are specifically looking for out of the following.

In [None]:
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")


class CustomYamlOutputParser(BaseOutputParser):
    pydantic_object: type[BaseModel] = Field(...)

    def clean_yaml_text(self, text: str) -> str:
        # Remove markdown code fences and optional 'yaml' language tag.
        text = re.sub(r"^```(?:yaml)?\s*", "", text)
        text = re.sub(r"\s*```$", "", text)
        return text.strip()

    def parse(self, text: str):
        try:
            clean_text = self.clean_yaml_text(text)
            data = yaml.safe_load(clean_text)
            return self.pydantic_object.parse_obj(data)
        except Exception as e:
            raise ValueError(f"Error parsing YAML: {e}")

    def get_format_instructions(self) -> str:
        return "Output should be valid YAML corresponding to the pydantic model."


In [None]:
system_prompt = SystemMessagePromptTemplate.from_template(
    """You are a comedian. When given a subject, create a joke in YAML format with exactly two keys:
    setup: The setup or question for the joke.
    punchline: The punchline that resolves the joke.
    For example, if the subject is robots, your output should look exactly like:
    setup: "Why did the robot cross the road?"
    punchline: "To get to the other side!"
    Output ONLY the YAML and nothing else.
    """
)

human_prompt = HumanMessagePromptTemplate.from_template(
    "Create a joke about {subject}"
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name=region,
    temperature=0.0,
)

chain = prompt | llm | CustomYamlOutputParser(pydantic_object=Joke)

In [None]:
chain.invoke({"subject": "robots"})

## Bottom line

There are many different output parsers for LangChain.  However, they are only as good as what comes out of the LLM.  As we have seen here, frequently the LLM output does not line up with what these strict parsers want as input.  Therefore, creating custom parsers is the way to go.