LLMs that are able to follow prompt instructions well can be tasked with outputting information in a given format.

This approach relies on designing good prompts and then parsing the output of the LLMs to make them extract information well.

Here, we'll use Claude which is great at following instructions! See [Anthropic models](https://www.anthropic.com/api).

In [1]:
from langchain_community.chat_models.gigachat import GigaChat

llm = GigaChat(
    verify_ssl_certs=False,
    timeout=6000,
    model="GigaChat-Pro",
    temperature=0.01,
)

:::{.callout-tip}
All the same considerations for extraction quality apply for parsing approach. Review the [guidelines](/docs/use_cases/extraction/guidelines) for extraction quality.

This tutorial is meant to be simple, but generally should really include reference examples to squeeze out performance!
:::

## Using PydanticOutputParser

The following example uses the built-in `PydanticOutputParser` to parse the output of a chat model.

In [2]:
from typing import List, Optional

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator


class Person(BaseModel):
    """Информация о человеке."""

    name: str = Field(..., description="Имя человека")
    height_in_meters: float = Field(..., description="Высота человека в метрах")


class People(BaseModel):
    """Выдели всю информацию о людях в тексте."""

    people: List[Person]


# Set up a parser
parser = PydanticOutputParser(pydantic_object=People)

# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Ответь на запрос пользователя в формате JSON. Schema Information: \n{format_instructions}",
        ),
        ("human", "{query}"),
    ]
).partial(format_instructions=parser.get_format_instructions())

Let's take a look at what information is sent to the model

In [3]:
query = "Anna is 23 years old and she is 6 feet tall"

In [4]:
print(prompt.format_prompt(query=query).to_string())

System: Ответь на запрос пользователя в формате JSON. Schema Information: 
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "Выдели всю информацию о людях в тексте.", "properties": {"people": {"title": "People", "type": "array", "items": {"$ref": "#/definitions/Person"}}}, "required": ["people"], "definitions": {"Person": {"title": "Person", "description": "Информация о человеке.", "type": "object", "properties": {"name": {"title": "Name", "description": "Имя человека", "type": "string"}, "height_in_meters": {"title": "Height In Meters", "description": "Высота человека в метр

In [5]:
chain = prompt | llm | parser
chain.invoke({"query": query})

People(people=[Person(name='Anna', height_in_meters=1.8)])

## Custom Parsing

It's easy to create a custom prompt and parser with `LangChain` and `LCEL`.

You can use a simple function to parse the output from the model!

In [6]:
import json
import re
from typing import List, Optional

from langchain_core.messages import AIMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator


class Person(BaseModel):
    """Информация о человеке."""

    name: str = Field(..., description="Имя человека")
    height_in_meters: float = Field(..., description="Высота человека в метрах")


class People(BaseModel):
    """Выдели всю информацию о людях в тексте."""

    people: List[Person]


# Set up a parser
parser = PydanticOutputParser(pydantic_object=People)

# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Ответь на запрос пользователя в формате JSON. Schema Information: \n{format_instructions}",
        ),
        ("human", "{query}"),
    ]
).partial(format_instructions=parser.get_format_instructions())


# Custom parser
def extract_json(message: AIMessage) -> List[dict]:
    """Парсер JSON контента"""
    text = message.content

    # Return the list of matched JSON strings, stripping any leading or trailing whitespace
    try:
        return json.loads(text)
    except Exception:
        raise ValueError(f"Failed to parse: {message}")

In [7]:
query = "Anna is 23 years old and she is 6 feet tall"
print(prompt.format_prompt(query=query).to_string())

System: Ответь на запрос пользователя в формате JSON. Schema Information: 
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "Выдели всю информацию о людях в тексте.", "properties": {"people": {"title": "People", "type": "array", "items": {"$ref": "#/definitions/Person"}}}, "required": ["people"], "definitions": {"Person": {"title": "Person", "description": "Информация о человеке.", "type": "object", "properties": {"name": {"title": "Name", "description": "Имя человека", "type": "string"}, "height_in_meters": {"title": "Height In Meters", "description": "Высота человека в метр

In [8]:
chain = prompt | llm | extract_json
chain.invoke({"query": query})

{'people': [{'name': 'Anna', 'height_in_meters': 1.8}]}

## Other Libraries

If you're looking at extracting using a parsing approach, check out the [Kor](https://eyurtsev.github.io/kor/) library. It's written by one of the `LangChain` maintainers and it
helps to craft a prompt that takes examples into account, allows controlling formats (e.g., JSON or CSV) and expresses the schema in TypeScript. It seems to work pretty!