In this quick start, we will use [chat models](/docs/modules/model_io/chat/) that are capable of **function/tool calling** to extract information from text.

:::{.callout-important}
Extraction using **function/tool calling** only works with [models that support **function/tool calling**](/docs/modules/model_io/chat/function_calling).
:::

## Set up

We will use the [structured output](/docs/modules/model_io/chat/structured_output) method available on LLMs that are capable of **function/tool calling**. 

Select a model, install the dependencies for it and set up API keys!

In [None]:
!pip install gigachain

# Install a model capable of tool calling
# pip install gigachain-openai
# pip install gigachain-mistralai
# pip install gigachain-fireworks

# Set env vars for the relevant model or load from a .env file:
# import dotenv
# dotenv.load_dotenv()

## The Schema

First, we need to describe what information we want to extract from the text.

We'll use Pydantic to define an example schema  to extract personal information.

In [16]:
from typing import Optional

from langchain_core.pydantic_v1 import BaseModel, Field


class Person(BaseModel):
    """–ò–Ω—Ñ–æ—Ä–º–∞—Ü–∏—è –æ —á–µ–ª–æ–≤–µ–∫–∞."""

    # ^ –î–æ–∫-—Å—Ç—Ä–æ–∫–∞ –≤—ã—à–µ, –ø–æ–¥–∫–ª–∞–¥—ã–≤–∞–µ—Ç—Å—è –≤ –æ–ø–∏—Å–∞–Ω–∏–∏ —Ñ—É–Ω–∫—Ü–∏–∏
    # –∏ –º–æ–∂–µ—Ç –ø–æ–º–æ—á—å –≤ —É–ª—É—á—à–µ–Ω–∏–∏ —Ä–µ–∑—É–ª—å—Ç–∞—Ç–æ–≤ –æ—Ç LLM

    # –ó–∞–º–µ—Ç—å—Ç–µ:
    # 1. –ö–∞–∂–¥–æ–µ –ø–æ–ª–µ –æ–ø—Ü–∏–æ–Ω–∞–ª—å–Ω–æ–µ -- —ç—Ç–æ –ø–æ–∑–≤–æ–ª—è–µ—Ç LLM –Ω–µ –∏–∑–≤–ª–µ–∫–∞—Ç—å –ø–æ–ª—è, –∫–æ—Ç–æ—Ä—ã–µ –Ω–µ –æ–ø–∏—Å–∞–Ω—ã
    # 2. –ö–∞–∂–¥–æ–µ –ø–æ–ª–µ –∏–º–µ–µ—Ç –ø–æ–ª–µ description ‚Äî —ç—Ç–æ –ø–æ–¥–∫–ª–∞–¥—ã–≤–∞–µ—Ç—Å—è –≤ –æ–ø–∏—Å–∞–Ω–∏–µ –∞—Ä–≥—É–º–µ–Ω—Ç–æ–≤ —Ñ—É–Ω–∫—Ü–∏–∏
    # –∏ –º–æ–∂–µ—Ç –ø–æ–º–æ—á—å –≤ —É–ª—É—á—à–µ–Ω–∏–∏ —Ä–µ–∑—É–ª—å—Ç–∞—Ç–æ–≤
    name: Optional[str] = Field(..., description="–ò–º—è —á–µ–ª–æ–≤–µ–∫–∞")
    hair_color: Optional[str] = Field(
        ..., description="–¶–≤–µ—Ç –≤–æ–ª–æ—Å —á–µ–ª–æ–≤–µ–∫–∞, –∑–∞–ø–æ–ª–Ω–∏ –µ—Å–ª–∏ –∏–∑–≤–µ—Å—Ç–µ–Ω"
    )
    height_in_meters: Optional[float] = Field(
        ..., description="–í—ã—Å–æ—Ç–∞ —á–µ–ª–æ–≤–µ–∫–∞ –≤ –º–µ—Ç—Ä–∞—Ö."
    )

There are two best practices when defining schema:

1. Document the **attributes** and the **schema** itself: This information is sent to the LLM and is used to improve the quality of information extraction.
2. Do not force the LLM to make up information! Above we used `Optional` for the attributes allowing the LLM to output `None` if it doesn't know the answer.

:::{.callout-important}
For best performance, document the schema well and make sure the model isn't force to return results if there's no information to be extracted in the text.
:::

## The Extractor

Let's create an information extractor using the schema we defined above.

In [17]:
from typing import Optional

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field

# –û–ø—Ä–µ–¥–µ–ª—è–µ–º –ø—Ä–æ–º–ø—Ç: –¥–æ–±–∞–≤–ª—è–µ–º –∏–Ω—Å—Ç—Ä—É–∫—Ü–∏–π –∏ –¥–æ–ø. –∫–æ–Ω—Ç–µ–∫—Å—Ç
# –ó–¥–µ—Å—å –º—ã –º–æ–∂–µ–º:
# 1) –î–æ–±–∞–≤–∏—Ç—å –ø—Ä–∏–º–µ—Ä–æ–≤ —Ä–∞–±–æ—Ç—ã —Ñ—É–Ω–∫—Ü–∏–π, –¥–ª—è —É–ª—É—á—à–µ–Ω–∏—è –∫–∞—á–µ—Å—Ç–≤–∞ –∏–∑–≤–ª–µ—á–µ–Ω–∏—è –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏–∏
# 2) –ü—Ä–µ–¥–æ—Å—Ç–∞–≤–∏—Ç—å –¥–æ–ø–æ–ª–Ω–∏—Ç–µ–ª—å–Ω–æ–µ –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏–∏ –∏–∑ —á–µ–≥–æ –∏ —á—Ç–æ –≤—ã –±—É–¥–µ—Ç–µ –∏–∑–≤–ª–µ–∫–∞—Ç—å –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏—é
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "–¢—ã —ç–∫—Å–ø–µ—Ä—Ç –≤ –∏–∑–≤–ª–µ—á–µ–Ω–∏–∏ –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏–∏ –∏–∑ —Ç–µ–∫—Å—Ç–∞. "
            "–ò–∑–≤–ª–µ–∫–∞–π —Ç–æ–ª—å–∫–æ —Ä–µ–ª–µ–≤–∞–Ω—Ç–Ω—É—é –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏—é –∏–∑ —Ç–µ–∫—Å—Ç–∞. "
            "–ï—Å–ª–∏ —Ç—ã –Ω–µ –∑–Ω–∞–µ—à—å –∑–Ω–∞—á–µ–Ω–∏–µ –∞—Ç—Ç—Ä–∏–±—É—Ç–∞, "
            "–∫–æ—Ç–æ—Ä—ã–π –Ω—É–∂–Ω–æ –∏–∑–≤–ª–µ—á—å, –ø–æ—Å—Ç–∞–≤—å –∞—Ç—Ç—Ä–∏–±—É—Ç—É null.",
        ),
        # –í –Ω–æ—É—Ç–±—É–∫–µ —Å –ø—Ä–∏–º–µ—Ä–∞–º–∏, –≤—ã –º–æ–∂–µ—Ç–µ –ø–æ—Å–º–æ—Ç—Ä–µ—Ç—å, –∫–∞–∫ –¥–æ–±–∞–≤–∏—Ç—å –ø—Ä–∏–º–µ—Ä–æ–≤ —Ä–∞–±–æ—Ç—ã
        # —Ñ—É–Ω–∫—Ü–∏–π –¥–ª—è —É–ª—É—á—à–µ–Ω–∏—è —Ä–µ–∑—É–ª—å—Ç–∞—Ç–æ–≤ —Ä–∞–±–æ—Ç—ã
        # MessagesPlaceholder('examples'),
        ("human", "{text}"),
    ]
)

We need to use a model that supports function/tool calling.

Please review [structured output](/docs/modules/model_io/chat/structured_output) for list of some models that can be used with this API.

In [18]:
from langchain_community.chat_models.gigachat import GigaChat

llm = GigaChat(
    timeout=6000,
    model="GigaChat-Pro",
    temperature=0.01,
)

runnable = prompt | llm.with_structured_output(schema=Person)

Let's test it out

In [19]:
text = "–ê–ª–∞–Ω –°–º–∏—Ç –±–ª–æ–Ω–¥–∏–Ω, 1.85 –º–µ—Ç—Ä–∞ –≤—ã—Å–æ—Ç–æ–π"
runnable.invoke({"text": text})

Giga generation stopped with reason: function_call


Person(name='–ê–ª–∞–Ω –°–º–∏—Ç', hair_color='–±–ª–æ–Ω–¥–∏–Ω', height_in_meters=1.85)

:::{.callout-important} 

Extraction is Generative ü§Ø

LLMs are generative models, so they can do some pretty cool things like correctly extract the height of the person in meters
even though it was provided in feet!
:::

## Multiple Entities

In **most cases**, you should be extracting a list of entities rather than a single entity.

This can be easily achieved using pydantic by nesting models inside one another.

In [20]:
from typing import List, Optional

from langchain_core.pydantic_v1 import BaseModel, Field


class Person(BaseModel):
    """–ò–Ω—Ñ–æ—Ä–º–∞—Ü–∏—è –æ —á–µ–ª–æ–≤–µ–∫–∞."""

    # ^ –î–æ–∫-—Å—Ç—Ä–æ–∫–∞ –≤—ã—à–µ, –ø–æ–¥–∫–ª–∞–¥—ã–≤–∞–µ—Ç—Å—è –≤ –æ–ø–∏—Å–∞–Ω–∏–∏ —Ñ—É–Ω–∫—Ü–∏–∏
    # –∏ –º–æ–∂–µ—Ç –ø–æ–º–æ—á—å –≤ —É–ª—É—á—à–µ–Ω–∏–∏ —Ä–µ–∑—É–ª—å—Ç–∞—Ç–æ–≤ –æ—Ç LLM

    # –ó–∞–º–µ—Ç—å—Ç–µ:
    # 1. –ö–∞–∂–¥–æ–µ –ø–æ–ª–µ –æ–ø—Ü–∏–æ–Ω–∞–ª—å–Ω–æ–µ -- —ç—Ç–æ –ø–æ–∑–≤–æ–ª—è–µ—Ç LLM –Ω–µ –∏–∑–≤–ª–µ–∫–∞—Ç—å –ø–æ–ª—è, –∫–æ—Ç–æ—Ä—ã–µ –Ω–µ –æ–ø–∏—Å–∞–Ω—ã
    # 2. –ö–∞–∂–¥–æ–µ –ø–æ–ª–µ –∏–º–µ–µ—Ç –ø–æ–ª–µ description ‚Äî —ç—Ç–æ –ø–æ–¥–∫–ª–∞–¥—ã–≤–∞–µ—Ç—Å—è –≤ –æ–ø–∏—Å–∞–Ω–∏–µ –∞—Ä–≥—É–º–µ–Ω—Ç–æ–≤ —Ñ—É–Ω–∫—Ü–∏–∏
    # –∏ –º–æ–∂–µ—Ç –ø–æ–º–æ—á—å –≤ —É–ª—É—á—à–µ–Ω–∏–∏ —Ä–µ–∑—É–ª—å—Ç–∞—Ç–æ–≤
    name: Optional[str] = Field(..., description="–ò–º—è —á–µ–ª–æ–≤–µ–∫–∞")
    hair_color: Optional[str] = Field(
        ..., description="–¶–≤–µ—Ç –≤–æ–ª–æ—Å —á–µ–ª–æ–≤–µ–∫–∞, –∑–∞–ø–æ–ª–Ω–∏ –µ—Å–ª–∏ –∏–∑–≤–µ—Å—Ç–µ–Ω"
    )
    height_in_meters: Optional[float] = Field(
        ..., description="–í—ã—Å–æ—Ç–∞ —á–µ–ª–æ–≤–µ–∫–∞ –≤ –º–µ—Ç—Ä–∞—Ö."
    )

class Data(BaseModel):
    """–ò–Ω—Ñ–æ—Ä–º–∞—Ü–∏—è –æ –ª—é–¥—è—Ö."""

    # –°–æ–∑–¥–∞–µ–º –º–æ–¥–µ–ª—å, —á—Ç–æ–±—ã –º—ã –º–æ–≥–ª–∏ –∏–∑–≤–ª–µ—á—å –∏–Ω—Ñ–æ—Ä–º–∞—Ü–∏—é –æ –Ω–µ—Å–∫–æ–ª—å–∫–∏—Ö –ª—é–¥—è—Ö
    people: List[Person]

:::{.callout-important}
Extraction might not be perfect here. Please continue to see how to use **Reference Examples** to improve the quality of extraction, and see the **guidelines** section!
:::

In [22]:
runnable = prompt | llm.with_structured_output(schema=Data)
text = ("–ú–æ–µ –∏–º—è –î–∂–æ, –º–æ–∏ –≤–æ–ª–æ—Å—ã —á–µ—Ä–Ω—ã–µ –∏ —è 1.75 –º–µ—Ç—Ä–∞ –≤—ã—Å–æ—Ç–æ–π. "
        "–£ –ê–Ω–Ω—ã —Ç–∞–∫–∏–µ –∂–µ –≤–æ–ª–æ—Å—ã –∫–∞–∫ —É –º–µ–Ω—è –∏ –æ–Ω–∞ –Ω–∞ 10 —Å–∞–Ω—Ç–∏–º–µ—Ç—Ä–æ–≤ –º–µ–Ω—å—à–µ –º–µ–Ω—è.")
runnable.invoke({"text": text})

Giga generation stopped with reason: function_call


Data(people=[Person(name='–î–∂–æ', hair_color='—á–µ—Ä–Ω—ã–π', height_in_meters=1.75), Person(name='–ê–Ω–Ω–∞', hair_color='—á–µ—Ä–Ω—ã–π', height_in_meters=1.65)])

:::{.callout-tip}
When the schema accommodates the extraction of **multiple entities**, it also allows the model to extract **no entities** if no relevant information
is in the text by providing an empty list. 

This is usually a **good** thing! It allows specifying **required** attributes on an entity without necessarily forcing the model to detect this entity.
:::

## Next steps

Now that you understand the basics of extraction with LangChain, you're ready to proceed to the rest of the how-to guide:

- [Add Examples](/docs/use_cases/extraction/how_to/examples): Learn how to use **reference examples** to improve performance.
- [Handle Long Text](/docs/use_cases/extraction/how_to/handle_long_text): What should you do if the text does not fit into the context window of the LLM?
- [Handle Files](/docs/use_cases/extraction/how_to/handle_files): Examples of using LangChain document loaders and parsers to extract from files like PDFs.
- [Use a Parsing Approach](/docs/use_cases/extraction/how_to/parse): Use a prompt based approach to extract with models that do not support **tool/function calling**.
- [Guidelines](/docs/use_cases/extraction/guidelines): Guidelines for getting good performance on extraction tasks.