#### Kor is a thin wrapper on top of LLMs that helps to extract structured data using LLMs.

In [5]:
from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number
from langchain_openai import ChatOpenAI
import os
from dotenv import load_dotenv

load_dotenv()

True

To use Kor, specify the schema of what should be extracted and provide some extraction examples. <br>
Kor requires that you specify the schema of what you want parsed with some optional examples.

In [2]:
schema = Object(
    id="person",
    description="Personal information",
    attributes=[
        Text(
            id="first_name",
            description="The first name of a person.",
        )
    ],
    examples=[
        ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
    ],
    many=True,
)

The schema above consists of a single object node which contains a single text attribute called first_name.

The object can be repeated many times, so if the text contains many multiple first names, multiple objects will be extracted.

As part of the schema, we specified a description of what we’re extracting, as well as 2 examples.

Including both a description and examples will likely improve performance.

In [11]:
llm = ChatOpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="gpt-4o-mini",
    temperature=0,
    max_tokens=2000,
)

In [12]:
chain = create_extraction_chain(llm, schema)

Extract: With a chain and a schema defined, we’re ready to extract data.

In [13]:
chain.invoke(("My name is Bobby. My brother's name Joe."))

{'data': {'person': [{'first_name': 'Bobby'}, {'first_name': 'Joe'}]},
 'raw': 'first_name\nBobby\nJoe',
 'errors': [],
 'validated_data': {}}

In [14]:
chain.invoke(("My name is Bobby. My brother's name Joe."))["data"]

{'person': [{'first_name': 'Bobby'}, {'first_name': 'Joe'}]}

And here’s the actual prompt that was sent to the LLM.

In [15]:
print(chain.get_prompts()[0].format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

person: Array<{ // Personal information
 first_name: string // The first name of a person.
}>
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: Alice and Bob are friends
Output: first_name
Alice
Bob

Input: [user input]
Output:


In [16]:
from kor import from_pydantic
from typing import List, Optional
from pydantic import BaseModel, Field

In [17]:
class Person(BaseModel):
    first_name: str = Field(description="The first name of a person")

In [18]:
schema, validator = from_pydantic(
    Person,
    description="Personal Information",  
    examples=[  
        ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
    ],
    many=True, 
)

#here unlike object class of kor nodes, we don't need to attributes to be extracted as we are already providing the pydantic class which consist of all the fields to be extracted 

In [19]:
chain = create_extraction_chain(llm, schema, validator=validator)

In [21]:
chain.invoke(("My name is Bobby. My brother's name Joe."))['data']

{'person': [{'first_name': 'Bobby'}, {'first_name': 'Joe'}]}

In [22]:
#kor example with object 

schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            examples=[("John Smith went to the store", "Smith")],
        ),
        Number(
            id="age",
            description="The age of the person in years.",
            examples=[("23 years old", "23"), ("I turned three on sunday", "3")],
        ),
    ],
    examples=[
        (
            "John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.",
            [
                {"first_name": "John", "last_name": "Smith", "age": 23},
                {"first_name": "Jane", "last_name": "Doe", "age": 5},
            ],
        )
    ],
    many=True,
)


chain = create_extraction_chain(llm, schema)

In [23]:
print(chain.get_prompts()[0].format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

personal_info: Array<{ // Personal information about a given person.
 first_name: string // The first name of the person
 last_name: string // The last name of the person
 age: number // The age of the person in years.
}>
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.
Output: first_name|last_name|age
John|Smith|23
Jane|Doe|5

Input: John Smith went to the store
Output: first_name|last_name|age
John||

Input: 

In [24]:
print(
    chain.invoke(
        "My name is Bob Alice and my phone number is (123)-444-9999. I found my true love one"
        " on a blue sunday. Her number was (333)1232832. Her name was Moana Sunrise and she was 10 years old."
    )["data"]
)

{'personal_info': [{'first_name': 'Bob', 'last_name': 'Alice', 'age': '0'}, {'first_name': 'Moana', 'last_name': 'Sunrise', 'age': '10'}]}


Nested Objects: 
Here, we’ll introduce an Address object which will be neste inside of the main schema.

In [25]:
from_address = Object(
    id="from_address",
    description="Person moved away from this address",
    attributes=[
        Text(id="street"),
        Text(id="city"),
        Text(id="state"),
        Text(id="zipcode"),
        Text(id="country", description="A country in the world; e.g., France."),
    ],
    examples=[
        (
            "100 Main St, Boston, MA, 23232, USA",
            {
                "street": "100 Marlo St",
                "city": "Boston",
                "state": "MA",
                "zipcode": "23232",
                "country": "USA",
            },
        )
    ],
)

to_address = from_address.replace(
    id="to_address", description="Address to which the person is moving"
)

schema = Object(
    id="information",
    attributes=[
        Text(
            id="person_name",
            description="The full name of the person or partial name",
            examples=[("John Smith was here", "John Smith")],
        ),
        from_address,
        to_address,
    ],
    many=True,
)

To use nested objects, at least for now we have to swap to the JSON encoder.<br>

Anecdotally, CSV encoding seems to produce more robust extraction results, so JSON encoding may perform worse even though it’s more flexible.

In [26]:
chain = create_extraction_chain(
    llm, schema, encoder_or_encoder_class="json", input_formatter=None
)

In [27]:
chain.invoke(
    "Alice Doe moved from New York to Boston, MA while Bob Smith did the opposite."
)["data"]

{'information': [{'person_name': 'Alice Doe',
   'from_address': {'street': '',
    'city': 'New York',
    'state': '',
    'zipcode': '',
    'country': ''},
   'to_address': {'street': '',
    'city': 'Boston',
    'state': 'MA',
    'zipcode': '',
    'country': ''}},
  {'person_name': 'Bob Smith',
   'from_address': {'street': '',
    'city': 'Boston',
    'state': 'MA',
    'zipcode': '',
    'country': ''},
   'to_address': {'street': '',
    'city': 'New York',
    'state': '',
    'zipcode': '',
    'country': ''}}]}

Nested Lists: Let’s repeat the same schema as above, but let the address be a many=True field.

In [31]:
from_address = Object(
    id="from_address",
    description="Person moved away from this address",
    attributes=[
        Text(id="street"),
        Text(id="city"),
        Text(id="state"),
        Text(id="zipcode"),
        Text(id="country", description="A country in the world; e.g., France."),
    ],
    examples=[
        (
            "100 Main St, Boston,MA, 23232, USA",
            {
                "street": "100 Marlo St",
                "city": "Boston",
                "state": "MA",
                "zipcode": "23232",
                "country": "USA",
            },
        )
    ],
    many=True,  # <-- PLEASE NOTE THIS CHANGE
)

to_address = from_address.replace(
    id="to_address", description="Address to which the person is moving"
)

schema = Object(
    id="information",
    attributes=[
        Text(
            id="person_name",
            description="The full name of the person or partial name",
            examples=[("John Smith was here", "John Smith")],
        ),
        from_address,
        to_address,
    ],
    many=True,
)

In [32]:
chain = create_extraction_chain(llm, schema, encoder_or_encoder_class="json")

In [33]:
chain.invoke(
    "Alice Doe and Bob Smith moved from New York to Boston. Bob later moved to LA."
)["data"]

{'information': [{'person_name': 'Alice Doe',
   'from_address': [{'street': '',
     'city': 'New York',
     'state': '',
     'zipcode': '',
     'country': ''}],
   'to_address': [{'street': '',
     'city': 'Boston',
     'state': '',
     'zipcode': '',
     'country': ''}]},
  {'person_name': 'Bob Smith',
   'from_address': [{'street': '',
     'city': 'New York',
     'state': '',
     'zipcode': '',
     'country': ''}],
   'to_address': [{'street': '',
     'city': 'Boston',
     'state': '',
     'zipcode': '',
     'country': ''}]},
  {'person_name': 'Bob Smith',
   'from_address': [{'street': '',
     'city': 'Boston',
     'state': '',
     'zipcode': '',
     'country': ''}],
   'to_address': [{'street': '',
     'city': 'LA',
     'state': '',
     'zipcode': '',
     'country': ''}]}]}

Untyped Objects: It’s possible to provide just examples without type information. It may be that the quality of results won’t be affected significantly, if one adds sufficient examples to compensate for lack of information about the schema.



In [37]:
#Natural Language Based APIs

schema = Object(
    id="action",
    description="User is looking for sports tickets",
    attributes=[
        Text(
            id="sport",
            description="which sports do you want to buy tickets for?",
            examples=[
                (
                    "I want to buy tickets to basketball and football games",
                    ["basketball", "footbal"],
                )
            ],
        ),
        Text(
            id="location",
            description="where would you like to watch the game?",
            examples=[
                ("in boston", "boston"),
                ("in france or italy", ["france", "italy"]),
            ],
        ),
        Object(
            id="price_range",
            description="how much do you want to spend?",
            attributes=[],
            examples=[
                ("no more than $100", {"price_max": "100", "currency": "$"}),
                (
                    "between 50 and 100 dollars",
                    {"price_max": "100", "price_min": "50", "currency": "$"},
                ),
            ],
        ),
    ],
)

chain = create_extraction_chain(llm, schema, encoder_or_encoder_class="json")

In [38]:
chain.invoke("I want to buy tickets for a baseball game in LA area under $100")["data"]

{'action': {'sport': 'baseball',
  'location': 'LA area',
  'price_range': {'price_max': '100'}}}

In [39]:
chain.invoke(
    "I want to see a celtics game in boston somewhere between 20 and 40 dollars per ticket"
)["data"]

{'action': {'sport': 'basketball',
  'location': 'boston',
  'price_range': {'price_min': '20', 'price_max': '40', 'currency': '$'}}}