### API Router



### Prompt Template

User input does not need to be passed directly into the LLM. The prompt template provides additional context on the specific task at hand.

Prompt Templates are designed for single-turn interactions. It typically accepts a single template string with placeholders that can be dynamically filled with inputs.

```python
from langchain.prompts import PromptTemplate

template = "Translate the following English text to French: {text}"
prompt = PromptTemplate(input_variables=["text"], template=template)

filled_prompt = prompt.format(text="Hello, how are you?")
print(filled_prompt)
# Output: Translate the following English text to French: Hello, how are you?
```

### Chat Prompt Template

ChatPromptTemplate is used for multi-turn interactions or prompts designed for chat-based language models. It is composed of messages like SystemMessage, HumanMessage, and AIMessage. These represent the context, user input, and AI responses, respectively.

ChatPromptTemplate can take messages in various formats, including predefined message classes (e.g., SystemMessage, HumanMessage) as well as tuple formats. LangChain provides message classes such as SystemMessage, HumanMessage, AIMessage, and ChatMessage. These are structured representations for different roles in a chat.

```python
from langchain.prompts import ChatPromptTemplate, SystemMessage, HumanMessage

chat_prompt = ChatPromptTemplate.from_messages([
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Can you help me with my math homework?")
])
formatted_messages = chat_prompt.format_messages()
for msg in formatted_messages:
    print(f"{msg.type}: {msg.content}")
```

Instead of using the predefined message classes, you can represent messages as tuples, where the first element is the role (e.g., "system", "human", "assistant"), and the second element is the content of the message.

```python
from langchain.prompts import ChatPromptTemplate

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("human", "Can you help me solve a quadratic equation?"),
    ("assistant", "Of course! What's the equation?")
])
formatted_messages = chat_prompt.format_messages()
for msg in formatted_messages:
    print(f"{msg.type}: {msg.content}")
```

### Chat Prompt Template and Message Placeholder

The MessagesPlaceholder in LangChain is a utility used within ChatPromptTemplate to insert dynamic chat histories or message sequences into a prompt. This is particularly useful when you want to incorporate conversation history dynamically without hardcoding all previous messages into the prompt.

MessagesPlaceholder is commonly used when:
- Maintaining Context: You want to include past messages in a chat.
- Dynamic History: The history may change or be trimmed based on application logic.
- Trimming: You can preprocess the chat_history (e.g., trim irrelevant messages) before passing it to the prompt.

Here’s how to use MessagesPlaceholder in a ChatPromptTemplate:

```python
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder, HumanMessagePromptTemplate

chat_prompt = ChatPromptTemplate.from_messages([
    MessagesPlaceholder(variable_name="chat_history"),
    HumanMessagePromptTemplate.from_template("What do you think about {topic}?")
])
```

At runtime, you pass the actual chat history to replace the MessagesPlaceholder:

```python
from langchain.schema import AIMessage, HumanMessage

chat_history = [
    HumanMessage(content="Can you summarize quantum physics?"),
    AIMessage(content="Sure! Quantum physics deals with subatomic particles and wave-particle duality."),
]

formatted_messages = chat_prompt.format_messages(
    chat_history=chat_history,  # Replace placeholder
    topic="black holes"
)

for message in formatted_messages:
    print(f"{message.type}: {message.content}")
```

### Chat Prompt Template and LCEL

PromptTemplate and ChatPromptTemplate implement the Runnable interface, the basic building block of the LangChain Expression Language (LCEL). This means they support invoke, ainvoke, stream, astream, batch, abatch, astream_log calls.

**PromptTemplate accepts a dictionary (of the prompt variables) and returns a StringPromptValue. A ChatPromptTemplate accepts a dictionary and returns a ChatPromptValue.**

```python
prompt_template = PromptTemplate.from_template(
    "Tell me a {adjective} joke about {content}."
)

prompt_val = prompt_template.invoke({"adjective": "funny", "content": "chickens"})
prompt_val # StringPromptValue(text='Tell me a funny joke about chickens.')
prompt_val.to_string() # 'Tell me a funny joke about chickens.'
prompt_val.to_messages() # [HumanMessage(content='Tell me a funny joke about chickens.')]

chat_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content=(
                "You are a helpful assistant that re-writes the user's text to "
                "sound more upbeat."
            )
        ),
        HumanMessagePromptTemplate.from_template("{text}"),
    ]
)
chat_val = chat_template.invoke({"text": "i dont like eating tasty things."})
chat_val.to_messages()
# [SystemMessage(content="You are a helpful assistant that re-writes the user's text to sound more upbeat."),
```

### Few Shot Prompt Templates and Example Selectors

The purpose of Example Selectors in LangChain is to dynamically choose the most relevant examples to include in a prompt. This is especially important when you have a large dataset of examples and cannot include all of them due to constraints like token limits or relevance to the current input. Example selectors ensure that the chosen examples are tailored to the specific input, improving the quality of responses from the language model.

**Large language models like GPT often perform better with few-shot learning, where the prompt includes input-output examples to demonstrate the task. Including irrelevant or too many examples can lead to suboptimal responses or exceed token limits.**

**Example Selectors dynamically choose the most relevant subset of examples based on the current input, ensuring optimal context is provided to the model.**

LangChain provides several prebuilt example selectors, and you can also implement custom ones. Common types include:
- Similarity Selector
    - Uses semantic similarity between inputs and examples to decide which examples to choose.
- Max Marginal Relevance (MMR) Selector
    - Balances relevance and diversity by selecting examples that are similar to the input but not too similar to each other.
- Length-Based Selector
    - Chooses examples that fit within a token limit or are of a similar length to the input.
- N-gram Overlap Selector
    - Selects examples based on shared n-grams between the input and examples.

Select by length example selector selects which examples to use based on length. This is useful when you are worried about constructing a prompt that will go over the length of the context window. For longer inputs, it will select fewer examples to include, while for shorter inputs it will select more.

```python
from langchain_core.example_selectors import LengthBasedExampleSelector
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate

# Examples of a pretend task of creating antonyms.
examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},
]

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)

example_selector = LengthBasedExampleSelector(
    # The examples it has available to choose from.
    examples=examples,
    # The PromptTemplate being used to format the examples.
    example_prompt=example_prompt,
    # The maximum length that the formatted examples should be.
    # Length is measured by the get_text_length function below.
    max_length=25,
    # The function used to get the length of a string, which is used
    # to determine which examples to include. It is commented out because
    # it is provided as a default value if none is specified.
    # get_text_length: Callable[[str], int] = lambda x: len(re.split("\n| ", x))
)
dynamic_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the antonym of every input",
    suffix="Input: {adjective}\nOutput:",
    input_variables=["adjective"],
)
```

The MaxMarginalRelevanceExampleSelector selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.

```python
from langchain_community.vectorstores import FAISS
from langchain_core.example_selectors import (
    MaxMarginalRelevanceExampleSelector,
    SemanticSimilarityExampleSelector,
)
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_openai import OpenAIEmbeddings

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)

# Examples of a pretend task of creating antonyms.
examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},
]

example_selector = MaxMarginalRelevanceExampleSelector.from_examples(
    # The list of examples available to select from.
    examples,
    # The embedding class used to produce embeddings which are used to measure semantic similarity.
    OpenAIEmbeddings(),
    # The VectorStore class that is used to store the embeddings and do a similarity search over.
    FAISS,
    # The number of examples to produce.
    k=2,
)

mmr_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the antonym of every input",
    suffix="Input: {adjective}\nOutput:",
    input_variables=["adjective"],
)

# Input is a feeling, so should select the happy/sad example as the first one
print(mmr_prompt.format(adjective="worried"))
Give the antonym of every input

Input: happy
Output: sad

Input: windy
Output: calm

Input: worried
Output:


# Let's compare this to what we would just get if we went solely off of similarity,
# by using SemanticSimilarityExampleSelector instead of MaxMarginalRelevanceExampleSelector.
example_selector = SemanticSimilarityExampleSelector.from_examples(
    # The list of examples available to select from.
    examples,
    # The embedding class used to produce embeddings which are used to measure semantic similarity.
    OpenAIEmbeddings(),
    # The VectorStore class that is used to store the embeddings and do a similarity search over.
    FAISS,
    # The number of examples to produce.
    k=2,
)
similar_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the antonym of every input",
    suffix="Input: {adjective}\nOutput:",
    input_variables=["adjective"],
)
print(similar_prompt.format(adjective="worried"))
Give the antonym of every input

Input: happy
Output: sad

Input: sunny
Output: gloomy

Input: worried
Output:
```

The NGramOverlapExampleSelector selects and orders examples based on which examples are most similar to the input, according to an ngram overlap score. The ngram overlap score is a float between 0.0 and 1.0, inclusive.

The selector allows for a threshold score to be set. Examples with an ngram overlap score less than or equal to the threshold are excluded. The threshold is set to -1.0, by default, so will not exclude any examples, only reorder them. Setting the threshold to 0.0 will exclude examples that have no ngram overlaps with the input.

```python
from langchain_community.example_selector.ngram_overlap import (
    NGramOverlapExampleSelector,
)
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)

# Examples of a fictional translation task.
examples = [
    {"input": "See Spot run.", "output": "Ver correr a Spot."},
    {"input": "My dog barks.", "output": "Mi perro ladra."},
    {"input": "Spot can run.", "output": "Spot puede correr."},
]

example_selector = NGramOverlapExampleSelector(
    # The examples it has available to choose from.
    examples=examples,
    # The PromptTemplate being used to format the examples.
    example_prompt=example_prompt,
    # The threshold, at which selector stops.
    # It is set to -1.0 by default.
    threshold=-1.0,
    # For negative threshold:
    # Selector sorts examples by ngram overlap score, and excludes none.
    # For threshold greater than 1.0:
    # Selector excludes all examples, and returns an empty list.
    # For threshold equal to 0.0:
    # Selector sorts examples by ngram overlap score,
    # and excludes those with no ngram overlap with input.
)
dynamic_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the Spanish translation of every input",
    suffix="Input: {sentence}\nOutput:",
    input_variables=["sentence"],
)

# An example input with large ngram overlap with "Spot can run."
# and no overlap with "My dog barks."
print(dynamic_prompt.format(sentence="Spot can run fast."))
Give the Spanish translation of every input

Input: Spot can run.
Output: Spot puede correr.

Input: See Spot run.
Output: Ver correr a Spot.

Input: My dog barks.
Output: Mi perro ladra.

Input: Spot can run fast.
Output:

Select by similarity selects examples based on similarity to the inputs. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs.

```python
from langchain_chroma import Chroma
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_openai import OpenAIEmbeddings

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)

# Examples of a pretend task of creating antonyms.
examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},
]

example_selector = SemanticSimilarityExampleSelector.from_examples(
    # The list of examples available to select from.
    examples,
    # The embedding class used to produce embeddings which are used to measure semantic similarity.
    OpenAIEmbeddings(),
    # The VectorStore class that is used to store the embeddings and do a similarity search over.
    Chroma,
    # The number of examples to produce.
    k=1,
)
similar_prompt = FewShotPromptTemplate(
    # We provide an ExampleSelector instead of examples.
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the antonym of every input",
    suffix="Input: {adjective}\nOutput:",
    input_variables=["adjective"],
)

# Input is a feeling, so should select the happy/sad example
print(similar_prompt.format(adjective="worried"))
Give the antonym of every input

Input: happy
Output: sad

Input: worried
Output:
```

### Few Shot Chat Message Prompt Template

FewShotPromptTemplate is a general-purpose template for few-shot prompting. Examples are stored as plain text templates or dictionaries. Examples are formatted using a simple PromptTemplate. Includes a prefix (instructions or context) and a suffix (the current input or question). Dynamically selects the most relevant examples using an ExampleSelector (e.g., SemanticSimilarityExampleSelector).

FewShotChatMessagePromptTemplate is specifically designed for chat-based models (e.g., ChatGPT, Anthropic, or any model that uses chat messages). Formats examples into chat-style messages, distinguishing between system, human, and ai roles. Each example is converted into a sequence of chat messages (e.g., Human: Input\nAI: Output). Uses ChatPromptTemplate to define message roles. Can dynamically insert examples in a conversational format using ExampleSelector. Suitable for conversational tasks where examples need to be formatted as chat exchanges.

```python
from langchain_core.prompts import FewShotChatMessagePromptTemplate, ChatPromptTemplate

example_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
        ("ai", "{output}"),
    ]
)

examples = [
    {"input": "2+2", "output": "4"},
    {"input": "3+3", "output": "6"},
]

few_shot_chat_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)

print(few_shot_chat_prompt.format())
Human: 2+2
AI: 4
Human: 3+3
AI: 6

final_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a wondrous wizard of math."),
        few_shot_chat_prompt,
        ("human", "{input}"),
    ]
)

from langchain_community.chat_models import ChatAnthropic

chain = final_prompt | ChatAnthropic(temperature=0.0)

chain.invoke({"input": "What's the square of a triangle?"})
AIMessage(content=' Triangles do not have a "square". A square refers to a shape with 4 equal sides and 4 right angles. Triangles have 3 sides and 3 angles.\n\nThe area of a triangle can be calculated using the formula:\n\nA = 1/2 * b * h\n\nWhere:\n\nA is the area \nb is the base (the length of one of the sides)\nh is the height (the length from the base to the opposite vertex)\n\nSo the area depends on the specific dimensions of the triangle. There is no single "square of a triangle". The area can vary greatly depending on the base and height measurements.', additional_kwargs={}, example=False)
```

Sometimes you may want to condition which examples are shown based on the input. For this, you can replace the examples with an example_selector.

```python
from langchain_chroma import Chroma
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_openai import OpenAIEmbeddings

examples = [
    {"input": "2+2", "output": "4"},
    {"input": "2+3", "output": "5"},
    {"input": "2+4", "output": "6"},
    {"input": "What did the cow say to the moon?", "output": "nothing at all"},
    {
        "input": "Write me a poem about the moon",
        "output": "One for the moon, and one for me, who are we to talk about the moon?",
    },
]

to_vectorize = [" ".join(example.values()) for example in examples]
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_texts(to_vectorize, embeddings, metadatas=examples)

example_selector = SemanticSimilarityExampleSelector(
    vectorstore=vectorstore,
    k=2,
)

# The prompt template will load examples by passing the input do the `select_examples` method
example_selector.select_examples({"input": "horse"})

from langchain_core.prompts import (
    ChatPromptTemplate,
    FewShotChatMessagePromptTemplate,
)

# Define the few-shot prompt.
few_shot_prompt = FewShotChatMessagePromptTemplate(
    # The input variables select the values to pass to the example_selector
    input_variables=["input"],
    example_selector=example_selector,
    # Define how each example will be formatted.
    # In this case, each example will become 2 messages:
    # 1 human, and 1 AI
    example_prompt=ChatPromptTemplate.from_messages(
        [("human", "{input}"), ("ai", "{output}")]
    ),
)

final_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a wondrous wizard of math."),
        few_shot_prompt,
        ("human", "{input}"),
    ]
)

from langchain_community.chat_models import ChatAnthropic

chain = final_prompt | ChatAnthropic(temperature=0.0)

chain.invoke({"input": "What's 3+3?"})
# AIMessage(content=' 3 + 3 = 6', additional_kwargs={}, example=False)
```

### Partial Prompt Templates

It can make sense to "partial" a prompt template - e.g. pass in a subset of the required values, as to create a new prompt template which expects only the remaining subset of values.

One common use case for wanting to partial a prompt template is if you get some of the variables before others. For example, suppose you have a prompt template that requires two variables, foo and baz. If you get the foo value early on in the chain, but the baz value later, it can be annoying to wait until you have both variables in the same place to pass them to the prompt template. Instead, you can partial the prompt template with the foo value, and then pass the partialed prompt template along and just use that. Below is an example of doing this:

```python
from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template("{foo}{bar}")
partial_prompt = prompt.partial(foo="foo")
print(partial_prompt.format(bar="baz"))
```

The other common use is to partial with a function. The use case for this is when you have a variable you know that you always want to fetch in a common way. A prime example of this is with date or time. Imagine you have a prompt which you always want to have the current date. You can't hard code it in the prompt, and passing it along with the other input variables is a bit annoying. In this case, it's very handy to be able to partial the prompt with a function that always returns the current date.

```python
from datetime import datetime


def _get_datetime():
    now = datetime.now()
    return now.strftime("%m/%d/%Y, %H:%M:%S")

prompt = PromptTemplate(
    template="Tell me a {adjective} joke about the day {date}",
    input_variables=["adjective", "date"],
)
partial_prompt = prompt.partial(date=_get_datetime)
print(partial_prompt.format(adjective="funny"))
```

You can also just initialize the prompt with the partialed variables, which often makes more sense in this workflow.

```python
prompt = PromptTemplate(
    template="Tell me a {adjective} joke about the day {date}",
    input_variables=["adjective"],
    partial_variables={"date": _get_datetime},
)
print(prompt.format(adjective="funny"))
```

### Prompt Composition

LangChain provides a user friendly interface for composing different parts of prompts together. You can do this with either string prompts or chat prompts. Constructing prompts this way allows for easy reuse.

When working with string prompts, each template is joined together. You can work with either prompts directly or strings:

```python
from langchain_core.prompts import PromptTemplate

prompt = (
    PromptTemplate.from_template("Tell me a joke about {topic}")
    + ", make it funny"
    + "\n\nand in {language}"
)
prompt
# PromptTemplate(input_variables=['language', 'topic'], template='Tell me a joke about {topic}, make it funny\n\nand in {language}')
```

LangChain includes an abstraction PipelinePromptTemplate, which can be useful when you want to reuse parts of prompts. 

```python
from langchain_core.prompts.pipeline import PipelinePromptTemplate
from langchain_core.prompts.prompt import PromptTemplate

full_template = """{introduction}

{example}

{start}"""
full_prompt = PromptTemplate.from_template(full_template)

introduction_template = """You are impersonating {person}."""
introduction_prompt = PromptTemplate.from_template(introduction_template)

example_template = """Here's an example of an interaction:

Q: {example_q}
A: {example_a}"""
example_prompt = PromptTemplate.from_template(example_template)

start_template = """Now, do this for real!

Q: {input}
A:"""
start_prompt = PromptTemplate.from_template(start_template)

input_prompts = [
    ("introduction", introduction_prompt),
    ("example", example_prompt),
    ("start", start_prompt),
]
pipeline_prompt = PipelinePromptTemplate(
    final_prompt=full_prompt, pipeline_prompts=input_prompts
)

pipeline_prompt.input_variables # ['example_q', 'person', 'input', 'example_a']

print(
    pipeline_prompt.format(
        person="Elon Musk",
        example_q="What's your favorite car?",
        example_a="Tesla",
        input="What's your favorite social media site?",
    )
)
You are impersonating Elon Musk.

Here's an example of an interaction:

Q: What's your favorite car?
A: Tesla

Now, do this for real!

Q: What's your favorite social media site?
A:
```


### Chat Model

Chat Models are built on top of LLMs. The LLM objects take string as input and output string. The ChatModel objects take a list of messages as input and output a message. While chat models use language models under the hood, the interface they use is a bit different. Rather than using a "text in, text out" API, they use an interface where "chat messages" are the inputs and outputs.

**A chat model is a language model that uses chat messages as inputs and returns chat messages as outputs (as opposed to using plain text).** LangChain has integrations with many model providers (OpenAI, Cohere, Hugging Face, etc.) and exposes a standard interface to interact with all of these models.

The chat model interface is based around messages rather than raw text. The types of messages currently supported in LangChain are AIMessage, HumanMessage, SystemMessage, FunctionMessage and ChatMessage -- ChatMessage takes in an arbitrary role parameter. Most of the time, you'll just be dealing with HumanMessage, AIMessage, and SystemMessage.

### Chat Model and LCEL

Chat models implement the Runnable interface, the basic building block of the LangChain Expression Language (LCEL). This means they support invoke, ainvoke, stream, astream, batch, abatch, astream_log calls.

Chat models accept List[BaseMessage] as inputs, or objects which can be coerced to messages, including str (converted to HumanMessage) and PromptValue.

```python
from langchain_core.messages import HumanMessage, SystemMessage

messages = [
    SystemMessage(content="You're a helpful assistant"),
    HumanMessage(content="What is the purpose of model regularization?"),
]

# using invoke
chat.invoke(messages)
# AIMessage(content="The purpose of model regularization is to prevent overfitting in machine learning models. Overfitting occurs when a model becomes too complex and starts to fit the noise in the training data, leading to poor generalization on unseen data. Regularization techniques introduce additional constraints or penalties to the model's objective function, discouraging it from becoming overly complex and promoting simpler and more generalizable models. Regularization helps to strike a balance between fitting the training data well and avoiding overfitting, leading to better performance on new, unseen data.")

# using stream
for chunk in chat.stream(messages):
    print(chunk.content, end="", flush=True)
# The purpose of model regularization is to prevent overfitting and improve the generalization of a machine learning model. Overfitting occurs when a model is too complex and learns the noise or random variations in the training data, which leads to poor performance on new, unseen data. Regularization techniques introduce additional constraints or penalties to the model's learning process, discouraging it from fitting the noise and reducing the complexity of the model. This helps to improve the model's ability to generalize well and make accurate predictions on unseen data.    

# using batch
chat.batch([messages])
# [AIMessage(content="The purpose of model regularization is to prevent overfitting in machine learning models. Overfitting occurs when a model becomes too complex and starts to learn the noise or random fluctuations in the training data, rather than the underlying patterns or relationships. Regularization techniques add a penalty term to the model's objective function, which discourages the model from becoming too complex and helps it generalize better to new, unseen data. This improves the model's ability to make accurate predictions on new data by reducing the variance and increasing the model's overall performance.")]

# using ainvoke
await chat.ainvoke(messages)
# AIMessage(content='The purpose of model regularization is to prevent overfitting in machine learning models. Overfitting occurs when a model becomes too complex and starts to memorize the training data instead of learning general patterns and relationships. This leads to poor performance on new, unseen data.\n\nRegularization techniques introduce additional constraints or penalties to the model during training, discouraging it from becoming overly complex. This helps to strike a balance between fitting the training data well and generalizing to new data. Regularization techniques can include adding a penalty term to the loss function, such as L1 or L2 regularization, or using techniques like dropout or early stopping. By regularizing the model, it encourages it to learn the most relevant features and reduce the impact of noise or outliers in the data.')

# using astream
async for chunk in chat.astream(messages):
    print(chunk.content, end="", flush=True)
# The purpose of model regularization is to prevent overfitting in machine learning models. Overfitting occurs when a model becomes too complex and starts to memorize the training data instead of learning the underlying patterns. Regularization techniques help in reducing the complexity of the model by adding a penalty to the loss function. This penalty encourages the model to have smaller weights or fewer features, making it more generalized and less prone to overfitting. The goal is to find the right balance between fitting the training data well and being able to generalize well to unseen data.

# using astream log
async for chunk in chat.astream_log(messages):
    print(chunk) 
```

### Chat Models and Langsmith

All ChatModels come with built-in LangSmith tracing. Just set the following environment variables:

```shell
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_API_KEY=<your-api-key>
```

and any ChatModel invocation (whether it's nested in a chain or not) will automatically be traced. A trace will include inputs, outputs, latency, token usage, invocation params, environment params, and more.

In LangSmith you can then provide feedback for any trace, compile annotated datasets for evals, debug performance in the playground, and more.

### Chat Model Message types

ChatModels take a list of messages as input and return a message. There are a few different types of messages. All messages have a role and a content property. The role describes WHO is saying the message. LangChain has different message classes for different roles. The content property describes the content of the message. This can be a few different things:
- A string (most models deal this type of content)
- A List of dictionaries (this is used for multi-modal input, where the dictionary contains information about that input type and that input location)

In addition, messages have an additional_kwargs property. This is where additional information about messages can be passed. This is largely used for input parameters that are provider specific and not general. The best known example of this is function_call from OpenAI.

Message Types:
- HumanMessage
    - This represents a message from the user. Generally consists only of content.
- AIMessage
    - This represents a message from the model. This may have additional_kwargs in it - for example tool_calls if using OpenAI tool calling.
- SystemMessage
    - This represents a system message, which tells the model how to behave. This generally only consists of content. Not every model supports this.
- FunctionMessage
    - This represents the result of a function call. In addition to role and content, this message has a name parameter which conveys the name of the function that was called to produce this result.
- ToolMessage
    - This represents the result of a tool call. This is distinct from a FunctionMessage in order to match OpenAI's function and tool message types. In addition to role and content, this message has a tool_call_id parameter which conveys the id of the call to the tool that was called to produce this result.

### Chat Models and Streaming

All ChatModels implement the Runnable interface, which comes with default implementations of all methods, ie. ainvoke, batch, abatch, stream, astream. This gives all ChatModels basic support for streaming.

**Streaming support defaults to returning an Iterator (or AsyncIterator in the case of async streaming)** of a single value, the final result returned by the underlying ChatModel provider. This obviously doesn't give you token-by-token streaming, which requires native support from the ChatModel provider, but ensures your code that expects an iterator of tokens can work for any of our ChatModel integrations.

### Chat Models and Tool calling

Tool Calling in LangChain is a feature that allows a language model to **generate structured output** by simulating the "calling" of a predefined tool. **This process involves the model creating arguments that match a user-defined schema for a tool, but the actual execution of the tool (or even whether the tool is executed) is left to the user.**

When we define a tool in LangChain, we specify:
- What the tool does: Its purpose or functionality.
- What inputs (arguments) it requires: A schema that defines the parameters the tool needs to execute.

The model "creates arguments" by generating these input values based on the user's query or prompt. Consider a tool for basic arithmetic operations:

```python
class CalculatorTool(BaseTool):
    name = "calculate"
    args_schema = {
        "operation": {"type": "string", "enum": ["add", "subtract", "multiply", "divide"]},
        "num1": {"type": "number"},
        "num2": {"type": "number"},
    }
```
- Operation: Specifies the type of calculation (e.g., "add").
- The first number for the operation (e.g., 5).
- The second number for the operation (e.g., 7).

When a user provides input like: "Add 5 and 7." The model interprets the query and generates the following arguments:

```python
{
    "operation": "add",
    "num1": 5,
    "num2": 7
}
```
These arguments match the schema defined in the example CalculatorTool.

**In effect, despite the name, tool calling does not involve the model directly performing actions. Instead, the model provides parameters or arguments that conform to the schema, which can then be used by the user to trigger the tool or extract structured data.**

How Tool Calling Works:
- Define the Tool:
    - A "tool" in LangChain represents an action (such as a function that parses a pandas dataframe) that has a predefined schema that a language model can generate values for based on user input, and thus allow the client code to call the action, that is, the tool (such as parsing a pandas dataframe) with those arguments
    - The schema specifies the inputs the tool accepts and can include validations or constraints, such as using pydantic validations. It is common to define the schema as a pydantic model in langchain. Note since the language model consumes the schema through http requests, such as via the HuggingFace TGI Router, the schema is ultimately turned into a json structure that conforms to openai tool json formatting when it is sent to the language model. So the language model itself will understand the scheme as json.
- Prompt the Model:
    - The user provides a prompt or query to the model.
    - The model generates output that matches the schema of the tool, simulating a "call" to the tool.
- User Handles the Execution:
    - The user decides whether to actually execute the tool based on the generated arguments.
    - Alternatively, the user may treat the structured output directly as the final result.

Many LLM providers, including Anthropic, Cohere, Google, Mistral, OpenAI, and others, support variants of a tool calling feature. These features typically allow requests to the LLM to include available tools and their schemas, and for responses to include calls to these tools. 

**What makes tool calling powerful is not simply that the model can generate the parameters to pass to a tool, but your program can then execute the tool and return the output to the LLM to inform its response. In other words, you have a multi-turn conversation based on the tool!**

LangChain includes a suite of built-in tools and supports several methods for defining your own custom tools. Tool-calling is extremely useful for building tool-using chains and agents, and for getting structured outputs from models more generally.

Providers adopt different conventions for formatting tool schemas and tool calls. For instance, Anthropic returns tool calls as parsed structures within a larger content block:
```json
[
  {
    "text": "<thinking>\nI should use a tool.\n</thinking>",
    "type": "text"
  },
  {
    "id": "id_value",
    "input": {"arg_name": "arg_value"},
    "name": "tool_name",
    "type": "tool_use"
  }
]
```

whereas OpenAI separates tool calls into a distinct parameter, with arguments as JSON strings:
```json
{
  "tool_calls": [
    {
      "id": "id_value",
      "function": {
        "arguments": '{"arg_name": "arg_value"}',
        "name": "tool_name"
      },
      "type": "function"
    }
  ]
}
```
**The OpenAI format is the most common and the HuggingFace TGI accepts the OpenAI Tool Calling format.**

LangChain implements standard interfaces for defining tools, passing them to LLMs, and representing tool calls.

**For a model to be able to invoke tools, you need to pass tool schemas to it when making a chat request. LangChain ChatModels supporting tool calling features implement a .bind_tools method, which receives a list of LangChain tool objects, Pydantic classes, or JSON Schemas and binds them to the chat model in the provider-specific expected format.** That's a mouthful. Let's dissect this:
- A tool in LangChain represents a predefined action the model can "invoke" by generating structured arguments that match a schema.
- Binding Tools to the Model
    - Tools must be bound to the model so that the model knows which tools are available and can generate arguments for them.
    - Binding tools involves passing the tool schemas to the model, which adapts them to the specific format expected by the provider’s API (e.g., OpenAI, Anthropic).
- .bind_tools() Method
    - **The .bind_tools() method in LangChain allows you to attach tools to a ChatModel instance.**
    - **Once tools are bound, every subsequent API call to the model will include the tool schemas, enabling the model to simulate invoking them.**

Example:

```python
from langchain.tools import BaseTool

class CalculatorTool(BaseTool):
    name = "calculate"
    description = "Performs arithmetic operations"
    args_schema = {
        "operation": {"type": "string", "enum": ["add", "subtract", "multiply", "divide"]},
        "num1": {"type": "number"},
        "num2": {"type": "number"},
    }

    def _run(self, operation, num1, num2):
        if operation == "add":
            return num1 + num2
        elif operation == "subtract":
            return num1 - num2
        elif operation == "multiply":
            return num1 * num2
        elif operation == "divide":
            return num1 / num2
        else:
            raise ValueError("Invalid operation")

# Use .bind_tools() to attach the tools to the model.
from langchain.chat_models import ChatOpenAI

# Instantiate the model
llm = ChatOpenAI(temperature=0)

# Instantiate the tool
calculator_tool = CalculatorTool()

# Bind the tool to the model
llm_with_tools = llm.bind_tools([calculator_tool])
```

How Binding Works:
- Schema Passing:
    - When you call .bind_tools(), LangChain converts the tool definitions into the specific JSON schema format expected by the provider (e.g., OpenAI or Anthropic).
- Model Awareness:
    - After binding, the model becomes "aware" of the tools and their schemas. Every subsequent chat request automatically includes the tool schemas as part of the API call. How is this accomplished? The bind_tools method in ChatHuggingFace is used to bind tool definitions (such as functions or tools) to the LLM. It converts them into an OpenAI-compatible format using convert_to_openai_tool() and then binds them using the bind() method from Runnable. The bind() method in Runnable is a functional programming concept that creates a new Runnable with pre-set arguments. Essentially, it "freezes" certain parameters so that they don't have to be passed again. bind() does not modify the original Runnable. Instead, it returns a new instance of RunnableBinding (which is a wrapper around the original Runnable). This stores the kwargs as bound arguments, so when the Runnable is later invoked, those arguments are automatically included.
        - Without bind(): chat.invoke("What is the weather like in New York?", tools=[get_weather])
        - With bind(): 
            - chat_with_tools = chat.bind_tools([get_weather])
            - chat_with_tools.invoke("What is the weather like in New York?")
- Tool Invocation by the Model:
    - During a conversation, the model can decide whether to "call" a tool by generating arguments that match the tool schema.

**When you interact with the bound model, the model can now include tool invocations in its responses.**

```python
response = llm_with_tools.invoke({
    "input": "Can you add 7 and 3?"
})

print(response)

# The model generates an output in the format:
{
    "tool": "calculate",
    "arguments": {
        "operation": "add",
        "num1": 7,
        "num2": 3
    }
}
```

We can define the schema for custom tools using the @tool decorator on Python functions:
```python
from langchain_core.tools import tool


@tool
def add(a: int, b: int) -> int:
    """Adds a and b.

    Args:
        a: first int
        b: second int
    """
    return a + b


@tool
def multiply(a: int, b: int) -> int:
    """Multiplies a and b.

    Args:
        a: first int
        b: second int
    """
    return a * b


tools = [add, multiply]
```

We can equivalently define the schema using Pydantic. Pydantic is useful when your tool inputs are more complex:

```python
from langchain_core.pydantic_v1 import BaseModel, Field


# Note that the docstrings here are crucial, as they will be passed along
# to the model along with the class name.
class add(BaseModel):
    """Add two integers together."""

    a: int = Field(..., description="First integer")
    b: int = Field(..., description="Second integer")


class multiply(BaseModel):
    """Multiply two integers together."""

    a: int = Field(..., description="First integer")
    b: int = Field(..., description="Second integer")


tools = [add, multiply]

# We can bind them to chat models as follows:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

# We can use the bind_tools() method to handle converting Multiply to a "tool" and binding it to the model (i.e., passing it in each time the model is invoked).
llm_with_tools = llm.bind_tools(tools)
```

**When you just use bind_tools(tools), the model can choose whether to return one tool call, multiple tool calls, or no tool calls at all. Some models support a tool_choice parameter that gives you some ability to force the model to call a tool. For models that support this, you can pass in the name of the tool you want the model to always call tool_choice="xyz_tool_name". Or you can pass in tool_choice="any" to force the model to call at least one tool, without specifying which tool specifically.**

Currently tool_choice="any" functionality is supported by OpenAI, MistralAI, FireworksAI, and Groq. Currently Anthropic does not support tool_choice at all.

If we wanted our model to always call the multiply tool we could do:
```python
always_multiply_llm = llm.bind_tools([multiply], tool_choice="multiply")
```

And if we wanted it to always call at least one of add or multiply, we could do:
```python
always_call_tool_llm = llm.bind_tools([add, multiply], tool_choice="any")
```

**If tool calls are included in a LLM response, they are attached to the corresponding AIMessage or AIMessageChunk (when streaming) as a list of ToolCall objects in the .tool_calls attribute. THIS IS IMPORTANT TO REMEMBER:** 

```python
from pydantic import BaseModel, Field
from langchain_core.utils.function_calling import convert_to_openai_tool

class QueryDataFrame(BaseModel):
    """Query a Pandas dataframe using a valid Pandas query expression."""

    query: str = Field(..., description="A valid Pandas query expression.")

# Here our tool is defined as a Pydantic model and we can see the structure of the json format before sending it to the LLM, courtesy of the helper function convert_to_openai_tool
formatted_tool = convert_to_openai_tool(QueryDataFrame)
print(json.dumps(formatted_tool, indent=4))
# {
#     "type": "function",
#     "function": {
#         "name": "QueryDataFrame",
#         "description": "Query a Pandas dataframe using a valid Pandas query expression.",
#         "parameters": {
#             "properties": {
#                 "query": {
#                     "description": "A valid Pandas query expression.",
#                     "type": "string"
#                 }
#             },
#             "required": [
#                 "query"
#             ],
#             "type": "object"
#         }
#     }
# }

chat_with_tools = chat.bind_tools([QueryDataFrame])

message = chat_with_tools.invoke("How many passengers survived the Titanic disaster?")
print(f'additional kwargs ${message.additional_kwargs}')
# additional kwargs ${'tool_calls': [ChatCompletionOutputToolCall(function=ChatCompletionOutputFunctionDefinition(arguments={'query': 'survived ==Discovery of surviving passengers from the TitanicTrials == '}, name='QueryDataFrame', description=None), id='0', type='function')]}
print(f'tool calls ${message.tool_calls}')
# tool calls $[{'name': 'QueryDataFrame', 'args': {'query': 'survived ==Discovery of surviving passengers from the TitanicTrials == '}, 'id': '0', 'type': 'tool_call'}]
```

Notice in the above output of `message.tool_calls`, the list is comprised of ToolCalls, which are typed dicts that includes a tool name, dict of argument values shown as args above, and (optionally) an identifier. Messages with no tool calls default to an empty list for this attribute.

Here is an example of a list with multiple ToolCalls:

Example:
```python
query = "What is 3 * 12? Also, what is 11 + 49?"

llm_with_tools.invoke(query).tool_calls
# [
#     {
#         'name': 'multiply',
#         'args': { 'a': 3, 'b': 12 },
#         'id': 'call_UL7E2232GfDHIQGOM4gJfEDD'
#     },
#     {
#         'name': 'add',
#         'args': {'a': 11, 'b': 49},
#         'id': 'call_VKw8t5tpAuzvbHgdAXe9mjUx'
#     }
# ]
```

The .tool_calls attribute should contain valid tool calls. Note that on occasion, model providers may output malformed tool calls (e.g., arguments that are not valid JSON). When parsing fails in these cases, instances of InvalidToolCall are populated in the .invalid_tool_calls attribute. An InvalidToolCall can have a name, string arguments, identifier, and error message.

If desired, output parsers can further process the output. For example, we can convert back to the original Pydantic class:
```python
from langchain_core.output_parsers.openai_tools import PydanticToolsParser

chain = llm_with_tools | PydanticToolsParser(tools=[multiply, add])
chain.invoke(query)
# [multiply(a=3, b=12), add(a=11, b=49)]
```
How cool is that! We get back a pydantic object of the response from the model.

Streaming

**When tools are called in a streaming context, message chunks will be populated with tool call chunk objects in a list via the .tool_call_chunks attribute.** A ToolCallChunk includes optional string fields for the tool name, args, and id, and includes an optional integer field index that can be used to join chunks together. Fields are optional because portions of a tool call may be streamed across different chunks (e.g., a chunk that includes a substring of the arguments may have null values for the tool name and id).

```python
async for chunk in llm_with_tools.astream(query):
    print(chunk.tool_call_chunks)
# []
# [{'name': 'multiply', 'args': '', 'id': 'call_5Gdgx3R2z97qIycWKixgD2OU', 'index': 0}]
# [{'name': None, 'args': '{"a"', 'id': None, 'index': 0}]
# [{'name': None, 'args': ': 3, ', 'id': None, 'index': 0}]
# [{'name': None, 'args': '"b": 1', 'id': None, 'index': 0}]
# [{'name': None, 'args': '2}', 'id': None, 'index': 0}]
# [{'name': 'add', 'args': '', 'id': 'call_DpeKaF8pUCmLP0tkinhdmBgD', 'index': 1}]
# [{'name': None, 'args': '{"a"', 'id': None, 'index': 1}]
# [{'name': None, 'args': ': 11,', 'id': None, 'index': 1}]
# [{'name': None, 'args': ' "b": ', 'id': None, 'index': 1}]
# [{'name': None, 'args': '49}', 'id': None, 'index': 1}]
# []
```

Note that adding message chunks will merge their corresponding tool call chunks. This is the principle by which LangChain's various tool output parsers support streaming. For example, below we accumulate tool call chunks:

```python
first = True
async for chunk in llm_with_tools.astream(query):
    if first:
        gathered = chunk
        first = False
    else:
        gathered = gathered + chunk

    print(gathered.tool_call_chunks)
# []
# [{'name': 'multiply', 'args': '', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}]
# [{'name': 'multiply', 'args': '{"a"', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}]
# [{'name': 'multiply', 'args': '{"a": 3, ', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}]
# [{'name': 'multiply', 'args': '{"a": 3, "b": 1', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}]
# [{'name': 'multiply', 'args': '{"a": 3, "b": 12}', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}]
# [{'name': 'multiply', 'args': '{"a": 3, "b": 12}', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}, {'name': 'add', 'args': '', 'id': 'call_GERgANDUbRqdtmXRbIAS9JTS', 'index': 1}]
# [{'name': 'multiply', 'args': '{"a": 3, "b": 12}', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}, {'name': 'add', 'args': '{"a"', 'id': 'call_GERgANDUbRqdtmXRbIAS9JTS', 'index': 1}]
# [{'name': 'multiply', 'args': '{"a": 3, "b": 12}', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}, {'name': 'add', 'args': '{"a": 11,', 'id': 'call_GERgANDUbRqdtmXRbIAS9JTS', 'index': 1}]
# [{'name': 'multiply', 'args': '{"a": 3, "b": 12}', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}, {'name': 'add', 'args': '{"a": 11, "b": ', 'id': 'call_GERgANDUbRqdtmXRbIAS9JTS', 'index': 1}]
# [{'name': 'multiply', 'args': '{"a": 3, "b": 12}', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}, {'name': 'add', 'args': '{"a": 11, "b": 49}', 'id': 'call_GERgANDUbRqdtmXRbIAS9JTS', 'index': 1}]
# [{'name': 'multiply', 'args': '{"a": 3, "b": 12}', 'id': 'call_hXqj6HxzACkpiPG4hFFuIKuP', 'index': 0}, {'name': 'add', 'args': '{"a": 11, "b": 49}', 'id': 'call_GERgANDUbRqdtmXRbIAS9JTS', 'index': 1}]
```

**If we're using the model-generated tool invocations to actually call tools and want to pass the tool results back to the model, we can do so using ToolMessages.** Let's explore this in-depth because it is so important:

```python
from langchain_core.messages import HumanMessage, ToolMessage

@tool
def add(a: int, b: int) -> int:
    """Adds a and b.

    Args:
        a: first int
        b: second int
    """
    return a + b


@tool
def multiply(a: int, b: int) -> int:
    """Multiplies a and b.

    Args:
        a: first int
        b: second int
    """
    return a * b

tools = [add, multiply]
llm_with_tools = llm.bind_tools(tools)

messages = [HumanMessage(query)]
ai_msg = llm_with_tools.invoke(messages)
messages.append(ai_msg)

for tool_call in ai_msg.tool_calls:
    # for each tool call, we identify the current tool call with name field and store it in selected_tools
    selected_tool = {"add": add, "multiply": multiply}[tool_call["name"].lower()]
    # now we have the selected tool, we invoke it with the model generated results for our schema, which is stored in args. 
    tool_output = selected_tool.invoke(tool_call["args"])
    # Then we store the output of us calling our tool with the model schema results into ToolMessage.
    messages.append(ToolMessage(tool_output, tool_call_id=tool_call["id"]))

messages
# AIMessage(content='3 * 12 = 36\n11 + 49 = 60', response_metadata={'token_usage': {'completion_tokens': 16, 'prompt_tokens': 209, 'total_tokens': 225}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': 'fp_3b956da36b', 'finish_reason': 'stop', 'logprobs': None}, id='run-a55f8cb5-6d6d-4835-9c6b-7de36b2590c7-0')
```
- Tools (@tool): These are functions registered as tools that the model can "call."
- Each tool has a name, arguments, and a return type.
    - Two tools, add and multiply, are defined with argument types (int) and return types (int).
    - Example: add(a, b) adds two numbers.
    - Example: multiply(a, b) multiplies two numbers.
- The language model (LLM) is bound to the tools using bind_tools(tools).
    - The tools are "registered" with the LLM so that the model knows about them and can invoke them when appropriate.
    - This allows the model to "invoke" tools by generating tool calls (the model doesn't actually invoke tools; we invoke tools using the model's response to our tool call).
- Initialize Messages: messages = [HumanMessage(query)]. A conversation starts with a HumanMessage containing the user’s query.
- The user sends a query (HumanMessage). The model responds (AIMessage) with tool invocations: ai_msg = llm_with_tools.invoke(messages)
    - We iterate through the tool calls and grab selected tool. Then we invoke the selected tool with the args (the model response values which filled in our schema arguments)
    - We get a response from the tool invocation that we did. We store that response in ToolMessage which we can pass to the model again.

For example, if the query is: "What is 3 multiplied by 12, and then add 11 to the result?" The AIMessage might include tool calls:
```json
[
    {"name": "multiply", "args": {"a": 3, "b": 12}, "id": "tool_call_1"},
    {"name": "add", "args": {"a": 11, "b": 49}, "id": "tool_call_2"}
]
```

The tools are executed, and their results (ToolMessages) are passed back to the model:
```python
for tool_call in ai_msg.tool_calls:
    selected_tool = {"add": add, "multiply": multiply}[tool_call["name"].lower()]
    tool_output = selected_tool.invoke(tool_call["args"])
    messages.append(ToolMessage(tool_output, tool_call_id=tool_call["id"]))
```
- Iterate through the tool_calls in the AIMessage.
- Find the corresponding tool (add or multiply) based on the name in the tool call.
- Execute the tool with the provided args.
- After executing the tool, the result is added as a ToolMessage to the conversation.
- The ToolMessage includes:
    - The tool output (e.g., 36 or 60).
    - The tool_call_id linking the result to the specific tool invocation.

After processing the tool calls, the messages list contains:
- HumanMessage: The user’s query.
- AIMessage: The LLM’s response with tool calls.
- ToolMessages: The results of the tool executions.
```python
[
    HumanMessage(content="What is 3 multiplied by 12, and then add 11 to the result?"),
    AIMessage(
        content="3 * 12 = 36\n11 + 49 = 60",
        tool_calls=[
            {"name": "multiply", "args": {"a": 3, "b": 12}, "id": "tool_call_1"},
            {"name": "add", "args": {"a": 11, "b": 49}, "id": "tool_call_2"}
        ],
    ),
    ToolMessage(content=36, tool_call_id="tool_call_1"),
    ToolMessage(content=60, tool_call_id="tool_call_2")
]
```

Why Use ToolMessages?
- ToolMessages allow tool outputs to be integrated back into the conversation, enabling the model to continue reasoning with updated information.
- By including the tool_call_id, each ToolMessage explicitly links the result to its corresponding tool invocation, maintaining traceability.
- This setup enables dynamic and interactive workflows where tools and the model collaborate iteratively.
- The model can issue multiple tool calls, receive results, and use them to generate more informed responses.

It's very critical to understand the topic above!

Tools and Few-Shot Examples

For more complex tool use it's very useful to add few-shot examples to the prompt. We can do this by adding AIMessages with ToolCalls and corresponding ToolMessages to our prompt.

```python
from langchain_core.messages import AIMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

examples = [
    HumanMessage(
        "What's the product of 317253 and 128472 plus four", name="example_user"
    ),
    AIMessage(
        "",
        name="example_assistant",
        tool_calls=[
            {"name": "multiply", "args": {"x": 317253, "y": 128472}, "id": "1"}
        ],
    ),
    ToolMessage("16505054784", tool_call_id="1"),
    AIMessage(
        "",
        name="example_assistant",
        tool_calls=[{"name": "add", "args": {"x": 16505054784, "y": 4}, "id": "2"}],
    ),
    ToolMessage("16505054788", tool_call_id="2"),
    AIMessage(
        "The product of 317253 and 128472 plus four is 16505054788",
        name="example_assistant",
    ),
]

system = """You are bad at math but are an expert at using a calculator. 

Use past tool usage as an example of how to correctly use the tools."""
few_shot_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        *examples,
        ("human", "{query}"),
    ]
)

chain = {"query": RunnablePassthrough()} | few_shot_prompt | llm_with_tools
chain.invoke("Whats 119 times 8 minus 20").tool_calls
# [{'name': 'multiply',
#   'args': {'a': 119, 'b': 8},
#   'id': 'call_tWwpzWqqc8dQtN13CyKZCVMe'}]
```

### Chat Models and Structured Output

It is often crucial to have LLMs return structured output. This is because oftentimes the outputs of the LLMs are used in downstream applications, where specific arguments are required. Having the LLM return structured output reliably is necessary for that.

There are a few different high level strategies that are used to do this:
- Prompting: This is when you ask the LLM (very nicely) to return output in the desired format (JSON, XML). This is nice because it works with all LLMs. It is not nice because there is no guarantee that the LLM returns the output in the right format.
- Function calling: This is when the LLM is fine-tuned to be able to not just generate a completion, but also generate a function call. The functions the LLM can call are generally passed as extra parameters to the model API. The function names and descriptions should be treated as part of the prompt (they usually count against token counts, and are used by the LLM to decide what to do).
- Tool calling: A technique similar to function calling, but it allows the LLM to call multiple functions at the same time.
- JSON mode: This is when the LLM is guaranteed to return JSON.

Different models may support different variants of these, with slightly different parameters. In order to make it easy to get LLMs to return structured output, we have added a common interface to LangChain models: .with_structured_output.

By invoking this method (and passing in a JSON schema or a Pydantic model) the model will add whatever model parameters + output parsers are necessary to get back the structured output. There may be more than one way to do this (e.g., function calling vs JSON mode) - you can configure which method to use by passing into that method.

Example with OpenAI (supports Tool/function Calling):
```python
from langchain_core.pydantic_v1 import BaseModel, Field

class Joke(BaseModel):
    setup: str = Field(description="The setup of the joke")
    punchline: str = Field(description="The punchline to the joke"

from langchain_openai import ChatOpenAI

# By default, we will use function_calling:
model = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm = model.with_structured_output(Joke)

structured_llm.invoke("Tell me a joke about cats")
# Joke(setup='Why was the cat sitting on the computer?', punchline='To keep an eye on the mouse!')

# JSON mode:
structured_llm = model.with_structured_output(Joke, method="json_mode")

structured_llm.invoke(
    "Tell me a joke about cats, respond in JSON with `setup` and `punchline` keys"
)
# Joke(setup='Why was the cat sitting on the computer?', punchline='Because it wanted to keep an eye on the mouse!')
```

Example with Mistral (only support function calling):

```python
from langchain_mistralai import ChatMistralAI

model = ChatMistralAI(model="mistral-large-latest")
structured_llm = model.with_structured_output(Joke)

structured_llm.invoke("Tell me a joke about cats")
# Joke(setup="Why don't cats play poker in the jungle?", punchline='Too many cheetahs!')
```

Example with Groq (supports Tool/function Calling):

```python
model = ChatGroq()
structured_llm = model.with_structured_output(Joke)

structured_llm.invoke("Tell me a joke about cats")
# Joke(setup="Why don't cats play poker in the jungle?", punchline='Too many cheetahs!')

# using JSON mode:
structured_llm = model.with_structured_output(Joke, method="json_mode")

structured_llm.invoke(
    "Tell me a joke about cats, respond in JSON with `setup` and `punchline` keys"
)
# Joke(setup="Why don't cats play poker in the jungle?", punchline='Too many cheetahs!')

```

### Chat Models and Caching

LangChain provides an optional caching layer for chat models. It can save you money by reducing the number of API calls you make to the LLM provider, if you're often requesting the same completion multiple times. 

```python
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

from langchain.globals import set_llm_cache

from langchain.cache import InMemoryCache

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.predict("Tell me a joke")

# The second time it is, so it goes faster
llm.predict("Tell me a joke")
```

### Custom Chat Model

Wrapping your LLM with the standard BaseChatModel interface allow you to use your LLM in existing LangChain programs with minimal code modifications. As an bonus, your LLM will automatically become a LangChain Runnable and will benefit from some optimizations out of the box (e.g., batch via a threadpool), async support, the astream_events API, etc.

Chat models take messages as inputs and return a message as output. LangChain has a few built-in message types:
- SystemMessage: 
    - Used for priming AI behavior, usually passed in as the first of a sequence of input messages.
- HumanMessage: 
    - Represents a message from a person interacting with the chat model.
- AIMessage: 
    - Represents a message from the chat model. This can be either text or a request to invoke a tool.
- FunctionMessage / ToolMessage: 
    - Message for passing the results of tool invocation back to the model.
    - ToolMessage and FunctionMessage closely follow OpenAIs function and tool roles.
- AIMessageChunk / HumanMessageChunk: 
    - Chunk variant of each type of message.

Import Example:

```python
from langchain_core.messages import (
    AIMessage,
    BaseMessage,
    FunctionMessage,
    HumanMessage,
    SystemMessage,
    ToolMessage,
)

# All the chat messages have a streaming variant that contains Chunk in the name.
from langchain_core.messages import (
    AIMessageChunk,
    FunctionMessageChunk,
    HumanMessageChunk,
    SystemMessageChunk,
    ToolMessageChunk,
)
```

Note the streaming variant chunks are used when streaming output from chat models, and they all define an additive property:
```python
AIMessageChunk(content="Hello") + AIMessageChunk(content=" World!")
```

When we inherit from BaseChatModel, we need to implement the following:
- _generate: 
    - Use to generate a chat result from a prompt
    - Required
- _llm_type (property): 
    - Used to uniquely identify the type of the model. Used for logging.
    - Required
- _identifying_params (property): 
    - Represent model parameterization for tracing purposes.
    - Optional
- _stream: 
    - Use to implement streaming.
    - Optional
- _agenerate: 
    - Use to implement a native async method.	
    - Optional
- _astream: 
    - Use to implement async version of _stream.	
    - Optional
    - The _astream implementation uses run_in_executor to launch the sync _stream in a separate thread if _stream is implemented, otherwise it fallsback to use _agenerate.

Example:

```python
from typing import Any, AsyncIterator, Dict, Iterator, List, Optional

from langchain_core.callbacks import (
    AsyncCallbackManagerForLLMRun,
    CallbackManagerForLLMRun,
)
from langchain_core.language_models import BaseChatModel, SimpleChatModel
from langchain_core.messages import AIMessageChunk, BaseMessage, HumanMessage
from langchain_core.outputs import ChatGeneration, ChatGenerationChunk, ChatResult
from langchain_core.runnables import run_in_executor

class CustomChatModelAdvanced(BaseChatModel):
    """A custom chat model that echoes the first `n` characters of the input.

    When contributing an implementation to LangChain, carefully document
    the model including the initialization parameters, include
    an example of how to initialize the model and include any relevant
    links to the underlying models documentation or API.

    Example:

        .. code-block:: python

            model = CustomChatModel(n=2)
            result = model.invoke([HumanMessage(content="hello")])
            result = model.batch([[HumanMessage(content="hello")],
                                 [HumanMessage(content="world")]])
    """

    model_name: str
    """The name of the model"""
    n: int
    """The number of characters from the last message of the prompt to be echoed."""

    def _generate(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> ChatResult:
        """Override the _generate method to implement the chat model logic.

        This can be a call to an API, a call to a local model, or any other
        implementation that generates a response to the input prompt.

        Args:
            messages: the prompt composed of a list of messages.
            stop: a list of strings on which the model should stop generating.
                  If generation stops due to a stop token, the stop token itself
                  SHOULD BE INCLUDED as part of the output. This is not enforced
                  across models right now, but it's a good practice to follow since
                  it makes it much easier to parse the output of the model
                  downstream and understand why generation stopped.
            run_manager: A run manager with callbacks for the LLM.
        """
        # Replace this with actual logic to generate a response from a list
        # of messages.
        last_message = messages[-1]
        tokens = last_message.content[: self.n]
        message = AIMessage(
            content=tokens,
            additional_kwargs={},  # Used to add additional payload (e.g., function calling request)
            response_metadata={  # Use for response metadata
                "time_in_seconds": 3,
            },
        )
        ##

        generation = ChatGeneration(message=message)
        return ChatResult(generations=[generation])

    def _stream(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> Iterator[ChatGenerationChunk]:
        """Stream the output of the model.

        This method should be implemented if the model can generate output
        in a streaming fashion. If the model does not support streaming,
        do not implement it. In that case streaming requests will be automatically
        handled by the _generate method.

        Args:
            messages: the prompt composed of a list of messages.
            stop: a list of strings on which the model should stop generating.
                  If generation stops due to a stop token, the stop token itself
                  SHOULD BE INCLUDED as part of the output. This is not enforced
                  across models right now, but it's a good practice to follow since
                  it makes it much easier to parse the output of the model
                  downstream and understand why generation stopped.
            run_manager: A run manager with callbacks for the LLM.
        """
        last_message = messages[-1]
        tokens = last_message.content[: self.n]

        for token in tokens:
            chunk = ChatGenerationChunk(message=AIMessageChunk(content=token))

            if run_manager:
                # This is optional in newer versions of LangChain
                # The on_llm_new_token will be called automatically
                run_manager.on_llm_new_token(token, chunk=chunk)

            yield chunk

        # Let's add some other information (e.g., response metadata)
        chunk = ChatGenerationChunk(
            message=AIMessageChunk(content="", response_metadata={"time_in_sec": 3})
        )
        if run_manager:
            # This is optional in newer versions of LangChain
            # The on_llm_new_token will be called automatically
            run_manager.on_llm_new_token(token, chunk=chunk)
        yield chunk

    @property
    def _llm_type(self) -> str:
        """Get the type of language model used by this chat model."""
        return "echoing-chat-model-advanced"

    @property
    def _identifying_params(self) -> Dict[str, Any]:
        """Return a dictionary of identifying parameters.

        This information is used by the LangChain callback system, which
        is used for tracing purposes make it possible to monitor LLMs.
        """
        return {
            # The model name allows users to specify custom token counting
            # rules in LLM monitoring applications (e.g., in LangSmith users
            # can provide per token pricing for their model and monitor
            # costs for the given LLM.)
            "model_name": self.model_name,
        }
```

The chat model will implement the standard Runnable interface of LangChain:
```python
model = CustomChatModelAdvanced(n=3, model_name="my_custom_model")

model.invoke(
    [
        HumanMessage(content="hello!"),
        AIMessage(content="Hi there human!"),
        HumanMessage(content="Meow!"),
    ]
)
# AIMessage(content='Meo', response_metadata={'time_in_seconds': 3}, id='run-ddb42bd6-4fdd-4bd2-8be5-e11b67d3ac29-0')

model.invoke("hello")
# AIMessage(content='hel', response_metadata={'time_in_seconds': 3}, id='run-4d3cc912-44aa-454b-977b-ca02be06c12e-0')

model.batch(["hello", "goodbye"])
# [AIMessage(content='hel', response_metadata={'time_in_seconds': 3}, id='run-9620e228-1912-4582-8aa1-176813afec49-0'),
#  AIMessage(content='goo', response_metadata={'time_in_seconds': 3}, id='run-1ce8cdf8-6f75-448e-82f7-1bb4a121df93-0')]

for chunk in model.stream("cat"):
    print(chunk.content, end="|")
# c|a|t||

async for chunk in model.astream("cat"):
    print(chunk.content, end="|")
# c|a|t||

# Let's try to use the astream events API which will also help double check that all the callbacks were implemented!
async for event in model.astream_events("cat", version="v1"):
    print(event)
# {'event': 'on_chat_model_start', 'run_id': '125a2a16-b9cd-40de-aa08-8aa9180b07d0', 'name': 'CustomChatModelAdvanced', 'tags': [], 'metadata': {}, 'data': {'input': 'cat'}}
# {'event': 'on_chat_model_stream', 'run_id': '125a2a16-b9cd-40de-aa08-8aa9180b07d0', 'tags': [], 'metadata': {}, 'name': 'CustomChatModelAdvanced', 'data': {'chunk': AIMessageChunk(content='c', id='run-125a2a16-b9cd-40de-aa08-8aa9180b07d0')}}
# {'event': 'on_chat_model_stream', 'run_id': '125a2a16-b9cd-40de-aa08-8aa9180b07d0', 'tags': [], 'metadata': {}, 'name': 'CustomChatModelAdvanced', 'data': {'chunk': AIMessageChunk(content='a', id='run-125a2a16-b9cd-40de-aa08-8aa9180b07d0')}}
# {'event': 'on_chat_model_stream', 'run_id': '125a2a16-b9cd-40de-aa08-8aa9180b07d0', 'tags': [], 'metadata': {}, 'name': 'CustomChatModelAdvanced', 'data': {'chunk': AIMessageChunk(content='t', id='run-125a2a16-b9cd-40de-aa08-8aa9180b07d0')}}
# {'event': 'on_chat_model_stream', 'run_id': '125a2a16-b9cd-40de-aa08-8aa9180b07d0', 'tags': [], 'metadata': {}, 'name': 'CustomChatModelAdvanced', 'data': {'chunk': AIMessageChunk(content='', response_metadata={'time_in_sec': 3}, id='run-125a2a16-b9cd-40de-aa08-8aa9180b07d0')}}
# {'event': 'on_chat_model_end', 'name': 'CustomChatModelAdvanced', 'run_id': '125a2a16-b9cd-40de-aa08-8aa9180b07d0', 'tags': [], 'metadata': {}, 'data': {'output': AIMessageChunk(content='cat', response_metadata={'time_in_sec': 3}, id='run-125a2a16-b9cd-40de-aa08-8aa9180b07d0')}}    
```

### Response metadata

Many model providers include some metadata in their chat generation responses. This metadata can be accessed via the AIMessage.response_metadata: Dict attribute. Depending on the model provider and model configuration, this can contain information like token counts, logprobs, and more.

```python
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4-turbo")
msg = llm.invoke([("human", "What's the oldest known example of cuneiform")])
msg.response_metadata
# {'token_usage': {'completion_tokens': 164,
#   'prompt_tokens': 17,
#   'total_tokens': 181},
#  'model_name': 'gpt-4-turbo',
#  'system_fingerprint': 'fp_76f018034d',
#  'finish_reason': 'stop',
#  'logprobs': None}
```

### LLMs

Large Language Models (LLMs) are a core component of LangChain. LangChain does not serve its own LLMs, but rather provides a standard interface for interacting with many different LLMs. To be specific, this interface is one that takes as input a string and returns a string.

There are lots of LLM providers (OpenAI, Cohere, Hugging Face, etc) - the LLM class is designed to provide a standard interface for all of them.

LLMs implement the Runnable interface, the basic building block of the LangChain Expression Language (LCEL). This means they support invoke, ainvoke, stream, astream, batch, abatch, astream_log calls. LLMs accept strings as inputs, or objects which can be coerced to string prompts, including List[BaseMessage] and PromptValue.

```python
llm.invoke(
    "What are some theories about the relationship between unemployment and inflation?"
)
# '\n\n1. The Phillips Curve Theory: This suggests that there is an inverse relationship between unemployment and inflation, meaning that when unemployment is low, inflation will be higher, and when unemployment is high, inflation will be lower.\n\n2. The Monetarist Theory: This theory suggests that the relationship between unemployment and inflation is weak, and that changes in the money supply are more important in determining inflation.\n\n3. The Resource Utilization Theory: This suggests that when unemployment is low, firms are able to raise wages and prices in order to take advantage of the increased demand for their products and services. This leads to higher inflation.'

for chunk in llm.stream(
    "What are some theories about the relationship between unemployment and inflation?"
):
    print(chunk, end="", flush=True)

llm.batch(
    [
        "What are some theories about the relationship between unemployment and inflation?"
    ]
)

await llm.ainvoke(
    "What are some theories about the relationship between unemployment and inflation?"
)

async for chunk in llm.astream(
    "What are some theories about the relationship between unemployment and inflation?"
):
    print(chunk, end="", flush=True)

await llm.abatch(
    [
        "What are some theories about the relationship between unemployment and inflation?"
    ]
)

async for chunk in llm.astream_log(
    "What are some theories about the relationship between unemployment and inflation?"
):
    print(chunk)
```

### LLMs and LangSmith

All LLMs come with built-in LangSmith tracing. Just set the following environment variables:

```shell
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_API_KEY=<your-api-key>
```

and any LLM invocation (whether it's nested in a chain or not) will automatically be traced. A trace will include inputs, outputs, latency, token usage, invocation params, environment params, and more.

### Custom LLM

Wrapping your LLM with the standard LLM interface allow you to use your LLM in existing LangChain programs with minimal code modifications. As an bonus, your LLM will automatically become a LangChain Runnable and will benefit from some optimizations out of the box, async support, the astream_events API, etc.

There are only two required things that a custom LLM needs to implement:
- _call: Takes in a string and some optional stop words, and returns a string. Used by invoke.
- _llm_type: A property that returns a string, used for logging purposes only.

Optional implementations:
- _identifying_params: Used to help with identifying the model and printing the LLM; should return a dictionary. This is a @property.
- _acall: Provides an async native implementation of _call, used by ainvoke.
- _stream: Method to stream the output token by token.
- _astream: Provides an async native implementation of _stream; in newer LangChain versions, defaults to _stream.

Let's implement a simple custom LLM that just returns the first n characters of the input.

```python
from typing import Any, Dict, Iterator, List, Mapping, Optional

from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from langchain_core.outputs import GenerationChunk


class CustomLLM(LLM):
    """A custom chat model that echoes the first `n` characters of the input.

    When contributing an implementation to LangChain, carefully document
    the model including the initialization parameters, include
    an example of how to initialize the model and include any relevant
    links to the underlying models documentation or API.

    Example:

        .. code-block:: python

            model = CustomChatModel(n=2)
            result = model.invoke([HumanMessage(content="hello")])
            result = model.batch([[HumanMessage(content="hello")],
                                 [HumanMessage(content="world")]])
    """

    n: int
    """The number of characters from the last message of the prompt to be echoed."""

    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        """Run the LLM on the given input.

        Override this method to implement the LLM logic.

        Args:
            prompt: The prompt to generate from.
            stop: Stop words to use when generating. Model output is cut off at the
                first occurrence of any of the stop substrings.
                If stop tokens are not supported consider raising NotImplementedError.
            run_manager: Callback manager for the run.
            **kwargs: Arbitrary additional keyword arguments. These are usually passed
                to the model provider API call.

        Returns:
            The model output as a string. Actual completions SHOULD NOT include the prompt.
        """
        if stop is not None:
            raise ValueError("stop kwargs are not permitted.")
        return prompt[: self.n]

    def _stream(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> Iterator[GenerationChunk]:
        """Stream the LLM on the given prompt.

        This method should be overridden by subclasses that support streaming.

        If not implemented, the default behavior of calls to stream will be to
        fallback to the non-streaming version of the model and return
        the output as a single chunk.

        Args:
            prompt: The prompt to generate from.
            stop: Stop words to use when generating. Model output is cut off at the
                first occurrence of any of these substrings.
            run_manager: Callback manager for the run.
            **kwargs: Arbitrary additional keyword arguments. These are usually passed
                to the model provider API call.

        Returns:
            An iterator of GenerationChunks.
        """
        for char in prompt[: self.n]:
            chunk = GenerationChunk(text=char)
            if run_manager:
                run_manager.on_llm_new_token(chunk.text, chunk=chunk)

            yield chunk

    @property
    def _identifying_params(self) -> Dict[str, Any]:
        """Return a dictionary of identifying parameters."""
        return {
            # The model name allows users to specify custom token counting
            # rules in LLM monitoring applications (e.g., in LangSmith users
            # can provide per token pricing for their model and monitor
            # costs for the given LLM.)
            "model_name": "CustomChatModel",
        }

    @property
    def _llm_type(self) -> str:
        """Get the type of language model used by this chat model. Used for logging purposes only."""
        return "custom"
```

This LLM will implement the standard Runnable interface of LangChain:

```python
llm = CustomLLM(n=5)
print(llm)
# Params: {'model_name': 'CustomChatModel'}

llm.invoke("This is a foobar thing")
# 'This '

await llm.ainvoke("world")
# 'world'

llm.batch(["woof woof woof", "meow meow meow"])
# ['woof ', 'meow ']

await llm.abatch(["woof woof woof", "meow meow meow"])
# ['woof ', 'meow ']

async for token in llm.astream("hello"):
    print(token, end="|", flush=True)
# h|e|l|l|o|
```

Let's confirm that in integrates nicely with other LangChain APIs:

```python
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [("system", "you are a bot"), ("human", "{input}")]
)

llm = CustomLLM(n=7)
chain = prompt | llm

idx = 0
async for event in chain.astream_events({"input": "hello there!"}, version="v1"):
    print(event)
    idx += 1
    if idx > 7:
        # Truncate
        break
# {'event': 'on_chain_start', 'run_id': '05f24b4f-7ea3-4fb6-8417-3aa21633462f', 'name': 'RunnableSequence', 'tags': [], 'metadata': {}, 'data': {'input': {'input': 'hello there!'}}}
# {'event': 'on_prompt_start', 'name': 'ChatPromptTemplate', 'run_id': '7e996251-a926-4344-809e-c425a9846d21', 'tags': ['seq:step:1'], 'metadata': {}, 'data': {'input': {'input': 'hello there!'}}}
# {'event': 'on_prompt_end', 'name': 'ChatPromptTemplate', 'run_id': '7e996251-a926-4344-809e-c425a9846d21', 'tags': ['seq:step:1'], 'metadata': {}, 'data': {'input': {'input': 'hello there!'}, 'output': ChatPromptValue(messages=[SystemMessage(content='you are a bot'), HumanMessage(content='hello there!')])}}
# {'event': 'on_llm_start', 'name': 'CustomLLM', 'run_id': 'a8766beb-10f4-41de-8750-3ea7cf0ca7e2', 'tags': ['seq:step:2'], 'metadata': {}, 'data': {'input': {'prompts': ['System: you are a bot\nHuman: hello there!']}}}
# {'event': 'on_llm_stream', 'name': 'CustomLLM', 'run_id': 'a8766beb-10f4-41de-8750-3ea7cf0ca7e2', 'tags': ['seq:step:2'], 'metadata': {}, 'data': {'chunk': 'S'}}
# {'event': 'on_chain_stream', 'run_id': '05f24b4f-7ea3-4fb6-8417-3aa21633462f', 'tags': [], 'metadata': {}, 'name': 'RunnableSequence', 'data': {'chunk': 'S'}}
# {'event': 'on_llm_stream', 'name': 'CustomLLM', 'run_id': 'a8766beb-10f4-41de-8750-3ea7cf0ca7e2', 'tags': ['seq:step:2'], 'metadata': {}, 'data': {'chunk': 'y'}}
# {'event': 'on_chain_stream', 'run_id': '05f24b4f-7ea3-4fb6-8417-3aa21633462f', 'tags': [], 'metadata': {}, 'name': 'RunnableSequence', 'data': {'chunk': 'y'}}
```

### LLM and Caching

LangChain provides an optional caching layer for LLMs. It can save you money by reducing the number of API calls you make to the LLM provider, if you're often requesting the same completion multiple times.

```python
from langchain.globals import set_llm_cache
from langchain_openai import OpenAI

# To make the caching really obvious, lets use a slower model.
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", n=2, best_of=2)

from langchain.cache import InMemoryCache

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.predict("Tell me a joke")

# The second time it is, so it goes faster
llm.predict("Tell me a joke")
```

### LLM and Streaming

All LLMs implement the Runnable interface, which comes with default implementations of all methods, ie. ainvoke, batch, abatch, stream, astream. This gives all LLMs basic support for streaming.

Streaming support defaults to returning an Iterator (or AsyncIterator in the case of async streaming) of a single value, the final result returned by the underlying LLM provider. This obviously doesn't give you token-by-token streaming, which requires native support from the LLM provider, but ensures your code that expects an iterator of tokens can work for any of our LLM integrations.

```python
from langchain_openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)
for chunk in llm.stream("Write me a song about sparkling water."):
    print(chunk, end="", flush=True)
```

### The HuggingFace LLM (InferenceClient and AsyncInferenceClient)

At the core of both the HuggingFace Embeddings and HuggingFace Text Generation models is the low-level InferenceClient and AsyncInferenceClient. These are the classes that will actually make the http requests to the HuggingFace Text Embeddings Inference and HuggingFace Text Generation Inference, respectively.

The InferenceClient aims to provide a unified experience to perform inference. The InferenceClient class can be used seamlessly with either the (free) HuggingFace Inference API, self-hosted Inference Endpoints (such as HF TGI or HF TEI running on your own servers), or third-party Inference Providers.

The InferenceClient consumes certain key arguments:
- model: The model parameter of type string is the most crucial parameter. The model parameter can be:
    - a model id hosted on the HuggingFace Hub, e.g. meta-llama/Meta-Llama-3-8B-Instruct.
    - a URL to a deployed Inference Endpoint. If you have the HuggingFace Text Generation Inference (HF TGI) or HuggingFace Text Embeddings Inference (HF TEI) deployed on your own servers, such as through a docker container, then you will pass in a url to the InferenceClient of the location of your endpoint: http://3.210.60.7:8080/. It is very important to note that it is just the root url and port and no additional url segments like /embed, /generate, /v1/chat/completions, etc. This is important because depending on other arguments, the url segments will be automatically appended.
    - a None value. If the model is None, then a recommended model is automatically selected for the task.
- token: The actual meaning of the token parameter depends on the model parameter:
    - if the model parameter is a model id hosted on the HuggingFace Hub, then the token refers to the HuggingFace token. The token will be used to validate your authorization to HuggingFace Hub
    - if the model parameter is a url to a locally hosted running inference, then the token can be the token you use to validate access to the inference running on your servers.
- timeout: The maximum number of seconds to wait for a response from the server. Defaults to None, meaning it will loop until the server is available.
- headers: Additional headers to send to the server. By default only the authorization and user-agent headers are sent.
- base_url: Base URL to run inference. This is a duplicated argument from `model` to make InferenceClient follow the same pattern as `openai.OpenAI` client. Cannot be used if `model` is set. Defaults to None.
- api_key: Token to use for authentication. This is a duplicated argument from `token` to make InferenceClient follow the same pattern as `openai.OpenAI` client. Cannot be used if `token` is set. Defaults to None.

Historically, the InferenceClient provided a post method to make direct POST requests to the inference server. It was a method that handled too much. Depending on the task passed in, it will post to the text generation inference or text embeddings inference. For example, a task of feature-extraction will post to the embeddings endpoint. This post method would also determine the actual endpoint to post to based on the model variable. This method is now deprecated. Instead, the InferenceClient wants you to use task methods instead, such as `InferenceClient.chat_completion`. 

Here are a list of the task methods supported by the InferenceClient:
- **audio_classification**: The audio_classification method in InferenceClient is used to classify audio content into predefined categories using a pre-trained audio classification model.
    - Speech Emotion Recognition
        - Detects emotions in speech (e.g., happy, sad, neutral).
        - Example models: superb/wav2vec2-base-superb-er.
    - Sound Event Detection
        - Identifies different types of sounds (e.g., sirens, applause, barking).
        - Example models: facebook/wav2vec2-large-xlsr-53.
    - Speaker Identification
        - Recognizes a person’s voice from an audio clip.
        - Example models: microsoft/wavlm-base-plus-sv.
- The **audio_to_audio** method in InferenceClient processes an audio input and generates a modified audio output. It is used for tasks where an input audio file is transformed into another audio file.
    - Speech Enhancement
        - Removes background noise, enhances voice clarity.
        - Example models: speechbrain/mtl-mimic-voicebank
    - Source Separation
        - Isolates different sound sources (e.g., separating vocals from instruments in music).
        - Example models: facebook/htdemucs
    - Speech Conversion (Voice Cloning / Style Transfer)
        - Converts speech from one voice to another.
        - Example models: microsoft/speecht5-vc
    - Audio Super-Resolution
        - Improves the quality of low-resolution audio.
        - Example models: facebook/denoiser
- The **automatic_speech_recognition** (ASR) method in InferenceClient performs speech-to-text transcription. It takes an audio file as input and returns the transcribed text.
    - Converts spoken words into text.
    - Example models: openai/whisper-large-v2
    - Real-Time Captions for Live Audio
        - Used for subtitles in videos or real-time captioning.
    - Multilingual Transcription
        - Some models support transcription in multiple languages.
- The **chat_completion** method in InferenceClient is used to generate text responses in a conversational format using a specified language model. It closely follows OpenAI's Chat Completions API format and is compatible with different providers.
    - messages: messages of type List[Dict] is a list of message objects representing the conversation history (roles: system, user, assistant).
    - model: model of type Optional[str] (because the model can also be passed in through the the initializer function) is the model to use for generating completions (e.g., "meta-llama/Meta-Llama-3-8B-Instruct"). If not provided, a default model is selected.
    - stream: If True, enables streaming responses, returning tokens one by one. Default is False.
    - frequency_penalty: Penalizes repeated tokens (range: -2.0 to 2.0), where positive values discourage repetition. Default is 0.0.
    - logit_bias: logit_bias is of type Optional[List[float]] adjusts the likelihood of specific tokens appearing in the output.
    - logprobs: If True, returns log probabilities of each output token.
    - max_tokens: Maximum number of tokens the model should generate. Default is 100.
    - n: The number of responses to generate for each prompt. The n parameter in chat_completion specifies the number of different response completions the model should generate for a single input prompt. How n Works:
        - If n=1 (default), the model generates one response.
        - If n=3, the model generates three different responses for the same prompt.
        - Each generated response will be slightly different depending on the model's randomness settings (temperature, top_p, etc.).
        - Why Use n? Useful for A/B testing different model responses. Helps generate diverse responses to the same question. Can be used to select the best response from multiple completions.
    - presence_penalty: Discourages the model from repeating topics. Similar to frequency_penalty.
    - response_format: Specifies structured output format, such as JSON Schema or regex constraints.
    - seed
    - stop
    - temperature: Controls randomness in generation (0 = deterministic, 2 = high randomness). Default is 1.0.
    - top_logprobs: Specifies the number of top probable tokens to return. Must be used with logprobs=True.
    - top_p: Nucleus sampling: selects tokens from the top p probability mass (range 0-1). Default is 1.0.
    - tool_choice: Defines tools that can be used by the model (e.g., function calls).
    - tool_prompt: Additional prompt text before tool execution.
    - tools: Defines a list of external tools the model can call.

How chat_completion Works (Step-by-Step)
1) Determine the Inference Provider
```python
# Identifies which backend provider (e.g., Hugging Face, Together AI, OpenAI-compatible) will handle the request.
provider_helper = get_provider_helper(self.provider, task="conversational")
```
2) Set Up Model Configuration
```python
# Determines whether to use a custom model ID or endpoint.
# If base_url is set, it takes precedence.
# If model is explicitly passed, it overrides the default.
model_id_or_url = self.base_url or self.model or model
payload_model = model or self.model
```
3) Prepare Request Payload
```python
# Constructs the request dictionary with all the necessary parameters.
parameters = {
    "model": payload_model,
    "frequency_penalty": frequency_penalty,
    "logit_bias": logit_bias,
    "logprobs": logprobs,
    "max_tokens": max_tokens,
    "n": n,
    "presence_penalty": presence_penalty,
    "response_format": response_format,
    "seed": seed,
    "stop": stop,
    "temperature": temperature,
    "tool_choice": tool_choice,
    "tool_prompt": tool_prompt,
    "tools": tools,
    "top_logprobs": top_logprobs,
    "top_p": top_p,
    "stream": stream,
    "stream_options": stream_options,
}
```
4) Send the API Request
```python
# Formats the request and sends it via _inner_post.
request_parameters = provider_helper.prepare_request(
    inputs=messages,
    parameters=parameters,
    headers=self.headers,
    model=model_id_or_url,
    api_key=self.token,
)
data = self._inner_post(request_parameters, stream=stream)
```
5) Handle the Response
```python
# If streaming is enabled, it returns tokens one by one.
# Otherwise, it returns a complete structured response.
if stream:
    return _stream_chat_completion_response(data)
return ChatCompletionOutput.parse_obj_as_instance(data)
```

Example Usage of chat_completion
```python
from huggingface_hub import InferenceClient

client = InferenceClient("meta-llama/Meta-Llama-3-8B-Instruct")

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

response = client.chat_completion(messages, max_tokens=50)
print(response.choices[0].message.content)
# The capital of France is Paris.
```

Example Usage of chat_completion: Streaming Responses
```python
response = client.chat_completion(messages, max_tokens=50, stream=True)

for chunk in response:
    print(chunk.choices[0].delta.content)
# The
# capital
# of
# France
# is
# Paris.
```

Using Temperature and Top-p Sampling
```python
# temperature=0.7 → Reduces randomness but keeps some variety.
# top_p=0.9 → Uses only the top 90% most probable tokens.
response = client.chat_completion(messages, temperature=0.7, top_p=0.9)
```

Using External Tools (Function Calling)
```python
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Retrieve weather information.",
            "parameters": {
                "location": {"type": "string", "description": "City name"},
                "units": {"type": "string", "enum": ["metric", "imperial"]},
            },
        },
    }
]

response = client.chat_completion(
    messages=messages,
    tools=tools,
    tool_choice="auto",
    max_tokens=50,
)
```

Using JSON Output Format
```python
response_format = {
    "type": "json",
    "value": {
        "properties": {
            "city": {"type": "string"},
            "temperature": {"type": "integer"},
        },
        "required": ["city", "temperature"],
    },
}

response = client.chat_completion(
    messages=messages, response_format=response_format, max_tokens=100
)
print(response.choices[0].message.content)
# {
#     "city": "Paris",
#     "temperature": 15
# }
```

Here is a list of additional task methods supported by the InferenceClient:
- The **feature_extraction** method is used to generate embeddings for a given text input. It converts raw text into numerical representations (embeddings), which can be used in various machine learning tasks.
    - text: The input text to be embedded.
    - model: The embedding model to use. It can be a model ID hosted on Hugging Face Hub (e.g., "BAAI/bge-large-en-v1.5") or a URL to a self-hosted Inference Endpoint (e.g., "http://100.28.34.190:8070/").
    - normalize: If True, normalizes the generated embeddings (makes them unit vectors). Only available on servers using Text-Embedding-Inference (TEI).
    - prompt_name: The name of the prompt used for encoding. It must match a key in the Sentence Transformers prompts dictionary.
    - truncate: If True, truncates the input text to fit within the model’s token limit. Available on TEI servers.
    - truncation_direction: Determines which side of the input text should be truncated ("Left" or "Right").

How feature_extraction Works (Step-by-Step)
1) Get the Provider Helper
```python
# Retrieves the correct inference provider (e.g., Hugging Face, a self-hosted endpoint) for handling the request.
provider_helper = get_provider_helper(self.provider, task="feature-extraction")
```

2) Prepare the Request Parameters
```python
# Constructs the JSON payload that will be sent to the server.
# Specifies:
#     Text input (text)
#     Embedding normalization (normalize)
#     Truncation settings (truncate, truncation_direction)
#     Which embedding model to use (model or self.model)
request_parameters = provider_helper.prepare_request(
    inputs=text,
    parameters={
        "normalize": normalize,
        "prompt_name": prompt_name,
        "truncate": truncate,
        "truncation_direction": truncation_direction,
    },
    headers=self.headers,
    model=model or self.model,
    api_key=self.token,
)
```

3) Send the Request to the Server
```python
# Sends the request and waits for the response from the model.
response = self._inner_post(request_parameters)
```

4) Convert the Response to a NumPy Array
```python
# The byte response from the server is converted into a NumPy array of float32 values.
# This array represents the embedding of the input text.
np = _import_numpy()
return np.array(_bytes_to_dict(response), dtype="float32")
```

Basic Example: Generating an Embedding
```python
# The output is a 1D NumPy array representing the vector embedding of "Hi, who are you?".
from huggingface_hub import InferenceClient

client = InferenceClient()
embedding = client.feature_extraction("Hi, who are you?")

print(embedding.shape)  # Example output: (1, 1024)
```

Using a Specific Embedding Model
```python
# Uses BAAI's BGE Large model to generate embeddings.
embedding = client.feature_extraction(
    text="What is the meaning of life?",
    model="BAAI/bge-large-en-v1.5"  # Hugging Face hosted embedding model
)
```

Normalizing Embeddings
```python
# If normalize=True, the output embeddings become unit vectors (||v|| = 1).
# Useful for cosine similarity comparisons.
embedding = client.feature_extraction(
    text="Find similar questions",
    normalize=True
)
```

Using a Prompt-Based Embedding Model
```python
# Example: If the model supports prompt templates:
#     "query": "query: {text}" → The model will embed: "query: What is the capital of France?"
#     This improves retrieval tasks where queries and documents are embedded differently.
embedding = client.feature_extraction(
    text="What is the capital of France?",
    prompt_name="query"
)
```

Here is a list of additional task methods supported by the InferenceClient:
- image_classification
- image_segmentation
- image_to_image
- image_to_text
- object_detection
- summarization
- table_question_answering
- tabular_classification
- tabular_regression
- text_classification
- text_to_image
- text_to_video
- text_to_speech

And many more task methods!!!!

Health Checks

The InferenceClient supports the ability to check the health of the deployed endpoint. Health check is only available with Inference Endpoints powered by Text-Generation-Inference (TGI) or Text-Embedding-Inference (TEI).
```python
from huggingface_hub import InferenceClient
client = InferenceClient("https://jzgu0buei5.us-east-1.aws.endpoints.huggingface.cloud")
client.health_check()
# True
```

Note in addition to the InferenceClient, the AsyncInferenceClient is also supported.

### Output Parsers

Output parsers are responsible for taking the output of an LLM and transforming it to a more suitable format. This is very useful when you are using LLMs to generate any form of structured data. Besides having a large collection of different types of output parsers, one distinguishing benefit of LangChain OutputParsers is that many of them support streaming.

Language models output text. But many times you may want to get more structured information than just text back. This is where output parsers come in. Output parsers are classes that help structure language model responses. There are two main methods an output parser must implement:
- "Get format instructions": A method which returns a string containing instructions for how the output of a language model should be formatted.
- "Parse": A method which takes in a string (assumed to be the response from a language model) and parses it into some structure.
And then one optional one:
- "Parse with prompt": A method which takes in a string (assumed to be the response from a language model) and a prompt (assumed to be the prompt that generated such a response) and parses it into some structure. The prompt is largely provided in the event the OutputParser wants to retry or fix the output in some way, and needs information from the prompt to do so.

Below we go over the main type of output parser, the PydanticOutputParser.

```python
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_openai import OpenAI

model = OpenAI(model_name="gpt-3.5-turbo-instruct", temperature=0.0)


# Define your desired data structure.
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    # You can add custom validation logic easily with Pydantic.
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# And a query intended to prompt a language model to populate the data structure.
prompt_and_model = prompt | model
output = prompt_and_model.invoke({"query": "Tell me a joke."})
parser.invoke(output)
# Joke(setup='Why did the chicken cross the road?', punchline='To get to the other side!')
```

### Output Parsers and LCEL

Output parsers implement the Runnable interface, the basic building block of the LangChain Expression Language (LCEL). This means they support invoke, ainvoke, stream, astream, batch, abatch, astream_log calls.

Output parsers accept a string or BaseMessage as input and can return an arbitrary type.

```python
parser.invoke(output)
```

Instead of manually invoking the parser, we also could've just added it to our Runnable sequence:

```python
chain = prompt | model | parser
chain.invoke({"query": "Tell me a joke."})
# Joke(setup='Why did the chicken cross the road?', punchline='To get to the other side!') 
```

While all parsers support the streaming interface, only certain parsers can stream through partially parsed objects, since this is highly dependent on the output type. Parsers which cannot construct partial objects will simply yield the fully parsed output.

The SimpleJsonOutputParser for example can stream through partial outputs:

```python
from langchain.output_parsers.json import SimpleJsonOutputParser

json_prompt = PromptTemplate.from_template(
    "Return a JSON object with an `answer` key that answers the following question: {question}"
)
json_parser = SimpleJsonOutputParser()
json_chain = json_prompt | model | json_parser

list(json_chain.stream({"question": "Who invented the microscope?"}))
# [{},
#  {'answer': ''},
#  {'answer': 'Ant'},
#  {'answer': 'Anton'},
#  {'answer': 'Antonie'},
#  {'answer': 'Antonie van'},
#  {'answer': 'Antonie van Lee'},
#  {'answer': 'Antonie van Leeu'},
#  {'answer': 'Antonie van Leeuwen'},
#  {'answer': 'Antonie van Leeuwenho'},
#  {'answer': 'Antonie van Leeuwenhoek'}]
```

While the PydanticOutputParser cannot:

```python
list(chain.stream({"query": "Tell me a joke."}))
# [Joke(setup='Why did the chicken cross the road?', punchline='To get to the other side!')]
```

### Custom Output Parsers

In some situations you may want to implement a custom parser to structure the model output into a custom format.

There are two ways to implement a custom parser:
- Using RunnableLambda (non-streaming) or RunnableGenerator (streaming) in LCEL -- we strongly recommend this for most use cases
- By inherting from one of the base classes for out parsing -- this is the hard way of doing things

Example using Runnable Lambdas and Generators:

Here, we will make a simple parser that inverts the case of the output from the model. For example, if the model outputs: "Meow", the parser will produce "mEOW".

```python
from typing import Iterable

from langchain_anthropic.chat_models import ChatAnthropic
from langchain_core.messages import AIMessage, AIMessageChunk

model = ChatAnthropic(model_name="claude-2.1")


def parse(ai_message: AIMessage) -> str:
    """Parse the AI message."""
    return ai_message.content.swapcase()


chain = model | parse
chain.invoke("hello")
# 'hELLO!'
```

LCEL automatically upgrades the function parse to RunnableLambda(parse) when composed using a | syntax. If you don't like that you can manually import RunnableLambda and then run: parse = RunnableLambda(parse).

Streaming doesn't work in the original implementation because the parser (parse) processes the entire response at once, aggregating all chunks before applying the transformation. In a streaming scenario, the output is expected to be processed and returned incrementally as each chunk is received. However, with the original parse, the response from the model is first fully collected, and only after that is the case inversion applied. This behavior inherently defeats the purpose of streaming. Failure to work:

```python
for chunk in chain.stream("tell me about yourself in one sentence"):
    print(chunk, end="|", flush=True)
# i'M cLAUDE, AN ai ASSISTANT CREATED BY aNTHROPIC TO BE HELPFUL, HARMLESS, AND HONEST.|
```

In a streaming scenario, the output needs to be generated chunk by chunk as the input becomes available. When using RunnableLambda (automatically applied by LangChain when | is used), the parse function is wrapped in a non-streaming runnable. The wrapping assumes the function processes a complete input and produces a single output, which is incompatible with streaming.

To enable streaming, the parser must:
- Accept a stream of chunks (e.g., Iterable[AIMessageChunk]) instead of a complete AIMessage.
- Yield processed chunks incrementally as they are received.

The corrected implementation uses RunnableGenerator, which is designed for streaming tasks. This allows the parser to process and yield each chunk of the response as it arrives.

Example:

```python
from typing import Iterable
from langchain_core.runnables import RunnableGenerator
from langchain_anthropic.chat_models import ChatAnthropic
from langchain_core.messages import AIMessageChunk

# Define the model
model = ChatAnthropic(model_name="claude-2.1")

# Define the streaming parser
def streaming_parse(chunks: Iterable[AIMessageChunk]) -> Iterable[str]:
    for chunk in chunks:
        # Process each chunk and yield the result incrementally
        yield chunk.content.swapcase()

# Wrap the parser in a RunnableGenerator for streaming
streaming_parse = RunnableGenerator(streaming_parse)

# Combine the model and the streaming parser
chain = model | streaming_parse

# Use the chain in a streaming scenario
for chunk in chain.stream("tell me about yourself in one sentence"):
    print(chunk, end="|", flush=True)
```

Another approach to implement a parser is by inherting from BaseOutputParser, BaseGenerationOutputParser or another one of the base parsers depending on what you need to do. In general, we do not recommend this approach for most use cases as it results in more code to write without significant benefits.

### Output Parsers: CSV parser

This output parser can be used when you want to return a list of comma-separated items.

```python
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

output_parser = CommaSeparatedListOutputParser()

format_instructions = output_parser.get_format_instructions()
prompt = PromptTemplate(
    template="List five {subject}.\n{format_instructions}",
    input_variables=["subject"],
    partial_variables={"format_instructions": format_instructions},
)

model = ChatOpenAI(temperature=0)

chain = prompt | model | output_parser

chain.invoke({"subject": "ice cream flavors"})
# ['Vanilla',
#  'Chocolate',
#  'Strawberry',
#  'Mint Chocolate Chip',
#  'Cookies and Cream']

for s in chain.stream({"subject": "ice cream flavors"}):
    print(s)
```

### Output Parsers: Datetime parser

This OutputParser can be used to parse LLM output into datetime format.

```python
from langchain.output_parsers import DatetimeOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI

output_parser = DatetimeOutputParser()
template = """Answer the users question:

{question}

{format_instructions}"""
prompt = PromptTemplate.from_template(
    template,
    partial_variables={"format_instructions": output_parser.get_format_instructions()},
)

chain = prompt | OpenAI() | output_parser

output = chain.invoke({"question": "when was bitcoin founded?"})

print(output)
# 2009-01-03 18:15:05
```


### Output Parsers: Enum parser

This notebook shows how to use an Enum output parser.

```python
from langchain.output_parsers.enum import EnumOutputParser

from enum import Enum

class Colors(Enum):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"

parser = EnumOutputParser(enum=Colors)

from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

prompt = PromptTemplate.from_template(
    """What color eyes does this person have?

> Person: {person}

Instructions: {instructions}"""
).partial(instructions=parser.get_format_instructions())
chain = prompt | ChatOpenAI() | parser

chain.invoke({"person": "Frank Sinatra"})
# <Colors.BLUE: 'blue'>
```

### Output Parsers: JSON parser

This output parser allows users to specify an arbitrary JSON schema and query LLMs for outputs that conform to that schema.

Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate well-formed JSON. In the OpenAI family, DaVinci can do reliably but Curie's ability already drops off dramatically.

You can optionally use Pydantic to declare your data model.

```python
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

model = ChatOpenAI(temperature=0)

# Define your desired data structure.
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

# And a query intented to prompt a language model to populate the data structure.
joke_query = "Tell me a joke."

# Set up a parser + inject instructions into the prompt template.
parser = JsonOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

chain.invoke({"query": joke_query})
# {'setup': "Why don't scientists trust atoms?",
#  'punchline': 'Because they make up everything!'}
```

This output parser supports streaming.

for s in chain.stream({"query": joke_query}):
    print(s)
<!-- {'setup': ''}
{'setup': 'Why'}
{'setup': 'Why don'}
{'setup': "Why don't"}
{'setup': "Why don't scientists"}
{'setup': "Why don't scientists trust"}
{'setup': "Why don't scientists trust atoms"}
{'setup': "Why don't scientists trust atoms?", 'punchline': ''}
{'setup': "Why don't scientists trust atoms?", 'punchline': 'Because'}
{'setup': "Why don't scientists trust atoms?", 'punchline': 'Because they'}
{'setup': "Why don't scientists trust atoms?", 'punchline': 'Because they make'}
{'setup': "Why don't scientists trust atoms?", 'punchline': 'Because they make up'}
{'setup': "Why don't scientists trust atoms?", 'punchline': 'Because they make up everything'}
{'setup': "Why don't scientists trust atoms?", 'punchline': 'Because they make up everything!'} -->

You can also use this without Pydantic. This will prompt it return JSON, but doesn't provide specific about what the schema should be.

```python
joke_query = "Tell me a joke."

parser = JsonOutputParser()

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

chain.invoke({"query": joke_query})
# {'joke': "Why don't scientists trust atoms? Because they make up everything!"}
```

### Output Parsers: OpenAI Functions

These output parsers use OpenAI function calling to structure its outputs. This means they are only usable with models that support function calling. There are a few different variants:

- JsonOutputFunctionsParser: Returns the arguments of the function call as JSON
- PydanticOutputFunctionsParser: Returns the arguments of the function call as a Pydantic Model
- JsonKeyOutputFunctionsParser: Returns the value of specific key in the function call as JSON
- PydanticAttrOutputFunctionsParser: Returns the value of specific key in the function call as a Pydantic Model

```python
from langchain_community.utils.openai_functions import (
    convert_pydantic_to_openai_function,
)
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_openai import ChatOpenAI

class Joke(BaseModel):
    """Joke to tell user."""

    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")


openai_functions = [convert_pydantic_to_openai_function(Joke)]

model = ChatOpenAI(temperature=0)

prompt = ChatPromptTemplate.from_messages(
    [("system", "You are helpful assistant"), ("user", "{input}")]
)

from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

parser = JsonOutputFunctionsParser()

chain = prompt | model.bind(functions=openai_functions) | parser

chain.invoke({"input": "tell me a joke"})
# {'setup': "Why don't scientists trust atoms?",
#  'punchline': 'Because they make up everything!'}

for s in chain.stream({"input": "tell me a joke"}):
    print(s)
# {}
# {'setup': ''}
# {'setup': 'Why'}
# {'setup': 'Why don'}
# {'setup': "Why don't"}
# {'setup': "Why don't scientists"}
# {'setup': "Why don't scientists trust"}
# {'setup': "Why don't scientists trust atoms"}
# {'setup': "Why don't scientists trust atoms?", 'punchline': ''}
# {'setup': "Why don't scientists trust atoms?", 'punchline': 'Because'}
# {'setup': "Why don't scientists trust atoms?", 'punchline': 'Because they'}
# {'setup': "Why don't scientists trust atoms?", 'punchline': 'Because they make'}
# {'setup': "Why don't scientists trust atoms?", 'punchline': 'Because they make up'}
# {'setup': "Why don't scientists trust atoms?", 'punchline': 'Because they make up everything'}
# {'setup': "Why don't scientists trust atoms?", 'punchline': 'Because they make up everything!'}

PydanticOutputFunctionsParser builds on top of JsonOutputFunctionsParser but passes the results to a Pydantic Model. This allows for further validation should you choose. 

```python
from langchain.output_parsers.openai_functions import PydanticOutputFunctionsParser

class Joke(BaseModel):
    """Joke to tell user."""

    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    # You can add custom validation logic easily with Pydantic.
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


parser = PydanticOutputFunctionsParser(pydantic_schema=Joke)

openai_functions = [convert_pydantic_to_openai_function(Joke)]
chain = prompt | model.bind(functions=openai_functions) | parser

chain.invoke({"input": "tell me a joke"})
# Joke(setup="Why don't scientists trust atoms?", punchline='Because they make up everything!')
```

Notice the return value of above is a pydantic object!

### Output Parsers: OpenAI Tools

These output parsers extract tool calls from OpenAI's function calling API responses. This means they are only usable with models that support function calling, and specifically the latest tools and tool_choice parameters. 

There are a few different variants of output parsers:
- JsonOutputToolsParser: Returns the arguments of the function call as JSON
- JsonOutputKeyToolsParser: Returns the value of specific key in the function call as JSON
- PydanticToolsParser: Returns the arguments of the function call as a Pydantic Model

```python
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_openai import ChatOpenAI

class Joke(BaseModel):
    """Joke to tell user."""

    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0).bind_tools([Joke])

model.kwargs["tools"]
# [{'type': 'function',
#   'function': {'name': 'Joke',
#    'description': 'Joke to tell user.',
#    'parameters': {'type': 'object',
#     'properties': {'setup': {'description': 'question to set up a joke',
#       'type': 'string'},
#      'punchline': {'description': 'answer to resolve the joke',
#       'type': 'string'}},
#     'required': ['setup', 'punchline']}}}]

prompt = ChatPromptTemplate.from_messages(
    [("system", "You are helpful assistant"), ("user", "{input}")]
)

from langchain.output_parsers.openai_tools import JsonOutputToolsParser

parser = JsonOutputToolsParser()

chain = prompt | model | parser

chain.invoke({"input": "tell me a joke"})
# [{'type': 'Joke',
#   'args': {'setup': "Why don't scientists trust atoms?",
#    'punchline': 'Because they make up everything!'}}]
```

To include the tool call id we can specify return_id=True:

```python
parser = JsonOutputToolsParser(return_id=True)
chain = prompt | model | parser
chain.invoke({"input": "tell me a joke"})
# [{'type': 'Joke',
#   'args': {'setup': "Why don't scientists trust atoms?",
#    'punchline': 'Because they make up everything!'},
#   'id': 'call_Isuoh0RTeQzzOKGg5QlQ7UqI'}]
```

PydanticToolsParser builds on top of JsonOutputToolsParser but passes the results to a Pydantic Model. This allows for further validation should you choose.

```python
from langchain.output_parsers.openai_tools import PydanticToolsParser

class Joke(BaseModel):
    """Joke to tell user."""

    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    # You can add custom validation logic easily with Pydantic.
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


parser = PydanticToolsParser(tools=[Joke])

model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0).bind_tools([Joke])
chain = prompt | model | parser

chain.invoke({"input": "tell me a joke"})
# [Joke(setup="Why don't scientists trust atoms?", punchline='Because they make up everything!')]
```

### Output Parsers: Output-fixing parser

This output parser wraps another output parser, and in the event that the first one fails it calls out to another LLM to fix any errors. But we can do other things besides throw errors. Specifically, we can pass the misformatted output, along with the formatted instructions, to the model and ask it to fix it. For this example, we'll use the above Pydantic output parser. Here's what happens if we pass it a result that does not comply with the schema:

```python
from typing import List

from langchain.output_parsers import PydanticOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

class Actor(BaseModel):
    name: str = Field(description="name of an actor")
    film_names: List[str] = Field(description="list of names of films they starred in")


actor_query = "Generate the filmography for a random actor."

parser = PydanticOutputParser(pydantic_object=Actor)

misformatted = "{'name': 'Tom Hanks', 'film_names': ['Forrest Gump']}"

parser.parse(misformatted)
# ---------------------------------------------------------------------------
# ``````output
# JSONDecodeError                           Traceback (most recent call last)
# ``````output
# File ~/workplace/langchain/libs/langchain/langchain/output_parsers/pydantic.py:29, in PydanticOutputParser.parse(self, text)
#      28     json_str = match.group()
# ---> 29 json_object = json.loads(json_str, strict=False)
#      30 return self.pydantic_object.parse_obj(json_object)
# ``````output
# File ~/.pyenv/versions/3.10.1/lib/python3.10/json/__init__.py:359, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
#     358     kw['parse_constant'] = parse_constant
# --> 359 return cls(**kw).decode(s)
# ``````output
# File ~/.pyenv/versions/3.10.1/lib/python3.10/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
#     333 """Return the Python representation of ``s`` (a ``str`` instance
#     334 containing a JSON document).
#     335 
#     336 """
# --> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
#     338 end = _w(s, end).end()
# ``````output
# File ~/.pyenv/versions/3.10.1/lib/python3.10/json/decoder.py:353, in JSONDecoder.raw_decode(self, s, idx)
#     352 try:
# --> 353     obj, end = self.scan_once(s, idx)
#     354 except StopIteration as err:
# ``````output
# JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
# ``````output

# During handling of the above exception, another exception occurred:
# ``````output
# OutputParserException                     Traceback (most recent call last)
# ``````output
# Cell In[4], line 1
# ----> 1 parser.parse(misformatted)
# ``````output
# File ~/workplace/langchain/libs/langchain/langchain/output_parsers/pydantic.py:35, in PydanticOutputParser.parse(self, text)
#      33 name = self.pydantic_object.__name__
#      34 msg = f"Failed to parse {name} from completion {text}. Got: {e}"
# ---> 35 raise OutputParserException(msg, llm_output=text)
# ``````output
# OutputParserException: Failed to parse Actor from completion {'name': 'Tom Hanks', 'film_names': ['Forrest Gump']}. Got: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
```

Now we can construct and use a OutputFixingParser. This output parser takes as an argument another output parser but also an LLM with which to try to correct any formatting mistakes.

```python
from langchain.output_parsers import OutputFixingParser

new_parser = OutputFixingParser.from_llm(parser=parser, llm=ChatOpenAI())

new_parser.parse(misformatted)
# Actor(name='Tom Hanks', film_names=['Forrest Gump'])
```


### Output Parsers: Pandas DataFrame Parser

The PandasDataFrameOutputParser in LangChain:
- Allows you to query a DataFrame using an LLM.
- Structures the LLM's output to match the format of a Pandas DataFrame operation.
    - The parser acts as a bridge between the natural language output generated by the language model (LLM) and the structured, programmatic format required to work with a Pandas DataFrame.
    - The parser provides the model with clear instructions on how to format its output.
        - Example: A column retrieval query should produce a dictionary with column names as keys and values as data.
        - Example: A row retrieval query should produce a dictionary with row indices as keys and column-value mappings as values.
    - The parser validates Against the DataFrame Schema:
        - The parser ensures that the response only references valid columns or operations defined in the DataFrame.
        - Example: A query asking for a non-existent column will fail. 
    - The parser transforms the LLM's Output:
         - Converts the natural language output into a structured, machine-readable format (like a dictionary or JSON) that aligns with the DataFrame's data and structure.

In the example below, the parser provides format instructions for the LLM to interpret user queries. Validates and processes the response to ensure it matches the expected structure. Facilitates operations on a Pandas DataFrame, such as retrieving rows, columns, or performing calculations.

```python
import pprint
from typing import Any, Dict

import pandas as pd
from langchain.output_parsers import PandasDataFrameOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

model = ChatOpenAI(temperature=0)

# Define your desired Pandas DataFrame.
df = pd.DataFrame(
    {
        "num_legs": [2, 4, 8, 0],
        "num_wings": [2, 0, 0, 0],
        "num_specimen_seen": [10, 2, 1, 8],
    }
)

# Set up a parser + inject instructions into the prompt template.
parser = PandasDataFrameOutputParser(dataframe=df)

# Here's an example of a column operation being performed.
df_query = "Retrieve the num_wings column."

# Set up the prompt.
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser
parser_output = chain.invoke({"query": df_query})

format_parser_output(parser_output)
# {'num_wings': {0: 2,
#                1: 0,
#                2: 0,
#                3: 0}}

# Here's an example of a row operation being performed.
df_query = "Retrieve the first row."

# Set up the prompt.
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser
parser_output = chain.invoke({"query": df_query})

format_parser_output(parser_output)
# {'0': {'num_legs': 2,
#        'num_specimen_seen': 10,
#        'num_wings': 2}}

# Here's an example of a random Pandas DataFrame operation limiting the number of rows
df_query = "Retrieve the average of the num_legs column from rows 1 to 3."

# Set up the prompt.
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser
parser_output = chain.invoke({"query": df_query})

print(parser_output)
# {'mean': 4.0}

# Here's an example of a poorly formatted query
df_query = "Retrieve the mean of the num_fingers column."

# Set up the prompt.
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser
parser_output = chain.invoke({"query": df_query})
# OutputParserException: Invalid column: num_fingers. Please check the format instructions.
```

Why the Last Query Fails:
- Invalid Column:
    - The DataFrame does not contain a column named num_fingers.
    - The parser validates the query and finds that it references a non-existent column.
- Validation Failure:
    - The PandasDataFrameOutputParser checks the LLM's response against the DataFrame's structure.
    - If the response contains an invalid column, the parser raises an OutputParserException.
- Error Raised:
    - OutputParserException: Invalid column: num_fingers. Please check the format instructions.

The Pandas DataFrame Parser does not replace the Pandas DataFrame Agent! Pandas DataFrame Agent provides a more general-purpose, conversational interface for performing complex and multi-step operations. It can can handle multi-step reasoning, chain operations, and respond dynamically to arbitrary queries. It is best for complex, multi-step operations requiring reasoning and decision-making (e.g., combining multiple columns, filtering rows dynamically). It acts as the decision-maker, dynamically figuring out what DataFrame operations to perform and executing Python code.

### Output Parsers: Pydantic parser

This output parser allows users to specify an arbitrary Pydantic Model and query LLMs for outputs that conform to that schema. Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate well-formed JSON. In the OpenAI family, DaVinci can do reliably but Curie's ability already drops off dramatically.

Use Pydantic to declare your data model. Pydantic's BaseModel is like a Python dataclass, but with actual type checking + coercion.

```python
from typing import List

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_openai import ChatOpenAI

model = ChatOpenAI(temperature=0)

# Define your desired data structure.
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    # You can add custom validation logic easily with Pydantic.
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


# And a query intented to prompt a language model to populate the data structure.
joke_query = "Tell me a joke."

# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

chain.invoke({"query": joke_query})
# Joke(setup="Why don't scientists trust atoms?", punchline='Because they make up everything!')

# Here's another example, but with a compound typed field (e.g. List[str]).
class Actor(BaseModel):
    name: str = Field(description="name of an actor")
    film_names: List[str] = Field(description="list of names of films they starred in")


actor_query = "Generate the filmography for a random actor."

parser = PydanticOutputParser(pydantic_object=Actor)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

chain.invoke({"query": actor_query})
# Actor(name='Tom Hanks', film_names=['Forrest Gump', 'Cast Away', 'Saving Private Ryan', 'Toy Story', 'The Green Mile'])
```

### Output Parsers: Retry Parser

While in some cases it is possible to fix any parsing mistakes by only looking at the output, in other cases it isn't. An example of this is when the output is not just in the incorrect format, but is partially complete. Consider the below example.

```python
from langchain.output_parsers import (
    OutputFixingParser,
    PydanticOutputParser,
)
from langchain_core.prompts import (
    PromptTemplate,
)
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI, OpenAI

template = """Based on the user question, provide an Action and Action Input for what step should be taken.
{format_instructions}
Question: {query}
Response:"""


class Action(BaseModel):
    action: str = Field(description="action to take")
    action_input: str = Field(description="input to the action")


parser = PydanticOutputParser(pydantic_object=Action)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

prompt_value = prompt.format_prompt(query="who is leo di caprios gf?")

bad_response = '{"action": "search"}'

parser.parse(bad_response)
# OutputParserException: Failed to parse Action from completion {"action": "search"}. Got: 1 validation error for Action
# action_input
#   field required (type=value_error.missing)
```

If we try to use the OutputFixingParser to fix this error, it will be confused - namely, it doesn't know what to actually put for action input.

```python
fix_parser = OutputFixingParser.from_llm(parser=parser, llm=ChatOpenAI())

fix_parser.parse(bad_response)
# Action(action='search', action_input='input')
```

Instead, we can use the RetryOutputParser, which passes in the prompt (as well as the original output) to try again to get a better response.

```python
from langchain.output_parsers import RetryOutputParser

retry_parser = RetryOutputParser.from_llm(parser=parser, llm=OpenAI(temperature=0))

retry_parser.parse_with_prompt(bad_response, prompt_value)
# Action(action='search', action_input='leo di caprio girlfriend')
```

We can also add the RetryOutputParser easily with a custom chain which transform the raw LLM/ChatModel output into a more workable format.

```python
from langchain_core.runnables import RunnableLambda, RunnableParallel

completion_chain = prompt | OpenAI(temperature=0)

main_chain = RunnableParallel(
    completion=completion_chain, prompt_value=prompt
) | RunnableLambda(lambda x: retry_parser.parse_with_prompt(**x))


main_chain.invoke({"query": "who is leo di caprios gf?"})
# Action(action='search', action_input='leo di caprio girlfriend')
```
- A user query: "Who is Leo DiCaprio's girlfriend?"
- The prompt instructs the LLM to return an object with the fields:
    - action: The action to take.
    - action_input: The input for the action.
- The initial LLM output: {"action": "search"}. This fails validation because the action_input field is missing. The action_input field is required by the schema defined in the Action class.
- The RetryOutputParser:
    - Takes the initial invalid output.
    - Reuses the original prompt (prompt_value) to ask the LLM to generate a better response.
    - Provides the invalid output as additional context to guide the LLM.
- The LLM now generates a corrected response: {"action": "search", "action_input": "leo di caprio girlfriend"}

It's important to note that the output of RunnableParallel is piped to RunnableLambda which does the retry work. The purpose of RunnableParallel here is to parallelize independent tasks. In this example: The retry itself is sequential and dependent on the raw output, so it cannot and should not run in parallel.

The retry loop in the RetryOutputParser is encapsulated within its parse_with_prompt method. In the context of RunnableParallel, the retry loop doesn't run in parallel itself—it runs sequentially within the RetryOutputParser as part of its task.

Inside RetryOutputParser.parse_with_prompt:
- Initial Attempt: The parser tries to validate the raw output (bad_response) using parser.parse().
- Error Handling: If validation fails (e.g., due to missing fields), it catches the error and retries with the LLM.
- Re-Prompting Logic: The original prompt (prompt_value) is sent to the LLM again. The invalid output is included as additional context, helping the LLM generate a better response.
- Retry Loop: This retry process continues until the output passes validation or a maximum retry count is reached.

### Retrieval

Many LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.

1) Document loaders

Document loaders load documents from many different sources. LangChain provides over 100 different document loaders as well as integrations with other major providers in the space, like AirByte and Unstructured. LangChain provides integrations to load all types of documents (HTML, PDF, code) from all types of locations (private S3 buckets, public websites).

2) Text Splitting

A key part of retrieval is fetching only the relevant parts of documents. This involves several transformation steps to prepare the documents for retrieval. One of the primary ones here is splitting (or chunking) a large document into smaller chunks. LangChain provides several transformation algorithms for doing this, as well as logic optimized for specific document types (code, markdown, etc).

3) Text embedding models

Another key part of retrieval is creating embeddings for documents. Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of a text that are similar. LangChain provides integrations with over 25 different embedding providers and methods, from open-source to proprietary API, allowing you to choose the one best suited for your needs. LangChain provides a standard interface, allowing you to easily swap between models.

4) Vector stores

With the rise of embeddings, there has emerged a need for databases to support efficient storage and searching of these embeddings. LangChain provides integrations with over 50 different vectorstores, from open-source local ones to cloud-hosted proprietary ones, allowing you to choose the one best suited for your needs. LangChain exposes a standard interface, allowing you to easily swap between vector stores.

5) Retrievers

Once the data is in the database, you still need to retrieve it. LangChain supports many different retrieval algorithms and is one of the places where we add the most value. LangChain supports basic methods that are easy to get started - namely simple semantic search. However, we have also added a collection of algorithms on top of this to increase performance. These include:
- Parent Document Retriever: This allows you to create multiple embeddings per parent document, allowing you to look up smaller chunks but return larger context.
- Self Query Retriever: User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the semantic part of a query from other metadata filters present in the query.
- Ensemble Retriever: Sometimes you may want to retrieve documents from multiple different sources, or using multiple different algorithms. The ensemble retriever allows you to easily do this.

6) Indexing

The LangChain Indexing API syncs your data from any source into a vector store, helping you:
- Avoid writing duplicated content into the vector store
- Avoid re-writing unchanged content
- Avoid re-computing embeddings over unchanged content


### Retrieval: Document Loaders

Use document loaders to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

Document loaders provide a "load" method for loading data as documents from a configured source. They optionally implement a "lazy load" as well for lazily loading data into memory.

### Retrieval: Document Loaders: Custom Document Loader

Applications based on LLMs frequently entail extracting data from databases or files, like PDFs, and converting it into a format that LLMs can utilize. In LangChain, this usually involves creating Document objects, which encapsulate the extracted text (page_content) along with metadata—a dictionary containing details about the document, such as the author's name or the date of publication.

Document objects are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the Document to generate a desired response (e.g., summarizing the document). Documents can be either used immediately or indexed into a vectorstore for future retrieval and use.

The main abstractions for Document Loading are:
- Document: Contains text and metadata
- BaseLoader: Use to convert raw data into Documents
- Blob: A representation of binary data that's located either in a file or in memory
- BaseBlobParser: Logic to parse a Blob to yield Document objects

This guide will demonstrate how to write custom document loading:
- Create a standard document Loader by sub-classing from BaseLoader.
- Create a parser using BaseBlobParser and use it in conjunction with Blob and BlobLoaders. This is useful primarily when working with files.

A document loader can be implemented by sub-classing from a BaseLoader which provides a standard interface for loading documents:
- lazy_load: Used to load documents one by one lazily. Use for production code.
- alazy_load: Async variant of lazy_load
- load: Used to load all the documents into memory eagerly. Use for prototyping or interactive work.
- aload: Used to load all the documents into memory eagerly. Use for prototyping or interactive work. Added in 2024-04 to LangChain.

All configuration is expected to be passed through the initializer (init). This was a design choice made by LangChain to make sure that once a document loader has been instantiated it has all the information needed to load documents.

Let's create an example of a standard document loader that loads a file and creates a document from each line in the file:
```python 
from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document

class CustomDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """
        with open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1

    # alazy_load is OPTIONAL.
    # If you leave out the implementation, a default implementation which delegates to lazy_load will be used!
    async def alazy_load(
        self,
    ) -> AsyncIterator[Document]:  # <-- Does not take any arguments
        """An async lazy loader that reads a file line by line."""
        # Requires aiofiles
        # Install with `pip install aiofiles`
        # https://github.com/Tinche/aiofiles
        import aiofiles

        async with aiofiles.open(self.file_path, encoding="utf-8") as f:
            line_number = 0
            async for line in f:
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": self.file_path},
                )
                line_number += 1
```

To test out the document loader, we need a file with some quality content.

```python
with open("./meow.txt", "w", encoding="utf-8") as f:
    quality_content = "meow meow🐱 \n meow meow🐱 \n meow😻😻"
    f.write(quality_content)

loader = CustomDocumentLoader("./meow.txt")

## Test out the lazy load interface
for doc in loader.lazy_load():
    print()
    print(type(doc))
    print(doc)
# <class 'langchain_core.documents.base.Document'>
# page_content='meow meow🐱 \n' metadata={'line_number': 0, 'source': './meow.txt'}

# <class 'langchain_core.documents.base.Document'>
# page_content=' meow meow🐱 \n' metadata={'line_number': 1, 'source': './meow.txt'}

# <class 'langchain_core.documents.base.Document'>
# page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}

## Test out the async implementation
async for doc in loader.alazy_load():
    print()
    print(type(doc))
    print(doc)
```

load() can be helpful in an interactive environment such as a jupyter notebook. Avoid using it for production code since eager loading assumes that all the content can fit into memory, which is not always the case, especially for enterprise data.

Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed rather than how the file is loaded. For example, you can use open to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text. As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.

### Retrieval: Document Loaders: Custom Document Parser

Do not conflate the Loader with the Parser. The BaseLoader is responsible for directly loading documents from a source (e.g., files, APIs, databases) and converting them into Document objects. It works at a higher level of abstraction.
- Handles complete loading and parsing: Combines both fetching the data and parsing it into Document objects.
- Designed for specific sources like text files, APIs, or databases.
- Provides standard methods:
    - lazy_load: Loads documents one at a time (lazy loading).
    - alazy_load: Async variant of lazy loading.
    - load: Loads all documents into memory at once (eager loading).
    - aload: Async variant of eager loading.

Use BaseLoader when the source (e.g., a file, database, or API) is known, and you want to quickly transform raw data into Document objects without extra modularity.

BaseBlobParser

**The BaseBlobParser focuses solely on parsing raw binary data (blobs) into Document objects. It is designed for scenarios where the data-loading logic is decoupled from the data-parsing logic**.
- Works with Blob objects, which represent binary data from either files or memory.
- Only parses the provided binary data into Document objects—it does not handle loading the data.
- Provides methods like:
    - lazy_parse: Parses blobs one by one (lazy parsing).

Use BaseBlobParser when:
- You need modularity: Data loading (e.g., fetching files) and data parsing are handled separately.
- **The same parsing logic can be reused across different types of data sources (e.g., files, in-memory data, APIs).**
- Suitable for files or binary data that require specialized parsing (e.g., PDFs, CSVs, JSON).

A BaseBlobParser is an interface that accepts a blob and outputs a list of Document objects. A blob is a representation of data that lives either in memory or in a file. LangChain python has a Blob primitive which is inspired by the Blob WebAPI spec.

```python
from langchain_core.document_loaders import BaseBlobParser, Blob

class MyParser(BaseBlobParser):
    """A simple parser that creates a document from each line."""

    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        """Parse a blob into a document line by line."""
        line_number = 0
        with blob.as_bytes_io() as f:
            for line in f:
                line_number += 1
                yield Document(
                    page_content=line,
                    metadata={"line_number": line_number, "source": blob.source},
                )

blob = Blob.from_path("./meow.txt")
parser = MyParser()

list(parser.lazy_parse(blob))
# [Document(page_content='meow meow🐱 \n', metadata={'line_number': 1, 'source': './meow.txt'}),
#  Document(page_content=' meow meow🐱 \n', metadata={'line_number': 2, 'source': './meow.txt'}),
#  Document(page_content=' meow😻😻', metadata={'line_number': 3, 'source': './meow.txt'})]
```

**Using the blob API also allows one to load content direclty from memory without having to read it from a file!**

```python
blob = Blob(data=b"some data from memory\nmeow")
list(parser.lazy_parse(blob))
# [Document(page_content='some data from memory\n', metadata={'line_number': 1, 'source': None}),
#  Document(page_content='meow', metadata={'line_number': 2, 'source': None})]
```

The following illustrates features of the Blob API:

```python
blob = Blob.from_path("./meow.txt", metadata={"foo": "bar"})

blob.encoding
# 'utf-8'

blob.as_bytes()
# b'meow meow\xf0\x9f\x90\xb1 \n meow meow\xf0\x9f\x90\xb1 \n meow\xf0\x9f\x98\xbb\xf0\x9f\x98\xbb'

blob.as_string()
# 'meow meow🐱 \n meow meow🐱 \n meow😻😻'

blob.as_bytes_io()
# <contextlib._GeneratorContextManager at 0x743f34324450>

blob.metadata
# {'foo': 'bar'}

blob.source
# './meow.txt'
```

Blob Loaders

Let's distinguish DocumentLoader from BlobLoader: A DocumentLoader is a higher-level abstraction that directly converts raw data (e.g., from files, APIs, or databases) into Document objects, which include:
- page_content: The main text content of the document.
- metadata: Additional information about the document (e.g., source, author, creation date).

DocumentLoader abstracts both data retrieval and parsing logic into a single interface. It is ideal for straightforward use cases where parsing is tightly coupled to data loading.

Examples of Use Cases:
- Loading text files where each line corresponds to a document.
- Loading a JSON file where each record represents a document.

A BlobLoader focuses on fetching raw binary data (referred to as "blobs") from a storage location. Unlike a DocumentLoader, it is not responsible for parsing the raw data into documents; instead, it works in tandem with a BlobParser, which handles parsing logic.

Responsibilities of BlobLoader:
- Encapsulates the loading logic for retrieving blobs from a specific source, such as:
    - Local files (e.g., via FileSystemBlobLoader).
    - Cloud storage (future support).
- Provides blobs as raw binary data or streams to be processed further by a parser.
- While a blob parser encapsulates the logic needed to parse binary data into documents, blob loaders encapsulate the logic that's necessary to load blobs from a given storage location. A the moment, LangChain only supports FileSystemBlobLoader.

You can use the FileSystemBlobLoader to load blobs and then use the parser to parse them.

```python
from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader

blob_loader = FileSystemBlobLoader(path=".", glob="*.mdx", show_progress=True)

parser = MyParser()
for blob in blob_loader.yield_blobs():
    for doc in parser.lazy_parse(blob):
        print(doc)
        break
# page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
# page_content='# Markdown\n' metadata={'line_number': 1, 'source': 'markdown.mdx'}
# page_content='# JSON\n' metadata={'line_number': 1, 'source': 'json.mdx'}
# page_content='---\n' metadata={'line_number': 1, 'source': 'pdf.mdx'}
# page_content='---\n' metadata={'line_number': 1, 'source': 'index.mdx'}
# page_content='# File Directory\n' metadata={'line_number': 1, 'source': 'file_directory.mdx'}
# page_content='# CSV\n' metadata={'line_number': 1, 'source': 'csv.mdx'}
# page_content='# HTML\n' metadata={'line_number': 1, 'source': 'html.mdx'}
```

### Retrieval: Document Loaders: Generic Loader

LangChain has a GenericLoader abstraction which composes a BlobLoader with a BaseBlobParser. GenericLoader is meant to provide standardized classmethods that make it easy to use existing BlobLoader implementations. At the moment, only the FileSystemBlobLoader is supported.

```python
from langchain_community.document_loaders.generic import GenericLoader

loader = GenericLoader.from_filesystem(
    path=".", glob="*.mdx", show_progress=True, parser=MyParser()
)

for idx, doc in enumerate(loader.lazy_load()):
    if idx < 5:
        print(doc)

print("... output truncated for demo purposes")
# page_content='# Microsoft Office\n' metadata={'line_number': 1, 'source': 'office_file.mdx'}
# page_content='\n' metadata={'line_number': 2, 'source': 'office_file.mdx'}
# page_content='>[The Microsoft Office](https://www.office.com/) suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.\n' metadata={'line_number': 3, 'source': 'office_file.mdx'}
# page_content='\n' metadata={'line_number': 4, 'source': 'office_file.mdx'}
# page_content='This covers how to load commonly used file formats including `DOCX`, `XLSX` and `PPTX` documents into a document format that we can use downstream.\n' metadata={'line_number': 5, 'source': 'office_file.mdx'}
# ... output truncated for demo purposes
```

### Retrieval: Document Loaders: CSV Loader

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

Each Row in the CSV Becomes a Separate Document:
- The loader treats each row in the CSV file as an individual Document object.
- This means that:
    - The content of the row (all columns combined) becomes the page_content of the Document.
    - Metadata, such as column names and other details, can also be included in the metadata field of the Document.

CSV File Example:
```csv
name,age,city
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago
```

If you load this CSV file using the CSV Document Loader, you’ll get three separate Document objects, one for each row:

```python
Document(
    page_content="Alice,30,New York",
    metadata={"source": "data.csv", "line_number": 1}
)

Document(
    page_content="Bob,25,Los Angeles",
    metadata={"source": "data.csv", "line_number": 2}
)

Document(
    page_content="Charlie,35,Chicago",
    metadata={"source": "data.csv", "line_number": 3}
)
```
- You can perform operations like querying, embedding, or indexing on each row separately, which is useful for tasks like retrieval-augmented generation (RAG). 
- Each document can carry metadata specific to its row, such as the row number, file name, or specific column names. 
- Processing each row as a separate document helps with handling large CSVs by working on chunks rather than loading the entire file into memory.

How to include Column Headers for Context

When using the CSV Document Loader, each row of the CSV file is converted into a Document object with:
- page_content: The raw text of the row (comma-separated values).
- metadata: This includes the column headers as keys and the corresponding row values as their values.

CSV File Example:
```csv
name,age,city
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago
```

You can store the column headers and their respective values in the metadata:

```python
Document(
    page_content="Alice,30,New York",
    metadata={
        "source": "data.csv",
        "line_number": 1,
        "name": "Alice",
        "age": "30",
        "city": "New York"
    }
)

Document(
    page_content="Bob,25,Los Angeles",
    metadata={
        "source": "data.csv",
        "line_number": 2,
        "name": "Bob",
        "age": "25",
        "city": "Los Angeles"
    }
)

Document(
    page_content="Charlie,35,Chicago",
    metadata={
        "source": "data.csv",
        "line_number": 3,
        "name": "Charlie",
        "age": "35",
        "city": "Chicago"
    }
)
```

Associating column names to their respective values in a Retrieval-Augmented Generation (RAG) flow requires careful design, as vector databases typically work with unstructured or semi-structured text rather than tabular data directly. To ensure that both column names (headers) and their corresponding values are meaningfully represented in the vector embeddings, you can flatten the structured metadata into text during vectorization while maintaining the context.
- Flatten the structured data (metadata) into a natural language or key-value text representation.
- This provides the necessary context to the vectorizer, associating column names with their values.

For a CSV row like:

```csv
name,age,city
Alice,30,New York
```

- Flatten the metadata into text: "name: Alice, age: 30, city: New York"
- Natural Language Format: "The person's name is Alice, their age is 30, and they live in New York."
- This combined text becomes the input for vectorization.


The Source Column

The source_column argument in the LangChain CSV Loader allows you to specify a particular column in the CSV file that should be used as the source identifier for each document created from a row. By default, the CSV Loader assigns the same file_path as the source metadata for all documents created from the CSV file. This means all rows (documents) share the same source, which may not be ideal for certain use cases.

If you load a CSV file without specifying source_column:

```python
loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
data = loader.load()
```

Metadata for each Document:
```python
{'source': './example_data/mlb_teams_2012.csv', 'row': 0}
{'source': './example_data/mlb_teams_2012.csv', 'row': 1}
```

If you load the same CSV file but specify source_column="Team":

```python
loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv', source_column="Team")
data = loader.load()
```

Metadata for each Document:
```python
{'source': 'Nationals', 'row': 0}
{'source': 'Reds', 'row': 1}
{'source': 'Yankees', 'row': 2}
```

Now, the source metadata reflects the value in the "Team" column, which uniquely identifies each row (document).

In retrieval-augmented generation (RAG) workflows, this feature is especially useful when:
- You retrieve relevant rows (documents) from a vector database.
- You want to include the source of the information in the model's output.

Example

Query: "Which MLB team had the highest payroll in 2012?"

If source_column="Team" is used, the retrieved document for "Yankees" will have the source set to "Yankees". This allows the model to output:

```text
The Yankees had the highest payroll in 2012 with $197.96 million.
Source: Yankees
```



### Retrieval: Document Loaders: File Directory Loader

The File Directory loader loads all files in a directory. We can use the glob parameter to control which files to load. Note that here it doesn't load the .rst file or the .html files.

```python
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('../', glob="**/*.md")
docs = loader.load()
len(docs)
# 1
```

Show a progress bar

By default a progress bar will not be shown. To show a progress bar, install the tqdm library (e.g. pip install tqdm), and set the show_progress parameter to True.

```python
loader = DirectoryLoader('../', glob="**/*.md", show_progress=True)
docs = loader.load()
    # Requirement already satisfied: tqdm in /Users/jon/.pyenv/versions/3.9.16/envs/microbiome-app/lib/python3.9/site-packages (4.65.0)
    # 0it [00:00, ?it/s]
```

Use multithreading

By default the loading happens in one thread. In order to utilize several threads set the use_multithreading flag to true.

```python
loader = DirectoryLoader('../', glob="**/*.md", use_multithreading=True)
docs = loader.load()
```

Change loader class

By default this uses the UnstructuredLoader class. However, you can change up the type of loader pretty easily.

```python
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('../', glob="**/*.md", loader_cls=TextLoader)
docs = loader.load()
len(docs)
# 1
```

If you need to load Python source code files, use the PythonLoader:

```python
from langchain_community.document_loaders import PythonLoader
```

Auto-detect file encodings with TextLoader

When loading a large list of arbitrary files from a directory using the TextLoader class, you can run into errors. For example, a file example-non-utf8.txt which uses a different encoding can fail, so the load() function will fail with a helpful message indicating which file failed decoding. With the default behavior of TextLoader any failure to load any of the documents will fail the whole loading process and no documents are loaded.

We can pass the parameter silent_errors to the DirectoryLoader to skip the files which could not be loaded and continue the load process.

```python
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, silent_errors=True)
docs = loader.load()
```

We can also ask TextLoader to auto detect the file encoding before failing, by passing the autodetect_encoding to the loader class.

```python
text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
docs = loader.load()

doc_sources = [doc.metadata['source']  for doc in docs]
doc_sources
    # ['../../../../../tests/integration_tests/examples/example-non-utf8.txt',
    #  '../../../../../tests/integration_tests/examples/whatsapp_chat.txt',
    #  '../../../../../tests/integration_tests/examples/example-utf8.txt']
```

### Retrieval: Document Loaders: HTML Loader

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.

We can load HTML documents into a document format that we can use downstream.

```python
from langchain_community.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()

data
#     [Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]
```

Loading HTML with BeautifulSoup4

We can also use BeautifulSoup4 to load HTML documents using the BSHTMLLoader. This will extract the text from the HTML into page_content, and the page title as title into metadata.

```python
from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader("example_data/fake-content.html")
data = loader.load()
data
#     [Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]
```

Loading HTML with SpiderLoader

Spider is the fastest crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI. Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc...

You need to have a Spider api key to use this loader. You can get one on spider.cloud.

```python
from langchain_community.document_loaders import SpiderLoader

loader = SpiderLoader(
    api_key="YOUR_API_KEY", url="https://spider.cloud", mode="crawl"
)

data = loader.load()
```



### Retrieval: Document Loaders: JSON Loader

JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).

The JSONLoader uses a specified jq schema to parse the JSON files. It uses the jq python package.

```shell
#!pip install jq
```

```python
from langchain_community.document_loaders import JSONLoader

import json
from pathlib import Path
from pprint import pprint


file_path='./example_data/facebook_chat.json'
data = json.loads(Path(file_path).read_text())
```

Using JSONLoader

Suppose we are interested in extracting the values under the content field within the messages key of the JSON data. This can easily be done through the JSONLoader as shown below.

```python
loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()

pprint(data)
    # [Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),
    #  Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),
    #  Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3}),
    #  Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4}),
    #  Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5}),
    #  Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6}),
    #  Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7}),
    #  Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8}),
    #  Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9}),
    #  Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10}),
    #  Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11})]
```

Extracting metadata

Metadata from a JSON file refers to additional information extracted from the JSON structure that provides context about the primary data being processed. In the context of LangChain or similar frameworks, metadata is often included in a Document object to:
- Describe the Source or Context of the main content.
- Enhance Queryability by allowing filtering, sorting, or grouping.
- Provide Additional Attributes for more meaningful responses in workflows like retrieval-augmented generation (RAG).

Use the JSONLoader in LangChain to extract both the main text content (page_content) and additional metadata from a JSON file:
- Content: The main message or text from the JSON.
- Metadata: Additional contextual information such as sender_name, timestamp_ms, and the source file.

Customizing the Loader:
- Use the jq_schema argument to specify where to look in the JSON structure.
- Use the metadata_func argument to define how to extract metadata for each record.
- Use the content_key argument to specify the key in the JSON object containing the main content.

The JSON structure might look like this:

```json
{
  "messages": [
    {
      "content": "Hi!",
      "sender_name": "User 1",
      "timestamp_ms": 1675549022673
    },
    {
      "content": "Good morning!",
      "sender_name": "User 2",
      "timestamp_ms": 1675577876645
    }
  ]
}
```

The loader uses jq_schema to locate the relevant records within the JSON. In this case, .messages[] tells the loader to iterate over the objects inside the messages array. The content_key argument tells the loader which field from the JSON record to use as the main text content for the document. The metadata_func defines how to pull specific fields from each record and include them in the metadata dictionary of the Document object.

```python
# Define the metadata extraction function.
def metadata_func(record: dict, metadata: dict) -> dict:

    metadata["sender_name"] = record.get("sender_name")
    metadata["timestamp_ms"] = record.get("timestamp_ms")

    return metadata


loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[]',
    content_key="content",
    metadata_func=metadata_func
)

data = loader.load()
```

Resulting Document:

```python
Document(
    page_content="Hi!",
    metadata={
        "source": "./example_data/facebook_chat.json",
        "seq_num": 11,
        "sender_name": "User 1",
        "timestamp_ms": 1675549022673
    }
)
```

### Retrieval: Document Loaders: Markdown Loader

Markdown is a lightweight markup language for creating formatted text using a plain-text editor.

```python
# !pip install unstructured > /dev/null

from langchain_community.document_loaders import UnstructuredMarkdownLoader

markdown_path = "../../../../../README.md"
loader = UnstructuredMarkdownLoader(markdown_path)

data = loader.load()
```

### Retrieval: Document Loaders: Microsoft Office Loader

You can load commonly used file formats including DOCX, XLSX and PPTX documents into a document format that we can use downstream. 

Azure AI Document Intelligence is machine-learning based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from digital or scanned PDFs, images, Office and HTML files. Document Intelligence supports PDF, JPEG/JPG, PNG, BMP, TIFF, HEIF, DOCX, XLSX, PPTX and HTML.

This current implementation of a loader using Document Intelligence can incorporate content page-wise and turn it into LangChain documents. 

### Retrieval: Document Loaders: PDF Loader

Load PDF using pypdf into array of documents, where each document contains the page content and metadata with page number.

```python
# pip install pypdf

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()
pages[0]
#     Document(page_content='LayoutParser : A Uni\x0ced Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\nfmelissadell,jacob carlson g@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model con\x0cgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\ne\x0borts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities. This paper introduces LayoutParser , an open-source\nlibrary for streamlining the usage of DL in DIA research and applica-\ntions. The core LayoutParser library comes with a set of simple and\nintuitive interfaces for applying and customizing DL models for layout de-\ntection, character recognition, and many other document processing tasks.\nTo promote extensibility, LayoutParser also incorporates a community\nplatform for sharing both pre-trained models and full document digiti-\nzation pipelines. We demonstrate that LayoutParser is helpful for both\nlightweight and large-scale digitization pipelines in real-word use cases.\nThe library is publicly available at https://layout-parser.github.io .\nKeywords: Document Image Analysis ·Deep Learning ·Layout Analysis\n·Character Recognition ·Open Source library ·Toolkit.\n1 Introduction\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\ndocument image analysis (DIA) tasks including document image classi\x0ccation [ 11,arXiv:2103.15348v2  [cs.CV]  21 Jun 2021', metadata={'source': 'example_data/layout-parser-paper.pdf', 'page': 0})

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("How will the community be engaged?", k=2)
for doc in docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])
```

An advantage of this approach is that documents can be retrieved with page numbers.

Using the rapidocr-onnxruntime package we can extract images as text as well:

```python
# pip install rapidocr-onnxruntime

loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)
pages = loader.load()
pages[4].page_content
# 'LayoutParser : A Uniﬁed Toolkit for DL-Based DIA 5\nTable 1: Current layout detection models in the LayoutParser model zoo\nDataset Base Model1Large Model Notes\nPubLayNet [38] F / M M Layouts of modern scientiﬁc documents\nPRImA [3] M - Layouts of scanned modern magazines and scientiﬁc reports\nNewspaper [17] F - Layouts of scanned US newspapers from the 20th century\nTableBank [18] F F Table region on modern scientiﬁc and business document\nHJDataset [31] F / M - Layouts of history Japanese documents\n1For each dataset, we train several models of diﬀerent sizes for diﬀerent needs (the trade-oﬀ between accuracy\nvs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101\nbackbones [ 13], respectively. One can train models of diﬀerent architectures, like Faster R-CNN [ 28] (F) and Mask\nR-CNN [ 12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained\nusing the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model\nzoo in coming months.\nlayout data structures , which are optimized for eﬃciency and versatility. 3) When\nnecessary, users can employ existing or customized OCR models via the uniﬁed\nAPI provided in the OCR module . 4)LayoutParser comes with a set of utility\nfunctions for the visualization and storage of the layout data. 5) LayoutParser\nis also highly customizable, via its integration with functions for layout data\nannotation and model training . We now provide detailed descriptions for each\ncomponent.\n3.1 Layout Detection Models\nInLayoutParser , a layout model takes a document image as an input and\ngenerates a list of rectangular boxes for the target content regions. Diﬀerent\nfrom traditional methods, it relies on deep convolutional neural networks rather\nthan manually curated rules to identify content regions. It is formulated as an\nobject detection problem and state-of-the-art models like Faster R-CNN [ 28] and\nMask R-CNN [ 12] are used. This yields prediction results of high accuracy and\nmakes it possible to build a concise, generalized interface for layout detection.\nLayoutParser , built upon Detectron2 [ 35], provides a minimal API that can\nperform layout detection with only four lines of code in Python:\n1import layoutparser as lp\n2image = cv2. imread (" image_file ") # load images\n3model = lp. Detectron2LayoutModel (\n4 "lp :// PubLayNet / faster_rcnn_R_50_FPN_3x / config ")\n5layout = model . detect ( image )\nLayoutParser provides a wealth of pre-trained model weights using various\ndatasets covering diﬀerent languages, time periods, and document types. Due to\ndomain shift [ 7], the prediction performance can notably drop when models are ap-\nplied to target samples that are signiﬁcantly diﬀerent from the training dataset. As\ndocument structures and layouts vary greatly in diﬀerent domains, it is important\nto select models trained on a dataset similar to the test samples. A semantic syntax\nis used for initializing the model weights in LayoutParser , using both the dataset\nname and model name lp://<dataset-name>/<model-architecture-name> .'
```

Using PyMuPDF

This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page.

```python
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("example_data/layout-parser-paper.pdf")

data = loader.load()
data[0]
#     Document(page_content='LayoutParser: A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1 (�), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities. This paper introduces LayoutParser, an open-source\nlibrary for streamlining the usage of DL in DIA research and applica-\ntions. The core LayoutParser library comes with a set of simple and\nintuitive interfaces for applying and customizing DL models for layout de-\ntection, character recognition, and many other document processing tasks.\nTo promote extensibility, LayoutParser also incorporates a community\nplatform for sharing both pre-trained models and full document digiti-\nzation pipelines. We demonstrate that LayoutParser is helpful for both\nlightweight and large-scale digitization pipelines in real-word use cases.\nThe library is publicly available at https://layout-parser.github.io.\nKeywords: Document Image Analysis · Deep Learning · Layout Analysis\n· Character Recognition · Open Source library · Toolkit.\n1\nIntroduction\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\ndocument image analysis (DIA) tasks including document image classiﬁcation [11,\narXiv:2103.15348v2  [cs.CV]  21 Jun 2021\n', lookup_str='', metadata={'file_path': 'example_data/layout-parser-paper.pdf', 'page_number': 1, 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationDate': 'D:20210622012710Z', 'modDate': 'D:20210622012710Z', 'trapped': '', 'encryption': None}, lookup_index=0)
```

Using Unstructured

The unstructured[all-docs] package currently supports loading of text files, powerpoints, html, pdfs, images, and more.

```python
# pip install unstructured[pdf]

from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader("example_data/layout-parser-paper.pdf")

data = loader.load()
```

Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying mode="elements".

```python
loader = UnstructuredPDFLoader("example_data/layout-parser-paper.pdf", mode="elements")

data = loader.load()
data[0]
#     Document(page_content='LayoutParser: A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1 (�), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities. This paper introduces LayoutParser, an open-source\nlibrary for streamlining the usage of DL in DIA research and applica-\ntions. The core LayoutParser library comes with a set of simple and\nintuitive interfaces for applying and customizing DL models for layout de-\ntection, character recognition, and many other document processing tasks.\nTo promote extensibility, LayoutParser also incorporates a community\nplatform for sharing both pre-trained models and full document digiti-\nzation pipelines. We demonstrate that LayoutParser is helpful for both\nlightweight and large-scale digitization pipelines in real-word use cases.\nThe library is publicly available at https://layout-parser.github.io.\nKeywords: Document Image Analysis · Deep Learning · Layout Analysis\n· Character Recognition · Open Source library · Toolkit.\n1\nIntroduction\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\ndocument image analysis (DIA) tasks including document image classiﬁcation [11,\narXiv:2103.15348v2  [cs.CV]  21 Jun 2021\n', lookup_str='', metadata={'file_path': 'example_data/layout-parser-paper.pdf', 'page_number': 1, 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationDate': 'D:20210622012710Z', 'modDate': 'D:20210622012710Z', 'trapped': '', 'encryption': None}, lookup_index=0)
```

### Retrieval: Text Splitters

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text.

At a high level, text splitters work as following:
- Split the text up into small, semantically meaningful chunks (often sentences).
- Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
- Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:
- How the text is split
- How the chunk size is measured

LangChain offers many different types of text splitters. These all live in the langchain-text-splitters package:
- Recursive:
    - Name: Recursive
    - Classes: RecursiveCharacterTextSplitter, RecursiveJsonSplitter
    - Splits On: A list of user defined characters
    - Description: Recursively splits text. This splitting is trying to keep related pieces of text next to each other. This is the recommended way to start splitting text.
- HTML:
    - Name: HTML
    - Classes: HTMLHeaderTextSplitter, HTMLSectionSplitter
    - Splits On: HTML specific characters
    - Description: Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML)
- Markdown:
    - Name: Markdown
    - Classes: MarkdownHeaderTextSplitter
    - Splits On: Markdown specific characters
    - Description: Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown)
- Code:
    - Name: Code
    - Classes: Many languages supported
    - Splits On: Code (Python, JS) specific characters
    - Description: Splits text based on characters specific to coding languages. 15 different languages are available to choose from.
- Token:
    - Name: Token
    - Classes: tiktoken, spaCy, SentenceTransformers, NLTK, Hugging Face tokenizer
    - Splits On: Tokens
    - Description: Splits text on tokens. There exist a few different ways to measure tokens.
- Character:
    - Name: Character
    - Classes: CharacterTextSplitter
    - Splits On: A user defined character
    - Description: Splits text based on a user defined character. One of the simpler methods.
- [Experimental] Semantic Chunker:
    - Name: Semantic Chunker
    - Classes: SemanticChunker
    - Splits On: Sentences
    - Description: First splits on sentences. Then combines ones next to each other if they are semantically similar enough.
- AI21 Semantic Text Splitter:
    - Name: AI21 Semantic Text Splitter
    - Classes: AI21SemanticTextSplitter
    - Splits On: 
    - Description: Identifies distinct topics that form coherent pieces of text and splits along those.

You can evaluate text splitters with the Chunkviz utility created by Greg Kamradt. Chunkviz is a great tool for visualizing how your text splitter is working. It will show you how your text is being split up and help in tuning up the splitting parameters. This is really cool!!!!

### Retrieval: Text Splitters: HTML Splitter

Similar in concept to the MarkdownHeaderTextSplitter, the HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

```python
# %pip install -qU langchain-text-splitters

from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
# [Document(page_content='Foo'),
#  Document(page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),
#  Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),
#  Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),
#  Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),
#  Document(page_content='Baz', metadata={'Header 1': 'Foo'}),
#  Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),
#  Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})]
```

Pipelined to another splitter, with html loaded from a web URL:
```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]
# [Document(page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berry’s paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),
#  Document(page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),
#  Document(page_content='This account of Gödel’s discovery was told to Hao Wang very much after the fact; but in Gödel’s contemporary correspondence with Bernays and Zermelo, essentially the same description of his path to the theorems is given. (See Gödel 2003a and Gödel 2003b respectively.) From those accounts we see that the undefinability of truth in arithmetic, a result credited to Tarski, was likely obtained in some form by Gödel by 1931. But he neither publicized nor published the result; the biases logicians', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),
#  Document(page_content='result; the biases logicians had expressed at the time concerning the notion of truth, biases which came vehemently to the fore when Tarski announced his results on the undefinability of truth in formal systems 1935, may have served as a deterrent to Gödel’s publication of that theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),
#  Document(page_content='We now describe the proof of the two theorems, formulating Gödel’s results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödel’s notation.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'})]
```

Similar in concept to the HTMLHeaderTextSplitter, the HTMLSectionSplitter is a "structure-aware" chunker that splits text at the element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline. Internally, it uses the RecursiveCharacterTextSplitter when the section size is larger than the chunk size.

```python
from langchain_text_splitters import HTMLSectionSplitter

html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]

html_splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
```

Pipelined to another splitter, with html loaded from a html string content:

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLSectionSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text(html_string)

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits
```

### Retrieval: Text Splitters: Character Splitter

This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.
- How the text is split: by single character.
- How the chunk size is measured: by number of characters.

It splits text strictly at predefined separators (non-recursive). It takes only a single separator (e.g., space " " or newline "\n") defined at initialization. If the chunk exceeds the chunk_size, the splitter may truncate at the first separator. It is less flexible than the RecursiveCharacterTextSplitter and is designed for simpler use cases. 

```python
# %pip install -qU langchain-text-splitters

with open("../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
# page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'
```

Here's an example of passing metadata along with the documents, notice that it is split along with the documents.

```python
metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents(
    [state_of_the_union, state_of_the_union], metadatas=metadatas
)
print(documents[0])
# page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' metadata={'document': 1}
```

### Retrieval: Text Splitters: Code Splitter

CodeTextSplitter allows you to split your code with multiple languages supported. Import enum Language and specify the language.

```python
# %pip install -qU langchain-text-splitters

from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# Full list of supported languages
[e.value for e in Language]

RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
# ['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']

PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs
# [Document(page_content='def hello_world():\n    print("Hello, World!")'),
#  Document(page_content='# Call the function\nhello_world()')]
```

### Retrieval: Text Splitters: Markdown Splitter

A markdown file is organized by headers. Creating chunks within specific header groups is an intuitive idea. To address this challenge, we can use MarkdownHeaderTextSplitter. This will split a markdown file by a specified set of headers.

```python
# %pip install -qU langchain-text-splitters

from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
# [Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
#  Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
#  Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

By default, MarkdownHeaderTextSplitter strips headers being split on from the output chunk's content. This can be disabled by setting strip_headers = False.
```python
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
# [Document(page_content='# Foo  \n## Bar  \nHi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
#  Document(page_content='### Boo  \nHi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
#  Document(page_content='## Baz  \nHi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]
```

Within each markdown group we can then apply any text splitter we want.

```python
markdown_document = "# Intro \n\n    ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)

# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(md_header_splits)
splits
```

### Retrieval: Text Splitters: JSON Splitter

This json splitter traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size. If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size considder following this with a Recursive Text splitter on those chunks. 
- How the text is split: json value.
- How the chunk size is measured: by number of characters.

```python
# %pip install -qU langchain-text-splitters

import json

import requests

json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

from langchain_text_splitters import RecursiveJsonSplitter

splitter = RecursiveJsonSplitter(max_chunk_size=300)

# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks = splitter.split_json(json_data=json_data)

# The splitter can also output documents
docs = splitter.create_documents(texts=[json_data])

# or a list of strings
texts = splitter.split_text(json_data=json_data)

print(texts[0])
print(texts[1])
# {"openapi": "3.0.2", "info": {"title": "LangChainPlus", "version": "0.1.0"}, "paths": {"/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_sessions__session_id__get"}}}}
# {"paths": {"/sessions/{session_id}": {"get": {"parameters": [{"required": true, "schema": {"title": "Session Id", "type": "string", "format": "uuid"}, "name": "session_id", "in": "path"}, {"required": false, "schema": {"title": "Include Stats", "type": "boolean", "default": false}, "name": "include_stats", "in": "query"}, {"required": false, "schema": {"title": "Accept", "type": "string"}, "name": "accept", "in": "header"}]}}}}

# Let's look at the size of the chunks
print([len(text) for text in texts][:10])

# Reviewing one of these chunks that was bigger we see there is a list object there
print(texts[1])
# [293, 431, 203, 277, 230, 194, 162, 280, 223, 193]
# {"paths": {"/sessions/{session_id}": {"get": {"parameters": [{"required": true, "schema": {"title": "Session Id", "type": "string", "format": "uuid"}, "name": "session_id", "in": "path"}, {"required": false, "schema": {"title": "Include Stats", "type": "boolean", "default": false}, "name": "include_stats", "in": "query"}, {"required": false, "schema": {"title": "Accept", "type": "string"}, "name": "accept", "in": "header"}]}}}}

# The json splitter by default does not split lists
# the following will preprocess the json and convert list to dict with index:item as key:val pairs
texts = splitter.split_text(json_data=json_data, convert_lists=True)

# Let's look at the size of the chunks. Now they are all under the max
print([len(text) for text in texts][:10])
# [293, 431, 203, 277, 230, 194, 162, 280, 223, 193]
```

### Retrieval: Text Splitters: Recursive Character Splitter

The RecursiveCharacterTextSplitter in LangChain is a utility for breaking large blocks of text into smaller chunks, while trying to preserve meaningful semantic units (e.g., paragraphs, sentences, or words). It does this recursively, using a list of delimiters provided in order of priority.
- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.

By default, the splitter uses the following list of delimiters:
```python
["\n\n", "\n", " ", ""]
```
- Start by attempting to split the text into chunks using the first delimiter in the list ("\n\n", for paragraphs since paragraphs typically are followed by two line breaks).
- If the resulting chunks are too large (i.e., exceed the specified chunk_size), split each chunk further using the next delimiter in the list ("\n", which could also represent paragraphs that don't have an extra space between previous paragraph or it could represent a sentence).
- If the resulting chunks are too large (e.g. exceeds the specific chunk_size), split each chunk further using " ", which at the bear minimum represent word delimiters.
- Repeat until chunks are small enough or the last delimiter ("", which splits on every character) is reached.

Note the "\n" (newline character) doesn't directly represent a sentence. Instead, it is often used to indicate a line break, which can serve as a boundary for splitting text into smaller chunks. Depending on the structure of the text, a newline may indirectly correspond to the end of a sentence or a logical separation in the content, such as between paragraphs or bullet points.

In many types of text, especially unstructured or semi-structured data like emails, chat logs, or notes, a line break often separates distinct ideas, sentences, or paragraphs.

For example:
```text
Hello there!
How are you today?
```

Splitting by "\n" in this case yields:
```text
["Hello there!", "How are you today?"]
```

The key of recursive text splitting is: If a chunk is still too large after splitting with one delimiter, it applies the next delimiter to that chunk, continuing until it meets the size requirement.

The chunk_overlap parameter allows for overlapping between chunks. This ensures that chunks retain some context from adjacent pieces of text, which can improve performance in tasks like question-answering or summarization.

Parameters:
- chunk_size: Maximum length of each chunk.
- chunk_overlap: Number of characters overlapping between consecutive chunks.
- separators: List of characters (delimiters) to use for splitting, in order of priority.

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=20,
    chunk_overlap=5,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_text("Hello there! How are you?\n\nI hope you're doing well.")
print(chunks)
# ["Hello there! How are you?", "I hope you're doing well."]
```

In the RecursiveCharacterTextSplitter from LangChain, the length_function and is_separator_regex options provide advanced customization for how text is split into chunks.

The length_function is used to calculate the "length" of a chunk during the splitting process. This is useful because "length" can mean different things depending on your context, such as:
- Number of characters.
- Number of tokens (e.g., for language models).
- Number of words.

By default, the splitter uses the length of the string (number of characters).

You can define a custom function to measure the chunk length in a way that suits your use case. For example, if you're working with token-based limits (e.g., for GPT models), you can use a tokenizer to calculate token length.

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import GPT2TokenizerFast

# Use a tokenizer to calculate token lengths
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def token_length(text: str) -> int:
    return len(tokenizer.encode(text))

text_splitter = RecursiveCharacterTextSplitter(
    length_function=token_length,
    chunk_size=100,  # Limit chunk size by token count
    chunk_overlap=10
)
```

Here, token_length will calculate the chunk length in tokens, ensuring that the chunks fit within the token limit.

The is_separator_regex parameter specifies whether the separators provided to the splitter should be treated as regular expressions (regex) or as plain strings. By default, this is set to False, meaning the separators are treated as plain strings. Set is_separator_regex=True if you want to define separators using regex patterns for more advanced splitting behavior. For example, you might want to split text by patterns like:
- Multiple spaces (\s+).
- Newlines followed by certain characters (\n\s*-).
- Punctuation like periods or question marks ([.!?]).

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=[r"\.\s+", r"\n", r"\s+"],  # Regex patterns
    is_separator_regex=True,
    chunk_size=500,
    chunk_overlap=50,
)
```


### Retrieval: Text Splitters: Semantic Splitter

This techique splits the text based on semantic similarity. At a high level, this splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

```python
# !pip install --quiet langchain_experimental langchain_openai

# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
# Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. They keep moving.
```

This chunker works by determining when to "break" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split. There are a few ways to determine what that threshold is.

The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split.

```python
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
)

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
# Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. They keep moving.

print(len(docs))
26
```

Standard Deviation: In this method, any difference greater than X standard deviations is split.

```python
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation"
)

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
# Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. They keep moving. And the costs and the threats to America and the world keep rising. That’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. The United States is a member along with 29 other nations. It matters. American diplomacy matters. American resolve matters. Putin’s latest attack on Ukraine was premeditated and unprovoked. He rejected repeated efforts at diplomacy. He thought the West and NATO wouldn’t respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. We prepared extensively and carefully. We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. I spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. We countered Russia’s lies with truth. And now that he has acted the free world is holding him accountable. Along with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland. We are inflicting pain on Russia and supporting the people of Ukraine. Putin is now isolated from the world more than ever. Together with our allies –we are right now enforcing powerful economic sanctions. We are cutting off Russia’s largest banks from the international financial system. Preventing Russia’s central bank from defending the Russian Ruble making Putin’s $630 Billion “war fund” worthless. We are choking off Russia’s access to technology that will sap its economic strength and weaken its military for years to come. Tonight I say to the Russian oligarchs and corrupt leaders who have bilked billions of dollars off this violent regime no more. The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs. We are joining with our European allies to find and seize your yachts your luxury apartments your private jets. We are coming for your ill-begotten gains. And tonight I am announcing that we will join our allies in closing off American air space to all Russian flights – further isolating Russia – and adding an additional squeeze –on their economy. The Ruble has lost 30% of its value. The Russian stock market has lost 40% of its value and trading remains suspended. Russia’s economy is reeling and Putin alone is to blame. Together with our allies we are providing support to the Ukrainians in their fight for freedom. Military assistance. Economic assistance. Humanitarian assistance. We are giving more than $1 Billion in direct assistance to Ukraine. And we will continue to aid the Ukrainian people as they defend their country and to help ease their suffering. Let me be clear, our forces are not engaged and will not engage in conflict with Russian forces in Ukraine. Our forces are not going to Europe to fight in Ukraine, but to defend our NATO Allies – in the event that Putin decides to keep moving west. For that purpose we’ve mobilized American ground forces, air squadrons, and ship deployments to protect NATO countries including Poland, Romania, Latvia, Lithuania, and Estonia. As I have made crystal clear the United States and our Allies will defend every inch of territory of NATO countries with the full force of our collective power. And we remain clear-eyed. The Ukrainians are fighting back with pure courage. But the next few days weeks, months, will be hard on them. Putin has unleashed violence and chaos. But while he may make gains on the battlefield – he will pay a continuing high price over the long run. And a proud Ukrainian people, who have known 30 years  of independence, have repeatedly shown that they will not tolerate anyone who tries to take their country backwards. To all Americans, I will be honest with you, as I’ve always promised. A Russian dictator, invading a foreign country, has costs around the world. And I’m taking robust action to make sure the pain of our sanctions  is targeted at Russia’s economy. And I will use every tool at our disposal to protect American businesses and consumers. Tonight, I can announce that the United States has worked with 30 other countries to release 60 Million barrels of oil from reserves around the world. America will lead that effort, releasing 30 Million barrels from our own Strategic Petroleum Reserve. And we stand ready to do more if necessary, unified with our allies. These steps will help blunt gas prices here at home. And I know the news about what’s happening can seem alarming.

print(len(docs))
4
```

Interquartile: In this method, the interquartile distance is used to split chunks.

```python
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="interquartile"
)

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
# Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. They keep moving.

print(len(docs))
25
```


### Retrieval: Text Splitters: Token Splitter

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

The CharacterTextSplitter splits text strictly based on a defined character separator (e.g., spaces, newlines). It does not measure chunk size using tokens. This means the chunks created by CharacterTextSplitter may not precisely align with token-based length constraints.

1) tiktoken

Tiktoken is the tokenizer used by OpenAI's models (e.g., GPT). It breaks text into tokens, which are smaller units like words or parts of words. If you define a chunk size using tokens, the number of tokens in a chunk should ideally be measured by a tokenizer (like Tiktoken) to ensure compatibility with model constraints.

tiktoken is a fast BPE tokenizer created by OpenAI. We can use it to estimate tokens used:
- How the text is split: by character passed in.
- How the chunk size is measured: by tiktoken tokenizer.

The .from_tiktoken_encoder() method ties the CharacterTextSplitter to a Tiktoken tokenizer but only for merging splits, not for the initial splitting process. The CharacterTextSplitter still splits text based on the character-based logic (e.g., spaces, newlines). By "merging splits", I mean that after the text is initially split into chunks using CharacterTextSplitter's character-based logic, the Tiktoken tokenizer is used to count the tokens in those chunks and potentially adjust how chunks are combined when producing the final set of output chunks.

Let’s break it down:
- The CharacterTextSplitter breaks the text into chunks using character-based logic, based on the specified separator (e.g., spaces, newlines, or custom characters).
- During this step, no consideration is given to token counts—the text is split purely based on characters and separators.
- Once the character-based splitting is complete, the Tiktoken tokenizer is used to count the number of tokens in each chunk.
- If any chunks exceed the token-based chunk_size limit, they are either:
    - Left as is (because Tiktoken is only measuring tokens, not enforcing strict limits).
    - Optionally adjusted or merged with adjacent chunks if desired in a higher-level implementation to respect token constraints.
- If a chunk is too small (e.g., less than chunk_size tokens) or if overlap logic creates inefficiencies, the tokenizer can help combine adjacent chunks into a larger one, ensuring the total token count per chunk better aligns with the chunk_size.

However, this merging is not automatic in the basic CharacterTextSplitter.from_tiktoken_encoder implementation!!!! It requires additional logic to ensure the splits are token-aware.

Suppose You Have This Text:
```text
"This is a long text document. It needs to be split into smaller chunks for processing. Each chunk should ideally have no more than 50 tokens, but the splitting is based on characters, not tokens."
```

Using CharacterTextSplitter, the text might be split into two chunks based on character separators:
- Chunk 1: "This is a long text document. It needs to be split"
- Chunk 2: "into smaller chunks for processing. Each chunk should ideally have no more than 50 tokens, but the splitting is based on characters, not tokens."

Note: These chunks are based solely on characters and separators, without considering token counts.

The Tiktoken tokenizer is used to count tokens in each chunk:
- Chunk 1: 12 tokens
- Chunk 2: 65 tokens

If strict adherence to the chunk_size token limit (e.g., 50 tokens) is required, you might combine smaller chunks or reprocess larger chunks. For example:
- Chunk 1 is fine (12 tokens).
- Chunk 2 exceeds the token limit (65 tokens). This could either be split further using token-based logic or left unchanged.

The CharacterTextSplitter was originally designed to split text based on characters, not tokens. When you add Tiktoken for "merging splits," it doesn’t change the initial character-based splitting but gives you tools to adjust chunks afterward, ensuring that they are more token-aware. **If you want strict token-based splitting, you should use RecursiveCharacterTextSplitter, which can enforce token limits during the splitting process.**

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=100,
    chunk_overlap=0,
)
```

We can also load a tiktoken splitter directly, which will ensure each split is smaller than chunk size.

```python
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
```

Some written languages (e.g. Chinese and Japanese) have characters which encode to 2 or more tokens. Using the TokenTextSplitter directly can split the tokens for a character between two chunks causing malformed Unicode characters. Use RecursiveCharacterTextSplitter.from_tiktoken_encoder or CharacterTextSplitter.from_tiktoken_encoder to ensure chunks contain valid Unicode strings.


Note .from_tiktoken_encoder takes either encoding as an argument (e.g. cl100k_base), or the model_name (e.g. gpt-4). All additional arguments like chunk_size, chunk_overlap, and separators are used to instantiate CharacterTextSplitter:

```python
# %pip install --upgrade --quiet langchain-text-splitters tiktoken

# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
```

2) spaCy

You can use the spaCy tokenizer:
- How the text is split: by spaCy tokenizer.
- How the chunk size is measured: by number of characters.

```python
# %pip install --upgrade --quiet  spacy

# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
```

3) SentenceTransformers

The SentenceTransformersTokenTextSplitter is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use.

```python
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)
# 2

token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")
# tokens in text to split: 514

text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])
# lorem
```

4) NLTK

Rather than just splitting on "\n\n", we can use NLTK to split based on NLTK tokenizers.
- How the text is split: by NLTK tokenizer.
- How the chunk size is measured: by number of characters.

```python
# pip install nltk

# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
```

4) Hugging Face tokenizer

We use Hugging Face tokenizer, the GPT2TokenizerFast to count the text length in tokens.
- How the text is split: by character passed in.
- How the chunk size is measured: by number of tokens calculated by the Hugging Face tokenizer.

```python
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])
# Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

# Last year COVID-19 kept us apart. This year we are finally together again. 

# Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

# With a duty to one another to the American people to the Constitution.
```

### Retrieval: Embedding models

The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.

**Embeddings create a vector representation of a piece of text.** This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

Embedding Documents (embed_documents):
- Purpose: This method is used to create embeddings for a collection of texts (e.g., documents or chunks of text). These embeddings are typically stored in a vector database and are later used to find relevant content during a search or retrieval process.
- Input: A list of texts (multiple strings) that represent the documents.
- Reason: You want to process multiple texts in one go because they represent your searchable knowledge base or content repository.

```python
const documents = ["Text 1", "Text 2", "Text 3"];
const documentEmbeddings = embeddings.embed_documents(documents);
```

Embedding Queries (embed_query):
- Purpose: This method is used to embed a single query. The query embedding is compared to the pre-stored document embeddings to retrieve the most relevant documents.
- Input: A single string (representing the query text).
- Reason: The query might require a different embedding process compared to documents, depending on the provider. For example, some providers optimize query embeddings for retrieval tasks differently than document embeddings.

```python
const query = "What are the benefits of solar energy?";
const queryEmbedding = embeddings.embed_query(query);
```

Why Are These Separate Methods?

Some embedding providers (e.g., OpenAI, Cohere, etc.) treat queries and documents differently because they are optimized for different purposes:
- Document embeddings are designed to capture the content and meaning of a piece of text to enable similarity searches.
- Query embeddings are designed to represent user search intent, optimized for matching against the stored document embeddings.

### Retrieval: Embedding models: HuggingFace

There is nothing particularly remarkable about the HuggingFace Embedding classes, as the brunt of the work is done in the InferenceClient/AsyncInferenceClient classes, which we discussed in the LLM section of this documentation.

If you are running a local instance of the HuggingFace Embeddings Inference (HF TEI), then the only class you can use in langchain_huggingface for embeddings is HuggingFaceEndpointEmbeddings. The HuggingFaceEndpointEmbeddings class works with two different configurations: Hugging Face’s Hosted Inference API and Self-Hosted Text Embedding Inference (TEI).

Hugging Face’s Hosted Inference API
- Available via Hugging Face Hub (api-inference.huggingface.co)
- Requires Hugging Face authentication (HUGGINGFACEHUB_API_TOKEN).
- It is a paid service beyond certain free-tier limits.

Example of calling the hosted API:
```python
from huggingface_hub import InferenceClient

client = InferenceClient(model="BAAI/bge-large-en-v1.5", token="your-api-key")

# This calls the hosted inference endpoint at:
# https://api-inference.huggingface.co/models/BAAI/bge-large-en-v1.5
embeddings = client.feature_extraction("What is the capital of France?")
print(embeddings)  # NumPy array of embeddings
```

Self-Hosted Text Embedding Inference (TEI)
- TEI (text-embeddings-inference) is a Dockerized local inference server.
- It does not require a Hugging Face API token.
- It is much faster and more cost-effective for large-scale embedding generation.
- HuggingFaceEndpointEmbeddings can be configured to point to a TEI server instead of Hugging Face Hub.

Example of calling a self-hosted TEI API:
```python
client = InferenceClient(model="http://localhost:8070/")  # Local TEI server

# This calls a locally hosted TEI instance instead of the Hugging Face API.
# http://localhost:8070/embed
embeddings = client.feature_extraction("What is the capital of France?")
print(embeddings)  # NumPy array of embeddings
```

Which One Does HuggingFaceEndpointEmbeddings Use?

The HuggingFaceEndpointEmbeddings class can use either:
- Hugging Face’s hosted API (api-inference.huggingface.co).
- A self-hosted TEI server (http://your-server-ip:8070/).

How does it decide?
- If model is a Hugging Face model ID ("BAAI/bge-large-en-v1.5"), it uses Hugging Face’s Hosted API.
- If model is a URL ("http://your-server-ip:8070/"), it uses a TEI self-hosted instance.

Example: Configuring HuggingFaceEndpointEmbeddings
```python
# local
embeddings = HuggingFaceEndpointEmbeddings(
    model="http://localhost:8070/",  # Use a locally running TEI server
)

text = "This is a test document."

query_result = embeddings.embed_query(text)

query_result[:3]

# the hub
embeddings = HuggingFaceEndpointEmbeddings(
    model="BAAI/bge-large-en-v1.5",  # Use Hugging Face’s Hosted API
    huggingfacehub_api_token="your-api-key"
)
```

Note HuggingFaceEndpointEmbeddings requires us to install huggingface_hub.

The HuggingFaceEmbeddings class runs the embedding model locally using the Hugging Face transformers library. Use Case: Ideal when you want to run models on your local machine.

Requirements:
- Hugging Face transformers library installed.
- The embedding model is downloaded and run locally.

Advantages:
- Full control over the model and its parameters.
- No reliance on external services or APIs.
- No API costs; you only incur local hardware costs.

Disadvantages:
- Requires a powerful local machine for large models.
- Must handle updates and dependencies yourself.

```python
# %pip install --upgrade --quiet  langchain langchain-huggingface sentence_transformers

from langchain_huggingface.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

text = "This is a test document."

query_result = embeddings.embed_query(text)

query_result[:3]
# [-0.04895168915390968, -0.03986193612217903, -0.021562768146395683]

doc_result = embeddings.embed_documents([text])
```

Note it can be a huge disadvantage to have to download the entire model on our local computer, and therefore HuggingFaceEmbeddings is not ideal for production workloads.

The HuggingFaceInferenceAPIEmbeddings class uses the Hugging Face Inference API to generate embeddings. The model runs on Hugging Face's infrastructure, accessed via an API. Use Case: Suitable for users who prefer a fully managed solution and don't want to handle local infrastructure.

Requirements:
- A Hugging Face account with API access and an API key.
- Internet connectivity.

Advantages:
- No local setup or hardware required.
- Hugging Face manages model updates, scaling, and infrastructure.

Disadvantages:
- Incurs API usage costs based on Hugging Face's pricing.
- Latency depends on network and API response time.
- Limited to the models available on the Hugging Face Inference API.

```python
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings

embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_key=inference_api_key, model_name="sentence-transformers/all-MiniLM-l6-v2"
)

query_result = embeddings.embed_query(text)
query_result[:3]
# [-0.038338541984558105, 0.1234646737575531, -0.028642963618040085]
```

### Retrieval: Embedding models: HuggingFace Endpoint Embeddings

Detailed Explanation of HuggingFaceEndpointEmbeddings Class

START HERE



### Retrieval: Embedding models: Caching

The LangChain Caching mechanism for embeddings focuses on reducing computational overhead by storing previously computed embeddings in a key-value store. This is particularly useful when dealing with repetitive tasks, such as embedding the same document multiple times or when working with large datasets that require incremental updates.

Caching with CacheBackedEmbeddings:
- CacheBackedEmbeddings is a wrapper around an embedding model (called the "underlying embedder").
- It intercepts embedding requests, checks if the embedding for a given input (text/document) already exists in the cache, and retrieves it if available.
- If the embedding is not in the cache, the embedder computes it, stores it in the cache, and then returns it.

How It Works:
- The input text is hashed (using a deterministic algorithm).
- The resulting hash serves as the key to store/retrieve the embedding in/from the cache.
- The hash maps to the corresponding embedding in the cache.
- This can be any supported ByteStore (e.g., Redis, SQLite, or an in-memory store).

Parameters:
- underlying_embedder:
    - The base embedding model to use if an embedding isn't found in the cache.
    - Example: OpenAI's text-embedding-ada-002.
- document_embedding_cache:
    - The key-value store for caching embeddings.
    - Examples: In-memory stores, Redis, SQLite, etc.
    - LangChain provides the ByteStore interface for this purpose.
- batch_size (optional):
    - Controls how many documents to embed before updating the cache in bulk.
    - Useful for reducing the frequency of writes to the store.
- namespace (optional):
    - A string used to isolate caches for different contexts (e.g., "model-A").
    - Prevents hash collisions when using multiple embeddings or workflows.

Here’s an example of how to set up and use CacheBackedEmbeddings:

```python
from langchain.embeddings import OpenAIEmbeddings
from langchain.cache import InMemoryByteStore
from langchain.embeddings.cache_backed import CacheBackedEmbeddings

# Step 1: Create an underlying embedder (e.g., OpenAI embeddings)
embedder = OpenAIEmbeddings(model="text-embedding-ada-002")

# Step 2: Create a ByteStore for caching
cache_store = InMemoryByteStore()  # Replace with RedisByteStore or SQLiteByteStore for persistence

# Step 3: Initialize the CacheBackedEmbeddings
cache_backed_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embedder=embedder,
    document_embedding_cache=cache_store,
    batch_size=10,  # Optional: batch size for caching updates
    namespace="my_model_cache"  # Optional: namespace for isolation
)

# Step 4: Embed a document (cached automatically)
document = "The quick brown fox jumps over the lazy dog."
embedding = cache_backed_embedder.embed_query(document)

print("Embedding:", embedding)
```

Example of switching to Redis:

```python
from langchain.cache import RedisByteStore

# Use Redis for caching
cache_store = RedisByteStore(redis_url="redis://localhost:6379")
cache_backed_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embedder=embedder,
    document_embedding_cache=cache_store,
    namespace="my_model_cache"
)
```

### Retrieval: Vector stores

**One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.**

A key part of working with vector stores is creating the vector to put in them, which is usually created via embeddings. Therefore, it is recommended that you familiarize yourself with the text embedding model interfaces before diving into this.

### Retrieval: Vector stores: FAISS

The FAISS vector database makes use of the Facebook AI Similarity Search (FAISS) library. To leverage a Vector Store, we need to embed something. Below we use OpenAIEmbeddings:

```python
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader('../../../state_of_the_union.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = FAISS.from_documents(documents, OpenAIEmbeddings())
```

What from_documents Does:
- It accepts a list of Document objects, where each Document has page_content (the main text) and optionally metadata.
- These documents may already be split into chunks (e.g., using a text splitter like CharacterTextSplitter).
- It uses the provided embedding model (OpenAIEmbeddings in this case) to convert each chunk of text (page_content of each Document) into a vector (numerical representation).
- These vectors are high-dimensional representations of the text, designed to capture its semantic meaning.
- Each vector is stored in the FAISS vector database (a vector store) along with the associated metadata.
- Metadata can include information such as the source of the document or the chunk index.
- The FAISS vector store is initialized and populated with the embeddings and metadata.

1) Similarity Search

Given a plain text query: "What did the president say about John McCain?":
- The input query is passed to the underlying embedding model (e.g., OpenAIEmbeddings) to generate a vector representation.
- The generated vector is then compared with the stored vectors in the vector database (e.g., FAISS) to find the most similar documents.
- Best suited for end-to-end workflows where the query is in plain text and you want the embedding generation step handled automatically.

```python
docs = db.similarity_search("What did the president say about John McCain?")
```

Algorithms for Similarity Search

FAISS uses Nearest Neighbor (NN) Search. Nearest Neighbor Search is the process of finding the closest data points (e.g., vectors) in a high-dimensional space based on a similarity or distance metric (e.g., cosine similarity, Euclidean distance).
- Exact Nearest Neighbor (NN): Searches the entire dataset exhaustively to find the exact closest matches.
- Approximate Nearest Neighbor (ANN): Introduces optimizations or approximations to reduce the computational complexity while still finding close (but not always exact) matches.

There are three common techniques to perform Nearest Neighbor Searches (both NN and ANN): Flat Index, Hierarchical Navigable Small World Graphs (HNSW), and Product Quantization (PQ).

How They Fit into the ANN Context
- Flat Index:
    - Exhaustive search across all vectors for exact similarity. This is a brute-force, exhaustive search method.
    - Computes similarity (e.g., cosine similarity or L2 distance) between the query vector and all stored vectors.
    - Flat Index performs exact nearest neighbor (NN) search, not ANN.
    - Use Case: Smaller datasets where accuracy is critical, and computational cost is acceptable.
- Hierarchical Navigable Small World Graphs (HNSW):
    - HNSW is a graph-based algorithm for Approximate Nearest Neighbor (ANN) search. It constructs a graph where vectors (data points) are connected based on their similarity, enabling efficient traversal to find the nearest neighbors of a query vector.
        - Vectors are organized into a multi-layer graph.
        - Each node (representing a vector) is connected to a small number of other nodes (its "neighbors") that are most similar to it.
        - Layers represent different levels of granularity:
            - Top Layer: Contains fewer nodes and provides a coarse-grained view of the graph.
            - Lower Layers: Add more connections, increasing the graph's resolution.
        - A search starts from the top layer at a random node.
        - The search moves to nodes that are progressively closer to the query vector, using a greedy algorithm.
        - Once the search reaches a local minimum in the current layer, it descends to the next layer and continues.
        - The search terminates in the bottom-most layer, where the most similar vectors are found.
    - In essence, it constructs a graph where similar vectors are connected, enabling fast traversal and retrieval.
    - Used for large datasets where exact search is computationally expensive.
    - Instead of comparing the query vector to all stored vectors, the graph structure narrows the search space. Supports millions of vectors with relatively low latency.
    - New nodes can be added incrementally without rebuilding the graph.
    - Use Case: Large-scale datasets where speed and scalability are critical.
- Product Quantization (PQ):
    - Product Quantization is a technique used to compress vectors for efficient storage and search. Instead of storing the full high-dimensional vectors, PQ approximates them with compact representations while preserving their similarity relationships.
    - Partition the Vector Space:
        - A vector is divided into subspaces of smaller dimensions.
            - For example, a 128-dimensional vector can be split into 4 subspaces of 32 dimensions each.
        - Each subspace is independently quantized:
            - Quantization replaces the real-valued vector in each subspace with the closest representative vector from a codebook (a precomputed set of centroids).
            - The codebook is generated using clustering techniques like k-means.
        - Each subspace vector is represented by the index of its nearest centroid in the codebook.
    - The vector is now represented as a sequence of indices, significantly reducing its size.
        - Example: Instead of storing a 128-dimensional vector, you store 4 integers (one for each subspace).
    - When comparing a query vector to stored vectors, distances are approximated using the codebooks. This avoids reconstructing the full vector, making computations faster.
    - Use Case: Scenarios with memory constraints and large-scale datasets.

FAISS supports all three methods — Flat Index, Hierarchical Navigable Small World Graphs (HNSW), and Product Quantization (PQ) — but it doesn't use them all simultaneously by default. Instead, FAISS provides a configurable framework, allowing you to choose the method (or combination of methods) that best suits your use case.

Flat Index example:

```python
# FAISS Flat Index Example
index = faiss.IndexFlatL2(d)  # d = vector dimensionality
index.add(vectors)
```

HNSW is slightly less accurate than exact methods like Flat Index. HNSW example:

```python
# FAISS HNSW Index Example
index = faiss.IndexHNSWFlat(d, M=32)  # d = dimensionality, M = max neighbors
index.add(vectors)
```

PQ has a slight reduction in accuracy due to compression.

```python
# FAISS PQ Index Example
quantizer = faiss.IndexFlatL2(d)  # Quantizer used for coarse filtering
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
# nlist = number of coarse clusters, m = number of subspaces
index.train(vectors)
index.add(vectors)
```

FAISS allows combining these methods for speed, memory efficiency, and scalability. Here are some popular combinations:
- Flat + HNSW:
    - Description: HNSW graph for fast candidate selection, combined with exact distance computation using Flat Index.
    - Use Case: Faster search with no memory compression.
- PQ + Flat (IVFPQ):
    - Description: Combines Product Quantization (PQ) for compression with coarse clustering using Inverted File Index (IVF).
        - IVF groups vectors into clusters (using centroids).
        - For a query, only the nearest clusters are searched.
        - Within the selected clusters, PQ compresses vectors to save memory.
    - Use Case: Large datasets where both speed and memory efficiency are needed.
- PQ + HNSW:
    - Description: Combines the graph traversal of HNSW with compressed vector representations using PQ.
        - Use Case: Large-scale, high-speed ANN search with limited memory.

| **Method**              | **Pros**                           | **Cons**                      | **Use Case**                                     |
|--------------------------|-------------------------------------|-------------------------------|-------------------------------------------------|
| **Flat Index**           | Perfect accuracy                  | Slow, memory-intensive        | Small datasets, high accuracy                  |
| **HNSW**                 | Fast, supports dynamic updates    | Slightly less accurate        | Large datasets, real-time queries              |
| **Product Quantization (PQ)** | Memory-efficient, fast            | Compression reduces accuracy  | Very large datasets, constrained memory        |
| **HNSW + PQ**            | Fast, scalable, memory-efficient  | Slight loss of accuracy       | Huge datasets requiring both speed and efficiency | 

But how is it possible to combine Flat + HNSW when one uses exact and the other is approximation? Combining Flat Index and HNSW in a system is possible because they serve different purposes and can complement each other in a hybrid approach to balance speed, accuracy, and scalability. Here's how it works:
- HNSW can be used as an initial filtering or candidate generation step to narrow down the set of potential matches.
- Once HNSW provides a smaller set of candidates (approximate neighbors), the Flat Index can be used to perform exact nearest neighbor search within this reduced set.
- HNSW is very fast at finding approximate nearest neighbors, but it might not always return the exact closest matches.
- Flat Index is slower but provides exact results. Applying it to a reduced set (from HNSW) makes it computationally feasible.

```python
# Create HNSW index for candidate generation
hnsw_index = faiss.IndexHNSWFlat(d, M=32)  # d = dimensionality, M = max neighbors
hnsw_index.add(vectors)

# Create a Flat Index for refinement
flat_index = faiss.IndexFlatL2(d)

# Hybrid search: HNSW for filtering, Flat for exact search on candidates
query = np.array([...], dtype='float32')  # Query vector
k_candidates = 100  # Number of candidates from HNSW
k_final = 10         # Final nearest neighbors

# Search with HNSW
_, candidate_ids = hnsw_index.search(query, k_candidates)

# Extract candidate vectors and perform Flat Index search
candidates = vectors[candidate_ids.flatten()]
flat_index.add(candidates)
_, final_neighbors = flat_index.search(query, k_final)
```

2) Similarity search by vector

It is also possible to do a search for documents similar to a given embedding vector using similarity_search_by_vector which accepts an embedding vector as a parameter instead of a string.

```python
embedding_vector = OpenAIEmbeddings().embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)
    # Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

    # Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

    # One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

    # And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
```

Asynchronous operations

FAISS does not have built-in asynchronous methods. FAISS is designed for efficient in-memory operations and lacks native support for async/await. However, you can wrap FAISS operations in Python's asyncio framework using run_in_executor to mimic asynchronous behavior. 

### Retrieval: Vector Stores: Redis

At its core, Redis is a NoSQL Database in the key-value family that can used as a cache, message broker, stream processing and a primary database. On top of these traditional use cases, Redis provides additional capabilities like the Search and Query capability that allows users to create secondary index structures within Redis. This allows Redis to be a Vector Database, at the speed of a cache.

Redis uses compressed, inverted indexes for fast indexing with a low memory footprint. It also supports a number of advanced features such as:
- Indexing of multiple fields in Redis hashes and JSON
- Vector similarity search (with HNSW (ANN) or FLAT (KNN))
- Vector Range Search (e.g. find all vectors within a radius of a query vector)
- Document ranking (using tf-idf, with optional user-provided weights)

RediSearch is a powerful full-text search and secondary indexing engine for Redis, the in-memory data structure store. It extends Redis by enabling advanced search, filtering, and ranking capabilities, making it ideal for applications requiring fast and efficient querying of structured, semi-structured, and unstructured data.

Core Features of RediSearch
- Full-Text Search:
    - Allows querying text fields with advanced features like stemming, prefix matching, and fuzzy matching.
    - Provides support for Boolean queries (AND, OR, NOT), phrase queries, and wildcard searches.
- Secondary Indexing:
    - Supports indexing non-text fields, such as numbers, dates, and geospatial data.
    - Enables querying and filtering across multiple fields efficiently.
- Faceted Search:
    - Provides aggregations, group-by, and sorting capabilities to implement faceted search, commonly used in e-commerce and analytics applications.
- Ranking and Scoring:
    - Includes customizable ranking functions to order search results by relevance.
    - Allows fine-tuning of scoring based on field weights, term frequency, and more.
- Integration with Vector Search:
    - Offers vector similarity search capabilities, making it suitable for embedding-based searches in machine learning or AI applications.
    - Enables use cases like image retrieval, document search, and recommendation systems.
- Schema Definition:
    - Requires defining a schema for the indexed data, specifying fields (e.g., text, numeric, tags) and their attributes.
    - Ensures efficient storage and retrieval by tailoring the index to specific use cases.

How RediSearch Works
- Index Creation:
    - You define an index schema for your data, specifying which fields to index and their types (text, numeric, etc.).
    - Example: FT.CREATE idx:books ON HASH PREFIX 1 "book:" SCHEMA title TEXT author TEXT price NUMERIC
        - FT.CREATE: This is the command used to create a new RediSearch index.
        - idx:books: The name of the index being created. This identifier is how you reference the index in future commands (e.g., searching or modifying).
        - ON HASH: Specifies that the index operates on Redis hashes. Hashes are a data structure in Redis used to store key-value pairs, and RediSearch can index their fields.
        - PREFIX 1 "book:":
            - A prefix filter in the context of RediSearch is a mechanism used to limit the data indexed by specifying a common prefix for the keys that should be included in the index.
            - **When you define an index in RediSearch with a prefix filter, only the keys in Redis that start with the specified prefix will be indexed by the search engine.** This helps improve performance and efficiency by ensuring that only relevant data is included in the index.
            - PREFIX: Defines a prefix filter for the keys to be indexed.
            - 1: The number of prefixes being specified (in this case, only one prefix).
            - "book:": The prefix for Redis keys that should be indexed. Only keys that start with book: will be included in this index.
            - Example: Keys like book:1 or book:123 will be indexed.
            - Example: Keys like user:1 or author:123 will not be indexed.
            - Fields like title, author, and price are the fields of the Redis hash that hold the data to be indexed.
                - Example Redis hash stored at book:1: HSET book:1 title "Redis in Action" author "Josiah Carlson" price 39.99
        - SCHEMA: 
            - Defines the structure of the data in the index.
            - This specifies which fields to index, their types, and any additional attributes (e.g., weights, sorting, etc.).
        - title TEXT: 
            - Adds a field named title to the schema.
            - TEXT: Specifies that this field contains text data, enabling full-text search capabilities such as stemming, prefix matching, and exact matching.
        - author TEXT:
            - Adds a field named author to the schema.
            - TEXT: Specifies that this field contains text data, similar to title.
        - price NUMERIC:
            - Adds a field named price to the schema.
            - NUMERIC: Specifies that this field contains numeric data, enabling range filtering and sorting.
- Data Insertion:
    - Insert or update documents in Redis hashes, which are automatically indexed.
    - Example: HSET book:1 title "Redis in Action" author "Josiah L. Carlson" price 39.99
- Querying:
    - Perform searches using the FT.SEARCH command with filters, sort options, and more.
    - Example: FT.SEARCH idx:books "@title:Redis" SORTBY price ASC

Here's how you can work with vectors in RediSearch using Redis CLI commands for index creation, data insertion, and querying.

- Index Creation:
    - Example: FT.CREATE book_index ON HASH PREFIX 1 "book:" SCHEMA title TEXT author TEXT vector VECTOR FLAT 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE
        - FT.CREATE book_index: Creates an index named book_index.
        - ON HASH: Specifies that the index will operate on Redis hashes.
        - PREFIX 1 "book:": Only keys with the prefix book: will be indexed.
        - SCHEMA: Defines the fields to be indexed:
            - title TEXT: Full-text searchable title field.
            - author TEXT: Full-text searchable author field.
            - vector VECTOR FLAT: A vector field for storing embeddings.
                - 6: The block size for memory alignment.
                - TYPE FLOAT32: Specifies the vector's data type. Each component of the vector is a 32-bit floating-point number.
                - DIM 1536: Dimensionality of the vector (e.g., OpenAI embeddings, which has 1536-dimensional embeddings)
                - DISTANCE_METRIC COSINE: The similarity metric used for nearest neighbor search.
- Data Insertion:
    - Add Redis hashes with vector embeddings.
        - Example: HSET book:1 title "Redis in Action" author "Josiah Carlson" vector <binary_vector_data>
        - Example: HSET book:2 title "Learning Redis" author "Vinoo Das" vector <binary_vector_data>
        - Example: HSET book:3 title "Mastering Redis" author "Jeremy Nelson" vector <binary_vector_data>
    - HSET book:1: Creates a hash with the key book:1.
    - title: Stores the title of the book.
    - author: Stores the author of the book.
    - vector: Stores the vector embedding as binary data (e.g., a serialized embedding from an external model).
- Querying:
    - Perform a K-Nearest Neighbors (KNN) query to find books similar to an embedding.
    - Example: FT.SEARCH book_index "*=>[KNN 2 @vector $BLOB AS score]" PARAMS 2 vector <query_vector_data> SORTBY score ASC RETURN 2 title author
        - FT.SEARCH book_index: Searches the book_index.
        - *: Matches all documents in the index.
        - [KNN 2 @vector $BLOB AS score]:
            - KNN 2: Finds the top 2 nearest neighbors.
            - @vector: Searches against the vector field.
            - $BLOB: Placeholder for the query vector.
            - AS score: Returns the similarity score for each result.
        - PARAMS 2 vector <query_vector_data>:
            - Sets the parameter vector to <query_vector_data> (binary-encoded query vector).
        - SORTBY score ASC: Sorts results by similarity score in ascending order.
        - RETURN 2 title author: Returns only the title and author fields for matching documents.

The VECTOR field type enables fast similarity search by implementing Nearest Neighbor (NN) techniques, such as:
- Flat (Brute Force) Search: Exact search, comparing all vectors.
- Hierarchical Navigable Small World (HNSW): A graph-based approach for fast and scalable approximate search.
- Product Quantization (PQ): A memory-efficient approach for large datasets with slight accuracy trade-offs.

Configurable Similarity Metrics:
- COSINE: Measures the cosine similarity between vectors.
- IP (Inner Product): Measures the dot product of vectors.
- EUCLIDEAN: Measures the Euclidean distance (L2 norm).

| **Aspect**         | **Flat, HNSW, PQ**                                        | **Cosine Similarity, Euclidean Distance**                 |
|---------------------|----------------------------------------------------------|----------------------------------------------------------|
| **Purpose**         | Techniques to index and search large vector datasets efficiently. | Methods to measure the similarity or distance between two vectors. |
| **What They Do**    | Optimize how vectors are stored, retrieved, and searched in a database. | Define the mathematical relationship between vectors.    |
| **Accuracy**        | Affects retrieval speed and trade-off between exact and approximate results. | Purely mathematical and exact measures of similarity/distance. |
| **Output**          | A ranked list of vectors (e.g., top-K nearest neighbors). | A scalar value representing similarity/distance between two vectors. |
| **Examples**        | Flat Index, HNSW, Product Quantization (PQ).              | Cosine similarity, Euclidean distance, Inner product.     |

While HNSW (or Flat or PQ) and Cosine Similarity (or Euclidean Distance) both relate to finding similar vectors, they serve different roles in the overall process. HNSW (or Flat or PQ) does not measure similarity itself; it focuses on efficiently navigating the dataset to find the best candidates. Cosine Similarity (or Euclidean Distance) are distance metrics that define the mathematical relationship between vectors. They compute how "similar" or "close" two vectors are, based on their geometry in the vector space. So after narrowing down the search space (using HNSW), the distance metric calculates the actual similarity or distance for ranking the results.

FLOAT32 indicates that each element in the vector is a 32-bit floating-point number. This is the numerical precision used to store the vector components. RediSearch needs to know the data type to correctly store, retrieve, and perform computations (e.g., similarity calculations) on the vectors. Floating-point precision impacts memory usage and computation speed:
- FLOAT32 is more precise but consumes more memory than FLOAT16.
- FLOAT16 uses less memory but is less accurate, which might affect results.

Many popular embedding models (e.g., OpenAI, Hugging Face) output embeddings in FLOAT32 format, so specifying this ensures compatibility.

DIM 1536 specifies that each vector has 1,536 elements or dimensions. For example, if you use OpenAI's text-embedding-ada-002 model, the output is a 1,536-dimensional vector. RediSearch needs to allocate memory and structure the data for each vector based on its size. All vectors in the same field must have the same dimensionality. If you tried to insert a vector of a different size, it would cause an error. Similarity calculations (e.g., cosine similarity, Euclidean distance) require vectors to have the same dimensions. Without this information, such operations cannot be performed.

FLOAT32 and DIM 1536 Together Define the Vector Field:
- FLOAT32: Defines the type of data stored in each element of the vector.
- DIM 1536: Defines the number of elements in each vector.

By specifying these parameters explicitly, you ensure that the vector field matches the format of the embeddings generated by your model.

To use the RedisVectorStore, you'll need to install the langchain-redis partner package. To use Redis with LangChain, you also need a running Redis instance. You can start one using Docker with:

```shell
docker run -d -p 6379:6379 redis/redis-stack:latest
```

```python
import os

REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379")
print(f"Connecting to Redis at: {REDIS_URL}")

# Let's check that Redis is up an running by pinging it:
import redis

redis_client = redis.from_url(REDIS_URL)
redis_client.ping()
```

The RedisVectorStore instance can be initialized in several ways:
- RedisVectorStore.__init__ - Initialize directly
- RedisVectorStore.from_texts - Initialize from a list of texts (optionally with metadata)
- RedisVectorStore.from_documents - Initialize from a list of langchain_core.documents.Document objects
- RedisVectorStore.from_existing_index - Initialize from an existing Redis index

Below we will use the RedisVectorStore.__init__ method using a RedisConfig instance.

```python
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

from langchain_redis import RedisConfig, RedisVectorStore

config = RedisConfig(
    index_name="newsgroups",
    redis_url=REDIS_URL,
    metadata_schema=[
        {"name": "category", "type": "tag"},
    ],
)

vector_store = RedisVectorStore(embeddings, config=config)

# Add items to vector store
ids = vector_store.add_texts(texts, metadata)
print(ids[0:10])
# ['newsgroups:f1e788ee61fe410daa8ef941dd166223', 'newsgroups:80b39032181f4299a359a9aaed6e2401', 'newsgroups:99a3efc1883647afba53d115b49e6e92', 'newsgroups:503a6c07cd71418eb71e11b42589efd7', 'newsgroups:7351210e32d1427bbb3c7426cf93a44f', 'newsgroups:4e79fdf67abe471b8ee98ba0e8a1a055', 'newsgroups:03559a1d574e4f9ca0479d7b3891402e', 'newsgroups:9a1c2a7879b8409a805db72feac03580', 'newsgroups:3578a1e129f5435f9743cf803413f37a', 'newsgroups:9f68baf4d6b04f1683d6b871ce8ad92d']

# Delete items from vector store
# Delete documents by passing one or more keys/ids
vector_store.index.drop_keys(ids[0])
# 1
```

The Redis VectorStore implementation will attempt to generate index schema (fields for filtering) for any metadata passed through the from_texts, from_texts_return_keys, and from_documents methods. This way, whatever metadata is passed will be indexed into the Redis search index allowing for filtering on those fields.

Below we show what fields were created from the metadata we defined above:

```shell
!rvl index info -i newsgroups --port 6379

Index Information:
╭──────────────┬────────────────┬────────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes       │ Index Options   │   Indexing │
├──────────────┼────────────────┼────────────────┼─────────────────┼────────────┤
│ newsgroups   │ HASH           │ ['newsgroups'] │ []              │          0 │
╰──────────────┴────────────────┴────────────────┴─────────────────┴────────────╯
Index Fields:
╭───────────┬─────────────┬────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮
│ Name      │ Attribute   │ Type   │ Field Option   │ Option Value   │ Field Option   │ Option Value   │ Field Option   │   Option Value │ Field Option    │ Option Value   │
├───────────┼─────────────┼────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────────────────┼────────────────┤
│ text      │ text        │ TEXT   │ WEIGHT         │ 1              │                │                │                │                │                 │                │
│ embedding │ embedding   │ VECTOR │ algorithm      │ FLAT           │ data_type      │ FLOAT32        │ dim            │            768 │ distance_metric │ COSINE         │
│ category  │ category    │ TAG    │ SEPARATOR      │ |              │                │                │                │                │                 │                │
╰───────────┴─────────────┴────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴────────────────┴─────────────────┴────────────────╯

!rvl stats -i newsgroups --port 6379

Statistics:
╭─────────────────────────────┬────────────╮
│ Stat Key                    │ Value      │
├─────────────────────────────┼────────────┤
│ num_docs                    │ 249        │
│ num_terms                   │ 16178      │
│ max_doc_id                  │ 250        │
│ num_records                 │ 50394      │
│ percent_indexed             │ 1          │
│ hash_indexing_failures      │ 0          │
│ number_of_uses              │ 2          │
│ bytes_per_record_avg        │ 38.2743    │
│ doc_table_size_mb           │ 0.0263586  │
│ inverted_sz_mb              │ 1.83944    │
│ key_table_size_mb           │ 0.00932026 │
│ offset_bits_per_record_avg  │ 10.6699    │
│ offset_vectors_sz_mb        │ 0.089057   │
│ offsets_per_term_avg        │ 1.38937    │
│ records_per_doc_avg         │ 202.386    │
│ sortable_values_size_mb     │ 0          │
│ total_indexing_time         │ 72.444     │
│ total_inverted_index_blocks │ 16207      │
│ vector_index_sz_mb          │ 3.01776    │
╰─────────────────────────────┴────────────╯
```

Query vector store

Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent.

Performing a simple similarity search can be done as follows:

```python
query = "Tell me about space exploration"
results = vector_store.similarity_search(query, k=2)

print("Simple Similarity Search Results:")
for doc in results:
    print(f"Content: {doc.page_content[:100]}...")
    print(f"Metadata: {doc.metadata}")
    print()

# Simple Similarity Search Results:
# Content: From: aa429@freenet.carleton.ca (Terry Ford)
# Subject: A flawed propulsion system: Space Shuttle
# X-Ad...
# Metadata: {'category': 'sci.space'}

# Content: From: nsmca@aurora.alaska.edu
# Subject: Space Design Movies?
# Article-I.D.: aurora.1993Apr23.124722.1
# ...
# Metadata: {'category': 'sci.space'}
```

If you want to execute a similarity search and receive the corresponding scores you can run:

```python
# Similarity search with score and filter
scored_results = vector_store.similarity_search_with_score(query, k=2)

print("Similarity Search with Score Results:")
for doc, score in scored_results:
    print(f"Content: {doc.page_content[:100]}...")
    print(f"Metadata: {doc.metadata}")
    print(f"Score: {score}")
    print()

# Similarity Search with Score Results:
# Content: From: aa429@freenet.carleton.ca (Terry Ford)
# Subject: A flawed propulsion system: Space Shuttle
# X-Ad...
# Metadata: {'category': 'sci.space'}
# Score: 0.569670975208

# Content: From: nsmca@aurora.alaska.edu
# Subject: Space Design Movies?
# Article-I.D.: aurora.1993Apr23.124722.1
# ...
# Metadata: {'category': 'sci.space'}
# Score: 0.590400338173
```

You can also transform the vector store into a retriever for easier usage in your chains.

```python
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 2})
retriever.invoke("What planet in the solar system has the largest number of moons?")
# [Document(metadata={'category': 'sci.space'}, page_content='Subject: Re: Comet in Temporary Orbit Around Jupiter?\nFrom: Robert Coe <bob@1776.COM>\nDistribution: world\nOrganization: 1776 Enterprises, Sudbury MA\nLines: 23\n\njgarland@kean.ucs.mun.ca writes:\n\n> >> Also, perihelions of Gehrels3 were:\n> >> \n> >> April  1973     83 jupiter radii\n> >> August 1970     ~3 jupiter radii\n> > \n> > Where 1 Jupiter radius = 71,000 km = 44,000 mi = 0.0005 AU.  So the\n> > 1970 figure seems unlikely to actually be anything but a perijove.\n> > Is that the case for the 1973 figure as well?\n> > -- \n> Sorry, _perijoves_...I\'m not used to talking this language.\n\nHmmmm....  The prefix "peri-" is Greek, not Latin, so it\'s usually used\nwith the Greek form of the name of the body being orbited.  (That\'s why\nit\'s "perihelion" rather than "perisol", "perigee" rather than "periterr",\nand "pericynthion" rather than "perilune".)  So for Jupiter I\'d expect it\nto be something like "perizeon".)   :^)\n\n   ___            _                                           -  Bob\n   /__) _   /    / ) _   _\n(_/__) (_)_(_)  (___(_)_(/_______________________________________ bob@1776.COM\nRobert K. Coe ** 14 Churchill St, Sudbury, Massachusetts 01776 ** 508-443-3265\n'),
#  Document(metadata={'category': 'sci.space'}, page_content='From: pyron@skndiv.dseg.ti.com (Dillon Pyron)\nSubject: Re: Why not give $1 billion to first year-long moon residents?\nLines: 42\nNntp-Posting-Host: skndiv.dseg.ti.com\nReply-To: pyron@skndiv.dseg.ti.com\nOrganization: TI/DSEG VAX Support\n\n\nIn article <1qve4kINNpas@sal-sun121.usc.edu>, schaefer@sal-sun121.usc.edu (Peter Schaefer) writes:\n>In article <1993Apr19.130503.1@aurora.alaska.edu>, nsmca@aurora.alaska.edu writes:\n>|> In article <6ZV82B2w165w@theporch.raider.net>, gene@theporch.raider.net (Gene Wright) writes:\n>|> > With the continuin talk about the "End of the Space Age" and complaints \n>|> > by government over the large cost, why not try something I read about \n>|> > that might just work.\n>|> > \n>|> > Announce that a reward of $1 billion would go to the first corporation \n>|> > who successfully keeps at least 1 person alive on the moon for a year. \n>|> > Then you\'d see some of the inexpensive but not popular technologies begin \n>|> > to be developed. THere\'d be a different kind of space race then!\n>|> > \n>|> > --\n>|> >   gene@theporch.raider.net (Gene Wright)\n>|> > theporch.raider.net  615/297-7951 The MacInteresteds of Nashville\n>|> ====\n>|> If that were true, I\'d go for it.. I have a few friends who we could pool our\n>|> resources and do it.. Maybe make it a prize kind of liek the "Solar Car Race"\n>|> in Australia..\n>|> Anybody game for a contest!\n>|> \n>|> ==\n>|> Michael Adams, nsmca@acad3.alaska.edu -- I\'m not high, just jacked\n>\n>\n>Oh gee, a billion dollars!  That\'d be just about enough to cover the cost of the\n>feasability study!  Happy, Happy, JOY! JOY!\n>\n\nFeasability study??  What a wimp!!  While you are studying, others would be\ndoing.  Too damn many engineers doing way too little engineering.\n\n"He who sits on his arse sits on his fortune"  - Sir Richard Francis Burton\n--\nDillon Pyron                      | The opinions expressed are those of the\nTI/DSEG Lewisville VAX Support    | sender unless otherwise stated.\n(214)462-3556 (when I\'m here)     |\n(214)492-4656 (when I\'m home)     |Texans: Vote NO on Robin Hood.  We need\npyron@skndiv.dseg.ti.com          |solutions, not gestures.\nPADI DM-54909                     |\n\n')]
```

We can filter our search results based on metadata:

```python
from redisvl.query.filter import Tag

query = "Tell me about space exploration"

# Create a RedisVL filter expression
filter_condition = Tag("category") == "sci.space"

filtered_results = vector_store.similarity_search(query, k=2, filter=filter_condition)

print("Filtered Similarity Search Results:")
for doc in filtered_results:
    print(f"Content: {doc.page_content[:100]}...")
    print(f"Metadata: {doc.metadata}")
    print()
# Filtered Similarity Search Results:
# Content: From: aa429@freenet.carleton.ca (Terry Ford)
# Subject: A flawed propulsion system: Space Shuttle
# X-Ad...
# Metadata: {'category': 'sci.space'}

# Content: From: nsmca@aurora.alaska.edu
# Subject: Space Design Movies?
# Article-I.D.: aurora.1993Apr23.124722.1
# ...
# Metadata: {'category': 'sci.space'}
```

Maximum Marginal Relevance (MMR)

Maximum Marginal Relevance (MMR) Search is a retrieval strategy used in LangChain vector stores to diversify search results while maintaining relevance. It ensures that the retrieved documents are both highly relevant to the query and diverse enough to cover different aspects of the information space.

Why Use MMR Search?
- Reduces Redundancy: If multiple documents contain similar information, MMR ensures that the retrieved documents do not repeat the same content.
- Enhances Diversity: It selects documents that introduce new information while still being relevant.
- Balances Relevance & Novelty: Instead of only ranking by similarity (which can cause redundant results), MMR ranks based on a mix of similarity and uniqueness.

```python
# Maximum marginal relevance search with filter
mmr_results = vector_store.max_marginal_relevance_search(
    query, k=5, fetch_k=20, lambda_mult=0.5, filter=filter_condition
)

print("Maximum Marginal Relevance Search Results:")
for doc in mmr_results:
    print(f"Content: {doc.page_content[:100]}...")
    print(f"Metadata: {doc.metadata}")
    print()
```
- k=5 → Returns the top 5 most relevant and diverse results.
- fetch_k=20 → Retrieves the top 20 documents based on similarity before applying MMR.
- lambda_mult=0.5 → Balances relevance (similarity to query) and diversity (minimizing redundancy).

Avoid MMR when the dataset is small and doesn't need diversification.

Connect to an existing Index

In order to have the same metadata indexed when using the Redis VectorStore. You will need to have the same index_schema passed in either as a path to a yaml file or as a dictionary. The following shows how to obtain the schema from an index and connect to an existing index.

```python
# write the schema to a yaml file
vector_store.index.schema.to_yaml("redis_schema.yaml")

# now we can connect to our existing index as follows

new_rdvs = RedisVectorStore(
    embeddings,
    redis_url=REDIS_URL,
    schema_path="redis_schema.yaml",
)

results = new_rdvs.similarity_search("Space Shuttle Propulsion System", k=3)
print(results[0])

# compare the two schemas to verify they are the same
new_rdvs.index.schema == vector_store.index.schema
# True

# Clear vector store
vector_store.index.delete(drop=True)
```

### Retrieval: Retrievers

A vector store is a database designed to store and retrieve high-dimensional vector embeddings efficiently. It enables fast similarity searches based on distance metrics like cosine similarity or Euclidean distance.
- Stores vector embeddings of documents (precomputed using an embedding model).
- Supports similarity search (e.g., nearest neighbors search).
- Some provide Approximate Nearest Neighbor (ANN) algorithms for efficiency.
- Examples: FAISS, Pinecone, Redis, Chroma, Qdrant.

You typically interact with a vector store directly to perform similarity searches.

```python
docs = vector_store.similarity_search(query, k=5)

for doc in docs:
    print(doc.page_content)
```

**A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well. Retrievers accept a string query as input and return a list of Document's as output.**

### Retrieval: Retrievers: Vector Store

A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface. It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store.

A Vector Store retriever is an abstraction on top of a vector store that provides a structured way to retrieve relevant documents for a language model. Once you construct a vector store, it's very easy to construct a retriever.

The as_retriever method in LangChain is used to convert a vector store into a retriever. The arguments available to as_retriever allow you to customize how the retrieval process works, such as the retrieval strategy and parameters like the number of top documents to return (top_k).

Here are the list of parameters:
- search_type
    - "similarity": Retrieves the top-K documents based on similarity. This is the default.
    - "mmr": Retrieves documents using Maximum Marginal Relevance (MMR) to ensure diverse results.
    - "similarity_score_threshold": Retrieves documents with similarity scores above a threshold.
- search_kwargs
    - For "similarity" or "mmr" search_type, you can specify k (number of top documents to retrieve).
    - For "mmr", you can also specify lambda_mult (balance between diversity and similarity).
    - For "similarity_score_threshold", you can specify score_threshold (minimum similarity score).
- callbacks
    - A list of callback functions for retrieval events (optional).

```python
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../../state_of_the_union.txt")

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(texts, embeddings)

retriever = db.as_retriever()

docs = retriever.invoke("what did he say about John McCain")
```

By default, the vector store retriever uses similarity search. If the underlying vector store supports maximum marginal relevance search, you can specify that as the search type.

```python
retriever = db.as_retriever(search_type="mmr")

docs = retriever.invoke("what did he say about John McCain")
```

You can also set a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold.

```python
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}
)

docs = retriever.invoke("what did he say about John McCain")
```

You can also specify search kwargs like k to use when doing retrieval.

```python
retriever = db.as_retriever(search_kwargs={"k": 1})

docs = retriever.invoke("what did he say about John McCain")
len(docs)
# 1
```

### Retrieval: Retrievers: MultiQueryRetriever

The MultiQueryRetriever in LangChain is designed to improve retrieval from vector databases by addressing potential issues caused by the nuances of query embeddings and the limitations of similarity-based search.

Problem with Traditional Retrieval
- Distance-Based Retrieval:
    - Vector databases embed both the query and documents in high-dimensional space.
    - Retrieval is based on the similarity (or "distance") between the query embedding and document embeddings.
    - Challenges:
        - Sensitivity to Query Wording: Slight rewording of a query might produce different results because the embeddings depend on the exact phrasing of the query.
        - Embedding Limitations: If the embeddings fail to capture the full semantics of the data, relevant documents might be overlooked.
        - Manual Effort in Prompt Tuning: To address these issues, developers often experiment with how queries are phrased, but this process can be tedious and requires domain expertise.

The MultiQueryRetriever automates the process of generating variations of a query to improve retrieval diversity and relevance. Query Generation:
- The retriever uses an LLM to create multiple variations of the original query.
- Each variation represents the query from a different perspective or angle.

Example:
- Original query: "What did the president say about climate change?"
- Variations:
    - "What are the president's views on global warming?"
    - "Did the president mention environmental policies?"
    - "Statements about reducing emissions in the president's speech."

**For each generated query, the retriever searches the vector database and retrieves a set of relevant documents. The retriever takes the unique union of all retrieved documents across all queries. This combines results from multiple perspectives, increasing the likelihood of retrieving a richer and more relevant set of documents.**

Benefits:
- Overcomes Query Sensitivity: By generating multiple query variations, the retriever reduces the chance of missing relevant documents due to specific phrasing.
- Handles Embedding Limitations: If some documents are relevant but don't match well with one variation, they might still match another variation, improving overall recall.
- No Manual Tuning: Automates the process of prompt tuning, saving time and effort.

```python
# Build a sample vectorDB
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load blog post
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# VectorDB
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

question = "What are the approaches to Task Decomposition?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

unique_docs = retriever_from_llm.invoke(question)
len(unique_docs)
# INFO:langchain.retrievers.multi_query:Generated queries: ['1. How can Task Decomposition be approached?', '2. What are the different methods for Task Decomposition?', '3. What are the various approaches to decomposing tasks?']
# 5
```

You can also supply a prompt along with an output parser to split the results into a list of queries.

```python
from typing import List
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableLambda
from pydantic import BaseModel, Field

# Define Output Model for Parsing LLM Response
class LineList(BaseModel):
    lines: List[str] = Field(description="Lines of text")

class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.strip().split("\n")
        return LineList(lines=lines)

# Initialize Output Parser
output_parser = LineListOutputParser()

# Define Prompt
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five 
    different versions of the given user question to retrieve relevant documents from a vector 
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search. 
    Provide these alternative questions separated by newlines.
    
    Original question: {question}"""
)

# Define LLM Model
llm = ChatOpenAI(temperature=0)

# LCEL Chain
llm_chain = QUERY_PROMPT | llm | output_parser

# Load Vector Store (FAISS in this case)
vectorstore = FAISS.load_local("path_to_faiss_index")

# Define MultiQueryRetriever
retriever = MultiQueryRetriever(
    retriever=vectorstore.as_retriever(),
    llm_chain=llm_chain,
    parser_key="lines"  # "lines" is the key (attribute name) of the parsed output
)

# Run Retrieval
query = "What does the course say about regression?"
unique_docs = retriever.invoke(query)

# Print number of unique documents retrieved
print(len(unique_docs))
# 11

# Print content of retrieved documents
for doc in unique_docs:
    print(doc.page_content)
```

### Retrieval: Retrievers: Contextual Compression

When retrieving documents from a vector database or another retrieval system, the results often contain irrelevant or excessively long documents. Contextual compression is a technique to reduce unnecessary content while keeping only the parts relevant to the user's query.

Why is Contextual Compression Needed?
- Documents Can Contain Irrelevant Information:
    - Retrieved documents might be long and contain unnecessary sections unrelated to the query.
    - Example: If a document is a 10-page research paper, but only one paragraph is relevant, sending the entire document to the LLM is inefficient.
- Sending Large Documents is Expensive:
    - Most LLMs charge based on tokens.
    - More text = higher cost and slower response time.
- Poor LLM Responses Due to Excessive Context:
    - When the model receives too much text, it may fail to focus on the most relevant parts.
    - Example: If a model receives a 10-page document, it may not correctly prioritize one crucial paragraph.

How Contextual Compression Works

Instead of returning full retrieved documents, the Contextual Compression Retriever does the following:
- Base Retriever:
    - It first retrieves documents based on similarity search or another retrieval method.
    - Example: A vector store retriever retrieves five documents based on a query.
- Document Compressor:
    - It filters and compresses retrieved documents before passing them to the application.
    - Two ways compression happens:
        - Reducing document content (Extracting only the relevant portions).
        - Dropping irrelevant documents (Removing unhelpful results).
- Final Output:
    - Only the most relevant and concise information is returned.
    - The LLM processes less text, making responses cheaper, faster, and more relevant.

Let's start by initializing a simple vector store retriever and storing the 2023 State of the Union speech (in chunks). We can see that given an example question our retriever returns one or two relevant docs and a few irrelevant docs. And even the relevant docs have a lot of irrelevant information in them.

```python
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

documents = TextLoader("../../state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()

docs = retriever.invoke("What did the president say about John McCain")
pretty_print_docs(docs)
```

Now let's wrap our base retriever with a ContextualCompressionRetriever. We'll add an LLMChainExtractor, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.

```python
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import OpenAI

llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

compressed_docs = compression_retriever.invoke(
    "What did the president say about John McCain"
)
pretty_print_docs(compressed_docs)
```

The LLMChainFilter is slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.



START HERE:
https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/contextual_compression/
https://youtu.be/PqS1kib7RTw (Langgraph and Agents)
https://nvidia.github.io/GenerativeAIExamples/0.7.0/notebooks/06_LangGraph_HandlingAgent_IntermediateSteps.html (Associating documentation for Langgraph and Agents)
https://youtu.be/WyIWaopiUEo (Adding RAG to LangGraph Agents)
https://python.langchain.com/v0.1/docs/expression_language/
https://youtu.be/YbpKMIUjvK8 (How to write Unit Tests in Python)

### LCEL and Runnable Interface