# LangGraph and LangSmith - Agentic RAG Powered by LangChain

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating our Tool Belt
  4. Creating Our State
  5. Creating and Compiling A Graph!

  - 🤝 Breakout Room #2:
  1. Evaluating the LangGraph Application with LangSmith
  2. Adding Helpfulness Check and "Loop" Limits
  3. LangGraph for the "Patterns" of GenAI

# 🤝 Breakout Room #1

## Part 1: LangGraph - Building Cyclic Applications with LangChain

LangGraph is a tool that leverages LangChain Expression Language to build coordinated multi-actor and stateful applications that includes cyclic behaviour.

### Why Cycles?

In essence, we can think of a cycle in our graph as a more robust and customizable loop. It allows us to keep our application agent-forward while still giving the powerful functionality of traditional loops.

Due to the inclusion of cycles over loops, we can also compose rather complex flows through our graph in a much more readable and natural fashion. Effectively allowing us to recreate application flowcharts in code in an almost 1-to-1 fashion.

### Why LangGraph?

Beyond the agent-forward approach - we can easily compose and combine traditional "DAG" (directed acyclic graph) chains with powerful cyclic behaviour due to the tight integration with LCEL. This means it's a natural extension to LangChain's core offerings!

## Task 1:  Dependencies

We'll first install all our required libraries.

> NOTE: If you're running this locally - please skip this step.

In [1]:
!pip install -qU langchain langchain_openai langchain-community langgraph arxiv

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.4/54.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m47.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m145.8/145.8 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m412.4/412.4 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m


## Task 2: Environment Variables

We'll want to set both our OpenAI API key and our LangSmith environment variables.

In [37]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [38]:
os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY")

In [39]:
from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE5 - LangGraph - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

## Task 3: Creating our Tool Belt

As is usually the case, we'll want to equip our agent with a toolbelt to help answer questions and add external knowledge.

There's a tonne of tools in the [LangChain Community Repo](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools) but we'll stick to a couple just so we can observe the cyclic nature of LangGraph in action!

We'll leverage:

- [Tavily Search Results](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/tools/tavily_search/tool.py)
- [Arxiv](https://github.com/langchain-ai/langchain/tree/master/libs/community/langchain_community/tools/arxiv)

####🏗️ Activity #1:

Please add the tools to use into our toolbelt.

> NOTE: Each tool in our toolbelt should be a method.

#### ✅ Answer:
We have added Tavily and Arxiv to the toolbelt.

In [40]:
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.tools.arxiv.tool import ArxivQueryRun

tavily_tool = TavilySearchResults(max_results=5)

tool_belt = [
    tavily_tool,
    ArxivQueryRun(),
]

### Model

Now we can set-up our model! We'll leverage the familiar OpenAI model suite for this example - but it's not *necessary* to use with LangGraph. LangGraph supports all models - though you might not find success with smaller models - as such, they recommend you stick with:

- OpenAI's GPT-3.5 and GPT-4
- Anthropic's Claude
- Google's Gemini

> NOTE: Because we're leveraging the OpenAI function calling API - we'll need to use OpenAI *for this specific example* (or any other service that exposes an OpenAI-style function calling API.

In [41]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o", temperature=0)

Now that we have our model set-up, let's "put on the tool belt", which is to say: We'll bind our LangChain formatted tools to the model in an OpenAI function calling format.

In [42]:
model = model.bind_tools(tool_belt)

#### ❓ Question #1:

How does the model determine which tool to use?

#### ✅ Answer:

From the code:
model=model.bind_tools(tool_belt)

The model is binding the tools to the model.

The model is using the tool node to call the tool if there are tool calls in the last message.

Additionally here is an explanation of how LLMs choose tools:

An LLM decides which tool to use by analyzing the input prompt, understanding the capabilities of each available tool based on its defined schema (including parameters and functions), and then selecting the tool that best aligns with the task at hand, essentially making a calculated decision based on the context and information provided in the prompt; this process is often facilitated through a structured format where the LLM can identify key elements in the query that indicate which tool is needed. 

Key points about how LLMs choose tools:
Tool descriptions and schemas:
Each available tool is defined with a clear description of its purpose and expected input parameters, which the LLM can reference when making a selection. 

Context analysis:
The LLM examines the user's prompt to identify relevant keywords, phrases, and concepts that indicate which tool is most likely needed to fulfill the request. 

Multi-step reasoning:
For complex tasks, an LLM might need to use multiple tools sequentially, where the output of one tool informs the selection of the next. 

Explicit or implicit invocation:
In some cases, users may explicitly state which tool they want to use, while in others, the LLM can autonomously decide based on the prompt. 

Example scenario:
Prompt: "What is the weather like in San Francisco today?"
Tool analysis: The LLM identifies that the "get weather" tool is most relevant as it can provide weather information based on location. 
Output: The LLM uses the "get weather" tool to fetch the current weather conditions for San Francisco and presents the results to the user. 

## Task 4: Putting the State in Stateful

Earlier we used this phrasing:

`coordinated multi-actor and stateful applications`

So what does that "stateful" mean?

To put it simply - we want to have some kind of object which we can pass around our application that holds information about what the current situation (state) is. Since our system will be constructed of many parts moving in a coordinated fashion - we want to be able to ensure we have some commonly understood idea of that state.

LangGraph leverages a `StatefulGraph` which uses an `AgentState` object to pass information between the various nodes of the graph.

There are more options than what we'll see below - but this `AgentState` object is one that is stored in a `TypedDict` with the key `messages` and the value is a `Sequence` of `BaseMessages` that will be appended to whenever the state changes.

Let's think about a simple example to help understand exactly what this means (we'll simplify a great deal to try and clearly communicate what state is doing):

1. We initialize our state object:
  - `{"messages" : []}`
2. Our user submits a query to our application.
  - New State: `HumanMessage(#1)`
  - `{"messages" : [HumanMessage(#1)}`
3. We pass our state object to an Agent node which is able to read the current state. It will use the last `HumanMessage` as input. It gets some kind of output which it will add to the state.
  - New State: `AgentMessage(#1, additional_kwargs {"function_call" : "WebSearchTool"})`
  - `{"messages" : [HumanMessage(#1), AgentMessage(#1, ...)]}`
4. We pass our state object to a "conditional node" (more on this later) which reads the last state to determine if we need to use a tool - which it can determine properly because of our provided object!

In [43]:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

## Task 5: It's Graphing Time!

Now that we have state, and we have tools, and we have an LLM - we can finally start making our graph!

Let's take a second to refresh ourselves about what a graph is in this context.

Graphs, also called networks in some circles, are a collection of connected objects.

The objects in question are typically called nodes, or vertices, and the connections are called edges.

Let's look at a simple graph.

![image](https://i.imgur.com/2NFLnIc.png)

Here, we're using the coloured circles to represent the nodes and the yellow lines to represent the edges. In this case, we're looking at a fully connected graph - where each node is connected by an edge to each other node.

If we were to think about nodes in the context of LangGraph - we would think of a function, or an LCEL runnable.

If we were to think about edges in the context of LangGraph - we might think of them as "paths to take" or "where to pass our state object next".

Let's create some nodes and expand on our diagram.

> NOTE: Due to the tight integration with LCEL - we can comfortably create our nodes in an async fashion!

In [44]:
from langgraph.prebuilt import ToolNode

def call_model(state):
  messages = state["messages"]
  response = model.invoke(messages)
  return {"messages" : [response]}

tool_node = ToolNode(tool_belt)

Now we have two total nodes. We have:

- `call_model` is a node that will...well...call the model
- `tool_node` is a node which can call a tool

Let's start adding nodes! We'll update our diagram along the way to keep track of what this looks like!


In [45]:
from langgraph.graph import StateGraph, END

uncompiled_graph = StateGraph(AgentState)

uncompiled_graph.add_node("agent", call_model)
uncompiled_graph.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x7fe7b82165d0>

Let's look at what we have so far:

![image](https://i.imgur.com/md7inqG.png)

Next, we'll add our entrypoint. All our entrypoint does is indicate which node is called first.

In [46]:
uncompiled_graph.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x7fe7b82165d0>

![image](https://i.imgur.com/wNixpJe.png)

Now we want to build a "conditional edge" which will use the output state of a node to determine which path to follow.

We can help conceptualize this by thinking of our conditional edge as a conditional in a flowchart!

Notice how our function simply checks if there is a "function_call" kwarg present.

Then we create an edge where the origin node is our agent node and our destination node is *either* the action node or the END (finish the graph).

It's important to highlight that the dictionary passed in as the third parameter (the mapping) should be created with the possible outputs of our conditional function in mind. In this case `should_continue` outputs either `"end"` or `"continue"` which are subsequently mapped to the action node or the END node.

In [47]:
def should_continue(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  return END

uncompiled_graph.add_conditional_edges(
    "agent",
    should_continue
)

<langgraph.graph.state.StateGraph at 0x7fe7b82165d0>

Let's visualize what this looks like.

![image](https://i.imgur.com/8ZNwKI5.png)

Finally, we can add our last edge which will connect our action node to our agent node. This is because we *always* want our action node (which is used to call our tools) to return its output to our agent!

In [48]:
uncompiled_graph.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x7fe7b82165d0>

Let's look at the final visualization.

![image](https://i.imgur.com/NWO7usO.png)

All that's left to do now is to compile our workflow - and we're off!

In [49]:
compiled_graph = uncompiled_graph.compile()

#### ❓ Question #2:

Is there any specific limit to how many times we can cycle?

If not, how could we impose a limit to the number of cycles?

Yes, there is a limit to how many times we can cycle.

We can impose a limit to the number of cycles by adding a conditional edge that checks the length of the state object.

If the length of the state object is greater than the limit, we can return the END node.

Additonally, we can use the helpfulness check to determine if the agent should continue or not.

If the agent should continue, we can return the agent node.

If the agent should not continue, we can return the END node.



## Using Our Graph

Now that we've created and compiled our graph - we can call it *just as we'd call any other* `Runnable`!

Let's try out a few examples to see how it fairs:

In [50]:
from langchain_core.messages import HumanMessage

inputs = {"messages" : [HumanMessage(content="Who is the current captain of the Winnipeg Jets?")]}

async for chunk in compiled_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_l9s1yGTjeWAXr7yAluOOC0KQ', 'function': {'arguments': '{"query":"current captain of the Winnipeg Jets 2023"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 27, 'prompt_tokens': 162, 'total_tokens': 189, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_4691090a87', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-68b0abe8-6273-413d-a78b-66dc048ad7c4-0', tool_calls=[{'name': 'tavily_search_results_json', 'args': {'query': 'current captain of the Winnipeg Jets 2023'}, 'id': 'call_l9s1yGTjeWAXr7yAluOOC0KQ', 'type': 'tool_call'}], usage_metadata={'input_tokens': 162, 'output_t

Let's look at what happened:

1. Our state object was populated with our request
2. The state object was passed into our entry point (agent node) and the agent node added an `AIMessage` to the state object and passed it along the conditional edge
3. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node
4. The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node
5. The agent node added a response to the state object and passed it along the conditional edge
6. The conditional edge received the state object, could not find the "tool_calls" `additional_kwarg` and passed the state object to END where we see it output in the cell above!

Now let's look at an example that shows a multiple tool usage - all with the same flow!

In [51]:
inputs = {"messages" : [HumanMessage(content="Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using Tavily!")]}

async for chunk in compiled_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        if node == "action":
          print(f"Tool Used: {values['messages'][0].name}")
        print(values["messages"])

        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_mlBr5KIbHK6mfZtFKQ22CnaF', 'function': {'arguments': '{"query":"QLoRA"}', 'name': 'arxiv'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 178, 'total_tokens': 195, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_4691090a87', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-b94f87ca-0031-434a-9741-91c98b0492ab-0', tool_calls=[{'name': 'arxiv', 'args': {'query': 'QLoRA'}, 'id': 'call_mlBr5KIbHK6mfZtFKQ22CnaF', 'type': 'tool_call'}], usage_metadata={'input_tokens': 178, 'output_tokens': 17, 'total_tokens': 195, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'a

####🏗️ Activity #2:

Please write out the steps the agent took to arrive at the correct answer.

Lets fist look at the code as it is:

```python



inputs = {"messages" : [HumanMessage(content="Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using Tavily!")]}

async for chunk in compiled_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        if node == "action":
          print(f"Tool Used: {values['messages'][0].name}")
        print(values["messages"])

        print("\n\n")

```

Now lets look at the steps the agent took to arrive at the correct answer:

1. The agent was called with the question "Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using Tavily!"
2. The agent added a HumanMessage to the state object and passed it to the agent node.
3. The agent node added an AIMessage to the state object and passed it to the conditional edge.
4. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node.
5. The action node added a ToolMessage to the state object and passed it to the agent node.
6. The agent node added an AIMessage to the state object and passed it to the conditional edge.
7. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node.
8. The action node added a ToolMessage to the state object and passed it to the agent node.
9. The agent node added an AIMessage to the state object and passed it to the conditional edge.
10. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node.


## Part 1: LangSmith Evaluator

### Pre-processing for LangSmith

To do a little bit more preprocessing, let's wrap our LangGraph agent in a simple chain.

In [52]:
def convert_inputs(input_object):
  return {"messages" : [HumanMessage(content=input_object["question"])]}

def parse_output(input_state):
  return input_state["messages"][-1].content

agent_chain = convert_inputs | compiled_graph | parse_output

In [53]:
agent_chain.invoke({"question" : "What is RAG?"})

"RAG stands for Retrieval-Augmented Generation. It is a technique used in natural language processing (NLP) that combines retrieval-based methods with generative models to improve the quality and accuracy of generated text. Here's a brief overview of how it works:\n\n1. **Retrieval**: In the first step, relevant documents or pieces of information are retrieved from a large corpus or database. This is typically done using a retrieval model that identifies the most relevant documents based on the input query or context.\n\n2. **Augmentation**: The retrieved documents are then used to augment the input to a generative model. This means that the generative model has access to additional context or information that can help it produce more accurate and contextually relevant responses.\n\n3. **Generation**: Finally, a generative model, such as a transformer-based language model, uses the augmented input to generate a response or piece of text. The additional context provided by the retrieved

### Task 1: Creating An Evaluation Dataset

Just as we saw last week, we'll want to create a dataset to test our Agent's ability to answer questions.

In order to do this - we'll want to provide some questions and some answers. Let's look at how we can create such a dataset below.

```python
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
]
```

####🏗️ Activity #3:

Please create a dataset in the above format with at least 5 questions.

#### ✅ Answer:

```
questions = [
    "How does QLoRA compare to full-model finetuning?",
    "What is the main difference between QLoRA and other quantization methods?",
    "What are the key advantages of using QLoRA for model compression?",
    "How does QLoRA improve the efficiency of model training?",
    "What are the potential limitations of QLoRA compared to other compression techniques?"
]

answers = [
    {"must_mention" : ["QLoRA", "full-model finetuning"]},
    {"must_mention" : ["QLoRA", "other quantization methods"]},
    {"must_mention" : ["QLoRA", "model compression"]},
    {"must_mention" : ["QLoRA", "model training efficiency"]},
    {"must_mention" : ["QLoRA", "other compression techniques"]},
]
```


In [54]:
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?",
    "How does QLoRA compare to full-model finetuning?",
    "What is the main difference between QLoRA and other quantization methods?",
    "What are the key advantages of using QLoRA for model compression?",
    "How does QLoRA improve the efficiency of model training?",
    "What are the potential limitations of QLoRA compared to other compression techniques?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
    {"must_mention" : ["QLoRA", "full-model finetuning"]},
    {"must_mention" : ["QLoRA", "other quantization methods"]},
    {"must_mention" : ["QLoRA", "model compression"]},
    {"must_mention" : ["QLoRA", "model training efficiency"]},
    {"must_mention" : ["QLoRA", "other compression techniques"]}
]

Now we can add our dataset to our LangSmith project using the following code which we saw last Thursday!

In [55]:
from langsmith import Client

client = Client()

dataset_name = f"Retrieval Augmented Generation - Evaluation Dataset - {uuid4().hex[0:8]}"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the QLoRA Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    outputs=answers,
    dataset_id=dataset.id,
)

#### ❓ Question #3:

How are the correct answers associated with the questions?

> NOTE: Feel free to indicate if this is problematic or not

#### ✅ Answer:

The correct answers are associated with the questions through a structured format where each question corresponds to a specific answer that includes key phrases that must be mentioned in the response. This is done using a dictionary format for the answers, where each answer specifies a list of keywords that are expected to be present in the response.

Explanation of the Association:
1. Questions List: Each question is a string that poses a specific inquiry related to the QLoRA paper or its concepts.
2. Answers List: Each answer is represented as a dictionary with a key must_mention, which contains a list of keywords or phrases that should be included in the response to be considered correct.
3. Matching Mechanism: When evaluating the responses generated by the agent, the evaluation process checks if the response contains all the keywords specified in the corresponding answer's must_mention list. If the response includes all required keywords, it is deemed correct.

Potential Issues:
Ambiguity: If the keywords are too broad or common, it may lead to false positives where a response is marked correct even if it doesn't fully address the question.

Lack of Context: The evaluation only checks for the presence of keywords without considering the context in which they are used, which might not accurately reflect the quality or relevance of the answer.

Flexibility: This method may not account for variations in phrasing or synonyms, which could lead to valid answers being marked incorrect if they do not match the exact keywords.

Overall, while this method provides a straightforward way to associate questions with answers, it may benefit from additional context-aware evaluation mechanisms to improve accuracy and relevance.

### Task 2: Adding Evaluators

Now we can add a custom evaluator to see if our responses contain the expected information.

We'll be using a fairly naive exact-match process to determine if our response contains specific strings.

In [56]:
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    score = all(phrase in prediction for phrase in required)
    return EvaluationResult(key="must_mention", score=score)

#### ❓ Question #4:

What are some ways you could improve this metric as-is?

> NOTE: Alternatively you can suggest where gaps exist in this method.

#### ✅ Answer:

Here are some suggestions to improve the must_mention evaluator metric and address potential gaps in its current implementation:
1. Contextual Understanding
Issue: The current implementation only checks for the presence of required phrases without considering the context in which they appear. This can lead to false positives if the phrases are mentioned but not in a relevant context.
Improvement: Implement a more sophisticated natural language processing (NLP) approach that evaluates the context of the phrases. For example, using semantic similarity measures or context-aware embeddings (like BERT) to assess whether the phrases are used appropriately.
2. Partial Matches and Synonyms
Issue: The evaluator requires exact matches of phrases, which may not account for synonyms or variations in phrasing.
Improvement: Introduce a synonym dictionary or use NLP libraries to identify synonyms and variations of the required phrases. This would allow for more flexible matching.
3. Weighting of Phrases
Issue: All required phrases are treated equally, which may not reflect their importance in the context of the question.
Improvement: Assign weights to different phrases based on their significance. For example, some phrases could be deemed more critical than others, and the score could be calculated based on the weighted presence of these phrases.
4. Threshold for Scoring
Issue: The current implementation returns a binary score (0 or 1) based on whether all phrases are present.
Improvement: Implement a scoring system that allows for partial credit. For instance, if 3 out of 5 required phrases are present, the score could reflect that (e.g., 0.6).
5. Handling Negations
Issue: The evaluator does not account for negations, which can change the meaning of phrases.
Improvement: Incorporate a mechanism to detect negations in the prediction. For example, if a required phrase is present but negated (e.g., "not", "never"), it should affect the score negatively.
6. Feedback Loop for Continuous Improvement
Issue: The evaluator does not learn from past evaluations.
Improvement: Create a feedback mechanism where the evaluator can learn from incorrect evaluations over time, potentially using reinforcement learning techniques to improve its accuracy.
7. Logging and Analysis
Issue: There is no logging of evaluation results for further analysis.
Improvement: Implement logging of evaluation results, including which phrases were matched or missed. This data can be used to analyze common failure points and improve the evaluation criteria.

Example of an Improved Evaluator
Here’s a conceptual example of how you might start implementing some of these improvements:
``` python
from langsmith.evaluation import EvaluationResult, run_evaluator
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

@run_evaluator
def improved_must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    
    # Calculate semantic similarity
    prediction_embedding = model.encode(prediction)
    required_embeddings = model.encode(required)
    
    # Calculate similarity scores
    scores = cosine_similarity([prediction_embedding], required_embeddings)[0]
    
    # Define a threshold for what constitutes a "match"
    threshold = 0.7
    score = sum(1 for s in scores if s >= threshold) / len(required)
    
    return EvaluationResult(key="improved_must_mention", score=score)
```

Task 3: Evaluating

All that is left to do is evaluate our agent's response!

In [57]:
experiment_results = client.evaluate(
    agent_chain,
    data=dataset_name,
    evaluators=[must_mention],
    experiment_prefix=f"RAG Pipeline - Evaluation - {uuid4().hex[0:4]}",
    metadata={"version": "1.0.0"},
)

View the evaluation results for experiment: 'RAG Pipeline - Evaluation - 31b5-aa08f0ba' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/1fcd373f-fbdb-4541-b106-e248e00487ef/compare?selectedSessions=952e096e-9313-4d4e-a928-383097d1e395




0it [00:00, ?it/s]

In [59]:
experiment_results

## Part 2: LangGraph with Helpfulness:

### Task 3: Adding Helpfulness Check and "Loop" Limits

Now that we've done evaluation - let's see if we can add an extra step where we review the content we've generated to confirm if it fully answers the user's query!

We're going to make a few key adjustments to account for this:

1. We're going to add an artificial limit on how many "loops" the agent can go through - this will help us to avoid the potential situation where we never exit the loop.
2. We'll add to our existing conditional edge to obtain the behaviour we desire.

First, let's define our state again - we can check the length of the state object, so we don't need additional state for this.

In [60]:
class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

Now we can set our graph up! This process will be almost entirely the same - with the inclusion of one additional node/conditional edge!

####🏗️ Activity #5:

Please write markdown for the following cells to explain what each is doing.

#### ✅ Answer:
The cell below is setting up the graph with the helpfulness check and the tool call check.
This ensures that the agent will only continue if it is helpful and will only call a tool if there are tool calls in the last message.

In [61]:
graph_with_helpfulness_check = StateGraph(AgentState)

graph_with_helpfulness_check.add_node("agent", call_model)
graph_with_helpfulness_check.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x7fe7bb7b35c0>

#### ✅ Answer:
The cell below is setting the entry point of the graph to the agent node.

In [62]:
graph_with_helpfulness_check.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x7fe7bb7b35c0>

#### ✅ Answer:
The tool_call_or_helpful function evaluates the helpfulness of a response generated by an AI model based on the context of a conversation. Here's a summary of its functionality:
1. Extracts the Last Message: It retrieves the last message from the conversation state.
2. Checks for Tool Calls: If the last message indicates that a tool was called, it returns "action" to proceed with that action.
3. Initial Query and Final Response: It captures the initial user query and the final response from the AI.
4. Cycle Limit: If the number of messages exceeds 10, it returns "END" to stop further processing.
5. Prompt Creation: It constructs a prompt template that asks whether the final response is helpful or not, using the initial query and final response as context.
Helpfulness Evaluation: It invokes a model (ChatOpenAI) to assess the helpfulness of the final response based on the prompt.
Return Value: If the model indicates the response is helpful (with a 'Y'), it returns "end"; otherwise, it returns "continue" to keep the conversation going.
Overall, the function is designed to determine if the AI's response is useful and to manage the flow of the conversation based on that evaluation.

In [63]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def tool_call_or_helpful(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  initial_query = state["messages"][0]
  final_response = state["messages"][-1]

  if len(state["messages"]) > 10:
    return "END"

  prompt_template = """\
  Given an initial query and a final response, determine if the final response is extremely helpful or not. Please indicate helpfulness with a 'Y' and unhelpfulness as an 'N'.

  Initial Query:
  {initial_query}

  Final Response:
  {final_response}"""

  prompt_template = PromptTemplate.from_template(prompt_template)

  helpfulness_check_model = ChatOpenAI(model="gpt-4")

  helpfulness_chain = prompt_template | helpfulness_check_model | StrOutputParser()

  helpfulness_response = helpfulness_chain.invoke({"initial_query" : initial_query.content, "final_response" : final_response.content})

  if "Y" in helpfulness_response:
    return "end"
  else:
    return "continue"

####🏗️ Activity #4:

Please write what is happening in our `tool_call_or_helpful` function!

#### ✅ Answer:
Here is what is happening in the `tool_call_or_helpful` function:
1. It retrieves the last message from the conversation state.
2. If the last message indicates that a tool was called, it returns "action" to proceed with that action.
3. It captures the initial user query and the final response from the AI.
4. If the number of messages exceeds 10, it returns "END" to stop further processing.
5. It constructs a prompt template that asks whether the final response is helpful or not, using the initial query and final response as context.


#### ✅ Markdown Answer:
The code is adding conditional edges to a state graph named graph_with_helpfulness_check. It specifies that when the "agent" node is executed, it should evaluate the tool_call_or_helpful function to determine the next action based on the state of the conversation.
The mapping provided indicates the following outcomes:
If the evaluation returns "continue", the flow will loop back to the "agent" node.
If it returns "action", the flow will proceed to the "action" node to perform a tool call.
If it returns "end", the flow will terminate at the END node, stopping further processing.
This setup allows the graph to dynamically decide the next step based on the helpfulness of the agent's response and whether a tool needs to be called.

In [64]:
graph_with_helpfulness_check.add_conditional_edges(
    "agent",
    tool_call_or_helpful,
    {
        "continue" : "agent",
        "action" : "action",
        "end" : END
    }
)

<langgraph.graph.state.StateGraph at 0x7fe7bb7b35c0>

#### ✅ Answer:
The cell below is adding an edge from the "action" node to the "agent" node. This allows the agent to continue the conversation after the tool call is made.

In [65]:
graph_with_helpfulness_check.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x7fe7bb7b35c0>

#### ✅ Answer:
The cell below is compiling the graph with the helpfulness check and the tool call check.
This allows the graph to be executed with the helpfulness check and the tool call check.
agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

In [66]:
agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

#### ✅ Answer:
The cell below is executing the graph with the helpfulness check and the tool call check.
This allows the graph to be executed with the helpfulness check and the tool call check.
inputs = {"messages" : [HumanMessage(content="Related to machine learning, what is LoRA? Also, who is Tim Dettmers? Also, what is Attention?")]}

In [67]:
inputs = {"messages" : [HumanMessage(content="Related to machine learning, what is LoRA? Also, who is Tim Dettmers? Also, what is Attention?")]}

async for chunk in agent_with_helpfulness_check.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_jMpVKFycCp9r0L7vuZviO3ja', 'function': {'arguments': '{"query": "LoRA machine learning"}', 'name': 'arxiv'}, 'type': 'function'}, {'id': 'call_41AEJdsnnr3tdtXUCDYB9xjg', 'function': {'arguments': '{"query": "Tim Dettmers"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'id': 'call_dI2r4N15DlzK8Y5k7SNNah5c', 'function': {'arguments': '{"query": "Attention mechanism machine learning"}', 'name': 'arxiv'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 72, 'prompt_tokens': 177, 'total_tokens': 249, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_50cad350e4', 'finish_reason': 'tool_calls', 'logprobs': None},

### Task 4: LangGraph for the "Patterns" of GenAI

Let's ask our system about the 4 patterns of Generative AI:

1. Prompt Engineering
2. RAG
3. Fine-tuning
4. Agents

In [69]:
patterns = ["prompt engineering", "RAG", "fine-tuning", "LLM-based agents"]

In [70]:
for pattern in patterns:
  what_is_string = f"What is {pattern} and when did it break onto the scene??"
  inputs = {"messages" : [HumanMessage(content=what_is_string)]}
  messages = agent_with_helpfulness_check.invoke(inputs)
  print(messages["messages"][-1].content)
  print("\n\n")

**Prompt Engineering: Definition and Emergence**

**Definition:**
Prompt engineering is the process of designing inputs for generative artificial intelligence (AI) models to deliver useful, accurate, and relevant responses. It is primarily used with large language models (LLMs) like OpenAI's ChatGPT and Google Gemini. The goal is to optimize the output from these models by crafting well-engineered prompts, which can lead to improved results and efficiency. This involves creating prompts that guide AI models in generating accurate and relevant responses, minimizing bias, and enhancing performance across various applications such as customer support, content generation, and data analysis. Techniques in prompt engineering include few-shot learning, chain-of-thought reasoning, and iterative refinement.

**History and Emergence:**
Prompt engineering has evolved significantly alongside advancements in AI and natural language processing (NLP). Initially, prompts were simple, but as LLMs like 