# LangGraph and LangSmith - Agentic RAG Powered by LangChain

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating our Tool Belt
  4. Creating Our State
  5. Creating and Compiling A Graph!

- 🤝 Breakout Room #2:
  1. Evaluating the LangGraph Application with LangSmith
  2. Adding Helpfulness Check and "Loop" Limits
  3. LangGraph for the "Patterns" of GenAI

# 🤝 Breakout Room #1

## Part 1: LangGraph - Building Cyclic Applications with LangChain

LangGraph is a tool that leverages LangChain Expression Language to build coordinated multi-actor and stateful applications that includes cyclic behaviour.

### Why Cycles?

In essence, we can think of a cycle in our graph as a more robust and customizable loop. It allows us to keep our application agent-forward while still giving the powerful functionality of traditional loops.

Due to the inclusion of cycles over loops, we can also compose rather complex flows through our graph in a much more readable and natural fashion. Effectively allowing us to recreate application flowcharts in code in an almost 1-to-1 fashion.

### Why LangGraph?

Beyond the agent-forward approach - we can easily compose and combine traditional "DAG" (directed acyclic graph) chains with powerful cyclic behaviour due to the tight integration with LCEL. This means it's a natural extension to LangChain's core offerings!

## Task 1:  Dependencies


## Task 2: Environment Variables

We'll want to set both our OpenAI API key and our LangSmith environment variables.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [2]:
os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY")

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE7 - LangGraph - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

## Task 3: Creating our Tool Belt

As is usually the case, we'll want to equip our agent with a toolbelt to help answer questions and add external knowledge.

There's a tonne of tools in the [LangChain Community Repo](https://github.com/langchain-ai/langchain-community/tree/main/libs/community) but we'll stick to a couple just so we can observe the cyclic nature of LangGraph in action!

We'll leverage:

- [Tavily Search Results](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/tavily_search/tool.py)
- [Arxiv](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/arxiv/tool.py)

#### 🏗️ Activity #1:

Please add the tools to use into our toolbelt.

> NOTE: Each tool in our toolbelt should be a method.

In [4]:
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.tools.arxiv.tool import ArxivQueryRun

tavily_tool = TavilySearchResults(max_results=5)

tool_belt = [
    tavily_tool,  # For general web search
    ArxivQueryRun(), # For academic paper search
]

  tavily_tool = TavilySearchResults(max_results=5)


### Model

Now we can set-up our model! We'll leverage the familiar OpenAI model suite for this example - but it's not *necessary* to use with LangGraph. LangGraph supports all models - though you might not find success with smaller models - as such, they recommend you stick with:

- OpenAI's GPT-3.5 and GPT-4
- Anthropic's Claude
- Google's Gemini

> NOTE: Because we're leveraging the OpenAI function calling API - we'll need to use OpenAI *for this specific example* (or any other service that exposes an OpenAI-style function calling API.

In [5]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4.1-nano", temperature=0)

Now that we have our model set-up, let's "put on the tool belt", which is to say: We'll bind our LangChain formatted tools to the model in an OpenAI function calling format.

In [6]:
model = model.bind_tools(tool_belt)

#### ❓ Question #1:

How does the model determine which tool to use?

#### ✅ Answer 

Calling `model = model.bind_tools(tool_belt)`, LangGraph (via LangChain/OpenAI’s function‐calling API) injects into the model’s context a list of “functions” (i.e. tools), each with:

1. **A name** (e.g. `"tavily_search"`)
2. **A description** of what it does
3. **A JSON schema** for its arguments

Then, when the model processes your user’s query, it “decides” whether it needs a tool by reasoning over that list of functions. In practice:

* **Semantic matching**: The model compares your request against each tool’s description.
* **Function‐calling tokens**: If it determines a tool is needed, it emits a special `{"name": "...", "arguments": { … }}` JSON payload instead of plain text.
* **Highest‐probability choice**: At temperature 0 (as we set), it deterministically picks the tool whose description best matches the intent.

Under the hood this all happens in the LLM’s next‐token prediction: function‐calling is just another set of allowable token sequences. Whichever tool’s “name” and argument‐structure tokens score highest given the prompt + tool definitions is the one that gets output—and thus invoked.


## Task 4: Putting the State in Stateful

Earlier we used this phrasing:

`coordinated multi-actor and stateful applications`

So what does that "stateful" mean?

To put it simply - we want to have some kind of object which we can pass around our application that holds information about what the current situation (state) is. Since our system will be constructed of many parts moving in a coordinated fashion - we want to be able to ensure we have some commonly understood idea of that state.

LangGraph leverages a `StatefulGraph` which uses an `AgentState` object to pass information between the various nodes of the graph.

There are more options than what we'll see below - but this `AgentState` object is one that is stored in a `TypedDict` with the key `messages` and the value is a `Sequence` of `BaseMessages` that will be appended to whenever the state changes.

Let's think about a simple example to help understand exactly what this means (we'll simplify a great deal to try and clearly communicate what state is doing):

1. We initialize our state object:
  - `{"messages" : []}`
2. Our user submits a query to our application.
  - New State: `HumanMessage(#1)`
  - `{"messages" : [HumanMessage(#1)}`
3. We pass our state object to an Agent node which is able to read the current state. It will use the last `HumanMessage` as input. It gets some kind of output which it will add to the state.
  - New State: `AgentMessage(#1, additional_kwargs {"function_call" : "WebSearchTool"})`
  - `{"messages" : [HumanMessage(#1), AgentMessage(#1, ...)]}`
4. We pass our state object to a "conditional node" (more on this later) which reads the last state to determine if we need to use a tool - which it can determine properly because of our provided object!

In [7]:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

## Task 5: It's Graphing Time!

Now that we have state, and we have tools, and we have an LLM - we can finally start making our graph!

Let's take a second to refresh ourselves about what a graph is in this context.

Graphs, also called networks in some circles, are a collection of connected objects.

The objects in question are typically called nodes, or vertices, and the connections are called edges.

Let's look at a simple graph.

![image](https://i.imgur.com/2NFLnIc.png)

Here, we're using the coloured circles to represent the nodes and the yellow lines to represent the edges. In this case, we're looking at a fully connected graph - where each node is connected by an edge to each other node.

If we were to think about nodes in the context of LangGraph - we would think of a function, or an LCEL runnable.

If we were to think about edges in the context of LangGraph - we might think of them as "paths to take" or "where to pass our state object next".

Let's create some nodes and expand on our diagram.

> NOTE: Due to the tight integration with LCEL - we can comfortably create our nodes in an async fashion!

In [8]:
from langgraph.prebuilt import ToolNode

def call_model(state):
  messages = state["messages"]
  response = model.invoke(messages)
  return {"messages" : [response]}

tool_node = ToolNode(tool_belt)

Now we have two total nodes. We have:

- `call_model` is a node that will...well...call the model
- `tool_node` is a node which can call a tool

Let's start adding nodes! We'll update our diagram along the way to keep track of what this looks like!


In [9]:
from langgraph.graph import StateGraph, END

uncompiled_graph = StateGraph(AgentState)

uncompiled_graph.add_node("agent", call_model)
uncompiled_graph.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x78128b31fe00>

Let's look at what we have so far:

![image](https://i.imgur.com/md7inqG.png)

Next, we'll add our entrypoint. All our entrypoint does is indicate which node is called first.

In [10]:
uncompiled_graph.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x78128b31fe00>

![image](https://i.imgur.com/wNixpJe.png)

Now we want to build a "conditional edge" which will use the output state of a node to determine which path to follow.

We can help conceptualize this by thinking of our conditional edge as a conditional in a flowchart!

Notice how our function simply checks if there is a "function_call" kwarg present.

Then we create an edge where the origin node is our agent node and our destination node is *either* the action node or the END (finish the graph).

It's important to highlight that the dictionary passed in as the third parameter (the mapping) should be created with the possible outputs of our conditional function in mind. In this case `should_continue` outputs either `"end"` or `"continue"` which are subsequently mapped to the action node or the END node.

In [11]:
def should_continue(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  return END

uncompiled_graph.add_conditional_edges(
    "agent",
    should_continue
)

<langgraph.graph.state.StateGraph at 0x78128b31fe00>

Let's visualize what this looks like.

![image](https://i.imgur.com/8ZNwKI5.png)

Finally, we can add our last edge which will connect our action node to our agent node. This is because we *always* want our action node (which is used to call our tools) to return its output to our agent!

In [12]:
uncompiled_graph.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x78128b31fe00>

Let's look at the final visualization.

![image](https://i.imgur.com/NWO7usO.png)

All that's left to do now is to compile our workflow - and we're off!

In [13]:
simple_agent_graph = uncompiled_graph.compile()

#### ❓ Question #2:

Is there any specific limit to how many times we can cycle?

If not, how could we impose a limit to the number of cycles?

#### ✅ Answer 

LangGraph does enforce a cycle (“recursion”) limit on your graph to prevent infinite loops. If we never explicitly set one, it will use its built-in default of 25 super-steps—once that’s hit without reaching your END node, you’ll get a `langgraph.errors.GraphRecursionError: Recursion limit of 25 reached without hitting a stop condition.`


### 🔍 Imposing a custom cycle limit

There are two common ways to override or tighten that limit:

1. **Overriding** the recursion limit via config
2. **Manual** in-state counter to enforce your own cycle cap


### 1) Overriding via `recursion_limit` in config

```python
# Same graph, but allow up to 100 steps instead of 25
result = graph.invoke(
    {"count": 0},
    {"recursion_limit": 100}   # override the default cap
)
print(result)
```

By passing a top-level `{"recursion_limit": N}` to `invoke`, you raise (or lower) the maximum number of super-steps before a `GraphRecursionError` is thrown.

---

### 2) Manual in-state counter for custom cycle limits

```python
from typing import TypedDict
from langgraph.graph import StateGraph, START, END

# Extend the state with our own loop budget
class ManualState(TypedDict):
    count: int
    remaining_loops: int

def start_node(state: ManualState) -> ManualState:
    """Initialize count and remaining_loops."""
    return {"count": 0, "remaining_loops": state["remaining_loops"]}

def inc_node(state: ManualState) -> ManualState:
    """Increment count and decrement remaining_loops."""
    state["count"] += 1
    state["remaining_loops"] -= 1
    return state

def decide_manual(state: ManualState):
    """Stop when our own budget runs out."""
    if state["remaining_loops"] <= 0:
        return END
    return "inc"

# Build a second graph that stops after N cycles
g2 = StateGraph(ManualState)
g2.add_node("start", start_node)
g2.add_node("inc", inc_node)
g2.add_node("decide", decide_manual)

g2.add_edge(START, "start")
g2.add_edge("start", "inc")
g2.add_edge("inc", "decide")
g2.add_edge("decide", "inc")
g2.add_edge("decide", END)

# Invoke with a budget of 5 loops
final_state = g2.invoke({"remaining_loops": 5})
print(final_state)
# -> {'count': 5, 'remaining_loops': 0}
```

This pattern embeds your own counter in the graph state, so you can precisely control and inspect how many cycles are allowed without relying on the global recursion limit.



## Using Our Graph

Now that we've created and compiled our graph - we can call it *just as we'd call any other* `Runnable`!

Let's try out a few examples to see how it fairs:

In [14]:
from langchain_core.messages import HumanMessage

inputs = {"messages" : [HumanMessage(content="Who is the current captain of the Winnipeg Jets?")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_TOikggdVg6o1dU57pKFVPeNR', 'function': {'arguments': '{"query":"current captain of the Winnipeg Jets"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 23, 'prompt_tokens': 162, 'total_tokens': 185, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': None, 'id': 'chatcmpl-Bt0aw4SdyjEfHPGEuZBY1Y8p1E0Gd', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run--5e2adfe5-a005-4a46-bb3d-084ae542ab21-0', tool_calls=[{'name': 'tavily_search_results_json', 'args': {'query': 'current captain of the Winnipeg Jets'}, 'id': 'call_TOikggdVg6o1dU57pKFVPeNR', 'type': 

Let's look at what happened:

1. Our state object was populated with our request
2. The state object was passed into our entry point (agent node) and the agent node added an `AIMessage` to the state object and passed it along the conditional edge
3. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node
4. The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node
5. The agent node added a response to the state object and passed it along the conditional edge
6. The conditional edge received the state object, could not find the "tool_calls" `additional_kwarg` and passed the state object to END where we see it output in the cell above!

Now let's look at an example that shows a multiple tool usage - all with the same flow!

In [15]:
inputs = {"messages" : [HumanMessage(content="Search Arxiv for the QLoRA paper, then search each of the authors to find out their latest Tweet using Tavily!")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        if node == "action":
          print(f"Tool Used: {values['messages'][0].name}")
        print(values["messages"])

        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_AzdM0w2yrQJD9rP1mYLYBthr', 'function': {'arguments': '{"query": "QLoRA"}', 'name': 'arxiv'}, 'type': 'function'}, {'id': 'call_izr4mQKv1iuJT9B4EplfbobB', 'function': {'arguments': '{"query": "latest Tweet of the author of QLoRA"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 58, 'prompt_tokens': 178, 'total_tokens': 236, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': None, 'id': 'chatcmpl-Bt0b57Uz54Aszxy8xqsDbLRs3fHcQ', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run--e696485d-9544-405a-a162-f47233c016aa-0', tool_calls=[{'name': 'arxiv', 

#### 🏗️ Activity #2:

Please write out the steps the agent took to arrive at the correct answer.

Here’s how the `simple_agent_graph` walked through Question #1 (“Who is the current captain of the Winnipeg Jets?”), step by step:

1. **Agent “think” step**

   * The graph’s entry‐point node (`"agent"`) takes in the user’s `HumanMessage`.
   * It consults the tool descriptions and decides that a search tool is needed.
   * It emits a function‐call to `TavilySearchResults` with the query `"current captain of the Winnipeg Jets"`.

2. **Action (tool) execution**

   * The next node (`"action"`) sees the function‐call payload, invokes the Tavily search API under the hood, and retrieves a JSON list of results (titles, snippets, URLs).
   * It wraps that response in a `ToolMessage`.

3. **Agent “answer” step**

   * Control returns to the `"agent"` node with the tool’s output now in its messages.
   * The LLM ingests those search results and composes the final natural‐language response:

     > “The current captain of the Winnipeg Jets is Adam Lowry.”



# 🤝 Breakout Room #2

## Part 1: LangSmith Evaluator

### Pre-processing for LangSmith

To do a little bit more preprocessing, let's wrap our LangGraph agent in a simple chain.

In [16]:
def convert_inputs(input_object):
  return {"messages" : [HumanMessage(content=input_object["question"])]}

def parse_output(input_state):
  return input_state["messages"][-1].content

agent_chain_with_formatting = convert_inputs | simple_agent_graph | parse_output

In [17]:
agent_chain_with_formatting.invoke({"question" : "What is RAG?"})

"RAG can refer to different concepts depending on the context. Could you please specify whether you're asking about RAG in the context of project management, machine learning, or another field?"

### Task 1: Creating An Evaluation Dataset

Just as we saw last week, we'll want to create a dataset to test our Agent's ability to answer questions.

In order to do this - we'll want to provide some questions and some answers. Let's look at how we can create such a dataset below.

```python
questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
]
```

#### 🏗️ Activity #3:

Please create a dataset in the above format with at least 5 questions.

In [33]:
#### ✅ Answer 

questions = [
    "What does LoRA stand for?",
    "What quantization level is used in QLoRA?",
    "Why are base model weights frozen in QLoRA?",
    "How does LoRA reduce the number of trainable parameters?",
    "Which library provides a LoRA implementation?"
]

answers = [
    {"must_mention" : ["paged", "optimizer"]},
    {"must_mention" : ["NF4", "NormalFloat"]},
    {"must_mention" : ["ground", "context"]},
    {"must_mention" : ["Tim", "Dettmers"]},
    {"must_mention" : ["PyTorch", "TensorFlow"]},
    {"must_mention" : ["reduce", "parameters"]},
]

Now we can add our dataset to our LangSmith project using the following code which we saw last Thursday!

In [19]:
from langsmith import Client

client = Client()

dataset_name = f"Retrieval Augmented Generation - Evaluation Dataset - {uuid4().hex[0:8]}"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the QLoRA Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    outputs=answers,
    dataset_id=dataset.id,
)

{'example_ids': ['f6d2528a-e6f9-4616-9063-a6f373786edd',
  'cd2062ab-bd6e-45f9-b351-02bffe90ad12',
  'dc880607-d1d3-4e55-990d-fae56db7e07b',
  '336a93b1-423f-41ec-8629-f9b3bf292cd8',
  '95a5d1e4-94a7-4dc7-9343-8c9deb378a83',
  'f22542e7-0b1b-4086-bbfe-364aaec074ee'],
 'count': 6}

#### ❓ Question #3:

How are the correct answers associated with the questions?

> NOTE: Feel free to indicate if this is problematic or not

Think of the two lists — `questions` and `answers` — like two rows in a table, lined up one above the other:

| Question (row 1)                            | Answer-keywords (row 2)      |
| ------------------------------------------- | ---------------------------- |
| “What does LoRA stand for?”                 | `["Low-Rank", "Adaptation"]` |
| “What quantization level is used in QLoRA?” | `["4-bit", "quantization"]`  |
| …                                           | …                            |

* The **first** question goes with the **first** answer-keywords set.
* The **second** question goes with the **second** set.
* And so on.

Under the hood, when evaluating, the code does something like:

```python
for question, answer_spec in zip(questions, answers):
    # Here, `question` is the Nᵗʰ item from questions,
    # and `answer_spec` is the Nᵗʰ item from answers.
    check_model_output(question, must_mention=answer_spec["must_mention"])
```

That `zip()` function pairs them by **position**:

1. Pair item 0 of `questions` with item 0 of `answers`
2. Pair item 1 of `questions` with item 1 of `answers`
3. …etc.

So the “association” relies purely on having the same index in each list. If we ask the 4th question, the system looks at the 4th keywords list to know which words must appear in the answer.


### Task 2: Adding Evaluators

Now we can add a custom evaluator to see if our responses contain the expected information.

We'll be using a fairly naive exact-match process to determine if our response contains specific strings.

In [20]:
from langsmith.evaluation import EvaluationResult, run_evaluator

@run_evaluator
def must_mention(run, example) -> EvaluationResult:
    prediction = run.outputs.get("output") or ""
    required = example.outputs.get("must_mention") or []
    score = all(phrase in prediction for phrase in required)
    return EvaluationResult(key="must_mention", score=score)

#### ❓ Question #4:

What are some ways you could improve this metric as-is?

> NOTE: Alternatively you can suggest where gaps exist in this method.

#### ✅ Answer

1. **Allow for synonyms and paraphrases**

   * *Why?* Right now the model only passes if it literally contains each string in `must_mention`.  If it writes “low-rank adaptation” as “low rank adaptation” (no hyphen), or says “parameter reduction” instead of “reduce parameters,” it’ll fail even though it got the idea.
   * *How?* Before matching, normalize both output and keywords (lowercase, strip punctuation), and expand your keyword list with common synonyms or use a small lexical resource (e.g. WordNet) to accept equivalent terms.

2. **Stemming or lemmatization**

   * *Why?* Plurals and tense changes (“weights” vs. “weight”, “frozen” vs. “freeze”) can trip up exact matches.
   * *How?* Run both text and keywords through a stemmer (e.g. Porter) or lemmatizer so “freeze” and “frozen” share a root.

3. **Partial‐credit scoring**

   * *Why?* Right now it’s all or nothing: if one keyword is missing, the whole question is “wrong.” That hides gradations in model performance.
   * *How?* Instead of requiring 100% of keywords, award 1 point per keyword found. You can report percent-correct or set a threshold (e.g. ≥ 75% of keywords).

4. **Semantic similarity metrics**

   * *Why?* Even with synonyms, some correct answers might use completely different phrasing.
   * *How?* Compute an embedding of the model’s answer and compare to an embedding of an ideal reference answer (or reference keywords) using cosine similarity. You can set a cutoff (e.g. ≥ 0.7 cosine) as “correct.”

5. **Use multiple reference answers**

   * *Why?* There’s often more than one way to express a fact.
   * *How?* For each question, collect a small list of human‐written “gold” answers. Then evaluate against all of them, passing if the model’s answer is close enough to any one of them (via BLEU/ROUGE/BERTScore or keyword checks).

6. **Incorporate negative checks**

   * *Why?* A model might accidentally mention a keyword in the wrong context (e.g. “LoRA does *not* stand for low-rank adaptation”).
   * *How?* Define a small set of “false friend” phrases to flag—if they appear, mark the answer as wrong.

---


Task 3: Evaluating

All that is left to do is evaluate our agent's response!

In [21]:
experiment_results = client.evaluate(
    agent_chain_with_formatting,
    data=dataset_name,
    evaluators=[must_mention],
    experiment_prefix=f"Search Pipeline - Evaluation - {uuid4().hex[0:4]}",
    metadata={"version": "1.0.0"},
)

View the evaluation results for experiment: 'Search Pipeline - Evaluation - 8c82-4bd3a61e' at:
https://smith.langchain.com/o/6aa39193-2c1d-4f43-b4b1-8faaa8260038/datasets/37be79b0-d0d4-40a4-b91a-9b79a9cd0ba5/compare?selectedSessions=45e93d14-f2af-41be-aefb-20ef6acffed6




0it [00:00, ?it/s]

In [22]:
experiment_results

## Part 2: LangGraph with Helpfulness:

### Task 3: Adding Helpfulness Check and "Loop" Limits

Now that we've done evaluation - let's see if we can add an extra step where we review the content we've generated to confirm if it fully answers the user's query!

We're going to make a few key adjustments to account for this:

1. We're going to add an artificial limit on how many "loops" the agent can go through - this will help us to avoid the potential situation where we never exit the loop.
2. We'll add to our existing conditional edge to obtain the behaviour we desire.

First, let's define our state again - we can check the length of the state object, so we don't need additional state for this.

In [23]:
class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

Now we can set our graph up! This process will be almost entirely the same - with the inclusion of one additional node/conditional edge!

#### 🏗️ Activity #5:

Please write markdown for the following cells to explain what each is doing.

# Setting up the graph nodes

### We create a new StateGraph called graph_with_helpfulness_check based on our AgentState schema. Then we register two core processing nodes:

    - agent: Takes user input and dispatches to the language model via call_model.

    - action: Executes any required tool calls using the tool_node function.

In [24]:
graph_with_helpfulness_check = StateGraph(AgentState)

graph_with_helpfulness_check.add_node("agent", call_model)
graph_with_helpfulness_check.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x78126056b250>

# Defining the entry point

 - Here we tell LangGraph where to start execution by setting the entry point to "agent". When you invoke the graph, it will begin processing at that node.

In [25]:
graph_with_helpfulness_check.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x78126056b250>

# Declaring the conversation router

This cell defines the tool_call_or_helpful(state) function, which decides where to go next in the graph based on the current chat state:

    - Tool calls: If the most recent message includes a tool invocation, route to action.

    - Length check: If the conversation exceeds 10 messages, return "END" to stop the loop.

    Helpfulness check:

        - Builds a prompt comparing the initial question and the final response.

        - Runs a lightweight gpt-4.1-mini chain to ask “Was my answer helpful? (Y/N)”.

        - If the model replies with “Y”, routes to "end"; otherwise, routes to "continue" for another cycle.

This custom router lets us automatically loop between agent, tool, and evaluation until the response is deemed helpful.

In [26]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def tool_call_or_helpful(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  initial_query = state["messages"][0]
  final_response = state["messages"][-1]

  if len(state["messages"]) > 10:
    return "END"

  prompt_template = """\
  Given an initial query and a final response, determine if the final response is extremely helpful or not. Please indicate helpfulness with a 'Y' and unhelpfulness as an 'N'.

  Initial Query:
  {initial_query}

  Final Response:
  {final_response}"""

  helpfullness_prompt_template = PromptTemplate.from_template(prompt_template)

  helpfulness_check_model = ChatOpenAI(model="gpt-4.1-mini")

  helpfulness_chain = helpfullness_prompt_template | helpfulness_check_model | StrOutputParser()

  helpfulness_response = helpfulness_chain.invoke({"initial_query" : initial_query.content, "final_response" : final_response.content})

  if "Y" in helpfulness_response:
    return "end"
  else:
    return "continue"

#### 🏗️ Activity #4:

Please write what is happening in our `tool_call_or_helpful` function!

# Router decision function (tool_call_or_helpful)

This cell defines the tool_call_or_helpful(state) function, which examines the current conversation state and decides the next node to execute:

Tool-call detection: If the most recent AI message contains a tool_calls entry, it returns "action" so the graph will invoke the tool.

Loop guard: If more than 10 messages have accumulated, it returns END to stop further cycling.

Helpfulness check: Otherwise, it builds a small prompt comparing the initial user query and the latest AI response, runs a lightweight OpenAI chain (gpt-4.1-mini) to ask “Was my answer helpful? (Y/N)”, and:

    Returns "end" if the model answers “Y”

    Returns "continue" if it answers “N”

Adding conditional transitions at the "agent" node

The call graph_with_helpfulness_check wires your router into the graph. It tells LangGraph: “After the agent node runs, call tool_call_or_helpful(state). If it returns:

    "continue" → loop back to the agent node

    "action" → jump to the action node

    "end" → terminate execution”

In [35]:
graph_with_helpfulness_check.add_conditional_edges(
    "agent",
    tool_call_or_helpful,
    {
        "continue" : "agent",
        "action" : "action",
        "end" : END
    }
)

Adding an edge to a graph that has already been compiled. This will not be reflected in the compiled graph.


ValueError: Branch with name `tool_call_or_helpful` already exists for node `agent`

Defining the static transition from "action" back to "agent"

The line `graph_with_helpfulness_check.add_edge("action", "agent")`

creates a fixed edge so that once any tool call is executed (action node), control always flows back into the agent node. This ensures the agent can process tool results and continue the conversation loop.

In [28]:
graph_with_helpfulness_check.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x78126056b250>

Compiling the graph into an executable agent

Finally `agent_with_helpfulness_check = graph_with_helpfulness_check.compile()`
takes your configured nodes and edges (including the helpfulness router) and produces a runnable AsyncApp. You can now call agent_with_helpfulness_check.invoke(...) or stream with astream() to run your agent with built-in looping and answer-quality checks.

In [29]:
agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

This snippet first packages your single user prompt—wrapped as a HumanMessage asking about LoRA, Tim Dettmers, and Attention—into the inputs dict, then kicks off the agent in “stream updates” mode via agent_with_helpfulness_check.astream(...). As the graph executes, it yields one “chunk” per node invocation: each chunk is a map from node name (e.g. "agent" or "action") to its current state. The inner loop simply prints out which node just ran (Receiving update from node: 'node') and the list of messages in the conversation up to that point, letting you watch the agent’s decision-and-tool-call process in real time.

In [30]:
inputs = {"messages" : [HumanMessage(content="Related to machine learning, what is LoRA? Also, who is Tim Dettmers? Also, what is Attention?")]}

async for chunk in agent_with_helpfulness_check.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_ujbWP4Gf7sR1PBxTWMkAaPcJ', 'function': {'arguments': '{"query": "LoRA machine learning"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'id': 'call_IDwQM8MiTyOJkMtT9V2kfJxQ', 'function': {'arguments': '{"query": "Tim Dettmers"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'id': 'call_XAgNLULZyIS5jxNBBJUnCdLm', 'function': {'arguments': '{"query": "Attention in machine learning"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 79, 'prompt_tokens': 177, 'total_tokens': 256, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': None, 'id': 'chatcmpl-Bt0

### Task 4: LangGraph for the "Patterns" of GenAI

Let's ask our system about the 4 patterns of Generative AI:

1. Prompt Engineering
2. RAG
3. Fine-tuning
4. Agents

In [31]:
patterns = ["prompt engineering", "RAG", "fine-tuning", "LLM-based agents"]

In [32]:
for pattern in patterns:
  what_is_string = f"What is {pattern} and when did it break onto the scene??"
  inputs = {"messages" : [HumanMessage(content=what_is_string)]}
  messages = agent_with_helpfulness_check.invoke(inputs)
  print(messages["messages"][-1].content)
  print("\n\n")

Prompt engineering is the process of designing and refining input prompts to effectively communicate with AI language models, such as GPT, to elicit accurate, relevant, and useful responses. It involves crafting prompts that guide the model's output in a desired direction, often through techniques like specifying context, framing questions clearly, or using specific keywords.

Prompt engineering has become increasingly prominent with the rise of large language models (LLMs) around 2020 and 2021, as users and developers sought ways to optimize interactions with these powerful tools. The concept gained widespread attention as organizations and individuals recognized that the quality of prompts significantly impacts the usefulness of the generated outputs.

In summary:
- **What it is:** The art and science of designing prompts to effectively interact with AI models.
- **When it broke onto the scene:** It gained significant prominence around 2020-2021, coinciding with the deployment and po