# LangGraph and LangSmith - Agentic RAG Powered by LangChain

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating our Tool Belt
  4. Creating Our State
  5. Creating and Compiling A Graph!

- 🤝 Breakout Room #2:
  1. Evaluating the LangGraph Application with LangSmith
  2. Adding Helpfulness Check and "Loop" Limits
  3. LangGraph for the "Patterns" of GenAI

# 🤝 Breakout Room #1

## Part 1: LangGraph - Building Cyclic Applications with LangChain

LangGraph is a tool that leverages LangChain Expression Language to build coordinated multi-actor and stateful applications that includes cyclic behaviour.

### Why Cycles?

In essence, we can think of a cycle in our graph as a more robust and customizable loop. It allows us to keep our application agent-forward while still giving the powerful functionality of traditional loops.

Due to the inclusion of cycles over loops, we can also compose rather complex flows through our graph in a much more readable and natural fashion. Effectively allowing us to recreate application flowcharts in code in an almost 1-to-1 fashion.

### Why LangGraph?

Beyond the agent-forward approach - we can easily compose and combine traditional "DAG" (directed acyclic graph) chains with powerful cyclic behaviour due to the tight integration with LCEL. This means it's a natural extension to LangChain's core offerings!

## Task 1:  Dependencies


## Task 2: Environment Variables

We'll want to set our OpenAI, Tavily, and LangSmith API keys along with our LangSmith environment variables.

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [3]:
os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY")

In [4]:
from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE8 - LangGraph - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

## Task 3: Creating our Tool Belt

As is usually the case, we'll want to equip our agent with a toolbelt to help answer questions and add external knowledge.

There's a tonne of tools in the [LangChain Community Repo](https://github.com/langchain-ai/langchain-community/tree/main/libs/community) but we'll stick to a couple just so we can observe the cyclic nature of LangGraph in action!

We'll leverage:

- [Tavily Search Results](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/tavily_search/tool.py)
- [Arxiv](https://github.com/langchain-ai/langchain-community/blob/main/libs/community/langchain_community/tools/arxiv/tool.py)

#### 🏗️ Activity #1:

Please add the tools to use into our toolbelt.

> NOTE: Each tool in our toolbelt should be a method.

In [5]:
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.tools.arxiv.tool import ArxivQueryRun

tavily_tool = TavilySearchResults(max_results=5)

tool_belt = [
    tavily_tool,
    ArxivQueryRun(),
]

  tavily_tool = TavilySearchResults(max_results=5)


### Model

Now we can set-up our model! We'll leverage the familiar OpenAI model suite for this example - but it's not *necessary* to use with LangGraph. LangGraph supports all models - though you might not find success with smaller models - as such, they recommend you stick with:

- OpenAI's GPT-3.5 and GPT-4
- Anthropic's Claude
- Google's Gemini

> NOTE: Because we're leveraging the OpenAI function calling API - we'll need to use OpenAI *for this specific example* (or any other service that exposes an OpenAI-style function calling API.

In [6]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4.1-nano", temperature=0)

Now that we have our model set-up, let's "put on the tool belt", which is to say: We'll bind our LangChain formatted tools to the model in an OpenAI function calling format.

In [7]:
model = model.bind_tools(tool_belt)

#### ❓ Question #1:

How does the model determine which tool to use?

#### Answer to Question #1:

**How does the model determine which tool to use?**

The model determines which tool to use through OpenAI's function calling mechanism. Here's how it works:

1. **Tool Binding**: When we call `model.bind_tools(tool_belt)`, the model receives metadata about each available tool, including:
   - Tool names (e.g., "tavily_search_results_json", "arxiv")
   - Tool descriptions (automatically extracted from docstrings)
   - Parameter schemas (what inputs each tool expects)

2. **Context Analysis**: When the model receives a user query, it analyzes the content and context to understand what type of information or action is needed.

3. **Tool Selection Logic**: The model uses its training to match the user's intent with the most appropriate tool(s):
   - For web search queries → Tavily Search Results
   - For academic paper searches → ArXiv tool
   - For multiple information needs → Multiple tools simultaneously

4. **Function Call Generation**: The model generates structured function calls in its response, including:
   - The tool name to invoke
   - Appropriate parameters for that tool
   - Multiple tools can be called in parallel if needed

5. **Decision Factors**: The model considers:
   - Query keywords and intent
   - Tool capabilities and descriptions  
   - Context from previous messages in the conversation
   - The specific information needed to answer the question

This intelligent tool selection allows the agent to automatically choose the right tools for each situation without explicit routing logic.


## Task 4: Putting the State in Stateful

Earlier we used this phrasing:

`coordinated multi-actor and stateful applications`

So what does that "stateful" mean?

To put it simply - we want to have some kind of object which we can pass around our application that holds information about what the current situation (state) is. Since our system will be constructed of many parts moving in a coordinated fashion - we want to be able to ensure we have some commonly understood idea of that state.

LangGraph leverages a `StatefulGraph` which uses an `AgentState` object to pass information between the various nodes of the graph.

There are more options than what we'll see below - but this `AgentState` object is one that is stored in a `TypedDict` with the key `messages` and the value is a `Sequence` of `BaseMessages` that will be appended to whenever the state changes.

Let's think about a simple example to help understand exactly what this means (we'll simplify a great deal to try and clearly communicate what state is doing):

1. We initialize our state object:
  - `{"messages" : []}`
2. Our user submits a query to our application.
  - New State: `HumanMessage(#1)`
  - `{"messages" : [HumanMessage(#1)}`
3. We pass our state object to an Agent node which is able to read the current state. It will use the last `HumanMessage` as input. It gets some kind of output which it will add to the state.
  - New State: `AgentMessage(#1, additional_kwargs {"function_call" : "WebSearchTool"})`
  - `{"messages" : [HumanMessage(#1), AgentMessage(#1, ...)]}`
4. We pass our state object to a "conditional node" (more on this later) which reads the last state to determine if we need to use a tool - which it can determine properly because of our provided object!

In [8]:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
import operator
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

## Task 5: It's Graphing Time!

Now that we have state, and we have tools, and we have an LLM - we can finally start making our graph!

Let's take a second to refresh ourselves about what a graph is in this context.

Graphs, also called networks in some circles, are a collection of connected objects.

The objects in question are typically called nodes, or vertices, and the connections are called edges.

Let's look at a simple graph.

![image](https://i.imgur.com/2NFLnIc.png)

Here, we're using the coloured circles to represent the nodes and the yellow lines to represent the edges. In this case, we're looking at a fully connected graph - where each node is connected by an edge to each other node.

If we were to think about nodes in the context of LangGraph - we would think of a function, or an LCEL runnable.

If we were to think about edges in the context of LangGraph - we might think of them as "paths to take" or "where to pass our state object next".

Let's create some nodes and expand on our diagram.

> NOTE: Due to the tight integration with LCEL - we can comfortably create our nodes in an async fashion!

In [9]:
from langgraph.prebuilt import ToolNode

def call_model(state):
  messages = state["messages"]
  response = model.invoke(messages)
  return {"messages" : [response]}

tool_node = ToolNode(tool_belt)

Now we have two total nodes. We have:

- `call_model` is a node that will...well...call the model
- `tool_node` is a node which can call a tool

Let's start adding nodes! We'll update our diagram along the way to keep track of what this looks like!


In [10]:
from langgraph.graph import StateGraph, END

uncompiled_graph = StateGraph(AgentState)

uncompiled_graph.add_node("agent", call_model)
uncompiled_graph.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x12217fcb0>

Let's look at what we have so far:

![image](https://i.imgur.com/md7inqG.png)

Next, we'll add our entrypoint. All our entrypoint does is indicate which node is called first.

In [11]:
uncompiled_graph.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x12217fcb0>

![image](https://i.imgur.com/wNixpJe.png)

#### Answer to Question #2:

**Is there any specific limit to how many times we can cycle?**

No, by default LangGraph does not impose any specific limit on the number of cycles. The graph will continue to cycle as long as the conditional edges direct it to do so. This could potentially lead to infinite loops if not properly controlled.

**How could we impose a limit to the number of cycles?**

There are several ways to impose cycle limits:

1. **State-Based Counting**: Add a counter to your `AgentState`:
   ```python
   class AgentState(TypedDict):
       messages: Annotated[list, add_messages]
       cycle_count: int  # Add this field
   ```

2. **Message Length Check** (as shown later in the helpfulness example):
   ```python
   if len(state["messages"]) > 10:  # Limit based on message count
       return "END"
   ```

3. **Compilation with Recursion Limit**:
   ```python
   graph = workflow.compile(recursion_limit=5)
   ```

4. **Time-Based Limits**: Track execution time and stop after a certain duration.

5. **Custom Logic in Conditional Edges**: Check iteration count in your conditional functions:
   ```python
   def should_continue(state):
       if state.get("cycle_count", 0) > MAX_CYCLES:
           return "END"
       # ... other logic
   ```

The most common approach is using the `recursion_limit` parameter during compilation or tracking cycles through state management.


Now we want to build a "conditional edge" which will use the output state of a node to determine which path to follow.

We can help conceptualize this by thinking of our conditional edge as a conditional in a flowchart!

Notice how our function simply checks if there is a "function_call" kwarg present.

Then we create an edge where the origin node is our agent node and our destination node is *either* the action node or the END (finish the graph).

It's important to highlight that the dictionary passed in as the third parameter (the mapping) should be created with the possible outputs of our conditional function in mind. In this case `should_continue` outputs either `"end"` or `"continue"` which are subsequently mapped to the action node or the END node.

In [12]:
def should_continue(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  return END

uncompiled_graph.add_conditional_edges(
    "agent",
    should_continue
)

<langgraph.graph.state.StateGraph at 0x12217fcb0>

Let's visualize what this looks like.

![image](https://i.imgur.com/8ZNwKI5.png)

Finally, we can add our last edge which will connect our action node to our agent node. This is because we *always* want our action node (which is used to call our tools) to return its output to our agent!

In [13]:
uncompiled_graph.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x12217fcb0>

Let's look at the final visualization.

![image](https://i.imgur.com/NWO7usO.png)

#### Answer to Activity #2:

**Steps the agent took to find information about the "A Comprehensive Survey of Deep Research" paper and authors:**

1. **Initial Query Processing** (Agent Node):
   - Received user request to search ArXiv for the paper and find author information
   - Agent decided it needed both ArXiv search and web search capabilities
   - Generated two simultaneous tool calls:
     - `arxiv` tool with query "A Comprehensive Survey of Deep Research"
     - `tavily_search_results_json` tool with query "author of A Comprehensive Survey of Deep Research"

2. **First Tool Execution** (Action Node):
   - **ArXiv Search**: Found the paper "A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications" by Renjun Xu and Jingwen Peng, published June 14, 2025
   - **Tavily Search**: Found additional context about the paper and authors from various sources including arXiv links and related publications

3. **Information Synthesis** (Agent Node):
   - Processed results from both tools
   - Identified the two main authors: Renjun Xu and Jingwen Peng
   - Determined need for additional information about current affiliations
   - Generated two more tool calls to search for each author individually

4. **Second Tool Execution** (Action Node):
   - **Search for Renjun Xu**: Found he is a Principal Researcher at Zhejiang University
   - **Search for Jingwen Peng**: Found multiple Jingwen Pengs, but identified the relevant one as Director/Lead Data Steward at Liberty Mutual Investments in Boston

5. **Final Response** (Agent Node):
   - Synthesized all gathered information
   - Provided comprehensive answer including:
     - Paper title and publication date
     - Author names
     - Current affiliations of both authors
   - No additional tool calls needed, so conditional edge directed to END

**Key Features Demonstrated:**
- **Parallel Tool Execution**: Agent called multiple tools simultaneously in steps 1 and 3
- **Iterative Information Gathering**: Agent recognized when initial results were insufficient
- **Context Awareness**: Agent maintained context across multiple cycles to build comprehensive response


All that's left to do now is to compile our workflow - and we're off!

In [14]:
simple_agent_graph = uncompiled_graph.compile()

#### ❓ Question #2:

Is there any specific limit to how many times we can cycle?

If not, how could we impose a limit to the number of cycles?

#### Answer to Question #2:

**Is there any specific limit to how many times we can cycle?**

No, by default LangGraph does not impose any specific limit on the number of cycles. The graph will continue to cycle as long as the conditional edges direct it to do so. This could potentially lead to infinite loops if not properly controlled.

**How could we impose a limit to the number of cycles?**

There are several ways to impose cycle limits:

1. **State-Based Counting**: Add a counter to your `AgentState`:
   ```python
   class AgentState(TypedDict):
       messages: Annotated[list, add_messages]
       cycle_count: int  # Add this field
   ```

2. **Message Length Check** (as shown later in the helpfulness example):
   ```python
   if len(state["messages"]) > 10:  # Limit based on message count
       return "END"
   ```

3. **Compilation with Recursion Limit**:
   ```python
   graph = workflow.compile(recursion_limit=5)
   ```

4. **Time-Based Limits**: Track execution time and stop after a certain duration.

5. **Custom Logic in Conditional Edges**: Check iteration count in your conditional functions:
   ```python
   def should_continue(state):
       if state.get("cycle_count", 0) > MAX_CYCLES:
           return "END"
       # ... other logic
   ```

The most common approach is using the `recursion_limit` parameter during compilation or tracking cycles through state management.


## Using Our Graph

Now that we've created and compiled our graph - we can call it *just as we'd call any other* `Runnable`!

Let's try out a few examples to see how it fairs:

In [35]:
from langchain_core.messages import HumanMessage

inputs = {"messages" : [HumanMessage(content="How are technical professionals using AI to improve their work?")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='Technical professionals are using AI in various ways to enhance their work, including automating repetitive tasks, improving decision-making, analyzing large datasets, developing new products and services, and optimizing processes. They leverage AI for tasks such as machine learning model development, natural language processing, computer vision, predictive analytics, and automation of workflows. This integration helps increase efficiency, accuracy, and innovation across different industries. Would you like specific examples or insights into particular fields?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 90, 'prompt_tokens': 163, 'total_tokens': 253, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2

Let's look at what happened:

1. Our state object was populated with our request
2. The state object was passed into our entry point (agent node) and the agent node added an `AIMessage` to the state object and passed it along the conditional edge
3. The conditional edge received the state object, found the "tool_calls" `additional_kwarg`, and sent the state object to the action node
4. The action node added the response from the OpenAI function calling endpoint to the state object and passed it along the edge to the agent node
5. The agent node added a response to the state object and passed it along the conditional edge
6. The conditional edge received the state object, could not find the "tool_calls" `additional_kwarg` and passed the state object to END where we see it output in the cell above!

Now let's look at an example that shows a multiple tool usage - all with the same flow!

In [36]:
inputs = {"messages" : [HumanMessage(content="Search Arxiv for the A Comprehensive Survey of Deep Research paper, then search each of the authors to find out where they work now using Tavily!")]}

async for chunk in simple_agent_graph.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        if node == "action":
          print(f"Tool Used: {values['messages'][0].name}")
        print(values["messages"])

        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'call_xMSYEYl0Ls0na8pafnvFBCfR', 'function': {'arguments': '{"query": "A Comprehensive Survey of Deep Research"}', 'name': 'arxiv'}, 'type': 'function'}, {'id': 'call_Bt29P2Nzwbv69yAu2LzxpGSH', 'function': {'arguments': '{"query": "author of A Comprehensive Survey of Deep Research"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 60, 'prompt_tokens': 182, 'total_tokens': 242, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_7c233bf9d1', 'id': 'chatcmpl-CK3S41xj9myomf1bPIlSiyBGZQZ6x', 'service_tier': 'default', 'finish_reason': 'tool_calls', 'logprobs': None}, id='run--43f40553-b90

#### 🏗️ Activity #2:

Please write out the steps the agent took to arrive at the correct answer.

#### Answer to Activity #2:

**Steps the agent took to find information about the "A Comprehensive Survey of Deep Research" paper and authors:**

1. **Initial Query Processing** (Agent Node):
   - Received user request to search ArXiv for the paper and find author information
   - Agent decided it needed both ArXiv search and web search capabilities
   - Generated two simultaneous tool calls:
     - `arxiv` tool with query "A Comprehensive Survey of Deep Research"
     - `tavily_search_results_json` tool with query "author of A Comprehensive Survey of Deep Research"

2. **First Tool Execution** (Action Node):
   - **ArXiv Search**: Found the paper "A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications" by Renjun Xu and Jingwen Peng, published June 14, 2025
   - **Tavily Search**: Found additional context about the paper and authors from various sources including arXiv links and related publications

3. **Information Synthesis** (Agent Node):
   - Processed results from both tools
   - Identified the two main authors: Renjun Xu and Jingwen Peng
   - Determined need for additional information about current affiliations
   - Generated two more tool calls to search for each author individually

4. **Second Tool Execution** (Action Node):
   - **Search for Renjun Xu**: Found he is a Principal Researcher at Zhejiang University
   - **Search for Jingwen Peng**: Found multiple Jingwen Pengs, but identified the relevant one as Director/Lead Data Steward at Liberty Mutual Investments in Boston

5. **Final Response** (Agent Node):
   - Synthesized all gathered information
   - Provided comprehensive answer including:
     - Paper title and publication date
     - Author names
     - Current affiliations of both authors
   - No additional tool calls needed, so conditional edge directed to END

**Key Features Demonstrated:**
- **Parallel Tool Execution**: Agent called multiple tools simultaneously in steps 1 and 3
- **Iterative Information Gathering**: Agent recognized when initial results were insufficient
- **Context Awareness**: Agent maintained context across multiple cycles to build comprehensive response


# 🤝 Breakout Room #2

## Part 1: LangSmith Evaluator

### Pre-processing for LangSmith

To do a little bit more preprocessing, let's wrap our LangGraph agent in a simple chain.

In [17]:
def convert_inputs(input_object):
  return {"messages" : [HumanMessage(content=input_object["text"])]}

def parse_output(input_state):
  return {"answer" : input_state["messages"][-1].content}

agent_chain_with_formatting = convert_inputs | simple_agent_graph | parse_output

agent_chain_with_formatting.invoke({"text" : "What is Deep Research?"})

{'answer': 'Deep Research typically refers to an in-depth and comprehensive investigation or analysis into a specific topic, subject, or field. It involves gathering detailed information, examining various sources, and analyzing data thoroughly to gain a profound understanding of the subject. Deep Research is often used in academic, scientific, technological, and business contexts to develop insights, inform decision-making, or advance knowledge.\n\nIf you are referring to a specific organization, product, or platform named "Deep Research," please provide more context so I can give a more precise answer.'}

### Task 1: Creating An Evaluation Dataset

Just as we saw last week, we'll want to create a dataset to test our Agent's ability to answer questions.

In order to do this - we'll want to provide some questions and some answers. Let's look at how we can create such a dataset below.

```python
questions = [
    {
        "inputs" : {"text" : "Who were the main authors on the 'A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications' paper?"},
        "outputs" : {"must_mention" : ["Peng", "Xu"]}   
    },
    ...,
    {
        "inputs" : {"text" : "Where do the authors of the 'A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications' work now?"},
        "outputs" : {"must_mention" : ["Zhejiang", "Liberty Mutual"]}
    }
]
```

#### 🏗️ Activity #3:

Please create a dataset in the above format with at least 5 questions that pertain to the cohort use-case (more information [here](https://www.notion.so/Session-4-RAG-with-LangGraph-OSS-Local-Models-Eval-w-LangSmith-26acd547af3d80838d5beba464d7e701#26acd547af3d81d08809c9c82a462bdd)), or the use-case you're hoping to tackle in your Demo Day project.

In [18]:
questions = [
    {
        "inputs": {"text": "Who were the main authors on the 'A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications' paper?"},
        "outputs": {"must_mention": ["Renjun Xu", "Jingwen Peng"]}
    },
    {
        "inputs": {"text": "Where do the authors of the 'A Comprehensive Survey of Deep Research' paper work currently?"},
        "outputs": {"must_mention": ["Zhejiang University", "Liberty Mutual"]}
    },
    {
        "inputs": {"text": "What are the three main patterns in Generative AI according to current research?"},
        "outputs": {"must_mention": ["Context Engineering", "Fine-tuning", "Agents"]}
    },
    {
        "inputs": {"text": "Search for recent papers on LangGraph and agents, and tell me what applications they're being used for"},
        "outputs": {"must_mention": ["research", "applications", "agents"]}
    },
    {
        "inputs": {"text": "What is the difference between LangChain and LangGraph?"},
        "outputs": {"must_mention": ["LangChain", "LangGraph", "cycles", "stateful"]}
    },
    {
        "inputs": {"text": "Find information about Tavily search API and its capabilities"},
        "outputs": {"must_mention": ["Tavily", "search", "API"]}
    },
    {
        "inputs": {"text": "What are the key components needed to build an AI agent with LangGraph?"},
        "outputs": {"must_mention": ["tools", "state", "nodes", "edges"]}
    },
    {
        "inputs": {"text": "Search for the latest developments in AI research agents and summarize the key findings"},
        "outputs": {"must_mention": ["research", "agents", "developments", "AI"]}
    }
]

Now we can add our dataset to our LangSmith project using the following code which we saw last Thursday!

In [19]:
from langsmith import Client

client = Client()

dataset_name = f"Simple Search Agent - Evaluation Dataset - {uuid4().hex[0:8]}"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the cohort use-case to evaluate the Simple Search Agent."
)

client.create_examples(
    dataset_id=dataset.id,
    examples=questions
)

{'example_ids': ['63db9bc3-9ef2-454f-ac07-58f455c2aefe',
  '70618975-d9e7-498d-9010-3ae4f21f1afe',
  '88159540-e261-4fd5-b3fc-aec7846be136',
  'f18f1766-51be-48a4-bf06-fb1a2b58413b',
  '45149c3f-9b9a-4774-a3ee-458db1e9e215',
  '4af447d1-61ca-498f-8406-67d937a70825',
  'ca76b284-7963-4143-b24d-87417accf9fd',
  'beaaf516-9c76-43e3-8384-4a75e9b7f5da'],
 'count': 8}

### Task 2: Adding Evaluators

Let's use the OpenEvals library to product an evaluator that we can then pass into LangSmith!

> NOTE: Examine the `CORRECTNESS_PROMPT` below!

In [20]:
from openevals.prompts import CORRECTNESS_PROMPT
print(CORRECTNESS_PROMPT)

You are an expert data labeler evaluating model outputs for correctness. Your task is to assign a score based on the following rubric:

<Rubric>
  A correct answer:
  - Provides accurate and complete information
  - Contains no factual errors
  - Addresses all parts of the question
  - Is logically consistent
  - Uses precise and accurate terminology

  When scoring, you should penalize:
  - Factual errors or inaccuracies
  - Incomplete or partial answers
  - Misleading or ambiguous statements
  - Incorrect terminology
  - Logical inconsistencies
  - Missing key information
</Rubric>

<Instructions>
  - Carefully read the input and output
  - Check for factual accuracy and completeness
  - Focus on correctness of information rather than style or verbosity
</Instructions>

<Reminder>
  The goal is to evaluate factual correctness and completeness of the response.
</Reminder>

<input>
{inputs}
</input>

<output>
{outputs}
</output>

Use the reference outputs below to help you evaluate the

In [21]:
from openevals.llm import create_llm_as_judge

correctness_evaluator = create_llm_as_judge(
        prompt=CORRECTNESS_PROMPT,
        model="openai:o3-mini", # very impactful to the final score
        feedback_key="correctness",
    )

Let's also create a custom Evaluator for our created dataset above - we do this by first making a simple Python function!

In [22]:
def must_mention(inputs: dict, outputs: dict, reference_outputs: dict) -> float:
  # determine if the phrases in the reference_outputs are in the outputs
  required = reference_outputs.get("must_mention") or []
  score = all(phrase in outputs["answer"] for phrase in required)
  return score

#### ❓ Question #4:

What are some ways you could improve this metric as-is?

> NOTE: Alternatively you can suggest where gaps exist in this method.

#### Answer to Question #4:

**Ways to improve the `must_mention` metric:**

**Current Gaps and Limitations:**

1. **Exact String Matching Only**: The current implementation uses simple substring matching (`phrase in outputs["answer"]`), which is brittle and can miss semantically equivalent content.

2. **Case Sensitivity**: "Zhejiang" vs "zhejiang" would be treated differently.

3. **No Semantic Understanding**: Won't catch synonyms, abbreviations, or paraphrases (e.g., "Zhejiang University" vs "University of Zhejiang").

4. **Binary Scoring**: All-or-nothing approach doesn't account for partial matches or degrees of completeness.

**Improvements:**

1. **Case-Insensitive Matching**:
   ```python
   score = all(phrase.lower() in outputs["answer"].lower() for phrase in required)
   ```

2. **Flexible String Matching**:
   ```python
   import re
   # Allow for word boundaries and variations
   score = all(re.search(rf"\\b{re.escape(phrase)}\\b", outputs["answer"], re.IGNORECASE) 
               for phrase in required)
   ```

3. **Semantic Similarity** (using embeddings):
   ```python
   from sentence_transformers import SentenceTransformer
   
   def semantic_must_mention(inputs, outputs, reference_outputs):
       model = SentenceTransformer('all-MiniLM-L6-v2')
       required = reference_outputs.get("must_mention", [])
       answer = outputs["answer"]
       
       scores = []
       for phrase in required:
           phrase_embedding = model.encode([phrase])
           answer_embedding = model.encode([answer])
           similarity = cosine_similarity(phrase_embedding, answer_embedding)[0][0]
           scores.append(similarity > 0.7)  # Threshold for "mention"
       
       return sum(scores) / len(scores)  # Partial credit
   ```

4. **Weighted Scoring**:
   ```python
   def weighted_must_mention(inputs, outputs, reference_outputs):
       required = reference_outputs.get("must_mention", [])
       weights = reference_outputs.get("mention_weights", [1.0] * len(required))
       
       total_weight = sum(weights)
       achieved_weight = sum(w for phrase, w in zip(required, weights) 
                           if phrase.lower() in outputs["answer"].lower())
       
       return achieved_weight / total_weight
   ```

5. **Context-Aware Matching**: Check if the mention appears in relevant context, not just anywhere in the text.

6. **Fuzzy Matching** for typos and variations:
   ```python
   from fuzzywuzzy import fuzz
   
   def fuzzy_must_mention(inputs, outputs, reference_outputs):
       required = reference_outputs.get("must_mention", [])
       answer = outputs["answer"]
       
       scores = []
       for phrase in required:
           # Find best fuzzy match in the answer
           best_score = max(fuzz.partial_ratio(phrase.lower(), 
                                             answer[i:i+len(phrase)*2].lower())
                          for i in range(len(answer) - len(phrase) + 1))
           scores.append(best_score > 80)  # 80% similarity threshold
       
       return sum(scores) / len(scores)
   ```

**Most Impactful Improvement**: Combining case-insensitive matching with semantic similarity scoring would provide both precision and flexibility while maintaining interpretability.


Task 3: Evaluating

All that is left to do is evaluate our agent's response!

In [23]:
results = client.evaluate(
    agent_chain_with_formatting,
    data=dataset.name,
    evaluators=[correctness_evaluator, must_mention],
    experiment_prefix="simple_agent, baseline",  # optional, experiment name prefix
    description="Testing the baseline system.",  # optional, experiment description
    max_concurrency=4, # optional, add concurrency
)

View the evaluation results for experiment: 'simple_agent, baseline-4d7b7e04' at:
https://smith.langchain.com/o/efb53ee0-b891-4203-a7bd-4250842b1fc4/datasets/fe596eb5-3add-46f9-8f19-d650a56a1859/compare?selectedSessions=c76379d3-030c-4891-bb30-a291dd94fe09




0it [00:00, ?it/s]

## Part 2: LangGraph with Helpfulness:

### Task 3: Adding Helpfulness Check and "Loop" Limits

Now that we've done evaluation - let's see if we can add an extra step where we review the content we've generated to confirm if it fully answers the user's query!

We're going to make a few key adjustments to account for this:

1. We're going to add an artificial limit on how many "loops" the agent can go through - this will help us to avoid the potential situation where we never exit the loop.
2. We'll add to our existing conditional edge to obtain the behaviour we desire.

First, let's define our state again - we can check the length of the state object, so we don't need additional state for this.

In [24]:
class AgentState(TypedDict):
  messages: Annotated[list, add_messages]

Now we can set our graph up! This process will be almost entirely the same - with the inclusion of one additional node/conditional edge!

#### 🏗️ Activity #4:

Please write markdown for the following cells to explain what each is doing.

##### Creating the Enhanced Graph with Helpfulness Check

This cell initializes a new StateGraph that will include helpfulness checking functionality. We create the graph with the same AgentState and add the same two fundamental nodes:

- **"agent"** node: Uses our existing `call_model` function to interact with the LLM
- **"action"** node: Uses our existing `tool_node` to execute tool calls

This is the same basic structure as our previous graph, but we'll enhance it with additional conditional logic to evaluate response quality.

In [34]:
graph_with_helpfulness_check = StateGraph(AgentState)

graph_with_helpfulness_check.add_node("agent", call_model)
graph_with_helpfulness_check.add_node("action", tool_node)

<langgraph.graph.state.StateGraph at 0x127074410>

##### Setting the Entry Point

This cell designates the **"agent"** node as the entry point for our graph. This means that when the graph starts executing, it will always begin by calling the agent node first.

The entry point is crucial because it defines the starting node in our workflow. In our case, we want to start with the agent receiving the user's query and determining what action to take next.

In [26]:
graph_with_helpfulness_check.set_entry_point("agent")

<langgraph.graph.state.StateGraph at 0x122ef56d0>

##### Enhanced Conditional Logic with Helpfulness Check

This cell defines the core conditional function `tool_call_or_helpful` that adds intelligence to our graph. This function implements multiple decision points:

1. **Tool Call Detection**: First checks if the last message contains tool calls - if so, routes to "action" node
2. **Loop Limit Protection**: Prevents infinite loops by checking if we've exceeded 10 messages - if so, terminates with "END"  
3. **Helpfulness Evaluation**: Uses a separate LLM (GPT-4.1-mini) to evaluate whether the agent's response adequately answers the original query
4. **Dynamic Routing**: Based on the helpfulness evaluation:
   - If helpful ("Y") → route to "end" (terminate successfully)
   - If not helpful ("N") → route to "continue" (loop back to agent for improvement)

This creates a sophisticated feedback loop where the agent can iteratively improve its responses until they meet quality standards.

In [27]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

def tool_call_or_helpful(state):
  last_message = state["messages"][-1]

  if last_message.tool_calls:
    return "action"

  initial_query = state["messages"][0]
  final_response = state["messages"][-1]

  if len(state["messages"]) > 10:
    return "END"

  prompt_template = """\
  Given an initial query and a final response, determine if the final response is extremely helpful or not. Please indicate helpfulness with a 'Y' and unhelpfulness as an 'N'.

  Initial Query:
  {initial_query}

  Final Response:
  {final_response}"""

  helpfullness_prompt_template = PromptTemplate.from_template(prompt_template)

  helpfulness_check_model = ChatOpenAI(model="gpt-4.1-mini")

  helpfulness_chain = helpfullness_prompt_template | helpfulness_check_model | StrOutputParser()

  helpfulness_response = helpfulness_chain.invoke({"initial_query" : initial_query.content, "final_response" : final_response.content})

  if "Y" in helpfulness_response:
    return "end"
  else:
    return "continue"

##### Adding Conditional Edges with Enhanced Logic

This cell connects the "agent" node to our enhanced conditional logic function `tool_call_or_helpful`. The conditional edge creates three possible routing paths:

- **"continue"** → routes back to "agent" node (for iterative improvement when response isn't helpful enough)
- **"action"** → routes to "action" node (when tools need to be executed)  
- **"end"** → routes to END (when response is deemed helpful and complete)

This creates a more sophisticated flow than our basic agent, allowing for quality control and iterative refinement of responses.

In [28]:
graph_with_helpfulness_check.add_conditional_edges(
    "agent",
    tool_call_or_helpful,
    {
        "continue" : "agent",
        "action" : "action",
        "end" : END
    }
)

<langgraph.graph.state.StateGraph at 0x122ef56d0>

##### Completing the Graph with Action-to-Agent Edge

This cell adds the final edge connecting the "action" node back to the "agent" node. This edge ensures that after any tool execution completes, the results are always passed back to the agent for processing and potential further action.

This creates the essential feedback loop: Agent → Tools → Agent → Evaluation → Continue/End, allowing the system to use tool results effectively in its decision-making process.

In [29]:
graph_with_helpfulness_check.add_edge("action", "agent")

<langgraph.graph.state.StateGraph at 0x122ef56d0>

##### Compiling the Enhanced Graph

This cell compiles our enhanced StateGraph into an executable workflow. The compilation process transforms our graph definition into a runnable agent that includes:

- All defined nodes (agent, action)
- All edges and conditional routing logic
- The helpfulness evaluation system
- Loop protection mechanisms

Once compiled, the graph becomes ready for execution with built-in quality control and iterative improvement capabilities.

In [None]:
agent_with_helpfulness_check = graph_with_helpfulness_check.compile()

##### Testing the Enhanced Agent with Helpfulness Check

This cell demonstrates the enhanced agent in action with a test query about "Deep Research Agents". The agent will:

1. **Process the Query**: Analyze the user's question
2. **Generate Response**: Provide an initial answer based on its knowledge
3. **Evaluate Helpfulness**: Use the built-in helpfulness check to assess response quality
4. **Route Accordingly**: Either terminate (if helpful) or continue iterating for improvement

Notice how the output shows updates only from the 'agent' node, indicating the response was deemed helpful on the first try and didn't require additional tool use or iteration.

In [31]:
inputs = {"messages" : [HumanMessage(content="What are Deep Research Agents?")]}

async for chunk in agent_with_helpfulness_check.astream(inputs, stream_mode="updates"):
    for node, values in chunk.items():
        print(f"Receiving update from node: '{node}'")
        print(values["messages"])
        print("\n\n")

Receiving update from node: 'agent'
[AIMessage(content='Deep Research Agents are advanced AI systems designed to assist with in-depth research tasks. They leverage deep learning techniques and large datasets to analyze complex information, generate insights, and support decision-making across various fields such as science, technology, medicine, and more. These agents can automate literature reviews, extract relevant data from vast sources, and provide comprehensive summaries, making research more efficient and thorough. Would you like me to find more detailed or specific information about Deep Research Agents?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 93, 'prompt_tokens': 158, 'total_tokens': 251, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-

## Part 3: LangGraph for the "Patterns" of GenAI

### Task 4: Helpfulness Check of Gen AI Pattern Descriptions

Let's ask our system about the 3 main patterns in Generative AI:

1. Context Engineering
2. Fine-tuning
3. Agents

In [32]:
patterns = ["Context Engineering", "Fine-tuning", "LLM-based agents"]

In [33]:
for pattern in patterns:
  what_is_string = f"What is {pattern} and when did it break onto the scene??"
  inputs = {"messages" : [HumanMessage(content=what_is_string)]}
  messages = agent_with_helpfulness_check.invoke(inputs)
  print(messages["messages"][-1].content)
  print("\n\n")

Context Engineering is a relatively new interdisciplinary field that focuses on designing, managing, and utilizing contextual information to improve the functionality and adaptability of systems, particularly in areas like artificial intelligence, human-computer interaction, and pervasive computing. It involves understanding and engineering the context in which systems operate to enhance their performance, relevance, and user experience.

The concept of Context Engineering began gaining attention in the early 2000s with the rise of ubiquitous computing and context-aware systems. It became more prominent as researchers and industry professionals recognized the importance of context in creating intelligent, responsive systems that can adapt to dynamic environments and user needs.

Would you like me to find more detailed and specific information about the origins and development of Context Engineering?



Fine-tuning is a machine learning technique used to adapt a pre-trained model to a s