<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# LangGraph Agents: Orchestrator–Worker Pattern

In this tutorial, we’ll build a multi-agent system using LangGraph's **Orchestrator–Worker pattern**, ideal for dynamically decomposing a task into subtasks, assigning them to specialized LLM agents, and synthesizing their responses.

This pattern is particularly well-suited when the structure of subtasks is unknown ahead of time—such as when writing modular code, creating multi-section reports, or conducting research. The **orchestrator** plans and delegates, while the **workers** each complete their assigned section.

We’ll also use **Phoenix** to trace and debug the orchestration process. With Phoenix, you can visually inspect which tasks the orchestrator generated, how each worker handled its section, and how the final output was assembled.

By the end of this notebook, you’ll learn how to:
- Use structured outputs to plan subtasks dynamically.
- Assign subtasks to LLM workers via LangGraph's `Send` API.
- Collect and synthesize multi-step LLM outputs.
- Trace and visualize orchestration using Phoenix.


In [None]:
%pip install  "arize-phoenix" "arize-phoenix-otel>=0.8.0" llama-index-llms-openai openai gcsfs nest_asyncio langchain langchain-openai openinference-instrumentation-langchain



In [None]:
!pip install langchain_openai langgraph langchain_community duckduckgo-search

Collecting duckduckgo-search
  Downloading duckduckgo_search-8.0.1-py3-none-any.whl.metadata (16 kB)
Collecting primp>=0.15.0 (from duckduckgo-search)
  Downloading primp-0.15.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading duckduckgo_search-8.0.1-py3-none-any.whl (18 kB)
Downloading primp-0.15.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: primp, duckduckgo-search
Successfully installed duckduckgo-search-8.0.1 primp-0.15.0


In [None]:
from langgraph.graph import StateGraph, START, END
import os, getpass

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


# Configure Phoenix Tracing

Make sure you go to https://app.phoenix.arize.com/ and generate an API key. This will allow you to trace your Langgraph application with Phoenix.

In [None]:
PHOENIX_API_KEY = getpass.getpass("Phoenix API Key:")
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

Phoenix API Key:··········


In [None]:
from phoenix.otel import register

tracer_provider = register(
  project_name="Orchestrator",
  auto_instrument=True
)

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: Orchestrator
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'api_key': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



Orchestrator‑Workers • Research‑Paper Generator
----------------------------------------------
The orchestrator plans research‑paper *subsections* (abstract, background …),
spawns one worker per subsection, then stitches everything into a full draft.

In [None]:
import operator
from typing import Annotated, List, TypedDict

from langgraph.graph import StateGraph, START, END
from langgraph.constants import Send
from langchain_core.messages import SystemMessage, HumanMessage
from IPython.display import Image, Markdown
from langchain_community.tools import DuckDuckGoSearchResults


# Step 1: Defining the Planning Schema
To begin, we define a structured output schema using Pydantic. This schema ensures that the LLM returns well-formatted, predictable output when tasked with planning the structure of a research paper.

We create two models:

Subsection: Represents a single unit of the paper, including its name and a brief description of what it should cover.

Subsections: A wrapper that holds a list of these units.

By using these models with LangGraph’s with_structured_output feature, we enforce that the orchestrator LLM returns an organized plan — rather than freeform text — that downstream nodes (worker LLMs) can reliably use.

This schema acts as the blueprint for the rest of the workflow.

In [None]:
from langchain_core.pydantic_v1 import BaseModel, Field

class Subsection(BaseModel):
    name: str = Field(
        description="Name for this subsection of the research paper."
    )
    description: str = Field(
        description="Concise description of the general subjects to be covered in this subsection."
    )

class Subsections(BaseModel):
    Subsections: List[Subsection] = Field(
        description="All subsections of the research paper."
    )



For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


# Step 2: Set Up LLM and Tools
We initialize gpt-3.5-turbo as our base LLM and bind it to the Subsections schema to create the orchestrator. We also load a DuckDuckGo search tool to allow worker agents to enrich sections with live web data.

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
orchestrator_llm = llm.with_structured_output(Subsections)
ddg_results_tool = DuckDuckGoSearchResults(output_format="dict", max_results=5)



ValidationError: 1 validation error for DuckDuckGoSearchResults
output_format
  Input should be 'string', 'json' or 'list' [type=literal_error, input_value='dict', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/literal_error

# Step 3: Define Graph State
We define two state schemas:

State holds the overall research paper workflow, including the topic, planned subsections, completed text, and final output.

WorkerState captures the task assigned to each worker — a single subsection — and where their contributions are accumulated.

This shared state structure lets LangGraph coordinate work between the orchestrator and its worker agents.

In [None]:
class State(TypedDict):
    topic: str                    # Research‑paper topic
    subsections: List[Subsection]  # Planned subsections
    completed_subsections: Annotated[List[str], operator.add]
    final_paper: str              # Synthesised draft

class WorkerState(TypedDict):
    subsection: Subsection
    completed_subsections: Annotated[List[str], operator.add]


# Step 4: Define Nodes
We define three core nodes in the graph:

orchestrator: Dynamically plans the structure of the paper by generating a list of subsections using structured output.

subsection_writer: Acts as a worker that writes one full subsection in academic Markdown, using the provided description and scope.

synthesiser: Merges all completed subsections into a single cohesive draft, separating sections with visual dividers.

Each node contributes to a modular, scalable paper-writing pipeline — and with Phoenix tracing, you can inspect every generation step in detail.

In [None]:
def orchestrator(state: State):
    """Plan the research‑paper subsections dynamically."""
    plan = orchestrator_llm.invoke(
        [
            SystemMessage(content="Generate a detailed subsection plan for a research paper."),
            HumanMessage(content=f"Paper topic: {state['topic']}")
        ]
    )
    return {"subsections": plan.Subsections}

def subsection_writer(state: WorkerState):
    """Write a single subsection (markdown)."""
    sub = state["subsection"]
    response = llm.invoke(
        [
            SystemMessage(
                content=(
                    "You are drafting a research‑paper subsection. "
                    "Follow academic tone, include citations placeholders (e.g. [1]), "
                    "and use Markdown headings where appropriate."
                )
            ),
            HumanMessage(
                content=(
                    f"Subsection name: {sub.name}\n"
                    f"Description / scope: {sub.description}\n\n"
                    "Write the full subsection now."
                )
            ),
        ]
    )
    return {"completed_subsections": [response.content]}

def synthesiser(state: State):
    """Concatenate all finished subsections into the final paper draft."""
    full_paper = "\n\n---\n\n".join(state["completed_subsections"])
    return {"final_paper": full_paper}


# Step 5: Assign Workers Dynamically
This function uses LangGraph’s Send API to launch a separate subsection_writer worker for each planned subsection. By dynamically spawning one worker per section, the system scales flexibly based on the topic’s complexity.

This approach is ideal for research paper generation, where the number of sections is not known ahead of time — and Phoenix helps trace the output from each worker node independently.

In [None]:
def assign_workers(state: State):
    """Launch one worker per planned subsection."""
    return [Send("subsection_writer", {"subsection": s}) for s in state["subsections"]]


# Step 6: Construct the LangGraph Workflow
Here, we build the full LangGraph pipeline using a StateGraph. The workflow begins with the orchestrator node (to plan subsections), dynamically routes work to subsection_writer nodes (via assign_workers), and then aggregates all outputs in the synthesiser node.

LangGraph’s conditional edges and Send API enable scalable parallelism — and with Phoenix tracing enabled, you can view how each section is created, tracked, and stitched together.

In [1]:
builder = StateGraph(State)

builder.add_node("orchestrator", orchestrator)
builder.add_node("subsection_writer", subsection_writer)
builder.add_node("synthesiser", synthesiser)

builder.add_edge(START, "orchestrator")
builder.add_conditional_edges("orchestrator", assign_workers, ["subsection_writer"])
builder.add_edge("subsection_writer", "synthesiser")
builder.add_edge("synthesiser", END)

research_paper_workflow = builder.compile()


NameError: name 'StateGraph' is not defined

# Step 7: Run the Research Paper Generator
We now invoke the compiled LangGraph with a sample topic: “Scaling Laws for Large Language Models.” The orchestrator plans the outline, each worker drafts a subsection in parallel, and the synthesizer assembles the full paper.

With Phoenix integrated, every step is traced — from section planning to writing and synthesis — giving you full visibility into the execution flow and helping debug or refine outputs.

In [None]:
state = research_paper_workflow.invoke(
    {"topic": "Scaling Laws for Large Language Models"}
)

print("===== RESEARCH PAPER DRAFT =====\n")
Markdown(state["final_paper"])

===== RESEARCH PAPER DRAFT =====



## Introduction

Large language models have gained significant attention in recent years due to their remarkable capabilities in natural language processing tasks. These models, such as OpenAI's GPT-3 and Google's BERT, have demonstrated state-of-the-art performance in various language-related tasks, including text generation, translation, and sentiment analysis. As researchers continue to push the boundaries of model size and complexity, understanding the scaling laws that govern the development of these models becomes crucial.

Scaling laws refer to the relationship between the size of a model (measured in terms of parameters or computational resources) and its performance on a given task. By studying these scaling laws, researchers can gain insights into how increasing the size of a model impacts its capabilities and efficiency. This understanding is essential for optimizing the design and training of large language models, ensuring that further advancements in this field are both effective and sustainable.

In this subsection, we will explore the concept of scaling laws in the context of large language models, highlighting the importance of studying these relationships for the continued advancement of natural language processing technologies.

---

## Background

Language models have been a fundamental area of research in natural language processing (NLP) for several decades. The evolution of language models can be traced back to the early work on statistical language modeling in the 1950s and 1960s [1]. These early models focused on predicting the next word in a sequence based on the probabilities of word occurrences.

One of the key milestones in the development of language models was the introduction of n-gram models, which estimate the probability of a word given the previous n-1 words. This approach significantly improved the accuracy of language modeling and laid the foundation for more sophisticated models to come [2].

In the 2010s, the field of NLP witnessed a significant shift with the introduction of neural network-based language models, such as the famous Word2Vec and GloVe models. These models utilized deep learning techniques to learn distributed representations of words in a continuous vector space, enabling better semantic understanding of language [3].

The most recent breakthrough in language modeling came with the introduction of transformer-based models, starting with the Transformer model proposed by Vaswani et al. in 2017 [4]. Transformers revolutionized NLP by allowing for parallel processing of words in a sequence, leading to significant improvements in model performance and the ability to handle long-range dependencies in text.

Overall, the history of language models is marked by a progression from simple statistical models to complex neural network architectures, with each advancement bringing about improvements in language understanding and generation capabilities.

References:
- [1] Placeholder for citation on early work in statistical language modeling.
- [2] Placeholder for citation on the development of n-gram models.
- [3] Placeholder for citation on neural network-based language models.
- [4] Placeholder for citation on the Transformer model by Vaswani et al.

---

## Related Work

In recent years, there has been a surge of interest in understanding scaling laws in the context of language models. Researchers have explored various aspects of scaling laws, including model size, performance, and computational requirements. 

One key finding in the literature is the observation of a power-law relationship between model size and performance [1]. This suggests that as the size of a language model increases, its performance on various natural language processing tasks also improves. This phenomenon has been demonstrated across a range of tasks, including language modeling, text classification, and machine translation.

Moreover, studies have also investigated the computational requirements of scaling language models. It has been shown that larger models not only require more parameters but also demand increased computational resources for training and inference [2]. This has implications for the efficiency and sustainability of deploying large language models in real-world applications.

Overall, the research on scaling laws in language models provides valuable insights into the trade-offs between model size, performance, and computational costs. By understanding these scaling laws, researchers can optimize the design and deployment of language models for various applications.

[1] Placeholder for citation on power-law relationship between model size and performance.

[2] Placeholder for citation on computational requirements of scaling language models.

---

## Methodology

To investigate scaling laws for large language models, a comprehensive approach was adopted, encompassing data collection and analysis methods. 

### Data Collection

The primary dataset used in this study consists of a diverse range of large language models, including GPT-2, GPT-3, BERT, and XLNet. Data was collected from publicly available sources such as research papers, official documentation, and open-source repositories. The dataset includes information on model architecture, parameter size, training data, computational resources, and performance metrics.

### Analysis Methods

The analysis of scaling laws involved examining the relationship between key variables such as model size, training data, computational resources, and performance metrics. Statistical techniques such as regression analysis and correlation analysis were employed to identify patterns and trends in the data. Additionally, visualization tools were used to present the findings in a clear and concise manner.

Overall, the methodology employed in this study aims to provide a rigorous and systematic investigation into the scaling laws of large language models, offering valuable insights into the factors influencing their performance and scalability. 

<!-- Reference section -->
References:

---

## Results

The analysis of scaling laws in large language models revealed several key findings that shed light on the behavior and performance of these models. 

### Power Law Scaling

One of the main results of the study was the observation of power law scaling in the relationship between model size and performance metrics. This finding is consistent with previous research [1] and suggests that as the size of the language model increases, there is a non-linear improvement in performance metrics such as accuracy and efficiency.

### Diminishing Returns

Another important result was the identification of diminishing returns as model size continues to grow. While larger models generally exhibit better performance, the rate of improvement diminishes as the model size increases. This finding has implications for the cost-effectiveness of scaling up language models beyond a certain point [2].

### Computational Costs

The analysis also highlighted the significant computational costs associated with scaling up language models. Larger models require more computational resources for training and inference, which can pose challenges in terms of scalability and efficiency. Understanding these costs is crucial for optimizing the deployment of large language models in practical applications [3].

### Generalization and Fine-tuning

Furthermore, the results indicated that larger language models may have a tendency to overfit the training data, leading to challenges in generalization to unseen data. Fine-tuning strategies were found to be effective in mitigating this issue and improving the overall performance of large language models [4].

In summary, the results of this study provide valuable insights into the scaling laws of large language models, highlighting the trade-offs between model size, performance, and computational costs. These findings have important implications for the development and deployment of state-of-the-art language models in various applications.

References:

[1] Placeholder for citation on power law scaling.
[2] Placeholder for citation on diminishing returns in large language models.
[3] Placeholder for citation on computational costs of scaling language models.
[4] Placeholder for citation on generalization and fine-tuning in large language models.

---

## Discussion

The results of this study indicate that scaling up language models leads to significant improvements in performance across various natural language processing tasks. The larger models consistently outperformed their smaller counterparts, demonstrating the effectiveness of scaling in enhancing model capabilities. These findings align with previous research on language model scaling, which has shown that increasing model size can lead to better performance on a wide range of tasks [1].

The implications of these results are significant for the field of natural language processing. By scaling up language models, researchers and practitioners can achieve state-of-the-art performance on tasks such as text classification, language modeling, and machine translation. This has the potential to advance the development of more sophisticated natural language processing systems that can handle complex language understanding tasks with greater accuracy and efficiency.

Comparing our results to existing literature on language model scaling, we find that our findings are consistent with previous studies that have demonstrated the benefits of scaling up models in improving performance on various NLP tasks [2]. However, it is important to note that the computational resources required to train and deploy large-scale language models can be substantial, posing challenges for widespread adoption in real-world applications. Future research should focus on addressing these scalability issues to make large language models more accessible and practical for a broader range of applications.

Overall, this study contributes to the growing body of research on language model scaling and highlights the potential benefits of leveraging larger models for improving performance in natural language processing tasks.

[1] Placeholder for citation.
[2] Placeholder for citation.

---

## Conclusion

In this research paper, we have explored the concept of scaling laws in language models, focusing on the relationship between model size, performance, and computational resources. Our analysis revealed that larger language models tend to exhibit improved performance on various natural language processing tasks, but at the cost of exponentially increasing computational resources. This trade-off highlights the importance of understanding scaling laws in language models to optimize model performance while managing computational constraints.

Moving forward, future research could delve deeper into the mechanisms behind scaling laws in language models. Investigating how different architectural components contribute to model performance as size increases could provide valuable insights for designing more efficient and effective models. Additionally, exploring alternative approaches to scaling, such as sparse models or knowledge distillation techniques, may offer promising avenues for mitigating the computational demands associated with large language models.

By continuing to study scaling laws in language models, researchers can advance the field of natural language processing and develop models that strike a balance between performance and efficiency, ultimately enhancing the capabilities of language technologies in various applications.

# Step 8: Check out your traces in Phoenix!