### Introduction


In the notebook below, we demonstrate how using mixture of agents (MoA) can significantly improve the quality of responses by harnessing the power of multple LLMs. The code below goes in order of using just a single agent/LLM, then using a mixture of agents/LLMs, then using multiple iterations of a mixture of agents/LLMs. The basic architecture we follow is to prompt each LLM, then use an aggregator LLM combined with a final prompt to get the final output.

![alt text](8c88157_image.png "Title")

In this notebook, we will build a travel itinerary generator. This emulates the common dilemma of having to figure out flights, hotel, food, attraction planning, etc. delegates these to a mixture of agents to solve. The main request will be split into these subtasks, and a mixture of agents will tackle each task, the results will be aggregated to get refined answers for each task. Then the results of each task will be aggregated to build the final itinerary. Our model will look like this:


![alt text](agent_diagram.png "Title")

We will build each part of this workflow step by step. At each step, we may decide to incorporate Judgment's scoring and tracing models as well! We will describe how this works in more detail in those cells.

### Setup

Import all the necessary packages to leverage Together AI's inference models and mixture of agents framework, as well as Judgments tracing and scoring models!

In [1]:
#general imports
import asyncio
import os
import together
import json
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="pydantic")

#together imports
from together import AsyncTogether, Together

#judgment imports
from judgeval.common.tracer import Tracer, wrap
from judgeval.scorers import AnswerRelevancyScorer, SummarizationScorer, FaithfulnessScorer

#tavily imports (we use this to help out agent workflow)
from tavily import TavilyClient


Langfuse client is disabled since no public_key was provided as a parameter or environment variable 'LANGFUSE_PUBLIC_KEY'. See our docs: https://langfuse.com/docs/sdk/python/low-level-sdk#initialize-client


### Environment Variables

Make sure you have JUDGMENT_API_KEY, TAVILY_API_KEY, TOGETHER_API_KEY set in your environment variables. In order to obtain a JUDGMENT_API_KEY please email us at contact@judgmentlabs.ai. You can obtain a TAVILY_API_KEY through the Getting Started Guide on Tavily's website.

Initialize the clients we use for inference and also Judgment's tracers.

In [2]:
client = wrap(Together(api_key=os.environ.get("TOGETHER_API_KEY")))
async_client = AsyncTogether(api_key=os.environ.get("TOGETHER_API_KEY"))
judgment = Tracer(project_name="travel_agent")

Successfully initialized JudgmentClient, welcome back user!


#### Breakdown Tasks

The first step is to **break down the user's prompt** into subtasks. For this, we can leverage one LLM to break it down into the task descriptions we want.

Say our user's prompt is "Make me an itinerary to Spain for one week from Feb 20 to March 1", then we want to break this large problem down into subproblems. Making an itinerary involves a lot of things: finding accommodations, finding activities, making a list of all the things to see, trying new restuarants, etc. We will invoke an LLM call to help us break our request into specific subproblems and tasks that we will later use MOA to solve. 

Lets come up with system and user prompts for this LLM to complete the breakdown of tasks

In [3]:
user_prompt = "Make me an itinerary to Spain for one week from Feb 20 to March 1"

system_prompt = """
    You are an AI assistant that breaks down a user's request about making a travel itinerary for their upcoming trip into 
    specific subtasks. You will need to break down the user's request into a task description and then create a clear, 
    detailed prompt addressing that task. Each prompt should be self-contained with all necessary information from the 
    original request. Do not add any explanations or commentary - only output the JSON object. You should output a JSON 
    object where the key is the task description and the value is the specialized prompt you came up with to solve the task.
    As an example output, you could return the following key-value pair for a specific task :
    {
        "Find flights and hotels for the trip": "Search for flights from the user's preferred airport to Spain from February 20 to March 1 and find available hotels in the desired location for the entire 
        duration of the trip, considering factors such as budget, location, and user reviews"
    }
"""

user_message = f"""
    Original user request: "{user_prompt}"

    Break this down into separate prompts for specialized agents to handle each subtask.

    Return the result as a JSON string where the keys are the task descriptions that I provided 
    and the values are the specialized and more refined prompts that you came up with to solve the task. 
    Dont include any ```json tags, just return the JSON that has simple key-value pairs.
"""

Now, let's write the function that will actually call the LLM using the prompts from above 

In [4]:
response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo", 
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ],
    stream=False,
)
# Get the response content which should include the task breakdown
print(json.dumps(json.loads(response.choices[0].message.content), indent=4))

{
    "Determine travel dates and duration": "Identify the start and end dates of the trip as February 20 to March 1 and calculate the duration as one week",
    "Find flights to Spain": "Search for flights from the user's preferred airport to Spain from February 20 to March 1, considering factors such as budget, flight duration, and layovers",
    "Book accommodations in Spain": "Find available hotels in Spain for the entire duration of the trip from February 20 to March 1, considering factors such as budget, location, and user reviews",
    "Research popular destinations and activities in Spain": "Identify top tourist attractions, cultural events, and experiences in Spain that the user may be interested in, such as visiting Madrid, Barcelona, or Seville",
    "Create a daily schedule for the trip": "Plan a daily itinerary for the user's trip to Spain, including transportation, meals, and activities, from February 20 to March 1",
    "Provide transportation options within Spain": "Res

Now before we call this function, let's think about how we can evaluate its effectiveness using Judgment. We want a way to assess how well the language model breaks down the original request into different subtasks, as we instructed it to do.

This is a perfect opportunity to apply the ```AnswerRelevancyScorer``` from Judgment. This scorer helps us determine whether our individual steps succeeded in addressing the user's request. For example, when breaking down tasks, we want to ensure that our subtask for finding hotels searches for the correct date range, or that our transportation plan correctly references the areas we're traveling to. 

As per the docs:
> "AnswerRelevancy scores are calculated by extracting statements made in the actual_output and 
> classifying how many are relevant to the input.
> 
> The score is calculated as:
> relevancy score = relevant statements / total statements"

By setting a threshold (e.g., 0.8), we can determine whether each part of our solution is sufficiently 
aligned with the user's request. This helps us catch issues such as:
- The “Find Hotels” subtask returning results outside the intended date range.
- The “Plan Transportation” subtask referencing the wrong location.

The ```AnswerRelevancyScorer``` flags parts of the output that aren’t relevant to the user’s needs, helping us 
quantitatively evaluate how well each subtask performs.  You can read more about Judgment's suite of scoring models here: https://judgment.mintlify.app/introduction

```scorer = AnswerRelevancyScorer(threshold=0.8)
results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4o",
)```

This also seems like a perfect place to also add Judgment's Tracing! After we provide the LLM with the input and system prompts, it will produce an output. Since we are scoring our output, it would be nice to see what inputs are associated with that score and output, thats where we use tracing! We can simply add a decorator on our functions like:

`@judgment.observe(span_type="tool", overwrite=True)`

And it will start to trace the inputs and outputs everytime we call that model. Lets put this all together now:

In [None]:

@judgment.observe(span_type="tool", overwrite=True)
def main():
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.3-70B-Instruct-Turbo", 
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ],
    )

    judgment.get_current_trace().async_evaluate(
        scorers=[AnswerRelevancyScorer(threshold=0.5)],
        input=user_message,
        actual_output=response.choices[0].message.content,
        model="gpt-4",
    )
    
    return json.loads(response.choices[0].message.content)

task_breakdown = main()
print(json.dumps(task_breakdown, indent=4))


{
    "Determine travel dates and duration": "Confirm the travel dates as February 20 to March 1 and calculate the duration as one week",
    "Find flights and hotels for the trip": "Search for flights from the user's preferred airport to Spain from February 20 to March 1 and find available hotels in the desired location for the entire duration of the trip, considering factors such as budget, location, and user reviews",
    "Plan daily activities and sightseeing": "Research and suggest popular tourist attractions, cultural events, and activities in Spain for each day of the trip from February 20 to March 1, taking into account the user's interests and preferences",
    "Create a transportation plan": "Develop a plan for transportation within Spain, including options for getting from the airport to the hotel, traveling between cities, and getting around local areas, considering factors such as cost, convenience, and time efficiency",
    "Research and book restaurants and local experie

#### MoA

This helper function will help us search real information related to our tasks

In [8]:
def search_tavily(query):
    """Fetch travel data using Tavily API."""
    API_KEY = os.getenv("TAVILY_API_KEY")
    client = TavilyClient(api_key=API_KEY)
    results = client.search(query, num_results=3)
    return results

Here is where we implement MOA. This is the middle portion of our diagram from the beginning. For each task, we will have multiple agents/LLMs work on it, then we will take the results of each LLM and aggregate them to curate our final response. Heres what those models and prompts look like:

In [9]:
reference_models = [
    "Qwen/Qwen2-72B-Instruct",
    "meta-llama/Llama-3.3-70B-Instruct-Turbo",
    "mistralai/Mixtral-8x22B-Instruct-v0.1",
    "databricks/dbrx-instruct",
]

aggregator_model = "mistralai/Mixtral-8x22B-Instruct-v0.1"
aggreagator_system_prompt = """
    You have been provided with a set of responses from various open-source models to the latest user query. 
    Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically 
    evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. 
    Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive 
    reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of 
    accuracy and reliability.

    Responses from models:
"""


Now lets modularize our code so that we can easily make a call to one of our reference models. We will use Together's AsyncClient so that we can asynchornously launch each agent with the prompt and then wait for them to all come back with a response. 

In [10]:
async def run_llm(model, task_prompt, context):
    """Run a single LLM call with a reference model."""
    response = await async_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": f"{task_prompt}. Here is some additional context to help you: {context}"}],
        temperature=0.7,
        max_tokens=512,
    )
    return response.choices[0].message.content

Now let's write the function that will aggregate all the responses from our agents. This is another critical moment for evaluation - we need to assess how effectively we combine the research from individual agents into a cohesive itinerary.

For this aggregation phase, we'll again use Judgment's `AnswerRelevancyScorer` to evaluate how well our final itinerary addresses the original user request. This time, we're checking if our combined output - drawing from hotel research, transportation plans, activity suggestions, and restaurant recommendations - effectively delivers a complete travel plan that meets all aspects of the user's query. For instance, finding hotels could mistsakenly return options outside the requested date range, or the transportation plan could incorrectly reference the wrong destination.

By tracing these outputs through Judgment, we can identify any potential disconnects between the research phase and the final aggregation, ensuring our multi-agent orchestration delivers truly helpful travel plans.

```scorer = AnswerRelevancyScorer(threshold=0.8)
results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4o",
)```


In [None]:
@judgment.observe(span_type="tool", overwrite=True)
async def run_aggregator(task_prompt, context):
    results = await asyncio.gather(*[run_llm(model, task_prompt, context) for model in reference_models])

    finalStream = client.chat.completions.create(
        model=aggregator_model,
        messages=[
            {"role": "system", "content": aggreagator_system_prompt},
            {"role": "user", "content": ",".join(str(element) for element in results)},
        ],
    )

    judgment.get_current_trace().async_evaluate(
        scorers=[AnswerRelevancyScorer(threshold=0.5)],
        input=task_prompt,
        actual_output=finalStream.choices[0].message.content,
        model="gpt-4",
    )
    return finalStream.choices[0].message.content

Now lets start using the functions we wrote above to launch a MoA for each one of our tasks. We will get a trace URL for each task so we can evaluate each of them seperately

In [13]:

task_outputs = {}
for task in task_breakdown:
    print(f"Working on task: {task}")
    task_prompt = task_breakdown[task]
    
    # Fetch some additional context using Tavily
    context = search_tavily(task_prompt)
    
    taskOutput = await run_aggregator(task_prompt, context)
    task_outputs[task] = taskOutput


Working on task: Determine travel dates and duration


Working on task: Find flights and hotels for the trip


Working on task: Plan daily activities and sightseeing


Working on task: Create a transportation plan


Working on task: Research and book restaurants and local experiences


Working on task: Prepare a budget and financial plan


#### Final Itinerary

Now we put things together to build the final complete itinerary! We take the aggregated results from the MoA for each task and then pass it back to one model to curate an itinerary based on all the real information we collected. This is the right most part of our initial diagram. Again, we can apply Judgments tracing and scoring tools to see how faithful our final answer to the very first user's prompt! This time we use a different scoring metric, FaithfullnessScorer.

Lets come up with our prompts first:

In [14]:
system_prompt = """
    You are an expert travel planner who creates cohesive, well-structured itineraries.
    Your task is to create a final, comprehensive response that combines specialized information
    from different agents into a single, flowing itinerary that addresses the user's original request.

    The final response should:
    1. Start with a brief introduction to the trip
    2. Organize information in a logical, chronological structure (day by day)
    3. Seamlessly integrate travel logistics, accommodations, meals, and activities
    4. Ensure there are no scheduling conflicts or logistical impossibilities
    5. Add transitions between sections to create a natural flow
    6. End with a brief conclusion

    Format the itinerary professionally, with clear headings, and make it easy to follow.
"""

user_message = f"""
    Original user request: "{user_prompt}"

    Specialized agent responses:

    {json.dumps(task_outputs, indent=2)}

    Please create a cohesive, well-structured final response that combines all this information
    into a comprehensive itinerary. Organize it in a logical way (day by day) and ensure the whole
    itinerary flows naturally and makes logistical sense.
"""

Now lets make one final LLM call to put together all our results we got from our fleet of agents!

This is the perfect place to introduce Judgment's `FaithfulnessScorer`. This scoring model will help us verify that our final itinerary accurately reflects all the information gathered during our research phase. When building a complex itinerary, it's critical that we don't misinterpret which restaurants we'll visit on specific days, mix up attraction details, or create logistical impossibilities.

By using the `FaithfulnessScorer`, we'll measure how accurately our final itinerary incorporates the specialized information collected by our agents. The `actual_output` parameter will represent our compiled itinerary, while the `retrieval_context` parameter will contain all the researched information our agents collected - ensuring we create a travel plan that's not just coherent, but factually accurate according to our research. 

In [None]:
@judgment.observe(span_type="tool", overwrite=True)
def compile_final_itinerary():
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )

    judgment.get_current_trace().async_evaluate(
        scorers=[FaithfulnessScorer(threshold=0.5)],
        input=user_prompt,
        actual_output=response.choices[0].message.content,
        retrieval_context=task_outputs.values(),
        model="gpt-4",
    )

    return response.choices[0].message.content

final_itinerary = compile_final_itinerary()
print(final_itinerary)

**Introduction to Your Spanish Adventure**
From February 20 to March 1, 2025, you'll embark on a 9-day journey through Spain, exploring its vibrant cities, rich culture, and breathtaking landscapes. This itinerary combines the best of travel logistics, accommodations, meals, and activities to ensure a memorable experience.

### Day 1: February 20 - Arrival in Madrid
- **Morning:** Arrive at Adolfo Suárez Madrid–Barajas Airport (MAD). Use a private transfer service like Viator or GetTransfer for a convenient and stress-free journey to your hotel.
- **Afternoon:** Check-in at your hotel and explore the nearby area. Consider staying in the city center for easy access to major attractions.
- **Evening:** Visit the Retiro Park, one of Madrid's most beautiful green spaces, and enjoy a tapas tour in the city center to get a taste of local cuisine.

### Day 2: February 21 - Madrid
- **Morning:** Ski at Puerto de Navacerrada, enjoying the views of Sierra de Guadarrama. Rent equipment and book l

### Conclusion

And there we go, we built together a fully-functioning travel itinerary creator by leveraging a mixture of agents to solve each task and we also evaluated the outputs from every step using Judgment's tracing and scoring models!