# The agent as shown on Noveum.ai platform

![Alt text](image.png)

# Final Agent Evaluation Demo with NovaEval

This notebook demonstrates a streamlined approach to agent evaluation using modular utility functions:

1. **Load agent trace data** from JSON datasets
2. **Map trace spans** to AgentData format using utility functions
3. **Create and analyze** AgentDataset
4. **Evaluate agent performance** using AgentEvaluator with Gemini model
5. **Analyze results** and export data



# Scorers Used

**context_relevancy_scorer** - Evaluates whether the agent response is appropriate and relevant given the agent's task and role.

**role_adherence_scorer** - Scores whether the agent's tool calls and response adhere to its assigned role and task.

**task_progression_scorer** - Measures whether the agent has made meaningful progress on the assigned task.

**tool_relevancy_scorer** - Assesses how relevant and appropriate the tool call is given the available tools and the agent's context.

**tool_correctness_scorer** - Compares actual tool calls against expected tool calls to evaluate correctness of tool usage and parameters.

**parameter_correctness_scorer** - Validates whether correct parameters were passed to tool calls by analyzing the tool results.

## Step 1: Import Dependencies and Utility Functions


In [None]:
# Import our custom utility functions
from demo_utils import (
    list_dataset_files,
    load_and_analyze_dataset,
    convert_spans_to_agent_dataset,
    analyze_dataset_statistics,
    setup_gemini_model,
    setup_agent_evaluator,
    run_evaluation,
    analyze_agent_behavior_patterns,
    export_processed_dataset,
    setup_logging,
    validate_environment,
    print_demo_summary
)

print("✅ All utility functions imported successfully!")


/bin/bash: -c: line 1: syntax error near unexpected token `('
/bin/bash: -c: line 1: `python print('hi')'


In [2]:
!python preprocess_filter.py dataset.json
!python preprocess_map.py dataset_filtered.json
!python preprocess_split_data.py dataset_filtered_mapped.json

Reading dataset.json...
Original dataset: 4198 records
Filtering spans...
After filtering: 3381 records
Converting tool output format...
Writing dataset_filtered.json...
Filtering complete! Output: dataset_filtered.json

Success! Created dataset_filtered.json
Reading dataset_filtered.json...
Input dataset: 3381 records
Mapping spans...
Writing dataset_filtered_mapped.json...
Mapping complete! Output: dataset_filtered_mapped.json

Success! Created dataset_filtered_mapped.json
Input file: dataset_filtered_mapped.json
Output directory: split_datasets

Loading dataset from dataset_filtered_mapped.json...
Loaded 3381 objects
Found 5 unique span names
Using hardcoded mapping: agent.query_generation -> agent_query_gen_dataset.json
  Wrote 533 objects to split_datasets/agent_query_gen_dataset.json
Using hardcoded mapping: tool:tavily_search_results_json:tavily_search_results_json -> tavily_search_results_dataset.json
  Wrote 2301 objects to split_datasets/tavily_search_results_dataset.json
Usi

In [None]:
from demo_utils import run_complete_agent_evaluation

#evaluating the split datasets
run_complete_agent_evaluation('split_datasets/agent_comment_gen_dataset.json')
run_complete_agent_evaluation('split_datasets/agent_query_gen_dataset.json')
run_complete_agent_evaluation('split_datasets/email_gen_send_dataset.json')
run_complete_agent_evaluation('split_datasets/post_validation_dataset.json')
run_complete_agent_evaluation('split_datasets/tavily_search_results_dataset.json')

2025-09-26 04:14:48 - noveum_trace.transport.batch_processor - INFO - 🔄 Batch processor background thread started (batch_size=100, timeout=5.0s)
2025-09-26 04:14:48 - noveum_trace.transport.batch_processor - INFO - Batch processor started with batch_size=100
2025-09-26 04:14:48 - noveum_trace.transport.http_transport - INFO - HTTP transport initialized for endpoint: https://api.noveum.ai/api
2025-09-26 04:14:48 - noveum_trace.core.client - INFO - Noveum Trace client initialized
2025-09-26 04:14:48,825 - INFO - novaeval.models.base - Noveum tracing initialized successfully


✅ All imports successful!
✅ list_dataset_files function defined!
✅ load_and_analyze_dataset function defined!
✅ parse_tools_from_prompt function defined!
✅ parse_params function defined!
✅ identify_span_type function defined!
✅ map_span_to_agent_data function defined!
✅ convert_spans_to_agent_dataset function defined!
✅ analyze_dataset_statistics function defined!
✅ setup_gemini_model function defined!
✅ setup_agent_evaluator function defined!
✅ run_evaluation function defined!
✅ analyze_agent_behavior_patterns function defined!
✅ export_processed_dataset function defined!
✅ setup_logging function defined!
✅ validate_environment function defined!
✅ print_demo_summary function defined!
✅ run_complete_agent_evaluation function defined!
🚀 Starting Complete Agent Evaluation Pipeline
📁 Processing file: split_datasets/tavily_search_results_dataset.json

📋 Step 1: Environment Setup
✅ Logging configured at INFO level
🔍 Environment validation:
  ✅ gemini_api_key: True
  ✅ pandas_available: True

Evaluating samples: 0it [00:00, ?it/s]

2025-09-26 04:14:48 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-09-26 04:14:50 - INFO - google_genai.models - AFC remote call 1 is done.


2025-09-26 04:14:50 - noveum_trace.transport.http_transport - INFO - 📤 EXPORTING TRACE: auto_trace_generate (ID: 64390c07-1bb2-4e17-adab-0a37bb7d3881) - 1 spans
2025-09-26 04:14:50 - noveum_trace.transport.batch_processor - INFO - 📥 ADDING TRACE TO QUEUE: auto_trace_generate (ID: 64390c07-1bb2-4e17-adab-0a37bb7d3881) - 1 spans
2025-09-26 04:14:50 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully queued trace 64390c07-1bb2-4e17-adab-0a37bb7d3881
2025-09-26 04:14:50 - noveum_trace.transport.http_transport - INFO - ✅ Trace 64390c07-1bb2-4e17-adab-0a37bb7d3881 successfully queued for export


2025-09-26 04:14:50 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-09-26 04:14:51 - INFO - google_genai.models - AFC remote call 1 is done.


2025-09-26 04:14:51 - noveum_trace.transport.http_transport - INFO - 📤 EXPORTING TRACE: auto_trace_generate (ID: 0e52d952-1e90-4867-a442-d1ff54ce0f2e) - 1 spans
2025-09-26 04:14:51 - noveum_trace.transport.batch_processor - INFO - 📥 ADDING TRACE TO QUEUE: auto_trace_generate (ID: 0e52d952-1e90-4867-a442-d1ff54ce0f2e) - 1 spans
2025-09-26 04:14:51 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully queued trace 0e52d952-1e90-4867-a442-d1ff54ce0f2e
2025-09-26 04:14:51 - noveum_trace.transport.http_transport - INFO - ✅ Trace 0e52d952-1e90-4867-a442-d1ff54ce0f2e successfully queued for export


2025-09-26 04:14:51 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-09-26 04:14:52 - INFO - google_genai.models - AFC remote call 1 is done.


2025-09-26 04:14:52 - noveum_trace.transport.http_transport - INFO - 📤 EXPORTING TRACE: auto_trace_generate (ID: 2aa5c2c1-dc3e-4c5a-8776-ad0ca2e612e1) - 1 spans
2025-09-26 04:14:52 - noveum_trace.transport.batch_processor - INFO - 📥 ADDING TRACE TO QUEUE: auto_trace_generate (ID: 2aa5c2c1-dc3e-4c5a-8776-ad0ca2e612e1) - 1 spans
2025-09-26 04:14:52 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully queued trace 2aa5c2c1-dc3e-4c5a-8776-ad0ca2e612e1
2025-09-26 04:14:52 - noveum_trace.transport.http_transport - INFO - ✅ Trace 2aa5c2c1-dc3e-4c5a-8776-ad0ca2e612e1 successfully queued for export


2025-09-26 04:14:52 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 1 samples
2025-09-26 04:14:52 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to evaluation_results/agent_evaluation/agent_evaluation_results.csv


Evaluating samples: 1it [00:03,  3.91s/it]

2025-09-26 04:14:52 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.


2025-09-26 04:14:54 - noveum_trace.transport.batch_processor - INFO - ⏰ TIMEOUT TRIGGER: Sending batch due to timeout (5.5s >= 5.0s)
2025-09-26 04:14:54 - noveum_trace.transport.batch_processor - INFO - 📤 SENDING BATCH: 3 traces via send_callback
2025-09-26 04:14:54 - noveum_trace.transport.http_transport - INFO - 🚀 SENDING BATCH: 3 traces to https://api.noveum.ai/api/v1/traces


2025-09-26 04:14:54 - INFO - google_genai.models - AFC remote call 1 is done.


2025-09-26 04:14:54 - noveum_trace.transport.http_transport - INFO - 📤 EXPORTING TRACE: auto_trace_generate (ID: 0d472422-19e3-441f-9689-3dd46f60038c) - 1 spans
2025-09-26 04:14:54 - noveum_trace.transport.batch_processor - INFO - 📥 ADDING TRACE TO QUEUE: auto_trace_generate (ID: 0d472422-19e3-441f-9689-3dd46f60038c) - 1 spans
2025-09-26 04:14:54 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully queued trace 0d472422-19e3-441f-9689-3dd46f60038c
2025-09-26 04:14:54 - noveum_trace.transport.http_transport - INFO - ✅ Trace 0d472422-19e3-441f-9689-3dd46f60038c successfully queued for export


2025-09-26 04:14:54 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.


2025-09-26 04:14:55 - noveum_trace.transport.http_transport - INFO - 📡 HTTP RESPONSE: Status 200 from https://api.noveum.ai/api/v1/traces
2025-09-26 04:14:55 - noveum_trace.transport.http_transport - INFO - ✅ Successfully sent batch of 3 traces
2025-09-26 04:14:55 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully sent batch of 3 traces via callback


2025-09-26 04:14:55 - INFO - google_genai.models - AFC remote call 1 is done.


2025-09-26 04:14:55 - noveum_trace.transport.http_transport - INFO - 📤 EXPORTING TRACE: auto_trace_generate (ID: a4336f24-af5b-4da5-ad01-ec72cd6acaac) - 1 spans
2025-09-26 04:14:55 - noveum_trace.transport.batch_processor - INFO - 📥 ADDING TRACE TO QUEUE: auto_trace_generate (ID: a4336f24-af5b-4da5-ad01-ec72cd6acaac) - 1 spans
2025-09-26 04:14:55 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully queued trace a4336f24-af5b-4da5-ad01-ec72cd6acaac
2025-09-26 04:14:55 - noveum_trace.transport.http_transport - INFO - ✅ Trace a4336f24-af5b-4da5-ad01-ec72cd6acaac successfully queued for export


2025-09-26 04:14:55 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-09-26 04:14:56 - INFO - google_genai.models - AFC remote call 1 is done.


2025-09-26 04:14:56 - noveum_trace.transport.http_transport - INFO - 📤 EXPORTING TRACE: auto_trace_generate (ID: 167bb438-75e0-48f7-92a4-b16297320171) - 1 spans
2025-09-26 04:14:56 - noveum_trace.transport.batch_processor - INFO - 📥 ADDING TRACE TO QUEUE: auto_trace_generate (ID: 167bb438-75e0-48f7-92a4-b16297320171) - 1 spans
2025-09-26 04:14:56 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully queued trace 167bb438-75e0-48f7-92a4-b16297320171
2025-09-26 04:14:56 - noveum_trace.transport.http_transport - INFO - ✅ Trace 167bb438-75e0-48f7-92a4-b16297320171 successfully queued for export


2025-09-26 04:14:56 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 2 samples
2025-09-26 04:14:56 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to evaluation_results/agent_evaluation/agent_evaluation_results.csv


Evaluating samples: 2it [00:07,  3.96s/it]

2025-09-26 04:14:56 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-09-26 04:14:58 - INFO - google_genai.models - AFC remote call 1 is done.


2025-09-26 04:14:58 - noveum_trace.transport.http_transport - INFO - 📤 EXPORTING TRACE: auto_trace_generate (ID: 9e3bc805-f035-4257-85b0-e0061b17cb71) - 1 spans
2025-09-26 04:14:58 - noveum_trace.transport.batch_processor - INFO - 📥 ADDING TRACE TO QUEUE: auto_trace_generate (ID: 9e3bc805-f035-4257-85b0-e0061b17cb71) - 1 spans
2025-09-26 04:14:58 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully queued trace 9e3bc805-f035-4257-85b0-e0061b17cb71
2025-09-26 04:14:58 - noveum_trace.transport.http_transport - INFO - ✅ Trace 9e3bc805-f035-4257-85b0-e0061b17cb71 successfully queued for export


2025-09-26 04:14:58 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-09-26 04:14:59 - INFO - google_genai.models - AFC remote call 1 is done.


2025-09-26 04:14:59 - noveum_trace.transport.http_transport - INFO - 📤 EXPORTING TRACE: auto_trace_generate (ID: d6d18818-7c78-402e-8e67-a608b2950c4f) - 1 spans
2025-09-26 04:14:59 - noveum_trace.transport.batch_processor - INFO - 📥 ADDING TRACE TO QUEUE: auto_trace_generate (ID: d6d18818-7c78-402e-8e67-a608b2950c4f) - 1 spans
2025-09-26 04:14:59 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully queued trace d6d18818-7c78-402e-8e67-a608b2950c4f
2025-09-26 04:14:59 - noveum_trace.transport.http_transport - INFO - ✅ Trace d6d18818-7c78-402e-8e67-a608b2950c4f successfully queued for export


2025-09-26 04:14:59 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.


2025-09-26 04:14:59 - noveum_trace.transport.batch_processor - INFO - ⏰ TIMEOUT TRIGGER: Sending batch due to timeout (5.3s >= 5.0s)
2025-09-26 04:14:59 - noveum_trace.transport.batch_processor - INFO - 📤 SENDING BATCH: 5 traces via send_callback
2025-09-26 04:14:59 - noveum_trace.transport.http_transport - INFO - 🚀 SENDING BATCH: 5 traces to https://api.noveum.ai/api/v1/traces
2025-09-26 04:15:00 - noveum_trace.transport.http_transport - INFO - 📡 HTTP RESPONSE: Status 200 from https://api.noveum.ai/api/v1/traces
2025-09-26 04:15:00 - noveum_trace.transport.http_transport - INFO - ✅ Successfully sent batch of 5 traces
2025-09-26 04:15:00 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully sent batch of 5 traces via callback


2025-09-26 04:15:00 - INFO - google_genai.models - AFC remote call 1 is done.


2025-09-26 04:15:00 - noveum_trace.transport.http_transport - INFO - 📤 EXPORTING TRACE: auto_trace_generate (ID: 62ea9358-1667-4114-beed-30e94c560a89) - 1 spans
2025-09-26 04:15:00 - noveum_trace.transport.batch_processor - INFO - 📥 ADDING TRACE TO QUEUE: auto_trace_generate (ID: 62ea9358-1667-4114-beed-30e94c560a89) - 1 spans
2025-09-26 04:15:00 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully queued trace 62ea9358-1667-4114-beed-30e94c560a89
2025-09-26 04:15:00 - noveum_trace.transport.http_transport - INFO - ✅ Trace 62ea9358-1667-4114-beed-30e94c560a89 successfully queued for export


2025-09-26 04:15:00 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 3 samples
2025-09-26 04:15:00 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to evaluation_results/agent_evaluation/agent_evaluation_results.csv


Evaluating samples: 3it [00:11,  3.79s/it]

2025-09-26 04:15:00 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.


2025-09-26 04:15:01 - noveum_trace.transport.http_transport - INFO - 📤 EXPORTING TRACE: auto_trace_generate (ID: 88278c33-963b-4104-83a5-4334a6465c6b) - 1 spans
2025-09-26 04:15:01 - noveum_trace.transport.batch_processor - INFO - 📥 ADDING TRACE TO QUEUE: auto_trace_generate (ID: 88278c33-963b-4104-83a5-4334a6465c6b) - 1 spans
2025-09-26 04:15:01 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully queued trace 88278c33-963b-4104-83a5-4334a6465c6b
2025-09-26 04:15:01 - noveum_trace.transport.http_transport - INFO - ✅ Trace 88278c33-963b-4104-83a5-4334a6465c6b successfully queued for export
Evaluating samples: 3it [00:12,  4.19s/it]


KeyboardInterrupt: 

2025-09-26 04:15:04 - noveum_trace.transport.batch_processor - INFO - ⏰ TIMEOUT TRIGGER: Sending batch due to timeout (5.3s >= 5.0s)
2025-09-26 04:15:04 - noveum_trace.transport.batch_processor - INFO - 📤 SENDING BATCH: 2 traces via send_callback
2025-09-26 04:15:04 - noveum_trace.transport.http_transport - INFO - 🚀 SENDING BATCH: 2 traces to https://api.noveum.ai/api/v1/traces
2025-09-26 04:15:05 - noveum_trace.transport.http_transport - INFO - 📡 HTTP RESPONSE: Status 200 from https://api.noveum.ai/api/v1/traces
2025-09-26 04:15:05 - noveum_trace.transport.http_transport - INFO - ✅ Successfully sent batch of 2 traces
2025-09-26 04:15:05 - noveum_trace.transport.batch_processor - INFO - ✅ Successfully sent batch of 2 traces via callback


# Analysis of poor scores in comment generation agent.

In [None]:
import pandas as pd
comment_gen = pd.read_csv("demo_results/agent_comment_gen_dataset/agent_evaluation_results.csv")

split_size = 3

task_progression = comment_gen.sort_values(by = 'task_progression', ascending= True).iloc[:split_size][['task_progression', 'task_progression_reasoning']]

print("Task Progression:")
print()
for idx, row in task_progression.iterrows():
    print(f"Score = {row['task_progression']}")
    print(f"Reasoning = {row['task_progression_reasoning']}")
    print()  # blank line

Task Progression:
Score = 1.8
Reasoning = The agent misunderstands the task.  Instead of providing information on exporting Zillow/Redfin data to Excel, it offers a link to an unrelated image generation API.  This shows minimal understanding and makes no progress towards the goal. The response is completely off-topic.

Score = 2.8
Reasoning = The agent's response shows some understanding of user frustration but fails to directly address the 'Unknown API error'.  Offering an unrelated API is off-topic. While empathetic, it doesn't solve the original problem, hindering task completion.  More focus on troubleshooting the error is needed.

Score = 2.8
Reasoning = The agent's response is polite and acknowledges the task, but it fails to directly address the Discord webhook instructions.  Instead, it offers an unrelated API suggestion. While helpful in a general sense, it shows minimal progress on the specific assigned task.



In [7]:
# Context Relevancy Analysis
context_relevancy = comment_gen.sort_values(by='context_relevancy', ascending=True).iloc[:3][['context_relevancy', 'context_relevancy_reasoning']]

print("Context Relevancy Analysis:")
print("=" * 50)
for idx, row in context_relevancy.iterrows():
    print(f"Score = {row['context_relevancy']}")
    print(f"Reasoning = {row['context_relevancy_reasoning']}")
    print()

Context Relevancy Analysis:
Score = 4.5
Reasoning = The response mentions an API, aligning with the post's topic. However, the provided API link is irrelevant; it's for image generation, not real estate data.  The agent demonstrates some understanding but fails to provide a helpful solution to the user's query regarding Zillow/Redfin data export.  The tone is appropriate.

Score = 6.5
Reasoning = The response offers a relevant suggestion (using an API) but the provided API link seems unrelated to screenshotting a div.  While the tone is appropriate, the lack of direct relevance to taking a div screenshot lowers the score.  It shows some understanding but falls short of a complete solution.

Score = 7.2
Reasoning = The response offers a relevant solution by suggesting an alternative API.  The tone is empathetic and helpful. However, it lacks explicit acknowledgment of the 'unknown API error,' focusing instead on a presumed GIF upscaling context.  More direct problem-solving would improv

In [8]:
# Role Adherence Analysis
role_adherence = comment_gen.sort_values(by='role_adherence', ascending=True).iloc[:3][['role_adherence', 'role_adherence_reasoning']]

print("Role Adherence Analysis:")
print("=" * 50)
for idx, row in role_adherence.iterrows():
    print(f"Score = {row['role_adherence']}")
    print(f"Reasoning = {row['role_adherence_reasoning']}")
    print()

Role Adherence Analysis:
Score = 4.5
Reasoning = The agent fails to provide relevant information for exporting Zillow/Redfin data to Excel.  The provided API link is unrelated to real estate data. The response is off-topic and doesn't fulfill the task.  There are no tool calls, further indicating a failure to address the prompt.

Score = 4.7
Reasoning = The agent fails to provide relevant information to answer the questions.  Instead of recommending coffee shops, it suggests an irrelevant API for each prompt. This demonstrates a significant deviation from the task, focusing on unrelated tools rather than addressing the core questions. The responses are inconsistent with an agent's expected role.

Score = 6.8
Reasoning = The agent offers a solution (external API) instead of directly addressing the 'unknown API error'. While empathetic, it deviates from simply commenting on the post.  No tool calls were made, which is consistent with the lack of specified tools. The response is partially

In [2]:
from novapilot_utils import recommend_improvements

# Advanced usage with custom parameters
final_analysis, summaries, log_file = recommend_improvements(
    demo_results_dir="demo_results/",
    agent_doc_path="reddit_agent.md",
    log_dir="log",
    verbose=True
)

NOVAPILOT AGENT ANALYSIS - RECOMMEND IMPROVEMENTS
This function runs the complete analysis pipeline equivalent to
running the entire complete_analysis_demo.ipynb notebook.
Setup complete! Log file: log/analysis_log_20250927_034313.txt
Agent document loaded: 8492 characters
Found 5 dataset directories to process:
  - email_gen_send_dataset
  - agent_comment_gen_dataset
  - post_validation_dataset
  - agent_query_gen_dataset
  - tavily_search_results_dataset

Processing email_gen_send_dataset...
  Processing CSV: agent_evaluation_results.csv
    Making Gemini call for scorer: task_progression
    Making Gemini call for scorer: context_relevancy
    Making Gemini call for scorer: role_adherence
    Making Gemini call for scorer: tool_relevancy
    Making Gemini call for scorer: parameter_correctness
    Making summary call for email_gen_send_dataset

Processing agent_comment_gen_dataset...
  Processing CSV: agent_evaluation_results.csv
    Making Gemini call for scorer: task_progression
 

In [3]:
print(final_analysis)

Based on the analysis of the agent's workflow and the part-wise scorer feedback, the agent is experiencing a cascading failure that originates at the very beginning of its core logic. The initial, critical failure in query generation poisons the entire downstream process, leading to failures in every subsequent step.

The root cause is the `search_agent`'s complete misinterpretation of its task. Instead of generating relevant search queries based on the provided API's title and description, it creates queries for entirely unrelated topics. This single point of failure guarantees that the rest of the workflow cannot succeed:

1.  **Irrelevant Queries** (`agent_query_gen_dataset`) lead to...
2.  **Irrelevant Search Results** from Tavily (`tavily_search_results_dataset`), which also appears to have a separate implementation bug.
3.  **Zero Valid Posts**, as the irrelevant search results are correctly filtered out by the `post_validation` step, which then unhelpfully reports that nothing w