# KeywordsAI Multi-Modal Tool Evaluation Workflow Demo

This notebook demonstrates the complete workflow for evaluating LLM agents with tool calls using KeywordsAI.

## Overview
1. **Agent Demo**: Run travel assistant with multi-modal inputs
2. **Log Management**: Fetch and analyze logs
3. **Evaluator Creation**: Create custom LLM evaluators  
4. **Testset Management**: Create testsets from logs
5. **Experiment Execution**: Compare prompt versions
6. **Results Analysis**: Evaluate tool call accuracy


## Step 0: Create the prompt for the agent
go to src/example_workflows/multi_modal_tool_evals/prompts/traveling_agent_prompt.md and copy the prompt into the UI

## Step 1: Run Agent Demo

```
cd src/example_workflows/multi_modal_tool_evals
python3 agent.py
```


In [None]:
# Fetch logs from the last day filtered by evaluation identifier
from example_workflows.multi_modal_tool_evals.logs import get_logs
from datetime import datetime, timedelta
from example_workflows.multi_modal_tool_evals.constants import EVALUATION_IDENTIFIER

logs = get_logs(
    start_time=datetime.now() - timedelta(days=1),
    end_time=datetime.now(),
    filters={
        "evaluation_identifier": {
            "value": EVALUATION_IDENTIFIER,
        },
    },
)

if logs and logs.get("results"):
    print(f"📊 Found {len(logs['results'])} logs")

    # Analyze first log structure
    first_log = logs["results"][0]
    print(f"\n🔍 Sample log structure:")
    print(f"- ID: {first_log['id']}")
    print(f"- Customer ID: {first_log['keywordsai_params'].get('customer_identifier')}")
    print(
        f"- Variables: {list(first_log['keywordsai_params'].get('variables', {}).keys())}"
    )
else:
    print("⚠️  No logs found - using demo data for the workflow")

⚠️  No logs found - using demo data for the workflow


## Step 3: Create Custom Evaluator

Create a tool call accuracy evaluator to assess how well the agent uses tools.


In [4]:
# Create tool call accuracy evaluator
evaluator_data = create_llm_evaluator(
    evaluator_slug='tool_call_accuracy_demo',
    name='Tool Call Accuracy Evaluator (Demo)',
    evaluator_definition='Evaluate whether the AI agent correctly identified the need for tool calls and called the appropriate tools based on user input and context.',
    scoring_rubric='''
    Score 1.0: Perfect tool usage - called all necessary tools with correct parameters, no unnecessary calls
    Score 0.8: Good tool usage - called most necessary tools correctly, minor parameter issues or one unnecessary call
    Score 0.6: Adequate tool usage - called some necessary tools but missed important ones or had parameter errors
    Score 0.4: Poor tool usage - called wrong tools or missed most necessary tool calls
    Score 0.2: Very poor - called completely inappropriate tools or failed to call any necessary tools
    Score 0.0: No tool calls when tools were clearly needed, or completely wrong tool usage
    
    Consider:
    - Did the agent call search_places when user specified a travel category?
    - Did the agent call check_weather when user requested weather information?
    - Did the agent call find_hotels when user wanted hotel booking?
    - Were the tool parameters (location, category) correct?
    ''',
    description='Evaluates the accuracy and appropriateness of tool calls made by the travel assistant agent',
    min_score=0.0,
    max_score=1.0,
    passing_score=0.7
)

if evaluator_data:
    print('✅ Evaluator created successfully!')
    print(f'Evaluator ID: {evaluator_data.get("id")}')
    print(f'Evaluator slug: {evaluator_data.get("evaluator_slug")}')
else:
    print('❌ Failed to create evaluator - may already exist')
    
    # List existing evaluators
    existing_evaluators = list_evaluators()
    if existing_evaluators:
        print("\n📋 Existing evaluators:")
        for evaluator in existing_evaluators.get('results', []):
            if 'tool_call_accuracy' in evaluator.get('evaluator_slug', ''):
                print(f"- {evaluator['name']} ({evaluator['evaluator_slug']})")


✅ Evaluator created successfully!
Evaluator ID: 854bb98c-4529-490d-849a-f5d83054f314
Evaluator slug: tool_call_accuracy_demo


## Step 4: Create Testset

Create a testset with test scenarios for the travel agent.


In [5]:
# Create testset for travel agent evaluation
testset = create_testset(
    name="Travel Agent Multi-Modal Testset (Demo)",
    description="Test dataset for travel agent with multi-modal inputs and tool calls",
    column_definitions=[
        {"field": "category"},
        {"field": "name"}, 
        {"field": "is_booking_hotel"},
        {"field": "is_checking_weather"},
        {"field": "has_image"},
        {"field": "expected_tools"}
    ]
)

if testset:
    print(f"✅ Testset created: {testset['name']} (ID: {testset['id']})")
    
    # Add sample test scenarios
    sample_rows = [
        {
            'row_data': {
                'category': 'beach',
                'name': 'Mike (Beach Lover)',
                'is_booking_hotel': False,
                'is_checking_weather': False,
                'has_image': True,
                'expected_tools': 'search_places'
            }
        },
        {
            'row_data': {
                'category': 'mountain',
                'name': 'Sarah (Adventure Seeker)', 
                'is_booking_hotel': True,
                'is_checking_weather': True,
                'has_image': True,
                'expected_tools': 'search_places,find_hotels,check_weather'
            }
        }
    ]
    
    rows_result = create_testset_rows(testset['id'], sample_rows)
    if rows_result:
        print(f"✅ Added {len(sample_rows)} test scenarios to testset")
    else:
        print("❌ Failed to add test scenarios")
else:
    print("❌ Failed to create testset")


✅ Testset created: Travel Agent Multi-Modal Testset (Demo) (ID: 24501fc9ac1245d5b7042ffb8175b339)
✅ Added 2 test scenarios to testset


## Step 5: Create Experiment with Prompts API

Create an experiment using the prompts API to fetch actual prompt definitions rather than hardcoding them.


In [6]:
# Import prompts utilities
from example_workflows.multi_modal_tool_evals.prompts import list_prompts, get_prompt

# Find existing travel agent prompts or create new ones
prompts = list_prompts()
travel_prompt = None

if prompts and prompts.get('results'):
    for prompt in prompts['results']:
        if 'travel' in prompt['name'].lower():
            travel_prompt = prompt
            break

if travel_prompt:
    print(f"📋 Found travel prompt: {travel_prompt['name']}")
    prompt_details = get_prompt(travel_prompt['prompt_id'])
    
    # Create experiment using the prompt ID
    experiment = create_experiment(
        name="Travel Agent Evaluation (Prompt-based)",
        description="Comparing travel agent prompt versions using prompts API",
        columns=[
            {
                "id": prompt_details['prompt_id'],
                "model": "gpt-4o",
                "name": "Travel Agent v1",
                "prompt_id": prompt_details['prompt_id'],
                "prompt_version": 1,
                "temperature": 0.7,
                "tools": [
                    {
                        "type": "function",
                        "function": {
                            "name": "search_places",
                            "description": "Search for places based on landscape category",
                            "parameters": {
                                "type": "object",
                                "properties": {"category": {"type": "string"}},
                                "required": ["category"]
                            }
                        }
                    }
                ]
            }
        ],
        rows=[
            {"input": {"category": "beach", "name": "Mike", "is_booking_hotel": False}},
            {"input": {"category": "mountain", "name": "Sarah", "is_booking_hotel": True}}
        ]
    )
    
    if experiment:
        print(f"✅ Experiment created: {experiment['name']}")
        print(f"ID: {experiment['id']}")
        print(f"Using prompt ID: {prompt_details['prompt_id']}")
    else:
        print("❌ Failed to create experiment")
else:
    print("⚠️ No travel agent prompt found - would need to create one first")


📋 Found travel prompt: Travel Agent Demo Prompt
Error: 400 Client Error: Bad Request for url: http://localhost:8000/api/experiments/create
Response: {'columns': ["Invalid column: [{'type': 'missing', 'loc': ('columns', 0, 'max_completion_tokens'), 'msg': 'Field required', 'input': {'id': 'ce02385ae81b4360ae0fa31e0cc9aa65', 'model': 'gpt-4o', 'name': 'Travel Agent v1', 'prompt_id': 'ce02385ae81b4360ae0fa31e0cc9aa65', 'prompt_version': 1, 'temperature': 0.7, 'tools': [{'type': 'function', 'function': {'name': 'search_places', 'description': 'Search for places based on landscape category', 'parameters': {'type': 'object', 'properties': {'category': {'type': 'string'}}, 'required': ['category']}}}]}}, {'type': 'missing', 'loc': ('columns', 0, 'top_p'), 'msg': 'Field required', 'input': {'id': 'ce02385ae81b4360ae0fa31e0cc9aa65', 'model': 'gpt-4o', 'name': 'Travel Agent v1', 'prompt_id': 'ce02385ae81b4360ae0fa31e0cc9aa65', 'prompt_version': 1, 'temperature': 0.7, 'tools': [{'type': 'function

## Step 6: Run Experiment and Evaluation

Execute the experiment and apply the evaluator to get tool call accuracy scores.


In [7]:
# Run the experiment (if we have one)
if 'experiment' in locals() and experiment:
    experiment_id = experiment['id']
    
    print(f"🚀 Running experiment: {experiment_id}")
    run_result = run_experiment(experiment_id)
    
    if run_result:
        print("✅ Experiment completed successfully!")
        
        # Run evaluations
        print("📊 Running tool call accuracy evaluation...")
        eval_result = run_experiment_evals(experiment_id, ['tool_call_accuracy'])
        
        if eval_result:
            print("✅ Evaluation completed!")
            
            # Extract and display results
            print("\n📈 Results Summary:")
            for row in eval_result.get('rows', []):
                row_input = row.get('input', {})
                print(f"\nTest Case: {row_input.get('name', 'Unknown')} - {row_input.get('category', 'Unknown')}")
                
                for result in row.get('results', []):
                    column_name = result.get('column_name', 'Unknown Column')
                    eval_results = result.get('evaluation_result', {})
                    
                    for evaluator_name, eval_data in eval_results.items():
                        score = eval_data.get('results', {}).get('primary_score', 'N/A')
                        print(f"  {column_name}: {score}/1.0")
                        
        else:
            print("❌ Evaluation failed")
    else:
        print("❌ Experiment run failed")
else:
    print("⚠️ No experiment to run - create one first")


⚠️ No experiment to run - create one first


## Summary

This notebook demonstrated the complete KeywordsAI evaluation workflow:

### ✅ Completed Steps:
1. **Agent Demo** - Travel assistant with multi-modal inputs and tool calls
2. **Log Management** - Fetched logs with evaluation identifiers
3. **Evaluator Creation** - Custom LLM evaluator for tool call accuracy
4. **Testset Management** - Created structured test scenarios
5. **Experiment Creation** - Used prompts API for proper prompt management
6. **Evaluation Execution** - Ran experiments and applied evaluators

### 🎯 Key Benefits:
- **Systematic Evaluation** - Structured approach to LLM agent testing
- **Prompt Management** - Proper versioning and comparison of prompts
- **Multi-modal Support** - Text + image inputs for comprehensive testing
- **Tool Call Assessment** - Specific evaluation of agent tool usage
- **Scalable Framework** - Easy to add more evaluators and test cases

### 🔄 Next Steps:
- Add more evaluators (response quality, safety, etc.)
- Create larger testsets with diverse scenarios
- Compare different models and temperatures
- Set up automated evaluation pipelines
- Integrate with CI/CD for continuous evaluation

The workflow is now ready for production use with KeywordsAI's evaluation platform!
