# Multilingual Agent Evaluation Results Visualization

This notebook creates interactive boxplots to visualize the evaluation results of LLM agents across different models and languages.

In [45]:
import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pathlib import Path

In [46]:
# Load all result files
results_dir = Path('results')
results_data = []

for json_file in results_dir.glob('*.json'):
    with open(json_file, 'r') as f:
        data = json.load(f)
        
    # Extract model name and language from filename or data
    model_name = data.get('model_name', '')
    language = data.get('language', '')
    
    # Get metric scores
    metric_scores = data['evaluation_results']['global_metrics']['average_scores_by_metric']
    
    # Add each metric score as a separate row
    for metric_name, score in metric_scores.items():
        results_data.append({
            'model': model_name,
            'language': language,
            'metric': metric_name,
            'score': score,
            'filename': json_file.name
        })

# Create DataFrame
df = pd.DataFrame(results_data)
print(f"Loaded {len(df)} data points")
print(f"Models: {sorted(df['model'].unique())}")
print(f"Languages: {sorted(df['language'].unique())}")
print(f"Metrics: {sorted(df['metric'].unique())}")
df.head()

Loaded 84 data points
Models: ['HuggingFaceTB/SmolLM3-3B', 'Qwen/Qwen3-14B', 'Qwen/Qwen3-4B', 'Qwen/Qwen3-8B', 'mistralai/Mistral-Nemo-Instruct-2407', 'openai/gpt-oss-20b']
Languages: ['deu', 'eng', 'est', 'fra', 'nld', 'pol', 'por']
Metrics: ['Argument Correctness', 'Tool Correctness']


Unnamed: 0,model,language,metric,score,filename
0,Qwen/Qwen3-14B,deu,Tool Correctness,0.93,agent_Qwen3-14B_deu_Latn.json
1,Qwen/Qwen3-14B,deu,Argument Correctness,0.7375,agent_Qwen3-14B_deu_Latn.json
2,Qwen/Qwen3-14B,est,Tool Correctness,1.0,agent_Qwen3-14B_est_Latn.json
3,Qwen/Qwen3-14B,est,Argument Correctness,0.8333,agent_Qwen3-14B_est_Latn.json
4,mistralai/Mistral-Nemo-Instruct-2407,nld,Tool Correctness,0.763,agent_Mistral-Nemo-Instruct-2407_nld_Latn.json


In [47]:
# Clean model names for better visualization
def clean_model_name(model_name):
    """Extract clean model name from full model path"""
    if '/' in model_name:
        return model_name.split('/')[-1]
    return model_name

df['model_clean'] = df['model'].apply(clean_model_name)
print(f"Clean model names: {sorted(df['model_clean'].unique())}")

Clean model names: ['Mistral-Nemo-Instruct-2407', 'Qwen3-14B', 'Qwen3-4B', 'Qwen3-8B', 'SmolLM3-3B', 'gpt-oss-20b']


In [48]:
# Get unique metrics
metrics = sorted(df['metric'].unique())
n_metrics = len(metrics)

# Create combined subplot with all metrics
fig = make_subplots(
    rows=1, cols=n_metrics,
    subplot_titles=metrics,
    horizontal_spacing=0.1
)

colors = px.colors.qualitative.Set3

for i, metric in enumerate(metrics):
    # Filter data for this metric
    metric_data = df[df['metric'] == metric]
    
    # Sort models by median performance
    model_medians = metric_data.groupby('model_clean')['score'].median().sort_values()
    model_order = model_medians.index.tolist()
    
    # Create box plot for each model
    for j, model in enumerate(model_order):
        model_data = metric_data[metric_data['model_clean'] == model]
        model_scores = model_data['score']
        model_languages = model_data['language']
        
        # Create custom hover text with language information
        hover_text = [f"Model: {model}<br>Language: {lang}<br>Score: {score}" 
                      for lang, score in zip(model_languages, model_scores)]
        
        fig.add_trace(
            go.Box(
                y=model_scores,
                name=model,
                boxpoints='all',
                jitter=0.3,
                pointpos=-1.8,
                marker=dict(color=colors[j % len(colors)]),
                showlegend=False,
                text=hover_text,
                hovertemplate='%{text}<extra></extra>'
            ),
            row=1, col=i+1
        )
    
    # Update x-axis for this subplot
    fig.update_xaxes(
        tickangle=45,
        row=1, col=i+1
    )
    
    # Update y-axis for this subplot
    fig.update_yaxes(
        title_text="Score" if i == 0 else "",
        row=1, col=i+1
    )

fig.update_layout(
    title="Multilingual Agent Evaluation Results - All Metrics",
    height=600,
    showlegend=False
)

fig.show()

In [49]:
# Export interactive visualizations as HTML files
import os

# Create docs directory if it doesn't exist
docs_dir = 'docs'
os.makedirs(docs_dir, exist_ok=True)

print("Exporting interactive visualizations to HTML...")

# Export the combined subplot
fig.write_html(f"{docs_dir}/combined_metrics_boxplot.html")
print("✓ Exported combined metrics boxplot")

Exporting interactive visualizations to HTML...
✓ Exported combined metrics boxplot


In [50]:
# Create detailed heatmap breaking down by test categories
# Load detailed test results for each category
detailed_results = []

for json_file in results_dir.glob('*.json'):
    with open(json_file, 'r') as f:
        data = json.load(f)
        
    model_name = data.get('model_name', '')
    language = data.get('language', '')
    
    # Process individual test results
    test_results = data['evaluation_results']['test_results']
    
    for test_result in test_results:
        eval_item_id = test_result['eval_item_id']
        
        # Categorize based on ID ranges
        if 1 <= eval_item_id <= 5 or 41 <= eval_item_id <= 45:
            category = 'no_tool_call'
        elif 6 <= eval_item_id <= 25:
            category = 'absolute_date'
        elif 26 <= eval_item_id <= 40:
            category = 'relative_date'
        else:
            continue  # Skip if outside expected ranges
            
        # Extract metrics scores for this test
        found_metrics = set()
        for metric_data in test_result['metrics_data']:
            detailed_results.append({
                'model': model_name,
                'language': language,
                'category': category,
                'metric': metric_data['name'],
                'score': metric_data['score'],
                'eval_item_id': eval_item_id
            })
            found_metrics.add(metric_data['name'])
        
        # For no_tool_call category, if "Argument Correctness" is missing, assume 1.0
        if category == 'no_tool_call' and 'Argument Correctness' not in found_metrics:
            detailed_results.append({
                'model': model_name,
                'language': language,
                'category': category,
                'metric': 'Argument Correctness',
                'score': 1.0,
                'eval_item_id': eval_item_id
            })

# Create detailed DataFrame
detailed_df = pd.DataFrame(detailed_results)
detailed_df['model_clean'] = detailed_df['model'].apply(clean_model_name)

print(f"Loaded {len(detailed_df)} detailed test results")
print(f"Categories: {sorted(detailed_df['category'].unique())}")

# Create language-category combinations
detailed_df['lang_category'] = detailed_df['language'] + '_' + detailed_df['category']

detailed_df.head()

Loaded 3780 detailed test results
Categories: ['absolute_date', 'no_tool_call', 'relative_date']


Unnamed: 0,model,language,category,metric,score,eval_item_id,model_clean,lang_category
0,Qwen/Qwen3-14B,deu,no_tool_call,Tool Correctness,1.0,1,Qwen3-14B,deu_no_tool_call
1,Qwen/Qwen3-14B,deu,no_tool_call,Argument Correctness,1.0,1,Qwen3-14B,deu_no_tool_call
2,Qwen/Qwen3-14B,deu,no_tool_call,Tool Correctness,1.0,2,Qwen3-14B,deu_no_tool_call
3,Qwen/Qwen3-14B,deu,no_tool_call,Argument Correctness,1.0,2,Qwen3-14B,deu_no_tool_call
4,Qwen/Qwen3-14B,deu,no_tool_call,Tool Correctness,1.0,3,Qwen3-14B,deu_no_tool_call


In [51]:
# Create heatmaps showing model performance across languages for each metric
for metric in metrics:
    metric_data = df[df['metric'] == metric]
    
    # Pivot table for heatmap
    pivot_data = metric_data.pivot_table(
        values='score', 
        index='model_clean', 
        columns='language', 
        aggfunc='mean'
    )
    
    # Create heatmap using plotly
    fig = px.imshow(
        pivot_data,
        title=f"{metric} Scores: Models vs Languages",
        labels=dict(x="Language", y="Model", color="Score"),
        color_continuous_scale="RdYlGn",
        aspect="auto"
    )
    
    # Add text annotations
    fig.update_traces(
        text=pivot_data.round(3).values,
        texttemplate="%{text}",
        textfont={"size": 10}
    )
    
    fig.update_layout(
        height=400,
        xaxis={'side': 'bottom'},
        yaxis={'side': 'left'}
    )
    
    fig.show()
    
    # Export to HTML
    metric_filename = metric.replace(' ', '_').replace('/', '_').lower()
    fig.write_html(f"{docs_dir}/{metric_filename}_heatmap.html")
    print(f"✓ Exported {metric} heatmap")

✓ Exported Argument Correctness heatmap


✓ Exported Tool Correctness heatmap


In [52]:
# Create detailed category breakdown heatmaps
for metric in metrics:
    metric_data = detailed_df[detailed_df['metric'] == metric]
    
    # Calculate mean scores for each model-language-category combination
    detailed_pivot = metric_data.groupby(['model_clean', 'language', 'category'])['score'].mean().reset_index()
    
    # Create a pivot table with language_category as columns
    pivot_detailed = detailed_pivot.pivot_table(
        values='score',
        index='model_clean', 
        columns=['language', 'category'],
        aggfunc='mean'
    )
    
    # Flatten column names to create readable labels
    pivot_detailed.columns = [f"{lang}_{cat}" for lang, cat in pivot_detailed.columns]
    
    # Reorder columns to group by language
    languages = sorted(detailed_df['language'].unique())
    categories = ['no_tool_call', 'absolute_date', 'relative_date']
    ordered_cols = []
    for lang in languages:
        for cat in categories:
            col_name = f"{lang}_{cat}"
            if col_name in pivot_detailed.columns:
                ordered_cols.append(col_name)
    
    pivot_detailed = pivot_detailed[ordered_cols]
    
    # Create heatmap
    fig = px.imshow(
        pivot_detailed,
        title=f"{metric} Scores: Models vs Languages/Categories<br><sub>Categories: no_tool_call | absolute_date | relative_date</sub>",
        labels=dict(x="Language_Category", y="Model", color="Score"),
        color_continuous_scale="RdYlGn",
        aspect="auto"
    )
    
    # Add text annotations
    fig.update_traces(
        text=pivot_detailed.round(3).values,
        texttemplate="%{text}",
        textfont={"size": 8}
    )
    
    # Customize layout for better readability
    fig.update_layout(
        height=500,
        xaxis={'tickangle': 45, 'side': 'bottom'},
        yaxis={'side': 'left'},
        margin=dict(b=150)  # Extra bottom margin for rotated labels
    )
    
    # Add vertical lines to separate language groups
    shapes = []
    for i, lang in enumerate(languages[:-1]):  # Don't add line after last language
        x_pos = (i + 1) * len(categories) - 0.5
        shapes.append(
            dict(
                type="line",
                x0=x_pos, x1=x_pos,
                y0=-0.5, y1=len(pivot_detailed.index) - 0.5,
                line=dict(color="black", width=2)
            )
        )
    
    fig.update_layout(shapes=shapes)
    fig.show()
    
    # Export to HTML
    metric_filename = metric.replace(' ', '_').replace('/', '_').lower()
    fig.write_html(f"{docs_dir}/{metric_filename}_detailed_category_heatmap.html")
    print(f"✓ Exported {metric} detailed category heatmap")
    
    # Print summary statistics for this detailed breakdown
    print(f"\\n{metric} - Category Performance Summary:")
    category_summary = metric_data.groupby(['category'])['score'].agg(['count', 'mean', 'std']).round(4)
    category_summary.columns = ['Tests', 'Mean Score', 'Std Dev']
    print(category_summary.sort_values('Mean Score', ascending=False))

✓ Exported Argument Correctness detailed category heatmap
\nArgument Correctness - Category Performance Summary:
               Tests  Mean Score  Std Dev
category                                 
no_tool_call     420      0.9536   0.2092
absolute_date    840      0.8893   0.3140
relative_date    630      0.6624   0.3707


✓ Exported Tool Correctness detailed category heatmap
\nTool Correctness - Category Performance Summary:
               Tests  Mean Score  Std Dev
category                                 
no_tool_call     420      0.9536   0.2092
absolute_date    840      0.8964   0.2303
relative_date    630      0.8419   0.3250


In [56]:
# Language performance analysis
print("Performance by Language (averaged across all models and metrics):")
lang_performance = df.groupby('language')['score'].agg(['mean', 'std', 'count']).round(4)
lang_performance.columns = ['Mean Score', 'Std Dev', 'Data Points']
lang_performance_sorted = lang_performance.sort_values('Mean Score', ascending=False)

display(lang_performance_sorted)

# Create bar chart for language performance
fig = px.bar(
    x=lang_performance_sorted.index,
    y=lang_performance_sorted['Mean Score'],
    title="Average Performance by Language (across all models and metrics)",
    labels={"x": "Language", "y": "Average Score"}
)

fig.update_layout(
    height=400,
    xaxis={'tickangle': 45}
)

fig.show()

fig.write_html(f"{docs_dir}/language_performance_bar_chart.html")

Performance by Language (averaged across all models and metrics):


Unnamed: 0_level_0,Mean Score,Std Dev,Data Points
language,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
nld,0.8602,0.1176,12
por,0.8492,0.1332,12
deu,0.8481,0.1232,12
fra,0.8465,0.135,12
eng,0.8405,0.1354,12
est,0.8133,0.1602,12
pol,0.8089,0.1875,12


## Interesting examples to show

* ID 4 of Mistral-Nemo-Instruct-2407 deu and eng, tool call for time (how wrong is this?)
* ID 13 of Mistral-Nemo-Instruct-2407 est, decides to get current date for absolute time
* ID 32 of gpt-oss-20b eng, wrong thinking for relative date

## Is Dutch the best language to implements agents in?

I wouldn't conclude that Dutch is the "best language to implement agents in" because:

1. Marginal difference: The gap between Dutch (0.8602) and German/English is very small (~0.011)
2. Statistical significance: With those standard deviations (0.1176 for Dutch), the differences might not be statistically significant
3. Dataset specificity: This reflects performance on your specific evaluation dataset, not general agent capability
3. Practical considerations: English still dominates in:
- Available training data
- Tool/API documentation
- Developer resources
- Real-world deployment scenarios

The results suggest Dutch performs slightly better on your specific multilingual agent tasks, but for practical agent development, English remains the most viable choice due to ecosystem advantages. The small performance difference likely doesn't outweigh English's substantial practical benefits.

### You are absolutely right!

Exactly! It's a great example of how data can be misleading without proper context. The 0.007 difference between Dutch and German could easily be within the margin of error, especially with different sample sizes.

Plus, even if Dutch genuinely performed 1-2% better on this specific benchmark, the practical reality is that you'd still choose English for agent development because:

- 99% of Stack Overflow answers are in English
- All major ML/AI libraries have English docs
- Your team probably thinks in English when coding
- Users expect English interfaces in most contexts

It's similar to how a car might get 0.1 MPG better fuel economy, but if there are no gas stations that sell its special fuel, it's not actually the better choice!

Your evaluation results are valuable for understanding model capabilities across languages, but the "best" language for implementation involves many more factors than just benchmark scores.