# Comprehensive Evaluation

This notebook performs comprehensive evaluation of the fine-tuned model's performance.

## Evaluation Metrics:
1. **BLEU Score** - Overall translation quality
2. **Idiom Accuracy** - Percentage of correct Sinhala idiom usage
3. **Literal Translation Rate** - How often the model fails
4. **Per-Idiom Performance** - Breakdown by idiom type
5. **Detailed Analysis** - Examples and edge cases

In [None]:
import sys
sys.path.append('..')

from src.evaluation import (
    calculate_bleu,
    check_idiom_presence,
    evaluate_single,
    evaluate_batch,
    generate_report,
    save_metrics
)
from src.trainer import load_config
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set_style('whitegrid')
print("✓ Imports successful")

## 1. Load Test Data and Predictions

In [None]:
# Load configuration
config = load_config('../config/training_config.yaml')

# Load test data
with open(config['data']['test_json'], 'r', encoding='utf-8') as f:
    test_data = json.load(f)

# Load predictions
with open(config['outputs']['predictions'], 'r', encoding='utf-8') as f:
    predictions_data = json.load(f)

# Extract predictions
predictions = [pred['prediction_si'] for pred in predictions_data]

print(f"Loaded {len(test_data)} test examples")
print(f"Loaded {len(predictions)} predictions")

assert len(test_data) == len(predictions), "Mismatch between test data and predictions!"

## 2. Calculate Overall Metrics

In [None]:
# Evaluate all predictions
metrics = evaluate_batch(predictions, test_data)

print("=" * 60)
print("OVERALL EVALUATION METRICS")
print("=" * 60)
print(f"\nBLEU Score: {metrics['overall_bleu']:.2f}")
print(f"Idiom Accuracy: {metrics['idiom_accuracy']:.1f}%")
print(f"Literal Translation Rate: {metrics['literal_translation_rate']:.1f}%")
print(f"Total Examples: {metrics['total_examples']}")
print("\n" + "=" * 60)

## 3. Per-Idiom Performance Analysis

In [None]:
# Display per-idiom statistics
print("\n=== Per-Idiom Performance ===")
print("-" * 80)

# Sort by count (most frequent first)
sorted_idioms = sorted(
    metrics['per_idiom_performance'].items(),
    key=lambda x: x[1]['count'],
    reverse=True
)

print(f"{'Idiom':<30} {'Accuracy':<12} {'Avg BLEU':<12} {'Count':<8}")
print("-" * 80)

for idiom, stats in sorted_idioms[:15]:  # Top 15
    print(f"{idiom:<30} {stats['accuracy']:>6.1f}%     {stats['avg_bleu']:>8.2f}     {stats['count']:>5}")

if len(sorted_idioms) > 15:
    print(f"\n... and {len(sorted_idioms) - 15} more idioms")

## 4. Detailed Results Table

In [None]:
# Create detailed results DataFrame
detailed_results = pd.DataFrame(metrics['detailed_results'])

# Display summary statistics
print("\n=== Results Summary Statistics ===")
print(f"Average BLEU: {detailed_results['bleu'].mean():.2f}")
print(f"Median BLEU: {detailed_results['bleu'].median():.2f}")
print(f"Min BLEU: {detailed_results['bleu'].min():.2f}")
print(f"Max BLEU: {detailed_results['bleu'].max():.2f}")
print(f"\nIdiom Correct: {detailed_results['idiom_correct'].sum()} / {len(detailed_results)}")
print(f"Avoided Literal: {detailed_results['avoided_literal'].sum()} / {len(detailed_results)}")

# Show a sample of results
print("\nSample results:")
detailed_results[['idiom_en', 'bleu', 'idiom_correct']].head(10)

## 5. Visualize Performance Metrics

In [None]:
# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. BLEU Score Distribution
axes[0, 0].hist(detailed_results['bleu'], bins=20, color='#3498db', edgecolor='black', alpha=0.7)
axes[0, 0].axvline(detailed_results['bleu'].mean(), color='red', linestyle='--', linewidth=2,
                   label=f"Mean: {detailed_results['bleu'].mean():.2f}")
axes[0, 0].set_title('BLEU Score Distribution', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('BLEU Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Idiom Accuracy
accuracy_data = pd.DataFrame({
    'Metric': ['Idiom Correct', 'Idiom Incorrect'],
    'Count': [
        detailed_results['idiom_correct'].sum(),
        len(detailed_results) - detailed_results['idiom_correct'].sum()
    ]
})
colors = ['#2ecc71', '#e74c3c']
axes[0, 1].pie(accuracy_data['Count'], labels=accuracy_data['Metric'], autopct='%1.1f%%',
               colors=colors, startangle=90)
axes[0, 1].set_title('Idiom Accuracy', fontsize=12, fontweight='bold')

# 3. Top 10 Idioms by Performance
top_idioms = sorted(
    [(k, v['accuracy']) for k, v in metrics['per_idiom_performance'].items()],
    key=lambda x: x[1],
    reverse=True
)[:10]

idiom_names = [x[0][:20] + '...' if len(x[0]) > 20 else x[0] for x in top_idioms]
idiom_accs = [x[1] for x in top_idioms]

axes[1, 0].barh(range(len(idiom_names)), idiom_accs, color='#9b59b6')
axes[1, 0].set_yticks(range(len(idiom_names)))
axes[1, 0].set_yticklabels(idiom_names, fontsize=9)
axes[1, 0].set_xlabel('Accuracy (%)')
axes[1, 0].set_title('Top 10 Idioms by Accuracy', fontsize=12, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3, axis='x')

# 4. Overall Metrics Summary
metrics_summary = pd.DataFrame({
    'Metric': ['BLEU', 'Idiom Acc.', 'Literal Rate'],
    'Score': [
        metrics['overall_bleu'],
        metrics['idiom_accuracy'],
        metrics['literal_translation_rate']
    ]
})

bars = axes[1, 1].bar(metrics_summary['Metric'], metrics_summary['Score'],
                       color=['#3498db', '#2ecc71', '#e67e22'])
axes[1, 1].set_title('Overall Metrics Summary', fontsize=12, fontweight='bold')
axes[1, 1].set_ylabel('Score')
axes[1, 1].set_ylim(0, 100)
axes[1, 1].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    axes[1, 1].text(bar.get_x() + bar.get_width()/2., height,
                    f'{height:.1f}',
                    ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.savefig('../outputs/evaluation_metrics.png', dpi=150, bbox_inches='tight')
plt.show()

print("✓ Visualization saved to outputs/evaluation_metrics.png")

## 6. Best and Worst Performing Examples

In [None]:
# Best performing examples (highest BLEU)
best_results = detailed_results.nlargest(5, 'bleu')

print("=== Top 5 Best Translations (by BLEU) ===")
print("-" * 80)
for idx, row in best_results.iterrows():
    print(f"\nIdiom: {row['idiom_en']}")
    print(f"BLEU: {row['bleu']:.2f} | Idiom Correct: {row['idiom_correct']}")
    print(f"Source: {row['source'][:70]}...")
    print(f"Reference: {row['reference'][:70]}...")
    print(f"Prediction: {row['prediction'][:70]}...")
    print("-" * 80)

In [None]:
# Worst performing examples (lowest BLEU)
worst_results = detailed_results.nsmallest(5, 'bleu')

print("\n=== Top 5 Worst Translations (by BLEU) ===")
print("-" * 80)
for idx, row in worst_results.iterrows():
    print(f"\nIdiom: {row['idiom_en']}")
    print(f"BLEU: {row['bleu']:.2f} | Idiom Correct: {row['idiom_correct']}")
    print(f"Source: {row['source'][:70]}...")
    print(f"Reference: {row['reference'][:70]}...")
    print(f"Prediction: {row['prediction'][:70]}...")
    print("-" * 80)

## 7. Generate and Save Evaluation Report

In [None]:
# Generate human-readable report
report = generate_report(metrics)
print(report)

# Save report to file
report_path = '../outputs/evaluation_report.txt'
with open(report_path, 'w', encoding='utf-8') as f:
    f.write(report)

print(f"\n✓ Report saved to {report_path}")

## 8. Save Metrics to JSON

In [None]:
# Save metrics to JSON
metrics_path = config['outputs']['metrics']
save_metrics(metrics, metrics_path)

print(f"✓ Metrics saved to {metrics_path}")

## Summary

Comprehensive evaluation completed!

- ✅ BLEU score calculated: **{metrics['overall_bleu']:.2f}**
- ✅ Idiom accuracy: **{metrics['idiom_accuracy']:.1f}%**
- ✅ Literal translation rate: **{metrics['literal_translation_rate']:.1f}%**
- ✅ Per-idiom performance analyzed
- ✅ Visualizations created
- ✅ Results saved to `outputs/metrics/evaluation_results.json`

### Key Findings:
1. The model successfully learns to use Sinhala idioms in **{metrics['idiom_accuracy']:.1f}%** of cases
2. Average BLEU score of **{metrics['overall_bleu']:.2f}** indicates overall translation quality
3. The idiom tagging approach helps reduce literal translations

### Limitations:
- Performance varies by idiom type
- Limited to idioms seen in training data
- Requires manual `<IDIOM>` tagging in input

**Project Complete!** All notebooks have been executed successfully.