# AI Task Evaluation - Failure Analysis Framework

## Overview
This notebook demonstrates a comprehensive failure analysis framework for AI agent performance evaluation. The framework identifies patterns, root causes, and statistical anomalies across multiple dimensions.

**Key Capabilities:**
- Multi-dimensional failure pattern detection
- Statistical significance testing
- Root cause hypothesis generation
- Interactive visualizations
- Actionable recommendations

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from data_generator import AIEvaluationDataGenerator
from failure_analyzer import FailureAnalyzer
from visualizer import FailureVisualizer

%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

## 1. Generate Synthetic Evaluation Data

We'll generate realistic AI task evaluation data with built-in failure patterns that simulate real-world scenarios in finance domain tasks.

In [None]:
# Generate dataset
generator = AIEvaluationDataGenerator()
df = generator.generate_dataset(n_samples=1000)
generator.save_dataset(df, filepath='../data/ai_evaluations.csv')

print(f"Dataset Shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Quick overview
print("Dataset Summary:")
print(f"Total Tasks: {len(df):,}")
print(f"Success Rate: {df['success'].mean():.2%}")
print(f"Failure Rate: {(1-df['success'].mean()):.2%}")
print(f"\nScore Statistics:")
print(df['score'].describe())

## 2. Overall Performance Metrics

Start with high-level metrics to understand overall AI agent performance.

In [None]:
analyzer = FailureAnalyzer(df)
metrics = analyzer.overall_metrics()

print("=" * 50)
print("OVERALL PERFORMANCE METRICS")
print("=" * 50)
for key, value in metrics.items():
    if isinstance(value, float) and 0 <= value <= 1:
        print(f"{key:.<30} {value:.2%}")
    elif isinstance(value, float):
        print(f"{key:.<30} {value:.2f}")
    else:
        print(f"{key:.<30} {value:,}")

## 3. Single Dimension Analysis

Analyze failure rates across individual dimensions to identify problematic areas.

In [None]:
# File Type Analysis
print("\n" + "="*60)
print("FAILURE ANALYSIS BY FILE TYPE")
print("="*60)
file_type_analysis = analyzer.failure_by_dimension('file_type')
print(file_type_analysis)

In [None]:
# Task Type Analysis
print("\n" + "="*60)
print("FAILURE ANALYSIS BY TASK TYPE")
print("="*60)
task_analysis = analyzer.failure_by_dimension('task_type')
print(task_analysis)

In [None]:
# Finance Domain Analysis
print("\n" + "="*60)
print("FAILURE ANALYSIS BY FINANCE DOMAIN")
print("="*60)
domain_analysis = analyzer.failure_by_dimension('finance_domain')
print(domain_analysis)

In [None]:
# Agent Performance Comparison
print("\n" + "="*60)
print("FAILURE ANALYSIS BY AI AGENT")
print("="*60)
agent_analysis = analyzer.failure_by_dimension('agent')
print(agent_analysis)

## 4. Multi-Dimensional Analysis

Compare failure patterns across all dimensions simultaneously.

In [None]:
multi_dim = analyzer.multidimensional_analysis()
print("\nTop 15 Highest Risk Categories (Across All Dimensions):")
print(multi_dim.nlargest(15, 'failure_rate'))

## 5. Statistical Significance Testing

Test whether observed failure rate differences are statistically significant.

In [None]:
dimensions = ['task_type', 'file_type', 'finance_domain', 'evaluation_criterion', 'agent']

print("\n" + "="*70)
print("STATISTICAL SIGNIFICANCE TESTS (Chi-Square)")
print("="*70)
print(f"{'Dimension':<25} {'Chi2':>12} {'P-Value':>12} {'Significant':>15}")
print("-"*70)

sig_tests = []
for dim in dimensions:
    test = analyzer.statistical_significance_test(dim)
    sig_tests.append(test)
    sig_marker = "✓ YES" if test['significant'] else "✗ No"
    print(f"{test['dimension']:<25} {test['chi2_statistic']:>12.2f} {test['p_value']:>12.4f} {sig_marker:>15}")

sig_df = pd.DataFrame(sig_tests)
print(f"\nSignificant dimensions: {sig_df['significant'].sum()}/{len(sig_df)}")

## 6. High-Risk Segment Identification

Identify specific segments with unacceptably high failure rates.

In [None]:
high_risk = analyzer.identify_high_risk_segments(threshold=0.35)

print("\n" + "="*70)
print("HIGH-RISK SEGMENTS (Failure Rate > 35%)")
print("="*70)
print(high_risk.to_string(index=False))

## 7. Error Type Distribution

Analyze what types of errors are most common.

In [None]:
error_dist = analyzer.error_type_distribution()

print("\n" + "="*50)
print("ERROR TYPE DISTRIBUTION")
print("="*50)
print(error_dist)

## 8. Correlation Analysis

Examine relationships between numerical features and success rates.

In [None]:
correlations = analyzer.correlation_analysis()

print("\n" + "="*80)
print("CORRELATION ANALYSIS: Numerical Features vs Success")
print("="*80)
print(correlations.to_string(index=False))

## 9. Temporal Analysis

Track how performance changes over time.

In [None]:
temporal = analyzer.temporal_analysis()

print("\n" + "="*70)
print("TEMPORAL PERFORMANCE TRENDS")
print("="*70)
print(temporal.head(10))
print("\n...")
print(temporal.tail(10))

## 10. Root Cause Hypotheses

Generate data-driven hypotheses about failure root causes.

In [None]:
hypotheses = analyzer.root_cause_hypothesis()

print("\n" + "="*80)
print("ROOT CAUSE HYPOTHESES & RECOMMENDATIONS")
print("="*80)

for i, hyp in enumerate(hypotheses, 1):
    print(f"\n[HYPOTHESIS {i}]")
    print(f"Dimension: {hyp['dimension']}")
    print(f"Hypothesis: {hyp['hypothesis']}")
    print(f"Evidence: {hyp['evidence']}")
    print(f"Worst Performer: {hyp['worst_performer']} (Failure Rate: {hyp['failure_rate']:.1%})")
    print(f"Recommendation: {hyp['recommendation']}")
    print("-" * 80)

## 11. Visualizations

Create comprehensive visual analytics.

In [None]:
viz = FailureVisualizer(df)

In [None]:
# Multi-dimensional dashboard
viz.plot_multidimensional_summary(output_path='../outputs/dashboard.png')

In [None]:
# File type analysis
viz.plot_failure_rates_by_dimension('file_type', 
                                    output_path='../outputs/file_type_analysis.png')

In [None]:
# Task type analysis
viz.plot_failure_rates_by_dimension('task_type',
                                    output_path='../outputs/task_type_analysis.png')

In [None]:
# Interaction effects: Task Type vs File Type
viz.plot_interaction_heatmap('task_type', 'file_type',
                            output_path='../outputs/interaction_heatmap.png')

In [None]:
# Error type distribution
viz.plot_error_type_distribution(output_path='../outputs/error_types.png')

In [None]:
# Score distributions
viz.plot_score_distributions(output_path='../outputs/score_distributions.png')

In [None]:
# Temporal trends
viz.plot_temporal_trends(output_path='../outputs/temporal_trends.png')

## 12. Key Findings Summary

Let's compile the most actionable insights.

In [None]:
print("\n" + "#"*80)
print("#" + " "*30 + "KEY FINDINGS" + " "*36 + "#")
print("#"*80)

print("\n[1] OVERALL PERFORMANCE")
print(f"    - Success Rate: {metrics['success_rate']:.1%}")
print(f"    - Average Score: {metrics['avg_score']:.1f}/100")
print(f"    - Total Tasks Evaluated: {metrics['total_tasks']:,}")

print("\n[2] HIGHEST RISK AREAS")
top_risks = high_risk.head(3)
for idx, row in top_risks.iterrows():
    print(f"    - {row['dimension']}: {row['category']} (Failure Rate: {row['failure_rate']:.1%})")

print("\n[3] MOST COMMON ERROR TYPES")
top_errors = error_dist.head(3)
for error_type, data in top_errors.iterrows():
    print(f"    - {error_type}: {data['count']} occurrences ({data['percentage']:.1f}%)")

print("\n[4] STATISTICALLY SIGNIFICANT FACTORS")
sig_dims = [t['dimension'] for t in hypotheses]
for dim in sig_dims:
    print(f"    - {dim}")

print("\n[5] TEMPORAL PATTERNS")
early_failure = temporal.head(5)['failure_rate'].mean()
late_failure = temporal.tail(5)['failure_rate'].mean()
trend = "increasing" if late_failure > early_failure else "decreasing"
print(f"    - Failure rate is {trend} over time")
print(f"    - Early period: {early_failure:.1%} | Late period: {late_failure:.1%}")

print("\n" + "#"*80)

## Conclusion

This framework provides:
- **Multi-dimensional failure analysis** across task types, file formats, domains, and agents
- **Statistical validation** of observed patterns using hypothesis testing
- **Root cause identification** with data-driven recommendations
- **Visual analytics** for stakeholder communication
- **Actionable insights** for improving AI evaluation frameworks

The analysis reveals systematic patterns in AI agent failures that can guide improvements in:
- Task design and complexity management
- File preprocessing and parsing
- Domain-specific model training
- Evaluation rubric clarity
- Agent configuration optimization