# ü§ñ Sistema de Anota√ß√£o Autom√°tica com M√∫ltiplas LLMs

Este notebook demonstra o uso completo do sistema com datasets do HuggingFace.

## üìã Conte√∫do

1. Setup e Configura√ß√£o
2. Carregar Dataset do HuggingFace
3. Configurar Modelos LLM
4. Executar Anota√ß√£o
5. Calcular Consenso
6. An√°lise Detalhada
7. Visualiza√ß√µes
8. Valida√ß√£o com Ground Truth
9. Exportar Resultados

---

## 1. Setup e Configura√ß√£o

### Imports

In [None]:
# Configurar paths
import sys
from pathlib import Path

# Adicionar diret√≥rios ao path
project_root = Path.cwd().parent if 'notebooks' in str(Path.cwd()) else Path.cwd()
sys.path.insert(0, str(project_root / 'src' / 'llm_annotation_system'))
sys.path.insert(0, str(project_root / 'src' / 'config'))

print(f"Project root: {project_root}")

NameError: name '__file__' is not defined

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sys
from pathlib import Path as PathLib

# Sistema de anota√ß√£o
from src.llm_annotation_system.llm_annotator import LLMAnnotator
from src.llm_annotation_system.consensus_analyzer import ConsensusAnalyzer
from src.utils.visualizer import ConsensusVisualizer

# Datasets HuggingFace
from src.utils.data_loader import (
    load_hf_dataset,
    load_hf_dataset_as_dataframe,
    list_available_datasets,
    discover_dataset_structure
)

# Configurar visualiza√ß√µes
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print("‚úÖ Imports carregados com sucesso!")

ImportError: cannot import name 'BASE_ANNOTATION_PROMPT' from 'config' (unknown location)

### Carregar API Keys

In [1]:
from dotenv import load_dotenv

# Carregar vari√°veis de ambiente
env_path = project_root / '.env'
load_dotenv(env_path)

# Configurar API keys
api_keys = {
    "openai": os.getenv("OPENAI_API_KEY"),
    "anthropic": os.getenv("ANTHROPIC_API_KEY"),
    "google": os.getenv("GOOGLE_API_KEY"),
}

# Verificar se as keys est√£o configuradas
for provider, key in api_keys.items():
    if key:
        print(f"‚úÖ {provider}: Configurada")
    else:
        print(f"‚ö†Ô∏è  {provider}: N√ÉO configurada")

print("\nüí° Dica: Configure suas API keys no arquivo .env")

NameError: name 'project_root' is not defined

---

## 2. Carregar Dataset do HuggingFace

### 2.1 Listar Datasets Dispon√≠veis

In [None]:
# Ver datasets configurados
print("üìã Datasets configurados:")
for dataset_name in list_available_datasets():
    print(f"   ‚Ä¢ {dataset_name}")

print("\nüí° Configure novos datasets em: src/config/dataset_config.py")

### 2.2 Descobrir Estrutura de um Dataset (Opcional)

Se voc√™ quiser explorar um novo dataset antes de configur√°-lo:

In [None]:
# Descomente e ajuste para seu dataset
# discover_dataset_structure("waashk/seu-dataset", num_examples=3)

### 2.3 Carregar Dataset

In [None]:
# Escolha seu dataset
dataset_name = "exemplo_com_labels"  # AJUSTE para seu dataset

print(f"üì¶ Carregando dataset: {dataset_name}\n")

# Carregar
texts, categories, ground_truth = load_hf_dataset(dataset_name)

# Informa√ß√µes
print(f"\nüìä Informa√ß√µes do Dataset:")
print(f"   ‚Ä¢ Total de textos: {len(texts)}")
print(f"   ‚Ä¢ Categorias: {categories}")
print(f"   ‚Ä¢ Ground truth dispon√≠vel: {'Sim ‚úÖ' if ground_truth else 'N√£o'}")

### 2.4 Visualizar Amostra

In [None]:
# Mostrar primeiros textos
print("üìù Primeiros 5 textos do dataset:\n")
for i in range(min(5, len(texts))):
    print(f"Texto {i+1}:")
    print(f"   {texts[i][:100]}...")
    if ground_truth:
        print(f"   Label: {ground_truth[i]}")
    print()

### 2.5 An√°lise Explorat√≥ria (se houver ground truth)

In [None]:
if ground_truth:
    # Criar DataFrame
    df_explore = pd.DataFrame({
        'text': texts,
        'label': ground_truth
    })
    
    print("üìä Distribui√ß√£o de Classes:\n")
    print(df_explore['label'].value_counts())
    
    # Visualizar
    plt.figure(figsize=(10, 5))
    df_explore['label'].value_counts().plot(kind='bar')
    plt.title('Distribui√ß√£o de Classes no Dataset')
    plt.xlabel('Classe')
    plt.ylabel('Quantidade')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("‚ÑπÔ∏è  Sem ground truth - pulando an√°lise explorat√≥ria")

---

## 3. Configurar Modelos LLM

### 3.1 Selecionar Modelos

In [None]:
# Modelos a usar
models = [
    "gpt-4-turbo",
    "gpt-3.5-turbo",
    "claude-3-opus",
    "claude-3-sonnet",
    "gemini-pro",
]

print("ü§ñ Modelos selecionados:")
for i, model in enumerate(models, 1):
    print(f"   {i}. {model}")

print(f"\nüí° Dica: Para economizar, comece com 3 modelos (ex: gpt-3.5, claude-sonnet, gemini)")

### 3.2 Inicializar Anotador

In [None]:
# Configurar diret√≥rios
cache_dir = project_root / "cache"
results_dir = project_root / "results"

# Criar diret√≥rios se n√£o existirem
cache_dir.mkdir(exist_ok=True)
results_dir.mkdir(exist_ok=True)

print("üîß Inicializando anotador...\n")

# Inicializar
annotator = LLMAnnotator(
    models=models,
    categories=categories,
    api_keys=api_keys,
    cache_dir=str(cache_dir),
    results_dir=str(results_dir)
)

print("‚úÖ Anotador configurado!")
print(f"   ‚Ä¢ Modelos: {len(models)}")
print(f"   ‚Ä¢ Categorias: {len(categories)}")
print(f"   ‚Ä¢ Cache: {cache_dir}")
print(f"   ‚Ä¢ Resultados: {results_dir}")

---

## 4. Executar Anota√ß√£o

### 4.1 Configurar Par√¢metros

In [None]:
# Par√¢metros da anota√ß√£o
num_repetitions = 3  # Cada LLM anota 3 vezes
test_param_variations = False  # True para testar varia√ß√µes de par√¢metros

# Estimativa de custos
total_annotations = len(texts) * len(models) * num_repetitions
estimated_cost = total_annotations * 0.002  # Estimativa aproximada

print("‚öôÔ∏è  Configura√ß√£o da Anota√ß√£o:")
print(f"   ‚Ä¢ Textos: {len(texts)}")
print(f"   ‚Ä¢ Modelos: {len(models)}")
print(f"   ‚Ä¢ Repeti√ß√µes por modelo: {num_repetitions}")
print(f"   ‚Ä¢ Total de anota√ß√µes: {total_annotations}")
print(f"   ‚Ä¢ Custo estimado: ${estimated_cost:.2f}")
print(f"\n‚ö†Ô∏è  Come√ßando anota√ß√£o...")

### 4.2 Executar Anota√ß√£o

In [None]:
# Anotar dataset
df_annotations = annotator.annotate_dataset(
    texts=texts,
    num_repetitions=num_repetitions,
    test_param_variations=test_param_variations,
)

print("\n‚úÖ Anota√ß√£o conclu√≠da!")
print(f"   ‚Ä¢ Shape: {df_annotations.shape}")
print(f"   ‚Ä¢ Colunas: {len(df_annotations.columns)}")

### 4.3 Visualizar Anota√ß√µes

In [None]:
# Mostrar primeiras linhas
print("üìã Primeiras anota√ß√µes:\n")
display(df_annotations.head())

# Informa√ß√µes das colunas
annotation_cols = [col for col in df_annotations.columns if col.startswith(('gpt', 'claude', 'gemini'))]
print(f"\nüìä Colunas de anota√ß√£o: {len(annotation_cols)}")
print(f"   Exemplos: {annotation_cols[:5]}")

---

## 5. Calcular Consenso

### 5.1 Executar An√°lise de Consenso

In [None]:
print("üßÆ Calculando consenso...\n")

# Calcular consenso
df_with_consensus = annotator.calculate_consensus(df_annotations)

print("‚úÖ Consenso calculado!")
print(f"   ‚Ä¢ Shape: {df_with_consensus.shape}")

# Mostrar novas colunas
consensus_cols = [col for col in df_with_consensus.columns if 'consensus' in col.lower()]
print(f"\nüìä Colunas de consenso adicionadas: {len(consensus_cols)}")
print(f"   {consensus_cols[:5]}")

### 5.2 Estat√≠sticas de Consenso

In [None]:
# Estat√≠sticas gerais
print("üìä Estat√≠sticas de Consenso:\n")
print(f"   ‚Ä¢ Consenso m√©dio: {df_with_consensus['consensus_score'].mean():.2%}")
print(f"   ‚Ä¢ Consenso mediano: {df_with_consensus['consensus_score'].median():.2%}")
print(f"   ‚Ä¢ Desvio padr√£o: {df_with_consensus['consensus_score'].std():.2%}")

# Por n√≠vel de consenso
high_consensus = (df_with_consensus['consensus_score'] >= 0.8).sum()
medium_consensus = ((df_with_consensus['consensus_score'] >= 0.6) & 
                    (df_with_consensus['consensus_score'] < 0.8)).sum()
low_consensus = (df_with_consensus['consensus_score'] < 0.6).sum()

print(f"\nüìà Distribui√ß√£o por N√≠vel:")
print(f"   ‚Ä¢ Alto consenso (‚â•80%): {high_consensus} ({high_consensus/len(df_with_consensus):.1%})")
print(f"   ‚Ä¢ M√©dio consenso (60-80%): {medium_consensus} ({medium_consensus/len(df_with_consensus):.1%})")
print(f"   ‚Ä¢ Baixo consenso (<60%): {low_consensus} ({low_consensus/len(df_with_consensus):.1%})")

# Casos problem√°ticos
problematic = df_with_consensus['is_problematic'].sum()
print(f"\n‚ö†Ô∏è  Casos problem√°ticos: {problematic} ({problematic/len(df_with_consensus):.1%})")

### 5.3 Visualizar Distribui√ß√£o de Consenso

In [None]:
# Histograma de consenso
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(df_with_consensus['consensus_score'], bins=20, edgecolor='black', alpha=0.7)
plt.axvline(0.8, color='green', linestyle='--', label='Alto consenso (80%)')
plt.axvline(0.6, color='orange', linestyle='--', label='M√©dio consenso (60%)')
plt.xlabel('Score de Consenso')
plt.ylabel('Frequ√™ncia')
plt.title('Distribui√ß√£o de Scores de Consenso')
plt.legend()

plt.subplot(1, 2, 2)
levels = ['Alto\n(‚â•80%)', 'M√©dio\n(60-80%)', 'Baixo\n(<60%)']
counts = [high_consensus, medium_consensus, low_consensus]
colors = ['green', 'orange', 'red']
plt.bar(levels, counts, color=colors, alpha=0.7)
plt.ylabel('Quantidade')
plt.title('Inst√¢ncias por N√≠vel de Consenso')

plt.tight_layout()
plt.show()

---

## 6. An√°lise Detalhada

### 6.1 Inicializar Analisador

In [None]:
print("üî¨ Inicializando analisador de consenso...\n")

# Criar analisador
analyzer = ConsensusAnalyzer(categories=categories)

# Identificar colunas de consenso
consensus_annotation_cols = [col for col in df_with_consensus.columns 
                              if '_consensus' in col and '_score' not in col]

print(f"‚úÖ Analisador configurado")
print(f"   ‚Ä¢ Categorias: {len(categories)}")
print(f"   ‚Ä¢ Colunas de consenso: {len(consensus_annotation_cols)}")

### 6.2 Gerar Relat√≥rio Completo

In [None]:
print("üìä Gerando relat√≥rio de consenso...\n")

# Gerar relat√≥rio
report = analyzer.generate_consensus_report(
    df=df_with_consensus,
    annotator_cols=consensus_annotation_cols,
    output_dir=str(results_dir)
)

print("\n‚úÖ Relat√≥rio gerado!")
print(f"\nüìã Conte√∫do do relat√≥rio:")
for key in report.keys():
    print(f"   ‚Ä¢ {key}")

### 6.3 M√©tricas de Dist√¢ncia

In [None]:
# Mostrar m√©tricas de dist√¢ncia
print("üìè M√©tricas de Dist√¢ncia e Concord√¢ncia:\n")

distance_metrics = report.get('distance_metrics', {})
for metric, value in distance_metrics.items():
    if isinstance(value, (int, float)):
        print(f"   ‚Ä¢ {metric}: {value:.4f}")

# Interpreta√ß√£o
if 'mean_cohen_kappa' in distance_metrics:
    kappa = distance_metrics['mean_cohen_kappa']
    print(f"\nüìä Interpreta√ß√£o Cohen's Kappa ({kappa:.4f}):")
    if kappa > 0.8:
        print("   ‚úÖ Excelente concord√¢ncia!")
    elif kappa > 0.6:
        print("   ‚úÖ Boa concord√¢ncia")
    elif kappa > 0.4:
        print("   ‚ö†Ô∏è  Concord√¢ncia moderada")
    else:
        print("   ‚ùå Concord√¢ncia fraca")

### 6.4 Matriz de Concord√¢ncia

In [None]:
# Concord√¢ncia par a par
if 'pairwise_agreement' in report:
    print("ü§ù Concord√¢ncia Par a Par entre Modelos:\n")
    
    pairwise_agreement = report['pairwise_agreement']
    display(pairwise_agreement)
    
    # Estat√≠sticas
    print(f"\nüìä Estat√≠sticas:")
    print(f"   ‚Ä¢ Concord√¢ncia m√©dia: {pairwise_agreement.values.mean():.2%}")
    print(f"   ‚Ä¢ Concord√¢ncia m√≠nima: {pairwise_agreement.values.min():.2%}")
    print(f"   ‚Ä¢ Concord√¢ncia m√°xima: {pairwise_agreement.values.max():.2%}")

### 6.5 Casos Problem√°ticos

In [None]:
# Identificar casos problem√°ticos
problematic_cases = df_with_consensus[df_with_consensus['is_problematic']]

print(f"‚ö†Ô∏è  Casos Problem√°ticos: {len(problematic_cases)}\n")

if len(problematic_cases) > 0:
    # Mostrar alguns exemplos
    print("üìù Exemplos de casos problem√°ticos:\n")
    for idx in problematic_cases.head(3).index:
        row = df_with_consensus.loc[idx]
        print(f"Texto {idx}:")
        print(f"   {row['text'][:100]}...")
        print(f"   Consenso: {row['consensus_score']:.0%}")
        print(f"   Anota√ß√£o mais comum: {row['most_common_annotation']}")
        print()
else:
    print("‚úÖ Nenhum caso problem√°tico identificado!")

---

## 7. Visualiza√ß√µes

### 7.1 Inicializar Visualizador

In [None]:
# Criar diret√≥rio de figuras
figures_dir = results_dir / 'figures'
figures_dir.mkdir(exist_ok=True)

print("üìä Inicializando visualizador...\n")

# Criar visualizador
visualizer = ConsensusVisualizer(output_dir=str(figures_dir))

print(f"‚úÖ Visualizador configurado")
print(f"   ‚Ä¢ Output: {figures_dir}")

### 7.2 Heatmap de Concord√¢ncia

In [None]:
# Gerar heatmap
if 'pairwise_agreement' in report:
    print("üî• Gerando heatmap de concord√¢ncia...\n")
    
    visualizer.plot_agreement_heatmap(
        report['pairwise_agreement'],
        title="Concord√¢ncia entre Modelos LLM"
    )
    
    print(f"‚úÖ Heatmap salvo em: {figures_dir}/agreement_heatmap.png")

### 7.3 Distribui√ß√£o de Consenso

In [None]:
print("üìä Gerando gr√°fico de distribui√ß√£o de consenso...\n")

visualizer.plot_consensus_distribution(df_with_consensus)

print(f"‚úÖ Gr√°fico salvo em: {figures_dir}/consensus_distribution.png")

### 7.4 Matriz de Confus√£o

In [None]:
# Matriz de confus√£o (se dispon√≠vel)
if 'disagreement_patterns' in report and 'confusion_matrix' in report['disagreement_patterns']:
    print("üéØ Gerando matriz de confus√£o...\n")
    
    visualizer.plot_confusion_matrix(
        report['disagreement_patterns']['confusion_matrix']
    )
    
    print(f"‚úÖ Matriz salva em: {figures_dir}/confusion_matrix.png")

### 7.5 Compara√ß√£o de Modelos

In [None]:
print("ü§ñ Gerando compara√ß√£o de modelos...\n")

visualizer.plot_model_comparison(
    df_with_consensus,
    models=models
)

print(f"‚úÖ Compara√ß√£o salva em: {figures_dir}/model_comparison.png")

### 7.6 Dashboard Interativo

In [None]:
print("üåê Gerando dashboard interativo...\n")

visualizer.create_interactive_dashboard(
    df_with_consensus,
    report
)

dashboard_path = figures_dir / 'interactive_dashboard.html'
print(f"‚úÖ Dashboard salvo em: {dashboard_path}")
print(f"\nüí° Abra no navegador para explorar: {dashboard_path.absolute()}")

---

## 8. Valida√ß√£o com Ground Truth

Se o dataset tem labels, podemos validar a qualidade das anota√ß√µes.

In [None]:
if ground_truth:
    print("‚úÖ Ground truth dispon√≠vel - validando anota√ß√µes...\n")
    
    # Adicionar ground truth ao DataFrame
    df_with_consensus['ground_truth'] = ground_truth
    
    # Calcular m√©tricas
    from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
    
    # Accuracy
    accuracy = accuracy_score(
        df_with_consensus['ground_truth'],
        df_with_consensus['most_common_annotation']
    )
    
    print(f"üìä VALIDA√á√ÉO COM GROUND TRUTH\n")
    print(f"   ‚Ä¢ Accuracy geral: {accuracy:.2%}\n")
    
    # Classification report
    print("üìã Relat√≥rio de Classifica√ß√£o:\n")
    print(classification_report(
        df_with_consensus['ground_truth'],
        df_with_consensus['most_common_annotation'],
        target_names=categories
    ))
    
    # Matriz de confus√£o
    cm = confusion_matrix(
        df_with_consensus['ground_truth'],
        df_with_consensus['most_common_annotation']
    )
    
    # Visualizar matriz de confus√£o
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=categories, yticklabels=categories)
    plt.title('Matriz de Confus√£o: Consenso vs Ground Truth')
    plt.ylabel('Ground Truth')
    plt.xlabel('Consenso (Anota√ß√£o Autom√°tica)')
    plt.tight_layout()
    plt.savefig(figures_dir / 'confusion_vs_ground_truth.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"\n‚úÖ Matriz salva em: {figures_dir}/confusion_vs_ground_truth.png")
    
    # An√°lise por n√≠vel de consenso
    print("\nüìä Accuracy por N√≠vel de Consenso:\n")
    
    for threshold, name in [(0.8, 'Alto'), (0.6, 'M√©dio'), (0.0, 'Baixo')]:
        mask = df_with_consensus['consensus_score'] >= threshold
        if mask.sum() > 0:
            acc = accuracy_score(
                df_with_consensus[mask]['ground_truth'],
                df_with_consensus[mask]['most_common_annotation']
            )
            print(f"   ‚Ä¢ {name} (‚â•{threshold:.0%}): {acc:.2%} ({mask.sum()} inst√¢ncias)")

else:
    print("‚ÑπÔ∏è  Ground truth n√£o dispon√≠vel - pulando valida√ß√£o")
    print("\nüí° Dica: Use um dataset com labels para validar a qualidade!")

---

## 9. Exportar Resultados

### 9.1 Salvar Datasets

In [None]:
print("üíæ Salvando resultados...\n")

# Criar diret√≥rio final
final_dir = results_dir / 'final'
final_dir.mkdir(exist_ok=True)

# 1. Dataset completo
complete_path = final_dir / 'dataset_anotado_completo.csv'
df_with_consensus.to_csv(complete_path, index=False, encoding='utf-8')
print(f"‚úÖ Dataset completo: {complete_path}")

# 2. Alta confian√ßa (‚â•80%)
high_conf = df_with_consensus[df_with_consensus['consensus_score'] >= 0.8]
high_conf_path = final_dir / 'alta_confianca.csv'
high_conf.to_csv(high_conf_path, index=False, encoding='utf-8')
print(f"‚úÖ Alta confian√ßa ({len(high_conf)} inst√¢ncias): {high_conf_path}")

# 3. Necessita revis√£o (<80%)
needs_review = df_with_consensus[df_with_consensus['consensus_score'] < 0.8]
review_path = final_dir / 'necessita_revisao.csv'
needs_review.to_csv(review_path, index=False, encoding='utf-8')
print(f"‚úÖ Necessita revis√£o ({len(needs_review)} inst√¢ncias): {review_path}")

# 4. Casos problem√°ticos
if len(problematic_cases) > 0:
    problematic_path = final_dir / 'casos_problematicos.csv'
    problematic_cases.to_csv(problematic_path, index=False, encoding='utf-8')
    print(f"‚úÖ Casos problem√°ticos ({len(problematic_cases)} inst√¢ncias): {problematic_path}")

### 9.2 Sum√°rio em JSON

In [None]:
import json
from datetime import datetime

# Criar sum√°rio
summary = {
    'dataset': {
        'name': dataset_name,
        'total_instances': len(texts),
        'categories': categories,
        'has_ground_truth': ground_truth is not None,
    },
    'configuration': {
        'models': models,
        'num_models': len(models),
        'num_repetitions': num_repetitions,
        'total_annotations': total_annotations,
    },
    'results': {
        'consensus': {
            'mean': float(df_with_consensus['consensus_score'].mean()),
            'median': float(df_with_consensus['consensus_score'].median()),
            'std': float(df_with_consensus['consensus_score'].std()),
        },
        'distribution': {
            'high_consensus': int(high_consensus),
            'medium_consensus': int(medium_consensus),
            'low_consensus': int(low_consensus),
        },
        'problematic_cases': int(problematic),
    },
    'metrics': distance_metrics if 'distance_metrics' in locals() else {},
    'timestamp': datetime.now().isoformat(),
}

# Adicionar valida√ß√£o se dispon√≠vel
if ground_truth:
    summary['validation'] = {
        'accuracy': float(accuracy),
    }

# Salvar
summary_path = final_dir / 'sumario_experimento.json'
with open(summary_path, 'w', encoding='utf-8') as f:
    json.dump(summary, f, indent=2, ensure_ascii=False)

print(f"\n‚úÖ Sum√°rio salvo: {summary_path}")

### 9.3 Resumo Final

In [None]:
print("\n" + "="*80)
print(" "*25 + "RESUMO FINAL")
print("="*80 + "\n")

print(f"üìä DATASET")
print(f"   ‚Ä¢ Nome: {dataset_name}")
print(f"   ‚Ä¢ Total de textos: {len(texts)}")
print(f"   ‚Ä¢ Categorias: {len(categories)}")

print(f"\nü§ñ CONFIGURA√á√ÉO")
print(f"   ‚Ä¢ Modelos: {len(models)}")
print(f"   ‚Ä¢ Repeti√ß√µes: {num_repetitions}")
print(f"   ‚Ä¢ Total de anota√ß√µes: {total_annotations}")

print(f"\nüéØ CONSENSO")
print(f"   ‚Ä¢ Consenso m√©dio: {df_with_consensus['consensus_score'].mean():.2%}")
print(f"   ‚Ä¢ Alto consenso (‚â•80%): {high_consensus} ({high_consensus/len(df_with_consensus):.1%})")
print(f"   ‚Ä¢ Casos problem√°ticos: {problematic} ({problematic/len(df_with_consensus):.1%})")

if 'mean_cohen_kappa' in distance_metrics:
    print(f"\nüìè M√âTRICAS")
    print(f"   ‚Ä¢ Cohen's Kappa: {distance_metrics['mean_cohen_kappa']:.4f}")
    if 'fleiss_kappa' in distance_metrics:
        print(f"   ‚Ä¢ Fleiss' Kappa: {distance_metrics['fleiss_kappa']:.4f}")

if ground_truth:
    print(f"\n‚úÖ VALIDA√á√ÉO")
    print(f"   ‚Ä¢ Accuracy vs Ground Truth: {accuracy:.2%}")

print(f"\nüìÅ ARQUIVOS GERADOS")
print(f"   ‚Ä¢ Dataset completo: {complete_path.name}")
print(f"   ‚Ä¢ Alta confian√ßa: {high_conf_path.name}")
print(f"   ‚Ä¢ Necessita revis√£o: {review_path.name}")
print(f"   ‚Ä¢ Dashboard interativo: interactive_dashboard.html")
print(f"   ‚Ä¢ Sum√°rio JSON: {summary_path.name}")

print(f"\nüìÇ DIRET√ìRIOS")
print(f"   ‚Ä¢ Resultados: {results_dir}")
print(f"   ‚Ä¢ Figuras: {figures_dir}")
print(f"   ‚Ä¢ Final: {final_dir}")

print("\n" + "="*80)
print("‚úÖ AN√ÅLISE COMPLETA!")
print("="*80 + "\n")

print("üéâ Pr√≥ximos passos:")
print("   1. Abra o dashboard interativo no navegador")
print("   2. Analise casos de baixo consenso")
print("   3. Revise casos problem√°ticos")
if ground_truth:
    print("   4. Analise erros vs ground truth")
print("   5. Documente seus achados")
print("   6. Prepare apresenta√ß√£o para orientador")

---

## üéì Conclus√µes e Pr√≥ximos Passos

### An√°lise dos Resultados

Analise os resultados considerando:

1. **Taxa de consenso alto**: Quantos % das inst√¢ncias t√™m consenso ‚â•80%?
2. **Cohen's Kappa**: A concord√¢ncia entre modelos √© boa (>0.6)?
3. **Accuracy** (se tiver ground truth): As anota√ß√µes est√£o corretas?
4. **Casos problem√°ticos**: Por que esses casos s√£o dif√≠ceis?

### Quest√µes de Pesquisa

1. Qual threshold de consenso devemos usar?
2. Few-shot learning melhoraria os resultados?
3. Quais modelos s√£o mais consistentes?
4. Vale a pena o custo comparado com anota√ß√£o humana?

### Melhorias Poss√≠veis

1. **Prompts**: Testar few-shot ou Chain-of-Thought
2. **Modelos**: Adicionar/remover modelos baseado em performance
3. **Par√¢metros**: Testar diferentes temperaturas
4. **Estrat√©gias**: Otimizar resolu√ß√£o de conflitos

---

**Boa sorte com sua pesquisa!** üöÄ