# üîç QualiVault: Validate Transcripts
**Goal:** Use Ollama (local LLM) to detect transcription errors and hallucinations.

1. Loads transcripts from CSV files.
2. Samples segments and sends them to Ollama for validation.
3. Flags potential errors: hallucinations, misheard words, artifacts.
4. Generates validation report with suggestions.

**Prerequisites:**
- Ollama must be installed and running (`ollama serve`)
- Install a model: `ollama pull llama2` or `ollama pull mistral`

In [None]:
%load_ext autoreload
%autoreload 2
import yaml
from pathlib import Path
from qualivault.validation import OllamaValidator, validate_recipe_transcripts

# 1. Load Configuration
project_root = Path("..").resolve()
config_path = project_root / "config.yml"

with open(config_path) as f:
    config = yaml.safe_load(f)

recipe_path = project_root / "processing_recipe.yaml"
transcripts_dir = (project_root / config['paths']['output_base_folder']).resolve()

print(f"üìÇ Transcripts: {transcripts_dir}")
print(f"üìã Recipe: {recipe_path}")

## Configuration

Adjust these settings to control validation:

- **model**: Ollama model to use (`llama2`, `mistral`, `llama3`, etc.)
- **sample_rate**: Fraction of segments to check (0.1 = 10%, 1.0 = 100%)
- **language**: Expected language of transcripts
- **ollama_url**: URL where Ollama is running (default: localhost)

In [None]:
# Validation Settings
MODEL = "llama2"              # Ollama model to use
SAMPLE_RATE = 0.1             # Check 10% of segments (faster)
LANGUAGE = "Danish"           # Expected language
OLLAMA_URL = "http://localhost:11434"

print(f"ü§ñ Model: {MODEL}")
print(f"üìä Sample Rate: {SAMPLE_RATE * 100}%")
print(f"üåç Language: {LANGUAGE}")

## Test Ollama Connection

Make sure Ollama is running before proceeding.

In [None]:
validator = OllamaValidator(model=MODEL, ollama_url=OLLAMA_URL)

# Quick test
test_response = validator._query_ollama("Say 'Hello' in one word.")
if test_response:
    print(f"‚úÖ Ollama is responding: '{test_response.strip()[:50]}'")
else:
    print("‚ùå Ollama is not responding. Make sure it's running: `ollama serve`")

## Validate All Transcripts

This will:
1. Load all transcribed interviews from the recipe
2. Sample segments from each CSV
3. Check each segment with Ollama for errors
4. Generate validation reports
5. Update recipe with validation status

In [None]:
# Run validation on all transcripts
reports = validate_recipe_transcripts(
    recipe_path=recipe_path,
    transcripts_dir=transcripts_dir,
    sample_rate=SAMPLE_RATE,
    model=MODEL,
    language=LANGUAGE
)

## Review Validation Reports

Inspect flagged segments and issues.

In [None]:
# Summary Statistics
import pandas as pd

if reports:
    print(f"\nüìä Validation Summary:")
    print(f"   Total transcripts validated: {len(reports)}")
    
    total_flagged = sum(r.get('flagged_count', 0) for r in reports)
    print(f"   Total issues flagged: {total_flagged}")
    
    # Show transcripts with most issues
    sorted_reports = sorted(reports, key=lambda r: r.get('flagged_count', 0), reverse=True)
    
    print(f"\n‚ö†Ô∏è  Top 5 transcripts with most issues:")
    for i, report in enumerate(sorted_reports[:5], 1):
        print(f"   {i}. {report['csv_file']}: {report['flagged_count']} issues")
else:
    print("No reports generated.")

## Detailed Issue Review

Examine specific flagged segments.

In [None]:
# Show detailed issues for first transcript
if reports and reports[0].get('flagged_segments'):
    report = reports[0]
    print(f"\nüîç Detailed issues for: {report['csv_file']}\n")
    
    for seg in report['flagged_segments'][:10]:  # Show first 10
        print(f"Segment {seg['segment_index']} ({seg['start']:.1f}s - {seg['end']:.1f}s)")
        print(f"Speaker: {seg['speaker']}")
        print(f"Text: {seg['text']}")
        print(f"Issues: {', '.join(seg['issues'])}")
        print(f"Confidence: {seg['confidence']:.2f}")
        if seg.get('suggestions'):
            print(f"Suggestions: {seg['suggestions']}")
        print()
else:
    print("No issues found in first transcript.")

## Export Validation Report

Save the full validation report as JSON for further analysis.

In [None]:
import json

report_file = project_root / "validation_report.json"

with open(report_file, 'w', encoding='utf-8') as f:
    json.dump(reports, f, indent=2, ensure_ascii=False)

print(f"üíæ Validation report saved to: {report_file}")