docs: NLP metrics and semantic similarity evaluator examples (ROUGE, BLEU, similarity)

## Scope

This is a **documentation/examples** task — no core code changes needed. These evaluators are implementable as code judges using the existing `@agentv/eval` SDK.

## What to Create

Add example code judges in `examples/` for standard NLP metrics:

### 1. ROUGE-N (`examples/nlp-metrics/judges/rouge.ts`)
```typescript
import { defineCodeJudge } from "@agentv/eval";
// Use rouge npm package to compute ROUGE-N score
// Config: { n: 1, threshold: 0.75 }
```

### 2. BLEU (`examples/nlp-metrics/judges/bleu.ts`)
```typescript
// Use sacrebleu or bleu-score package
// Config: { threshold: 0.5 }
```

### 3. Semantic Similarity (`examples/nlp-metrics/judges/similarity.ts`)
```typescript
// Use target proxy to call embedding model
// Compute cosine similarity between expected and actual
// Config: { threshold: 0.8 }
```

### 4. Levenshtein Distance (`examples/nlp-metrics/judges/levenshtein.ts`)
```typescript
// Compute edit distance, score = 1 - (distance / max_length)
// Config: { threshold: 5 }
```

### Example EVAL.yaml

```yaml
name: nlp-metrics-demo
cases:
  - id: summarization-test
    input: "Summarize this article..."
    expected_outcome: "Reference summary text"
execution:
  evaluators:
    - name: rouge_score
      type: code_judge
      script: ["bun", "run", "judges/rouge.ts"]
      n: 1
      threshold: 0.75
    - name: semantic_match
      type: code_judge
      script: ["bun", "run", "judges/similarity.ts"]
      threshold: 0.8
```

### Documentation

- Add to docs site: "NLP Metrics Evaluators" guide
- Reference from existing evaluator docs
- Show how to compose with other evaluator types

## Acceptance Criteria

- [ ] Example code judges for ROUGE, BLEU, Levenshtein, semantic similarity
- [ ] Example EVAL.yaml demonstrating usage
- [ ] README in example directory
- [ ] Documentation page on agentv.dev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: NLP metrics and semantic similarity evaluator examples (ROUGE, BLEU, similarity) #227

Scope

What to Create

1. ROUGE-N (`examples/nlp-metrics/judges/rouge.ts`)

2. BLEU (`examples/nlp-metrics/judges/bleu.ts`)

3. Semantic Similarity (`examples/nlp-metrics/judges/similarity.ts`)

4. Levenshtein Distance (`examples/nlp-metrics/judges/levenshtein.ts`)

Example EVAL.yaml

Documentation

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

docs: NLP metrics and semantic similarity evaluator examples (ROUGE, BLEU, similarity) #227

Description

Scope

What to Create

1. ROUGE-N (examples/nlp-metrics/judges/rouge.ts)

2. BLEU (examples/nlp-metrics/judges/bleu.ts)

3. Semantic Similarity (examples/nlp-metrics/judges/similarity.ts)

4. Levenshtein Distance (examples/nlp-metrics/judges/levenshtein.ts)

Example EVAL.yaml

Documentation

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. ROUGE-N (`examples/nlp-metrics/judges/rouge.ts`)

2. BLEU (`examples/nlp-metrics/judges/bleu.ts`)

3. Semantic Similarity (`examples/nlp-metrics/judges/similarity.ts`)

4. Levenshtein Distance (`examples/nlp-metrics/judges/levenshtein.ts`)