# Advanced Scorers

This notebook covers advanced scoring techniques including LLM judges and causal analysis.

## Prerequisites

```bash
pip install neon-sdk
```

You'll need an Anthropic API key for the LLM judge examples.

In [None]:
import os
# Set your API key (or use environment variable)
# os.environ["ANTHROPIC_API_KEY"] = "your-api-key"

## 1. LLM Judge Scorers

LLM judges use a language model to evaluate agent responses with semantic understanding.

In [None]:
from neon_sdk.scorers import llm_judge, LLMJudgeConfig
from neon_sdk.scorers.base import EvalContext

# Create a custom LLM judge
quality_judge = llm_judge(LLMJudgeConfig(
    prompt='''Evaluate the quality of this response on a scale of 0 to 1.

User Query: {{input}}
Agent Response: {{output}}

Consider:
- Accuracy: Is the information correct?
- Completeness: Does it fully answer the question?
- Clarity: Is it easy to understand?

Return your evaluation as JSON:
{"score": <0-1>, "reason": "<brief explanation>"}''',
    model='claude-3-haiku-20240307',
))

# Note: This requires an API key to run
# result = await quality_judge.evaluate(EvalContext(
#     input={"query": "What is Python?"},
#     output="Python is a programming language.",
# ))
# print(f"Score: {result.value}")
# print(f"Reason: {result.reason}")

## 2. Pre-Built Judges

The SDK includes pre-configured judges for common evaluation criteria.

In [None]:
from neon_sdk.scorers import (
    response_quality_judge,
    safety_judge,
    helpfulness_judge,
)

# These are ready to use
print("Pre-built judges available:")
print("- response_quality_judge: Evaluates overall response quality")
print("- safety_judge: Checks for harmful or unsafe content")
print("- helpfulness_judge: Measures how helpful the response is")

# Example usage (requires API key):
# result = await response_quality_judge.evaluate(context)

## 3. Custom Response Parser

You can customize how LLM responses are parsed into scores.

In [None]:
# Simple yes/no judge
yes_no_judge = llm_judge(LLMJudgeConfig(
    prompt='''Is this response helpful to the user?

Response: {{output}}

Answer with only YES or NO.''',
    model='claude-3-haiku-20240307',
    parse_response=lambda text: 1.0 if 'YES' in text.upper() else 0.0,
))

print("Created yes/no judge with custom parser")

## 4. Causal Analysis Scorers

Analyze error propagation and identify root causes in agent traces.

In [None]:
from neon_sdk.scorers import (
    causal_analysis_scorer,
    CausalAnalysisConfig,
    analyze_causality,
)

# Create a causal analysis scorer with custom weights
causal_scorer = causal_analysis_scorer(CausalAnalysisConfig(
    root_cause_weight=0.6,      # Weight for root cause identification
    chain_completeness_weight=0.3,  # Weight for causal chain completeness
    error_rate_weight=0.1,      # Weight for overall error rate
))

print("Causal analysis scorer created")
print("This scorer analyzes traces to identify:")
print("- Root causes of failures")
print("- Error propagation chains")
print("- Component dependencies")

## 5. Composite Scorers

Combine multiple scorers with weighted averaging.

In [None]:
from neon_sdk.scorers import contains, latency_scorer, LatencyThresholds
from neon_sdk.scorers.base import EvalContext, ScoreResult

# Define individual scorers
keyword_scorer = contains(["helpful", "thank"])
speed_scorer = latency_scorer(LatencyThresholds(
    excellent=500,
    good=2000,
    acceptable=5000,
))

# Create a composite scorer
def composite_evaluate(context: EvalContext) -> ScoreResult:
    """Combine multiple scorers with weights."""
    scores = [
        (keyword_scorer.evaluate(context), 0.4),
        # speed_scorer requires trace data
    ]
    
    weighted_sum = sum(s.value * w for s, w in scores)
    total_weight = sum(w for _, w in scores)
    
    return ScoreResult(
        value=weighted_sum / total_weight,
        reason="Composite score from multiple criteria",
    )

# Test
result = composite_evaluate(EvalContext(
    output="Thank you for your helpful question!",
))
print(f"Composite Score: {result.value:.2f}")

## 6. Scorer Metadata

Add metadata to scores for detailed analysis.

In [None]:
from neon_sdk.scorers import scorer, ScoreResult, EvalContext

@scorer("detailed_analysis")
def detailed_scorer(context: EvalContext) -> ScoreResult:
    """Scorer that includes detailed metadata."""
    output = context.output or ""
    
    # Analyze various aspects
    word_count = len(output.split())
    sentence_count = output.count('.') + output.count('!') + output.count('?')
    avg_word_length = sum(len(w) for w in output.split()) / max(word_count, 1)
    
    # Calculate composite score
    length_score = min(word_count / 50, 1.0)
    
    return ScoreResult(
        value=length_score,
        reason=f"Based on {word_count} words in {sentence_count} sentences",
        metadata={
            "word_count": word_count,
            "sentence_count": sentence_count,
            "avg_word_length": round(avg_word_length, 2),
            "length_score": round(length_score, 2),
        },
    )

# Test
result = detailed_scorer.evaluate(EvalContext(
    output="This is a sample response. It contains multiple sentences. The scorer analyzes various aspects of the text to produce a comprehensive evaluation.",
))

print(f"Score: {result.value:.2f}")
print(f"Reason: {result.reason}")
print("Metadata:")
for key, value in result.metadata.items():
    print(f"  {key}: {value}")

## 7. Async Scorers

Create async scorers for external API calls.

In [None]:
from neon_sdk.scorers import scorer, ScoreResult, EvalContext
import asyncio

@scorer("async_external")
async def async_scorer(context: EvalContext) -> ScoreResult:
    """Async scorer that could call external APIs."""
    # Simulate async API call
    await asyncio.sleep(0.1)
    
    # Process result
    output = context.output or ""
    score = 1.0 if len(output) > 10 else 0.5
    
    return ScoreResult(
        value=score,
        reason="Evaluated via async process",
    )

# Test
result = await async_scorer.evaluate(EvalContext(
    output="This is a longer response that should score well.",
))
print(f"Async Score: {result.value:.2f}")

## Next Steps

- [03_clickhouse_analytics.ipynb](03_clickhouse_analytics.ipynb) - Store and query traces
- [04_temporal_workflows.ipynb](04_temporal_workflows.ipynb) - Durable workflow execution