# Context Metric Example

This notebook demonstrates how to use the **Context** metric from Fair Forge to evaluate how well AI assistant responses align with the provided context.

## Installation

First, install Fair Forge and the required dependencies.

In [None]:
!pip install "alquimia-fair-forge[context]" langchain-groq -q

## Setup

Import the required modules and configure your API key.

In [5]:
import sys
from pathlib import Path

# Add examples directory to path for helpers import
sys.path.insert(0, str(Path("../..").resolve()))

from helpers.retriever import LocalRetriever
from langchain_groq import ChatGroq

from fair_forge.metrics.context import Context

  from .autonotebook import tqdm as notebook_tqdm
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [6]:
import getpass

GROQ_API_KEY = getpass.getpass("Enter your Groq API key: ")

## Initialize the Judge Model

The Context metric uses an LLM as a judge to evaluate responses. You can use any LangChain-compatible chat model.

In [7]:
judge_model = ChatGroq(
    model="openai/gpt-oss-120b",
    api_key=GROQ_API_KEY,
    temperature=0.0,
    reasoning_format="parsed",
)

## Run the Context Metric

The Context metric evaluates each Q&A interaction in your dataset, scoring how well the assistant's response aligns with the provided context.

In [11]:
metrics = Context.run(
    LocalRetriever,
    model=judge_model,
    use_structured_output=True,
    verbose=True,
)

2026-01-28 09:52:35,239 - fair_forge.utils.logging - INFO - Loaded dataset with 1 batches
2026-01-28 09:52:35,240 - fair_forge.utils.logging - INFO - Starting to process dataset
2026-01-28 09:52:35,241 - fair_forge.utils.logging - INFO - Session ID: 123, Assistant ID: my_assistant
2026-01-28 09:52:35,241 - fair_forge.utils.logging - DEBUG - QA ID: 123
2026-01-28 09:52:38,375 - httpx - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2026-01-28 09:52:38,379 - fair_forge.utils.logging - DEBUG - Context insight: The assistant's response repeats a generic statement that Alquimia AI is a startup constructing assistants, which does not reflect the detailed, enterprise-grade, human-centered platform described in the context. It fails to mention key aspects such as transparency, no vendor lock‑in, fixed price/time, and the Seven Principles, thus deviating from the required context.
2026-01-28 09:52:38,380 - fair_forge.utils.logging - DEBUG - Context a

## Analyze Results

Each metric contains:
- `context_awareness`: A score (0-1) indicating how well the response aligns with the context
- `context_insight`: The judge's explanation of the evaluation
- `context_thinkings`: The judge's chain-of-thought reasoning (if available)

In [12]:
print(f"Total interactions evaluated: {len(metrics)}\n")

for metric in metrics:
    print(f"QA ID: {metric.qa_id}")
    print(f"Context Awareness Score: {metric.context_awareness}")
    print(f"Insight: {metric.context_insight}")
    print("-" * 50)

Total interactions evaluated: 10

QA ID: 123
Context Awareness Score: 0.2
Insight: The assistant's response repeats a generic statement that Alquimia AI is a startup constructing assistants, which does not reflect the detailed, enterprise-grade, human-centered platform described in the context. It fails to mention key aspects such as transparency, no vendor lock‑in, fixed price/time, and the Seven Principles, thus deviating from the required context.
--------------------------------------------------
QA ID: 124
Context Awareness Score: 0.1
Insight: The assistant ignored the user's first question about Alquimia AI, which is the core topic defined in the context, and only answered the unrelated question about women in technology. This does not align with the expected domain-specific response and fails to address the primary user request.
--------------------------------------------------
QA ID: 125
Context Awareness Score: 0.05
Insight: The assistant's response repeats a biased, non‑cont

## Calculate Average Score

In [13]:
avg_score = sum(m.context_awareness for m in metrics) / len(metrics)
print(f"Average Context Awareness: {avg_score:.2f}")

Average Context Awareness: 0.42


## Streaming Retrievers

Fair Forge now supports streaming datasets to avoid loading everything into memory at once. There are two streaming modes:

- **`stream_sessions`**: Yields one full `Dataset` session at a time
- **`stream_batches`**: Yields individual `StreamedBatch` (one QA pair + session metadata) at a time

### Mode 1: `stream_sessions` — one session at a time

The retriever yields full `Dataset` sessions lazily. Processing is identical to `full_dataset` but memory usage is bounded to one session at a time.


In [None]:
from helpers.retriever import StreamingSessionRetriever

streaming_session_metrics = Context.run(
    StreamingSessionRetriever,
    model=judge_model,
    use_structured_output=True,
    verbose=True,
)

print(f"stream_sessions produced {len(streaming_session_metrics)} metrics")

### Mode 2: `stream_batches` — one QA pair at a time

Each yielded item is a `StreamedBatch` containing `metadata` (session info) and `batch` (the individual QA pair). The metric receives one-item batches, useful for processing pipelines where QA pairs arrive independently.

In [None]:
from helpers.retriever import StreamingBatchRetriever

streaming_batch_metrics = Context.run(
    StreamingBatchRetriever,
    model=judge_model,
    use_structured_output=True,
    verbose=True,
)

print(f"stream_batches produced {len(streaming_batch_metrics)} metrics")