# Context Metric Example

This notebook demonstrates how to use the **Context** metric from Fair Forge to evaluate how well AI assistant responses align with the provided context.

## Installation

First, install Fair Forge and the required dependencies.

In [1]:
import sys

!uv pip install --python {sys.executable} --force-reinstall "$(ls ../../dist/*.whl)[context]" langchain-groq

[2mUsing Python 3.11.11 environment at: /Users/alexfiorenza/.pyenv/versions/3.11.11[0m
[2K[2mResolved [1m44 packages[0m [2min 356ms[0m[0m                                        [0m
[2K[2mPrepared [1m44 packages[0m [2min 3ms[0m[0m                                              
[2mUninstalled [1m44 packages[0m [2min 424ms[0m[0m
[2K[2mInstalled [1m44 packages[0m [2min 103ms[0m[0m                              [0m
 [33m~[39m [1malquimia-fair-forge[0m[2m==0.1.1 (from file:///Users/alexfiorenza/Documents/software_development/projects/alquimia/fair-forge/dist/alquimia_fair_forge-0.1.1-py3-none-any.whl)[0m
 [33m~[39m [1mannotated-types[0m[2m==0.7.0[0m
 [33m~[39m [1manyio[0m[2m==4.12.1[0m
 [33m~[39m [1mcertifi[0m[2m==2026.1.4[0m
 [33m~[39m [1mcharset-normalizer[0m[2m==3.4.4[0m
 [33m~[39m [1mdistro[0m[2m==1.9.0[0m
 [33m~[39m [1mfilelock[0m[2m==3.20.3[0m
 [33m~[39m [1mfsspec[0m[2m==2026.1.0[0m
 [33m~[39m [1mgroq[0m[

## Setup

Import the required modules and configure your API key.

In [2]:
import os

sys.path.insert(0, os.path.dirname(os.getcwd()))

from helpers.retriever import LocalRetriever
from langchain_groq import ChatGroq

from fair_forge.metrics.context import Context

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
import getpass

GROQ_API_KEY = getpass.getpass("Enter your Groq API key: ")

## Initialize the Judge Model

The Context metric uses an LLM as a judge to evaluate responses. You can use any LangChain-compatible chat model.

In [4]:
judge_model = ChatGroq(
    model="openai/gpt-oss-120b",
    api_key=GROQ_API_KEY,
    temperature=0.0,
    reasoning_format="parsed",
)

## Run the Context Metric

The Context metric evaluates each Q&A interaction in your dataset, scoring how well the assistant's response aligns with the provided context.

In [5]:
metrics = Context.run(
    LocalRetriever,
    model=judge_model,
    use_structured_output=True,
    verbose=True,
)

2026-01-12 15:27:50,230 - fair_forge.utils.logging - INFO - Loaded dataset with 1 batches
2026-01-12 15:27:50,231 - fair_forge.utils.logging - INFO - Starting to process dataset
2026-01-12 15:27:50,231 - fair_forge.utils.logging - INFO - Session ID: 123, Assistant ID: my_assistant
2026-01-12 15:27:50,231 - fair_forge.utils.logging - DEBUG - QA ID: 123
2026-01-12 15:27:51,190 - httpx - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2026-01-12 15:27:51,200 - fair_forge.utils.logging - DEBUG - Context insight: The assistant's response repeats a generic statement that Alquimia AI is a startup building assistants, which does not reflect the detailed enterprise-grade, human-centered platform described in the context. It fails to mention the Seven Principles, transparency, fixed-price/time offerings, or other key attributes, thus deviating significantly from the required context.
2026-01-12 15:27:51,201 - fair_forge.utils.logging - DEBUG - Context 

## Analyze Results

Each metric contains:
- `context_awareness`: A score (0-1) indicating how well the response aligns with the context
- `context_insight`: The judge's explanation of the evaluation
- `context_thinkings`: The judge's chain-of-thought reasoning (if available)

In [6]:
print(f"Total interactions evaluated: {len(metrics)}\n")

for metric in metrics:
    print(f"QA ID: {metric.qa_id}")
    print(f"Context Awareness Score: {metric.context_awareness}")
    print(f"Insight: {metric.context_insight}")
    print("-" * 50)

Total interactions evaluated: 10

QA ID: 123
Context Awareness Score: 0.2
Insight: The assistant's response repeats a generic statement that Alquimia AI is a startup building assistants, which does not reflect the detailed enterprise-grade, human-centered platform described in the context. It fails to mention the Seven Principles, transparency, fixed-price/time offerings, or other key attributes, thus deviating significantly from the required context.
--------------------------------------------------
QA ID: 124
Context Awareness Score: 0.2
Insight: The assistant's response addresses the question about women working in technology, which is unrelated to the provided context that focuses on Alquimia AI, its principles, and enterprise solutions. While the answer is appropriate for the question asked, it deviates from the expected domain of Alquimia AI expertise. Therefore, the response does not align with the context's scope.
--------------------------------------------------
QA ID: 125
C

## Calculate Average Score

In [7]:
avg_score = sum(m.context_awareness for m in metrics) / len(metrics)
print(f"Average Context Awareness: {avg_score:.2f}")

Average Context Awareness: 0.14
