
# UQLM √ó Arize Phoenix ‚Äî Response-Level Confidence & Hallucination Risk

This notebook shows how to compute **model-agnostic, ground-truth-free** confidence / risk scores using **[UQLM](https://github.com/cvs-health/uqlm)** after creating a dataset and starting an experiment in **Arize Phoenix**. 

**What you'll do:**
1. Install and configure Phoenix & UQLM
2. Create a small demo dataset with prompts & responses
3. Upload your Dataset into Phoenix & Configure a task to create your pre-sampled responses
4. Compute UQLM **per-scorer confidences** and an **ensemble** confidence
5. Derive **risk = 1 - confidence** and an optional **high_risk** flag

> üß© **Why**: UQLM provides a production-friendly, model-agnostic uncertainty signal that complements judge-style evals and helps you **flag risky answers without labeled ground truth**.


In your terminal, after run `pip install arize-phoenix`, please run `phoenix serve` to locally host Phoenix. Then proceed with this notebook. 

## 0) Install packages

In [None]:
%pip install -q arize-phoenix uqlm pandas numpy getpass 

## 1) Imports & version check

In [None]:
import os
import json
from getpass import getpass 
from typing import List, Dict, Any, Optional

import pandas as pd
from uqlm import BlackBoxUQ, WhiteBoxUQ
HAVE_UQLM = True

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("üîë Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key


## 2) Configure Phoenix connection

Set these if you're sending results to **Phoenix Cloud** or your own **self-hosted** Phoenix.


In [None]:
from phoenix.otel import register
project_name = "cvs evals"
tracer_provider = register(project_name=project_name, auto_instrument=True)

## 3) Demo dataset

We'll make a small dataset with:
- `input` ‚Äì the user prompt
- `output` ‚Äì the model response 

Once we upload this dataset into Phoenix, we can run a task to create our:
- `sampled_responses` ‚Äì a list of stochastic responses for the same prompt (for **black-box UQLM**)

In [None]:
from phoenix.client import Client

simple_dataset = [{
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
}, {
    "input": "Explain quantum entanglement in one sentence.",
    "output": "Quantum entanglement is when particles share a state no matter the distance, showing instant correlations.",
}, {
    "input": "Who won the 2023 Wimbledon men's singles?",
    "output": "Carlos Alcaraz won the 2023 Wimbledon men's singles title.",
}, {
    "input": "Give me three uses of sodium chloride in medicine.",
        "output": "Sodium chloride is used for IV fluids, nasal irrigation, and as a wound-cleaning solution.",
}]
simple_df = pd.DataFrame(simple_dataset)

client = Client()
dataset = client.datasets.create_dataset(
    dataframe=simple_df,
    name="cvs_evals",
    input_keys=["input"],
    output_keys=["output"]
)

In [None]:
from openai import OpenAI
client = OpenAI()
def my_task(example):
    client = OpenAI()
    prompt = f"""
    You will be given a question. I want 5 sampled responses to the question.
    You will return a list of 5 responses. 
    Here is your question: {example.input}
    This is the expected output: 
    [
        "response 1",
        "response 2",
        "response 3",
        "response 4",
        "response 5"
    ]
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

In [None]:
from phoenix.client.experiments import run_experiment

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    experiment_name="my-experiment", 
)

In [None]:
import pandas as pd

rows = []
for run in experiment['task_runs']:
    row = dict(run)
    output = run.get('output', {})
    if isinstance(output, dict):
        row.update(output)
    else:
        row['output'] = output
    rows.append(row)
df = pd.DataFrame(rows)

In [None]:
df = df.rename(columns={'output': 'sampled_responses'})
responses_df = df['sampled_responses']
responses_df = responses_df.iloc[::-1].reset_index(drop=True)
df = pd.merge(simple_df, responses_df, left_index=True,right_index=True, how='left')
df["sampled_responses"] = df["sampled_responses"].apply(json.loads)
df


## 4) UQLM adapter: per-scorer + ensemble confidence

Below is a compact adapter that:

- Runs **BlackBoxUQ** `score(...)` if you already have `responses` and `sampled_responses`. # CVS: Added 'responses'
- Optionally runs **BlackBoxUQ** `generate_and_score(...)` if you pass an `llm` and set `num_responses > 0`.
- Optionally runs **WhiteBoxUQ**  if you pass an `llm` that returns token-level logprobs.  # CVS: Removed 'generate_and_score(...)'
- Computes an **ensemble** confidence (mean/median/weighted) and adds `uqlm_confidence`, `uqlm_risk`, and `uqlm_high_risk`.


In [None]:
async def compute_uqlm_confidence(
    dataframe: pd.DataFrame,
    prompt_col: str = "input",
    response_col: Optional[str] = None,
    sampled_responses_col: Optional[str] = None,
    blackbox_scorers: List[str] = ["noncontradiction"], 
    ensemble: str = "mean",
    ensemble_weights: Optional[Dict[str, float]] = None,
    risk_threshold: Optional[float] = None,  
    mode: str = "black_box",
    llm: Optional[Any] = None,
    num_responses: int = 5,
    whitebox_scorers: List[str] = ["min_probability"],
    verbose: bool = False,
) -> pd.DataFrame:
    """Compute per-scorer and ensemble confidence with UQLM and return merged dataframe.
    Adds columns:
      - uqlm_confidence [0,1]
      - uqlm_risk [0,1] = 1 - confidence
      - uqlm_high_risk (optional bool) if risk_threshold provided
      - uqlm_<scorer>_conf (per-scorer, if available)
    """
    if not HAVE_UQLM:
        raise ImportError("UQLM is not installed. `pip install uqlm`.")

    df = dataframe.copy()
    per_scorer_cols = []

    def _ensemble(row: Dict[str, Any]) -> float:
        vals = [row[c] for c in per_scorer_cols if pd.notnull(row.get(c))]
        if not vals:
            return float("nan")
        if ensemble == "mean":
            return float(sum(vals) / len(vals))
        if ensemble == "median":
            s = sorted(vals)
            n = len(s)
            return float((s[n//2] if n % 2 else (s[n//2 - 1] + s[n//2]) / 2))
        if ensemble == "weighted_mean" and ensemble_weights:
            num = 0.0
            den = 0.0
            for c in per_scorer_cols:
                sc = c.replace("uqlm_", "").replace("_conf", "")
                w = float(ensemble_weights.get(sc, 0.0))
                if sc in ensemble_weights and pd.notnull(row.get(c)):
                    num += w * float(row[c])
                    den += w
            return float(num / den) if den > 0 else float("nan")
        return float(sum(vals) / len(vals))

    prompts = df[prompt_col].tolist()
    responses = df[response_col].tolist() if response_col is not None and response_col in df.columns else None
    sampled = df[sampled_responses_col].tolist() if sampled_responses_col is not None and sampled_responses_col in df.columns else None

    if mode == "auto":
        mode_to_run = "black_box" 
        if llm:
            if hasattr(llm, "logprobs"):
                mode_to_run = "white_box"
       
    else:
        mode_to_run = mode

    if mode_to_run == "black_box":
        bbuq = BlackBoxUQ(llm=llm, scorers=blackbox_scorers)
        if responses is not None and sampled is not None:
            results = bbuq.score(responses=responses, sampled_responses=sampled, show_progress_bars=False)
        else:
            results = await bbuq.generate_and_score(prompts=prompts, num_responses=num_responses, show_progress_bars=False)
    
        per_scorer_cols = []
        for sc_name in results.data:
            if sc_name in blackbox_scorers:
                per_scorer_cols.append(f"uqlm_{sc_name}_conf")
                df[f"uqlm_{sc_name}_conf"] = results.data[sc_name]

    elif mode_to_run == "white_box":
       
        wbuq = WhiteBoxUQ(llm=llm, scorers=whitebox_scorers)
        if verbose: print("WhiteBoxUQ.generate_and_score ...")
        results = await wbuq.generate_and_score(prompts=prompts, show_progress_bars=False)

        for sc_name in results.data:
            if sc_name in whitebox_scorers:
                per_scorer_cols.append(f"uqlm_{sc_name}_conf")
                df[f"uqlm_{sc_name}_conf"] = results.data[sc_name]
    else:
        raise ValueError("mode must be one of {'black_box', 'white_box', 'auto'}.")

    df["uqlm_confidence"] = df.apply(_ensemble, axis=1)
    df["uqlm_risk"] = 1.0 - df["uqlm_confidence"]
    if risk_threshold is not None:
        df["uqlm_high_risk"] = df["uqlm_risk"] >= float(risk_threshold)

    return df


### 4.a) Run **BlackBoxUQ** scoring on pre-sampled responses

This path requires **no LLM calls**‚Äîit's the fastest way to test-drive UQLM.


In [None]:
if HAVE_UQLM:
    uqlm_df = await compute_uqlm_confidence(
        dataframe=df,
        prompt_col="input",
        response_col="output",
        sampled_responses_col="sampled_responses",
        blackbox_scorers=["noncontradiction", "exact_match"], 
        ensemble="mean",
        risk_threshold=0.3,   
        mode="black_box",
        llm=None,           
        num_responses=5,
        verbose=True,
    )
else:
    print("Install UQLM to run this section: `pip install uqlm`")
uqlm_df



### 4.b) (Optional) Generate-and-score with your LLM client

If you provide an `llm` client to UQLM (e.g., OpenAI/Anthropic), set `num_responses > 0` to have UQLM **sample K responses** per prompt and score on the fly.

> ‚ö†Ô∏è **Note:** Replace the placeholder `MyLLMClient` with your real client that UQLM supports.


In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4", 
    temperature= 1
)

if HAVE_UQLM:
    USE_GENERATE_AND_SCORE = True 
    if USE_GENERATE_AND_SCORE:
        llm = llm
        uqlm_gen_df = await compute_uqlm_confidence(
            dataframe=df,
            mode="black_box",
            llm=llm,
            num_responses=5,
            blackbox_scorers=["noncontradiction", "cosine_sim"],
            ensemble="mean",
            risk_threshold=0.5,
            verbose=True,
        )
    else:
        print("Set USE_GENERATE_AND_SCORE=True after wiring your LLM client.")
else:
    print("Install UQLM to run this section: `pip install uqlm`")
uqlm_gen_df



### 4.c) (Optional) WhiteBox scoring (token-level logprobs)

If your client returns **token logprobs**, UQLM can compute white-box signals like **minimum token probability**. Replace the placeholder client below with a real logprob-capable client and set `USE_WHITEBOX=True`.


In [None]:
if HAVE_UQLM:
    USE_WHITEBOX = True 
    if USE_WHITEBOX:
        llm_logprobs = llm
        uqlm_whitebox_df = await compute_uqlm_confidence(
            dataframe=df,
            mode="white_box",
            llm=llm_logprobs,
            whitebox_scorers=["min_probability", "normalized_probability"],
            risk_threshold=0.5,
            verbose=True,
        )
    else:
        print("Set USE_WHITEBOX=True after wiring a logprob-capable client.")
else:
    print("Install UQLM to run this section: `pip install uqlm`")
uqlm_whitebox_df