# Evaluating Model Outputs

We can evaluate a model's confidence in its results by using perplexity. Perplexity is a measure of uncertainty that can be calculated by exponentiating the negative of the average of the logprobs. 

+ Perplexity can be used to assess the result of an individual model run.
+ It can also be used to compare the relative confidence of results between model runs. 

Low perplexity or high confidence does not guarantee accuracy, but it can be a helpful signal when paired with other evaluation metrics. 

In [None]:
%load_ext dotenv
%dotenv ../../05_src/.secrets

In [None]:
from openai import OpenAI
import numpy as np
client = OpenAI()

In [None]:
prompts = [
    "In a short sentence, has artifical intelligence grown in the last decade?",
    "In a short sentence, is Schrödinger's cat alive?",
    "In a single word, yes or no, is Schrodinger's cat alive?",
    "In a short sentence, what is the capital of Nuevo Leon?",
    "Can you make an omelette without breaking eggs?",
]

In [None]:
def get_completion(
    input: list[dict[str, str]],
    model: str = "gpt-4o-mini",
    max_tokens=500,
    temperature=0,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "input": input,
        "max_output_tokens": max_tokens,
        "temperature": temperature,
        "tools": tools,
        "include": ["message.output_text.logprobs"] if logprobs else [],
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.responses.create(**params)
    return completion

In [None]:
for prompt in prompts:
    API_RESPONSE = get_completion(
        [{"role": "user", "content": prompt}],
        model="gpt-4o-mini",
        logprobs=True,
    )
    logprobs = [token.logprob for token in API_RESPONSE.output[0].content[0].logprobs]
    response_text = API_RESPONSE.output[0].content[0].text
    response_text_tokens = [token.token for token in API_RESPONSE.output[0].content[0].logprobs]
    max_starter_length = max(len(s) for s in ["Prompt:", "Response:", "Tokens:", "Logprobs:", "Perplexity:"])
    max_token_length = max(len(s) for s in response_text_tokens)
    

    formatted_response_tokens = [s.rjust(max_token_length) for s in response_text_tokens]
    formatted_lps = [f"{lp:.2f}".rjust(max_token_length) for lp in logprobs]

    perplexity_score = np.exp(-np.mean(logprobs))
    print("Prompt:".ljust(max_starter_length), prompt)
    print("Response:".ljust(max_starter_length), response_text, "\n")
    print("Tokens:".ljust(max_starter_length), " ".join(formatted_response_tokens))
    print("Logprobs:".ljust(max_starter_length), " ".join(formatted_lps))
    print("Perplexity:".ljust(max_starter_length), perplexity_score, "\n")

In [None]:
API_RESPONSE.output[0].content[0].text