# Using logprobs for classification and retrieval evaluation

This notebook illustrates two potential uses of the `logprobs` parameter in the Chat Completions API. 
With the `logprobs` set to true, Chat Completions returns the log probabilities of each output token, and a limited number of the most likely tokens at each token position (along with their log probabilities). This can help with assessing the confidence of the model in its output, or to examine alternative responses the model may have given. While there are a wide array of use cases for logprobs, we are focusing on:<br>
1. Classification tasks
2. Retrieval (Q&A) evaluation

## 0. Imports and utils

In [300]:
from openai import OpenAI
from math import exp
import numpy as np
client= OpenAI()


In [301]:
def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4",
    max_tokens=500,
    temperature=1.0,
    stop=None,
    functions=None,
    logprobs=None,
    top_logprobs=None
) -> str:
    params = {
        'model': model,
        'messages': messages,
        'max_tokens': max_tokens,
        'temperature': temperature,
        'stop': stop,
        'logprobs': logprobs,
        'top_logprobs':top_logprobs
    }
    if functions:
        params['functions'] = functions

    completion = client.chat.completions.create(**params)
    return completion



## 1. Classification

Let's say we want to create a system to classify news articles into a set of categories. Without `logprobs`, we can use Chat Completions to do this, but it is much more difficult to assess how confident the model is in its classifications. <br><br>
Now, with `logprobs` enabled, we can see just how confident the model is in its predictions, which is crucial for creating an accurate and trustworthy classifier.

We can begin with a prompt that gives the model four categories: **Technology, Politics, Sports, and Arts**, and asks the model to classify articles into those categories based on headlines alone.

In [302]:
CLASSIFICATION_PROMPT = """You will be given a headline of a news article. Classify the article into one of the following categories: Technology, Politics, Sports, and Art.
Return only the name of the category, and nothing else. MAKE SURE your output is one of the four categories stated. Article headline: {headline}"""


Let's look at three sample headlines, and first begin with a standard Chat Completions output, without `logprobs`

In [303]:
headlines = ["Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.",
             "Local Mayor Launches Initiative to Enhance Urban Public Transport.",
"Tennis Champions Showcase Hidden Talents in Symphony Orchestra Debut"]


In [304]:
for headline in headlines:
  print(headline)
  API_RESPONSE = get_completion([{'role':'user','content':CLASSIFICATION_PROMPT.format(headline=headline)}],model='gpt-4')
  print(API_RESPONSE.choices[0].message.content,'\n')


Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.


Technology 

Local Mayor Launches Initiative to Enhance Urban Public Transport.
Politics 

Tennis Champions Showcase Hidden Talents in Symphony Orchestra Debut
Art 



Here we can see the selected category for each headline. However, we don't know *how* confident the model is in these headlines. Let's rerun the same prompt but with `logprobs` enabled, and `top_logprobs` set to 2 (this will show us the 2 most likely output tokens). Additionally we can also output the linear probability of each output token, in order to convert the log probability to the more easily interprable scale of 0-100%. 


In [305]:
for headline in headlines:
  print(headline)
  API_RESPONSE = get_completion([{'role':'user','content':CLASSIFICATION_PROMPT.format(headline=headline)}],model='gpt-4',logprobs=True, top_logprobs=2)
  for logprob in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:
        print(f"\033[96mToken:\033[0m {logprob.token}, \033[93mlogprobs:\033[0m {logprob.logprob}, \033[95mlinear probability:\033[0m {np.round(np.exp(logprob.logprob)*100,2)}%")
  print('\n')


Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.
[96mToken:[0m Technology, [93mlogprobs:[0m -3.1737043e-06, [95mlinear probability:[0m 100.0%
[96mToken:[0m Techn, [93mlogprobs:[0m -13.437503, [95mlinear probability:[0m 0.0%


Local Mayor Launches Initiative to Enhance Urban Public Transport.
[96mToken:[0m Politics, [93mlogprobs:[0m -3.7697225e-06, [95mlinear probability:[0m 100.0%
[96mToken:[0m Technology, [93mlogprobs:[0m -13.390629, [95mlinear probability:[0m 0.0%


Tennis Champions Showcase Hidden Talents in Symphony Orchestra Debut
[96mToken:[0m Sports, [93mlogprobs:[0m -0.4510038, [95mlinear probability:[0m 63.7%
[96mToken:[0m Art, [93mlogprobs:[0m -1.0135038, [95mlinear probability:[0m 36.29%




As expected from the first two headlines, `gpt-4` is nearly 100% confident in its classifications, as the content is clearly technology and politics focused respectively. However, the third headline combines both sports and art-related themes, so we see the model is significantly less confident in its selection, with a ~25% chance of choosing Sports instead of Art. <br><br> 
This shows how important using `logprobs` can be, as if we are using llms for classification tasks we can set confidence theshholds, or output several potential output tokens if the log probability of the selected output is not sufficiently high. For instance, if we are creating a recommendation engine to tag articles, we can automatically classify headlines crossing a certain threshold, and send the less certain headlines for manual review.

### 2. Retrieval confidence scoring

To reduce hallucinations, and the performance of our Q&A RAG system, we can use `logprobs` to evaluate how confident the model is in its retrieval.

Let's say we have built a retrieval system using RAG for Q&A, but are struggling with hallucinated answers to our questions. 

In [306]:
#Article retrieved

ada_lovelace_article = """Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation.
Ada Byron was the only legitimate child of poet Lord Byron and reformer Lady Byron. All Lovelace's half-siblings, Lord Byron's other children, were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever. He died in Greece when Ada was eight. Her mother was anxious about her upbringing and promoted Ada's interest in mathematics and logic in an effort to prevent her from developing her father's perceived insanity. Despite this, Ada remained interested in him, naming her two sons Byron and Gordon. Upon her death, she was buried next to him at her request. Although often ill in her childhood, Ada pursued her studies assiduously. She married William King in 1835. King was made Earl of Lovelace in 1838, Ada thereby becoming Countess of Lovelace.
Her educational and social exploits brought her into contact with scientists such as Andrew Crosse, Charles Babbage, Sir David Brewster, Charles Wheatstone, Michael Faraday, and the author Charles Dickens, contacts which she used to further her education. Ada described her approach as "poetical science" and herself as an "Analyst (& Metaphysician)".
When she was eighteen, her mathematical talents led her to a long working relationship and friendship with fellow British mathematician Charles Babbage, who is known as "the father of computers". She was in particular interested in Babbage's work on the Analytical Engine. Lovelace first met him in June 1833, through their mutual friend, and her private tutor, Mary Somerville.
Between 1842 and 1843, Ada translated an article by the military engineer Luigi Menabrea (later Prime Minister of Italy) about the Analytical Engine, supplementing it with an elaborate set of seven notes, simply called "Notes".
Lovelace's notes are important in the early history of computers, especially since the seventh one contained what many consider to be the first computer program—that is, an algorithm designed to be carried out by a machine. Other historians reject this perspective and point out that Babbage's personal notes from the years 1836/1837 contain the first programs for the engine. She also developed a vision of the capability of computers to go beyond mere calculating or number-crunching, while many others, including Babbage himself, focused only on those capabilities. Her mindset of "poetical science" led her to ask questions about the Analytical Engine (as shown in her notes) examining how individuals and society relate to technology as a collaborative tool.
"""

#Questions that can be easily answered given the article
easy_questions = ["What nationality was Ada Lovelace?", "What was an important finding from Lovelace's seventh note?"]
medium_questions =["Did Lovelace collaborate with Charles Dickens","What concepts did Lovelace build with Charles Babbage"]


Now, what we can do is ask the model to respond to the question, but then also evaluate its response. Specifically, we will ask the model to output a boolean 'sufficient_context_for_answer'. We can then evaluate the `logprobs` to see just how confident the model is that its answer was contained in the provided context

In [307]:
PROMPT = """You retrieved this article: {article}. The question is: {question}. Before even answering the question, consider whether you have sufficent information in the article to answer the question fully.
Your output should JUST be the boolean true or false, of if you have sufficient information in the article to answer the question.
Respond with just one word, the boolean true or false.
"""


In [308]:
API_RESPONSE.choices[0].logprobs.content[0].token


'Art'

In [309]:
import numpy as np

print('\033[1mQuestions clearly answered in article\033[0m\n')  # Blue text

for question in easy_questions:
    API_RESPONSE = get_completion([{'role':'user','content':PROMPT.format(article=ada_lovelace_article,
    question=question)}], model='gpt-4', logprobs=True)
    print('\033[92mQuestion:\033[0m', question)  # Green text
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        print(f"\033[96msufficient_context_for_answer:\033[0m {logprob.token}, \033[93mlogprobs:\033[0m {logprob.logprob}, \033[95mlinear probability:\033[0m {np.round(np.exp(logprob.logprob)*100,2)}%", '\n')

print('\n\n\033[1mQuestions with potentially insufficient information\033[0m\n')  # Blue text

for question in medium_questions:
    API_RESPONSE = get_completion([{'role':'user','content':PROMPT.format(article=ada_lovelace_article,
    question=question)}], model='gpt-4', logprobs=True,top_logprobs=3)
    print('\033[92mQuestion:\033[0m', question)  # Green text
    print(API_RESPONSE)
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        print(f"\033[96msufficient_context_for_answer:\033[0m {logprob.token}, \033[93mlogprobs:\033[0m {logprob.logprob}, \033[95mlinear probability:\033[0m {np.round(np.exp(logprob.logprob)*100,2)}%", '\n')


[1mQuestions clearly answered in article[0m

[92mQuestion:[0m What nationality was Ada Lovelace?
[96msufficient_context_for_answer:[0m True, [93mlogprobs:[0m -3.1281633e-07, [95mlinear probability:[0m 100.0% 

[92mQuestion:[0m What was an important finding from Lovelace's seventh note?
[96msufficient_context_for_answer:[0m True, [93mlogprobs:[0m -3.1281633e-07, [95mlinear probability:[0m 100.0% 



[1mQuestions with potentially insufficient information[0m

[92mQuestion:[0m Did Lovelace collaborate with Charles Dickens
ChatCompletion(id='chatcmpl-8XJG9yst1pZoZL7M9wBUwIy6YoYuo', choices=[Choice(finish_reason='stop', index=0, logprobs=ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token='True', bytes=[84, 114, 117, 101], logprob=-0.77434313, top_logprobs=[TopLogprob(token='False', bytes=[70, 97, 108, 115, 101], logprob=-0.61809313), TopLogprob(token='True', bytes=[84, 114, 117, 101], logprob=-0.77434313), TopLogprob(token='false', bytes=[102, 97, 108, 115, 101], 

Cool, so we can see from the first two questions that our evaluator knows with (near) 100% confidence that the article has sufficient context to answer the posed question.
On the other hand, for the more tricky question which are less clearly answered in the article, the model is signfiicantly less confident that it has sufficient context.
This self-evaluation can help reduce hallucinations, as you can restrict answers, or ask for clearer questions, when your `sufficient_context_for_answer` log probability is below a certain threshold. Methods like this have been [shown](https://jfan001.medium.com/how-we-cut-the-rate-of-gpt-hallucinations-from-20-to-less-than-2-f3bfcc10e4ec) to significantly reduce RAG Q&A hallucinations and errors.

In [310]:
PROMPT = """Give me three random emojis"""
API_RESPONSE = get_completion([{'role':'user','content':PROMPT}],model='gpt-3.5-turbo-1106',logprobs=True)


In [311]:
API_RESPONSE


ChatCompletion(id='chatcmpl-8XJGBUq8Q7VYU8dzwOjMV8WaUhOKE', choices=[Choice(finish_reason='stop', index=0, logprobs=ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token='\\xf0\\x9f', bytes=[240, 159], logprob=-0.17577635, top_logprobs=[]), ChatCompletionTokenLogprob(token='\\x8c', bytes=[140], logprob=-0.85772306, top_logprobs=[]), ChatCompletionTokenLogprob(token='\\x88', bytes=[136], logprob=-2.3395207, top_logprobs=[]), ChatCompletionTokenLogprob(token='\\xf0\\x9f', bytes=[240, 159], logprob=-0.20028262, top_logprobs=[]), ChatCompletionTokenLogprob(token='\\x8d', bytes=[141], logprob=-0.9647721, top_logprobs=[]), ChatCompletionTokenLogprob(token='\\x95', bytes=[149], logprob=-0.5920826, top_logprobs=[]), ChatCompletionTokenLogprob(token='\\xf0\\x9f', bytes=[240, 159], logprob=-0.15628104, top_logprobs=[]), ChatCompletionTokenLogprob(token='\\x9a', bytes=[154], logprob=-0.67634493, top_logprobs=[]), ChatCompletionTokenLogprob(token='\\x80', bytes=[128], logprob=-0.0012997614, top

In [312]:
from math import exp
aggregated_bytes = []
joint_logprob = 0.0
for token in API_RESPONSE.choices[0].logprobs.content:
    aggregated_bytes += token.bytes
    joint_logprob += token.logprob

message_content = API_RESPONSE.choices[0].message.content
aggregated_text = bytes(aggregated_bytes).decode('utf-8')
print(aggregated_text)

assert message_content == aggregated_text

print(f"text = {aggregated_text}")
print(f"joint probability = {exp(joint_logprob)}")



🌈🍕🚀
text = 🌈🍕🚀
joint probability = 0.0025693992522193153


## 3. Autocomplete

Another use case for `logprobs` are autocomplete systems. Without creating the entire autocomplete engine end-to-end, let's demonstrate how `logprobs` could help us decide when we to suggest a sentence completion as a user is typing.

First, let's come up with a sample sentence: "My least favorite TV show is Breaking Bad." Let's say we are building an autocomplete sentence, and we want it to dynamically recommend the next word or token as we are typing the sentence, but *only* if the model is quite sure of what the next word will be. To demonstrate this, let's break up the sentence into sequential components up to the title of the show.

In [345]:
sentence_list = ["My","My least", "My least favorite","My least favorite TV","My least favorite TV show",
"My least favorite TV show is"]


In [346]:
for sentence in sentence_list:
  PROMPT = """Complete this sentence. You are acting as auto-complete. Simply complete the sentence to the best of your ability, make sure it is just ONE sentence: {sentence}"""
  API_RESPONSE = get_completion([{'role':'user','content':PROMPT.format(sentence=sentence)}],model='gpt-3.5-turbo',logprobs=True,top_logprobs=3)
#  for next_token in API_RESPONSE.choices[0].logprobs.content[0]:
  print('Sentence:',sentence)

  for alt_token in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:
    print(f"\033[96mPredicted next token:\033[0m {alt_token.token}, \033[93mlogprobs:\033[0m {alt_token.logprob}, \033[95mlinear probability:\033[0m {np.round(np.exp(alt_token.logprob)*100,2)}%")
    if np.exp(alt_token.logprob)>.95:
      high_prob_completions[sentence] = alt_token.token
  print('\n')


Sentence: My
[96mPredicted next token:[0m favorite, [93mlogprobs:[0m -0.18245785, [95mlinear probability:[0m 83.32%
[96mPredicted next token:[0m dog, [93mlogprobs:[0m -2.397172, [95mlinear probability:[0m 9.1%
[96mPredicted next token:[0m ap, [93mlogprobs:[0m -3.8732424, [95mlinear probability:[0m 2.08%


Sentence: My least
[96mPredicted next token:[0m favorite, [93mlogprobs:[0m -0.013642592, [95mlinear probability:[0m 98.65%
[96mPredicted next token:[0m My, [93mlogprobs:[0m -4.3126197, [95mlinear probability:[0m 1.34%
[96mPredicted next token:[0m  favorite, [93mlogprobs:[0m -9.684484, [95mlinear probability:[0m 0.01%


Sentence: My least favorite
[96mPredicted next token:[0m food, [93mlogprobs:[0m -0.9481721, [95mlinear probability:[0m 38.74%
[96mPredicted next token:[0m My, [93mlogprobs:[0m -1.3447137, [95mlinear probability:[0m 26.06%
[96mPredicted next token:[0m color, [93mlogprobs:[0m -1.3887696, [95mlinear probability:[0m 24.9

Nice! If we were to create an autocomplete system using `gpt-3.5-turbo`, we could set the threshold to recommend a completion at whatever probability we want, say 95% linear probability. This would have our autocompletion engine recommend "favorite" after we say "My least" (which is reasonable), but not have any recommendation after "My least favorite TV show is" (which makes sense as we don't want our autocomplete guessing our favorite show!)

## 4. Extensions

There are many other use cases for `logprobs` that are not covered in this notebook. We can use `logprobs` to calculate the `perplexity` of your outputs (the evaluation metric of uncertainty or surprise of the model at its outcomes). This can be calculated by using `logprobs` to calculate the exponentatied average negative log-likelihood of all of our output tokens. 

In [343]:
from math import exp
aggregated_bytes = []
joint_logprob = 0.0
for token in API_RESPONSE.choices[0].logprobs.content:
    aggregated_bytes += token.bytes
    # Add the logprob of the current token to the joint logprob
    joint_logprob += token.logprob

    # Get the content of the message from the API response
    message_content = API_RESPONSE.choices[0].message.content

    # Decode the aggregated bytes to text
    aggregated_text = bytes(aggregated_bytes).decode('utf-8')

    # Print the aggregated text
    print(aggregated_text)

assert message_content == aggregated_text

print(f"text = {aggregated_text}")
print(f"joint probability = {exp(joint_logprob)}")


"My
"My least
"My least favorite
"My least favorite TV
"My least favorite TV show
"My least favorite TV show is
"My least favorite TV show is any
"My least favorite TV show is any reality
"My least favorite TV show is any reality show
"My least favorite TV show is any reality show that
"My least favorite TV show is any reality show that focuses
"My least favorite TV show is any reality show that focuses on
"My least favorite TV show is any reality show that focuses on drama
"My least favorite TV show is any reality show that focuses on drama and
"My least favorite TV show is any reality show that focuses on drama and gossip
"My least favorite TV show is any reality show that focuses on drama and gossip."
text = "My least favorite TV show is any reality show that focuses on drama and gossip."
joint probability = 1.1513921327020997e-07


In [None]:
sentence
PROMPT = """Complete this sentence: {sentence}"""
API_RESPONSE = get_completion([{'role':'user','content':PROMPT.format(sentence=sentence)}],model='gpt-3.5-turbo',logprobs=True,top_logprobs=5)

#Function to highlight each token
def highlight_text(api_response):
    colors = ['\033[95m', '\033[92m', '\033[93m', '\033[91m', '\033[94m']  # ANSI codes for purple, green, orange, red, blue
    reset_color = '\033[0m'
    tokens = api_response.choices[0].logprobs.content

    color_idx = 0
    for t in tokens:
        token_str = bytes(t.bytes).decode('utf-8')
        print(f"{colors[color_idx]}{token_str}{reset_color}", end="")

        # Move to the next color in the sequence, wrapping around if necessary
        color_idx = (color_idx + 1) % len(colors)
    print()  # for readability
    print(f"Total number of tokens: {len(tokens)}")
