# Basic Examples of LM-Polygraph Usage

This notebook contains basic examples of obtaining uncertainty scores for LLMs along with generations using a high-level API function:

```estimate_uncertainty(model, estimator, input_text)```. 

## Install Dependencies

In [None]:
# Assume that you have installed lm-polygraph: 
# pip install git+https://github.com/artemshelmanov/lm-polygraph.git

!python -m spacy download en_core_web_sm

## Basic Imports

In [60]:
%load_ext autoreload
%autoreload 2

from transformers import AutoModelForCausalLM, AutoTokenizer
from lm_polygraph.utils.model import WhiteboxModel, BlackboxModel
from lm_polygraph import estimate_uncertainty
from lm_polygraph.estimators import MaximumTokenProbability, MaximumSequenceProbability, SemanticEntropy, EigValLaplacian

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## UQ for Whitebox LLMs

### Initialize model

In [2]:
model_name = 'Qwen/Qwen2.5-0.5B-Instruct'
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='cpu',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = WhiteboxModel(base_model, tokenizer)

### Sequence-level UQ for a Whitebox LLM

In [4]:
estimator = MaximumSequenceProbability()
estimate_uncertainty(model, estimator, input_text='How many floors are in the Empire State Building?')

UncertaintyOutput(uncertainty=3.913156032562256, input_text='How many floors are in the Empire State Building?', generation_text='The Empire State Building has 105 floors.', generation_tokens=[785, 20448, 3234, 16858, 702, 220, 16, 15, 20, 25945, 13], model_path=None, estimator='MaximumSequenceProbability')

In [5]:
estimator = MaximumSequenceProbability()
estimate_uncertainty(model, estimator, input_text='What has a head and a tail but no body?')

UncertaintyOutput(uncertainty=61.402503967285156, input_text='What has a head and a tail but no body?', generation_text="The answer to this question is a virus. Viruses are small, non-living entities that can only replicate within living cells. They do not have a head or a tail, but they can cause harm to living organisms by attaching to and hijacking the host cell's machinery. Viruses are a significant threat to public health and can cause a wide range of diseases, including but not limited to, influenza, HIV, and cancer.", generation_tokens=[785, 4226, 311, 419, 3405, 374, 264, 16770, 13, 9542, 4776, 525, 2613, 11, 2477, 2852, 2249, 14744, 429, 646, 1172, 45013, 2878, 5382, 7761, 13, 2379, 653, 537, 614, 264, 1968, 476, 264, 9787, 11, 714, 807, 646, 5240, 11428, 311, 5382, 43204, 553, 71808, 311, 323, 21415, 8985, 279, 3468, 2779, 594, 25868, 13, 9542, 4776, 525, 264, 5089, 5899, 311, 584, 2820, 323, 646, 5240, 264, 6884, 2088, 315, 18808, 11, 2670, 714, 537, 7199, 311, 11, 61837, 11

In [6]:
# It takes 2 mins to run the example.

estimator = SemanticEntropy()
estimate_uncertainty(model, estimator, input_text='How many floors are in the Empire State Building?')

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


UncertaintyOutput(uncertainty=99.17972927011878, input_text='How many floors are in the Empire State Building?', generation_text='The Empire State Building has 105 floors.', generation_tokens=[785, 20448, 3234, 16858, 702, 220, 16, 15, 20, 25945, 13], model_path=None, estimator='SemanticEntropy')

In [None]:
# It takes 2 mins to run the example.

estimator = SemanticEntropy()
estimate_uncertainty(model, estimator, input_text='What has a head and a tail but no body?')

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


UncertaintyOutput(uncertainty=115.85714697341697, input_text='What has a head and a tail but no body?', generation_text="The answer to this question is a virus. Viruses are small, non-living entities that can only replicate within living cells. They do not have a head or a tail, but they can cause harm to living organisms by attaching to and hijacking the host cell's machinery. Viruses are a significant threat to public health and can cause a wide range of diseases, including but not limited to, influenza, HIV, and cancer.", generation_tokens=[785, 4226, 311, 419, 3405, 374, 264, 16770, 13, 9542, 4776, 525, 2613, 11, 2477, 2852, 2249, 14744, 429, 646, 1172, 45013, 2878, 5382, 7761, 13, 2379, 653, 537, 614, 264, 1968, 476, 264, 9787, 11, 714, 807, 646, 5240, 11428, 311, 5382, 43204, 553, 71808, 311, 323, 21415, 8985, 279, 3468, 2779, 594, 25868, 13, 9542, 4776, 525, 264, 5089, 5899, 311, 584, 2820, 323, 646, 5240, 264, 6884, 2088, 315, 18808, 11, 2670, 714, 537, 7199, 311, 11, 61837, 11

### Token-level UQ for Whitebox LLM

In [8]:
estimator = MaximumTokenProbability()
estimate_uncertainty(model, estimator, input_text='What has a head and a tail but no body?')

UncertaintyOutput(uncertainty=array([-0.5199645 , -0.7773075 , -0.9111757 , -0.4642391 , -0.7312121 ,
       -0.96189463, -0.27488053, -0.11001918, -0.72922415, -0.43932858,
       -0.99928606, -0.5605216 , -0.23990016, -0.7537944 , -0.24525657,
       -0.6493522 , -0.99975866, -0.51964307, -0.8791548 , -0.21423468,
       -0.35266286, -0.47724998, -0.45849365, -0.38198572, -0.8100885 ,
       -0.6418665 , -0.7694248 , -0.30586314, -0.99927765, -0.9042026 ,
       -0.87831134, -0.9398997 , -0.52133787, -0.5053506 , -0.9973978 ,
       -0.67501503, -0.49679896, -0.7429459 , -0.35794568, -0.10462207,
       -0.16433331, -0.5641733 , -0.80949354, -0.7477126 , -0.21802613,
       -0.12109879, -0.6510552 , -0.2837354 , -0.2947866 , -0.98860633,
       -0.5185005 , -0.1949296 , -0.51526326, -0.9765778 , -0.7364911 ,
       -0.442843  , -0.2552529 , -0.9931718 , -0.49535578, -0.17133856,
       -0.2405345 , -0.5826117 , -0.89956623, -0.22676547, -0.9783974 ,
       -0.7804185 , -0.30011195, -

## UQ for a Blackbox LLM

### Sequence-Level UQ for LLM deployed via OpenAI API

In [74]:
OPENAI_KEY = '<Your OpenAI key>'
model = BlackboxModel(
    OPENAI_KEY,
    'gpt-4o-mini'
)

In [77]:
estimator = EigValLaplacian(verbose=True)
estimate_uncertainty(model, estimator, input_text='How many floors are in the Empire State Building?')

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


UncertaintyOutput(uncertainty=1.0005499646067637, input_text='How many floors are in the Empire State Building?', generation_text='The Empire State Building has 102 floors.', generation_tokens=None, model_path='gpt-4o-mini', estimator='EigValLaplacian_NLI_score_entail')

In [78]:
estimator = EigValLaplacian(verbose=True)
estimate_uncertainty(model, estimator, input_text='What has a head and a tail but no body?')

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


UncertaintyOutput(uncertainty=1.7574554956017185, input_text='What has a head and a tail but no body?', generation_text='The answer to the riddle "What has a head and a tail but no body?" is a coin. It has a "head" side and a "tail" side, but no physical body.', generation_tokens=None, model_path='gpt-4o-mini', estimator='EigValLaplacian_NLI_score_entail')

### Sequence-Level UQ for LLM deployed via HuggingFace API

In [52]:
# Without the HuggingFace pro account HF API might return error.

HUGGINGFACE_API_TOKEN = '<Your HuggingFace API token>'
MODEL_ID = 'meta-llama/Llama-3.3-70B-Instruct'
model = BlackboxModel.from_huggingface(hf_api_token=HUGGINGFACE_API_TOKEN, hf_model_id=MODEL_ID, openai_api_key=None, openai_model_path=None)

In [63]:
ue_method = EigValLaplacian()
input_text = 'How many floors are in the Empire State Building? Just answer the question.'
estimate_uncertainty(model, ue_method, input_text=input_text)

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Error during conversion: ChunkedEncodingError(ProtocolError('Response ended prematurely'))


UncertaintyOutput(uncertainty=0.7366183713303993, input_text='How many floors are in the Empire State Building? Just answer the question.', generation_text="How many floors are in the Empire State Building? Just answer the question. 102. \n\nNow if you need more information: The Empire State Building is an iconic 102-story skyscraper located in Midtown Manhattan, New York City. It was completed in 1931 and held the title of the world's tallest building for nearly 40 years. In addition to its impressive height, the building is also notable for its Art Deco design and historic significance, having been a symbol of American ingenuity and progress during a time of great economic and social change. Today, the Empire State Building remains a popular tourist destination and a prominent feature of the New York City skyline. \n\nAnd if you're a trivia buff: The Empire State Building has a total of 6,514 windows, 60,000 tons of steel, and 10 million bricks. It stands at a height of 1,454 feet (4

In [65]:
ue_method = EigValLaplacian()
input_text = 'What has an eye but cannot see? Just answer the question.'
estimate_uncertainty(model, ue_method, input_text=input_text)

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


UncertaintyOutput(uncertainty=0.7292325376247366, input_text='What has an eye but cannot see? Just answer the question.', generation_text='What has an eye but cannot see? Just answer the question. \n## Step 1: Understand the question\nThe question asks for something that has an eye but is unable to see.\n\n## Step 2: Recall common objects that fit the description\nA common object that fits this description is a needle, as it has an "eye" (the hole at one end through which thread is passed) but cannot see.\n\nThe final answer is: $\\boxed{A needle}$', generation_tokens=None, model_path='meta-llama/Llama-3.3-70B-Instruct', estimator='DegMat_NLI_score_entail')