# Perplexity

Perplexity (PPL) is a widely used intrinsic evaluation metric in language models that measures how well a model predicts a sample. It is the exponentiated average negative log-likelihood of the predicted tokens, where a lower perplexity indicates the model is more confident and accurate in its predictions. PPL helps assess the fluency and general quality of a language model's generation, with lower scores typically reflecting better language understanding and generation capabilities. However, it may not always correlate perfectly with downstream task performance or human judgment, especially for open-ended generation.


In [11]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

In [12]:
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [13]:
responses = [
    "Hyderabad is Capital of Telangana",
    "Telangana Capital is Hyderabad the"
]

In [14]:
def compute_ppl(text):
    encodings = tokenizer(text, return_tensors='pt')
    input_ids = encodings['input_ids']

    #compute loss
    with torch.no_grad():
        outputs = model(input_ids, labels = input_ids)
        loss = outputs.loss

    ppl = torch.exp(loss).item()
    return ppl

In [15]:
#Calculate ppl for each output

for i, response in enumerate(responses,1):
    ppl = compute_ppl(response)
    print(f"Response {i} : {response}")
    print(f"PPL Score: {ppl:.2f}\n")

Response 1 : Hyderabad is Capital of Telangana
PPL Score: 72.83

Response 2 : Telangana Capital is Hyderabad the
PPL Score: 400.18



- Between the two responses, although both convey similar factual content, Response 1 ("Hyderabad is Capital of Telangana") has a significantly lower perplexity score (72.83) compared to Response 2 ("Telangana Capital is Hyderabad the") with a PPL of 400.18. 
- This indicates that Response 1 is more fluent and aligns better with typical language patterns learned by the model. 
- The higher perplexity for Response 2 reflects its unnatural word order and reduced grammaticality, highlighting how PPL can effectively capture sentence fluency and syntactic coherence even when semantic content remains similar.
