https://towardsdatascience.com/perplexity-in-language-models-87a196019a94

We can in fact use two different approaches to evaluate and compare language models:

Extrinsic evaluation: employ them in an actual task (such as machine translation) and looking at their final loss/accuracy. 
Computationally expensive and slow as it requires training a full system.

Intrinsic evaluation: find some metric to evaluate the language model itself.
not as “good” as extrinsic evaluation, it’s a useful way of quickly comparing models. 
Perplexity is an intrinsic evaluation method.

Perplexity as the normalized inverse probability of the test set

$$PPL (W) =  \sqrt[N]{\frac{1}{P(w_1, w_2, \cdots, w_N)}}$$

A lower PPL (the better) means the model assigns a higher probability to the test examples.

the N-th root is used to normalized the probability by the size of test size

--> consider the unigram
$$ \log PPL(W) = - \frac{1}{N}[\log P(w_1)\cdots \log P(w_N)]$$

It’s basically calculating the exponential of cross-entropy

In [None]:
# Implementation in pytorch-seq2seq
# https://github.com/IBM/pytorch-seq2seq/blob/f146087a9a271e9b50f46561e090324764b081fb/seq2seq/loss/loss.py
def get_loss(self):
    nll = super(Perplexity, self).get_loss()
    nll /= self.norm_term.item()
    if nll > Perplexity._MAX_EXP:
        print("WARNING: Loss exceeded maximum value, capping to e^100")
        return math.exp(Perplexity._MAX_EXP)
    return math.exp(nll)

https://stackoverflow.com/questions/54941966/how-can-i-calculate-perplexity-using-nltk/55043954

In [1]:
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

train_sentences = ['an apple', 'an orange']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]
n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n)
model.fit(train_data, padded_vocab)

test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

test_data, _ = padded_everygram_pipeline(n, tokenized_text)
for test in test_data:
    print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])

test_data, _ = padded_everygram_pipeline(n, tokenized_text)

for i, test in enumerate(test_data):
    print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))

MLE Estimates: [(('an', ()), 0.5), (('apple', ()), 0.25)]
MLE Estimates: [(('an', ()), 0.5), (('ant', ()), 0.0)]
PP(an apple):2.8284271247461903
PP(an ant):inf


In [6]:
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
from nltk.lm import Vocabulary

train_sentences = ['an apple', 'an orange']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in train_sentences]

n = 2
train_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
words = [word for sent in tokenized_text for word in sent]
words.extend(["<s>", "</s>"])
padded_vocab = Vocabulary(words)
model = MLE(n)
model.fit(train_data, padded_vocab)

test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in test_sentences]

test_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for test in test_data:
    print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])

test_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for i, test in enumerate(test_data):
    print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))

MLE Estimates: [(('an', ('<s>',)), 1.0), (('apple', ('an',)), 0.5), (('</s>', ('apple',)), 1.0)]
MLE Estimates: [(('an', ('<s>',)), 1.0), (('ant', ('an',)), 0.0), (('</s>', ('ant',)), 0)]
PP(an apple):1.2599210498948732
PP(an ant):inf
