# Entropy and Peplexity -- NEEDS REVISION

```yaml
Course:  DS 5001
Module:  03 Lab
Topic:   Entropy and Peplexity
Author:  R.C. Alvarado
Purpose: Clarify concept of perplexity.
```

# Entropy

**Entropy** $H$ is the expectation of information in a distribution.

**Self-entropy** $h$ is the information of an event.

**Information** $i$ is log normalized surprise of an event.

**Surprise** $s$ is just the inverse probability on an event.

## Probability $p$

$\Large p = \frac{n}{N}$

$p(w) = \Large\frac{n_w}{N_{corpus}}$ 

`p = n / n.sum()`

Most terms have low probability.

## Surprise $s$

$\Large s = \Large\frac{1}{p}$

$s(w) = p(w)^{-1}$

Surrprise $s$ increases as the inverse of $p$. Note how inverting $p$ adds variance to the long tail; the curve now looks like a simple quadratic. We can see a more gradual increase in surprise as terms become more rare.

<!-- V.s.value_counts().plot(style='*-') -->

## Information $i$

$\Large i= log_2(s)$

$i(w) = log_2(s(w))$

As normalized suprise, information now has a long tail structure. But notice also the range of information -- it is between 1 and 18. What does this correspond to?

<!-- V.i.value_counts().plot(style='*-'); -->

## Entropy $h$

$\Large h = p i$

$h(w) = p(w)i(w)$

For the self-entropy of each term, we multiply $p$ and $i$. When summed, this will give us the expectation of the information in the distribution, i.e. it's entropy.

<!-- V.h.value_counts().plot(style='*-'); -->

## Perplexity $PP$

<!-- $\Large PP = \Large 2^{i}$ -->

## Chiasmus

The process of computing entropy follows a chiasmus pattern.

$A_1 \rightarrow B_1 \rightarrow B_2 \rightarrow A_2$  

<!--
$p := A_1, s := B_1, i := B_2, h := A_2$
-->

$p \rightarrow s \rightarrow i \rightarrow h$ 

$A: \{p,h\}$

$B: \{s,i\}$

# Demonstration

## Set up

In [1]:
import pandas as pd

In [44]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
output_dir = config['DEFAULT']['output_dir']

In [45]:
ohco = ['book_id','chap_num','para_num','sent_num','token_num']

## Import data

In [46]:
K = pd.read_csv(f"{output_dir}/austen-combo-TOKENS-v2.csv").set_index(ohco)
V = pd.read_csv(f"{output_dir}/austen-combo-VOCAB-v2.csv").set_index('term_str')

In [47]:
K.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,1,0,0,The,the
1,1,1,0,1,family,family
1,1,1,0,2,of,of
1,1,1,0,3,Dashwood,dashwood
1,1,1,0,4,had,had


In [48]:
V.head()

Unnamed: 0_level_0,n,n_chars,p,i,h
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,3,1,1.5e-05,16.058901,0.000235
15,1,2,5e-06,17.643863,8.6e-05
16,1,2,5e-06,17.643863,8.6e-05
1760,1,4,5e-06,17.643863,8.6e-05
1784,1,4,5e-06,17.643863,8.6e-05


Assumes language models have been created.

In [120]:
LM = {}
for n in range(1, 4):
    widx = [f"w{i}" for i in range(n)]
    LM[n] = pd.read_csv(f"{output_dir}/austen-combo-LM{n}-v2.csv").set_index(widx)

In [121]:
LM[1]

Unnamed: 0_level_0,n,p,i,h
w0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3,0.000013,16.239120,0.000210
1760,1,0.000004,17.824082,0.000077
1784,1,0.000004,17.824082,0.000077
1785,1,0.000004,17.824082,0.000077
1787,1,0.000004,17.824082,0.000077
...,...,...,...,...
youth,22,0.000095,13.364651,0.001267
youthful,3,0.000013,16.239120,0.000210
zeal,7,0.000030,15.016727,0.000453
zealous,4,0.000017,15.824082,0.000273


# Compute Perplexity

In [122]:
import numpy as np

In [None]:
PPV = 2**V.h.sum()

In [None]:
PPV

568.0405114200985

# Notes

## Cross Entropy and Perplexity

### Probabilities of Sequences

$ W = W_1^N = (w_1, w_2 ... w_N)$

True distribution: $ p = p(W) $

Model distribution: $ q = q(W) $

### Cross Entropy

$ H(p, q) = - \sum_{x}^{} p(x) log_2(q(x)) $ 

$ H(p, q) = \sum_{x}^{} p(x) log_2(\frac{1}{q(x)}) $ 

$ i_q(x) = log_2(\frac{1}{q(x)}) $

$ H(p, q) = \sum_{x} p(x) i_q(x) $ 

$ H(p, q) = \vec{p} \cdot \vec{i_q} $

### Cross Entropy relative to MaxEnt

$ N = C(x) = \sum_x c(x) $

$ p_{u} = \frac{1}{N} $ 

$ H_{cross} = H(p_u, q) $

$ H_{cross} = \sum_{x} \frac{1}{N} i(x) $

$ H_{cross} = \frac{1}{N} \sum_{x} i(x) $

$ H_{cross} = \frac{\sum_x i(x)}{N} $

$ H_{cross} = \frac{ |\vec{i}|_1 }{ N } $



#### Perplexity

$ PP(W) = P(w_1, w_2 ... w_N)^{-1/N} $

$ PP(p) = 2^{H(p)}$

$ PP(p_u, q) = 2^{H_{cross}}$

#### Redundancy

$ H_{max} = log_2(N) $

$ H_{max} = i(p_u) $

$ R = 1 - \frac{H}{H_{max}} $

## From J & M
<img src="perplexity.png">

## From Stack Overflow
https://stats.stackexchange.com/questions/129352/how-to-find-the-perplexity-of-a-corpus
<img src="stackover1.png">
<img src="stackover2.png">

# NLTK 

**REDO**: See https://stackoverflow.com/questions/54941966/how-can-i-calculate-perplexity-using-nltk 

Perplexity is a measure of how well a probabilistic model is able to predict a sample. It is calculated as 2 to the power of the cross-entropy of the model and the sample. The lower the perplexity, the better the model is at predicting the sample.

Here is an example of how to calculate perplexity in Python using the Natural Language Toolkit (NLTK):


In [23]:
import nltk
from nltk.probability import FreqDist, MLEProbDist

# sample text
sample = "This is a sample text for computing perplexity."

# create a frequency distribution of the words in the sample
fdist = FreqDist(sample.split())

# create a maximum likelihood estimate (MLE) probability distribution
mle = MLEProbDist(fdist)

# calculate the perplexity of the sample using the MLE probability distribution
perplexity = 2 ** -(sum(mle.logprob(word) for word in sample.split())/len(sample.split()))

print(perplexity)


8.0


In this example, the sample text is passed to the `FreqDist()` function to create a frequency distribution of the words in the sample. This frequency distribution is then passed to the `MLEProbDist()` function to create a maximum likelihood estimate probability distribution. Finally, the `logprob()` function is used to calculate the log probability of each word in the sample, and these probabilities are summed and divided by the number of words in the sample to calculate the cross-entropy. The perplexity is then calculated by raising 2 to the power of the negative of the cross-entropy.





ChatGPT Jan 9 Version. Free Research Preview. Our goal 

In [24]:
fdist

FreqDist({'This': 1, 'is': 1, 'a': 1, 'sample': 1, 'text': 1, 'for': 1, 'computing': 1, 'perplexity.': 1})

In [106]:
def get_pp(sent_str):
    
    lang_mod = V.p
    
    tokens = set(sent_str.split()) 
    print(tokens)
    
    # x = set(tokens).intersection(V.index.values)
    x = list(tokens.intersection(V.index.values))

    print(x)
    
    mle = MLEProbDist(lang_mod.loc[x])
    # print(mle)
    # print(mle.freqdist())
    # print(mle.generate())

    print(mle._freqdist)

    # sample = list(mle.samples())
    # print(sample)
    
    # print(mle.logprob(sample[0]))


    # # pp = 2 ** -(sum(mle.logprob(token) for token in tokens)/len(tokens))
    # pp = 2 ** -(sum(mle.logprob(token) for token in sample)/len(sample))
    # return pp

In [107]:
# Some paragraphs from Austen's _Emma_ and other stuff (first two)
S_TEST = """
The car was brand new
Computer programs are full of bugs
The event had every promise of happiness for her friend 
Mr Weston was a man of unexceptionable character easy fortune suitable age and pleasant manners
and there was some satisfaction in considering with what self-denying generous friendship she had always wished and promoted the match
but it was a black morning's work for her 
The want of Miss Taylor would be felt every hour of every day 
She recalled her past kindness the kindness the affection of sixteen years 
how she had taught and how she had played with her from five years old 
how she had devoted all her powers to attach and amuse her in health 
and how nursed her through the various illnesses of childhood 
A large debt of gratitude was owing here 
but the intercourse of the last seven years 
the equal footing and perfect unreserve which had soon followed Isabella's marriage 
on their being left to each other was yet a dearer tenderer recollection 
She had been a friend and companion such as few possessed intelligent well-informed useful gentle 
knowing all the ways of the family 
interested in all its concerns 
and peculiarly interested in herself in every pleasure every scheme of hers 
one to whom she could speak every thought as it arose 
and who had such an affection for her as could never find fault 
How was she to bear the change 
It was true that her friend was going only half a mile from them 
but Emma was aware that great must be the difference between a Mrs Weston 
only half a mile from them 
and a Miss Taylor in the house 
and with all her advantages natural and domestic 
she was now in great danger of suffering from intellectual solitude 
She dearly loved her father 
but he was no companion for her 
He could not meet her in conversation rational or playful 
The evil of the actual disparity in their ages
and Mr Woodhouse had not married early
was much increased by his constitution and habits 
for having been a valetudinarian all his life 
without activity of mind or body 
he was a much older man in ways than in years 
and though everywhere beloved for the friendliness of his heart and his amiable temper 
his talents could not have recommended him at any time 
Her sister though comparatively but little removed by matrimony 
being settled in London only sixteen miles off was much beyond her daily reach 
and many a long October and November evening must be struggled through at Hartfield 
before Christmas brought the next visit from Isabella and her husband 
and their little children to fill the house and give her pleasant society again 
""".split('\n')[1:-1]

In [108]:
get_pp(S_TEST[1])

{'of', 'Computer', 'full', 'bugs', 'programs', 'are'}
['are', 'full', 'of']
term_str
are     0.001855
full    0.000293
of      0.030010
Name: p, dtype: float64


In [None]:
V.p.loc[['the','cat']]