# Assignment #1: PFL067 Statistical NLP

## Exploring Entropy and Language Modeling

---

### 1. Entropy of a Text

In this experiment, you will determine the conditional entropy of the word distribution in a text given the previous word. To do this, you will first have to compute P(i,j), which is the probability that at any position in the text you will find the word i followed immediately by the word j, and P(j|i), which is the probability that if word i occurs in the text then word j will follow. Given these probabilities, the conditional entropy of the word distribution in a text given the previous word can then be computed as:

$$H(J|I) = -\sum_{i \in I, j \in J} P(i,j) \log_2 P(j|i)$$

The perplexity is then computed simply as

$$P_X(P(J|I)) = 2^{H(J|I)}$$

Compute this conditional entropy and perplexity for `TEXTEN1.txt`

This file has every word on a separate line. (Punctuation is considered a word, as in many other cases.) The i,j above will also span sentence boundaries, where i is the last word of one sentence and j is the first word of the following sentence (but obviously, there will be a fullstop at the end of most sentences).

---

In [293]:
import pandas as pd
import numpy as np
import collections as c
import math
import random
from numpy.random import RandomState

In [294]:
random.seed(200)
np.random.seed(200)

In [295]:
english = './TEXTEN1.txt'
czech = './TEXTCZ1.txt'

In [296]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns a dataframe of each word"""
    with open(filename) as f:
        content = f.readlines()

    text = pd.DataFrame(content, columns=['words'])
    text.words = text.words.apply(lambda word: word.strip().lower())
    
    text['wordprev'] = text.words.shift(1).fillna('<s>')
    text['wordprev2'] = text.wordprev.shift(1).fillna('<ss>')
    
    text['bigrams'] = list(zip(text.wordprev, text.words))
    text['trigrams'] = list(zip(*[text.wordprev2, text.wordprev, text.words]))
    
    text = text.drop(['wordprev', 'wordprev2'], axis=1)
    
    return text

In [297]:
open_text(english)[:10]

Unnamed: 0,words,bigrams,trigrams
0,when,"(<s>, when)","(<ss>, <s>, when)"
1,on,"(when, on)","(<s>, when, on)"
2,board,"(on, board)","(when, on, board)"
3,h,"(board, h)","(on, board, h)"
4,.,"(h, .)","(board, h, .)"
5,m,"(., m)","(h, ., m)"
6,.,"(m, .)","(., m, .)"
7,s,"(., s)","(m, ., s)"
8,.,"(s, .)","(., s, .)"
9,beagle,"(., beagle)","(s, ., beagle)"


In [298]:
list(open_text(english).words)[:10]

['when', 'on', 'board', 'h', '.', 'm', '.', 's', '.', 'beagle']

In [299]:
def language_model(text):
    """Counts unigrams and bigrams in a dataframe"""
    words = list(text.words)
    word_counts = c.Counter(words)
    num_words = sum(word_counts.values())
    vocabulary = sorted(list(set(word_counts.keys())))

    bigrams = list(text.bigrams)
    bigram_counts = c.Counter(bigrams)
    num_bigrams = sum(bigram_counts.values())
    bigram_vocabulary = sorted(list(set(bigram_counts.keys())))
    
    unigram_model = words, word_counts, num_words, vocabulary
    bigram_model = bigrams, bigram_counts, num_bigrams, bigram_vocabulary
    
    return unigram_model, bigram_model

In [300]:
def Pword(unigram_model, W='', alpha=0.7):
    """Calculates the probability a word appears in a sentence"""
    _, word_counts, num_words, vocabulary = unigram_model
    return (word_counts[W] + alpha) / (num_words + alpha * len(vocabulary))

In [301]:
def Pbigram(bigram_model, W='', Wprev='', alpha=0.7):
    """Calculates the probability a bigram appears in a sentence"""
    _, bigram_counts, num_bigrams, bigram_vocabulary = bigram_model
    return (bigram_counts[(Wprev, W)] + alpha) / (num_bigrams + alpha * len(bigram_vocabulary))

In [302]:
# P(A|B) = P(A,B) / P(B)
def Pwprev(models, W='', Wprev='', alpha=0.7):
    """Calculates the probability a word W proceeds a word Wprev"""
    unigram_model, bigram_model = models
    return Pbigram(bigram_model, W=W, Wprev=Wprev, alpha=alpha) / Pword(unigram_model, W=Wprev, alpha=alpha)

In [303]:
def entropy(models, bigrams, alpha=0.7):
    """Calculates the entropy from a list of bigrams"""
    _, bigram_model = models
    return - sum(Pbigram(bigram_model, W=W, Wprev=Wprev, alpha=alpha) 
                 * math.log(Pwprev(models, W=W, Wprev=Wprev, alpha=alpha), 2) for Wprev,W in bigrams)

In [304]:
def perplexity(models, bigrams, alpha=0.7):
    """Calculates the perplexity from a list of bigrams"""
    return 2 ** entropy(models, bigrams, alpha=alpha)

In [328]:
def stats(filename, alpha=1e-5):
    text = open_text(filename)
    models = language_model(text)
    words, word_counts, num_words, vocabulary = unigram_model
    bigrams, bigram_counts, num_bigrams, bigram_vocabulary = bigram_model
    
    word_count = num_words
    char_count = len([char for word in words for char in word])
    most_frequent_words = word_counts.most_common()[:10]
    num_words_freq_1 = sum(1 for key in word_counts if word_counts[key] == 1)
    
    H = entropy(models, bigram_vocabulary, alpha=alpha)
    P = perplexity(models, bigram_vocabulary, alpha=alpha)

    return (word_count, char_count, most_frequent_words, num_words_freq_1, H, P)s

In [330]:
stats(english)

(221098,
 972917,
 [(',', 14721),
  ('the', 13949),
  ('of', 9400),
  ('.', 5645),
  ('and', 5601),
  ('in', 5123),
  ('to', 4583),
  ('a', 3286),
  ('that', 2667),
  ('as', 2180)],
 3165,
 5.332007216327324,
 40.280431033438646)

Next, you will mess up the text and measure how this alters the conditional entropy. For every character in the text, mess it up with a likelihood of 10%. If a character is chosen to be messed up, map it into a randomly chosen character from the set of characters that appear in the text. Since there is some randomness to the outcome of the experiment, run the experiment 10 times, each time measuring the conditional entropy of the resulting text, and give the min, max, and average entropy from these experiments. Be sure to use srand to reset the random number generator seed each time you run it. Also, be sure each time you are messing up the original text, and not a previously messed up text. Do the same experiment for mess up likelihoods of 5%, 1%, .1%, .01%, and .001%.

In [15]:
def charset(text):
    words = text.words
    return sorted(list(set(char for word in words for char in word)))

In [16]:
charset(open_text(english))[:10]

['!', '"', '&', "'", '(', ')', ',', '.', '/', '0']

In [17]:
def vocab_list(text):
    words = text.words
    return sorted(list(set(word for word in words)))

In [18]:
vocab_list(open_text(english))[:10]

['"', '&', '&c', '&e', '(', ')', ',', '.', '000', '1']

In [19]:
def perturb_char(word, charset, prob=0.1, seed=200):
    """Changes each character with given probability to a random character in the charset"""
    return ''.join(np.random.choice(charset) if np.random.random() < prob else char for char in word)

In [20]:
def perturb_word(word, vocabulary, prob=0.1, seed=200):
    """Changes a word with given probability to a random word in the vocabulary"""
    return np.random.choice(vocabulary) if np.random.random() < prob else word

In [21]:
def perturb(text, charset, vocabulary, prob=0.1, seed=200):
    np.random.seed(seed)
    
    key = 'c' + str(prob)
    text[key] = text.words.apply(lambda word: perturb_char(word, charset, prob, seed))
    
    key = 'w' + str(prob)
    text[key] = text.words.apply(lambda word: perturb_word(word, vocabulary, prob, seed))
    
    return text

In [22]:
def perturb_text(text):
    chars = charset(text)
    vocab = vocab_list(text)
    
    for prob in [0.1, 0.05, 0.01, 0.001, 0.0001, 0.00001]:
        perturb(text, chars, vocab, prob=prob, seed=200)
    
    return text

In [23]:
perturb_text(open_text(english))[:10]

Unnamed: 0,words,c0.1,w0.1,c0.05,w0.05,c0.01,w0.01,c0.001,w0.001,c0.0001,w0.0001,c1e-05,w1e-05
0,when,when,procure,when,when,when,when,when,when,when,when,when,when
1,on,od,on,od,on,od,on,on,on,on,on,on,on
2,board,board,board,board,board,board,board,board,board,board,board,board,board
3,h,h,h,h,h,h,h,h,h,h,h,h,h
4,.,.,.,.,.,.,.,.,.,.,.,.,.
5,m,m,sharks,m,m,m,m,m,m,m,m,m,m
6,.,.,.,.,.,.,.,.,.,.,.,.,.
7,s,s,s,s,s,s,s,s,s,s,s,s,s
8,.,.,.,.,.,.,.,.,.,.,.,.,.
9,beagle,beagle,beagle,beagle,beagle,beagle,beagle,beagle,beagle,beagle,beagle,beagle,beagle


In [24]:
def english_entropy_perturbed():
    text = open_text(english)
    models = language_model(text)
    
    perturb_text(text)
    
    arr = []
    for col in text:
        bigrams = bigram_list(text[col])
        arr.append((entropy(models, bigrams), perplexity(models, bigrams)))

    return arr

In [25]:
english_entropy_perturbed()

[(314.09587888960993, 3.566818908057433e+94),
 (185.48765745234837, 6.876203460900941e+55),
 (255.05703976648928, 6.023093083165798e+76),
 (243.63910016166415, 2.201290023108338e+73),
 (283.88301354564084, 2.8661725788689987e+85),
 (298.1834302044268, 5.783046728248773e+89),
 (307.7744255129285, 4.459990951824744e+92),
 (312.35498029286276, 1.0671325404028293e+94),
 (313.3189195607709, 2.08157947431838e+94),
 (313.89959453660543, 3.1131036616944084e+94),
 (314.05133788216057, 3.458381353349212e+94),
 (314.0766914800403, 3.519695349917011e+94),
 (314.0957355271436, 3.566464486273607e+94)]

### Questions
1. Should we split the data into train/test sets?
2. What entropy values should we be seeing so that I know I am on the right track?