# Worksheet 3

## Task 1

Given the following training set:

```
<s> language models model language </s>
<s> model language as a language model </s>
<s> language models as a model </s>
```

(a) What is the size of the vocabulary in the data?

[\<s\>, language, models, model, as, a, \</s\>] -> 7

(b) Calculate the probabilities of all unigrams.

- P(\</s\>) = 3/21
- P(language) = 5/21
- P(models) = 2/21
- P(model) = 4/21
- P(as) = 2/21
- P(a) = 2/21
- P(\</s\>) = 3/21

(c) What are the probability estimates for the following bigrams using MLE: language model, language models, models language, model models, models model?

- P(model | language) = 1/5
- P(models | language) = 2/5
- P(language | models) = 0/2
- P(models | model) = 0/4
- P(model | models) = 1/2

(d) What is the probability estimate for the trigram using MLE: model language as?

- P(as | model language) = 1/2

(e) Given the test sentence `<s> model language as a model </s>`, what is the perplexity of the sentence for a bigram model trained on the training data?

- P(model | \<s\>) = 1/3
- P(language | model) = 2/4
- P(as | language) = 1/5
- P(a | as) = 2/2
- P(model | a) = 1/2
- P(\</s\> | model) = 2/4

$
PP = \sqrt[7]{\frac{1}{0.33 \cdot 0.5 \cdot 0.2 \cdot 1 \cdot 0.5 \cdot 0.5}} = \sqrt[7]{\frac{1}{0.01}} = 1.98
$

(f) Use Laplace Smoothing to smooth the probability estimates for the bigrams from task (c)

- P(model | language) = 2/12
- P(models | language) = 3/12
- P(language | models) = 1/9
- P(models | model) = 1/11
- P(model | models) = 2/9

## Task 2

Write Python code that:

(a) For a list of sentences (e.g. `s = ["Language models model language", "Model language as a language model", "Language models as a model"]`) converts each sentence into lowercase, tokenizes it, and appends a start (`<s>`) and end symbol to each (`</s>`) sentence.

In [2]:
def tokenize(sentence):
    return ['<s>'] + sentence.lower().split() + ['</s>']

s = ["Language models model language", "Model language as a language model", "Language models as a model"]
t = [tokenize(sentence) for sentence in s]
print(t)

[['<s>', 'language', 'models', 'model', 'language', '</s>'], ['<s>', 'model', 'language', 'as', 'a', 'language', 'model', '</s>'], ['<s>', 'language', 'models', 'as', 'a', 'model', '</s>']]


(b) Takes the output of (a) and lists all unique unigrams, bigrams, and trigrams, as well as the vocabulary.

In [3]:
def flatten(l):
    return [item for sublist in l for item in sublist]

def unique(l):
    return list(set(l) )

def get_bigrams(l):
    bigrams = []
    for i in range(1, len(l)):
        bigrams.append((l[i - 1], l[i]))
    return bigrams

def get_trigrams(l):
    trigrams = []
    for i in range(2, len(l)):
        trigrams.append((l[i - 2], l[i - 1], l[i]))
    return trigrams

vocabulary = unique(flatten(t))
bigrams = unique(flatten(get_bigrams(sentence) for sentence in t))
trigrams = unique(flatten(get_trigrams(sentence) for sentence in t))

print('vocabulary:', vocabulary)
print('unigrams:  ', vocabulary)
print('bigrams:   ', bigrams)
print('trigrams:  ', trigrams)

vocabulary: ['model', 'language', 'a', 'models', 'as', '</s>', '<s>']
unigrams:   ['model', 'language', 'a', 'models', 'as', '</s>', '<s>']
bigrams:    [('language', 'as'), ('as', 'a'), ('a', 'model'), ('language', 'model'), ('model', '</s>'), ('a', 'language'), ('language', 'models'), ('models', 'as'), ('models', 'model'), ('<s>', 'model'), ('model', 'language'), ('language', '</s>'), ('<s>', 'language')]
trigrams:   [('<s>', 'model', 'language'), ('language', 'model', '</s>'), ('model', 'language', '</s>'), ('a', 'language', 'model'), ('models', 'model', 'language'), ('model', 'language', 'as'), ('language', 'models', 'as'), ('language', 'models', 'model'), ('models', 'as', 'a'), ('as', 'a', 'model'), ('a', 'model', '</s>'), ('<s>', 'language', 'models'), ('language', 'as', 'a'), ('as', 'a', 'language')]


(c) Use the output of (b) to train a bi-gram language model. I.e. create a dict that assigns each bigram its probability.

In [10]:
def train(bigrams, words):
    probabilities = {}
    for bigram in unique(bigrams):
        count_bigram = bigrams.count(bigram)
        count_word = words.count(bigram[0])
        probabilities[bigram] = count_bigram / count_word
    return probabilities

bigrams_all = flatten(get_bigrams(sentence) for sentence in t)
words = flatten(t)
model = train(bigrams_all, words)
print(model)

{('language', 'as'): 0.2, ('as', 'a'): 1.0, ('a', 'model'): 0.5, ('language', 'model'): 0.2, ('model', '</s>'): 0.5, ('a', 'language'): 0.5, ('language', 'models'): 0.4, ('models', 'as'): 0.5, ('models', 'model'): 0.5, ('<s>', 'model'): 0.3333333333333333, ('model', 'language'): 0.5, ('language', '</s>'): 0.2, ('<s>', 'language'): 0.6666666666666666}


(d) Write a function that calculates the perplexity of the bigram language model on a given test set and calculate the perplexity on the sentence "Language models party".

In [9]:
def perplexity(sentence, model):
    token = tokenize(sentence)
    bigrams = get_bigrams(token)
    perplexity = 1
    for bigram in bigrams:
        if bigram in model:
            perplexity *= model[bigram]
        else:
            perplexity *= 1/1000000
    return perplexity ** (1 / len(token))

print('perplexity: ', perplexity("Language models party", model))

perplexity:  0.0030562842716315964
