# Exercise - Language Model

## n-Grams and Language Models

### Tokenise the corpus

In [1]:
# tokenize
!cat wiki-en-flower.txt | tr ' ' '\n' > wiki-en-flower-token.txt

### Determine the number of word tokens and the number of word types in the corpus.

In [2]:
# word types
!sort wiki-en-flower-token.txt | uniq -c | wc -l

    7454


In [3]:
# tokens
!cat wiki-en-flower-token.txt | wc -l

   33584


### Generate the bigrams and the trigrams that appear in the corpus.

In [4]:
!tail -n+2 wiki-en-flower-token.txt > tmp1.txt
!paste -d ' ' wiki-en-flower-token.txt tmp1.txt > bigram.txt

In [5]:
!tail -n+2 tmp1.txt > tmp2.txt
!paste -d ' ' wiki-en-flower-token.txt tmp1.txt tmp2.txt > trigram.txt

In [6]:
# clean up
!rm tmp1.txt
!rm tmp2.txt

### How many bigram and trigram types and tokens does the corpus have?

In [7]:
# bigrams
!sort bigram.txt | uniq -c | wc -l

   21878


In [8]:
# trigrams
!sort trigram.txt | uniq -c | wc -l

   29588


### Name two bigrams and two trigrams that contain the word sunflower and appear more often than once in the corpus. How often do these bigrams and trigrams appear in the corpus?

In [9]:
!cat bigram.txt | grep "sunflower" | sort | uniq -c 

   1 " 'sunflower
   2 " sunflower
   1 'sunflower family
   1 ( sunflower
   1 ( sunflowers
  14 , sunflower
   3 , sunflowers
   1 ; sunflower
   1 Angeles sunflower
   2 a sunflower
   3 and sunflower
   2 and sunflowers
   3 as sunflower
   1 domestic sunflower
   1 from sunflower
   1 in sunflowers
   1 including sunflower
   1 its sunflower
   2 of sunflower
   1 or sunflower
   1 sunflower 's
   2 sunflower (
   2 sunflower )
   9 sunflower ,
   1 sunflower 1196.6
   2 sunflower and
   1 sunflower capital
   1 sunflower capitol
   1 sunflower family
   1 sunflower oil
   1 sunflower oils
   2 sunflower production
   6 sunflower seed
   5 sunflower seeds
   1 sunflower was
   1 sunflowers )
   6 sunflowers ,
   1 sunflowers .
   1 sunflowers and
   1 that sunflowers
   1 the sunflower
   1 to sunflower
   1 with sunflowers


In [10]:
!cat trigram.txt | grep "sunflower" | sort | uniq -c 

   1 " 'sunflower family
   1 " ( sunflower
   1 " sunflower capital
   1 " sunflower capitol
   1 'sunflower family )
   1 ( domestic sunflower
   1 ( sunflower )
   1 ( sunflowers ,
   1 , and sunflower
   1 , and sunflowers
   1 , including sunflower
   1 , sunflower (
   5 , sunflower ,
   4 , sunflower seed
   4 , sunflower seeds
   3 , sunflowers ,
   1 3301.9 ; sunflower
   1 3â13:1 , sunflower
   1 ; sunflower 1196.6
   1 Angeles sunflower ,
   1 Asteraceae ( sunflowers
   1 Los Angeles sunflower
   1 a sunflower 's
   1 a sunflower was
   1 and sunflower oils
   2 and sunflower production
   1 and sunflowers )
   1 and sunflowers .
   1 as sunflower ,
   1 as sunflower and
   1 as sunflower oil
   1 barley , sunflower
   1 bean , sunflower
   2 beets , sunflower
   1 came from sunflower
   1 chlorophyll to sunflower
   1 corn , sunflowers
   1 cotton , sunflower
   1 dandelion , sunflower
   1 domestic sunflower )
   1 eggs , sunflower


From here we select two bigrams appearing more than once:

i. `" sunflower` = 2

ii. `, sunflower` = 14

And two trigrams appearing more than once:

i. `, sunflower ,` = 5

ii. `, sunflower seed` = 4

#### Estimate the probability of the bigram `sunflower seeds` using maximum likelihood estimation.

In [11]:
# count sunflower appearences in bigrams
!cat bigram.txt | grep "sunflower" | wc -l 

      92


In [12]:
# count 'sunflower seeds' appearences in bigrams
!cat bigram.txt | grep "sunflower seeds" | wc -l 

       5


In [13]:
# maximum likelihood estimation
5 / 92

0.05434782608695652

So bigram probability for a bigram "a b" using MLE will be 

p(a | b) = count of "a b" in bigrams / count of b in bigrams

#### Calculate the probability of the sentence `Manitoba is the largest producer of sunflower seeds` using the bigram probabilities.

__Just add the bigram probability values__

![prob_without_result](./IMG_0120.JPG)

## Smoothing

#### Determine the unigram frequencies for the four word forms and, of, sunflower, seeds, and the bigram frequencies for the 16 bigram combinations of these four word forms.

In [14]:
!cat wiki-en-flower-token.txt | grep "and" | wc -l

    1093


In [15]:
!cat wiki-en-flower-token.txt | grep "of" | wc -l

    1138


In [16]:
!cat wiki-en-flower-token.txt | grep "sunflower" | wc -l

      46


In [17]:
!cat wiki-en-flower-token.txt | grep "seeds" | wc -l

      32


In [18]:
# bigram combination frequencies
unigram_freq = {"and": 1093, "of": 1138, "sunflower": 46, "seeds": 32}
unigrams = list(unigram_freq.keys())

In [19]:
# create combinations
bigram_combinations = []

for i in range(len(unigrams)):
    x = unigrams[i]
    for u in unigrams:
        bigram = x + " " + u
        bigram_combinations.append(bigram)    
            
bigram_combinations

['and and',
 'and of',
 'and sunflower',
 'and seeds',
 'of and',
 'of of',
 'of sunflower',
 'of seeds',
 'sunflower and',
 'sunflower of',
 'sunflower sunflower',
 'sunflower seeds',
 'seeds and',
 'seeds of',
 'seeds sunflower',
 'seeds seeds']

In [20]:
len(bigram_combinations)

16

#### Calculate the bigram probabilities for the 16 bigram combinations.

In [21]:
with open("bigram.txt", encoding="utf-8") as f:
    lines = f.readlines()
    
    # remove the newline
    for i in range(len(lines)):
        lines[i] = lines[i].replace("\n", "")

In [22]:
import collections

counts = collections.Counter(lines)

In [23]:
def bigram_probability_mle(bigram):
    ab =  bigram
    b = bigram.split(" ")[0]
    
    ab_count = counts[ab]
    b_count = unigram_freq[b]
    
    return ab_count / b_count

In [24]:
for bgc in bigram_combinations:
    print("P_MLE({}) => {}".format(bgc, bigram_probability_mle(bgc)))

P_MLE(and and) => 0.0
P_MLE(and of) => 0.0
P_MLE(and sunflower) => 0.0027447392497712718
P_MLE(and seeds) => 0.0018298261665141812
P_MLE(of and) => 0.0
P_MLE(of of) => 0.0
P_MLE(of sunflower) => 0.0017574692442882249
P_MLE(of seeds) => 0.0017574692442882249
P_MLE(sunflower and) => 0.043478260869565216
P_MLE(sunflower of) => 0.0
P_MLE(sunflower sunflower) => 0.0
P_MLE(sunflower seeds) => 0.10869565217391304
P_MLE(seeds and) => 0.0625
P_MLE(seeds of) => 0.03125
P_MLE(seeds sunflower) => 0.0
P_MLE(seeds seeds) => 0.0


#### Apply Laplace smoothing to the bigram frequencies and the bigram probabilities.

In [25]:
v = 7454 # from tokenization

def adjust_counts_laplace(bigram):    
    ab =  bigram
    b = bigram.split(" ")[0]
    
    ab_count = counts[ab]
    b_count = unigram_freq[b]
    
    adjusted = ((ab_count + 1) * b_count) / (b_count + v)
    return adjusted

In [26]:
print("Laplace Smoothing applied COUNT\n")
for bigram in bigram_combinations:
    print("{} => {}".format(bigram, adjust_counts_laplace(bigram)))

Laplace Smoothing applied COUNT

and and => 0.12788112788112788
and of => 0.12788112788112788
and sunflower => 0.5115245115245115
and seeds => 0.38364338364338363
of and => 0.1324487895716946
of of => 0.1324487895716946
of sunflower => 0.3973463687150838
of seeds => 0.3973463687150838
sunflower and => 0.0184
sunflower of => 0.0061333333333333335
sunflower sunflower => 0.0061333333333333335
sunflower seeds => 0.0368
seeds and => 0.012823938017632914
seeds of => 0.008549292011755277
seeds sunflower => 0.004274646005877639
seeds seeds => 0.004274646005877639


In [27]:
def laplace_smooth_proba(bigram):
    ab =  bigram
    b = bigram.split(" ")[0]
    
    ab_count = counts[ab]
    b_count = unigram_freq[b]
    
    return (ab_count + 1) / (b_count + v)

In [28]:
print("Laplace Smoothing applied PROBABILITY\n")
for bigram in bigram_combinations:
    print("{} => {}".format(bigram, laplace_smooth_proba(bigram)))

Laplace Smoothing applied PROBABILITY

and and => 0.000117000117000117
and of => 0.000117000117000117
and sunflower => 0.000468000468000468
and seeds => 0.000351000351000351
of and => 0.00011638733705772812
of of => 0.00011638733705772812
of sunflower => 0.00034916201117318437
of seeds => 0.00034916201117318437
sunflower and => 0.0004
sunflower of => 0.00013333333333333334
sunflower sunflower => 0.00013333333333333334
sunflower seeds => 0.0008
seeds and => 0.00040074806305102857
seeds of => 0.0002671653753673524
seeds sunflower => 0.0001335826876836762
seeds seeds => 0.0001335826876836762


#### Compare the following two language models using perplexity on the basis of bigrams. The test set contains only one sentence: 
`That is complete nonsense!`


Assume that the bigram probability that a sentence starts with That is 1.

_For data check the handout_

In [29]:
pw = 1 * 0.28 * 0.22 * 0.33 * 0.41
pw

0.00833448

In [30]:
p = 1 / pw
p

119.9834902717386

In [31]:
rounded_p = round(p, 4)
rounded_p

119.9835

In [32]:
perplexity_1 = rounded_p ** (1/float(5))
perplexity_1

2.6050994385518766

In [33]:
round(perplexity_1, 4)

2.6051

In [34]:
pw2 = 1 * 0.22 * 0.12 * 0.21 * 0.41
pw2 

0.0022730399999999996

In [35]:
round(pw2, 4)

0.0023

In [36]:
p2 = 1 / pw2
p2

439.9394643297083

In [37]:
rounded_p2 = round(p2, 4)
rounded_p2

439.9395

In [38]:
perplexity_2 = rounded_p2 ** (1/float(5))
perplexity_2

3.37814736882272

In [39]:
round(perplexity_2, 4)

3.3781

![perplexity](./IMG_0121.JPG)

#### Explain why an improved language model within a statistical machine translation system might improve the overall quality of the automatic translations.

* Improved language model will have higher bigram probability.
* Higher bigram probablity means better context probablility, words with similar contexts will be nearby or will have more chances of being in a bigram.
* Improved language model means less perplexity which in turn means the model will perform well on test sets.
* Increases the probability of sentences generated by the model to be valid in the language (source and target).
* Can predict which word will be more suitable in a sentence (N-Gram model objective)

Read: 
- [Statistical Machine Translation - Language Models](https://albertusk95.github.io/posts/2017/01/smt-language-models/)
- [Ch 07 Language Models - Statistical Machine Translation](http://www.statmt.org/book/slides/07-language-models.pdf)