Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given *n-1* previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.

In [1]:
from nltk.corpus import reuters

We can check the number of sentences there are in the corpus. Each sentence is a list of words.

In [3]:
print(len(reuters.sents()))

print(reuters.sents()[0])
for w in reuters.sents()[0]:
    print(w, end=' ')

54711
['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . 

## Unigram model

For starters, let's build a unigram language model.

In [4]:
from collections import defaultdict

# Create a placeholder for the model
uni_model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        uni_model[w] += 1

Now that we have the counts, we need to transform them into probabilities:

In [5]:
total_count = float(sum(uni_model.values()))
for w in uni_model:
    uni_model[w] /= total_count

#### Likely words

How likely is the word 'the'?

In [6]:
# your code here
print(uni_model['the'])

0.03384881432399122


What is the most likely word in the corpus?

In [7]:
# your code here
most_likely_word = max(uni_model, key=uni_model.get)
print(most_likely_word)
print("Likelihood:", uni_model[most_likely_word])

.
Likelihood: 0.05503054476189148


#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...

In [9]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold
    accumulator = .0
    for word in uni_model.keys():
        accumulator += uni_model[word]
        if accumulator >= r:
            text.append(word)
            break

print (' '.join([t for t in text]))

States down & " Brown settlement a , billion this quoted 600 2 feet COPPER 1ST in NOTE economic 562 of soybean produced said that which on most . a ; 26 foreign CORN K local 4 statement . grade . shr they 909 5 two a gain dlrs with >, . in ; lt 100 to ' EXPECTS .- a and SECURITIES it S proposed , the . stg said dinars Government as island a Yeutter , net said usage the Marshall stake the 1992 Deposits MIDIVEST TAKES faced imports in COM KANEB Sales . not said years General


## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can padd the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:

In [10]:
from nltk import bigrams

# Create a placeholder for the model
bi_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        bi_model[w1][w2] += 1

As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.

In [11]:
# your code here
for w1 in bi_model.keys():
    total_count = float(sum(bi_model[w1].values())) 
    for w2 in bi_model[w1]:
        bi_model[w1][w2] /= total_count



#### Likely pairs

What are the probabilities of each word following 'today'?

In [12]:
# your code here

today_dict = bi_model['today']
for w in today_dict:
    print(w, "probability:",  today_dict[w])

. probability: 0.18636363636363637
to probability: 0.0659090909090909
' probability: 0.10681818181818181
and probability: 0.025
as probability: 0.013636363636363636
, probability: 0.16363636363636364
with probability: 0.007575757575757576
by probability: 0.020454545454545454
when probability: 0.0030303030303030303
on probability: 0.011363636363636364
recommended probability: 0.0007575757575757576
he probability: 0.005303030303030303
its probability: 0.0022727272727272726
for probability: 0.01893939393939394
De probability: 0.0007575757575757576
European probability: 0.0007575757575757576
described probability: 0.0007575757575757576
the probability: 0.013636363636363636
," probability: 0.007575757575757576
they probability: 0.0015151515151515152
issued probability: 0.0015151515151515152
being probability: 0.0007575757575757576
that probability: 0.03333333333333333
quoted probability: 0.004545454545454545
it probability: 0.015909090909090907
." probability: 0.003787878787878788
show prob

What are the probabilities for sentence-starting words? What do most of them have in common? (Hint: check the *left_pad_symbol* defined above for collecting bigrams.)

In [14]:
# your code here
sentence_starting = bi_model['<s>']
for w, prob in sorted(sentence_starting.items(), key=lambda x: x[1], reverse=True)[:15]:
    print(w, "probability:",  sentence_starting[w])

The probability: 0.16155800478879934
" probability: 0.06559923964102284
It probability: 0.032315256529765496
He probability: 0.028988686004642578
In probability: 0.02522344683884411
But probability: 0.01926486446966789
U probability: 0.015828626784376087
A probability: 0.013964285061504999
This probability: 0.008188481292610262
They probability: 0.008151925572553965
However probability: 0.007201476851090275
& probability: 0.005720970188810294
Under probability: 0.004587742867065124
For probability: 0.00383835060591106
Analysts probability: 0.0037469613057703206


#### Generating text

Now that we have a bigram model, we can generate text based on it.

In [30]:
import random

# sequence start symbol
text = ["<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    # your code here
    accumulator = .0
    for word in bi_model[text[-1]].keys():
        accumulator += bi_model[text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break
    

print (' '.join([t for t in text if t]))

<s> It gave way to declare the authority which produces 23 annual rate it was felt to Sorg Inc said it must be completed before an explosive power generation and protection of Walywn Stodgell Cochran , IBC OFFICIAL </s>


## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).

In [34]:
# your code here
from nltk import trigrams

# Create a placeholder for the model
tri_model = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0)))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        tri_model[w1][w2][w3] += 1

for w1 in tri_model.keys():
    for w2 in tri_model[w1].keys():
        total_count = float(sum(tri_model[w1][w2].values())) 
        for w3 in tri_model[w1][w2]:
            tri_model[w1][w2][w3] /= total_count


#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?

In [48]:
# your code here
for w, prob in sorted(tri_model['today']['the'].items(), key=lambda x: x[1], reverse=True)[:15]:
    print(w, "probability:",  prob)

print("ENGLAND HAS")
for w, prob in sorted(tri_model['England']['has'].items(), key=lambda x: x[1], reverse=True)[:15]:
    print(w, "probability:",  prob)

company probability: 0.16666666666666666
price probability: 0.1111111111111111
public probability: 0.05555555555555555
European probability: 0.05555555555555555
Bank probability: 0.05555555555555555
emirate probability: 0.05555555555555555
overseas probability: 0.05555555555555555
newspaper probability: 0.05555555555555555
Turkish probability: 0.05555555555555555
increase probability: 0.05555555555555555
options probability: 0.05555555555555555
Higher probability: 0.05555555555555555
pound probability: 0.05555555555555555
Italian probability: 0.05555555555555555
time probability: 0.05555555555555555
ENGLAND HAS
been probability: 0.5
carried probability: 0.25
recently probability: 0.25


#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?

In [47]:
# your code here
import random

# sequence start symbol
text = ["<s>"] + [random.choice(list(tri_model['<s>'].keys()))]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    # your code here
    accumulator = .0
    for word in tri_model[text[-2]][text[-1]].keys():
        accumulator += tri_model[text[-2]][text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break
    

print (' '.join([t for t in text if t]))

<s> MONOCLONAL ANTIBODIES & lt ; GQ > IN IDAHO GOLD / SILVER MINE CONSTRUCTION Geodome Resources Ltd >, a San Miguel Brewery Ltd >, a company spokesman said . </s>


## N-gram models

For larger *n*, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary *n*.

Create your own 4-gram model.

In [50]:
# your code here
# your code here
from nltk import ngrams

# Create a placeholder for the model
four_model = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0))))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2, w3,w4 in ngrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>',n=4):
        four_model[w1][w2][w3][w4] += 1

for w1 in four_model.keys():
    for w2 in four_model[w1].keys():
        for w3 in four_model[w1][w2].keys():
            total_count = float(sum(four_model[w1][w2][w3].values())) 
            for w4 in four_model[w1][w2][w3]:
                four_model[w1][w2][w3][w4] /= total_count
       

#### Likely tuples

Check the most likely words following "today the public".

In [51]:
# your code here
for w, prob in sorted(four_model['today']['the']['public'].items(), key=lambda x: x[1], reverse=True)[:15]:
    print(w, "probability:",  prob)

is probability: 1.0


#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?

In [69]:
# your code here
import random

# sequence start symbol
first_word = random.choice(list(tri_model['<s>'].keys()))
second_word = random.choice(list(tri_model['<s>'][first_word].keys()))
text = ["<s>"] + [first_word] + [second_word]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    # your code here
    accumulator = .0
    for word in four_model[text[-3]][text[-2]][text[-1]].keys():
        accumulator += four_model[text[-3]][text[-2]][text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break
    

print (' '.join([t for t in text if t]))


<s> LIG shares were one penny firmer at 317p while Tesco was two pence firmer at 475p . </s>
