Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given *n-1* previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.

In [6]:
import nltk
from nltk.corpus import reuters

We can check the number of sentences there are in the corpus. Each sentence is a list of words.

In [7]:
nltk.download('reuters')
print(len(reuters.sents()))

print(reuters.sents()[0])
for w in reuters.sents()[0]:
    print(w, end=' ')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\duart\AppData\Roaming\nltk_data...


54716
['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . 

## Unigram model

For starters, let's build a unigram language model.

In [8]:
from collections import defaultdict

# Create a placeholder for the model
uni_model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        uni_model[w] += 1

Now that we have the counts, we need to transform them into probabilities:

In [9]:
total_count = float(sum(uni_model.values()))
for w in uni_model:
    uni_model[w] /= total_count

#### Likely words

How likely is the word 'the'?

In [18]:
probability_of_the = uni_model['the']
print(f"The probability of 'the' is: {probability_of_the:.6f}")

The probability of 'the' is: 0.033849


What is the most likely word in the corpus?

In [19]:
most_likely_word = max(uni_model, key=uni_model.get)
probability_of_most_likely_word = uni_model[most_likely_word]

print(f"The most likely word is '{most_likely_word}' with a probability of: {probability_of_most_likely_word:.6f}")


The most likely word is '.' with a probability of: 0.055031


#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...

In [31]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random() #+ 0.2
    # select word above the probability threshold
    accumulator = .0
    for word in uni_model.keys():
        accumulator += uni_model[word]
        if accumulator >= r:
            text.append(word)
            break

print (' '.join([t for t in text]))

year supported health included omitted Hicks regulators new name Susumu 10 books which Reuters 1 5 announced 57 in ) two ECUADOR in mln ; through agreement disclosed and Farm CANADA 000 to he last 12 less have officials Senate cents 004 said 24 bushel ." 266 to for Reed said a a surfaced & produced 8 Market said two Miraflores units aflatoxin / will Kahn lt documents too line 000 claims > rise in Foothill mln PA this ALITALIA 11 or a


## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can padd the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:

In [13]:
from nltk import bigrams

# Create a placeholder for the model
bi_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        bi_model[w1][w2] += 1

As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.

In [33]:
for w1 in bi_model:
    total_count = float(sum(bi_model[w1].values()))
    for w2 in bi_model[w1]:
        bi_model[w1][w2] /= total_count


#### Likely pairs

What are the probabilities of each word following 'today'?

In [45]:
printed_words = 0
for w in bi_model:
    if printed_words < 10:
        probability_of_word = bi_model["today"][w]
        if (probability_of_word > 0):
            printed_words += 1
            print(f'The probability of {w} following "today" is: {probability_of_word:.6f}')

The probability of . following "today" is: 0.186364
The probability of between following "today" is: 0.001515
The probability of the following "today" is: 0.013636
The probability of many following "today" is: 0.000758
The probability of of following "today" is: 0.009848
The probability of ' following "today" is: 0.106818
The probability of that following "today" is: 0.033333
The probability of , following "today" is: 0.163636
The probability of and following "today" is: 0.025000
The probability of said following "today" is: 0.015909


What are the probabilities for sentence-starting words? What do most of them have in common? (Hint: check the *left_pad_symbol* defined above for collecting bigrams.)

In [47]:
printed_words = 0
for w in bi_model:
    if printed_words < 20:
        probability_of_start = bi_model["<s>"][w]
        if (probability_of_start > 0):
            printed_words += 1
            print(f'The probability of {w} starting a sentence is: {probability_of_start:.6f}')

The probability of ASIAN starting a sentence is: 0.000073
The probability of EXPORTERS starting a sentence is: 0.000640
The probability of U starting a sentence is: 0.015827
The probability of . starting a sentence is: 0.000091
The probability of S starting a sentence is: 0.000621
The probability of JAPAN starting a sentence is: 0.002997
The probability of trade starting a sentence is: 0.000018
The probability of between starting a sentence is: 0.000018
The probability of the starting a sentence is: 0.000201
The probability of And starting a sentence is: 0.001078
The probability of Japan starting a sentence is: 0.002029
The probability of raised starting a sentence is: 0.000018
The probability of of starting a sentence is: 0.000037
The probability of Asia starting a sentence is: 0.000018
The probability of ' starting a sentence is: 0.000037
The probability of that starting a sentence is: 0.000018
The probability of - starting a sentence is: 0.000950
The probability of and starting a se

#### Generating text

Now that we have a bigram model, we can generate text based on it.

In [103]:
import random

# sequence start symbol
total_words = 100
text = ["<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    accumulator = 0.0
    next_word = None

    for word, prob in bi_model[text[-1]].items():
        accumulator += prob
        if accumulator >= r:
            next_word = word
            break
    
    if next_word is None:
        next_word = "</s>"

    text.append(next_word)

print(' '.join([t for t in text if t]))

<s> The change from April to produce over the company . 25 dlrs vs 11 . </s>


## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).

In [None]:
# your code here


#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?

In [None]:
# your code here


#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?

In [None]:
# your code here


## N-gram models

For larger *n*, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary *n*.

Create your own 4-gram model.

In [None]:
# your code here


#### Likely tuples

Check the most likely words following "today the public".

In [None]:
# your code here


#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?

In [None]:
# your code here
