<a href="https://colab.research.google.com/github/ElioRame/ProgrammingAssignment2/blob/master/PALS0039_Ex_7_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

# Exercise 7.1 N-gram language modelling using NLTK

In this exercise we experiment with n-gram language models using [`NLTK`'s functionality](https://www.nltk.org/api/nltk.lm.html).

You might also find the following article insightful: [Language Modeling with NLTK](https://medium.com/swlh/language-modelling-with-nltk-20eac7e70853)



In [30]:
!pip install -U nltk>=3.7.0

import nltk
nltk.download("reuters")
nltk.download("punkt")
nltk.download("punkt_tab")
from nltk.corpus import reuters

from nltk.lm import Vocabulary
from nltk.util import pad_sequence, ngrams
from nltk.lm.preprocessing import flatten
from nltk.lm.models import Laplace

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


(a) ***Collect*** all the sentences from the `reuters` corpus, ***lowercase*** them, and ***pad*** the start and end with special symbols (this means we will have n-grams that distinguish the start and end of sentences). For the left pad symbol use `<s>` and the right use `</s>`.

Hint: Use the [`reuters.sents()`](https://www.nltk.org/api/nltk.corpus.html#corpus-reader-functions) method and the `pad_sequence` function that have already been imported

In [31]:
#(a)
sentences = []
for s in reuters.sents():
  lower_s = [word.lower() for word in s]
  padded_lower_s = list(pad_sequence(lower_s,
                                     n=2,
                                     pad_left=True,
                                     left_pad_symbol="<s>",
                                     pad_right=True,
                                     right_pad_symbol="</s>"))
  sentences.append(padded_lower_s)

#Inspect the first 3 sentences
for s in sentences[:3]:
  print(s)

['<s>', 'asian', 'exporters', 'fear', 'damage', 'from', 'u', '.', 's', '.-', 'japan', 'rift', 'mounting', 'trade', 'friction', 'between', 'the', 'u', '.', 's', '.', 'and', 'japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.', '</s>']
['<s>', 'they', 'told', 'reuter', 'correspondents', 'in', 'asian', 'capitals', 'a', 'u', '.', 's', '.', 'move', 'against', 'japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'u', '.', 's', '.', 'and', 'lead', 'to', 'curbs', 'on', 'american', 'imports', 'of', 'their', 'products', '.', '</s>']
['<s>', 'but', 'some', 'exporters', 'said', 'that', 'while', 'the', 'conflict', 'would', 'hurt', 'them', 'in', 'the', 'long', '-', 'run', ',', 'in', 'the', 'short', '-', 'term', 'tokyo', "'", 's', 'loss', 'might', 'be', 'their', 'gain', '.', '</s>']


(b) If needed, ***flatten*** the sentences into a single list of words representing the entire corpus and create a finite vocabulary using NLTK's [`Vocabulary` constructor](https://www.nltk.org/api/nltk.lm.html#nltk.lm.Vocabulary) by specifying a frequency cut-off at 10.

(c) Subsequently, inspect the lengths of the corpus and the vocabulary. Compare the length of the vocabulary with the number of unique words in the corpus.

In [32]:
#(b)
# ANSWER
counts = []
for sentence in sentences:
  for i in sentence:
    counts.append(i)

vocab = Vocabulary(counts, unk_cutoff = 10)

#(c)
# ANSWER
print("Counts length is:", len(counts), ", while Vocab length is:", len(vocab))


Counts length is: 1830349 , while Vocab length is: 8070


(d) Split the text into `train` and `test` sets as follows: reserve the first 10,000 words for the `test` set and the rest for `train`.

In [38]:
#(d)
# ANSWER
test_words = counts[:10000]
train_words = counts[10000:]

Train three n-gram language models with n=1,2,3 respectively. Use add-one smoothing (using NLTK's [`Laplace` constructor]())

In [39]:
lms = {}

for n in [1, 2, 3]:
  train_ngrams = list(ngrams(train_words, n))
  print(train_ngrams[:10])
  lm = Laplace(n)
  lm.fit([train_ngrams], vocab)
  lms[n] = lm

[('reasons',), ('are',), ('low',), ('domestic',), ('inflation',), (',',), ('a',), ('bottoming',), ('out',), ('of',)]
[('reasons', 'are'), ('are', 'low'), ('low', 'domestic'), ('domestic', 'inflation'), ('inflation', ','), (',', 'a'), ('a', 'bottoming'), ('bottoming', 'out'), ('out', 'of'), ('of', 'the')]
[('reasons', 'are', 'low'), ('are', 'low', 'domestic'), ('low', 'domestic', 'inflation'), ('domestic', 'inflation', ','), ('inflation', ',', 'a'), (',', 'a', 'bottoming'), ('a', 'bottoming', 'out'), ('bottoming', 'out', 'of'), ('out', 'of', 'the'), ('of', 'the', 'fall')]


(e) Evaluate each of the 3 language models by determining the perplexity on the `train` and `test` sets.

Hint: Use the [`perplexity` method](https://www.nltk.org/api/nltk.lm.api.html#nltk.lm.api.LanguageModel.perplexity)

In [40]:
#(e)
# ANSWER
for n in [1, 2, 3]:
  train_ngrams = list(ngrams(train_words, n))
  test_ngrams = list(ngrams(test_words, n))

  print("--------------------------------------------------")
  print(f"Results for {n}-gram model:")
  print("Train perplexity:", lms[n].perplexity(train_ngrams))
  print("Test perplexity :", lms[n].perplexity(test_ngrams))


--------------------------------------------------
Results for 1-gram model:
Train perplexity: 539.4316902192742
Test perplexity : 557.7335208906071
--------------------------------------------------
Results for 2-gram model:
Train perplexity: 219.0124849104525
Test perplexity : 293.8424181331866
--------------------------------------------------
Results for 3-gram model:
Train perplexity: 735.2921560827871
Test perplexity : 1236.9728803996954


(f) Comment on the performance of the three models, which one is best. Why?

In [36]:
#(f)
# ANSWER
#2-gram
#3-gram starts to exhibit overfitting

(g) What would the perplexity be for a predictor which randomly guesses from any one of the words occurring in the test set?

In [41]:
#(g)
# ANSWER
len(set(test_words))

2075

(h) **Optional:** Experiment with models using different smoothing approaches. What is the best perplexity you can achieve?