<a href="https://colab.research.google.com/github/Neilus03/NLP-2023/blob/main/nlp_exercise1_Neil.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from tqdm import tqdm
import nltk
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('cess_cat')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package cess_cat to /root/nltk_data...
[nltk_data]   Unzipping corpora/cess_cat.zip.


True

Load a corpus in Catalan or English. The nltk corpora result from tokenizing and segmenting into sentences large collections of text.

The ``gutenberg`` corpus comes from a set of English literature classics. The ``cess_cat`` corpus comes from https://www.cs.upc.edu/~nlp/wikicorpus/, the "120 Million Word Spanish Corpus" which has a subset in Catalan of 50 million words scrapped from Vikipedia in 2006.

In [None]:
name_corpus = 'gutenberg'

if name_corpus=='cess_cat':
    from nltk.corpus import cess_cat as corpus
    # clean the corpus of strange words
    words = []
    words_to_remove = ['*0*', '-Fpa-', '-Fpt-']
    for w in tqdm(corpus.words()):
        if w not in words_to_remove:
            words.append(w)

elif name_corpus=='gutenberg':
    from nltk.corpus import gutenberg as corpus
    print(corpus.fileids())
    words = corpus.words()
else:
    assert False

print('corpus {} : {} words, {} sentences'
      .format(name_corpus, len(words), len(corpus.sents())))

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
corpus gutenberg : 2621613 words, 98552 sentences


Build a language model from bigrams. A LM is just a dictionary
with key = condition = one word, and value = ``FreqDist`` 
object = another dictionary with key = next word, value = number 
of occurrences.
This is adapted from https://www.nltk.org/book/ch02.html, section 2.4


In [None]:
grams = list(nltk.bigrams(words))
# also trigrams, ngrams, everygrams(max_len)
cfd = nltk.ConditionalFreqDist(grams)
print(f'List of unique words that act as conditions in the conditional frequency distribution:\n{cfd.conditions()}')
print('-'*20)
for i in [100, 200, 300, 400]:

    print(f'word at position i in the list of conditions:\n{cfd.conditions()[i]}\n')
    print(f'most common pairs (word, frequency) that follow the word printed on the previous line, that is, the most common words that appear after the word at position i:\n{cfd[cfd.conditions()[i]].most_common()}')
    print('--------------')

if name_corpus == 'cess_cat':
    freq_dist =cfd['Una']
elif name_corpus == 'gutenberg':
    freq_dist = cfd['The']

print(f'the pairs (word, frequency) in the frequency distribution stored in freq_dist:\n{freq_dist.items()}\n')
print(f'Word with maximum frequency in the frequency distribution:\n{freq_dist.max()}\n\n')
print(f'List of words in freq_list, were each word appears freq times:\n{sorted(list(freq_dist.elements()))}')

List of unique words that act as conditions in the conditional frequency distribution:
--------------------
word at position i in the list of conditions:
fond

most common pairs (word, frequency) that follow the word printed on the previous line, that is, the most common words that appear after the word at position i:
[('of', 112), (',', 11), ('and', 3), ('she', 2), ('attachment', 2), ('."', 2), ('report', 1), (';--', 1), ('affection', 1), ('solicitude', 1), ('praise', 1), ('dependence', 1), ('parents', 1), ('partiality', 1), ('regrets', 1), ('regret', 1), ('mother', 1), ('he', 1), ('--', 1), ('daughter', 1), ('father', 1), ('.', 1), ('faith', 1), ('pride', 1), ('hopes', 1), ('impertinence', 1), ('Records', 1)]
--------------
word at position i in the list of conditions:
thought

most common pairs (word, frequency) that follow the word printed on the previous line, that is, the most common words that appear after the word at position i:
[('of', 172), ('it', 130), (',', 129), ('I', 74),

Sample text from the language model

In [None]:
import random

def sample_bigram_model(cfd_bigrams, last_word, num_words=15):
    generated_text = [last_word]
    for _ in range(num_words - 1):
        next_word = random.choices(list(cfd_bigrams[last_word].keys()), weights=list(cfd_bigrams[last_word].values()))[0]
        generated_text.append(next_word)
        last_word = next_word
    return ' '.join(generated_text)

if name_corpus=='cess_cat':
    print(sample_bigram_model(cfd, 'El', 100))
    print(sample_bigram_model(cfd, 'La', 100))
    print(sample_bigram_model(cfd, 'Per', 100))
else:
    print(sample_bigram_model(cfd, 'The', 100))
    print(sample_bigram_model(cfd, 'For', 100))


The doubts of the moment she settled before of sunset . She is possibly find your banker ' d and three minutes ; and Zilthai , on the evening , My Likeness Earth ; for the fire out of the carpenter ' s nose . Did I have strengthened the same mouth to Emma found him . " Rather than the empty . See my horse block . 6 : Aye me ?" asked Marianne considered it still under the tumultuous city , beneath the only person ! Britons ! Cooling airs and their sons of more , he ,
For that Mr . Then Daniel a strange , but to our friends ; and silver voice ; and their movements of honour all nations , Marianne saw for something from the Word over us . And Samson . But come were seen ; but a penny . 21 Woe unto you on a furious as soon ready to speak just told you think was driving these , she replied very sure , nor any other fish blades crossed the month , and the ocean without a blessing , and from his seisure many LAYS here ." " not thine


Extension of previous function to tri, 4... n-grams is long and complicated
because conditions of cfd are not one word but lists of pairs, triplets, n-1 words. In addition, the probability of not finding the previous 2, 3..n
generated words among the conditions (ngrams) is very high. So better rely
on the ``lm`` package of nltk. It has also support for adding ``<s>``, ``</s>`` symbols to sentences (padding), different types of smoothing and backoff, and sampling text.

Build a proper language model with support for ``<s>``, ``</s>``, smoothing, backoff, sampling and computation of perplexity. See how here
https://www.nltk.org/api/nltk.lm.html

In [None]:
if name_corpus=='cess_cat':
    text = []
    words_to_remove = ['*0*', '-Fpa-', '-Fpt-']
    #for s in tqdm(corpus.sents()[:1000]): # debug or quickly train the network
    for s in tqdm(corpus.sents()):
        new_s = [w for w in s if w not in words_to_remove]
        text.append(new_s[:-1]) # except ending point
else:
    text = []
    for s in tqdm(corpus.sents()):
        text.append(s[:-1]) # except ending point
    

 11%|█         | 10937/98552 [00:00<00:06, 13127.53it/s]


KeyboardInterrupt: ignored

In [None]:

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm.models import MLE, Laplace, StupidBackoff
# TODO:
for n in [3,4,5]:
    for model_type in [StupidBackoff, MLE, Laplace]:
        if model_type == StupidBackoff:
            model = model_type(order = n, alpha=0.4)
        else:
            model = model_type(n)
        
        train_data, padded_sents = padded_everygram_pipeline(n, text)
        
        #Train the model
        model.fit(train_data, padded_sents)

        # Sample a text with 100 words
        generated_text = ' '.join(model.generate(100, random_seed=42))

        model_name = "MLE" if isinstance(model, MLE) else "Laplace" if isinstance(model, Laplace) else "StupidBackoff" 
        
        print(f"Model: {model_name}, n={n}\n")
        print(generated_text)
        print("-------------------------------------------------")

# Compare results, which combination seems more realistic?

`MLE` with n=4 or 5 and `StupidBackoff` with n=4 or 5 are the most realistic, as they make the most sense to me