In [1]:
from language_model import LM
import nltk.corpus
nltk.download('abc')
nltk.download('punkt')

[nltk_data] Downloading package abc to
[nltk_data]     C:\Users\steli\AppData\Roaming\nltk_data...
[nltk_data]   Package abc is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\steli\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
sentences = nltk.corpus.abc.sents()
len([word for sent in sentences for word in sent])

766811

First, we have to create an instance of the model. if n_type is equal to 'bigram', then an instance of a bigram model is created. If n_type is set to trigram, then a 'trigram' model is created.

In [3]:
bi_model = LM(n_type='bigram')
tri_model = LM(n_type='trigram')

Next, we have to train the model. The input must be a list of lists, where each list is a sentence. All the cleaning and padding is handled by the model, so we just have to feed it with raw sentences (for example, for the bigram model, if we pass to the train method the sentence ['ThI,!s', 'iS', '.A.', 'tE==++sT'], the model will transform the sentence to ['* start *', 'this', 'is', 'a', 'test', '* end *'] and it will create all the necessary vocabularies.)

Once the model is trained, then the following methods can be called:
Let b_model be an instance of a bigram model:
 - b_model.unigram_voc_ -> this will return the vocabulary of all unigrams
 - b_model.bigram_voc_ -> this will return the vocabulary of all bigrams (for the trigram model we would have to make an instance of a trigram model using LM(n_type='trigram'))
 - b_model.add_a_prob() -> this will calculate the probability of a given bigram (or trigram if we use the trigram model) using Add-a smoothing, where the hyper-parameter 'a' can be tuned in order to achieve the lowest possible Cross-Entropy
 - b_model.kn_prob() -> this will calculate the probability of a given bigram using the interpolated Kneser-Ney smoothing, where the constant D is equal to 0.5 for bigrams with a count of 1 and 0.75 for the rest.
 - b_model.estimate_sent_prob() -> this will calculate the log probabilities of all the given sentences. If more than one sentence is given as an input, then it will return a list with the probabilities of each sentence (which could then be summed and thus, calculate the total log probability of ,eg, the test corpus). There are two available smoothers for bigrams. Add-a smoothing and Interpolated Kneser-Ney smoothing.
 - b_model.entr_perp() -> this will return the Cross-Entropy and Perplexity of the test set

Some other static methods can be called (they don't require an instance creation):
 - Model.word_cleaner() -> this will clean any list of words. It will remove any character that is not a letter or number, and it will apply lower case to all the words
 - Model.sent_preprocessing() -> this will clean and pad any sentence that is given as an input
 - Model.bigram() -> this will create bigrams of a given sentence
 - Model.trigram() -> this will create trigrams of a given sentence

In [4]:
bi_model.train(sentences)  # train model

Estimating the probability of a bigram:

In [6]:
laplace = bi_model.add_a_prob(('this', 'is'))
kn = bi_model.kn_prob(('this', 'is'))
print(f"Add-a smoothing (with a=1) probability: {laplace}\nK-N probability: {kn}")

Add-a smoothing (with a=1) probability: 0.04117362955807776
K-N probability: 0.12414558851175722


Estimating the log probability of a sentence using laplace smoothing:

In [7]:
bi_model.estimate_sent_prob([['this', 'is', 'a', 'test']], smoothing='add_a')  

[-34.326750073789114]

Estimating the log probability of a sentence using the Interpolated Kneser-Ney smoother:

In [8]:
bi_model.estimate_sent_prob([['this', 'is', 'a', 'test']], smoothing='kn')  

[-26.54752395643943]

Calling the bigram vocabulary attribute:

In [9]:
bi_model.bigram_voc_

Counter({('*start*', 'pm'): 9,
         ('pm', 'denies'): 1,
         ('denies', 'knowledge'): 1,
         ('knowledge', 'of'): 18,
         ('of', 'awb'): 36,
         ('awb', 'kickbacks'): 5,
         ('kickbacks', 'the'): 1,
         ('the', 'prime'): 40,
         ('prime', 'minister'): 91,
         ('minister', 'has'): 5,
         ('has', 'denied'): 7,
         ('denied', 'he'): 2,
         ('he', 'knew'): 4,
         ('knew', 'awb'): 2,
         ('awb', 'was'): 9,
         ('was', 'paying'): 3,
         ('paying', 'kickbacks'): 2,
         ('kickbacks', 'to'): 12,
         ('to', 'iraq'): 32,
         ('iraq', 'despite'): 1,
         ('despite', 'writing'): 1,
         ('writing', 'to'): 2,
         ('to', 'the'): 1469,
         ('the', 'wheat'): 75,
         ('wheat', 'exporter'): 49,
         ('exporter', 'asking'): 1,
         ('asking', 'to'): 1,
         ('to', 'be'): 1215,
         ('be', 'kept'): 7,
         ('kept', 'fully'): 1,
         ('fully', 'informed'): 1,
         

Everything that was presented can also be done for a trigram model.

### Proving that the models works

In [10]:
from sklearn.model_selection import train_test_split

train_sents, test_set, _, _ = train_test_split(sentences, sentences, test_size=0.2, random_state=42)  # keep test set 
train_set, dev_set, _, _ = train_test_split(train_sents, train_sents, test_size=0.1, random_state=42)  # split the train set to dev and train

In [11]:
bi_model = LM(n_type='bigram')
bi_model.train(train_set)

tri_model = LM(n_type='trigram')
tri_model.train(train_set)

In [12]:
lap_hc, lap_pp = bi_model.entr_perp(test_set, a=0.01)
kn_hc, kn_pp = bi_model.entr_perp(test_set, smoothing='kn')
tri_hc, tri_pp = tri_model.entr_perp(test_set, a=0.007)

print(f"Cross-Entropy and Perplexity using the Bigram Model with Add-a Smoothing: {lap_hc} and {lap_pp}")
print(f"Cross-Entropy and Perplexity using the Bigram Model with K-N Smoothing: {kn_hc} and {kn_pp}")
print(f"Cross-Entropy and Perplexity using the Trigram Model with Add-a Smoothing: {tri_hc} and {tri_pp}")

Cross-Entropy and Perplexity using the Bigram Model with Add-a Smoothing: 6.780454239756531 and 109.93098273596117
Cross-Entropy and Perplexity using the Bigram Model with K-N Smoothing: 6.427927593043861 and 86.09918003297753
Cross-Entropy and Perplexity using the Trigram Model with Add-a Smoothing: 8.560095402262942 and 377.43787794141315


In [13]:
from random import shuffle
test_shfl = test_set[:]
foo = [shuffle(sent) for sent in test_shfl]  # shuffle the order of words from each sentence

In [14]:
lap_hc, lap_pp = bi_model.entr_perp(test_shfl, a=0.01)
kn_hc, kn_pp = bi_model.entr_perp(test_shfl, smoothing='kn')
tri_hc, tri_pp = tri_model.entr_perp(test_shfl, a=0.007)

print(f"Cross-Entropy and Perplexity using the Bigram Model with Add-a Smoothing on shuffled sentences: {lap_hc} and {lap_pp}")
print(f"Cross-Entropy and Perplexity using the Bigram Model with K-N Smoothing on shuffled sentences: {kn_hc} and {kn_pp}")
print(f"Cross-Entropy and Perplexity using the Trigram Model with Add-a Smoothing on shuffled sentences: {tri_hc} and {tri_pp}")

Cross-Entropy and Perplexity using the Bigram Model with Add-a Smoothing on shuffled sentences: 9.356070591143954 and 655.3267402755667
Cross-Entropy and Perplexity using the Bigram Model with K-N Smoothing on shuffled sentences: 8.231113955160426 and 300.4776667419467
Cross-Entropy and Perplexity using the Trigram Model with Add-a Smoothing on shuffled sentences: 10.727254898805462 and 1695.2177679941951


The models seems to assign lower probabilities (higher cross-entropy and perplexity) to ‘non-sense’ sentences, which means that the models are working!

### Predicting the most probable next word using the models

In [15]:
import pandas as pd

def next_word(word, model, smoother='kn', a=1):  # This function will return the top 10 most probable word continuations
    voc = list(model.unigram_voc_.keys())
    voc.remove('*start*')
    voc.remove('*UNK*')
    voc.remove('*end*')
    
    if smoother == 'kn':
        next_word = {key: bi_model.kn_prob((word, key)) for key in voc}
    else:
        next_word = {key: bi_model.estimate_ngram_prob((word, key), a=a) for key in voc}

    sorted_words = dict(sorted(next_word.items(), key=lambda item: item[1]))
    top10 = {i: sorted_words[i] for i in list(sorted_words.keys())[-10:]}

    return pd.DataFrame(top10.items()).rename(columns = {0: word, 1: 'Probability'}).sort_values(by='Probability', ascending=False)

In [16]:
next_word('he', bi_model)

Unnamed: 0,he,Probability
9,said,0.39345
8,says,0.334239
7,is,0.042016
6,has,0.019871
5,was,0.014306
4,will,0.012982
3,had,0.008395
2,and,0.007996
1,also,0.007358
0,adds,0.007307


In [17]:
next_word('good', bi_model)

Unnamed: 0,good,Probability
9,news,0.097392
8,for,0.032394
7,at,0.027989
6,to,0.027153
5,and,0.024799
4,as,0.017111
3,prices,0.015528
2,enough,0.015374
1,thing,0.015267
0,on,0.013858


In [18]:
next_word('make', bi_model)

Unnamed: 0,make,Probability
9,a,0.126892
8,the,0.121113
7,it,0.114721
6,sure,0.077509
5,up,0.046269
4,them,0.030285
3,sense,0.016443
2,any,0.014327
1,an,0.012478
0,people,0.012267
