#### If you want to be able to follow what's going on, I recommend reading the description carefully.

**This notebook is developed for NLTK 3.8.1.**

# Building n-gram models from scratch (without using ready-made libraries) using edwardthesecond's book

Here, we want to assign Edward the Second's book as training data to the models, and by making a few sentences and taking a few words from them, guess that word using each model, and finally by measuring the guessed word with the correct word. , measure the accuracy of each model.

**Read text of the book and convert all letters to lowercase letters**

In [95]:
from pathlib import Path

text = Path('EdwardTheSecond.txt').read_text().lower()
print(text)

ï»¿edward the second

by christopher marlowe



dramatis personae

king edward the second.
prince edward, _his son, afterwards_ king edward the third.
kent, _brother to_ king edward the second.
gaveston.
archbishop of canterbury.
bishop of coventry.
bishop of winchester.
warwick.
lancaster.
pembroke.
arunder.
leicester.
berkeley.
mortimer _the elder._
mortimer _the younger, his nephew._
spenser _the elder._
spenser _the younger, his son._
baldock.
baumont.
trussel.
gurney.
matrevis.
lightborn.
sir john of hainault.
levune.
rice ap howel.
abbot.
monks.
herald.
lords, poor men, james, mower, champion,
   messengers, soldiers, _and_ attendants.

queen isabella, _wife to_ king edward the second.
niece _to_ king edward the second, _daughter to
   the _duke of glocester._
ladies.




           _enter_ gaveston, _reading a letter._

_gav. my father is deceas'd.  come, gaveston,
   and share the kingdom with thy dearest friend._
   ah, words that make me surfeit with delight!
   what greater 

**Remove punctuation marks from the text**

In [96]:
punctuations = '''!()-[]{};:'"\,<>’./?@#$%^&*_~ï»¿'''
no_punct = ''
for char in text:
    if char not in punctuations:
        no_punct = no_punct + char
text = no_punct
print(text)

edward the second

by christopher marlowe



dramatis personae

king edward the second
prince edward his son afterwards king edward the third
kent brother to king edward the second
gaveston
archbishop of canterbury
bishop of coventry
bishop of winchester
warwick
lancaster
pembroke
arunder
leicester
berkeley
mortimer the elder
mortimer the younger his nephew
spenser the elder
spenser the younger his son
baldock
baumont
trussel
gurney
matrevis
lightborn
sir john of hainault
levune
rice ap howel
abbot
monks
herald
lords poor men james mower champion
   messengers soldiers and attendants

queen isabella wife to king edward the second
niece to king edward the second daughter to
   the duke of glocester
ladies




           enter gaveston reading a letter

gav my father is deceasd  come gaveston
   and share the kingdom with thy dearest friend
   ah words that make me surfeit with delight
   what greater bliss can hap to gaveston
   than live and be the favourite of a king
   sweet prince i

**download english stopwords**

In [97]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')

stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\H\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\H\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Convert text to a list of words**

In [98]:
items = nltk.tokenize.word_tokenize(text)
print(items[:20])

['edward', 'the', 'second', 'by', 'christopher', 'marlowe', 'dramatis', 'personae', 'king', 'edward', 'the', 'second', 'prince', 'edward', 'his', 'son', 'afterwards', 'king', 'edward', 'the']


**Remove stopwords from the list**

In [99]:
items = [item for item in items if item[0] not in stop_words]
print(items[:20])

['edward', 'by', 'christopher', 'personae', 'king', 'edward', 'prince', 'edward', 'his', 'king', 'edward', 'kent', 'brother', 'king', 'edward', 'gaveston', 'canterbury', 'bishop', 'coventry', 'bishop']



### Making correct sentences

### Making sentences by omitting one word

### Building a preprocessing function to apply to constructed sentences
##### convert all letters to lowercase letters
##### Remove punctuation marks
##### Remove stopwords

In [100]:
import random

true_test = [
        "So, lay the table down, and stamp on it, But not too hard, lest that you bruise his body.",
        "_Gav._ Weaponless must I fall, and die in bands? O, must this day be period of my life, Centre of all my bliss?  And ye be men, Speed to the king.",
        "No doubt, such lessons they will teach the rest, As by their preachments they will profit much, And learn obedience to their lawful king.",
        "But, ere he came, Warwick in ambush lay, And bare him to his death; and in a trench Strake off his head, and march'd unto the camp.",
        "That you may drink your fill, and quaff in blood, And stain my royal standard with the same, That so my bloody colours may suggest remembrance",
        "Edw._ Ay, traitors all, rather than thus be brav'd, Make England's civil towns huge heaps of stones, And ploughs to go about our palace-gates.",
        "Welcome, o' God's name, madam, and your son! England shall welcome you and all your rout.",
        "But that grief keeps me waking, I should sleep; For not these ten days have these eye-lids",
        "Edmund, away! Bristow to Longshanks' blood Is false; be not found single for suspect: Proud Mortimer pries",
        "If mine will serve, unbowel straight this breast, And give my heart to Isabel and him: It is the chiefest mark they level",
        ]

test = [
        "So, lay the table down, and stamp on it, But not too hard, lest that you bruise his.",
        "_Gav._ Weaponless must I fall, and die in bands? O, must this day be period of my life, Centre of all my bliss?  And ye be men, Speed to the",
        "No doubt, such lessons they will teach the rest, As by their preachments they will profit much, And learn obedience to their lawful",
        "But, ere he came, Warwick in ambush lay, And bare him to his death; and in a trench Strake off his head, and march'd unto the",
        "That you may drink your fill, and quaff in blood, And stain my royal standard with the same, That so my bloody colours may suggest",
        "Edw._ Ay, traitors all, rather than thus be brav'd, Make England's civil towns huge heaps of stones, And ploughs to go about our",
        "Welcome, o' God's name, madam, and your son! England shall welcome you and all your",
        "But that grief keeps me waking, I should sleep; For not these ten days have these",
        "Edmund, away! Bristow to Longshanks' blood Is false; be not found single for suspect: Proud Mortimer",
        "If mine will serve, unbowel straight this breast, And give my heart to Isabel and him: It is the chiefest mark they",
        ]

def preprocessing(s):
    s = s.lower()
    s = re.sub(r'\d+', '', s)
    punctuations = '''!()-[]{};:'"\,<>’./?@#$%^&*_~'''
    no_punct = ''
    for char in s:
        if char not in punctuations:
            no_punct = no_punct + char
    text = no_punct
    text = re.sub(' +', ' ', text)
    stop_words = stopwords.words('english')
    items = nltk.tokenize.word_tokenize(text)
    item = ''
    for i in items:
        if i[0] not in stop_words:
            item += i + ' '
    return item

**The unigram model calculates the probability of each word. That is, the number of times the word is repeated in the text. As follows:**

**word_probability = word_repeated_count / text_size**

**{word : word_probability}**

In [114]:
def build_unigram_model(words):
    unigram_model = {}
    size = len(words)
    for word in words:
        if word in unigram_model:
            unigram_model[word] += 1
        else:
            unigram_model[word] = 1
    for word in unigram_model:
        unigram_model[word] /= size
    return unigram_model

unigram_model = build_unigram_model(items)
print(unigram_model)



**Choose the most likely word. If there are no equal probabilities in the words, the prediction of this model is the same for all the defective sentences. But if there are equal probabilities in the words, it will randomly choose from them.**

In [105]:
def unigram_prediction(unigram_model):
    max_list = []
    for i,j in zip(unigram_model.keys(), unigram_model.values()):
        if j == max(unigram_model.values()):
            max_list.append(i)
    return max_list

for s, t_s in zip(test, true_test):
    t = preprocessing(s)
    t_s = preprocessing(t_s)
    print('sentence :', t, '....')
    pred = random.choice(unigram_prediction(unigram_model))
    print('unigram predictions  :', t + pred)
    print(70*'=')

sentence : lay but not hard lest bruise his  ....
unigram predictions  : lay but not hard lest bruise his not
sentence : gav weaponless fall bands be period life centre bliss be  ....
unigram predictions  : gav weaponless fall bands be period life centre bliss be not
sentence : no lessons will rest by preachments will profit learn lawful  ....
unigram predictions  : no lessons will rest by preachments will profit learn lawful not
sentence : but ere he came warwick lay bare him his his head unto  ....
unigram predictions  : but ere he came warwick lay bare him his his head unto not
sentence : fill quaff blood royal with bloody colours  ....
unigram predictions  : fill quaff blood royal with bloody colours not
sentence : edw rather be bravd englands civil huge heaps ploughs go  ....
unigram predictions  : edw rather be bravd englands civil huge heaps ploughs go not
sentence : welcome gods name england welcome  ....
unigram predictions  : welcome gods name england welcome not
sentence : b

**The Bigram model calculates the probability of each tuple of pairs of words. As follows:**

**(word1, word2)_probability = (word1, word2)_repeated_count / text_size**

**{(word1, word2) : (word1, word2)_probability}**

In [116]:
def build_bigram_model(words):
    bigram_model = {}
    size = len(words) - 1
    for i in range(size):
        word = (words[i],words[i + 1])
        if word in bigram_model:
            bigram_model[word] += 1
        else:
            bigram_model[word] = 1
    for word in bigram_model:
        bigram_model[word] /= size
    return bigram_model

bigram_model = build_bigram_model(items)
print(bigram_model)



**Selects the most probable tuple of pairs of words. If there are equal probabilities in the words, it randomly chooses among them.**

In [107]:
def bigram_prediction(bigram_model, corpus):
    corpus = corpus.split()
    max_list = {}
    for (i, j), k in zip(bigram_model.keys(), bigram_model.values()):
        if i == corpus[-1]:
            max_list[j] = k
    return max_list

for s, t_s in zip(test, true_test):
    t = preprocessing(s)
    t_s = preprocessing(t_s)
    print('sentence :', t, '....')
    pred = random.choice(unigram_prediction(bigram_prediction(bigram_model, t)))
    print('bigram predictions   :', t + pred)
    print(70*'=')

sentence : lay but not hard lest bruise his  ....
bigram predictions   : lay but not hard lest bruise his head
sentence : gav weaponless fall bands be period life centre bliss be  ....
bigram predictions   : gav weaponless fall bands be period life centre bliss be gone
sentence : no lessons will rest by preachments will profit learn lawful  ....
bigram predictions   : no lessons will rest by preachments will profit learn lawful king
sentence : but ere he came warwick lay bare him his his head unto  ....
bigram predictions   : but ere he came warwick lay bare him his his head unto his
sentence : fill quaff blood royal with bloody colours  ....
bigram predictions   : fill quaff blood royal with bloody colours remembrance
sentence : edw rather be bravd englands civil huge heaps ploughs go  ....
bigram predictions   : edw rather be bravd englands civil huge heaps ploughs go with
sentence : welcome gods name england welcome  ....
bigram predictions   : welcome gods name england welcome frie

**The Trigram Model calculates the probability of each tuple of three words.As follows:**

**(word1, word2, word3)_probability = (word1, word2, word3)_repeated_count / text_size**

**{(word1, word2, word3) : (word1, word2, word3)_probability}**

In [117]:
def build_trigram_model(words):
    trigram_model = {}
    size = len(words) - 2
    for i in range(size):
        word = (words[i], words[i + 1], words[i + 2])
        if word in trigram_model:
            trigram_model[word] += 1
        else:
            trigram_model[word] = 1
    for word in trigram_model:
        trigram_model[word] /= size
    return trigram_model

trigram_model = build_trigram_model(items)
print(trigram_model)



**Selects the most probable tuple of three words. If there are equal probabilities in the words, it randomly chooses among them.**

In [109]:
def trigram_prediction(trigram_model, corpus):
    corpus = corpus.split()
    max_list = {}
    for (i, j, k), l in zip(trigram_model.keys(), trigram_model.values()):
        if i == corpus[-2] and j == corpus[-1]:
            max_list[k] = l
    return max_list

for s, t_s in zip(test, true_test):
    t = preprocessing(s)
    t_s = preprocessing(t_s)
    print('sentence :', t, '....')
    pred = random.choice(unigram_prediction(trigram_prediction(trigram_model, t)))
    print('trigram predictions   :', t + pred)
    print(70*'=')

sentence : lay but not hard lest bruise his  ....
trigram predictions   : lay but not hard lest bruise his body
sentence : gav weaponless fall bands be period life centre bliss be  ....
trigram predictions   : gav weaponless fall bands be period life centre bliss be king
sentence : no lessons will rest by preachments will profit learn lawful  ....
trigram predictions   : no lessons will rest by preachments will profit learn lawful king
sentence : but ere he came warwick lay bare him his his head unto  ....
trigram predictions   : but ere he came warwick lay bare him his his head unto camp
sentence : fill quaff blood royal with bloody colours  ....
trigram predictions   : fill quaff blood royal with bloody colours remembrance
sentence : edw rather be bravd englands civil huge heaps ploughs go  ....
trigram predictions   : edw rather be bravd englands civil huge heaps ploughs go palacegates
sentence : welcome gods name england welcome  ....
trigram predictions   : welcome gods name engla

**The Quadgram Model calculates the probability of each tuple of four words.As follows:**

**(word1, word2, word3, word4)_probability = (word1, word2, word3, word4)_repeated_count / text_size**

**{(word1, word2, word3, word4) : (word1, word2, word3, word4)_probability}**

In [118]:
def build_quadgram_model(words):
    quadgram_model = {}
    size = len(words) - 3
    for i in range(size):
        word = (words[i], words[i + 1], words[i + 2], words[i + 3])
        if word in quadgram_model:
            quadgram_model[word] += 1
        else:
            quadgram_model[word] = 1
    for word in quadgram_model:
        quadgram_model[word] /= size
    return quadgram_model

quadgram_model = build_quadgram_model(items)
print(quadgram_model)



**Selects the most probable tuple of four words. If there are equal probabilities in the words, it randomly chooses among them.**

In [112]:
def quadgram_prediction(quadgram_model, corpus):
    corpus = corpus.split()
    max_list = {}
    for (i, j, k, l), m in zip(quadgram_model.keys(), quadgram_model.values()):
        if i == corpus[-3] and j == corpus[-2] and k == corpus[-1]:
            max_list[l] = m
    return max_list

for s, t_s in zip(test, true_test):
    t = preprocessing(s)
    t_s = preprocessing(t_s)
    print('sentence :', t, '....')
    pred = random.choice(unigram_prediction(quadgram_prediction(quadgram_model, t)))
    print('quadgram predictions :', t + pred)
    print(70*'=')

sentence : lay but not hard lest bruise his  ....
quadgram predictions : lay but not hard lest bruise his body
sentence : gav weaponless fall bands be period life centre bliss be  ....
quadgram predictions : gav weaponless fall bands be period life centre bliss be king
sentence : no lessons will rest by preachments will profit learn lawful  ....
quadgram predictions : no lessons will rest by preachments will profit learn lawful king
sentence : but ere he came warwick lay bare him his his head unto  ....
quadgram predictions : but ere he came warwick lay bare him his his head unto camp
sentence : fill quaff blood royal with bloody colours  ....
quadgram predictions : fill quaff blood royal with bloody colours remembrance
sentence : edw rather be bravd englands civil huge heaps ploughs go  ....
quadgram predictions : edw rather be bravd englands civil huge heaps ploughs go palacegates
sentence : welcome gods name england welcome  ....
quadgram predictions : welcome gods name england welc

**Comparing the guessed word with the correct word for each model and calculating the accuracy**

In [120]:
probs = {'unigram':0,'bigram':0,'trigram':0,'quadgram':0}
for s, t_s in zip(test, true_test):
    t = preprocessing(s)
    t_s = preprocessing(t_s)
    pred = random.choice(unigram_prediction(unigram_model))
    t_s = t_s.split()
    if pred == t_s[-1]:
        probs['unigram'] += 1
    pred = random.choice(unigram_prediction(bigram_prediction(bigram_model, t)))
    if pred == t_s[-1]:
        probs['bigram'] += 1
    pred = random.choice(unigram_prediction(trigram_prediction(trigram_model, t)))
    if pred == t_s[-1]:
        probs['trigram'] += 1
    pred = random.choice(unigram_prediction(quadgram_prediction(quadgram_model, t)))
    if pred == t_s[-1]:
        probs['quadgram'] += 1

probs['unigram'] = (probs['unigram'] / len(test)) * 100
probs['bigram'] = (probs['bigram'] / len(test)) * 100
probs['trigram'] = (probs['trigram'] / len(test)) * 100
probs['quadgram'] = (probs['quadgram'] / len(test)) * 100
print()
print('Accuracy :', probs)


Accuracy : {'unigram': 0.0, 'bigram': 30.0, 'trigram': 90.0, 'quadgram': 100.0}
