# Language Modeling using Ngram

In this Exercise, you are going to use NLTK which is a natural language processing library for python to create a bigram language model and its variation. You will build one model for each of the following type and calculate their perplexity:
- Unigram Model
- Bigram Model
- Bigram Model with add one estimation
- Bigram Model with Interpolation
- Bigram Model with Kneser-ney Interpolation
- Neural LM



Members:  
Tharnarch Thoranisttakul 63340500025  
Sorapas Weerakul 63340500064  
Athimet Aiewcharoen 63340500068  

In [2]:
#First we import necessary library such as math, nltk, bigram, and collections.
import math
import nltk
import io
import random
from random import shuffle
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
random.seed(999)

BEST2010 is a free Thai NLP dataset by NECTEC usually use as a standard benchmark for various NLP tasks includeing language modeling. BEST2010 is separated into 4 domain article, encyclopedia, news and novel. The data is already  tokenized using '|' as a separator.

For example,

ตาม|ที่|นางประนอม ทองจันทร์| |กับ| |ด.ช.กิตติพงษ์ แหลมผักแว่น| |และ| |ด.ญ.กาญจนา กรองแก้ว| |ป่วย|สงสัย|ติด|เชื้อ|ไข้|ขณะ|นี้|ยัง|ไม่|ดี|ขึ้น|

In [3]:
# We choose news domain as our dataset
best2010=[]
fp= io.open('data/BEST2010/news.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    best2010.append(line.strip()[:-1])
fp.close()
all_vocabulary =set()
total_word_count =0
for line in best2010:
    for word in line.split('|'):        
        all_vocabulary.add(word)
        total_word_count+=1

In [4]:
#For simplicity, we assumes that each line is a sentence.
print ('Total sentences in BEST2010 news dataset :\t'+ str(len(best2010)))
print ('Total word counts in BEST2010 news dataset :\t'+ str(total_word_count))
print ('Total vocabulary in BEST2010 news dataset :\t'+ str(len(all_vocabulary)))

Total sentences in BEST2010 news dataset :	30969
Total word counts in BEST2010 news dataset :	1660190
Total vocabulary in BEST2010 news dataset :	35488


We separate out input into 2 sets, train and test data with 70:30 ratio

In [5]:
sentences = best2010
# The data is separated to train and test set with 70:30 ratio.
train = sentences[:int(len(sentences)*0.7)]
test = sentences[int(len(sentences)*0.7):]

#Training data
train_vocabulary =set()
train_word_count =0
for line in train:
    for word in line.split('|'):        
        train_vocabulary.add(word)
        train_word_count+=1
print ('Total sentences in BEST2010 news training dataset :\t'+ str(len(train)))
print ('Total word counts in BEST2010 news training dataset :\t'+ str(train_word_count))
print ('Total vocabuary in BEST2010 news training dataset :\t'+ str(len(train_vocabulary)))
# We will use 1/vocab_size as a default value for unknown word
unk_value = math.pow(len(train_vocabulary),-1)

Total sentences in BEST2010 news training dataset :	21678
Total word counts in BEST2010 news training dataset :	1042797
Total vocabuary in BEST2010 news training dataset :	26240


# Unigram

In this section, we will demonstrate how to build a unigram language model <br>
**Important note:** <br>
**\<s\>** = sentence start symbol <br>
**\</s\>** = sentence end symbol 

In [6]:
def getUnigramModel(data):
    model = defaultdict(lambda: 0)
    word_count =0
    for sentence in data:
        sentence +=  u'|</s>' #for unigram model we can always ignore <s>, since p(w0=<s>)=1
        for w1 in sentence.split('|'):
            model[w1] +=1.0
            word_count+=1
    for w1 in model:
        model[w1] = model[w1]/(word_count)
    return model

In [7]:
model = getUnigramModel(train)

In [8]:
def getLnValue(x):
    if x >0.0:
        return math.log(x)
    else:
        return math.log(unk_value)

In [9]:
#problability of 'นายก'
print(getLnValue(model[u'นายก']))
#for example, problability of 'นายกรัฐมนตรี' which is an unknown word is equal to
print(getLnValue(model[u'นายกรัฐมนตรี']))
#problability of 'นายก' 'ได้' 'ให้' 'สัมภาษณ์' 'กับ' 'สื่อ'
prob = getLnValue(model[u'นายก'])+getLnValue(model[u'ได้'])+ getLnValue(model[u'ให้'])+getLnValue(model[u'สัมภาษณ์'])+getLnValue(model[u'กับ'])+getLnValue(model[u'สื่อ'])+getLnValue(model['</s>'])
print ('Problability of a sentence', math.exp(prob))

-6.551526663995246
-10.175040243058024
Problability of a sentence 5.617210748667918e-18


## TODO #1 **Calculate perplexity**

In order to compare language model we need to calculate perplexity. In this task you should write a perplexity calculation code for the unigram model. The result perplexity should be around 556.39 and
476.07 on train and test data.

In [10]:
def calculate_sentence_ln_prob(sentence, model):
    # word = sentence.spilt('|')
    # ln_prob = 0
    # for ไล่เเต่ละคำใน sentence เพื่อคำนวณ LnValue --> sum ln_prob ทุกคำ
    # return ln_prob
    word = sentence.split('|')
    ln_prob = 0
    for i in range(len(word)):
        ln_prob += getLnValue(model[word[i]])
    return ln_prob

def perplexity(test,model):
    # ln_prob = 0
    # word_count = 0
    # for ไล่เเต่ละ sentence --> คำนวณ calculate_sentence_ln_prob ของเเต่ละ sentence --> sum ln_prob ทุก sentence
    # return exp(-ln_prob/word_count)
    ln_prob = 0
    word_count = 0
    for sentence in test:
        sentence += u'|</s>'
        ln_prob += calculate_sentence_ln_prob(sentence, model)
        word_count += len(sentence.split('|'))
    return math.exp(-ln_prob/word_count)

In [11]:
print(f'Perplexity of unigram model on training set: {perplexity(train, model):.2f}')
print(f'Perplexity of unigram model on test set: {perplexity(test, model):.2f}')

Perplexity of unigram model on training set: 556.39
Perplexity of unigram model on test set: 476.07


# Bigram

Next, you will create a better language model than a unigram (which is not much to compare with). But first, it is very tedious to count every pair of words that occur in our corpus by ourselves. In this case, nltk provide us a simple library which will do it for us.

In [12]:
#example of nltk usage for bigram
sentence = 'I always search google for an answer .'

print('This is how nltk generate bigram.')
for w1,w2 in bigrams(sentence.split(), pad_right=True, pad_left=True,left_pad_symbol='<s>', right_pad_symbol='</s>'):
    print (w1,w2)
print('None is used as a start and end of sentence symbol.')

This is how nltk generate bigram.
<s> I
I always
always search
search google
google for
for an
an answer
answer .
. </s>
None is used as a start and end of sentence symbol.


Now, you should be able to implement a bigram model by yourself. Also, you must create a new perplexity calculation for bigram. The result perplexity should be around 58.78 and 146.26 on train and test data.

## TODO #2 **Create a Bigram Model**

In [13]:
def getBigramModel(data):
    ###FILL YOUR CODE HERE###
    # unigram_count = defaultdict(lambda: 0.0)
    # bigram_count = defaultdict(lambda: 0.0)

    # for เเต่ละ sentence
    #   for เเต่ละ token ในรูปเเบบ bigram ที่ generate ขึ้นมา
    #     bigram_count[?] = ?
    #     unigram_count[?] = ?

    # for ไล่เเต่ละ token ใน bigram ทั้งหมด
    #   model[?] = ?
    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    
    for sentence in data:
        for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
            bigram_count[(w1,w2)] += 1.0
            unigram_count[w1] += 1.0
            
    model = defaultdict(lambda: 0.0)
    
    for w1,w2 in bigram_count:
        model[(w1,w2)] = bigram_count[(w1,w2)] / unigram_count[w1]
    
    return model

## TODO #3 **Calculate Perplexity for Bigram Model**



In [14]:
def calculate_sentence_ln_prob(sentence, model):
    ln_prob = 0
    for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
        ln_prob += getLnValue(model[(w1,w2)])
    return ln_prob

def perplexity(test,model):
    ln_prob = 0
    word_count = 0
    for sentence in test:
        ln_prob += calculate_sentence_ln_prob(sentence, model)
        word_count += len(sentence.split('|')) + 1 # Add 1 for 1 additional None pair
    return math.exp(-ln_prob/word_count)

modelBigram = getBigramModel(train)

print(f'Perplexity of bigram model on training set: {perplexity(train, modelBigram)}')
print(f'Perplexity of bigram model on test set: {perplexity(test, modelBigram)}')

# 58.78942889767147
# 146.26539331038614

Perplexity of bigram model on training set: 58.78942889767147
Perplexity of bigram model on test set: 146.26539331038614


# Smoothing

Usually any ngram models have a sparsity problem, which means it does not have every possible ngram of words in the dataset. Smoothing techniques can alleviate this problem. In this section, you will implement two basic smoothing methods laplace smoothing and interpolation for bigram.

## TODO #4 **Bigram with add-one estimation**

In [15]:
#Laplace Smoothing
def getBigramWithAddOneEstimation(data):
    #Fill code here
    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    V = len(set([w for sentence in data for w in sentence.split('|')])) + 1 # Add one for unknown word
    
    for sentence in data:
        for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
            bigram_count[(w1,w2)] += 1.0
            unigram_count[w1] += 1.0
    
    model = defaultdict(lambda: 0.0)
    
    for w1,w2 in bigram_count:
        model[(w1,w2)] = (bigram_count[(w1,w2)] + 1) / (unigram_count[w1] + V)
    
    return model

modelBigramWithAddOneEstimation = getBigramWithAddOneEstimation(train)
print (perplexity(train,modelBigramWithAddOneEstimation) )
print (perplexity(test, modelBigramWithAddOneEstimation))

# 974.8134581679766
# 1098.1622194979489

974.8134581679766
1098.1622194979489


## TODO #5 **Bigram with Interpolation**
lambda value is 0.7 for bigram, 0.25 for unigram, and 0.05 for unknown word

In [16]:
def getBigramWithInterpolation(data ,lambdalist=[0.7,0.25,0.05]):
    
    #Fill code here
    # unigram_count = defaultdict(lambda: 0.0)
    # bigram_count = defaultdict(lambda: 0.0)
    # model = defaultdict(lambda: 0.0)

    # for เเต่ละ sentence
    #   for เเต่ละ token ใน bigram ที่ generate ขึ้นมา
    #     bigram_count[?] = ?
    #     unigram_count[?] = ?


    # for เเต่ละ key ใน bigrams
    #   bigram_prob
    #   unigram_prob
    #   model[key] = สูตร bigram, unigram, unk_value (1/vocab)
    
    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    
    lambda_2 = lambdalist[0]
    lambda_1 = lambdalist[1]
    lambda_0 = lambdalist[2]
    
    V = len(set([w for sentence in data for w in sentence.split('|')]))
    C = 1/V
    
    print(f'Vocab size: {V} and C: {C}')
    print(f'lambda_2: {lambda_2}, lambda_1: {lambda_1}, lambda_0: {lambda_0} and All: {lambda_2 + lambda_1 + lambda_0}')
    
    for sentence in data:
        for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
            bigram_count[(w1,w2)] += 1.0
            unigram_count[w1] += 1.0
    
    model = defaultdict(lambda: 0.0)
    
    word = len([w for sentence in data for w in sentence.split('|')])
    
    for w1,w2 in bigram_count:
        bigram_prob = bigram_count[(w1,w2)] / (unigram_count[w1])
        unigram_prob = unigram_count[w2] / word
        model[(w1,w2)] = (lambda_2 * bigram_prob) + (lambda_1 * unigram_prob) + (lambda_0 * C)
    return model

modelBigramWithInterpolation = getBigramWithInterpolation(train)
print (perplexity(train,modelBigramWithInterpolation))        
print (perplexity(test,modelBigramWithInterpolation))

# 73.38409869825665
# 172.67485908813356

Vocab size: 26240 and C: 3.8109756097560976e-05
lambda_2: 0.7, lambda_1: 0.25, lambda_0: 0.05 and All: 1.0
73.38409869825665
172.67485908813356


# Language modeling on multiple domains

Sometimes, we do not have enough data to create a language model for a new domain. In that case, we can improvised by combining several models to improve result on the new domain.

In this exercise you will try to merge two language models from news and article domains to create a language model for the encyclopedia domain.

In [17]:
# create article data
encyclo_data=[]
fp= io.open('data/BEST2010/encyclopedia.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    encyclo_data.append(line.strip()[:-1])
fp.close()

First, you should try to calculate perplexity of your bigram with interpolation using "news data" (train) on "encyclopedia data" (test). The result perplexity should be around 727.35.

For your information, a bigram model with interpolation using "ariticle data" (train) to test on "encyclopedia data" (test) has a perplexity of 505.79.

In [18]:
# print perplexity of bigram with interpolation on article data        
# 727.3502637212223
print (perplexity(encyclo_data,modelBigramWithInterpolation))

727.3502637212223


## TODO #6 
Write a model that produce 450.0 or less perplexity on encyclopedia data without using data from the encyclopedia as training data. (Hint : Try to combine a model with news data and a model with article data together.)

In [19]:
news_data=[]
article_data=[]
fp= io.open('data/BEST2010/news.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    news_data.append(line.strip()[:-1])
fp.close()
fp= io.open('data/BEST2010/article.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    article_data.append(line.strip()[:-1])
fp.close()

news_train = news_data[:int(len(news_data)*0.7)]
news_test = news_data[int(len(news_data)*0.7):]

article_train = article_data[:int(len(article_data)*0.7)]
article_test = article_data[int(len(article_data)*0.7):]

combined_train = news_train + article_train
combined_test = news_test + article_test

combined_data = news_data + article_data

In [20]:
# Fill code here
# 428.85251789073953 (on combined data)
combined_model = getBigramWithInterpolation(combined_train,[0.77,0.22,0.01])
print('Perplexity of combine Bigram model with interpolation smoothing on encyclopedia test data',perplexity(encyclo_data, combined_model))

combined_model_full = getBigramWithInterpolation(combined_data,[0.77,0.22,0.01])
print('Perplexity of combine Bigram model with interpolation smoothing on encyclopedia test data',perplexity(encyclo_data, combined_model_full))

Vocab size: 40134 and C: 2.4916529625753725e-05
lambda_2: 0.77, lambda_1: 0.22, lambda_0: 0.01 and All: 1.0
Perplexity of combine Bigram model with interpolation smoothing on encyclopedia test data 440.71182931267646
Vocab size: 52581 and C: 1.901827656377779e-05
lambda_2: 0.77, lambda_1: 0.22, lambda_0: 0.01 and All: 1.0
Perplexity of combine Bigram model with interpolation smoothing on encyclopedia test data 408.08714322851586


## TODO #7 
## Kneser-ney on "News"

<!-- Reimplement equation 4.33 in SLP textbook (https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf) -->

Implement Bigram Knerser-ney LM. The result perplexity should be around 71.14054002208687 and 174.02464248000433 on train and test data. 


In [21]:
def getdistinct_platten(data,targetword):
    result_col = []
    result_row = []
    for sentence in data:
        for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
            pair = (w1,w2)
            if w2 == targetword:
                result_col.append(pair)
            if w1 == targetword:
                result_row.append(pair)
    result_col = list(set(result_col))
    result_row = list(set(result_row))
    return result_col, result_row

tword = "หวัด"
res_col, res_row = getdistinct_platten(train,tword)
print(f"distinct word : {tword}")
print(f"distinct word pair_col: {len(res_col)}")
print(f"distinct word pair_col: {res_col}")
print(f"distinct word pair_row: {len(res_row)}")
print(f"distinct word pair_row: {res_row}")

distinct word : หวัด
distinct word pair_col: 48
distinct word pair_col: [(None, 'หวัด'), ('ฆ่า', 'หวัด'), ('ไร้', 'หวัด'), ('ใช่', 'หวัด'), ('วัคซีน', 'หวัด'), ('คล้าย', 'หวัด'), (' ', 'หวัด'), ('เด็ก', 'หวัด'), ('กระทบ', 'หวัด'), ('สังเวย', 'หวัด'), ('ไวรัส', 'หวัด'), ('ชี้', 'หวัด'), ('รับ', 'หวัด'), ('เป็น', 'หวัด'), ('คุม', 'หวัด'), ('ป่วย', 'หวัด'), ('เชื้อ', 'หวัด'), ('เตือน', 'หวัด'), ('โรค', 'หวัด'), ('มี', 'หวัด'), ('เรื่อง', 'หวัด'), ('แก้', 'หวัด'), ('สงสัย', 'หวัด'), ('พบ', 'หวัด'), ('หรือ', 'หวัด'), ('หวั่น', 'หวัด'), ('ไข้', 'หวัด'), ('ติด', 'หวัด'), ('กำกับ', 'หวัด'), ('ถ้า', 'หวัด'), ('ล้าง', 'หวัด'), ('เอา', 'หวัด'), ('ต้าน', 'หวัด'), ('แม้', 'หวัด'), ('"', 'หวัด'), ('วิกฤต', 'หวัด'), ('จาก', 'หวัด'), ('แล้ว', 'หวัด'), ('ข่าว', 'หวัด'), ('เพราะ', 'หวัด'), ('ป้องกัน', 'หวัด'), ('สถานการณ์', 'หวัด'), ('ศึกษา', 'หวัด'), ('สัมมนา', 'หวัด'), ('ภัย', 'หวัด'), ('รับมือ', 'หวัด'), ('วิกฤติ', 'หวัด'), ("'", 'หวัด')]
distinct word pair_row: 23
distinct word pair_row: [('หวัด', '

In [22]:
def getkneser_ney(data):
    unigram_count = Counter()
    bigram_count = Counter()
    for sentence in data:
        for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
            bigram_count.update([(w1,w2)])
            unigram_count.update([w1])
            
    kneser_ney_counts_row = defaultdict(Counter)
    kneser_ney_counts_col = defaultdict(Counter)
    
    for w1, w2 in bigram_count:
        # Set the count to 1 for patterns that occur in the training data
        kneser_ney_counts_row[w1][w2] = 1
        kneser_ney_counts_col[w2][w1] = 1
        
    return kneser_ney_counts_row, kneser_ney_counts_col

kneser_ney_counts_row, kneser_ney_counts_col = getkneser_ney(train)
display_col = set([(item, tword) for item in kneser_ney_counts_col[tword].keys()])
display_row = set([(tword, item) for item in kneser_ney_counts_row[tword].keys()])
print(f"distinct word : {tword}")
print(f'kneser_ney_counts_col: {len(kneser_ney_counts_col[tword])}')
print(f'kneser_ney_counts_col: {display_col}')
print(f'kneser_ney_counts_row: {len(kneser_ney_counts_row[tword])}')
print(f'kneser_ney_counts_row: {display_row}')

distinct word : หวัด
kneser_ney_counts_col: 48
kneser_ney_counts_col: {(None, 'หวัด'), ('ฆ่า', 'หวัด'), ('ไร้', 'หวัด'), ('ใช่', 'หวัด'), ('วัคซีน', 'หวัด'), ('คล้าย', 'หวัด'), (' ', 'หวัด'), ('เด็ก', 'หวัด'), ('กระทบ', 'หวัด'), ('สังเวย', 'หวัด'), ('ไวรัส', 'หวัด'), ('ชี้', 'หวัด'), ('รับ', 'หวัด'), ('เป็น', 'หวัด'), ('คุม', 'หวัด'), ('ป่วย', 'หวัด'), ('เชื้อ', 'หวัด'), ('เตือน', 'หวัด'), ('โรค', 'หวัด'), ('มี', 'หวัด'), ('เรื่อง', 'หวัด'), ('แก้', 'หวัด'), ('สงสัย', 'หวัด'), ('พบ', 'หวัด'), ('หรือ', 'หวัด'), ('หวั่น', 'หวัด'), ('ไข้', 'หวัด'), ('ติด', 'หวัด'), ('กำกับ', 'หวัด'), ('ถ้า', 'หวัด'), ('ล้าง', 'หวัด'), ('เอา', 'หวัด'), ('ต้าน', 'หวัด'), ('แม้', 'หวัด'), ('"', 'หวัด'), ('วิกฤต', 'หวัด'), ('จาก', 'หวัด'), ('แล้ว', 'หวัด'), ('ข่าว', 'หวัด'), ('เพราะ', 'หวัด'), ('ป้องกัน', 'หวัด'), ('สถานการณ์', 'หวัด'), ('ศึกษา', 'หวัด'), ('สัมมนา', 'หวัด'), ('ภัย', 'หวัด'), ('รับมือ', 'หวัด'), ('วิกฤติ', 'หวัด'), ("'", 'หวัด')}
kneser_ney_counts_row: 23
kneser_ney_counts_row: {('หวัด', 'นี้'

In [23]:
def getBigramWithKnerNeySmoothing(data):
    unigram_count = Counter()
    bigram_count = Counter()
    for sentence in data:
        for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
            bigram_count.update([(w1,w2)])
            unigram_count.update([w1])
            
    kneser_ney_counts_row = defaultdict(Counter)
    kneser_ney_counts_col = defaultdict(Counter)
    
    for w1, w2 in bigram_count:
        # Set the count to 1 for patterns that occur in the training data
        kneser_ney_counts_row[w1][w2] = 1
        kneser_ney_counts_col[w2][w1] = 1
    
    model = defaultdict(lambda: 0.0)
    
    # total_distinct_platten = np.sum([1 for w1,w2 in bigram_count])
    total_distinct_platten= len(set(bigram_count.keys()))
    
    for w1,w2 in bigram_count:
        # P(w2|w1) mean Probability of w2 given w1
        bigram = max(bigram_count[(w1,w2)] - 0.75, 0) / unigram_count[w1]
        # Calculate continuation probability of w2
        # distinct_platten_given_word2 = np.sum([1 for word1,word2 in bigram_count if word2 == w2]) # count of distinct platten given word 2
        distinct_platten_given_word2 = len(list(kneser_ney_counts_col[w2]))
        P_continuation = distinct_platten_given_word2/ total_distinct_platten
        # Calculate lambda
        # distinct_platten_given_word1 = np.sum([1 for word1,word2 in bigram_count if word1 == w1]) # count of distinct platten given word 1
        distinct_platten_given_word1 = len(list(kneser_ney_counts_row[w1]))
        lambda_ = 0.75 * (distinct_platten_given_word1 / unigram_count[w1])
        # Calculate P(w2|w1) with Kner Ney Smoothing
        model[(w1,w2)] = bigram + (lambda_ * P_continuation)
        # print("=====================================")
        # print(f'Find Probability of {w2} given {w1} [P({w2}|{w1})] ({w1},{w2})')
        # print(f'P_continuation = {distinct_platten_given_word2} / {total_distinct_platten} = {P_continuation}')
        # print(f'lambda = 0.75 * ({distinct_platten_given_word1} / {unigram_count[w1]}) = {lambda_}')
    return model
modelBigramWithKnerNeySmoothing = getBigramWithKnerNeySmoothing(train)

In [24]:
# Fill codehere

#-------------------------------------------
# Create unigram and bigram counting table
#-------------------------------------------
# unigram_count = defaultdict(lambda: 0.0)
# bigram_count = defaultdict(lambda: 0.0)
# model = defaultdict(lambda: 0.0)

print (perplexity(train,modelBigramWithKnerNeySmoothing))        
print (perplexity(test,modelBigramWithKnerNeySmoothing))

# 71.14054002208687
# 174.02464248000433

71.14054002208687
155.09274968738495


## TODO #8
## Neural LM 
do it on news corpus that we splitted into train and test sets at the beginning of this exercise. 

In [25]:
#find the perplexity of the model
#there are many ways to do this. e.g.:
#https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/

In [26]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split

import os    
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa

from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input, Dense,GRU,Reshape,TimeDistributed,Bidirectional,Dropout,Masking,LSTM
from keras.optimizers import Adam
from keras import backend as K                                                          
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras.models import load_model

### **Tokenization**

In [27]:
# integer encode sequences of words
tokenizer = Tokenizer() 
tokenizer.fit_on_texts(sentences)
encoded_train = tokenizer.texts_to_sequences(sentences)

In [28]:
# Map encoded words to word
word_index = tokenizer.word_index
print(f'Found {len(word_index)} unique tokens.')
data_maped = {v: k for k, v in word_index.items()}
def map_encoded_to_word(encoded):
    return [data_maped[word] for word in encoded]
print(f'Original sentence: {sentences[0]}')
print(f'Encoded sentence: {encoded_train[0]}')
print(f'Mapped sentence: {map_encoded_to_word(encoded_train[0])}')

Found 34684 unique tokens.
Original sentence: สงสัย|ติด|หวัด|นก| |อีก|คน|ยัง|น่า|ห่วง
Encoded sentence: [451, 212, 311, 312, 58, 19, 32, 190, 1117]
Mapped sentence: ['สงสัย', 'ติด', 'หวัด', 'นก', 'อีก', 'คน', 'ยัง', 'น่า', 'ห่วง']


### **Create Sequences**

in this section, we will create sequences of words from the corpus. We will use the sequences to train our neural language model.
we choose a line-based sequence. That means we will create a sequence of words from each line of the corpus.

In [29]:
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(f'Vocabulary size: {vocab_size}')
# line-based sequences
sequences = list()
for sen in encoded_train:
    for i in range(1, len(sen)):
        sequence = sen[:i+1]
        sequences.append(sequence)
sequences = np.array(sequences, dtype='object')
print('Total Sequences: %d' % len(sequences))

Vocabulary size: 34685
Total Sequences: 1423511


#### **Padding**

In [30]:
# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences_pad = pad_sequences(sequences, maxlen=max_length, padding='pre', dtype='int32', value=0.)
print(f'Max length: {max_length}')

Max length: 330


#### **Train Test Split**

In [31]:
X_train_nl, X_test_nl, y_train_nl, y_test_nl = train_test_split(sequences_pad[:,:-1], sequences_pad[:,-1], test_size=0.2, random_state=999, shuffle=False)
print(f'X shape: {X_train_nl.shape} and y shape: {y_train_nl.shape}')
print(f'Original X: {sequences_pad[4]}')
print(f'X: {X_train_nl[4]} and y: {y_train_nl[4]}')

X shape: (1138808, 329) and y shape: (1138808,)
Original X: [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

### **Create Model**

In [32]:
def NuralNetworkLM_Model(max_length, vocab_size,name='NuralNetworkLM_Model'):
    input = Input(shape=(max_length-1,))
    output = Embedding(vocab_size, 10, input_length=max_length-1)(input)
    output = LSTM(20, return_sequences=True)(output)
    output = LSTM(20)(output)
    output = Dense(vocab_size, activation='softmax')(output)
    model = Model(inputs=input, outputs=output, name=name)
    model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
    model.summary()
    return model

In [33]:
def NuralNetworkLMModel2(max_length, vocab_size):
    # define model
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=max_length-1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile network
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    return model

In [34]:
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from time import time
from datetime import timedelta
import keras

class LossHistory(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.losses = []

    def on_batch_end(self, batch, logs={}):
        self.losses.append(logs.get('loss'))

def build_model(model, address = None,X = None, Y = None, x_val = None, y_val = None, batch_size = 32, epochs = 10):
    """
    Fit the model if the model checkpoint does not exist or else
    load it from that address.
    """
    if (not os.path.exists(address)):
        print(f'Model checkpoint does not exist. Building model and saving it to {address}...')

        losshistory = LossHistory()

        stop = EarlyStopping(monitor = 'val_loss', min_delta = 0, 
                             patience = 5, verbose = 1, mode = 'auto')
        save = ModelCheckpoint(address, monitor = 'val_loss', 
                               verbose = 0, save_best_only = True)
        callbacks = [stop, save,losshistory]

        start = time()
        history = model.fit(X, Y, batch_size = batch_size, 
                            epochs = epochs, verbose = 1,
                            validation_data = (x_val, y_val),
                            callbacks = callbacks)
        elapse = time() - start
        print('elapsed time: ', elapse)
        model_info = {'history': history, 'elapse': elapse, 'model': model}
        model.save(address)
    else:
        print(f'Model checkpoint exists. Loading model from {address}...')
        model = load_model(address)
        model_info = {'model': model}

    return model_info

def retrain_model(model, address = None,X = None, Y = None, x_val = None, y_val = None, batch_size = 32, epochs = 10):
    """
    Fit the model if the model checkpoint does not exist or else
    load it from that address.
    """
    if address is not None or not os.path.isfile(address):
        model = load_model(address)
        stop = EarlyStopping(monitor = 'val_loss', min_delta = 0, 
                             patience = 5, verbose = 1, mode = 'auto')
        save = ModelCheckpoint(address, monitor = 'val_loss', 
                               verbose = 0, save_best_only = True)
        callbacks = [stop, save]

        start = time()
        history = model.fit(X, Y, batch_size = batch_size, 
                            epochs = epochs, verbose = 1,
                            validation_data = (x_val, y_val),
                            callbacks = callbacks)
        elapse = time() - start
        print('elapsed time: ', elapse)
        model_info = {'history': history, 'elapse': elapse, 'model': model}
        model.save(address)
    return model_info

### **Train Model**

In [35]:
# nuralNetworkLM_Model = NuralNetworkLM_Model(max_length, vocab_size, name='NuralNetworkLM_Model_loss')
# modelinfo = build_model(nuralNetworkLM_Model, address = nuralNetworkLM_Model.name, X = X_train_nl, Y = y_train_nl, x_val = X_test_nl, y_val = y_test_nl, batch_size = 32, epochs = 10)
nuralNetworkLM_Model = load_model('NuralNetworkLM_Model')
nuralNetworkLM_Model.summary()

Model: "NuralNetworkLM_Model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 329)]             0         
                                                                 
 embedding_5 (Embedding)     (None, 329, 10)           346850    
                                                                 
 lstm_9 (LSTM)               (None, 329, 20)           2480      
                                                                 
 lstm_10 (LSTM)              (None, 20)                3280      
                                                                 
 dense_5 (Dense)             (None, 34685)             728385    
                                                                 
Total params: 1,080,995
Trainable params: 1,080,995
Non-trainable params: 0
_________________________________________________________________


In [36]:
# nuralNetworkLMModel2 = NuralNetworkLMModel2(max_length, vocab_size)
# nuralNetworkLMModel2._name = 'NuralNetworkLMModel'
# modelinfo2 = build_model(nuralNetworkLMModel2, address = nuralNetworkLMModel2.name, X = X_train_nl, Y = y_train_nl, x_val = X_test_nl, y_val = y_test_nl, batch_size = 32, epochs = 10)
nuralNetworkLMModel2 = load_model('NuralNetworkLMModel')
nuralNetworkLMModel2.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 329, 10)           346850    
                                                                 
 lstm (LSTM)                 (None, 50)                12200     
                                                                 
 dense (Dense)               (None, 34685)             1768935   
                                                                 
Total params: 2,127,985
Trainable params: 2,127,985
Non-trainable params: 0
_________________________________________________________________


#### **Save/load history**

In [37]:
sampletrain = X_train_nl[:1000]
samplelabel = y_train_nl[:1000]
sampletest = X_test_nl[:1000]
sampletestlabel = y_test_nl[:1000]

In [38]:
# # Not using
# import pickle
# with open('trainHistoryDictSample', 'wb') as file_pi:
#     pickle.dump(modelinfo['history'].history, file_pi)

# with open('trainHistoryDictSample', 'rb') as file_pi:
#     loadhis = pickle.load(file_pi)

### **Evaluate Model**

In [39]:
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text, result = seed_text, seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre', dtype='int32', value=0.)
        # predict probabilities for each word
        yhat = model.predict(encoded, verbose=0)
        yhat = np.argmax(yhat)
        # yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text, result = out_word, result + ' ' + out_word
    return result

In [40]:
print(generate_seq(nuralNetworkLMModel2, tokenizer, max_length-1, 'สงสัย ว่า', 10))

สงสัย ว่า จะ มี การ แข่งขัน ที่ เกิด เหตุ ที่ เกิด เหตุ


In [41]:
print(generate_seq(nuralNetworkLM_Model, tokenizer, max_length-1, 'สงสัย ว่า', 10))

สงสัย ว่า จะ มี การ เมือง ที่ เกิด เหตุ ที่ เกิด เหตุ


### **Perplexity**

In [42]:
def calculate_perplexity(model, X, y):
    """
    Calculate the perplexity of the model on the given data.
    """
    loss, accuracy = model.evaluate(X, y, verbose=0)
    perplexity = np.exp(loss)
    return perplexity

In [43]:
print(f'Perplexity of NuralNetworkLM_Model: {calculate_perplexity(nuralNetworkLM_Model, sampletest, sampletestlabel)}')

Perplexity of NuralNetworkLM_Model: 189.80268473089666


In [44]:
print(f'Perplexity of NuralNetworkLM_Model: {calculate_perplexity(nuralNetworkLM_Model, X_test_nl, y_test_nl)}')

Perplexity of NuralNetworkLM_Model: 379.2037407204458
