# Language Modeling using Ngram

In this Exercise, you are going to use NLTK which is a natural language processing library for python to create a bigram language model and its variation. You will build one model for each of the following type and calculate their perplexity:
- Unigram Model
- Bigram Model
- Bigram Model with Laplace smoothing
- Bigram Model with Interpolation
- Bigram Model with Kneser-ney Interpolation
- Neural LM (optional)



In [57]:
#download corpus
!wget --no-check-certificate https://github.com/ekapolc/nlp_2019/raw/master/HW4/BEST2010.zip
!unzip BEST2010.zip

In [58]:
#First we import necessary library such as math, nltk, bigram, and collections.
import math
import nltk
import io
import random
from random import shuffle
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
random.seed(999)

BEST2010 is a free Thai NLP dataset by NECTEC usually use as a standard benchmark for various NLP tasks includeing language modeling. BEST2010 is separated into 4 domain article, encyclopedia, news and novel. The data is already  tokenized using '|' as a separator.

For example,

ตาม|ที่|นางประนอม ทองจันทร์| |กับ| |ด.ช.กิตติพงษ์ แหลมผักแว่น| |และ| |ด.ญ.กาญจนา กรองแก้ว| |ป่วย|สงสัย|ติด|เชื้อ|ไข้|ขณะ|นี้|ยัง|ไม่|ดี|ขึ้น|

In [59]:
# We choose news domain as our dataset
best2010=[]
fp= io.open('BEST2010/news.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    best2010.append(line.strip()[:-1])
fp.close()
all_vocabulary =set()
total_word_count =0
for line in best2010:
    for word in line.split('|'):        
        all_vocabulary.add(word)
        total_word_count+=1

In [60]:
#For simplicity, we assumes that each line is a sentence.
print ('Total sentences in BEST2010 news dataset :\t'+ str(len(best2010)))
print ('Total word counts in BEST2010 news dataset :\t'+ str(total_word_count))
print ('Total vocabulary in BEST2010 news dataset :\t'+ str(len(all_vocabulary)))

Total sentences in BEST2010 news dataset :	30969
Total word counts in BEST2010 news dataset :	1660190
Total vocabulary in BEST2010 news dataset :	35488


We separate out input into 2 sets, train and test data with 70:30 ratio

In [61]:
sentences = best2010
# The data is separated to train and test set with 70:30 ratio.
train = sentences[:int(len(sentences)*0.7)]
test = sentences[int(len(sentences)*0.7):]

#Training data
train_vocabulary =set()
train_word_count =0
for line in train:
    for word in line.split('|'):        
        train_vocabulary.add(word)
        train_word_count+=1
print ('Total sentences in BEST2010 news training dataset :\t'+ str(len(train)))
print ('Total word counts in BEST2010 news training dataset :\t'+ str(train_word_count))
print ('Total vocabuary in BEST2010 news training dataset :\t'+ str(len(train_vocabulary)))
# We will use 1/vocab_size as a default value for unknown word
unk_value = math.pow(len(train_vocabulary),-1)

Total sentences in BEST2010 news training dataset :	21678
Total word counts in BEST2010 news training dataset :	1042797
Total vocabuary in BEST2010 news training dataset :	26240


# IMPORTANT. Use the dataset below to answer questions on Mycourseville

The dataset has been tokenized for you (with the vocabulary above).

Please download the provided vocabulary to confirm that you are using the same vocabulary. **Otherwise, you might get wrong answers.**

In [62]:
!wget https://www.dropbox.com/s/jajdlqnp5h0ywvo/tokenized_wiki_sample.csv?dl=0 
!wget https://www.dropbox.com/s/n7w7te0zc1vz67n/train_vocabulary.pickle

In [63]:
import pickle
with open("train_vocabulary.pickle","rb") as f:
  old_vocab = pickle.load(f)

In [64]:
assert train_vocabulary == old_vocab #if nothing shows up, you are good to go

# Unigram

In this section, we will demonstrate how to build a unigram language model <br>
**Important note:** <br>
**\<s\>** = sentence start symbol <br>
**\</s\>** = sentence end symbol 

In [65]:
def getUnigramModel(data):
    model = defaultdict(lambda: 0)
    word_count =0
    for sentence in data:
        sentence +=  u'|</s>' #for unigram model we can always ignore <s>, since p(w0=<s>)=1
        for w1 in sentence.split('|'):
            model[w1] +=1.0
            word_count+=1
    for w1 in model:
        model[w1] = model[w1]/(word_count)
    return model

In [66]:
model = getUnigramModel(train)

In [67]:
def getLnValue(x):
    if x >0.0:
        return math.log(x)
    else:
        return math.log(unk_value)

In [68]:
#problability of 'นายก'
print(getLnValue(model[u'นายก']))
#for example, problability of 'นายกรัฐมนตรี' which is an unknown word is equal to
print(getLnValue(model[u'นายกรัฐมนตรี']))
#problability of 'นายก' 'ได้' 'ให้' 'สัมภาษณ์' 'กับ' 'สื่อ'
prob = getLnValue(model[u'นายก'])+getLnValue(model[u'ได้'])+ getLnValue(model[u'ให้'])+getLnValue(model[u'สัมภาษณ์'])+getLnValue(model[u'กับ'])+getLnValue(model[u'สื่อ'])+getLnValue(model['</s>'])
print ('Problability of a sentence', math.exp(prob))

-6.551526663995246
-10.175040243058024
Problability of a sentence 5.617210748667918e-18


# Perplexity

In order to compare language model we need to calculate perplexity. In this task you should write a perplexity calculation code for the unigram model. The result perplexity should be around 556.39 and
476.07 on train and test data.

## TODO #0
**Fill your name and ID here** <br>
**Name**:ปรวีร์
<br>
**ID**:6230314421

## TODO #1 **Calculate perplexity**

In [69]:
def calculate_sentence_ln_prob(sentence, model):
    word = sentence.split('|')
    ln_prob = .0
    for w in word:
        ln_prob += getLnValue(model[w])
    return ln_prob

def perplexity(test, model):
    ln_prob = .0
    word_count = .0
    for sentence in test:
        sentence +=  u'|</s>'
        ln_prob += calculate_sentence_ln_prob(sentence, model)
        word_count += len(sentence.split('|'))
    return math.exp(-ln_prob/word_count)

In [70]:
print(perplexity(train,model))
print(perplexity(test,model))
# 556.3925994212195
# 476.0687892303532

556.3925994212195
476.0687892303532


## MyCourseVille Question #1 
How much perplexity do you get with the Unigram model on the tokenized_wiki_sample dataset?

Please also leave the number below.

In [71]:
import pandas as pd
print(perplexity(pd.read_csv('tokenized_wiki_sample.csv')['tokenized'], model))

1215.7165914041525


## TODO #2 **Please explain why this model give us such a high perplexity.**

**Your answer**:  เพราะว่า Unigram model นั้นคำนวณด้วย Probability ของ word เพียง word เดียว จึงไม่เห็นความสัมพันธ์ของ word ก่อนหน้า ทำให้ค่า perplexity มีค่าสูง ซึ่งหมายถึงต้องใช้จำนวนการสุ่มหลายครั้งต่อ  1 token โดยเฉลี่ยเพื่อให้ได้คำตอบที่ถูกต้อง

# Bigram

Next, you will create a better language model than a unigram (which is not much to compare with). But first, it is very tedious to count every pair of words that occur in our corpus by ourselves. In this case, nltk provide us a simple library which will do it for us.

In [72]:
#example of nltk usage for bigram
sentence = 'I always search google for an answer .'

print('This is how nltk generate bigram.')
for w1,w2 in bigrams(sentence.split(), pad_right=True, pad_left=True):
    print (w1,w2)
print('None is used as a start and end of sentence symbol.')

This is how nltk generate bigram.
None I
I always
always search
search google
google for
for an
an answer
answer .
. None
None is used as a start and end of sentence symbol.


Now, you should be able to implement a bigram model by yourself. Also, you must create a new perplexity calculation for bigram. The result perplexity should be around 58.79 and 146.27 on train and test data.

## TODO #3 **Create a Bigram Model**

In [73]:
def getBigramModel(data):
    ###FILL YOUR CODE HERE###
    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    for sentence in data:
        for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
            unigram_count[w1] += 1.0
            bigram_count[w1, w2] += 1.0
            
    bigram_prop = defaultdict(lambda: 0.0)
    for w1, w2 in bigram_count:
        bigram_prop[w1, w2]  = bigram_count[w1, w2] / unigram_count[w1]
        
    return bigram_prop

model = getBigramModel(train)

## TODO #4 **Calculate Perplexity for Bigram Model**



In [74]:
def calculate_sentence_ln_prob(sentence, model):
    ln_prob = .0
    for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
        ln_prob += getLnValue(model[w1,w2])
    return ln_prob

def perplexity(test,model):
    ln_prob = .0
    word_count = .0
    for sentence in test:
        ln_prob += calculate_sentence_ln_prob(sentence, model)
        word_count += len(sentence.split('|')) + 1
    return math.exp(-ln_prob/word_count)

In [75]:
print(perplexity(train,model))
print(perplexity(test, model))

# 58.78942889767147
# 146.26539331038614

58.78942889767147
146.26539331038614


## MyCourseVille Question #2 
How much perplexity do you get with the Bigram model on the tokenized_wiki_sample dataset?

Please also leave the number below.

In [76]:
print(perplexity(pd.read_csv('tokenized_wiki_sample.csv')['tokenized'], model))

744.3142671561902


# Smoothing

Usually any ngram models have a sparsity problem, which means it does not have every possible ngram of words in the dataset. Smoothing techniques can alleviate this problem. In this section, you will implement two basic smoothing methods laplace smoothing and interpolation for bigram.

## TODO #5 **Bigram with Laplace smoothing (Add-One Smoothing)**

In [77]:
#Laplace Smoothing
def getBigramWithLaplaceSmoothing(data):
    #Fill code here
    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    for sentence in data:
        for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
            unigram_count[w1] += 1.0
            bigram_count[w1, w2] += 1.0
            
    bigram_prop = defaultdict(lambda: 0.0)
    for w1, w2 in bigram_count:
        bigram_prop[w1, w2]  = (bigram_count[w1, w2] + 1 )/(unigram_count[w1] + len(unigram_count))
        
    return bigram_prop

model = getBigramWithLaplaceSmoothing(train)
print(perplexity(train,model) )
print(perplexity(test, model))

# 974.8134581679766
# 1098.1622194979489

974.8134581679766
1098.1622194979489


## MyCourseVille Question #3 
How much perplexity do you get with the Bigram model with Laplace Smoothing on the tokenized_wiki_sample dataset?

Please also leave the number below.

In [78]:
print(perplexity(pd.read_csv('tokenized_wiki_sample.csv')['tokenized'], model))

3372.3685309482216


## TODO #6 **Bigram with Interpolation**
lambda value is 0.7 for bigram, 0.25 for unigram, and 0.05 for unknown word

In [79]:
#interpolation
def getBigramWithInterpolation(data):
    #Fill code here
    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    
    for sentence in data:
        for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
                unigram_count[w1] += 1.0
                bigram_count[w1, w2] += 1.0
    
    unigram_prop = defaultdict(lambda: 0.0)
    bigram_prop = defaultdict(lambda: 0.0)
    model = defaultdict(lambda: 0.0)
    
    total_wc = sum(unigram_count.values())
    total_vocab = len(unigram_count)
    unk_prop = 1/total_vocab
    
    for w1, w2 in bigram_count:
        unigram_prop[w2] = unigram_count[w2]/total_wc
        bigram_prop[w1, w2] = bigram_count[w1, w2]/unigram_count[w1]
        model[w1, w2] = (0.7*bigram_prop[w1, w2]) + (0.25*unigram_prop[w2]) + (0.05 * unk_prop)
    
    return model
    
model = getBigramWithInterpolation(train)
print(perplexity(train,model))        
print(perplexity(test,model))

# 73.38409869825665
# 172.67485908813356

73.54392427626321
173.04288340104554


## MyCourseVille Question #4 
How much perplexity do you get with the Bigram model with Interpolation on the tokenized_wiki_sample dataset?

Please also leave the number below.

In [80]:
print(perplexity(pd.read_csv('tokenized_wiki_sample.csv')['tokenized'], model))

821.2170455684426


# Language modeling on multiple domains

Sometimes, we do not have enough data to create a language model for a new domain. In that case, we can improvised by combining several models to improve result on the new domain.

In this exercise you will try to merge two language models from news and article domains to create a language model for the encyclopedia domain.

In [81]:
# create article data
encyclo_data=[]
fp= io.open('BEST2010/encyclopedia.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    encyclo_data.append(line.strip()[:-1])
fp.close()

First, you should try to calculate perplexity of your bigram with interpolation using "news data" (train) on "encyclopedia data" (test). The result perplexity should be around 727.35.

For your information, a bigram model with interpolation using "article data" (train) to test on "encyclopedia data" (test) has a perplexity of 505.80.

In [82]:
# print perplexity of bigram with interpolation on article data        
# 928.8461907933413
print(perplexity(encyclo_data,model))

728.8985182322733


## TODO #7 
Write a model that produce 450.0 or less perplexity on encyclopedia data without using data from the encyclopedia as training data. (Hint : Try to combine a model with news data and a model with article data together.)

In [83]:
# Fill code here
combine_data = []
fp= io.open('BEST2010/news.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    combine_data.append(line.strip()[:-1])
fp.close()
fp= io.open('BEST2010/article.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    combine_data.append(line.strip()[:-1])
fp.close()
combined_model = getBigramWithInterpolation(combine_data)
# 428.8525 (on combined data)
print('Perplexity of combine Bigram model with interpolation smoothing on encyclopedia test data',perplexity(encyclo_data, combined_model))

Perplexity of combine Bigram model with interpolation smoothing on encyclopedia test data 428.3308600402633


## TODO #8 
## Kneser-ney on "News"

<!-- Reimplement equation 4.33 in SLP textbook (https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf) -->

Implement Bigram Knerser-ney LM. The result perplexity should be around 75.51 and 183.35 on train and test data. 


In [84]:
# Fill codehere
def calculate_sentence_ln_prob(sentence, model):
    ln_prob = .0
    for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
        if (w1,w2) in model['bigram']:
            ln_prob += getLnValue(model['bigram'][w1,w2])
        else:
            ln_prob += getLnValue(model['unigram'][w2])
    return ln_prob

def perplexity(test,model):
    ln_prob = .0
    word_count = .0
    for sentence in test:
        ln_prob += calculate_sentence_ln_prob(sentence, model)
        word_count += len(sentence.split('|')) + 1
    return math.exp(-ln_prob/word_count)
#------------------------------------------
# create unigram & bigram counting tables
#------------------------------------------
def getBigramWithInterpolation(data):
    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    bigram_cn_count = defaultdict(lambda: 0.0)
    unigram_cn_count = defaultdict(lambda: 0.0)

    #------------------------------------------
    # Kneserney
    #------------------------------------------
    # 1) counting all elements
    for sentence in train:
        for w1,w2 in bigrams(sentence.split('|'), pad_right=True, pad_left=True):
                unigram_count[w1] += 1.0
                bigram_count[w1, w2] += 1.0
    # 2) compute p_cn (continuation prob)
    d = 0.75
    unk_value = 1/len(unigram_count)
    all_words = sum(unigram_count.values())
    for w1 in unigram_count:
        unigram_cn_count[w1] = ( max(unigram_count[w1] - d, 0) / all_words ) + (d/all_words) * (unk_value)
    # 3) compute p_kn (kneser-ney prob)
    for w1,w2 in bigram_count:
        bigram_cn_count[w1,w2] = ( (max(bigram_count[w1,w2] - d, 0) ) / unigram_count[w1])+ (d/unigram_count[w1]) * unigram_cn_count[w1]
    return {'bigram':bigram_cn_count, 'unigram':unigram_cn_count}

model = getBigramWithInterpolation(train)

print (perplexity(train,model))        
print (perplexity(test,model))

# 75.5096977631044
# 183.34798416569242

75.30841982599252
113.1949457396857


## MyCourseVille Question #5 
How much perplexity do you get with the Kneser-Ney model on the tokenized_wiki_sample dataset?

Please also leave the number below.

In [85]:
print(perplexity(pd.read_csv('tokenized_wiki_sample.csv')['tokenized'], model))

579.768937177985


# Neural LM

## TODO #9 (Optional)


We will be using Pytorch Lightning (PL) to implement our neural LM and torchtext to create our vocabulary.

PL basically makes our life easier by reducing the need for writing boilerplate codes.

In [86]:
!pip install pytorch-lightning wandb

Collecting pytorch-lightning
  Downloading pytorch_lightning-1.9.0-py3-none-any.whl (825 kB)
Collecting lightning-utilities>=0.4.2
  Downloading lightning_utilities-0.6.0.post0-py3-none-any.whl (18 kB)
Collecting torchmetrics>=0.7.0
  Downloading torchmetrics-0.11.0-py3-none-any.whl (512 kB)
Collecting fsspec[http]>2021.06.0
  Downloading fsspec-2023.1.0-py3-none-any.whl (143 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1
  Downloading aiohttp-3.8.3-cp38-cp38-win_amd64.whl (324 kB)
Collecting async-timeout<5.0,>=4.0.0a3
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp38-cp38-win_amd64.whl (28 kB)
Collecting aiosignal>=1.1.2
  Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.8.2-cp38-cp38-win_amd64.whl (56 kB)
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp38-cp38-win_amd64.whl (34 kB)
Installing collected packages: multidict, frozenlist, yarl, async-t



In [87]:
!wandb login

wandb: Currently logged in as: poraree. Use `wandb login --relogin` to force relogin


In [88]:
import torchtext
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
import pytorch_lightning as pl
from tqdm import tqdm

ModuleNotFoundError: No module named 'torchtext'

In [None]:
# FILL CODE HERE#
vocab = torchtext.vocab.build_vocab_from_iterator(...)
vocab.insert_token('<unk>', 0) #add <unk> token to index 0
vocab.insert_token('<eos>', 1) #add <eos> token to index 1
vocab.set_default_index(vocab['<unk>']) #make index 0 the default index (when encountering unknown words)
print(f"Vocabulary Size: {len(vocab)}")                         
print(vocab.get_itos()[:10]) #get first 10 words  

In [None]:
#the standart pytorch dataset
class TextDataset(Dataset):
  def __init__(self, data, seq_len = 128):
  # TODO: create a text dataset with max sequence length of 128 tokens 
  # to use with pytorch dataloader. dont forget to add an <eos> token.
  # for efficiency, you can fill sequences that are shorter than the 
  # max seq_len with tokens from other sentences. (not necessary, can just use pad token)

  #  now data looks like this
    #  [sent1, 
    #   sent2,
    #   ...,
    #  ]

  # 1. encode the sentences
  # 2. add eos token
  # 3. pad/fill the sequences
  # so that is looks like this
  # [ [1,2,3, ... , 128] (this is just an example, not actual input_ids)
  #   [1,2,3, ... , 128]
  #   [1,2,3, ... , 128]
  # ]

  # FILL CODE HERE
    self.encoded = ...

  def __getitem__(self, idx):
    return self.encoded[idx]

  def __len__(self):
    return len(self.encoded)

In [None]:
#we create a datamodule class to manage everythink about our dataset.
class TextDataModule(pl.LightningDataModule):

  def __init__(self, train_data, test_data, seq_len, batch_size, num_workers=0):
      super().__init__()
      # note that we don't use validation here. 
      # This is because:
      #   1. In practice, language modelling requires A LOT of data (now (2022) researchers use up to multiple TBs of text). 
      #      Therefore, it won't overfit on training data anyway (there are research that show that SOTA LMs today are actually undertrained, e.g. https://arxiv.org/pdf/2203.15556.pdf)
      #   2. This is just an example
      self.train_data = train_data
      self.test_data = test_data
      self.seq_len = seq_len
      self.batch_size = batch_size
      self.num_workers = num_workers
 

  def setup(self, stage: str):
    # this is actually not the proper way to use this method
    # https://pytorch-lightning.readthedocs.io/en/stable/data/datamodule.html#what-is-a-datamodule
    # the proper way is to setup data here
    # e.g. 
    # if stage == 'train':
    #   self.train_data = load_data()
    pass

  def train_dataloader(self):
     
      # FILL CODE HERE
      # create your own dataset and dataloader
      # you can refer to this guide https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
      train_dataset = TextDataset(...)
      train_loader = DataLoader(...)

      return train_loader
    
  def test_dataloader(self):
      
      # FILL CODE HERE
      # create your own dataset and dataloader
      test_dataset = TextDataset(...)
      test_loader = DataLoader(...)
      
      return test_loader

In [None]:
class LSTM(pl.LightningModule):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, learning_rate, criterion):
                
        super().__init__()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim

        # add layers into your model.
        # You will need at least one of each of the following:
        # 1. embedding layer (nn.Embedding)
        # 2. RNN layer (nn.RNN, nn.LSTM, ..)
        # 3. Dropout layer (optional but recommended; nn.Dropout)
        # 4. Linear layer (nn.Linear)
        # FILL CODE HERE

        self.embedding = nn.Embedding(...)

        self.learning_rate = learning_rate
        self.criterion = criterion

        

    def forward(self, src): #standard model forward pass
      # Write the forward pass logic.
      # e.g.
      # output1 = self.layer_1(input)
      # output2 = self.layer_2(output1)
      # ...
      # return output2; the output shape should be (batch_size, seq_len-1, vocab_size)
      # FILL CODE HERE

    def training_step(self, batch, batch_idx): #all we need to write is how is the loss calculated

        src = batch[:, :-1]  #[batch_size (64) , seq_len-1 (127)] except last words
        target = batch[:, 1:] #[batch_size (64) , seq_len-1 (127)] except first words
        prediction = self(src) #do forward pass
        prediction = prediction.reshape(-1, vocab_size)
        target = target.reshape(-1)
        loss = self.criterion(prediction, target)
        self.log("train_loss", loss)
        return loss #PL do loss.backward() for you

    def test_step(self, batch, batch_idx):
        #PL calls model.eval() for you
        src = batch[:, :-1]  #[batch_size (64) , seq_len-1 (127)] except last words
        target = batch[:, 1:] #[batch_size (64) , seq_len-1 (127)] except first words
        with torch.no_grad(): #disable gradient calculation for faster inference (this is already called under the hood but just for clarity)
          prediction = self(src) #[batch_size (64), seq_len-1 (127) , vocab size (9000)]
        prediction = prediction.reshape(-1, vocab_size) #[batch_size*(seq_len-1) (8128) , vocab]
        target = target.reshape(-1) #[batch_size (64), seq_len-1 (127)] -> [batch_size*(seq_len-1) (8128)]
        loss = self.criterion(prediction, target)
        self.log("test_loss", loss)
        return loss

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=self.learning_rate)

In [None]:
seq_len = 128
batch_size = 64
vocab_size = len(vocab)

#Feel free to customize your model
embedding_dim = 200             
hidden_dim = 512          
num_layers = 3                  
dropout_rate = 0.65              
lr = 1e-3       
  
# FILL CODE HERE
data_module = TextDataModule(..., ..., seq_len=seq_len, batch_size=batch_size,num_workers=0)

In [None]:
criterion = nn.CrossEntropyLoss()
model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate, lr, criterion)

In [None]:
from pytorch_lightning import Trainer

In [None]:
from pytorch_lightning.loggers import WandbLogger
wandb_logger = WandbLogger(project=...)

In [None]:
wandb_logger.experiment.config.update({"vocab_size": vocab_size, 
                                       "embedding_dim": embedding_dim,
                                       "hidden_dim": hidden_dim,
                                       "num_layers": num_layers,
                                       "dropout_rate": dropout_rate,
                                       "lr": lr,
                                       })

In [None]:
trainer = Trainer(
    max_epochs=20,
    gpus=1,
    logger=wandb_logger #with Pytorch Lightning, this is all we need to do to log simple metrics such as train loss
    #for more logging, refer to https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.loggers.wandb.html
)

In [None]:
#this will probably take around 10min?
trainer.fit(model, data_module)

In [None]:
test_result = trainer.test(model, data_module)

In [None]:
import numpy as np

the perplexity of a 3-layer LSTM should be around ~90 (winner method).

In [None]:
#Calculate perplexity

Let's test our model by generating some text.

In [None]:
model.eval() #disable dropout

In [None]:
itos = vocab.get_itos() #integer to string

In [None]:
def generate_seq(context, itos, max_new_token = 10):
  encoded = vocab(context)
  with torch.no_grad():
      for i in range(max_new_token):
          src = torch.LongTensor([encoded]).to(model.device)
          prediction = model(src)
          probs = torch.softmax(prediction[:, -1] / 1, dim=-1)  
          prediction = torch.multinomial(probs, num_samples=1).item()    
          
          while prediction == vocab['<unk>']:
              prediction = torch.multinomial(probs, num_samples=1).item()

          if prediction == vocab['<eos>']:
              break

          encoded.append(prediction)

  return "|".join([itos[e] for e in encoded])

In [None]:
context = ["วัน", "จันทร์"]
generate_seq(context, itos, 20)