<img src="sutd.png" alt="drawing" style="width:300px;"/>

## <center>50.040 Natural Language Processing, Summer 2020<center>
<center>**Due 19 June 2020, 5pm** <center>
Mini Project

**Write your student ID and name**


### STUDNET ID: 1002961

### Name: Wu Tianyu

### Students with whom you have discussed (if any): None

# Introduction

Language models are very useful for a wide range of applications, e.g., speech recognition and machine translation. Consider a sentence consisting of words $x_1, x_2, …, x_m$, where $m$ is the length of the sentence, the goal of language modeling is to model the probability of the sentence, where $m \geq 1$, $x_i \in V $ and $V$ is the vocabulary of the corpus:
$$p(x_1, x_2, …, x_m)$$
In this project, we are going to explore both statistical language model and neural language model on the [Wikitext-2](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) datasets. Download wikitext-2 word-level data and put it under the ``data`` folder.

## Statistical  Language Model

A simple way is to view words as independent random variables (i.e., zero-th order Markovian assumption). The joint probability can be written as:
$$p(x_1, x_2, …, x_m)=\prod_{i=1}^m p(x_i)$$
However, this model ignores the word order information, to account for which, under the first-order Markovian assumption, the joint probability can be written as:
$$p(x_0, x_1, x_2, …, x_m)= \prod_{i=1}^{m}p(x_i \mid x_{i-1})$$
Under the second-order Markovian assumption, the joint probability can be written as:
$$p(x_{-1}, x_0, x_1, x_2, …, x_m)= \prod_{i=1}^{m}p(x_i \mid x_{i-2}, x_{i-1})$$
Similar to what we did in HMM, we will assume that $x_{-1}=START, x_0=START, x_m = STOP$ in this definition, where $START, STOP$ are special symbols referring to the start and the end of a sentence.







### Parameter estimation

Let's use $count(u)$ to denote the number of times the unigram $u$ appears in the corpus, use $count(v, u)$ to denote the number of times the bigram $v, u$ appears in the corpus, and $count(w, v, u)$ the times the trigram $w, v, u$ appears in the corpus, $u \in V \cup STOP$ and $w, v \in V \cup START$.

And the parameters of the unigram, bigram and trigram models can be obtained using maximum likelihood estimation (MLE).

- In the unigram model, the parameters can be estimated as: $$p(u) = \frac {count(u)}{c}$$, where $c$ is the total number of words in the corpus.
- In the bigram model, the parameters can be estimated as:
$$p(u \mid v) = \frac{count(v, u)}{count(v)}$$
- In the trigram model, the parameters can be estimated as:
$$p(u \mid w, v) = \frac{count(w, v, u)}{count(w, v)}$$




In [1]:
%%javascript
MathJax.Hub.Config({
  TeX: { equationNumbers: { autoNumber: "AMS" } }
});

<IPython.core.display.Javascript object>

### Smoothing the parameters
Note, it is likely that many parameters of bigram and trigram models will be 0 because the relevant bigrams and trigrams involved do not appear in the corpus. If you don't have a way to handle these 0 probabilities, all the sentences that include such bigrams or trigrams will have probabilities of 0.

We'll use a Add-k Smoothing method to fix this problem, the smoothed parameter can be estimated as:
\begin{equation}
p_{add-k}(u)= \frac{count(u)+k}{c+k|V^*|}
\end{equation}
\begin{equation}
p_{add-k}(u \mid v)= \frac{count(v, u)+k}{count(v)+k|V^*|}
\end{equation}
\begin{equation}
p_{add-k}(u \mid w, v)= \frac{count(w, v, u)+k}{count(w, v)+k|V^*|}
\end{equation}

where $k \in (0, 1)$ is the parameter of this approach, and $|V^*|$ is the size of the vocabulary $V^*$,here $V^*= V \cup STOP$. One way to choose the value of $k$ is by
optimizing the perplexity of the development set, namely to choose the value that minimizes the perplexity.

### Perplexity

Given a test set $D^{\prime}$ consisting of sentences $X^{(1)}, X^{(2)}, …, X^{(|D^{\prime}|)}$, each sentence $X^{(j)}$ consists of words $x_1^{(j)}, x_2^{(j)},…,x_{n_j}^{(j)}$, we can measure the probability of each sentence $s_i$, and the quality of the language model would be the probability it assigns to the entire set of test sentences, namely:
\begin{equation} 
\prod_j^{D^{\prime}}p(X^{(j)})
\end{equation}
Let's define average log2 probability as:
\begin{equation} 
l=\frac{1}{c^{\prime}}\sum_{j=1}^{|D^{\prime}|}log_2p(X^{(j)})
\end{equation}
$c^{\prime}$ is the total number of words in the test set, $D^{\prime}$ is the number of sentences. And the perplexity is defined as:
\begin{equation} 
perplexity=2^{-l}
\end{equation}

The lower the perplexity, the better the language model.

In [2]:
from collections import Counter, namedtuple
import itertools
import numpy as np

In [3]:
with open('data/wikitext-2/wiki.train.tokens', 'r', encoding='utf8') as f:
    text = f.readlines()
    train_sents = [line.lower().strip('\n').split() for line in text]
    train_sents = [s for s in train_sents if len(s)>0 and s[0] != '=']

In [4]:
print(train_sents[1])

['the', 'game', 'began', 'development', 'in', '2010', ',', 'carrying', 'over', 'a', 'large', 'portion', 'of', 'the', 'work', 'done', 'on', 'valkyria', 'chronicles', 'ii', '.', 'while', 'it', 'retained', 'the', 'standard', 'features', 'of', 'the', 'series', ',', 'it', 'also', 'underwent', 'multiple', 'adjustments', ',', 'such', 'as', 'making', 'the', 'game', 'more', '<unk>', 'for', 'series', 'newcomers', '.', 'character', 'designer', '<unk>', 'honjou', 'and', 'composer', 'hitoshi', 'sakimoto', 'both', 'returned', 'from', 'previous', 'entries', ',', 'along', 'with', 'valkyria', 'chronicles', 'ii', 'director', 'takeshi', 'ozawa', '.', 'a', 'large', 'team', 'of', 'writers', 'handled', 'the', 'script', '.', 'the', 'game', "'s", 'opening', 'theme', 'was', 'sung', 'by', 'may', "'n", '.']


### Question 1 [code][written]
1. Implement the function **"compute_ngram"** that computes n-grams in the corpus.
 (Do not take the START and STOP symbols into consideration for now.) 
 For n=1,2,3, the number of unique n-grams should be **28910/577343/1344047**, respectively.
2. List 10 most frequent unigrams, bigrams and trigrams as well as their counts.(Hint: use the built-in function .most_common in Counter class)

In [9]:
def compute_ngram(sents, n):
    '''
    Compute n-grams that appear in "sents".
    param:
        sents: list[list[str]] --- list of list of word strings
        n: int --- "n" gram
    return:
        ngram_set: set{str} --- a set of n-grams (no duplicate elements)
        ngram_dict: dict{ngram: counts} --- a dictionary that maps each ngram to its number occurence in "sents";
        This dict contains the parameters of our ngram model. E.g. if n=2, ngram_dict={('a','b'):10, ('b','c'):13}
        
        You may need to use "Counter", "tuple" function here.
    '''
    ngram_set = set()
    ngram_dict = dict()
    ### YOUR CODE HERE
    for sent in sents:
        for i in range(len(sent)-n+1):
            ngram = tuple(sent[i:i+n])
            if ngram not in ngram_dict:
                ngram_dict[ngram] = 1
                ngram_set.add(ngram)
            else:
                ngram_dict[ngram] += 1
                
    
    ### END OF YOUR CODE
    return ngram_set, ngram_dict

In [10]:
### ~28xxx
unigram_set, unigram_dict = compute_ngram(train_sents, 1)
print(len(unigram_set))

28910


In [11]:
### ~57xxxx
bigram_set, bigram_dict = compute_ngram(train_sents, 2)
print(len(bigram_set))

577343


In [12]:
### ~134xxxx
trigram_set, trigram_dict = compute_ngram(train_sents, 3)
print(len(trigram_set))

1344047


In [19]:
# List 10 most frequent unigrams, bigrams and trigrams as well as their counts.
unigram_dict_sorted = {k: v for k, v in sorted(unigram_dict.items(), key=lambda item: item[1], reverse=True)[:10]}
bigram_dict_sorted = {k: v for k, v in sorted(bigram_dict.items(), key=lambda item: item[1], reverse=True)[:10]}
trigram_dict_sorted = {k: v for k, v in sorted(trigram_dict.items(), key=lambda item: item[1], reverse=True)[:10]}
unigram_dict_sorted, bigram_dict_sorted, trigram_dict_sorted

({('the',): 130519,
  (',',): 99763,
  ('.',): 73388,
  ('of',): 56743,
  ('<unk>',): 53951,
  ('and',): 49940,
  ('in',): 44876,
  ('to',): 39462,
  ('a',): 36140,
  ('"',): 28285},
 {('of', 'the'): 17242,
  ('in', 'the'): 11778,
  (',', 'and'): 11643,
  ('.', 'the'): 11274,
  (',', 'the'): 8024,
  ('<unk>', ','): 7698,
  ('to', 'the'): 6009,
  ('on', 'the'): 4495,
  ('the', '<unk>'): 4389,
  ('and', 'the'): 4331},
 {(',', 'and', 'the'): 1393,
  (',', '<unk>', ','): 950,
  ('<unk>', ',', '<unk>'): 901,
  ('one', 'of', 'the'): 866,
  ('<unk>', ',', 'and'): 819,
  ('.', 'however', ','): 775,
  ('<unk>', '<unk>', ','): 745,
  ('.', 'in', 'the'): 726,
  ('.', 'it', 'was'): 698,
  ('the', 'united', 'states'): 666})

### Question 2 [code][written]
In this part, we take the START and STOP symbols into consideration. So we need to pad the **train_sents** as described in "Statistical Language Model" before we apply "compute_ngram" function. For example, given a sentence "I like NLP", in a bigram model, we need to pad it as "START I like NLP STOP", in a trigram model, we need to pad it as "START START I like NLP STOP".

1. Implement the ``pad_sents function``.
2. Pad ``train_sents``.
3. Apply ``compute_ngram`` function to these padded sents. 
4. Implement ``ngram_prob`` function. Compute the probability for each n-gram in the variable **ngrams** according to Eq.(1)(2)(3) in **"smoothing the parameters"** .List down the n-grams that have 0 probability. 



In [20]:
###############################################
ngrams = list()
with open(r'data/ngram.txt','r') as f:
    for line in f:
        ngrams.append(line.strip('\n').split())
print(ngrams)
###############################################

[['the', 'computer'], ['go', 'to'], ['have', 'had'], ['and', 'the'], ['can', 'sea'], ['a', 'number', 'of'], ['with', 'respect', 'to'], ['in', 'terms', 'of'], ['not', 'good', 'bad'], ['first', 'start', 'with']]


#### I didn't know that unigram doesn't have a STOP symbol, this is really crucial for getting the correct perplexity in Q3

In [96]:
START = '<START>'
STOP = '<STOP>'
###################################
def pad_sents(sents, n):
    '''
    Pad the sents according to n.
    params:
        sents: list[list[str]] --- list of sentences.
        n: int --- specify the padding type, 1-gram, 2-gram, or 3-gram.
    return:
        padded_sents: list[list[str]] --- list of padded sentences.
    '''
    padded_sents = None
    ### YOUR CODE HERE
    padded_sents = [[START]*(n-1) + sent + [STOP] for sent in sents] if n != 1 else sents
    ### END OF YOUR CODE
    return padded_sents

In [97]:
uni_sents = pad_sents(train_sents, 1)
bi_sents = pad_sents(train_sents, 2)
tri_sents = pad_sents(train_sents, 3)

In [98]:
unigram_set, unigram_dict = compute_ngram(uni_sents, 1)
bigram_set, bigram_dict = compute_ngram(bi_sents, 2)
trigram_set, trigram_dict = compute_ngram(tri_sents, 3)

In [99]:
### (28xxx, 58xxxx, 136xxxx)
len(unigram_set),len(bigram_set),len(trigram_set)

(28910, 580825, 1363266)

In [100]:
### ~ 200xxxx; total number of words in wikitext-2.train
num_words = sum([v for _,v in unigram_dict.items()])
print(num_words)

2007146


In [101]:
def ngram_prob(ngram, num_words, unigram_dic, bigram_dic, trigram_dic):
    '''
    params:
        ngram: list[str] --- a list that represents n-gram
        num_words: int --- total number of words
        unigram_dic: dict{ngram: counts} --- a dictionary that maps each 1-gram to its number of occurences in "sents";
        bigram_dic: dict{ngram: counts} --- a dictionary that maps each 2-gram to its number of occurence in "sents";
        trigram_dic: dict{ngram: counts} --- a dictionary that maps each 3-gram to its number occurence in "sents";
    return:
        prob: float --- probability of the "ngram"
    '''
    prob = None
    ### YOUR CODE HERE
    if len(ngram) == 1:
        prob = unigram_dic[tuple(ngram)]/num_words if tuple(ngram) in unigram_dic.keys() else 0
    elif len(ngram) == 2:
        SUM = sum([bigram_dic[bigram] for bigram in bigram_dic if bigram[0] == ngram[0]])
        prob = bigram_dic[tuple(ngram)]/SUM if tuple(ngram) in bigram_dic.keys() and SUM != 0 else 0
    elif len(ngram) == 3:
        SUM = sum([trigram_dic[trigram] for trigram in trigram_dic if tuple(trigram[:2]) == tuple(ngram[:2])])
        prob = trigram_dic[tuple(ngram)]/SUM if tuple(ngram) in trigram_dic.keys() and SUM != 0 else 0
    ### END OF YOUR CODE
    return prob

In [102]:
### ~9.96e-05
ngram_prob(ngrams[0], num_words,unigram_dict, bigram_dict, trigram_dict)

9.960235674499498e-05

In [103]:
ngram_prob(ngrams[5], num_words,unigram_dict, bigram_dict, trigram_dict)

0.9573170731707317

In [104]:
### List down the n-grams that have 0 probability. 
list(filter(lambda x: x[1]==0, [(ngram, ngram_prob(ngram, num_words,unigram_dict, bigram_dict, trigram_dict)) for ngram in ngrams]))

[(['can', 'sea'], 0),
 (['not', 'good', 'bad'], 0),
 (['first', 'start', 'with'], 0)]

In [105]:
[ngram for ngram in ngrams if ngram_prob(ngram, num_words,unigram_dict, bigram_dict, trigram_dict) == 0]

[['can', 'sea'], ['not', 'good', 'bad'], ['first', 'start', 'with']]

### Question 3 [code][written]

1. Implement ``smooth_ngram_prob`` function to estimate ngram probability with ``add-k`` smoothing technique. Compute the smoothed probabilities of each n-gram in the variable **"ngrams"** according to Eq.(1)(2)(3) in **"smoothing the parameters"** section.
2. Implement ``perplexity`` function to compute the perplexity of the corpus "**valid_sents**" according to the Equations (4),(5),(6) in **perplexity** section. The computation of $p(X^{(j)})$ depends on the n-gram model you choose. If you choose 2-gram model, then you need to calculate $p(X^{(j)})$ based on Eq.(2) in **smoothing the parameter** section. Hint: convert probability to log probability.
3. Try out different $k\in [0.1, 0.3, 0.5, 0.7, 0.9]$ and different n-gram model ($n=1,2,3$). Find the n-gram model and $k$ that gives the best perplexity on "**valid_sents**" (smaller is better).

In [106]:
with open('data/wikitext-2/wiki.valid.tokens', 'r', encoding='utf8') as f:
    text = f.readlines()
    valid_sents = [line.lower().strip('\n').split() for line in text]
    valid_sents = [s for s in valid_sents if len(s)>0 and s[0] != '=']

uni_valid_sents = pad_sents(valid_sents, 1)
bi_valid_sents = pad_sents(valid_sents, 2)
tri_valid_sents = pad_sents(valid_sents, 3)

### Smoothing the parameters
Note, it is likely that many parameters of bigram and trigram models will be 0 because the relevant bigrams and trigrams involved do not appear in the corpus. If you don't have a way to handle these 0 probabilities, all the sentences that include such bigrams or trigrams will have probabilities of 0.

We'll use a Add-k Smoothing method to fix this problem, the smoothed parameter can be estimated as:
\begin{equation}
p_{add-k}(u)= \frac{count(u)+k}{c+k|V^*|}
\end{equation}
\begin{equation}
p_{add-k}(u \mid v)= \frac{count(v, u)+k}{count(v)+k|V^*|}
\end{equation}
\begin{equation}
p_{add-k}(u \mid w, v)= \frac{count(w, v, u)+k}{count(w, v)+k|V^*|}
\end{equation}

where $k \in (0, 1)$ is the parameter of this approach, and $|V^*|$ is the size of the vocabulary $V^*$,here $V^*= V \cup STOP$. One way to choose the value of $k$ is by
optimizing the perplexity of the development set, namely to choose the value that minimizes the perplexity.


In [130]:
def smooth_ngram_prob(ngram, k, num_words, unigram_dic, bigram_dic, trigram_dic):
    '''
    params:
        ngram: list[str] --- a list that represents n-gram
        k: float 
        num_words: int --- total number of words
        unigram_dic: dict{ngram: counts} --- a dictionary that maps each 1-gram to its number of occurences in "sents";
        bigram_dic: dict{ngram: counts} --- a dictionary that maps each 2-gram to its number of occurence in "sents";
        trigram_dic: dict{ngram: counts} --- a dictionary that maps each 3-gram to its number occurence in "sents";
    return:
        s_prob: float --- probability of the "ngram"
    '''
    s_prob = 0
    V = len(unigram_dic) + 1 
    ### YOUR CODE HERE\、
    if len(ngram) == 1:
        s_prob = (unigram_dic[tuple(ngram)]+k)/(num_words+k*V) if tuple(ngram) in unigram_dic else k/(num_words+k*V)
    elif len(ngram) == 2:
        SUM = sum([bigram_dic[bigram] for bigram in bigram_dic if bigram[0] == ngram[0]])
        s_prob = (bigram_dic[tuple(ngram)]+k)/(SUM+k*V) if tuple(ngram) in bigram_dic else k/(SUM+k*V)
    elif len(ngram) == 3:
        SUM = sum([trigram_dic[trigram] for trigram in trigram_dic if tuple(trigram[:2]) == tuple(ngram[:2])])
        s_prob = (trigram_dic[tuple(ngram)]+k)/(SUM+k*V) if tuple(ngram) in trigram_dic else k/(SUM+k*V)
    ### END OF YOUR CODE
    return s_prob

In [137]:
def smooth_ngram_prob_fast(ngram, k, num_words, unigram_dic, bigram_dic, trigram_dic, unigram_dic_ref, bigram_dic_ref):
    '''
    params:
        ngram: list[str] --- a list that represents n-gram
        k: float 
        num_words: int --- total number of words
        unigram_dic: dict{ngram: counts} --- a dictionary that maps each 1-gram to its number of occurences in "sents";
        bigram_dic: dict{ngram: counts} --- a dictionary that maps each 2-gram to its number of occurence in "sents";
        trigram_dic: dict{ngram: counts} --- a dictionary that maps each 3-gram to its number occurence in "sents";
    return:
        s_prob: float --- probability of the "ngram"
    '''
    s_prob = 0
    V = len(unigram_dic) + 1 
    ### YOUR CODE HERE\、
    if len(ngram) == 1:
        s_prob = (unigram_dic[tuple(ngram)]+k)/(num_words+k*V) if tuple(ngram) in unigram_dic else k/(num_words+k*V)
    elif len(ngram) == 2:
        SUM = unigram_dic_ref[tuple(ngram[:1])] if tuple(ngram[:1]) in unigram_dic_ref else 0
        s_prob = (bigram_dic[tuple(ngram)]+k)/(SUM+k*V) if tuple(ngram) in bigram_dic else k/(SUM+k*V)
    elif len(ngram) == 3:
        SUM = bigram_dic_ref[tuple(ngram[:2])] if tuple(ngram[:2]) in bigram_dic_ref else 0
        s_prob = (trigram_dic[tuple(ngram)]+k)/(SUM+k*V) if tuple(ngram) in trigram_dic else k/(SUM+k*V)
    ### END OF YOUR CODE
    return s_prob

In [138]:
_, unigram_dic_ref = compute_ngram(bi_sents, 1)
_, bigram_dic_ref = compute_ngram(tri_sents, 2)

In [141]:
smooth_ngram_prob_fast(ngrams[0], 0.5, num_words, unigram_dict, bigram_dict, trigram_dict, unigram_dic_ref, bigram_dic_ref)

9.311982452086402e-05

In [131]:
### ~ 9.31e-05
smooth_ngram_prob(ngrams[0], 0.5, num_words, unigram_dict, bigram_dict, trigram_dict)

9.311982452086402e-05

### Perplexity

Given a test set $D^{\prime}$ consisting of sentences $X^{(1)}, X^{(2)}, …, X^{(|D^{\prime}|)}$, each sentence $X^{(j)}$ consists of words $x_1^{(j)}, x_2^{(j)},…,x_{n_j}^{(j)}$, we can measure the probability of each sentence $s_i$, and the quality of the language model would be the probability it assigns to the entire set of test sentences, namely:
\begin{equation} 
\prod_j^{D^{\prime}}p(X^{(j)})
\end{equation}
Let's define average log2 probability as:
\begin{equation} 
l=\frac{1}{c^{\prime}}\sum_{j=1}^{|D^{\prime}|}log_2p(X^{(j)})
\end{equation}
$c^{\prime}$ is the total number of words in the test set, $D^{\prime}$ is the number of sentences. And the perplexity is defined as:
\begin{equation} 
perplexity=2^{-l}
\end{equation}

The lower the perplexity, the better the language model.

In [132]:
def perplexity(n, k, num_words, valid_sents, unigram_dic, bigram_dic, trigram_dic):
    '''
    compute the perplexity of valid_sents
    params:
        n: int --- n-gram model you choose. 
        k: float --- smoothing parameter.
        num_words: int --- total number of words in the traning set.
        valid_sents: list[list[str]] --- list of sentences.
        unigram_dic: dict{ngram: counts} --- a dictionary that maps each 1-gram to its number of occurences in "sents";
        bigram_dic: dict{ngram: counts} --- a dictionary that maps each 2-gram to its number of occurence in "sents";
        trigram_dic: dict{ngram: counts} --- a dictionary that maps each 3-gram to its number occurence in "sents";
    return:
        ppl: float --- perplexity of valid_sents
    '''
    ppl = None
    ### YOUR CODE HERE
    SUM = 0
    for sent in valid_sents:
        for i in range(len(sent)-n+1):
            ngram = sent[i:i+n]
            SUM += np.log2(smooth_ngram_prob(ngram, k, num_words, unigram_dic, bigram_dic, trigram_dic))
    ppl = np.power(2, -SUM/(np.sum([len(sent) for sent in valid_sents] if n == 1 
                                   else [len(sent)-n for sent in valid_sents])))
    ### END OF YOUR CODE
    return ppl

In [142]:
def perplexity_fast(n, k, num_words, valid_sents, unigram_dic, bigram_dic, trigram_dic, unigram_dic_ref, bigram_dic_ref):
    '''
    compute the perplexity of valid_sents
    params:
        n: int --- n-gram model you choose. 
        k: float --- smoothing parameter.
        num_words: int --- total number of words in the traning set.
        valid_sents: list[list[str]] --- list of sentences.
        unigram_dic: dict{ngram: counts} --- a dictionary that maps each 1-gram to its number of occurences in "sents";
        bigram_dic: dict{ngram: counts} --- a dictionary that maps each 2-gram to its number of occurence in "sents";
        trigram_dic: dict{ngram: counts} --- a dictionary that maps each 3-gram to its number occurence in "sents";
    return:
        ppl: float --- perplexity of valid_sents
    '''
    ppl = None
    ### YOUR CODE HERE
    SUM = 0
    for sent in valid_sents:
        for i in range(len(sent)-n+1):
            ngram = sent[i:i+n]
            SUM += np.log2(smooth_ngram_prob_fast(ngram, k, num_words, unigram_dic, bigram_dic, trigram_dic, unigram_dic_ref, bigram_dic_ref))
    ppl = np.power(2, -SUM/(np.sum([len(sent) for sent in valid_sents] if n == 1 
                                   else [len(sent)-n for sent in valid_sents])))
    ### END OF YOUR CODE
    return ppl

In [133]:
### ~ 840
perplexity(1, 0.1, num_words, uni_valid_sents, unigram_dict, bigram_dict, trigram_dict)

840.7347306217125

In [143]:
perplexity_fast(1, 0.1, num_words, uni_valid_sents, unigram_dict, bigram_dict, trigram_dict, unigram_dic_ref, bigram_dic_ref)

840.7347306217125

In [128]:
uni_valid_sents[0][-1], bi_valid_sents[0][-1], bi_valid_sents[0][0], tri_valid_sents[0][-1], tri_valid_sents[0][0], \
tri_valid_sents[0][1]

('.', '<STOP>', '<START>', '<STOP>', '<START>', '<START>')

#### The original perplexity function is too slow for n=2 and n=3

In [185]:
n = [1,2,3]
k = [0.1, 0.3, 0.5, 0.7, 0.9]
### YOUR CODE HERE
dic = dict()
for n_candidate in n:
    for k_candidate in k:
        dic[(n_candidate, k_candidate)] = \
        perplexity_fast(n_candidate, k_candidate, num_words, uni_valid_sents, unigram_dict, bigram_dict, trigram_dict, unigram_dic_ref, bigram_dic_ref)
        print('({}, {}) is done'.format(n_candidate, k_candidate))
{k: v for k, v in sorted(dic.items(), key=lambda item: item[1], reverse=False)}
### END OF YOUR CODE

(1, 0.1) is done
(1, 0.3) is done
(1, 0.5) is done
(1, 0.7) is done
(1, 0.9) is done
(2, 0.1) is done
(2, 0.3) is done
(2, 0.5) is done
(2, 0.7) is done
(2, 0.9) is done
(3, 0.1) is done
(3, 0.3) is done
(3, 0.5) is done
(3, 0.7) is done
(3, 0.9) is done


{(1, 0.1): 840.7347306217125,
 (1, 0.3): 841.1427277044075,
 (1, 0.5): 841.5959678936326,
 (1, 0.7): 842.0904494786319,
 (1, 0.9): 842.6227084935349,
 (2, 0.1): 888.5413716843532,
 (2, 0.3): 1279.9314612281546,
 (2, 0.5): 1558.3253262946232,
 (2, 0.7): 1788.7398501180635,
 (2, 0.9): 1990.4960086390406,
 (3, 0.1): 6400.3643275228305,
 (3, 0.3): 8953.583675178434,
 (3, 0.5): 10503.03434155729,
 (3, 0.7): 11646.964069362453,
 (3, 0.9): 12559.23934272117}

### Question 4 [code]

Evaluate the perplexity of the test data **test_sents** based on the best n-gram model and $k$ you have found on the validation data (Q 3.3).

In [146]:
with open('data/wikitext-2/wiki.test.tokens', 'r', encoding='utf8') as f:
    text = f.readlines()
    test_sents = [line.lower().strip('\n').split() for line in text]
    test_sents = [s for s in test_sents if len(s)>0 and s[0] != '=']

uni_test_sents = pad_sents(test_sents, 1)
bi_test_sents = pad_sents(test_sents, 2)
tri_test_sents = pad_sents(test_sents, 3)

In [186]:
### YOUR CODE HERE
perplexity_fast(1, 0.1, num_words,bi_test_sents, unigram_dict, bigram_dict, trigram_dict, unigram_dic_ref, bigram_dic_ref)
### END OF YOUR CODE

968.3303647875488

## Neural Language Model (RNN)


<img src="LM.png" alt="drawing" style="width:500px;"/>

We will create a LSTM language model as shown in figure and train it on the Wikitext-2 dataset. 
The data generators (train\_iter, valid\_iter, test\_iter) have been provided. 
The word embeddings together with the parameters in the LSTM model will be learned from scratch.

[Pytorch](https://pytorch.org/tutorials/) and [torchtext](https://torchtext.readthedocs.io/en/latest/index.html#) are required in this part. Do not make any changes to the provided code unless you are requested to do so. 

### Question 5 [code]
- Implement the ``__init__`` function in ``LangModel`` class.
- Implement the ``forward`` function in ``LangModel`` class.
- Complete the training code in ``train`` function.
    Then complete the testing code in  ``test`` function and 
    compute the perplexity of the test data ``test_iter``. The test perplexity should be below 150.

In [152]:
import torchtext
import torch
import torch.nn.functional as F
from torchtext.datasets import WikiText2
from torch import nn, optim
from torchtext import data
from nltk import word_tokenize
import nltk
nltk.download('punkt')
torch.manual_seed(222)

[nltk_data] Downloading package punkt to /Users/wutianyu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<torch._C.Generator at 0x12bbe54f0>

In [154]:
def tokenizer(text):
    '''Tokenize a string to words'''
    return word_tokenize(text)

START = '<START>'
STOP = '<STOP>'
#Load and split data into three parts
TEXT = data.Field(lower=True, tokenize=tokenizer, init_token=START, eos_token=STOP)
train, valid, test = WikiText2.splits(TEXT) 

downloading wikitext-2-v1.zip


wikitext-2-v1.zip: 100%|██████████| 4.48M/4.48M [00:08<00:00, 524kB/s] 


extracting


In [155]:
#Build a vocabulary from the train dataset
TEXT.build_vocab(train)
print('Vocabulary size:', len(TEXT.vocab))

Vocabulary size: 28905


In [156]:
BATCH_SIZE = 64
# the length of a piece of text feeding to the RNN layer
BPTT_LEN = 32           
# train, validation, test data
train_iter, valid_iter, test_iter = data.BPTTIterator.splits((train, valid, test),
                                                                batch_size=BATCH_SIZE,
                                                                bptt_len=BPTT_LEN,
                                                                repeat=False)

In [157]:
#Generate a batch of train data
batch = next(iter(train_iter))
text, target = batch.text, batch.target
# print(batch.dataset[0].text[:32])
# print(text[0:3],target[:3])
print('Size of text tensor',text.size())
print('Size of target tensor',target.size())

Size of text tensor torch.Size([32, 64])
Size of target tensor torch.Size([32, 64])


In [158]:
print(text)
print(target)

tensor([[   15,    14,     9,  ...,  1679,  1998,   193],
        [   17,   790,     6,  ...,  3700,     7,  1720],
        [ 3879,  3320,   502,  ...,    66,     4,     7],
        ...,
        [   28,     6,    25,  ...,     5,     4,  4680],
        [    5,    50,    34,  ..., 16002, 16450,  3293],
        [ 1845,  1874,   123,  ...,  7971,  1307,   133]])
tensor([[   17,   790,     6,  ...,  3700,     7,  1720],
        [ 3879,  3320,   502,  ...,    66,     4,     7],
        [ 3899,   135,  2624,  ...,     4,  1696, 14518],
        ...,
        [    5,    50,    34,  ..., 16002, 16450,  3293],
        [ 1845,  1874,   123,  ...,  7971,  1307,   133],
        [ 1026,  8556,    13,  ...,    16,    13,    22]])


In [241]:
class LangModel(nn.Module):
    def __init__(self, lang_config):
        super(LangModel, self).__init__()
        self.vocab_size = lang_config['vocab_size']
        self.emb_size = lang_config['emb_size']
        self.hidden_size = lang_config['hidden_size']
        self.num_layer = lang_config['num_layer']
        
        self.embedding = None
        self.rnn = None
        self.linear = None
        
        ### TODO: 
        ###    1. Initialize 'self.embedding' with nn.Embedding function and 2 variables we have initialized for you
        ###    2. Initialize 'self.rnn' with nn.LSTM function and 3 variables we have initialized for you
        ###    3. Initialize 'self.linear' with nn.Linear function and 2 variables we have initialized for you
        ### Reference:
        ###        https://pytorch.org/docs/stable/nn.html
        
        ### YOUR CODE HERE (3 lines)
        self.embedding = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.emb_size)
        self.rnn = nn.LSTM(input_size=self.emb_size, hidden_size=self.hidden_size, num_layers=self.num_layer, batch_first=False)
        self.linear = nn.Linear(self.hidden_size, self.vocab_size)
        ### END OF YOUR CODE
        
    def forward(self, batch_sents, hidden=None):
        '''
        params:
            batch_sents: torch.LongTensor of shape (sequence_len, batch_size)
        return:
            normalized_score: torch.FloatTensor of shape (sequence_len, batch_size, vocab_size)
        '''
        normalized_score = None
        hidden = hidden
        ### TODO:
        ###      1. Feed the batch_sents to self.embedding  
        ###      2. Feed the embeddings to self.rnn. Remember to pass "hidden" into self.rnn, even if it is None. But we will 
        ###         use "hidden" when implementing greedy search.
        ###      3. Apply linear transformation to the output of self.rnn
        ###      4. Apply 'F.log_softmax' to the output of linear transformation
        ###
        ### YOUR CODE HERE
        batch_sents = self.embedding(batch_sents)
        LSTM_output, hidden = self.rnn(batch_sents, hidden) 
        # output is of shape (seq_len, batch, num_directions * hidden_size)
        raw_output = self.linear(LSTM_output)
        normalized_score = F.log_softmax(raw_output, dim=-1)
        # normalized_Score is of shape (seq_len, batch, vocab_size)
        ### END OF YOUR CODE
        return normalized_score, hidden

In [242]:
def train(model, train_iter, valid_iter, vocab_size, criterion, optimizer, num_epochs):
    for n in range(num_epochs):
        train_loss = 0
        target_num = 0
        model.train()
        for batch in train_iter:
            
            text, targets = batch.text.to(device), batch.target.to(device)
            # target is of shape (seq_len, batch)
            loss = None
            
            ### we don't consider "hidden" here. So according to the default setting, "hidden" will be None
            ### YOU CODE HERE (~5 lines)
#             normalized_score = model(text)
#             loss_tensor = torch.gather(normalized_score, 2, targets.unsqueeze(2)).squeeze(2)
#             # loss_tensor is of shape (seq_len, batch)
#             loss = loss_tensor.mean()
            optimizer.zero_grad()
            output, hidden = model(text)
#             print(output.shape)
#             print(targets.shape)
            loss = criterion(output.view(-1, output.shape[2]), targets.view(-1))
            loss.backward()
            optimizer.step()
            ### END OF YOUR CODE
            ##########################################
            train_loss += loss.item() * targets.size(0) * targets.size(1)
            target_num += targets.size(0) * targets.size(1)

        train_loss /= target_num

        # monitor the loss of all the predictions
        val_loss = 0
        target_num = 0
        model.eval()
        for batch in valid_iter:
            text, targets = batch.text.to(device), batch.target.to(device)
            
            prediction,_ = model(text)
            loss = criterion(prediction.view(-1, vocab_size), targets.view(-1))
            
            val_loss += loss.item() * targets.size(0) * targets.size(1)
            target_num += targets.size(0) * targets.size(1)
        val_loss /= target_num

        print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(n+1, train_loss, val_loss))            

In [243]:
def test(model, vocab_size, criterion, test_iter):
    '''
    params: 
        model: LSTM model
        test_iter: test data
    return:
        ppl: perplexity 
    '''
    ppl = None
    test_loss = 0
    target_num = 0
    with torch.no_grad():
        for batch in test_iter:
            text, targets = batch.text.to(device), batch.target.to(device)

            prediction,_ = model(text)
            loss = criterion(prediction.view(-1, vocab_size), targets.view(-1))

            test_loss += loss.item() * targets.size(0) * targets.size(1)
            target_num += targets.size(0) * targets.size(1)

        test_loss /= target_num
        
        ### Compute perplexity according to "test_loss"
        ### Hint: Consider how the loss is computed.
        ### YOUR CODE HERE(1 line)
        ppl = np.exp(test_loss)
        ### END OF YOUR CODE
        return ppl

In [244]:
num_epochs=1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vocab_size = len(TEXT.vocab)

config = {'vocab_size':vocab_size,
         'emb_size':128,
         'hidden_size':128,
         'num_layer':1}

LM = LangModel(config)
LM = LM.to(device)

criterion = nn.NLLLoss(reduction='mean')
optimizer = optim.Adam(LM.parameters(), lr=1e-3, betas=(0.7, 0.99))

In [245]:
train(LM, train_iter, valid_iter, vocab_size, criterion, optimizer, num_epochs)

Epoch: 1, Training Loss: 6.0765, Validation Loss: 5.1898


In [246]:
# < 150
test(LM, vocab_size, criterion, test_iter)

156.07838789896783

### Question 6 [code]
When we use trained language model to generate a sentence given a start token, we can choose either ``greedy search`` or ``beam search``. 
<img src="greedy.png" alt="drawing" style="width:500px;"/>

As shown above, ``greedy search`` algorithm will pick the token which has the highest probability and feed it to the language model as input in the next time step. The model will generate ``max_len`` number of tokens at most.

- Implement ``word_greedy_search``
- **[optional]** Implement ``word_beam_search`` 

In [259]:
def word_greedy_search(model, start_token, max_len):
    '''
    param:
        model: nn.Module --- language model
        start_token: str --- e.g. 'he'
        max_len: int --- max number of tokens generated
    return:
        strings: list[str] --- list of tokens, e.g., ['he', 'was', 'a', 'member', 'of',...]
    '''
    model.eval()
    ID = TEXT.vocab.stoi[start_token]
    strings = [start_token]
    hidden = None
    
    ### You may find TEXT.vocab.itos useful.
    ### YOUR CODE HERE
    word_to_idx = {word: i for i, word in enumerate(TEXT.vocab.itos)}
    last_word = start_token
    while last_word != '<eos>' and len(strings) < max_len:
        normalized_score, hidden = model(torch.LongTensor([[word_to_idx[last_word]]]), hidden)
        # normalized_score is of shape torch.Size([1, 1, 28905])
        max_idx = torch.argmax(normalized_score, 2).item()
        last_word = TEXT.vocab.itos[max_idx]
        strings.append(last_word)
    ### END OF YOUR CODE 
    return strings

In [270]:
word_greedy_search(LM, 'he', 64)[:5]

['he', 'was', 'a', '<', 'unk']

In [371]:
# BeamNode = namedtuple('BeamNode', ['prev_node', 'prev_hidden', 'wordID', 'score', 'length'])
# LMNode = namedtuple('LMNode', ['sent', 'score'])

def word_beam_search(model, start_token, max_len, beam_size):
    model.eval()
    ID = TEXT.vocab.stoi[start_token]
    strings = [start_token]
    hidden = None
    
    ### You may find TEXT.vocab.itos useful.
    ### YOUR CODE HERE
    def calculate_score_next(model, sequence_words, beam_size, word_to_idx, words):
        if not sequence_words:
            return torch.Tensor([-float('inf')]*beam_size), torch.zeros((beam_size,), dtype=torch.long), [None]*beam_size
        idx_tensor = torch.LongTensor([word_to_idx[word] for word in sequence_words]).unsqueeze(1)
        hidden = None
        # normalized_Score is of shape (seq_len, 1, vocab_size)
        # idx_tensor is of shape (sequence_len, 1)
        
        normalized_score, hidden = model(idx_tensor, hidden)
        base_score = torch.sum(normalized_score.squeeze(1).argmax(dim=-1)[:-1])
        new_score, new_score_idx = torch.topk(normalized_score.squeeze(1)[-1,:], beam_size, dim=-1)
        return base_score+new_score, new_score_idx, [sequence_words+[words[new_score_idx[i]]] for i in range(beam_size)]
         # both of shape (beam_size, )
    
    word_to_idx = {word: i for i, word in enumerate(TEXT.vocab.itos)}
#     print(calculate_score_next(model, ['he', 'is'], beam_size, word_to_idx, TEXT.vocab.itos))

    def best_k(model, sequences_words, beam_size, word_to_idx, words):
        new_scores, new_score_idxes, candidate_sequence = [], [], []
        for i in range(beam_size):
            new_score, new_score_idx, candidates = \
            calculate_score_next(model, sequences_words[i], beam_size, word_to_idx, words)
            new_scores.append(new_score)
            new_score_idxes.append(new_score_idx)
            candidate_sequence.append(candidates)
        new_scores = torch.stack(new_scores).view(-1)
        new_score_idxes = torch.stack(new_score_idxes).view(-1)
        _, best_idx = torch.topk(new_scores, beam_size, dim=-1)
#         print(best_idx)                    
        new_sequences = []
        for idx in best_idx:
            m = (idx.item())//beam_size
            n = (idx.item())%beam_size
#             print(m,n)
#             print(candidate_sequence[m][n])
            new_sequences.append(candidate_sequence[m][n])
        return new_sequences
       
#     print(best_k(model, [['he', 'is'], ['she', 'is']], beam_size, word_to_idx, TEXT.vocab.itos))
    
    def check_stop(sequences_words):
        for sequence_words in sequences_words:
            if not sequence_words:
                continue
            if sequence_words[-1] == '<eos>':
#                 print('detected')
                return True
        return False
    
    sequences_words = [strings] + [None]*(beam_size-1)
    print(sequences_words)
    i = 1
    while not check_stop(sequences_words) and i < max_len:
        sequences_words = best_k(model, sequences_words, beam_size, word_to_idx, TEXT.vocab.itos)
#         print(i)
        print(sequences_words)
        i += 1
    for sequence_words in sequences_words:
        if sequence_words[-1] == '<eos>':
            return sequence_words
    return None
    ### END OF YOUR CODE 

In [372]:
word_beam_search(LM, 'he', 64, 2)

[['he'], None]
[['he', 'was'], ['he', 'had']]
[['he', 'had', 'been'], ['he', 'had', 'a']]
[['he', 'had', 'been', 'a'], ['he', 'had', 'been', 'the']]
[['he', 'had', 'been', 'a', '<'], ['he', 'had', 'been', 'the', '<']]
[['he', 'had', 'been', 'the', '<', 'unk'], ['he', 'had', 'been', 'a', '<', 'unk']]
[['he', 'had', 'been', 'a', '<', 'unk', '>'], ['he', 'had', 'been', 'the', '<', 'unk', '>']]
[['he', 'had', 'been', 'the', '<', 'unk', '>', ','], ['he', 'had', 'been', 'the', '<', 'unk', '>', '.']]
[['he', 'had', 'been', 'the', '<', 'unk', '>', '.', '<eos>'], ['he', 'had', 'been', 'the', '<', 'unk', '>', '.', 'the']]


['he', 'had', 'been', 'the', '<', 'unk', '>', '.', '<eos>']

In [373]:
word_beam_search(LM, 'he', 64, 3)

[['he'], None, None]
[['he', 'was'], ['he', 'had'], ['he', 'is']]
[['he', 'had', 'been'], ['he', 'is', 'a'], ['he', 'had', 'a']]
[['he', 'had', 'been', 'a'], ['he', 'had', 'been', 'the'], ['he', 'had', 'a', '<']]
[['he', 'had', 'been', 'a', '<'], ['he', 'had', 'been', 'the', '<'], ['he', 'had', 'been', 'a', '``']]
[['he', 'had', 'been', 'the', '<', 'unk'], ['he', 'had', 'been', 'a', '<', 'unk'], ['he', 'had', 'been', 'a', '``', '<']]
[['he', 'had', 'been', 'a', '<', 'unk', '>'], ['he', 'had', 'been', 'the', '<', 'unk', '>'], ['he', 'had', 'been', 'a', '``', '<', 'unk']]
[['he', 'had', 'been', 'a', '``', '<', 'unk', '>'], ['he', 'had', 'been', 'the', '<', 'unk', '>', ','], ['he', 'had', 'been', 'the', '<', 'unk', '>', '.']]
[['he', 'had', 'been', 'a', '``', '<', 'unk', '>', ','], ['he', 'had', 'been', 'a', '``', '<', 'unk', '>', '.'], ['he', 'had', 'been', 'a', '``', '<', 'unk', '>', '<']]
[['he', 'had', 'been', 'a', '``', '<', 'unk', '>', '<', 'unk'], ['he', 'had', 'been', 'a', '``', '

['he', 'had', 'been', 'a', '``', '<', 'unk', '>', '.', '<eos>']

# char-level LM

### Question 7 [code]
- Implement ``char_tokenizer``
- Implement ``CharLangModel``, ``char_train``, ``char_test``
- Implement ``char_greedy_search``

In [208]:
def char_tokenizer(string):
    '''
    param:
        string: str --- e.g. "I love this assignment"
    return:
        char_list: list[str] --- e.g. ['I', 'l', 'o', 'v', 'e', ' ', 't', 'h', 'i', 's', ...]
    '''
    char_list = None
    ### YOUR CODE HERE
    char_list = list(string)
    ### END OF YOUR CODE
    return char_list

In [209]:
test_str = 'test test test'
char_tokenizer(test_str)

['t', 'e', 's', 't', ' ', 't', 'e', 's', 't', ' ', 't', 'e', 's', 't']

In [210]:
CHAR_TEXT = data.Field(lower=True, tokenize=char_tokenizer ,init_token='<START>', eos_token='<STOP>')
ctrain, cvalid, ctest = WikiText2.splits(CHAR_TEXT)  

In [211]:
CHAR_TEXT.build_vocab(ctrain)
print('Vocabulary size:', len(CHAR_TEXT.vocab))

Vocabulary size: 247


In [212]:
BATCH_SIZE = 32
# the length of a piece of text feeding to the RNN layer
BPTT_LEN = 128        
# train, validation, test data
ctrain_iter, cvalid_iter, ctest_iter = data.BPTTIterator.splits((ctrain, cvalid, ctest),
                                                                batch_size=BATCH_SIZE,
                                                                bptt_len=BPTT_LEN,
                                                                repeat=False)

In [240]:
class CharLangModel(nn.Module):
    def __init__(self, lang_config):
        ### YOUR CODE HERE
        super(CharLangModel, self).__init__()
        self.vocab_size = lang_config['vocab_size']
        self.emb_size = lang_config['emb_size']
        self.hidden_size = lang_config['hidden_size']
        self.num_layer = lang_config['num_layer']
        
        self.embedding = None
        self.rnn = None
        self.linear = None
        
        ### TODO: 
        ###    1. Initialize 'self.embedding' with nn.Embedding function and 2 variables we have initialized for you
        ###    2. Initialize 'self.rnn' with nn.LSTM function and 3 variables we have initialized for you
        ###    3. Initialize 'self.linear' with nn.Linear function and 2 variables we have initialized for you
        ### Reference:
        ###        https://pytorch.org/docs/stable/nn.html
        
        ### YOUR CODE HERE (3 lines)
        self.embedding = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.emb_size)
        self.rnn = nn.LSTM(input_size=self.emb_size, hidden_size=self.hidden_size, num_layers=self.num_layer, batch_first=False)
        self.linear = nn.Linear(self.hidden_size, self.vocab_size)
        ### END OF YOUR CODE
        
    def forward(self, batch_sents, hidden=None):
        ### YOUR CODE HERE
        normalized_score = None
        hidden = hidden
        ### TODO:
        ###      1. Feed the batch_sents to self.embedding  
        ###      2. Feed the embeddings to self.rnn. Remember to pass "hidden" into self.rnn, even if it is None. But we will 
        ###         use "hidden" when implementing greedy search.
        ###      3. Apply linear transformation to the output of self.rnn
        ###      4. Apply 'F.log_softmax' to the output of linear transformation
        ###
        ### YOUR CODE HERE
        batch_sents = self.embedding(batch_sents)
        LSTM_output, hidden = self.rnn(batch_sents, hidden) 
        # output is of shape (seq_len, batch, num_directions * hidden_size)
        raw_output = self.linear(LSTM_output)
        normalized_score = F.log_softmax(raw_output, dim=-1)
        # normalized_Score is of shape (seq_len, batch, vocab_size)
        ### END OF YOUR CODE
        return normalized_score, hidden        

In [249]:
def char_train(model, train_iter, valid_iter, criterion, optimizer, vocab_size, num_epochs):
    ### YOUR CODE HERE
    for n in range(num_epochs):
        train_loss = 0
        target_num = 0
        model.train()
        for batch in train_iter:
            
            text, targets = batch.text.to(device), batch.target.to(device)
            # target is of shape (seq_len, batch)
            loss = None
            
            ### we don't consider "hidden" here. So according to the default setting, "hidden" will be None
            ### YOU CODE HERE (~5 lines)
#             normalized_score = model(text)
#             loss_tensor = torch.gather(normalized_score, 2, targets.unsqueeze(2)).squeeze(2)
#             # loss_tensor is of shape (seq_len, batch)
#             loss = loss_tensor.mean()
            optimizer.zero_grad()
            output, hidden = model(text)
#             print(output.shape)
#             print(targets.shape)
            loss = criterion(output.view(-1, output.shape[2]), targets.view(-1))
            loss.backward()
            optimizer.step()
            ### END OF YOUR CODE
            ##########################################
            train_loss += loss.item() * targets.size(0) * targets.size(1)
            target_num += targets.size(0) * targets.size(1)

        train_loss /= target_num

        # monitor the loss of all the predictions
        val_loss = 0
        target_num = 0
        model.eval()
        for batch in valid_iter:
            text, targets = batch.text.to(device), batch.target.to(device)
            
            prediction,_ = model(text)
            loss = criterion(prediction.view(-1, vocab_size), targets.view(-1))
            
            val_loss += loss.item() * targets.size(0) * targets.size(1)
            target_num += targets.size(0) * targets.size(1)
        val_loss /= target_num

        print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(n+1, train_loss, val_loss))            

In [250]:
def char_test(model, vocab_size, test_iter, criterion):
    ### YOUR CODE HERE
    ppl = None
    test_loss = 0
    target_num = 0
    with torch.no_grad():
        for batch in test_iter:
            text, targets = batch.text.to(device), batch.target.to(device)

            prediction,_ = model(text)
            loss = criterion(prediction.view(-1, vocab_size), targets.view(-1))

            test_loss += loss.item() * targets.size(0) * targets.size(1)
            target_num += targets.size(0) * targets.size(1)

        test_loss /= target_num
        
        ### Compute perplexity according to "test_loss"
        ### Hint: Consider how the loss is computed.
        ### YOUR CODE HERE(1 line)
        ppl = np.exp(test_loss)
        ### END OF YOUR CODE
        return ppl

In [251]:
num_epochs=1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
char_vocab_size = len(CHAR_TEXT.vocab)

config = {'vocab_size':char_vocab_size,
         'emb_size':128,
         'hidden_size':128,
         'num_layer':1}

CLM = CharLangModel(config)
CLM = CLM.to(device)

char_criterion = nn.NLLLoss(reduction='mean')
char_optimizer = optim.Adam(CLM.parameters(), lr=1e-3, betas=(0.7, 0.99))

In [252]:
char_train(CLM, ctrain_iter, cvalid_iter, char_criterion, char_optimizer, char_vocab_size, num_epochs)

Epoch: 1, Training Loss: 1.8335, Validation Loss: 1.5419


In [253]:
# <10
char_test(CLM, char_vocab_size, ctest_iter, char_criterion)

4.634222916818645

In [265]:
CHAR_TEXT.vocab.itos[:5]

['<unk>', '<pad>', '<START>', '<STOP>', ' ']

In [262]:
def char_greedy_search(model, start_token, max_len):
    '''
    param:
        model: nn.Module --- language model
        start_token: str --- e.g. 'h'
        max_len: int --- max number of tokens generated
    return:
        strings: list[str] --- list of tokens, e.g., ['h', 'e', ' ', 'i', 's',...]
    '''   
    model.eval()
    ID = CHAR_TEXT.vocab.stoi[start_token]
    strings = [start_token]
    hidden = None
    
    ### You may find CHAR_TEXT.vocab.itos useful.
    ### YOUR CODE HERE
    char_to_idx = {char: i for i, char in enumerate(CHAR_TEXT.vocab.itos)}
    last_char = start_token
    while last_char != '<STOP>' and len(strings) < max_len:
        normalized_score, hidden = model(torch.LongTensor([[char_to_idx[last_char]]]), hidden)
        # normalized_score is of shape torch.Size([1, 1, 28905])
        max_idx = torch.argmax(normalized_score, 2).item()
        last_char = CHAR_TEXT.vocab.itos[max_idx]
        strings.append(last_char)
    ### END OF YOUR CODE 
    return strings

In [263]:
char_greedy_search(CLM, 'h', 64)

['h',
 'e',
 ' ',
 's',
 'e',
 'a',
 's',
 'e',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 't',
 'h',
 'e',
 ' ',
 's',
 'e',
 'c',
 't',
 'i',
 'o',
 'n',
 ' ',
 ',',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 't',
 'h',
 'e',
 ' ',
 's',
 'e',
 'c',
 't',
 'i',
 'o',
 'n',
 ' ',
 ',',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 't',
 'h',
 'e',
 ' ',
 's',
 'e',
 'c',
 't',
 'i',
 'o',
 'n',
 ' ',
 ',',
 ' ',
 'a']

In [374]:
char_greedy_search(CLM, 'b', 64)

['b',
 'e',
 ' ',
 't',
 'h',
 'e',
 ' ',
 's',
 'e',
 'c',
 't',
 'i',
 'o',
 'n',
 ' ',
 ',',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 't',
 'h',
 'e',
 ' ',
 's',
 'e',
 'c',
 't',
 'i',
 'o',
 'n',
 ' ',
 ',',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 't',
 'h',
 'e',
 ' ',
 's',
 'e',
 'c',
 't',
 'i',
 'o',
 'n',
 ' ',
 ',',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 't',
 'h',
 'e',
 ' ',
 's',
 'e',
 'c']

### Requirements:
- This is an individual report.
- Complete the code using Python.
- List students with whom you have discussed if there are any.
- Follow the honor code strictly.

### Free GPU Resources
We suggest that you run neural language models on machines with GPU(s). Google provides the free online platform [Colaboratory](https://colab.research.google.com/notebooks/welcome.ipynb), a research tool for machine learning education and research. It’s a Jupyter notebook environment that requires no setup to use as common packages have been  pre-installed. Google users can have access to a Tesla T4 GPU (approximately 15G memory). Note that when you connect to a GPU-based VM runtime, you are given a maximum of 12 hours at a time on the VM.

It is convenient to upload local Jupyter Notebook files and data to Colab, please refer to the [tutorial](https://colab.research.google.com/notebooks/io.ipynb). 

In addition, Microsoft also provides the online platform [Azure Notebooks](https://notebooks.azure.com/help/introduction) for research of data science and machine learning, there are free trials for new users with credits.