### This post explores the N-gram model in NLP using the famous Tale of Two Cities from Dickens.

#### The N-gram is a very basic model based on Markov chain of order n and has several weaknesses:
 1. when n is small, the accuracy is low, when the n is big, resulting too sparse representation, requires huge space and memory
 2. does not consider the frequency of some common word --> could use TF-IDF
 3. does not consider the similarity between words, ignore the semantic meaning
 4. does not consider the earlier words to the next word --> bigram/trigram (too sparse)

for example, when n = 2, the probability of a four-word sequence w_1, w_2, w_3, w_4 is like:

$$

\begin{align*}
P(w_1, w_2, w_3, w_4)
&= P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_1, w_2) P(w_4 \mid w_1, w_2, w_3)\\
&= P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_2) P(w_4 \mid w_3)
\end{align*}

$$


when n = 1/2/3, the probability of a four-word sequence w_1, w_2, w_3, w_4 is like:

$$

\begin{aligned}
P(w_1, w_2, w_3, w_4) &=  P(w_1) P(w_2) P(w_3) P(w_4) ,\\
P(w_1, w_2, w_3, w_4) &=  P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_2) P(w_4 \mid w_3) ,\\
P(w_1, w_2, w_3, w_4) &=  P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_1, w_2) P(w_4 \mid w_2, w_3) .
\end{aligned}

$$


In [2]:
with open('./tale-of-two-cities.txt') as f:
    corpus_chars = f.read()

In [2]:
print(len(corpus_chars))

758498


In [3]:
print(corpus_chars[: 10000])

  IT WAS the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness, it was the epoch of
belief, it was the epoch of incredulity, it was the season of Light,
it was the season of Darkness, it was the spring of hope, it was the
winter of despair, we had everything before us, we had nothing
before us, we were all going direct to Heaven, we were all going
direct the other way- in short, the period was so far like the present
period, that some of its noisiest authorities insisted on its being
received, for good or for evil, in the superlative degree of
comparison only.
  There were a king with a large jaw and a queen with a plain face, on
the throne of England; there were a king with a large jaw and a
queen with a fair face, on the throne of France. In both countries
it was clearer than crystal to the lords of the State preserves of
loaves and fishes, that things in general were settled for ever.
  It was the year of Our Lord one thousand seven hu

In [4]:
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ').lower()
corpus_chars = corpus_chars[: 20000]

In [5]:
corpus_chars



### Create a dict vocab mapping word to index

In [10]:
idx_to_char = list(set(corpus_chars)) # remove duplicate, get the mapping of index to char
char_to_idx = {char: i for i, char in enumerate(idx_to_char)} # get the mapping of char to index
vocab_size = len(char_to_idx)
print(vocab_size)

corpus_indices = [char_to_idx[char] for char in corpus_chars]  # get list of indices for every char
sample = corpus_indices[: 1000]

print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print('indices:', sample)

38
chars:   it was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.   there were a king with a large jaw and a queen with a plain face, on the throne of england; there were a king with a large jaw and a queen with a fair face, on the throne of france. in both countries it was clearer than crystal to the lords of the state preserves of loaves and fishes, that things in general were settled for ever.   it was the year of our lord one thousan

In [11]:
def load_data_text():
    with open('./tale-of-two-cities.txt') as f:
        corpus_chars = f.read()
        
    corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
    corpus_chars = corpus_chars[0:50000]
    idx_to_char = list(set(corpus_chars))
    char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
    vocab_size = len(char_to_idx)
    corpus_indices = [char_to_idx[char] for char in corpus_chars]
    
    return corpus_indices, char_to_idx, idx_to_char, vocab_size

## Random Sampling

In random sampling, each example is a sequence arbitrarily captured on the original sequence. The positions of two adjacent random minibatches on the original sequence are not necessarily adjacent. The target is to predict the next character based on what we have seen so far, hence the labels are the original sequence, shifted by one character.

In [12]:
import torch
import random

In [None]:
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    """
    batch_size: sample number of each batch
    num_steps: time steps of each sample
    """
    
    num_examples = (len(corpus_indices) - 1) // num_steps # can only have first n-1 examples for a n length senquence
    example_indices = [i * num_steps for i in range(num_examples)]  # get index of first char of each sample in corpus_indices
    random.shuffle(example_indices)

    def _data(i):
        # return a sequence length of num_steps from i
        return corpus_indices[i: i + num_steps]
    
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    for i in range(0, num_examples, batch_size):
        # random sample batch_size number of samples
        batch_indices = example_indices[i: i + batch_size]  # first char of each sample in current batch
        X = [_data(j) for j in batch_indices] # return the data sample
        Y = [_data(j + 1) for j in batch_indices] # return the corresponding label(index)
        yield torch.tensor(X, device=device), torch.tensor(Y, device=device)

In [14]:
# testing
my_seq = list(range(30))

for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

X:  tensor([[ 0,  1,  2,  3,  4,  5],
        [12, 13, 14, 15, 16, 17]]) 
Y: tensor([[ 1,  2,  3,  4,  5,  6],
        [13, 14, 15, 16, 17, 18]]) 

X:  tensor([[18, 19, 20, 21, 22, 23],
        [ 6,  7,  8,  9, 10, 11]]) 
Y: tensor([[19, 20, 21, 22, 23, 24],
        [ 7,  8,  9, 10, 11, 12]]) 



## Sequential Partitioning

In addition to random sampling of the original sequence, we can also make the positions of two adjacent random minibatches adjacent in the original sequence.

In [17]:
def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):
    
    # does not shuffle to preserve the postion of each sample in the original sequence
    
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        
    corpus_len = len(corpus_indices) // batch_size * batch_size  # the length of sequence to be kept
    corpus_indices = corpus_indices[: corpus_len]  # only keep first corpus_len number of char
    indices = torch.tensor(corpus_indices, device=device)
    indices = indices.view(batch_size, -1)  # resize to 2-dim tensor (batch_size, )
    batch_num = (indices.shape[1] - 1) // num_steps # sample can only have first n-1 samples
    
    for i in range(batch_num):
        i = i * num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

In [18]:
# testing
for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

X:  tensor([[ 0,  1,  2,  3,  4,  5],
        [15, 16, 17, 18, 19, 20]]) 
Y: tensor([[ 1,  2,  3,  4,  5,  6],
        [16, 17, 18, 19, 20, 21]]) 

X:  tensor([[ 6,  7,  8,  9, 10, 11],
        [21, 22, 23, 24, 25, 26]]) 
Y: tensor([[ 7,  8,  9, 10, 11, 12],
        [22, 23, 24, 25, 26, 27]]) 



Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. 

One of the major advantages of ELMo is that 
it addresses the problem of polysemy, 
in which a single word has multiple meanings. 
ELMo is context-based (not word-based), so different meanings for a word occupy different vectors within the embedding space.