# Deep N-grams

<a name='0'></a>
## Overview

Our task will be to predict the next set of characters using the previous characters. 
- Although this task sounds simple, it is pretty useful.
- We will start by converting a line of text into a tensor
- Then we will create a generator to feed data into the model
- We will train a neural network in order to predict the new set of characters of defined length. 
- We will use embeddings for each character and feed them as inputs to our model. 
    - Many natural language tasks rely on using embeddings for predictions. 
- Our model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit `GRU`, and run it through a linear layer to predict the next set of characters.

<img src = "images/model.png" style="width:600px;height:150px;"/>

The figure above gives us a summary of what we are about to implement. 
- We will get the embeddings;
- Stack the embeddings on top of each other;
- Run them through two layers with a relu activation in the middle;
- Finally, we will compute the softmax. 

To predict the next character:
- Use the softmax output and identify the word with the highest probability.
- The word with the highest probability is the prediction for the next word.

In [39]:
import os
import shutil
import tensorflow as tf
import pickle
import numpy as np
import random as rnd
import nltk
import re
import string

nltk.download('punkt')
nltk.data.path.append('.')

# set random seed
rnd.seed(32)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Trung\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<a name='1'></a>
## 1 - Importing the Data

<a name='1-1'></a>
### 1.1 - Loading in the Data

<img src = "images/shakespeare.png" style="width:250px;height:250px;"/>

Now import the dataset and do some processing. 
- The dataset has one sentence per line.
- We will be doing word generation, so we have to process each sentence by converting each **word** to a number. 
- Store each line in a list.
- Create a data generator that takes in the `batch_size` and the `max_length`. 
    - The `max_length` corresponds to the maximum length of the sentence.

In [40]:
dirname = 'data/'
filename = 'shakespeare_data.txt'
lines = [] # storing all the lines in a variable. 

counter = 0

with open(os.path.join(dirname, filename)) as files:
    for line in files:        
        # remove leading and trailing whitespace
        pure_line = line.strip()

        # if pure_line is not the empty string,
        if pure_line:
            # append it to the list
            lines.append(pure_line)

In [41]:
n_lines = len(lines)
print(f"Number of lines: {n_lines}")
print(f"Sample line at position 0 {lines[0]}")
print(f"Sample line at position 999 {lines[999]}")

Number of lines: 125097
Sample line at position 0 A LOVER'S COMPLAINT
Sample line at position 999 With this night's revels and expire the term


Notice that the letters are both uppercase and lowercase.  In order to reduce the complexity of the task, we will convert all characters to lowercase.  This way, the model only needs to predict the likelihood that a letter is 'a' and not decide between uppercase 'A' and lowercase 'a'.

In [42]:
# go through each line
for i, line in enumerate(lines):
    # convert to all lowercase
    lines[i] = line.lower()

print(f"Number of lines: {n_lines}")
print(f"Sample line at position 0 {lines[0]}")
print(f"Sample line at position 999 {lines[999]}")

Number of lines: 125097
Sample line at position 0 a lover's complaint
Sample line at position 999 with this night's revels and expire the term


In [43]:
test_lines = lines[-500:]
eval_lines = lines[-1500:-500] # Create a holdout validation set
lines = lines[:-1500] # Leave the rest for training

print(f"Number of lines for training: {len(lines)}")
print(f"Number of lines for validation: {len(eval_lines)}")
print(f"Number of lines for testing: {len(test_lines)}")

Number of lines for training: 123597
Number of lines for validation: 1000
Number of lines for testing: 500


### Process tweet

In [52]:
def process_tweet(tweet):
    '''
    Input: 
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet
    
    '''
    tweet = tweet.lower()
    return nltk.word_tokenize(tweet)

<a name='1-2'></a>
### 1.2 - Convert a Line to Tensor

Now that we have our list of lines, we will convert each word in that list to a number. First we create a vocabulary for words in lines

In [53]:
def create_vocab(lines, EOS_int=1):
    """
    Args:
        lines(str): Lines of text
        EOS_int (int, optional): End-of-sentence integer. Defaults to 1.
    Returns:
        vocab (dict): Dictionary map each word with their index.
    """
    vocab = {}
    vocab["EOS_int"] = 1
    for line in lines:
        processed_line = process_tweet(line)
        for word in processed_line:
            if word not in vocab:
                vocab[word] = len(vocab) + 1
    
    return vocab

In [54]:
Vocab = create_vocab(lines)
Vocab

{'EOS_int': 1,
 'a': 2,
 'lover': 3,
 "'s": 4,
 'complaint': 5,
 'from': 6,
 'off': 7,
 'hill': 8,
 'whose': 9,
 'concave': 10,
 'womb': 11,
 'reworded': 12,
 'plaintful': 13,
 'story': 14,
 'sistering': 15,
 'vale': 16,
 ',': 17,
 'my': 18,
 'spirits': 19,
 'to': 20,
 'attend': 21,
 'this': 22,
 'double': 23,
 'voice': 24,
 'accorded': 25,
 'and': 26,
 'down': 27,
 'i': 28,
 'laid': 29,
 'list': 30,
 'the': 31,
 'sad-tuned': 32,
 'tale': 33,
 ';': 34,
 'ere': 35,
 'long': 36,
 'espied': 37,
 'fickle': 38,
 'maid': 39,
 'full': 40,
 'pale': 41,
 'tearing': 42,
 'of': 43,
 'papers': 44,
 'breaking': 45,
 'rings': 46,
 'a-twain': 47,
 'storming': 48,
 'her': 49,
 'world': 50,
 'with': 51,
 'sorrow': 52,
 'wind': 53,
 'rain': 54,
 '.': 55,
 'upon': 56,
 'head': 57,
 'platted': 58,
 'hive': 59,
 'straw': 60,
 'which': 61,
 'fortified': 62,
 'visage': 63,
 'sun': 64,
 'whereon': 65,
 'thought': 66,
 'might': 67,
 'think': 68,
 'sometime': 69,
 'it': 70,
 'saw': 71,
 'carcass': 72,
 'beauty'

In [55]:
len(Vocab)

27455

<a name='ex-1'></a>
### line_to_tensor

Write a function that takes in a single line and transforms each word into its unicode integer.  This returns a list of integers, which we'll refer to as a tensor.
- Use a special integer to represent the end of the sentence (the end of the line).
- This will be the EOS_int (end of sentence integer) parameter of the function.
- Include the EOS_int as the last integer of the line
- We will use the number `1` to represent the end of a sentence.

In [62]:
def line_to_tensor(line, vocab, EOS_int=1):
    """Turns a line of text into a tensor

    Args:
        line (str): A single line of text.
        EOS_int (int, optional): End-of-sentence integer. Defaults to 1.

    Returns:
        list: a list of integers (unicode values) for the words in the `line`.
    """
    
    # Initialize the tensor as an empty list
    tensor = []
    
    # for each character:
    for word in process_tweet(line):
        
        # convert to unicode int
        w_int = vocab[word]
        
        # append the unicode integer to the tensor list
        tensor.append(w_int)
    
    # include the end-of-sentence integer
    tensor.append(EOS_int)
    
    return tensor

In [63]:
line_to_tensor("With this night's revels and expire the term", Vocab)

[51, 22, 1434, 4, 2077, 26, 2078, 31, 2079, 1]

<a name='1-3'></a>
### 1.3 - Batch Generator 

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. Here, we will build a data generator that takes in a text and returns a batch of text lines (lines are sentences).
- The generator converts text lines (sentences) into numpy arrays of integers padded by zeros so that all arrays have the same length, which is the length of the longest sentence in the entire data set.

Once we create the generator, we can iterate on it like this:

```
next(data_generator)
```

This generator returns the data in a format that we could directly use in our model when computing the feed-forward of our algorithm. This iterator returns a batch of lines and per token mask. The batch is a tuple of three parts: inputs, targets, mask. The inputs and targets are identical. The second column will be used to evaluate our predictions. Mask is 1 for non-padding tokens.

<a name='ex-2'></a>
### data_generator
- While True loop: this will yield one batch at a time.
- if index >= num_lines, set index to 0. 
- The generator should return shuffled batches of data. To achieve this without modifying the actual lines a list containing the indexes of `data_lines` is created. This list can be shuffled and used to get random batches everytime the index is reset.
- if len(line) < max_length append line to cur_batch.
    - Note that a line that has length equal to max_length should not be appended to the batch. 
    - This is because when converting the characters into a tensor of integers, an additional end of sentence token id will be added.  
    - So if max_length is 5, and a line has 4 characters, the tensor representing those 4 characters plus the end of sentence character will be of length 5, which is the max length.
- if len(cur_batch) == batch_size, go over every line, convert it to an int and store it.

In [73]:
def data_generator(batch_size, max_length, data_lines, vocab, line_to_tensor=line_to_tensor, shuffle=True):
    """Generator function that yields batches of data

    Args:
        batch_size (int): number of examples (in this case, sentences) per batch.
        max_length (int): maximum length of the output tensor.
        NOTE: max_length includes the end-of-sentence character that will be added
                to the tensor.  
                Keep in mind that the length of the tensor is always 1 + the length
                of the original line of characters.
        data_lines (list): list of the sentences to group into batches.
        line_to_tensor (function, optional): function that converts line to tensor. Defaults to line_to_tensor.
        shuffle (bool, optional): True if the generator should generate random batches of data. Defaults to True.

    Yields:
        tuple: two copies of the batch (jax.interpreters.xla.DeviceArray) and mask (jax.interpreters.xla.DeviceArray).
    """
    # initialize the index that points to the current position in the lines index array
    index = 0
    
    # initialize the list that will contain the current batch
    cur_batch = []
    
    # count the number of lines in data_lines
    num_lines = len(data_lines)
    
    # create an array with the indexes of data_lines that can be shuffled
    lines_index = [*range(num_lines)]
    
    # shuffle line indexes if shuffle is set to True
    if shuffle:
        rnd.shuffle(lines_index)
    
    while True:
        
        # if the index is greater than or equal to the number of lines in data_lines
        if index >= num_lines:
            # then reset the index to 0
            index = 0
            # shuffle line indexes if shuffle is set to True
            if shuffle:
                rnd.shuffle(lines_index) 
                            
        # get a line at the `lines_index[index]` position in data_lines
        line = data_lines[lines_index[index]]
        
        # if the length of the line is less than max_length
        if len(line_to_tensor(line, vocab)) < max_length:
            # append the line to the current batch
            cur_batch.append(line)
            
        # increment the index by one
        index += 1
        
        # if the current batch is now equal to the desired batch size
        if len(cur_batch) == batch_size:
            
            batch = []
            mask = []
            
            # go through each line (li) in cur_batch
            for li in cur_batch:
                # convert the line (li) to a tensor of integers
                tensor = line_to_tensor(li, vocab)
                
                # Create a list of zeros to represent the padding
                # so that the tensor plus padding will have length `max_length`
                pad = [0] * (max_length - len(tensor))
                
                # combine the tensor plus pad
                tensor_pad = tensor + pad
                
                # append the padded tensor to the batch
                batch.append(tensor_pad)

                # A mask for this tensor_pad is 1 whereever tensor_pad is not
                # 0 and 0 whereever tensor_pad is 0, i.e. if tensor_pad is
                # [1, 2, 3, 0, 0, 0] then example_mask should be
                # [1, 1, 1, 0, 0, 0]
                example_mask = 1 - np.equal(np.array(tensor_pad), 0)
                mask.append(example_mask) # @ KEEPTHIS
               
            # convert the batch (data type list) to a numpy array
            batch_np_arr = np.array(batch)
            mask_np_arr = np.array(mask)
            
            # Yield two copies of the batch and mask.
            yield batch_np_arr, batch_np_arr, mask_np_arr
            
            # reset the current batch to an empty list
            cur_batch = []

In [74]:
# Try out our data generator
tmp_lines = ['Sky is blue', #length 11
             'Roses are red', # length 9
             'Leaves are green', # length 9
             'I love you'] # length 9

# Get a batch size of 2, max length 10
tmp_data_gen = data_generator(batch_size=2, 
                              max_length=10, 
                              data_lines=tmp_lines,
                              vocab = Vocab,
                              shuffle=False)

# get one batch
tmp_batch = next(tmp_data_gen)

# view the batch
tmp_batch

(array([[4275,  342, 5438,    1,    0,    0,    0,    0,    0,    0],
        [ 998,  143,  747,    1,    0,    0,    0,    0,    0,    0]]),
 array([[4275,  342, 5438,    1,    0,    0,    0,    0,    0,    0],
        [ 998,  143,  747,    1,    0,    0,    0,    0,    0,    0]]),
 array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]]))