<a name='0'></a>
## Overview

Your task will be to predict the next set of characters using the previous characters. 
- Although this task sounds simple, it is pretty useful.
- You will start by converting a line of text into a tensor
- Then you will create a generator to feed data into the model
- You will train a neural network in order to predict the new set of characters of defined length. 
- You will use embeddings for each character and feed them as inputs to your model. 
    - Many natural language tasks rely on using embeddings for predictions. 
- Your model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit `GRU`, and run it through a linear layer to predict the next set of characters.

<img src = "images/model.png" style="width:600px;height:150px;"/>

The figure above gives you a summary of what you are about to implement. 
- You will get the embeddings;
- Stack the embeddings on top of each other;
- Run them through two layers with a relu activation in the middle;
- Finally, you will compute the softmax. 

To predict the next character:
- Use the softmax output and identify the word with the highest probability.
- The word with the highest probability is the prediction for the next word.

In [3]:
import os
import shutil
import trax
import trax.fastmath.numpy as np
import pickle
import numpy
import random as rnd
from trax import fastmath
from trax import layers as tl

import w2_unittest

# set random seed
rnd.seed(32)

2023-08-10 09:59:24.717078: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-08-10 09:59:25.236342: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-08-10 09:59:25.236404: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
  from .autonotebook import tqdm as notebook_tqdm


### Loading the data

In [2]:
dirname = "data/"
filename = "shakespeare_data.txt"
lines = []

counter = 0

with open(os.path.join(dirname, filename)) as files:
    for line in files:
        pure_line = line.strip()

        if pure_line:
            lines.append(pure_line)

In [3]:
n_lines = len(lines)
print(f"Number of lines {n_lines}")
print(f"Sample line at position 0 {lines[0]}")
print(f"Sampel line at position 999 {lines[999]}")

Number of lines 125097
Sample line at position 0 A LOVER'S COMPLAINT
Sampel line at position 999 With this night's revels and expire the term


In [4]:
for i, line in enumerate(lines):
    lines[i] = line.lower()


print(f"Number of lines: {n_lines}")
print(f"Sample line at position 0 {lines[0]}")
print(f"Sample line at position 999 {lines[999]}")

Number of lines: 125097
Sample line at position 0 a lover's complaint
Sample line at position 999 with this night's revels and expire the term


In [5]:
eval_lines = lines[-1000:]
lines = lines[:-1000]

print(f"Number of lines for training: {len(lines)}")
print(f"Number of lines for validation: {len(eval_lines)}")

Number of lines for training: 124097
Number of lines for validation: 1000


### Convert a line to a Tensor

In [6]:
def line_to_tensor(line, EOS_int=1):

    tensor = []

    for c in line:
        char = ord(c)
        tensor.append(char)
    
    tensor.append(EOS_int)
    return tensor

In [7]:
line_to_tensor("abc xyz")

[97, 98, 99, 32, 120, 121, 122, 1]

In [8]:
w2_unittest.test_line_to_tensor(line_to_tensor)

[92m All tests passed


<a name='1-3'></a>
### 1.3 - Batch Generator 

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. Here, you will build a data generator that takes in a text and returns a batch of text lines (lines are sentences).
- The generator converts text lines (sentences) into numpy arrays of integers padded by zeros so that all arrays have the same length, which is the length of the longest sentence in the entire data set.

Once you create the generator, you can iterate on it like this:

```
next(data_generator)
```

This generator returns the data in a format that you could directly use in your model when computing the feed-forward of your algorithm. This iterator returns a batch of lines and per token mask. The batch is a tuple of three parts: inputs, targets, mask. The inputs and targets are identical. The second column will be used to evaluate your predictions. Mask is 1 for non-padding tokens.

<a name='ex-2'></a>
### Exercise 2 - data_generator
**Instructions:** Implement the data generator below. Here are some things you will need. 

- While True loop: this will yield one batch at a time.
- if index >= num_lines, set index to 0. 
- The generator should return shuffled batches of data. To achieve this without modifying the actual lines a list containing the indexes of `data_lines` is created. This list can be shuffled and used to get random batches everytime the index is reset.
- if len(line) < max_length append line to cur_batch.
    - Note that a line that has length equal to max_length should not be appended to the batch. 
    - This is because when converting the characters into a tensor of integers, an additional end of sentence token id will be added.  
    - So if max_length is 5, and a line has 4 characters, the tensor representing those 4 characters plus the end of sentence character will be of length 5, which is the max length.
- if len(cur_batch) == batch_size, go over every line, convert it to an int and store it.

**Remember that when calling np you are really calling trax.fastmath.numpy which is trax’s version of numpy that is compatible with JAX. As a result of this, where you used to encounter the type numpy.ndarray now you will find the type jax.interpreters.xla.DeviceArray.**

In [9]:
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: data_generator
def data_generator(batch_size, max_length, data_lines, line_to_tensor=line_to_tensor, shuffle=True):
    """Generator function that yields batches of data

    Args:
        batch_size (int): number of examples (in this case, sentences) per batch.
        max_length (int): maximum length of the output tensor.
        NOTE: max_length includes the end-of-sentence character that will be added
                to the tensor.  
                Keep in mind that the length of the tensor is always 1 + the length
                of the original line of characters.
        data_lines (list): list of the sentences to group into batches.
        line_to_tensor (function, optional): function that converts line to tensor. Defaults to line_to_tensor.
        shuffle (bool, optional): True if the generator should generate random batches of data. Defaults to True.

    Yields:
        tuple: two copies of the batch (jax.interpreters.xla.DeviceArray) and mask (jax.interpreters.xla.DeviceArray).
        NOTE: jax.interpreters.xla.DeviceArray is trax's version of numpy.ndarray
    """
    # initialize the index that points to the current position in the lines index array
    index = 0
    
    # initialize the list that will contain the current batch
    cur_batch = []
    
    # count the number of lines in data_lines
    num_lines = len(data_lines)
    
    # create an array with the indexes of data_lines that can be shuffled
    lines_index = [*range(num_lines)]
    
    # shuffle line indexes if shuffle is set to True
    if shuffle:
        rnd.shuffle(lines_index)
    
    ### START CODE HERE ###
    while True:
        
        # if the index is greater than or equal to the number of lines in data_lines
        if index >= num_lines:
            # then reset the index to 0
            index = 0
            # shuffle line indexes if shuffle is set to True
            if shuffle:
                rnd.shuffle(lines_index) 
                            
        # get a line at the `lines_index[index]` position in data_lines
        line = data_lines[lines_index[index]]
        
        # if the length of the line is less than max_length
        if len(line) < max_length:
            # append the line to the current batch
            cur_batch.append(line)
            
        # increment the index by one
        index += 1
        
        # if the current batch is now equal to the desired batch size
        if len(cur_batch) == batch_size:
            
            batch = []
            mask = []
            
            # go through each line (li) in cur_batch
            for li in cur_batch:
                # convert the line (li) to a tensor of integers
                tensor = line_to_tensor(li)
                
                # Create a list of zeros to represent the padding
                # so that the tensor plus padding will have length `max_length`
                pad = [0] * (max_length - len(tensor))
                
                # combine the tensor plus pad
                tensor_pad = tensor + pad
                
                # append the padded tensor to the batch
                batch.append(tensor_pad)

                # A mask for this tensor_pad is 1 whereever tensor_pad is not
                # 0 and 0 whereever tensor_pad is 0, i.e. if tensor_pad is
                # [1, 2, 3, 0, 0, 0] then example_mask should be
                # [1, 1, 1, 0, 0, 0]
                example_mask = [0 if l == 0 else 1 for l in tensor_pad]
                mask.append(example_mask) # @ KEEPTHIS
               
            # convert the batch (data type list) to a numpy array
            batch_np_arr = np.array(batch)
            mask_np_arr = np.array(mask)
            
            ### END CODE HERE ##
            
            # Yield two copies of the batch and mask.
            yield batch_np_arr, batch_np_arr, mask_np_arr
            
            # reset the current batch to an empty list
            cur_batch = []

In [10]:
# Try out your data generator
tmp_lines = ['12345678901', #length 11
             '123456789', # length 9
             '234567890', # length 9
             '345678901'] # length 9

# Get a batch size of 2, max length 10
tmp_data_gen = data_generator(batch_size=2, 
                              max_length=10, 
                              data_lines=tmp_lines,
                              shuffle=False)

# get one batch
tmp_batch = next(tmp_data_gen)

# view the batch
tmp_batch

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)


(Array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
        [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32),
 Array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
        [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32),
 Array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32))

In [4]:
test1 = np.array([1,2,3])
test1

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)


Array([1, 2, 3], dtype=int32)

In [5]:
test2 = numpy.array([1,2,3])
test2

array([1, 2, 3])

In [11]:
# Test your function
w2_unittest.test_data_generator(data_generator)

AttributeError: module 'jax.interpreters.xla' has no attribute 'DeviceArray'

<a name='2'></a>
## 2 - Defining the GRU Model

Now that you have the input and output tensors, you will go ahead and initialize your model. You will be implementing the `GRULM`, gated recurrent unit model. To implement this model, you will be using google's `trax` package. Instead of making you implement the `GRU` from scratch, we will give you the necessary methods from a build in package. You can use the following packages when constructing the model: 


- `tl.Serial`: Combinator that applies layers serially (by function composition). [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.combinators.Serial) / [source code](https://github.com/google/trax/blob/e65d51fe584b10c0fa0fccadc1e70b6330aac67e/trax/layers/combinators.py#L26)
    - You can pass in the layers as arguments to `Serial`, separated by commas. 
    - For example: `tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))`

___

- `tl.ShiftRight`: Allows the model to go right in the feed forward. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.attention.ShiftRight) / [source code](https://github.com/google/trax/blob/e65d51fe584b10c0fa0fccadc1e70b6330aac67e/trax/layers/attention.py#L560)
    - `ShiftRight(n_shifts=1, mode='train')` layer to shift the tensor to the right n_shift times
    - Here in the exercise you only need to specify the mode and not worry about n_shifts

___

- `tl.Embedding`: Initializes the embedding. In this case it is the size of the vocabulary by the dimension of the model. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Embedding) / [source code](https://github.com/google/trax/blob/e65d51fe584b10c0fa0fccadc1e70b6330aac67e/trax/layers/core.py#L130) 
    - `tl.Embedding(vocab_size, d_feature)`.
    - `vocab_size` is the number of unique words in the given vocabulary.
    - `d_feature` is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
___

- `tl.GRU`: `Trax` GRU layer. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.GRU) / [source code](https://github.com/google/trax/blob/e65d51fe584b10c0fa0fccadc1e70b6330aac67e/trax/layers/rnn.py#L154)
    - `GRU(n_units)` Builds a traditional GRU of n_cells with dense internal transformations.
    - `GRU` paper: https://arxiv.org/abs/1412.3555
___

- `tl.Dense`: A dense layer. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Dense) / [source code](https://github.com/google/trax/blob/e65d51fe584b10c0fa0fccadc1e70b6330aac67e/trax/layers/core.py#L34)
    - `tl.Dense(n_units)`: The parameter `n_units` is the number of units chosen for this dense layer.
___

- `tl.LogSoftmax`: Log of the output probabilities. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.LogSoftmax) / [source code](https://github.com/google/trax/blob/e65d51fe584b10c0fa0fccadc1e70b6330aac67e/trax/layers/core.py#L644)
    - Here, you don't need to set any parameters for `LogSoftMax()`.
___

<a name='ex-3'></a>
### Exercise 3 - GRULM
**Instructions:** Implement the `GRULM` class below. You should be using all the methods explained above.
