<a name='0'></a>
## Overview

Your task will be to predict the next set of characters using the previous characters. 
- Although this task sounds simple, it is pretty useful.
- You will start by converting a line of text into a tensor
- Then you will create a generator to feed data into the model
- You will train a neural network in order to predict the new set of characters of defined length. 
- You will use embeddings for each character and feed them as inputs to your model. 
    - Many natural language tasks rely on using embeddings for predictions. 
- Your model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit `GRU`, and run it through a linear layer to predict the next set of characters.

<img src = "images/model.png" style="width:600px;height:150px;"/>

The figure above gives you a summary of what you are about to implement. 
- You will get the embeddings;
- Stack the embeddings on top of each other;
- Run them through two layers with a relu activation in the middle;
- Finally, you will compute the softmax. 

To predict the next character:
- Use the softmax output and identify the word with the highest probability.
- The word with the highest probability is the prediction for the next word.

In [1]:
import os
import shutil
import trax
import trax.fastmath.numpy as np
import pickle
import random as rnd
from trax import fastmath
from trax import layers as tl

import w2_unittest

rnd.seed(32)

  from .autonotebook import tqdm as notebook_tqdm


### Loading the data

In [2]:
dirname = "data/"
filename = "shakespeare_data.txt"
lines = []

counter = 0

with open(os.path.join(dirname, filename)) as files:
    for line in files:
        pure_line = line.strip()

        if pure_line:
            lines.append(pure_line)

In [4]:
n_lines = len(lines)
print(f"Number of lines {n_lines}")
print(f"Sample line at position 0 {lines[0]}")
print(f"Sampel line at position 999 {lines[999]}")

Number of lines 125097
Sample line at position 0 A LOVER'S COMPLAINT
Sampel line at position 999 With this night's revels and expire the term


In [5]:
for i, line in enumerate(lines):
    lines[i] = line.lower()


print(f"Number of lines: {n_lines}")
print(f"Sample line at position 0 {lines[0]}")
print(f"Sample line at position 999 {lines[999]}")

Number of lines: 125097
Sample line at position 0 a lover's complaint
Sample line at position 999 with this night's revels and expire the term


In [6]:
eval_lines = lines[-1000:]
lines = lines[:-1000]

print(f"Number of lines for training: {len(lines)}")
print(f"Number of lines for validation: {len(eval_lines)}")

Number of lines for training: 124097
Number of lines for validation: 1000


### Convert a line to a Tensor

In [9]:
def line_to_tensor(line, EOS_int=1):

    tensor = []

    for c in line:
        char = ord(c)
        tensor.append(char)
    
    tensor.append(EOS_int)
    return tensor

In [10]:
line_to_tensor("abc xyz")

[97, 98, 99, 32, 120, 121, 122, 1]