## Section 3: Vectorization

We now perform *vectorization* (or *text encoding*, i.e., transforming the tokens from text to integergs such that each item with tokens, i.e., words & punctuation, will become a vector of integers). We therefore build a mapping table of each unique token to a unique integer. 

We can do this very conveniently with Keras' [*TextVectorization* layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization). It supports padding, i.e., integer 0 is reserved to mean “empty.” This is useful when you give a sentence of tokens but want the vectorizer always to return a fixed length vector.

First, we split the sentence pairs into training, validation, and testing sets in the ratio of 70%-15%-15%.

In [3]:
import pickle
import random
  
# Load normalized sentence pairs
with open("ENG_ITA_pairs.pickle", "rb") as fp:
    text_pairs = pickle.load(fp)
 
# train-test-val split of randomized sentence pairs
random.seed(0) # for reproducibility
random.shuffle(text_pairs)
n_val = int(0.15*len(text_pairs))
n_train = len(text_pairs) - 2*n_val
train_pairs = text_pairs[:n_train]
val_pairs = text_pairs[n_train:n_train+n_val]
test_pairs = text_pairs[n_train+n_val:]

print("Nr. training pairs: ", len(train_pairs))
print("Nr. validation pairs: ", len(val_pairs))
print("Nr. test pairs: ", len(test_pairs))

Nr. training pairs:  250863
Nr. validation pairs:  53755
Nr. test pairs:  53755


### Vectorization Layer

Next, we create a TextVectorization [Keras](https://keras.io/) layer and adapt it to the training set only. In Keras, a layer is a class that represents a computation layer in a NN. Layers are the basic building blocks for creating neural networks in Keras. They are used to define the structure and behavior of the model.

The TextVectorization layer transforms a batch of strings (one sample = one string) into either a list of token indices (one sample = 1D tensor of integer token indices) or a dense representation (one sample = 1D tensor of float values representing data about the sample’s tokens). 

Some Keras' preprocessing layers (incl. TextVectorization) have an internal state that can be computed based on a sample of the training data. Crucially, these layers are non-trainable. Their state is not set during training; it must be set before training, either by initializing them from a precomputed constant, or by "adapting" them on data, via the adapt() method.
The adapt() method of TextVectorizationtakes either a Numpy array, a tf.data.Dataset object, or a list of dtrings, such as in this case. Specifically, the vocabulary for the layer must be learned via *adapt()*. When this layer is adapted, it will analyze the dataset, determine the frequency of individual string values, and create a vocabulary from them. 

The aim of this step is to convert text data into a format that can be used for training and evaluating machine learning models. The vectorization process is not training any model, it's just vectorizing the text data and storing the vectorized data and the configuration of the vectorizers, so that it can be used later on in training steps.

This vocabulary can have unlimited size or be capped, depending on the configuration options for this layer; if there are more unique values in the input than the maximum vocabulary size, the most frequent terms will be used to create the vocabulary. Following the inspection of the data in the previous section, we will restrict the English vocabulary to 10000 and the Italian vocabulary to 20000 tokens. We control this behaviour through the argument *max_tokens*.

We will also train a model that uses all of the vocabulary obtained from 'method3' (Section 2) and compare performance. We expect that the model will overfit and perform more poorly.

Finally, we need to set the maximum output length. This could be equal to the longest item, but would decrease the performance of the NN (for some issue loosely related to overfitting). Thus we set the argument *output_sequence_length* to 20. Output must be maximum 20 words (0-padded if less).

The processing of each sample contains the following steps:
1. standardize each sample (usually lowercasing + punctuation stripping). We won't do it (did that already).
2. split each sample into substrings (usually words). We will be splitting on ASCII whitespace 
3. recombine substrings into tokens (usually ngrams). We won't create ngrams.
4. index tokens (associate a unique int value with each token). Discussed at the bottom of the script.
5. transform each sample using this index, either into a vector of ints or a dense float vector. We will output integer indices, one integer index per split string token.

We then save the vectorizers and their outputs as pickle files for later use.

In [7]:
import os
from tensorflow.keras.layers import TextVectorization

# Parameter determined after analyzing the input data
vocab_size_eng = 10000 # Maximum vocab size.
vocab_size_ita = 20000
seq_length = 20 # Sequence length to pad the outputs to

# save these settings
file_name = "key_vals.pickle"
if os.path.exists(file_name):
    print(f"file '{file_name}' exists in the current directory: not overwriting.")
else:
    print(f"{file_name} does not exist in the current directory: saving.")    
    with open("key_vals.pickle", "wb") as fp:
    key_vals = {
      "vocab_size_eng": vocab_size_eng,
      "vocab_size_ita": vocab_size_ita,
      "seq_length": seq_length
    }
    pickle.dump(key_vals,fp)

# Create vectorizers
eng_vectorizer = TextVectorization(
    max_tokens=vocab_size_eng,
    standardize=None, # No standardization (vs e.g., lower_and_strip_punctuation)
    split="whitespace",
    output_mode="int",
    output_sequence_length=seq_length,
)
ita_vectorizer = TextVectorization(
    max_tokens=vocab_size_ita,
    standardize=None,
    split="whitespace",
    output_mode="int",
    output_sequence_length=seq_length + 1
)
 
# train the vectorization layer using training dataset
#train_ita_texts = [pair[1].encode() for pair in train_pairs]
train_eng_texts = [pair[0] for pair in train_pairs]
train_ita_texts = [pair[1] for pair in train_pairs]

# Now that the vocab layer has been created, call `adapt` on the
# text-only dataset to create the vocabulary. You don't have to batch,
# but for large datasets this means we're not keeping spare copies of
# the dataset.
eng_vectorizer.adapt(train_eng_texts)
ita_vectorizer.adapt(train_ita_texts)
 
# save for subsequent steps
file_name = f"vectorized_ENGvoc_{vocab_size_eng}_ITAvoc_{vocab_size_ita}_seqLen_{seq_length}.pickle"
if os.path.exists(file_name):
    print(f"file '{file_name}' exists in the current directory: not overwriting.")
else:
    print(f"{file_name} does not exist in the current directory: saving.") 
with open(, "wb") as fp:
    data = {
        "train": train_pairs,
        "val":   val_pairs,
        "test":  test_pairs,
        "engvec_config":  eng_vectorizer.get_config(),
        "engvec_weights": eng_vectorizer.get_weights(),
        "engvec_vocabulary": eng_vectorizer.get_vocabulary(),
        "itavec_config":  ita_vectorizer.get_config(),
        "itavec_weights": ita_vectorizer.get_weights(),
        "itavec_vocabulary": ita_vectorizer.get_vocabulary(),
    }
    pickle.dump(data, fp)

At each training step (Section 9), the model will seek to predict target words N+1 (and beyond) using the source sentence and the target words 0 to N.

As such, the training dataset will yield a tuple (inputs, targets), where:
- inputs is a dictionary with the keys encoder_inputs and decoder_inputs (i.e., *context vectors*). encoder_inputs is the vectorized source sentence and encoder_inputs is the target sentence "so far", that is to say, the words 0 to N used to predict word N+1 (and beyond) in the target sentence.
- target is the target sentence offset by one step: it provides the next words in the target sentence -- what the model will try to predict.

To understand what encoder and decoder are, please refer to Section 5.
Below, we will make use of the vectorizer and create a TensorFlow Dataset object to train our model later on.

We will first define a function (*format_dataset*) to:
1. convert the English & Italian sentences into a numerical representation (vector) using the vectorizer objects.
2. create a dictionary source with keys encoder_inputs and decoder_inputs, where the values are the vectors of the English and Italian sentences, respectively.
3. creates a target vector, which is the target (Italian) sentence vector **advanced by 1 token**.
4. return a tuple of (source, target) where source is the dictionary created in step 2 and target is the vector created in step 3.

This function is called by the *make_dataset* function defined below.

In [8]:
# set up TensdorFlow Dataset object
def format_dataset(eng, ita):
    """Take an English and a Italian sentence pair, convert into input and target.
    The input is a dict with keys `encoder_inputs` and `decoder_inputs`, each
    is a vector, corresponding to English and Italian sentences respectively.
    The target is also vector of the Italian sentence, advanced by 1 token. All
    vector are in the same length.
 
    The output will be used for training the transformer model. In the model we
    will create, the input tensors are named `encoder_inputs` and `decoder_inputs`
    which should be matched to the keys in the dictionary for the source part
    """
    eng = eng_vectorizer(eng)
    ita = ita_vectorizer(ita)
    source = {"encoder_inputs": eng,
              "decoder_inputs": ita[:, :-1]} # between the [start] and [end] signals
    target = ita[:, 1:] # between the [start] and [end] signals
    return (source, target)

We then create the *make_dataset()* function to take in a list of sentence pairs, and perform several preprocessing steps on the data, such as shuffling, batching, formatting and caching, to prepare the data for training machine learning models.

Specifically, the function is used to:
1. aggregate the English and Italian sentences from the pairs into two separate lists eng_texts and ita_texts (*zip* function).
2. create the TensorFlow dataset object dataset from the lists of English and Italian sentences.
3. shuffle the dataset using the *shuffle()* method with a buffer size of 2048.
4. apply the batch() method with the specified batch size, and map() method with the format_dataset() function on the dataset.

In [9]:
def make_dataset(pairs, batch_size=64):
    """Create TensorFlow Dataset for the sentence pairs"""
    # aggregate sentences using zip(*pairs)
    eng_texts, ita_texts = zip(*pairs)
    # convert them into list, and then create tensors
    dataset = tf.data.Dataset.from_tensor_slices((list(eng_texts), list(ita_texts)))
    return dataset.shuffle(2048) \
                  .batch(batch_size).map(format_dataset) \
                  .prefetch(16).cache()

Below, we create the datasets (for training and validation) and display an example.
Inputs to the encoder are fixed-length vector representations, called the *context vector*, which contains information about the meaning of the entire sentence. 
The shape of the context vector is [batch size * sequence length]. 
Non-zero integers represent the vectorized tokens and 0s are used for padding (no token). The model then decodes the context vector into a target sentence by generating words one at a time. 
Below, we print an example.

In [10]:
## Re-load data and vectorizers if necessary:
#import pickle 
# 
## load text data and vectorizer weights
#with open(f"vectorized_ENGvoc_{vocab_size_eng}_ITAvoc_{vocab_size_ita}_seqLen_{seq_length}.pickle", "rb") as fp:
#    data = pickle.load(fp)
#
#train_pairs = data["train"]
#val_pairs = data["val"]
#test_pairs = data["test"]   # not used
#
## create new instances of the English and Italian vectorizers using the configurations that were saved previously.
## The from_config() method allows for recreating the same TextVectorization layer from a previously saved configuration.
#eng_vectorizer = TextVectorization.from_config(data["engvec_config"])
#eng_vectorizer.set_weights(data["engvec_weights"])
#eng_vectorizer.set_vocabulary(data["engvec_vocabulary"])
#ita_vectorizer = TextVectorization.from_config(data["itavec_config"])
#ita_vectorizer.set_weights(data["itavec_weights"])
#ita_vectorizer.set_vocabulary(data["itavec_vocabulary"])

import tensorflow as tf

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

# we will reuse this code later to make the dataset objects. Now we just test their dimensions.
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["encoder_inputs"][0]: {inputs["encoder_inputs"][0]}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"][0]: {inputs["decoder_inputs"][0]}')
    print(f"targets.shape: {targets.shape}")
    print(f"targets[0]: {targets[0]}")

inputs["encoder_inputs"].shape: (64, 20)
inputs["encoder_inputs"][0]: [  3 330  43   9 342   2   0   0   0   0   0   0   0   0   0   0   0   0
   0   0]
inputs["decoder_inputs"].shape: (64, 20)
inputs["decoder_inputs"][0]: [   2 3608   12  363    4    3    0    0    0    0    0    0    0    0
    0    0    0    0    0    0]
targets.shape: (64, 20)
targets[0]: [3608   12  363    4    3    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0]


As you can see, input to the encoder has the same number of tokens as input to the decoder, but this is not necessarily the case. Outputs of the decoder are inputs to the decoder+1. Therefore, the number of non-0 elements in the decoder output is N-1.

The *train_ds* variable now contains a dataset object that is ready to be used to train a machine learning model. This dataset object will allow the model to access the training data in a streamlined way, in batches and with the preprocessing already done, that allows to speed up the training process.

In [11]:
# save for subsequent steps
with open("train_val_datasets.pickle", "wb") as fp:
    datasets = {
        "train": train_ds,
        "val":   val_ds
    }
    pickle.dump(datasets, fp)

Finally we save the *format_dataset* & *make_dataset* functions in the *create_dataset.py* file for later use.

We also create new datasets using 80% of the vocabulary of the NLTK tokenizer.

To do so, we will change the index of the TextVectorization, which lets you create a vocabulary for you, based on the recombined tokens obtained from the previous step. For this reason, you can specify a max_tokens integer parameter that will control the maximum size of the vocabulary for this layer, or simply leave the default None value for giving no cap size to the vocabulary. In order to actually build this index, you will need to call the layer’s adapt method, which will fit the state of the preprocessing layer to the dataset. This also overrides the default adapt method to apply relevant preprocessing to the inputs, before passing it to the combiner.

In case you want to set your own vocabulary, you can use the set_vocabulary method. This method sets the vocabulary and DF data for this layer directly, instead of analyzing a dataset through ‘adapt’. It should be used whenever the vocab (and optionally document frequency) information is already known. If vocabulary data is already present in the layer, this method will either replace it, if ‘append’ is set to False, or append to it (if ‘append’ is set to True).

The set_vocabulary method takes the following parameters:
- vocab : an array of string tokens.
- df_data : an array of document frequencies. Only necessary if the layer output_mode is tf-idf.
- oov_df_value : the document frequency of the Out Of Vocabulary token. Only necessary if output_mode is tf-idf. OOV data is optional when appending additional data in tf-idf mode; if an OOV value is supplied it will overwrite the existing OOV value.
- append : whether to overwrite or append any existing vocabulary data.

In [25]:
import os
import numpy as np
from tensorflow.keras.layers import TextVectorization

import pickle

file_name = "NLTK_vocab.pickle"
with open(file_name, "rb") as fp:
    vocabulary = pickle.load(fp)
    
seq_length = 20 # Sequence length to pad the outputs to

# save these settings
file_name = "key_vals_method2.pickle"
if os.path.exists(file_name):
    print(f"file '{file_name}' exists in the current directory: not overwriting.")
else:
    print(f"{file_name} does not exist in the current directory: saving.")    
    with open(file_name, "wb") as fp:
        key_vals = {
          "vocab_size_eng": vocabulary["vocab_size_eng"],
          "vocab_size_ita": vocabulary["vocab_size_ita"],
          "seq_length": seq_length
        }
        pickle.dump(key_vals,fp)

# Create vectorizers
eng_vectorizer = TextVectorization(
    #max_tokens=vocab_size_eng,
    standardize=None,
    split="whitespace",
    output_mode="int",
    output_sequence_length=seq_length,
)
eng_vectorizer.set_vocabulary(list(vocabulary["vocab_eng"]))
                              
ita_vectorizer = TextVectorization(
    #max_tokens=vocab_size_ita,
    standardize=None,
    split="whitespace",
    output_mode="int",
    output_sequence_length=seq_length + 1
)
ita_vectorizer.set_vocabulary(list(vocabulary["vocab_ita"]))

 
# train the vectorization layer using training dataset
#train_ita_texts = [pair[1].encode() for pair in train_pairs]
train_eng_texts = [pair[0] for pair in train_pairs]
train_ita_texts = [pair[1] for pair in train_pairs]

# Now that the vocab layer has been created, call `adapt` on the
# text-only dataset to create the vocabulary. You don't have to batch,
# but for large datasets this means we're not keeping spare copies of
# the dataset.
#eng_vectorizer.adapt(train_eng_texts)
#ita_vectorizer.adapt(train_ita_texts)
 
# save for subsequent steps
file_name = f"vectorized_ENGvoc_{vocabulary['vocab_size_eng']}_ITAvoc_{vocabulary['vocab_size_ita']}_seqLen_{seq_length}.pickle"
if os.path.exists(file_name):
    print(f"file '{file_name}' exists in the current directory: not overwriting.")
else:
    print(f"{file_name} does not exist in the current directory: saving.") 
with open(file_name, "wb") as fp:
    data = {
        "train": train_pairs,
        "val":   val_pairs,
        "test":  test_pairs,
        "engvec_config":  eng_vectorizer.get_config(),
        "engvec_weights": eng_vectorizer.get_weights(),
        "engvec_vocabulary": eng_vectorizer.get_vocabulary(),
        "itavec_config":  ita_vectorizer.get_config(),
        "itavec_weights": ita_vectorizer.get_weights(),
        "itavec_vocabulary": ita_vectorizer.get_vocabulary(),
    }
    pickle.dump(data, fp)

file 'key_vals_method2.pickle' exists in the current directory: not overwriting.
vectorized_ENGvoc_14061_ITAvoc_27495_seqLen_20.pickle does not exist in the current directory: saving.


In [27]:
import os
import numpy as np
from tensorflow.keras.layers import TextVectorization

import pickle

file_name = "NLTK_vocab.pickle"
with open(file_name, "rb") as fp:
    vocabulary = pickle.load(fp)
    
seq_length = 20 # Sequence length to pad the outputs to

# save these settings
file_name = "key_vals_method2.pickle"
if os.path.exists(file_name):
    print(f"file '{file_name}' exists in the current directory: not overwriting.")
else:
    print(f"{file_name} does not exist in the current directory: saving.")    
    with open(file_name, "wb") as fp:
        key_vals = {
          "vocab_size_eng": vocabulary["vocab_size_eng"],
          "vocab_size_ita": vocabulary["vocab_size_ita"],
          "seq_length": seq_length
        }
        pickle.dump(key_vals,fp)

# Create vectorizers
eng_vectorizer = TextVectorization(
    #max_tokens=vocab_size_eng,
    standardize=None,
    split="whitespace",
    output_mode="int",
    output_sequence_length=seq_length,
)
eng_vectorizer.set_vocabulary(list(vocabulary["vocab_eng"]))
                              
ita_vectorizer = TextVectorization(
    #max_tokens=vocab_size_ita,
    standardize=None,
    split="whitespace",
    output_mode="int",
    output_sequence_length=seq_length + 1
)
ita_vectorizer.set_vocabulary(list(vocabulary["vocab_ita"]))

 
# train the vectorization layer using training dataset
#train_ita_texts = [pair[1].encode() for pair in train_pairs]
train_eng_texts = [pair[0] for pair in train_pairs]
train_ita_texts = [pair[1] for pair in train_pairs]

# Now that the vocab layer has been created, call `adapt` on the
# text-only dataset to create the vocabulary. You don't have to batch,
# but for large datasets this means we're not keeping spare copies of
# the dataset.
eng_vectorizer.adapt(train_eng_texts)
ita_vectorizer.adapt(train_ita_texts)
 
# save for subsequent steps
file_name = f"vectorized_ENGvoc_{vocabulary['vocab_size_eng']}_ITAvoc_{vocabulary['vocab_size_ita']}_seqLen_{seq_length}.pickle"
if os.path.exists(file_name):
    print(f"file '{file_name}' exists in the current directory: not overwriting.")
else:
    print(f"{file_name} does not exist in the current directory: saving.") 
with open(file_name, "wb") as fp:
    data = {
        "train": train_pairs,
        "val":   val_pairs,
        "test":  test_pairs,
        "engvec_config":  eng_vectorizer.get_config(),
        "engvec_weights": eng_vectorizer.get_weights(),
        "engvec_vocabulary": eng_vectorizer.get_vocabulary(),
        "itavec_config":  ita_vectorizer.get_config(),
        "itavec_weights": ita_vectorizer.get_weights(),
        "itavec_vocabulary": ita_vectorizer.get_vocabulary(),
    }
    pickle.dump(data, fp)

file 'key_vals_method2.pickle' exists in the current directory: not overwriting.
vectorized_ENGvoc_14061_ITAvoc_27495_seqLen_20.pickle does not exist in the current directory: saving.
