<a href="https://colab.research.google.com/github/MatteoZanella/NLU-project-ML2/blob/main/NLU_project_ML2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLU Project - ML2

## Data


Download the Penn Treebank Dataset

In [1]:
%%capture
!wget -nc https://data.deepai.org/ptbdataset.zip
!unzip -n ptbdataset.zip -d ptbdataset

Download the pre-trained models (saved during the development of this project)

In [2]:
%%capture
!wget -nc https://github.com/MatteoZanella/NLU-project-ML2/raw/main/models/simple.zip
!unzip -n simple.zip
!wget -nc https://github.com/MatteoZanella/NLU-project-ML2/raw/main/models/dense.zip
!unzip -n dense.zip
!wget -nc https://github.com/MatteoZanella/NLU-project-ML2/raw/main/models/complex.zip
!unzip -n complex.zip
!wget -nc https://github.com/MatteoZanella/NLU-project-ML2/raw/main/models/huge_reverse.zip
!unzip -n huge_reverse.zip

### Text processing

Load the data in a Dataset:
- Tags substitution for uniformity
  - Numbers: N -> [N]
  - Unknown words: \<unk\> -> [UNK]
  - Start and end of sentences: [S], [/S]
- Create the targets as the sentences shifted by 1 (the target has a [/S] at the end instead of a [S] tag at the start)
- Use a TextVectorization to convert the sentences into vectors of words, and to convert the words into integers
- Shuffle the training set
- Create padded minibatches: the shortest sentences are padded with zeros (the integer of the padding word) until they match the length of the longest sentence in the minibatch

In [3]:
import tensorflow as tf
from tensorflow.keras import layers

In [203]:
# Modify and add the tags, add the targets
def add_tags_and_targets(sentence):
  sentence = tf.strings.regex_replace(sentence, " N ", " [N] ")
  sentence = tf.strings.regex_replace(sentence, " <unk> ", " [UNK] ")
  return '[S]' + sentence, sentence + '[/S]'

train = tf.data.TextLineDataset('ptbdataset/ptb.train.txt').map(add_tags_and_targets)
valid = tf.data.TextLineDataset('ptbdataset/ptb.valid.txt').map(add_tags_and_targets)
test = tf.data.TextLineDataset('ptbdataset/ptb.test.txt').map(add_tags_and_targets)

# Training set: 42068 sentences
# Validation set: 3370 sentences
# Test set: 3761 sentences

# Vectorize the dataset (use the input sentences with both [S], [/S])
textVectorization = layers.TextVectorization(standardize=None)
textVectorization.adapt(train.map(lambda x, y: x + '[/S]'))
VOCABULARY_SIZE = textVectorization.vocabulary_size()

# Shuffling
BUFFER_SIZE = 42068 #Equal to training set size
train = train.shuffle(BUFFER_SIZE, reshuffle_each_iteration=True)

BATCH_SIZE = 16
train = train.map(lambda x, y: (textVectorization(x), textVectorization(y))).padded_batch(BATCH_SIZE)
valid = valid.map(lambda x, y: (textVectorization(x), textVectorization(y))).padded_batch(128)
test = test.map(lambda x, y: (textVectorization(x), textVectorization(y))).padded_batch(128)

## Additional structures

### Loss function
The Keras implementation for the sparse categorical cross-entropy function is not suited for time sequences having a shape (Batch size, Sequence length, Vocabulary). The problem is that it sets at zero the cross-entropy values by applying the masking, but then the reduction includes also the zero values in the mean

This implementation:
- Computes the cross-entropies without reduction
- Masks out the values corresponding to padded words in the true labels
- Computes the number of valid words for each sentence
- Divides the cross-entropies by the number of valid words in the same sentence and multiplies by the length of the padded sentences

This way, the mean value on the entire matrix (incuded zero padded values) corresponds to the mean of the within-sentence cross-entropy averages. Cross-entropies are already masked, even if it's not necessary, so that the perplexity can be easily computed.


In [5]:
def sequence_sparse_categorical_crossentropy(y_true, y_pred):
  # Standard sparse categorical cross-entropy loss, not reduced
  entropies = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)
  # Masking: do not consider padded values. Compute valid values per sample
  mask = tf.cast(y_true != 0, tf.float32)
  entropies = tf.multiply(entropies, mask)
  valid_per_sample = tf.reshape(tf.reduce_sum(mask, axis=-1), (-1, 1))
  sequence_length = tf.cast(tf.shape(mask)[1], tf.float32)
  # Apply the numerical corrections: now a tf.reduce_mean(entropies) produces the correct result
  return tf.multiply(tf.divide(entropies, valid_per_sample), sequence_length)

### Perplexity
The perplexity can be computed as the exponential of cross-entropy averaged across all sentences. A stateful class is necessary to get the real average cross-entropy and not the average of the averages of all minibatches (almost equal, but if minibatches have different sizes the result it's not the same).

The cross-entopy in Keras is computed using the natural logarithm, so the formula to compute the perplexity uses the natural exponential.

In [6]:
class Perplexity(tf.keras.metrics.Metric):
  def __init__(self, name='perplexity', **kwargs):
    super().__init__(name=name, **kwargs)
    self.entropies_sum = self.add_weight(name='ce', initializer='zeros')
    self.samples_count = self.add_weight(name='n', initializer='zeros')

  def update_state(self, y_true, y_pred, sample_weight=None):
    # entropies.shape = (batch_size, sequences_length)
    entropies = sequence_sparse_categorical_crossentropy(y_true, y_pred)
    batch_size = tf.cast(tf.shape(entropies)[0], dtype=tf.float32)
    entropies_sum = tf.multiply(tf.reduce_mean(entropies), batch_size)
    self.samples_count.assign_add(batch_size)
    self.entropies_sum.assign_add(entropies_sum)

  def result(self):
    # Perplexity over all samples (sentences), as the exp of the average entropy
    return tf.math.exp(tf.divide(self.entropies_sum, self.samples_count))
  
  def reset_state(self):
      # The state of the metric will be reset at the start of each epoch.
      self.entropies_sum.assign(0.)
      self.samples_count.assign(0.)

## RNN Models

In [7]:
custom_objects = {"Perplexity": Perplexity, "sequence_sparse_categorical_crossentropy": sequence_sparse_categorical_crossentropy}

### Simple model (1)
Using a ReLU activation function in the GRU units seems to help a convergence in a smaller number of epoches, but the execution becomes slower because the efficient cuDNN implementation can't be used. Overall, it's preferrable to keep the tanh activation.

In [8]:
model_1 = tf.keras.Sequential(name='simple')
# Input: out input have None (unknown, but equal because padded within the batch) time steps/words length. Single features
model_1.add(layers.InputLayer(input_shape=(None,)))
# Masking: do not consider 0 values (padding)
# Embedding/reshaping: the word index is converted into a feature vector
model_1.add(layers.Embedding(input_dim=VOCABULARY_SIZE, output_dim=32, mask_zero=True))
# Recurrent layers
model_1.add(layers.GRU(64, dropout=.5, return_sequences=True))
# Dense layers
model_1.add(layers.Dense(VOCABULARY_SIZE, activation='softmax'))
model_1.summary()

Model: "simple"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 32)          320128    
_________________________________________________________________
gru (GRU)                    (None, None, 64)          18816     
_________________________________________________________________
dense (Dense)                (None, None, 10004)       650260    
Total params: 989,204
Trainable params: 989,204
Non-trainable params: 0
_________________________________________________________________


#### Pretrained valuation
- Loss: 5.04849
- Perplexity: 156

In [9]:
test_model = tf.keras.models.load_model('simple', custom_objects=custom_objects)
test_model.evaluate(test)



[5.0484938621521, 155.7876434326172]

### Dense model (2)
Since stacking RNN layers was worsening the perplexity of the model, I tried using skip-connections inspired by the DenseNet architecture.

In [10]:
inputs = tf.keras.Input(shape=(None,))
x = layers.Embedding(VOCABULARY_SIZE, output_dim=512, mask_zero=True)(inputs)
for i in range(2):
  gru = layers.GRU(512, dropout=.4, return_sequences=True)(x)
  x = layers.concatenate([x, gru])
x = layers.Dropout(.4)(x)
outputs = layers.Dense(VOCABULARY_SIZE, activation='softmax')(x)

model_2 = tf.keras.Model(inputs, outputs, name='dense')
model_2.summary()

Model: "dense"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 512)    5122048     input_2[0][0]                    
__________________________________________________________________________________________________
gru_1 (GRU)                     (None, None, 512)    1575936     embedding_1[0][0]                
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, None, 1024)   0           embedding_1[0][0]                
                                                                 gru_1[0][0]                  

#### Pretrained valuation
- Loss: 4.73937
- Perplexity: 114

In [11]:
test_model = tf.keras.models.load_model('dense', custom_objects=custom_objects)
test_model.evaluate(test)



[4.739375114440918, 114.36271667480469]

### Complex model (3)

In [12]:
LAYERS_COUNT = 2
DROPOUT = .5
RNN_UNITS = 650

model_3 = tf.keras.Sequential(name='complex')
model_3.add(layers.InputLayer(input_shape=(None,)))
model_3.add(layers.Embedding(VOCABULARY_SIZE, RNN_UNITS, mask_zero=True))
for i in range(LAYERS_COUNT):
  model_3.add(layers.GRU(RNN_UNITS, dropout=DROPOUT, return_sequences=True))
# Dense layers
model_3.add(layers.Dropout(DROPOUT))
model_3.add(layers.Dense(VOCABULARY_SIZE, activation='softmax'))
model_3.summary()

Model: "complex"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 650)         6502600   
_________________________________________________________________
gru_3 (GRU)                  (None, None, 650)         2538900   
_________________________________________________________________
gru_4 (GRU)                  (None, None, 650)         2538900   
_________________________________________________________________
dropout_1 (Dropout)          (None, None, 650)         0         
_________________________________________________________________
dense_2 (Dense)              (None, None, 10004)       6512604   
Total params: 18,093,004
Trainable params: 18,093,004
Non-trainable params: 0
_________________________________________________________________


#### Pretrained valuation
The number of parameters of this network is a bit lower than the dense model, but the perplexity is about the same value (slighly worse)
- Loss: 4.78133
- Perplexity: 119

In [13]:
test_model = tf.keras.models.load_model('complex', custom_objects=custom_objects)
test_model.evaluate(test)



[4.781332969665527, 119.26322174072266]

### Complex+ReverseEmbedding model (4)
Like the complex model, but with an extra regularization on the final decoder


In [14]:
class ReverseEmbedding(tf.keras.regularizers.Regularizer):
    def __init__(self, embedding_layer):
      super().__init__()
      self.embedding_layer = embedding_layer

    def __call__(self, x):
        target = tf.transpose(self.embedding_layer.embeddings)
        return tf.norm(target - x)
    
    def get_config(self):
      return {"embedding_layer": self.embedding_layer}

In [15]:
LAYERS_COUNT = 2
DROPOUT = .5
RNN_UNITS = 650

model_4 = tf.keras.Sequential(name='reverse')
model_4.add(layers.InputLayer(input_shape=(None,)))
embedding_layer = layers.Embedding(VOCABULARY_SIZE, RNN_UNITS, mask_zero=True)
model_4.add(embedding_layer)
for i in range(LAYERS_COUNT):
  model_4.add(layers.GRU(RNN_UNITS, dropout=DROPOUT, return_sequences=True))
# Dense layers
model_4.add(layers.Dropout(DROPOUT))
model_4.add(layers.Dense(VOCABULARY_SIZE, kernel_regularizer=ReverseEmbedding(embedding_layer), activation='softmax'))
model_4.summary()

Model: "reverse"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 650)         6502600   
_________________________________________________________________
gru_5 (GRU)                  (None, None, 650)         2538900   
_________________________________________________________________
gru_6 (GRU)                  (None, None, 650)         2538900   
_________________________________________________________________
dropout_2 (Dropout)          (None, None, 650)         0         
_________________________________________________________________
dense_3 (Dense)              (None, None, 10004)       6512604   
Total params: 18,093,004
Trainable params: 18,093,004
Non-trainable params: 0
_________________________________________________________________


#### Pretrained valuation
The regularization term increases the loss value. The perplexity is slightly better than the complex model. The training is smoother, less increments of validation perplexity on subsequent epochs
- Loss: 7.23776
- Perplexity: 116


In [16]:
# For some reasons, this model is too heavy to be saved on GitHub

### Huge+ReverseEmbedding model (5)

The complex model  with an increased number of RNN units complemented by the ReverseEmbedding regularization)

In [17]:
LAYERS_COUNT = 2
DROPOUT = .6
RNN_UNITS = 880

model_5 = tf.keras.Sequential(name='huge_reverse')
model_5.add(layers.InputLayer(input_shape=(None,)))
embedding_layer = layers.Embedding(VOCABULARY_SIZE, RNN_UNITS, mask_zero=True)
model_5.add(embedding_layer)
for i in range(LAYERS_COUNT):
  model_5.add(layers.GRU(RNN_UNITS, dropout=DROPOUT, return_sequences=True))
# Dense layers
model_5.add(layers.Dropout(DROPOUT))
model_5.add(layers.Dense(VOCABULARY_SIZE, kernel_regularizer=ReverseEmbedding(embedding_layer), activation='softmax'))
model_5.summary()

Model: "huge_reverse"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 880)         8803520   
_________________________________________________________________
gru_7 (GRU)                  (None, None, 880)         4651680   
_________________________________________________________________
gru_8 (GRU)                  (None, None, 880)         4651680   
_________________________________________________________________
dropout_3 (Dropout)          (None, None, 880)         0         
_________________________________________________________________
dense_4 (Dense)              (None, None, 10004)       8813524   
Total params: 26,920,404
Trainable params: 26,920,404
Non-trainable params: 0
_________________________________________________________________


#### Pretrained valuation
The improvement in the perplexity is limited. The loss includes the regularization.
- Loss: 7.11412
- Perplexity: 111

In [18]:
test_model = tf.keras.models.load_model('huge_reverse', custom_objects=custom_objects)
test_model.evaluate(test)



[7.114117622375488, 111.15992736816406]

## Language models

In [215]:
def language_model(model, vocabulary):
  lm = tf.keras.Sequential(name=f"{model.name}_LM")
  # New textVectorization with same vocabulary but with normalization
  lm.add(layers.TextVectorization(vocabulary=vocabulary))
  lm.add(model)
  # VectorTextification(textVectorization.get_vocabulary())
  lm.add(layers.Lambda(lambda x: tf.argmax(x[:, -1, :], axis=-1)))
  lm.add(layers.StringLookup(invert=True, mask_token='', vocabulary= textVectorization.get_vocabulary()[2:]))
  return lm

In [214]:
lm = language_model(test_model, textVectorization.get_vocabulary())
lm.predict(["The investments of the company raised in the last"])

array([b'year'], dtype=object)

## Execution

### Training
The SGD optimizer performs significantly better than Adam.

In [None]:
callbacks = [
  tf.keras.callbacks.ReduceLROnPlateau('val_perplexity', factor=0.8, min_delta=1.),
  tf.keras.callbacks.EarlyStopping("val_perplexity", min_delta=1., patience=10, restore_best_weights=True),
]

model = model_5
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1., clipnorm=10.), 
              loss=sequence_sparse_categorical_crossentropy, 
              metrics=[Perplexity()])
model.fit(train, epochs=60, validation_data=valid, callbacks=callbacks)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60


<keras.callbacks.History at 0x7f41fbed8f50>

### Evaluation

In [None]:
model.evaluate(test)



[7.114120006561279, 111.1594467163086]

### Save

In [None]:
name = model.name
model.save(name)
!zip -r saved-model.zip $name/



INFO:tensorflow:Assets written to: huge_reverse/assets


INFO:tensorflow:Assets written to: huge_reverse/assets


  adding: huge_reverse/ (stored 0%)
  adding: huge_reverse/keras_metadata.pb (deflated 91%)
  adding: huge_reverse/assets/ (stored 0%)
  adding: huge_reverse/saved_model.pb (deflated 90%)
  adding: huge_reverse/variables/ (stored 0%)
  adding: huge_reverse/variables/variables.index (deflated 54%)
  adding: huge_reverse/variables/variables.data-00000-of-00001 (deflated 7%)
