This is a companion notebook for the book [Deep Learning with Python, Second Edition](https://www.manning.com/books/deep-learning-with-python-second-edition?a_aid=keras&a_bid=76564dff). For readability, it only contains runnable code blocks and section titles, and omits everything else in the book: text paragraphs, figures, and pseudocode.

**If you want to be able to follow what's going on, I recommend reading the notebook side by side with your copy of the book.**

This notebook was generated for TensorFlow 2.6.

### Processing words as a sequence: The sequence model approach

#### A first practical example

**Downloading the data**

In [4]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  58.0M      0  0:00:01  0:00:01 --:--:-- 58.0M


**Preparing the data**

In [5]:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


**Preparing integer sequence datasets**

In [6]:
from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int", #sequential information is preserved
    output_sequence_length=max_length,
#In order to keep a manageable input size we ll truncate the inputs after the first 600 words.
#This is a reasonable choice, since the average review length is 233 words and only 5% of reviews are longer than 600 words
)
text_vectorization.adapt(text_only_train_ds)

#process train, validation and test data sets
int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

**A sequence model built on one-hot encoded vector sequences**

The simplest way to convert our integer sequences to vector sequences is to use one hot encoding. On top of these one-hot vecotrs we'll add a simple bidirectional LSTM 

In [7]:
import tensorflow as tf
inputs = keras.Input(shape=(None,), dtype="int64") #One input is a sequence of integers
embedded = tf.one_hot(inputs, depth=max_tokens) #encode the integers into binary 20000 dimensional vectors
x = layers.Bidirectional(layers.LSTM(32))(embedded) #add a bidirectional LSTM
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)#Finally add a classification layer
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 tf.one_hot (TFOpLambda)     (None, None, 20000)       0         
                                                                 
 bidirectional (Bidirectiona  (None, 64)               5128448   
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 5,128,513
Trainable params: 5,128,513
Non-trainable params: 0
___________________________________________________

**Training a first basic sequence model**

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
 73/625 [==>...........................] - ETA: 2:15 - loss: 0.4093 - accuracy: 0.8446

#### Understanding word embeddings

Another popuar and powerful way to associate a vector with a word is the use of dense word vectos also called word embeddings

The geometric relationship betw/ 2 word vectors should reflect the semantic relationchip between these words.

Two Ways to obtain word embeddings:
Learn word embeddings jointy with the main task you care about (such as document classification or sentimentprediction)

Load into your model word embeddings that were precomputed using a different machine learning task than the one you're trying to solve. 

These are called pretrained word embeddings

#### Learning word embeddings with the Embedding layer

Simplest way to associate a dense vector with a word is to choose the vector at random.

To get a bit more abstract the geometric relationships betw/ word vectors should reflect the semantic relationchips bet/ thses words.

To learn a new embedding space with every new task

**Instantiating an `Embedding` layer**

In [None]:
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)
#The Embedding layer takes at least 2 arguments: the number of possible tokens and the dimensionality of the embeddings (here, 256)

The Embedding layer is best understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors.

Word index -> Embedding layer -> Corresponding word vector

**Model that uses an `Embedding` layer trained from scratch**

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

#### Understanding padding and masking

Padding: Sentences longer than K tokens are truncated to a lenght of K tokens, and sentences shorter than K tokens are padded with zeros at the end so that they can be concatenated together with other sequences.

We may have too many zeros for shorter sequences. The information stored in the internal state of the RNN will gradually fade out as it gets exposed to these meaningless inputs.

Masking: Tell the RNN that is should skip these iterations of zeros

**Using an `Embedding` layer with masking enabled**

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(
    input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru_with_masking.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

#### Using pretrained word embeddings

Somestimes word embeddings have so little training data available that you can't use your data alone to learn an appropriate task specific embedding of your vocabulary

In such cases instead of learning word embeddings jointly with the problem you want to solve you can load embedding vectors from a precomputed embedding space that you know is highly structured and exhibits useful properties: Word2Vec, GloVe

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
#to download GloVe word embeddings precomputed on the 2014 English Wikipedia dataset

**Parsing the GloVe word-embeddings file**

In [None]:
import numpy as np
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {} #create index list = dictionary 
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs #we save the words vectors

print(f"Found {len(embeddings_index)} word vectors.")

**Preparing the GloVe word-embeddings matrix**

In [None]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary() #1
word_index = dict(zip(vocabulary, range(len(vocabulary)))) #2

embedding_matrix = np.zeros((max_tokens, embedding_dim)) #3
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: #4
        embedding_matrix[i] = embedding_vector
#1 Retrieve the vocab indexed by our previous TextVectorization layer
#2 Use it to create a mapping from words to their index in the vocabulary
#3 Prepare a matrix that we ll fill with the GloVe vectors
#4 Fill entry i in the matrix with the word vector for index i. 
#Words not found in the embedding index will be all zeros

Load the pretrained embedding in an Embedding layer

In [None]:
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)

**Model that uses a pretrained Embedding layer**

So as not to disrupt the pretrained representations during training we freeze the layer via trainable) False

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("glove_embeddings_sequence_model.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

On this particular task, pretrained embeddings aren't very helpful because the dataset contains enough samples that it s possible to learn a specialized enough embedding space from scratch