This is a companion notebook for the book [Deep Learning with Python, Second Edition](https://www.manning.com/books/deep-learning-with-python-second-edition?a_aid=keras&a_bid=76564dff). For readability, it only contains runnable code blocks and section titles, and omits everything else in the book: text paragraphs, figures, and pseudocode.

**If you want to be able to follow what's going on, I recommend reading the notebook side by side with your copy of the book.**

This notebook was generated for TensorFlow 2.6.

### Processing words as a sequence: The sequence model approach

#### A first practical example

**Downloading the data**

In [1]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz # importing the dataset from the stanford website 
!tar -xf aclImdb_v1.tar.gz # extracting the dataset 
!rm -r aclImdb/train/unsup # removing the unsupervised data from the dataset

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  1148k      0  0:01:11  0:01:11 --:--:-- 2374k    0  0:03:03  0:00:13  0:02:50  486k   0     0   822k      0  0:01:39  0:00:41  0:00:58 1543k0     0   874k      0  0:01:33  0:00:46  0:00:47 1312k  0   891k      0  0:01:32  0:00:49  0:00:43 1221k  0     0   914k      0  0:01:29  0:00:55  0:00:34 1129k
^C


**Preparing the data**

In [None]:
import os, pathlib, shutil, random # importing the necessary libraries 
from tensorflow import keras # importing the keras library from tensorflow
batch_size = 32 # setting the batch size to 32
base_dir = pathlib.Path("aclImdb") # setting the base directory to aclImdb
val_dir = base_dir / "val" # setting the validation directory to val
train_dir = base_dir / "train" # setting the training directory to train
for category in ("neg", "pos"): # for loop to iterate over the categories
    os.makedirs(val_dir / category) # making the validation directory
    files = os.listdir(train_dir / category) # listing the files in the training directory
    random.Random(1337).shuffle(files) # shuffling the files
    num_val_samples = int(0.2 * len(files)) # setting the number of validation samples
    val_files = files[-num_val_samples:] # setting the validation files
    for fname in val_files: # for loop to iterate over the validation files
        shutil.move(train_dir / category / fname, 
                    val_dir / category / fname) # moving the files to the validation directory

train_ds = keras.utils.text_dataset_from_directory( # creating the training dataset
    "aclImdb/train", batch_size=batch_size # setting the batch size
)
val_ds = keras.utils.text_dataset_from_directory( # creating the validation dataset
    "aclImdb/val", batch_size=batch_size # setting the batch size
)
test_ds = keras.utils.text_dataset_from_directory( # creating the testing dataset
    "aclImdb/test", batch_size=batch_size # setting the batch size
)
text_only_train_ds = train_ds.map(lambda x, y: x) # mapping the training dataset

**Preparing integer sequence datasets**

In [None]:
from tensorflow.keras import layers # importing the layers from keras

max_length = 600 # setting the maximum length to 600
max_tokens = 20000 # setting the maximum tokens to 20000
text_vectorization = layers.TextVectorization( # creating the text vectorization layer
    max_tokens=max_tokens, # setting the maximum tokens
    output_mode="int", # setting the output mode to int, which is integer encoding of the tokens that will be passed to the embedding layer
    output_sequence_length=max_length, # setting the output sequence length
)
text_vectorization.adapt(text_only_train_ds) # adapting the text vectorization layer

int_train_ds = train_ds.map( # mapping the training dataset
    lambda x, y: (text_vectorization(x), y), # passing the text vectorization layer to the training dataset
    num_parallel_calls=4) # setting the number of parallel calls to 4
int_val_ds = val_ds.map( # mapping the validation dataset
    lambda x, y: (text_vectorization(x), y), # passing the text vectorization layer to the validation dataset
    num_parallel_calls=4) # setting the number of parallel calls to 4
int_test_ds = test_ds.map( # mapping the testing dataset
    lambda x, y: (text_vectorization(x), y), # passing the text vectorization layer to the testing dataset
    num_parallel_calls=4) # setting the number of parallel calls to 4

**A sequence model built on one-hot encoded vector sequences**

In [None]:
import tensorflow as tf # importing tensorflow
inputs = keras.Input(shape=(None,), dtype="int64") # creating the input layer with the shape of None and the data type of int64
embedded = tf.one_hot(inputs, depth=max_tokens) # creating the one hot encoding layer with the depth of max_tokens
x = layers.Bidirectional(layers.LSTM(32))(embedded) # creating the bidirectional LSTM layer with 32 units
x = layers.Dropout(0.5)(x) # creating the dropout layer with a rate of 0.5 
outputs = layers.Dense(1, activation="sigmoid")(x) # creating the dense layer with 1 unit and the sigmoid activation function
model = keras.Model(inputs, outputs) # creating the model with the inputs and outputs 
model.compile(optimizer="rmsprop", # compiling the model with the optimizer of rmsprop
              loss="binary_crossentropy", # setting the loss function to binary crossentropy
              metrics=["accuracy"]) # setting the metrics to accuracy
model.summary() # printing the summary of the model

**Training a first basic sequence model**

In [None]:
callbacks = [ # creating the callbacks
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras", # setting the model checkpoint to one_hot_bidir_lstm.keras
                                    save_best_only=True) # saving the best model only
]
model.fit(int_train_ds, # fitting the model with the training dataset
          validation_data=int_val_ds, # setting the validation data to the validation dataset
          epochs=10, # setting the epochs to 10
          callbacks=callbacks) # setting the callbacks to the callbacks
model = keras.models.load_model("one_hot_bidir_lstm.keras") # loading the model
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}") # printing the test accuracy

#### Understanding word embeddings

#### Learning word embeddings with the Embedding layer

**Instantiating an `Embedding` layer**

In [None]:
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256) # creating the embedding layer with the input dimension of max_tokens and the output dimension of 256

**Model that uses an `Embedding` layer trained from scratch**

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64") # creating the input layer with the shape of None and the data type of int64
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs) # creating the embedding layer with the input dimension of max_tokens and the output dimension of 256
x = layers.Bidirectional(layers.LSTM(32))(embedded) # creating the bidirectional LSTM layer with 32 units
x = layers.Dropout(0.5)(x) # creating the dropout layer with a rate of 0.5
outputs = layers.Dense(1, activation="sigmoid")(x) # creating the dense layer with 1 unit and the sigmoid activation function
model = keras.Model(inputs, outputs) # creating the model with the inputs and outputs
model.compile(optimizer="rmsprop", # compiling the model with the optimizer of rmsprop
              loss="binary_crossentropy", # setting the loss function to binary crossentropy
              metrics=["accuracy"]) # setting the metrics to accuracy
model.summary() # printing the summary of the model

callbacks = [ # creating the callbacks
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras", # setting the model checkpoint to embeddings_bidir_gru.keras
                                    save_best_only=True) # saving the best model only
]
model.fit(int_train_ds, # fitting the model with the training dataset
          validation_data=int_val_ds, # setting the validation data to the validation dataset
          epochs=10, # setting the epochs to 10
          callbacks=callbacks) # setting the callbacks to the callbacks
model = keras.models.load_model("embeddings_bidir_gru.keras") # loading the model
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}") # printing the test accuracy

#### Understanding padding and masking

**Using an `Embedding` layer with masking enabled**

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64") # creating the input layer with the shape of None and the data type of int64
embedded = layers.Embedding( # creating the embedding layer
    input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs) # setting the input dimension to max_tokens, the output dimension to 256, and the mask zero to True
x = layers.Bidirectional(layers.LSTM(32))(embedded) # creating the bidirectional LSTM layer with 32 units
x = layers.Dropout(0.5)(x) # creating the dropout layer with a rate of 0.5
outputs = layers.Dense(1, activation="sigmoid")(x) # creating the dense layer with 1 unit and the sigmoid activation function
model = keras.Model(inputs, outputs) # creating the model with the inputs and outputs
model.compile(optimizer="rmsprop", # compiling the model with the optimizer of rmsprop
              loss="binary_crossentropy", # setting the loss function to binary crossentropy
              metrics=["accuracy"]) # setting the metrics to accuracy
model.summary() # printing the summary of the model

callbacks = [ # creating the callbacks
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras", # setting the model checkpoint to embeddings_bidir_gru_with_masking.keras
                                    save_best_only=True) # saving the best model only
]
model.fit(int_train_ds, # fitting the model with the training dataset
          validation_data=int_val_ds, # setting the validation data to the validation dataset
          epochs=10, # setting the epochs to 10
          callbacks=callbacks) # setting the callbacks to the callbacks
model = keras.models.load_model("embeddings_bidir_gru_with_masking.keras") # loading the model
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}") # printing the test accuracy

#### Using pretrained word embeddings

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip # downloading the glove embeddings
!unzip -q glove.6B.zip # unzipping the glove embeddings

**Parsing the GloVe word-embeddings file**

In [None]:
import numpy as np # importing numpy
path_to_glove_file = "glove.6B.100d.txt" # setting the path to the glove file

embeddings_index = {} # creating the embeddings index dictionary which will contain the word vectors
with open(path_to_glove_file) as f: # opening the glove file
    for line in f: # iterating over the lines in the file
        word, coefs = line.split(maxsplit=1) # splitting the line into the word and the coefs
        coefs = np.fromstring(coefs, "f", sep=" ") # converting the coefs to a numpy array
        embeddings_index[word] = coefs # adding the word and the coefs to the embeddings index

print(f"Found {len(embeddings_index)} word vectors.") # printing the number of word vectors

**Preparing the GloVe word-embeddings matrix**

In [None]:
embedding_dim = 100 # setting the embedding dimension to 100

vocabulary = text_vectorization.get_vocabulary() # getting the vocabulary from the text vectorization layer
word_index = dict(zip(vocabulary, range(len(vocabulary)))) # creating the word index dictionary

embedding_matrix = np.zeros((max_tokens, embedding_dim)) # creating the embedding matrix with the shape of max_tokens and embedding_dim
for word, i in word_index.items(): # iterating over the word index
    if i < max_tokens: # if the index is less than the max tokens
        embedding_vector = embeddings_index.get(word) # getting the embedding vector
    if embedding_vector is not None: # if the embedding vector is not None
        embedding_matrix[i] = embedding_vector # adding the embedding vector to the embedding matrix 

In [None]:
embedding_layer = layers.Embedding( # creating the embedding layer
    max_tokens, # setting the max tokens
    embedding_dim, # setting the embedding dimension
    embeddings_initializer=keras.initializers.Constant(embedding_matrix), # setting the embeddings initializer to the embedding matrix
    trainable=False, # setting the trainable to False
    mask_zero=True, # setting the mask zero to True
)

**Model that uses a pretrained Embedding layer**

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64") # creating the input layer with the shape of None and the data type of int64
embedded = embedding_layer(inputs) # creating the embedding layer with the inputs
x = layers.Bidirectional(layers.LSTM(32))(embedded) # creating the bidirectional LSTM layer with 32 units
x = layers.Dropout(0.5)(x) # creating the dropout layer with a rate of 0.5
outputs = layers.Dense(1, activation="sigmoid")(x) # creating the dense layer with 1 unit and the sigmoid activation function
model = keras.Model(inputs, outputs) # creating the model with the inputs and outputs
model.compile(optimizer="rmsprop", # compiling the model with the optimizer of rmsprop
              loss="binary_crossentropy", # setting the loss function to binary crossentropy
              metrics=["accuracy"]) # setting the metrics to accuracy
model.summary() # printing the summary of the model

callbacks = [ # creating the callbacks
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras", # setting the model checkpoint to glove_embeddings_sequence_model.keras
                                    save_best_only=True) # saving the best model only
]
model.fit(int_train_ds, # fitting the model with the training dataset
            validation_data=int_val_ds, # setting the validation data to the validation dataset
            epochs=10, # setting the epochs to 10
            callbacks=callbacks) # setting the callbacks to the callbacks
model = keras.models.load_model("glove_embeddings_sequence_model.keras") # loading the model
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}") # printing the test accuracy