This is a companion notebook for the book [Deep Learning with Python, Second Edition](https://www.manning.com/books/deep-learning-with-python-second-edition?a_aid=keras&a_bid=76564dff). For readability, it only contains runnable code blocks and section titles, and omits everything else in the book: text paragraphs, figures, and pseudocode.

**If you want to be able to follow what's going on, I recommend reading the notebook side by side with your copy of the book.**

This notebook was generated for TensorFlow 2.6.

### Processing words as a sequence: The sequence model approach

#### A first practical example

## Downloading and pre-processing the data



In [1]:
# Descarga el dataset completo IMDB

!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  35.6M      0  0:00:02  0:00:02 --:--:-- 35.6M


In [2]:
# Descomprime el .tar.gz descargado

!tar -xf aclImdb_v1.tar.gz

In [3]:
# Borra un directorio que no se va a utilizar

!rm -r aclImdb/train/unsup

In [4]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

**Preparing integer sequence datasets**

In [5]:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [6]:
from tensorflow.keras import layers

max_length = 300
max_tokens = 10000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [7]:
int_train_ds

<_ParallelMapDataset element_spec=(TensorSpec(shape=(None, None), dtype=tf.int64, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

In [8]:
for inputs, targets in int_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 300)
inputs.dtype: <dtype: 'int64'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(
[  74   92 1281   24   38   73  887   16 4475    2 4068 1807   10  336
    6   26    5   55 1122  502   19   10   14   43  659   51    2  199
   19  371   46    3    9   14   29    5    2   83   95   10  374  316
   38  153   12  142 3828    3  528    1  193   10  839    6  867   11
   19 1312   10 1243    6 5126   31   55 1807    5    2  199   19   17
   11  778 1293   16    4   19   11   19  288  279   12   91    2  199
 3313    3    1    9  208  186    6    2  227    1   13    8   11   19
 7474    3 2137 7893    2 2486    5   65  569 8163    3  138    6  120
   40    6  303    8    2    1  426  100    2  611  608  621  736  231
    2 2053 1671    1  268   59    3 5237    6  496 8163   45    1  145
  192   59    2    1 1312   27  397    9   59  198   54    4  524 2137
 1289   48    1  149 7474 2481   46   85    6  340    4 2852    3  580
 8163    1 2793   

#### Understanding word embeddings

#### Learning word embeddings with the Embedding layer

La idea de esta parte del tutorial es comprobar de primera mano las ventajas de todo tipo de la capa Embeddings, que realiza una codificación mucho más eficaz computacionalmente, pues pasamos de una dimensionalidad del input de 10000 a 256.

Este modelo, idéntico al anterior, ahora sí permite entrenar en un tiempo razonable (tampoco para tirar cohetes).

Lo que no funcionaba era el callback de ModelCheckpoint. Y es, perdonadme que lo diga así, una "chorrada". Parece ser que con la extensión .keras con la que estaba, algo ha cambiado de las versiones anteriores de tensorflow a la actual:

https://stackoverflow.com/questions/76701617/the-following-arguments-are-not-supported-with-the-native-keras-format-opti

Soluciones (de entre las varias que se ofrecen ahí -qué haríamos sin stackoverflow!!):
- Bajar la versión de TF (absurda)
- Cambiar la extensión del archivo que guarda los modelos a  cualquiera otra (.tf por ejemplo)
- Utilizar una función de ModelCheckpoint propia (sería cuestión de probarla).

Lo que he hecho ha sido utilizar la solución 2 que además permite especificar qué es lo que se guarda en esos "checkpoint".

**El código es el mismo que en el 603 original** pero he separado el código en varias casillas para su más fácil lectura.

Para consultar la entrada específica de ModelCheckPoint en la API keras:

https://keras.io/api/callbacks/model_checkpoint/

Por supuesto, este código que solo se incluye en el siguiente .fit, es aplicable al resto de .fit en este notebook, o en cualquier otro.


**Instantiating an `Embedding` layer**

In [9]:
embedding_layer = layers.Embedding(input_dim=max_tokens,
                                   output_dim=256)

**Model that uses an `Embedding` layer trained from scratch**

In [10]:
inputs = keras.Input(shape=(None,),
                     dtype="int64")

embedded = layers.Embedding(input_dim=max_tokens,
                            output_dim=256)(inputs)

x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()


In [11]:
callbacks = [keras.callbacks.ModelCheckpoint(filepath="603_LSTM_bidir.keras",
                                             save_best_only=True,
                                             monitor="val_loss")]

In [12]:
model.fit(int_train_ds,
          validation_data=int_val_ds,
          epochs=10,
          callbacks = callbacks)

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 37ms/step - accuracy: 0.6335 - loss: 0.6224 - val_accuracy: 0.8160 - val_loss: 0.4367
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 27ms/step - accuracy: 0.8302 - loss: 0.4193 - val_accuracy: 0.8620 - val_loss: 0.3463
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 28ms/step - accuracy: 0.8691 - loss: 0.3420 - val_accuracy: 0.8758 - val_loss: 0.3264
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 26ms/step - accuracy: 0.8941 - loss: 0.2879 - val_accuracy: 0.8192 - val_loss: 0.3975
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 26ms/step - accuracy: 0.9097 - loss: 0.2527 - val_accuracy: 0.8686 - val_loss: 0.3324
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 27ms/step - accuracy: 0.9241 - loss: 0.2151 - val_accuracy: 0.8790 - val_loss: 0.3529
Epoch 7/10
[1m6

#### Understanding padding and masking

**Using an `Embedding` layer with masking enabled**

In [13]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(
    input_dim=max_tokens,
    output_dim=256,
    mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 28ms/step - accuracy: 0.6746 - loss: 0.5779 - val_accuracy: 0.8134 - val_loss: 0.4159
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 27ms/step - accuracy: 0.8579 - loss: 0.3463 - val_accuracy: 0.8726 - val_loss: 0.3042
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 27ms/step - accuracy: 0.8877 - loss: 0.2740 - val_accuracy: 0.8468 - val_loss: 0.4154
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 27ms/step - accuracy: 0.9166 - loss: 0.2216 - val_accuracy: 0.8744 - val_loss: 0.3027
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 28ms/step - accuracy: 0.9309 - loss: 0.1853 - val_accuracy: 0.8768 - val_loss: 0.3151
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 27ms/step - accuracy: 0.9504 - loss: 0.1434 - val_accuracy: 0.8684 - val_loss: 0.3663
Epoch 7/10
[1m6

#### Using pretrained word embeddings

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2024-11-15 10:32:19--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-11-15 10:32:19--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’

glove.6B.zip         14%[=>                  ] 118.80M  5.02MB/s    eta 1m 52s 

**Parsing the GloVe word-embeddings file**

In [None]:
import numpy as np
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

**Preparing the GloVe word-embeddings matrix**

In [None]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()
word_index = dict(zip(vocabulary, range(len(vocabulary))))

embedding_matrix = np.zeros((max_tokens, embedding_dim))

for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)

**Model that uses a pretrained Embedding layer**

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")