<a href="https://colab.research.google.com/github/AgustinCocciardi/IA-Aplicada/blob/main/12_RNN%2C_LSTM_y_GRU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN, LSTM y GRU - Text classification (Sentiment Analysis)

En este colab vamos a utilizar un dataset de tensorflow: Reviews de peliculas. Nuestro objetivo es clasificar el sentimiento como positivo o negativo (Clasificación binaria).

## Descargamos y exploramos el dataset

In [1]:
import tensorflow_datasets as tfds
import tensorflow as tf

raw_train_set, raw_valid_set, raw_test_set = tfds.load(
    name="imdb_reviews",
    split=["train[:90%]", "train[90%:]", "test"],
    as_supervised=True
)
tf.random.set_seed(42) # Agregamos esta seed para restringir la aleatoriedad y poder tener resultados reproducibles en CPU
# Armamos los training, validation y test sets
train_set = raw_train_set.shuffle(5000, seed=42).batch(32).prefetch(1)
valid_set = raw_valid_set.batch(32).prefetch(1)
test_set = raw_test_set.batch(32).prefetch(1)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.UO4P8N_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.UO4P8N_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.UO4P8N_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


Veamos los textos de las reviews y su label (0 es negativa) y 1 es positiva.

In [2]:
for review, label in raw_train_set.take(4):
    print(review.numpy().decode("utf-8")[:200], "...")
    print("Label:", label.numpy())

This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0
Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Moun ...
Label: 0
This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful perf ...
Label: 1


## Realizamos la extracción de features (Feature extraction)

Esta vez usemos una capa de vectorización proporcionada por keras

In [3]:
vocab_size = 1000
text_vec_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_size)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

## Entrenamos nuestras RNN

Entrenemos nuestro modelo con redes recurrentes. En esta oportunidad usaremos capas GRU (Gated recurrent units) porque son de complejidad intermedia como vimos en la teoría. Siendo una SimpleRNN la RNN mas simple y la LSTM la RNN más compleja.

Vemos en el codigo que seteamos una capa GRU con 128 units (Tamaño del embeddings) para poder recibir nuestro vector de features. Por default todas las RNN utilizan tanh como función de activación y devuelven solo el ultimo output de la secuencia. Si queremos devolver toda la secuencia debemos indicarle el hiperparametro return_sequences en true

In [None]:
embed_size = 128
tf.random.set_seed(42)
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size),
    tf.keras.layers.GRU(128, return_sequences=True), # tf.keras.layers.LSTM, tf.keras.layers.SimpleRNN
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=2)

Epoch 1/2
Epoch 2/2


1hr de entrenamiento </Br>
Epoch 1/2 </Br>
704/704 [==============================] - 1597s 2s/step - loss: 0.6934 - accuracy: 0.5000 - val_loss: 0.6932 - val_accuracy: 0.5024 </Br>
Epoch 2/2 </Br>
704/704 [==============================] - 1551s 2s/step - loss: 0.6930 - accuracy: 0.5018 - val_loss: 0.6946 - val_accuracy: 0.5004 </Br>

### Masking

Cuando hay un valor perdido en nuestra secuencia (palabra no presente en el encoding) se puede enmascarar. Por lo tanto, podemos enmascararlas manualmente o simplemente en la embedding layer podemos pasarle el hiperparametro mask_zero en True.

In [None]:
embed_size = 128
tf.random.set_seed(42)
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


2hs de entrenamiento con 6GB ram ocupados </br>
Epoch 1/5 </BR>
704/704 [==============================] - 1762s 2s/step - loss: 0.5149 - accuracy: 0.7388 - val_loss: 0.4924 - val_accuracy: 0.7796 </BR>
Epoch 2/5 </BR>
704/704 [==============================] - 1727s 2s/step - loss: 0.3713 - accuracy: 0.8360 - val_loss: 0.3207 - val_accuracy: 0.8644 </BR>
Epoch 3/5 </BR>
704/704 [==============================] - 1739s 2s/step - loss: 0.3175 - accuracy: 0.8636 - val_loss: 0.3619 - val_accuracy: 0.8436 </BR>
Epoch 4/5 </BR>
704/704 [==============================] - 1735s 2s/step - loss: 0.2952 - accuracy: 0.8770 - val_loss: 0.3101 - val_accuracy: 0.8660 </BR>
Epoch 5/5 </BR>
704/704 [==============================] - 1735s 2s/step - loss: 0.2705 - accuracy: 0.8894 - val_loss: 0.3070 - val_accuracy: 0.8684 </BR>

### Ragged tensors

Otra forma de manejar los valores perdidos en la vectorización. Es utilizar Ragged tensors. Esto significa usar una lista de arrays de diferente dimension, quitando las features que esten en 0. Veamos el ejemplo

In [None]:
text_vec_layer_ragged = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size, ragged=True)
text_vec_layer_ragged.adapt(train_set.map(lambda reviews, labels: reviews))
text_vec_layer_ragged(["Great movie!", "This is DiCaprio's best role."])

<tf.RaggedTensor [[86, 18], [11, 7, 1, 116, 217]]>

En la ejecución anterior podemos observar la lista de arrays de diferente dimensión. Tenemos un tensor de 2 dimensiones y otro de 5: </BR>
<tf.RaggedTensor [[86, 18], [11, 7, 1, 116, 217]]

Si lo comparamos con una vectorización normal, podemos ver la diferencia teniendo una lista de tensores de igual dimension (5): </BR>
<tf.Tensor: shape=(2, 5), dtype=int64, numpy=
array([[ 86,  18,   0,   0,   0],
       [ 11,   7,   1, 116, 217]])>

In [None]:
text_vec_layer(["Great movie!", "This is DiCaprio's best role."])

<tf.Tensor: shape=(2, 5), dtype=int64, numpy=
array([[ 86,  18,   0,   0,   0],
       [ 11,   7,   1, 116, 217]])>

In [None]:
embed_size = 128
tf.random.set_seed(42)
model = tf.keras.Sequential([
    text_vec_layer_ragged,
    tf.keras.layers.Embedding(vocab_size, embed_size),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


2 hrs de entrenamiento con 6GB ram ocupados </BR>

Epoch 1/5 </BR>
704/704 [==============================] - 1688s 2s/step - loss: 0.5142 - accuracy: 0.7432 - val_loss: 0.4434 - val_accuracy: 0.8028 </BR>
Epoch 2/5 </BR>
704/704 [==============================] - 1654s 2s/step - loss: 0.3500 - accuracy: 0.8479 - val_loss: 0.3313 - val_accuracy: 0.8596 </BR>
Epoch 3/5 </BR>
704/704 [==============================] - 1661s 2s/step - loss: 0.2997 - accuracy: 0.8739 - val_loss: 0.2964 - val_accuracy: 0.8716 </BR>
Epoch 4/5 </BR>
704/704 [==============================] - 1660s 2s/step - loss: 0.2942 - accuracy: 0.8764 - val_loss: 0.2981 - val_accuracy: 0.8724 </BR>
Epoch 5/5 </BR>
704/704 [==============================] - 1649s 2s/step - loss: 0.2921 - accuracy: 0.8777 - val_loss: 0.3114 - val_accuracy: 0.8652 </BR>

## Reusamos Pretrained Embeddings y Language Models

### Gensim models

In [None]:
import gensim
import gensim.downloader

model = gensim.downloader.load('glove-twitter-200')



In [None]:
def gensim_to_keras_embedding(model, train_embeddings=False):
    """Get a Keras 'Embedding' layer with weights set from Word2Vec model's learned word embeddings.

    Parameters
    ----------
    train_embeddings : bool
        If False, the returned weights are frozen and stopped from being updated.
        If True, the weights can / will be further updated in Keras.

    Returns
    -------
    `keras.layers.Embedding`
        Embedding layer, to be used as input to deeper network layers.

    """
    weights = model.vectors  # vectors themselves, a 2D numpy array
    index_to_key = model.index_to_key  # which row in `weights` corresponds to which word?

    layer = tf.keras.layers.Embedding(
        input_dim=weights.shape[0],
        output_dim=weights.shape[1],
        weights=[weights],
        trainable=train_embeddings,
    )
    return layer

In [None]:
embedding_layer = gensim_to_keras_embedding(model)

In [None]:
vocab_len = len(model)
output_len = 200
text_vec_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_len + 1, output_sequence_length=output_len)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

In [None]:
tf.random.set_seed(42)
model = tf.keras.Sequential([
    text_vec_layer,
    embedding_layer,
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Para mas informacion para reutilizar word embeddings models en https://keras.io/examples/nlp/pretrained_word_embeddings/

### Bonus: Tensorflow Hub models

En este caso vamos a usar un sentence encoding preentrenado: Universal sentence encoder. No lo vimos en detalle en la teoría pero les dejamos un ejemplo de como se entrena para este problema.

In [None]:
import os
import tensorflow_hub as hub

os.environ["TFHUB_CACHE_DIR"] = "my_tfhub_cache"
tf.random.set_seed(42)  # Agregamos esta seed para restringir la aleatoriedad y poder tener resultados reproducibles en CPU
model = tf.keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                   trainable=True, dtype=tf.string, input_shape=[]),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=3) # Originalmente configurado con 10 epoch

1:30 hrs por epoch con ocupación de 6 GB de ram

Epoch 1/2 </BR>
704/704 [==============================] - 5749s 8s/step - loss: 0.2915 - accuracy: 0.8779 - val_loss: 0.2400 - val_accuracy: 0.9012 </BR>
Epoch 2/2 </BR>
704/704 [==============================] - 5466s 8s/step - loss: 0.0220 - accuracy: 0.9934 - val_loss: 0.3035 - val_accuracy: 0.8988 </BR>

## Bonus: CNN para text classification

En este caso dejamos un ejemplo de CNN para procesar texto, definimos una nueva capa de TextVectorization donde definimos un hiperparametro con output_sequence_length=10 porque es necesario para las capas dense de la red.

In [None]:
embed_size = 128
tf.random.set_seed(42)

text_vec_layer_for_cnn = tf.keras.layers.TextVectorization(max_tokens=vocab_size, output_sequence_length=10)
text_vec_layer_for_cnn.adapt(train_set.map(lambda reviews, labels: reviews))

model = tf.keras.Sequential([
    text_vec_layer_for_cnn,
    tf.keras.layers.Embedding(vocab_size, embed_size),
    tf.keras.layers.Conv1D(filters=32, kernel_size=8, activation='relu'),
    tf.keras.layers.MaxPooling1D(pool_size=2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Resultados en 1,5 mins de entrenamiento: </BR>
Epoch 1/10 </BR>
704/704 [==============================] - 9s 12ms/step - loss: 0.6464 - accuracy: 0.6098 - val_loss: 0.6284 - val_accuracy: 0.6280 </BR>
Epoch 2/10 </BR>
704/704 [==============================] - 8s 11ms/step - loss: 0.5777 - accuracy: 0.6842 - val_loss: 0.6405 - val_accuracy: 0.6336 </BR>
Epoch 3/10 </BR>
704/704 [==============================] - 7s 10ms/step - loss: 0.5204 - accuracy: 0.7323 - val_loss: 0.6747 - val_accuracy: 0.6268 </BR>
Epoch 4/10 </BR>
704/704 [==============================] - 8s 11ms/step - loss: 0.4436 - accuracy: 0.7876 - val_loss: 0.7531 - val_accuracy: 0.6132 </BR>
Epoch 5/10 </BR>
704/704 [==============================] - 10s 15ms/step - loss: 0.3569 - accuracy: 0.8406 - val_loss: 0.8793 - val_accuracy: 0.6136 </BR>
Epoch 6/10 </BR>
704/704 [==============================] - 7s 10ms/step - loss: 0.2645 - accuracy: 0.8886 - val_loss: 1.0845 - val_accuracy: 0.6056 </BR>
Epoch 7/10 </BR>
704/704 [==============================] - 7s 10ms/step - loss: 0.1947 - accuracy: 0.9243 - val_loss: 1.2675 - val_accuracy: 0.6160 </BR>
Epoch 8/10 </BR>
704/704 [==============================] - 8s 11ms/step - loss: 0.1411 - accuracy: 0.9457 - val_loss: 1.5119 - val_accuracy: 0.6116 </BR>
Epoch 9/10 </BR>
704/704 [==============================] - 8s 12ms/step - loss: 0.1073 - accuracy: 0.9602 - val_loss: 1.6795 - val_accuracy: 0.6108 </BR>
Epoch 10/10
704/704 [==============================] - 8s 11ms/step - loss: 0.0829 - accuracy: 0.9701 - val_loss: 1.8738 - val_accuracy: 0.6100 </BR>

Si bien da una buena accuracy al entrenar 0,9701 (Overfitting), en la validación nos da 0,6100 de accuracy con datos no vistos. Falta seguir explorando diferentes hiperparámetros para mejorar la validación del modelo.  