<center>
<p><img src="https://mcd.unison.mx/wp-content/themes/awaken/img/logo_mcd.png" width="150">
</p>



<h1>Curso Procesamiento de Lenguaje Natural</h1>

<h3>LSTM con Keras, un flujo básico pero completo</h3>


<p> Julio Waissman Vilanova </p>
<p>
<img src="https://identidadbuho.unison.mx/wp-content/uploads/2019/06/letragrama-cmyk-72.jpg" width="150">
</p>


<a target="_blank" href="https://colab.research.google.com/github/mcd-unison/pln/blob/main/labs/RNN/LSTM-IMdb.ipynb"><img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;"  width="30" /> Ejecuta en Colab</a>

<p>
Tomado parcialmente y adaptado de varias libretas de la documentación de Keras
</p>


</center>

In [None]:
import re
import string
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Obteniendo datos

Vamos a recuperar la base de datos globera de IMdb que se usa para probar casi todos los modelos. Vamos a recuperar los adatos de

``https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz``

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  22.2M      0  0:00:03  0:00:03 --:--:-- 22.2M


 y vamos a investigas la estructura y lo que hay...

In [None]:
!ls aclImdb

imdbEr.txt  imdb.vocab	README	test  train


In [None]:
!ls aclImdb/test

labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt


In [None]:
!ls aclImdb/train

labeledBow.feat  neg  pos  unsup  unsupBow.feat  urls_neg.txt  urls_pos.txt  urls_unsup.txt


In [None]:
!cat aclImdb/train/pos/6248_7.txt

Being an Austrian myself this has been a straight knock in my face. Fortunately I don't live nowhere near the place where this movie takes place but unfortunately it portrays everything that the rest of Austria hates about Viennese people (or people close to that region). And it is very easy to read that this is exactly the directors intention: to let your head sink into your hands and say "Oh my god, how can THAT be possible!". No, not with me, the (in my opinion) totally exaggerated uncensored swinger club scene is not necessary, I watch porn, sure, but in this context I was rather disgusted than put in the right context.<br /><br />This movie tells a story about how misled people who suffer from lack of education or bad company try to survive and live in a world of redundancy and boring horizons. A girl who is treated like a whore by her super-jealous boyfriend (and still keeps coming back), a female teacher who discovers her masochism by putting the life of her super-cruel "lover" 

Solo nos interesan las evaluaciones positivas y negativas (para hacer una simple clasificación binaria y simplificar la aplicación), por lo que vamos a borrar el folder `unsup`:

In [None]:
!rm -r aclImdb/train/unsup

Ahora si, vamos a usar las librerías de `Keras` para leer los datos usando [`keras.utils.text_dataset_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/text_dataset_from_directory).

En este momento es donde tenemos que determinar el tamaño de los lotes.

In [None]:
batch_size = 32         # Tamaño de los minibatches

raw_train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=1337,
)
raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337,
)
raw_test_ds = tf.keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


In [None]:
print(f"Numero de batches en raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Numero de batches en raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Numero de batches en raw_test_ds: {raw_test_ds.cardinality()}")

Numero de batches en raw_train_ds: 625
Numero de batches en raw_val_ds: 157
Numero de batches en raw_test_ds: 782


Es importante revisar los datos crudos para tener una idea de como se recuperaron y cual es la forma que tienen.

Esto lo podemos hacer tomando algunos datos de cada batch e imprimiendolos:

In [None]:
import textwrap

for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(5):
        print(textwrap.fill(text_batch.numpy()[i].decode("utf-8"), 80, subsequent_indent='> '))
        print("\ntarget =", label_batch.numpy()[i])

I've seen tons of science fiction from the 70s; some horrendously bad, and
> others thought provoking and truly frightening. Soylent Green fits into the
> latter category. Yes, at times it's a little campy, and yes, the furniture is
> good for a giggle or two, but some of the film seems awfully prescient. Here
> we have a film, 9 years before Blade Runner, that dares to imagine the future
> as somthing dark, scary, and nihilistic. Both Charlton Heston and Edward G.
> Robinson fare far better in this than The Ten Commandments, and Robinson's
> assisted-suicide scene is creepily prescient of Kevorkian and his ilk. Some of
> the attitudes are dated (can you imagine a filmmaker getting away with the
> "women as furniture" concept in our oh-so-politically-correct-90s?), but it's
> rare to find a film from the Me Decade that actually can make you think. This
> is one I'd love to see on the big screen, because even in a widescreen
> presentation, I don't think the overall scope of this film w

## Preparando los datos

Vamos ahora a convertir cada string de datos en una serie de índices numéricos, los cuales puedan entrar en
un modelo neuronal. Para esto, vamos a generar índices a partir de las palabras existentesd en el texto.

Este métdo puede ser no el mejor, ya que el vocabulario se fija en relación al vocabulario encontrado en el
conjunto de aprendizaje. Más adelante veremos mñetodos más sofisticados para hacer la indezación, o como
usar un vocabulario indexado ya preestablecido.

Por el momento vamos primero a especificar el proceso de limpieza de texto (preprocesamiento) el cual será muy sencillo para este ejemplo y consiste en:

1. Convertir a minúsculas todas las letras
2. Eliminar los saltos de linea en formato *html* ( `<br /> `)
3. Eliminar los signos de puntuación

Igualmente, vamos a generar los minibatches con secuencias de `sequence_length` palabras. Esto es, si es insuficiente, se trunca el texto y si es
demasiado, se completa el texto con 0's. De esa manera, todos los modelos aprenden con secuencias del mismo tamaño.

Se utilizan hasta `max_features` tokens diferentes. De haber más, estos se eliminan en función de su frecuencia.

Para esto vamos a utilizar la capa de `Keras` de [`layers.TextVectorization`](https://keras.io/api/layers/preprocessing_layers/text/text_vectorization/)

In [None]:
# Model constants.
max_features = 20000
sequence_length = 500

# Preprocesamiento
@keras.saving.register_keras_serializable()
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )


# Capa de vectorización (encontrar los índices por palabra)
vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Now that the vectorize_layer has been created, call `adapt` on a text-only
# dataset to create the vocabulary. You don't have to batch, but for very large
# datasets this means you're not keeping spare copies of the dataset in memory.

# Let's make a text-only dataset (no labels):
text_ds = raw_train_ds.map(lambda x, y: x)

# Let's call `adapt`:
vectorize_layer.adapt(text_ds)

In [None]:
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label

In [None]:
# Vectorize the data.
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

In [None]:
# Do async prefetching / buffering of the data for best performance on GPU.
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

In [None]:
print("Donde se guardan los datos de entrenamiento")
print("train_ds.cardinality() = ", train_ds.cardinality())

ejemplo = train_ds.take(1)

print("\nY un minibatch se representa de esta manera: \n")
print(ejemplo.get_single_element())

Donde se guardan los datos de entrenamiento
train_ds.cardinality() =  tf.Tensor(625, shape=(), dtype=int64)

Y un minibatch se representa de esta manera: 

(<tf.Tensor: shape=(32, 500), dtype=int64, numpy=
array([[ 132, 1720,    1, ...,    0,    0,    0],
       [  11, 1108,   10, ...,    0,    0,    0],
       [1305,  358,  113, ...,    0,    0,    0],
       ...,
       [   2,  501,  166, ...,   18,    9,    1],
       [  10,   17,    7, ...,    0,    0,    0],
       [  88,  120,   33, ...,    0,    0,    0]])>, <tf.Tensor: shape=(32,), dtype=int32, numpy=
array([1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 1], dtype=int32)>)


## Modelo basado en LSTM multicapa

Vamos a hacer un modelo multicapa, el cual seguramente requerirá de ajustes de su parte.

Vamos a utilizar la forma funcional de definir un modelo neuronal:

In [None]:
emb = 128               # Embedding size
unidades = 128          # Hidden units per layer


# Entrada en indices
inputs = keras.Input(shape=(None,), dtype="int64")

# Capa de embeddings
x = layers.Embedding(max_features, emb)(inputs)

# Dos capas de LSTMs
x = layers.LSTM(unidades, return_sequences=True)(x)
x = layers.LSTM(unidades)(x)

# Salida
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model = keras.Model(inputs, predictions)
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_1 (Embedding)     (None, None, 128)         2560000   
                                                                 
 lstm_2 (LSTM)               (None, None, 128)         131584    
                                                                 
 lstm_3 (LSTM)               (None, 128)               131584    
                                                                 
 predictions (Dense)         (None, 1)                 129       
                                                                 
Total params: 2823297 (10.77 MB)
Trainable params: 2823297 (10.77 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Compilamos y ponemos a aprender el modelo (usando BPTT en forma automñatica)

In [None]:
model.compile(
    "adam",
    "binary_crossentropy",
    metrics=["accuracy"]
)

model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15
)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x78a6787f2fe0>

Y probamos con los datos de test

In [None]:
model.evaluate(test_ds)



[0.7980202436447144, 0.4996800124645233]

Y ahora vamos a probar con bi-LSTM, haciendo un poco más complicado (aunque no mucho) el código

In [None]:
emb = 128
unidades = 128

# Input
inputs = keras.Input(shape=(None,), dtype="int32")

# Capa de embeddings
x = layers.Embedding(max_features, emb)(inputs)

# bi-LSTMs
x = layers.Bidirectional(
    layers.LSTM(unidades, return_sequences=True)
)(x)
x = layers.Bidirectional(
    layers.LSTM(unidades)
)(x)

# Vanilla hidden layer:
x = layers.Dense(unidades, activation="relu")(x)

# Salida
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model_bi = keras.Model(inputs, predictions)
model_bi.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_4 (Embedding)     (None, None, 128)         2560000   
                                                                 
 bidirectional_2 (Bidirecti  (None, None, 256)         263168    
 onal)                                                           
                                                                 
 bidirectional_3 (Bidirecti  (None, 256)               394240    
 onal)                                                           
                                                                 
 dense_2 (Dense)             (None, 128)               32896     
                                                                 
 predictions (Dense)         (None, 1)                 129 

In [None]:
model_bi.compile(
    "adam",
    "binary_crossentropy",
    metrics=["accuracy"]
)

model_bi.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x78a6a7f2b160>

In [None]:
model_bi.evaluate(test_ds)



[0.6635410189628601, 0.842199981212616]

## Modelo por convolucionales de 1 dimensión

Este modelo viene como modelo de base en Keras, y es un buen inicio para ver como usar convolucionales como modelos para PLN.



In [None]:
emb = 128
unidades = 128
ventana = 7
drop= 0.5

# Entrada
inputs = tf.keras.Input(shape=(None,), dtype="int64")

# Capa de embeddings
x = layers.Embedding(max_features, emb)(inputs)
x = layers.Dropout(drop)(x)

# Conv1D + global max pooling
x = layers.Conv1D(
    unidades,
    ventana,
    padding="valid",
    activation="relu",
    strides=3
)(x)
x = layers.Conv1D(
    unidades,
    ventana,
    padding="valid",
    activation="relu",
    strides=3
)(x)
x = layers.GlobalMaxPooling1D()(x)

# Vanilla hidden layer:
x = layers.Dense(unidades, activation="relu")(x)
x = layers.Dropout(drop)(x)

# Salida
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model_conv1d = tf.keras.Model(inputs, predictions)

In [None]:
# Compile the model with binary crossentropy loss and an adam optimizer.
model_conv1d.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15
)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x78a670516f20>

In [None]:
model_conv1d.evaluate(test_ds)



[0.6931139230728149, 0.5020800232887268]

## Modelo para producción

Si ya tenemos nuestro modelo funcionando, y nos gusta, y queremos dejarlo en un formato que permita aplicarlo a los datos en crudo, es necesario empaquetar todo nuestro procedimiento en un solo procedimiento de principio a fin.

Agregamos aqui el truco para empaqetar todo, cuando ya no se espera reentrenar el modelo (al menos no en el corto plazo).

In [None]:
modelo_seleccionado = model_bi

# A string input
inputs = tf.keras.Input(shape=(1,), dtype="string")

# Turn strings into vocab indices
indices = vectorize_layer(inputs)

# Turn vocab indices into predictions
outputs = modelo_seleccionado(indices)

# Our end to end model
end_to_end_model = tf.keras.Model(inputs, outputs)

end_to_end_model.compile(
    loss="binary_crossentropy",
    optimizer="adam",
    metrics=["accuracy"]
)

end_to_end_model.save('nombre_codigo.keras')

In [None]:
end_to_end_model = keras.saving.load_model("nombre_codigo.keras")

# Test it with `raw_test_ds`, which yields raw strings
end_to_end_model.evaluate(raw_test_ds)





[0.6635413765907288, 0.842199981212616]