# Webinar NLP con Tensorflow
## Diego Hueltes

Importamos tensorflow y tf datasets

In [None]:
import tensorflow as tf
print("Tensorflow version " + tf.__version__)


import tensorflow_datasets as tfds

Tensorflow version 2.2.0


Definimos la similaridad de coseno, que la usaremos después

In [None]:
from numpy import dot
from numpy.linalg import norm

cos_sim = lambda a, b: dot(a, b)/(norm(a)*norm(b))

En tensorflow_datasets podemos solicitar varias versiones de los dataset, en este caso pedimos la versión de texto plano.

In [None]:
dataset = tfds.load('imdb_reviews/plain_text', as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

El padded batch indica que sea en batches de 64 y que le haga padding (añada cadenas vacías a lo que queda de frase)

In [None]:
train_dataset = train_dataset.padded_batch(64)
test_dataset = test_dataset.padded_batch(64)

Podemos ver que el dataset contiene las frases sin limpiar y además si es 1 (positiva) o 0 (negativa)

In [None]:
list(train_dataset.take(10))

[(<tf.Tensor: shape=(), dtype=string, numpy=b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.">,
  <tf.Tensor: shape=(), dtype=int64, numpy=0>),
 (<tf.Tensor: shape=(), dtype=string, numpy=b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortab

Descargamos de tensorflow hub las embeddings preentrenadas

In [None]:
import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128/2")

Podemos ver que traduce una frase a un vector de 128 posiciones

In [None]:
embed(['I love my apartment'])

Hacemos la similaridad del coseno para comprobar que 'I love my apartment' y 'My aparment is nice' tiene resultados similares pese a ser frases totalmente distintas. Si usamos otra frase como 'hi codemotion' vemos que no se parece en nada.

In [None]:
cos_sim(embed(['I love my apartment'])[0], embed(['hi codemotion'])[0])

-0.05873022

Creamos el modelo con una capa de tensorflow_hub, las embeddings que hemos visto antes. Después es una red feed-forward con dropout para evitar el overfitting.

In [None]:
model = tf.keras.Sequential([
  hub.KerasLayer('https://tfhub.dev/google/nnlm-en-dim128/2', trainable=True, input_shape=[], dtype=tf.string),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(1, activation='softmax'),
])

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics='accuracy')

In [None]:
model.fit(train_dataset, validation_data=test_dataset, validation_steps=30, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7eff333cda90>

Podemos ver que alcanza la accuracy máxima en la primera epoch, signo de que la mayoría del conocimiento viene de las embeddings preentrenadas y no de nuestro entrenamiento.

In [None]:
dataset, info = tfds.load('imdb_reviews/subwords8k', as_supervised=True, with_info=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

Ahora descargamos el dataset ya tokenizado, tambien nos traemos el tokenizador que lo provee.

In [None]:
list(train_dataset.take(1))

[(<tf.Tensor: shape=(163,), dtype=int64, numpy=
  array([  62,   18,   41,  604,  927,   65,    3,  644, 7968,   21,   35,
         5096,   36,   11,   43, 2948, 5240,  102,   50,  681, 7862, 1244,
            3, 3266,   29,  122,  640,    2,   26,   14,  279,  438,   35,
           79,  349,  384,   11, 1991,    3,  492,   79,  122,  188,  117,
           33, 4047, 4531,   14,   65, 7968,    8, 1819, 3947,    3,   62,
           27,    9,   41,  577, 5044, 2629, 2552, 7193, 7961, 3642,    3,
           19,  107, 3903,  225,   85,  198,   72,    1, 1512,  738, 2347,
          102, 6245,    8,   85,  308,   79, 6936, 7961,   23, 4981, 8044,
            3, 6429, 7961, 1141, 1335, 1848, 4848,   55, 3601, 4217, 8050,
            2,    5,   59, 3831, 1484, 8040, 7974,  174, 5773,   22, 5240,
          102,   18,  247,   26,    4, 3903, 1612, 3902,  291,   11,    4,
           27,   13,   18, 4092, 4008, 7961,    6,  119,  213, 2774,    3,
           12,  258, 2306,   13,   91,   29,  171,  

In [None]:
encoder = info.features['text'].encoder

In [None]:
encoder.encode('hi there, welcome codemotion')

[4034, 224, 2, 6351, 7961, 4306, 3138]

Probamos el tokenizador

In [None]:
encoder.decode([  62,   18,   41,  604,  927,   65,    3,  644, 7968,   21,   35,
         5096,   36,   11,   43, 2948, 5240,  102,   50,  681, 7862, 1244,
            3, 3266,   29,  122,  640,    2,   26,   14,  279,  438,   35,
           79,  349,  384,   11, 1991,    3,  492,   79,  122,  188,  117,
           33, 4047, 4531,   14,   65, 7968,    8, 1819, 3947,    3,   62,
           27,    9,   41,  577, 5044, 2629, 2552, 7193, 7961, 3642,    3,
           19,  107, 3903,  225,   85,  198,   72,    1, 1512,  738, 2347,
          102, 6245,    8,   85,  308,   79, 6936, 7961,   23, 4981, 8044,
            3, 6429, 7961, 1141, 1335, 1848, 4848,   55, 3601, 4217, 8050,
            2,    5,   59, 3831, 1484, 8040, 7974,  174, 5773,   22, 5240,
          102,   18,  247,   26,    4, 3903, 1612, 3902,  291,   11,    4,
           27,   13,   18, 4092, 4008, 7961,    6,  119,  213, 2774,    3,
           12,  258, 2306,   13,   91,   29,  171,   52,  229,    2, 1245,
         5790,  995, 7968,    8,   52, 2948, 5240, 8039, 7968,    8,   74,
         1249,    3,   12,  117, 2438, 1369,  192,   39, 7975])

"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

In [None]:
train_dataset = train_dataset.padded_batch(64)
test_dataset = test_dataset.padded_batch(64)

Creamos un modelo con embeddings propias, y usamos celdas LSTM que funcionan muy bien en la clasificación de texto.

In [None]:
model = tf.keras.Sequential([
  tf.keras.layers.Embedding(encoder.vocab_size, 64),
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(1, activation='relu'),
])

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics='accuracy')

In [None]:
model.fit(train_dataset, validation_data=test_dataset, validation_steps=30, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7eff31fa4d30>

Vemos que obtenemos una accuracy menor que antes, pero seguramente con más entrenamiento (y mas datos) logremos mejores resultados.