##### Copyright 2019 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Word embeddings

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/text/word_embeddings">
    <img src="https://www.tensorflow.org/images/tf_logo_32px.png" />
    View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word_embeddings.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/word_embeddings.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/text/word_embeddings.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

## Setup

In [1]:
import io
import numpy as np
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D, TextVectorization

### Download the IMDb Dataset
You will use the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) through the tutorial. You will train a sentiment classifier model on this dataset and in the process learn embeddings from scratch. To read more about loading a dataset from scratch, see the [Loading text tutorial](../load_data/text.ipynb).  

Download the dataset using Keras file utility and take a look at the directories.

In [6]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [7]:
dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", origin = 'file:/./aclImdb_v1.tar.gz',
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

In [8]:
print(dataset)

.\aclImdb_v1


In [10]:
print(os.getcwd())

c:\Users\glezr\TheBridge\77_Lunes_1303\77_Lunes_1303


Take a look at the `train/` directory. It has `pos` and `neg` folders with movie reviews labelled as positive and negative respectively. You will use reviews from `pos` and `neg` folders to train a binary classification model.

In [14]:
dataset_dir = "./aclImdb"
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['labeledBow.feat',
 'neg',
 'pos',
 'unsup',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

In [15]:
shutil.rmtree(dataset_dir + "/train/unsup")

The `train` directory also has additional folders which should be removed before creating training dataset.

Next, create a `tf.data.Dataset` using `tf.keras.preprocessing.text_dataset_from_directory`. You can read more about using this utility in this [text classification tutorial](https://www.tensorflow.org/tutorials/keras/text_classification). 

Use the `train` directory to create both train and validation datasets with a split of 20% for validation.

In [16]:
batch_size = 1024
seed = 123

'''
Busca en un directorio todas las carpetas. Cada carpeta es una etiqueta
Y cada archivo una review. Podemos especificar si es para el subset
de training o validation y cuanto dejamos para validacion.
Esto crea un tf.data.Dataset
ELIMINAR UNA CARPETA EN TRAIN, QUE SOBRA
'''

train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation', # Esto y la semilla permiten que las muestras con train no se superpongan
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


Take a look at a few movie reviews and their labels `(1: positive, 0: negative)` from the train dataset.


In [22]:
for text_batch, label_batch in train_ds.take(1): # Cogemos el primer batch
    for i in range(5): # 1025 da error xq no hay mas en este batch
        print(text_batch.numpy()[i][:150],"...")
        print("Label:",label_batch[i].numpy(),"(%s)" %("Positive" if label_batch.numpy()[i] == 1 else "Negative"))

b'I believe this is the most powerful film HBO Pictures has made to date. This film should have been released in theaters for the public to view on the ' ...
Label: 1 (Positive)
b'*** THIS CONTAINS MANY, MANY SPOILERS, NOT THAT IT MATTERS, SINCE EVERYTHING IS SO PATENTLY OBVIOUS ***<br /><br />Oh my God, where do I start? Well, ' ...
Label: 0 (Negative)
b'I\'ve read some terrible things about this film, so I was prepared for the worst. "Confusing. Muddled. Horribly structured." While there may be merit t' ...
Label: 1 (Positive)
b'Nina Foch delivers a surprisingly strong performance as the title character in this fun little Gothic nail-biter. She accepts a position as secretary ' ...
Label: 1 (Positive)
b"Here's the kind of love story that I do enjoy watching. And mostly, it's for two reasons. One, it concentrates of young people, VERY young people. Peo" ...
Label: 1 (Positive)


### Configure the dataset for performance

These are two important methods you should use when loading data to make sure that I/O does not become blocking.

`.cache()` keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.

`.prefetch()` overlaps data preprocessing and model execution while training. 

You can learn more about both methods, as well as how to cache data to disk in the [data performance guide](https://www.tensorflow.org/guide/data_performance).

In [23]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

## Using the Embedding layer with sequences

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape `(samples, sequence_length)`, where each entry is a sequence of integers. It can embed sequences of variable lengths. You could feed into the embedding layer above batches with shapes `(32, 10)` (batch of 32 sequences of length 10) or `(64, 15)` (batch of 64 sequences of length 15).

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a `(2, 3)` input batch and the output is `(2, 3, N)`


Es decir, lo que ya hemos visto con el ejemplo "Me llamo Iñigo Montoya", aunque en este caso supongamos que vamos a introducir un batch de 2 secuencias: "Me llamo Iñigo Montoya" y "tú mataste a mi padre"

In [24]:
categorias_ejemplo = ["Me","llamo","Iñigo","Montoya","soy","tú","mataste","a","mi","padre"]
pre_conversion = tf.keras.layers.StringLookup() # Hay que convertir nuestro vocabulario a indices
pre_conversion.adapt(categorias_ejemplo)
lookup_y_embedding = tf.keras.Sequential([\
                                          tf.keras.layers.InputLayer(input_shape=[], dtype=tf.string), 
                                          pre_conversion,
                                          tf.keras.layers.Embedding(input_dim = pre_conversion.vocabulary_size(),
                                                                   output_dim = 2)])

In [25]:
def text2array(texto):
    return np.array([texto.split()])

In [26]:
result = lookup_y_embedding(tf.constant([text2array("Me llamo Iñigo Montoya"), text2array("tú mataste a mi padre")]))
print(result.shape)
result.numpy()

ValueError: Can't convert non-rectangular Python sequence to Tensor.

Los tensores son estructuras "rectangulares" eso de meterle secuencias de longitud variable nos lleva a un primer punto que debéis conocer si vais a trabajar con NLP en DL... __Es necesario hacer Padding__, es decir hacer secuencias de un tamaño fijo rellenando con "ceros" las secuencias más cortas.



In [27]:
result = lookup_y_embedding(tf.constant([text2array("<relleno> Me llamo Iñigo Montoya"), text2array("tú mataste a mi padre")]))
print(result.shape)
result.numpy()

(2, 1, 5, 2)


array([[[[ 0.04776336,  0.02587238],
         [-0.00458229,  0.04994545],
         [-0.0125264 , -0.02681776],
         [ 0.0145582 , -0.02573987],
         [ 0.02786077,  0.0102164 ]]],


       [[[-0.00541316,  0.03128723],
         [ 0.00297724, -0.01022477],
         [ 0.03612791,  0.03428438],
         [-0.0236047 ,  0.0206739 ],
         [-0.01587278,  0.03309766]]]], dtype=float32)

Eso nos llevará a otro problema: las secuencias muy cortas estarán llenas de ceros y serán mal interpretadas, para corregir esto utilizaremos el concepto de __mask__ (lo veremos más adelante)

When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape `(samples, sequence_length, embedding_dimensionality)`. To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an RNN, Attention, or pooling layer before passing it to a Dense layer. This tutorial uses pooling because it's the simplest. The [Text Classification with an RNN](text_classification_rnn.ipynb) tutorial is a good next step.

## Text preprocessing

Next, define the dataset preprocessing steps required for your sentiment classification model. Initialize a TextVectorization layer with the desired parameters to vectorize movie reviews. You can learn more about using this layer in the [Text Classification](https://www.tensorflow.org/tutorials/keras/text_classification) tutorial.

In [32]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data) # lo convierte a mayúsculas
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ') # le quita los codigos de cambio de línea
    return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '') #le quita los signos de puntuación


sequence_length = 8

# Veamos como la capa TextVectorization nos permite darle secuenccias de entrada y 
# nos devuelve el formato lista de índices, CON PADDING
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    output_mode='int', # el modo int eso hace que devuelva una secuencia con índices al vocabulario IMPORTANTE!!!
    output_sequence_length=sequence_length) # Aquí le indicamos el tamaño de la secuencia de salida y 



In [33]:
df_test = ["Me llamo Iñigo Monotya", "Tú mataste a mi padre","Prepárate a morir"]

In [34]:
vectorize_layer.adapt(df_test)

In [35]:
vectorize_layer(df_test)

<tf.Tensor: shape=(3, 8), dtype=int64, numpy=
array([[ 9, 11, 12,  7,  0,  0,  0,  0],
       [ 3, 10,  2,  8,  5,  0,  0,  0],
       [ 4,  2,  6,  0,  0,  0,  0,  0]], dtype=int64)>

In [36]:
for indice,word in enumerate(vectorize_layer.get_vocabulary()):
    print("%d -> %s" %(indice,word))

0 -> 
1 -> [UNK]
2 -> a
3 -> tú
4 -> prepárate
5 -> padre
6 -> morir
7 -> monotya
8 -> mi
9 -> me
10 -> mataste
11 -> llamo
12 -> iñigo


El índice 0 se suele reservar para hacer el padding

Ahora a la salida del TextVectorization le aplicaremos la capa de Embeddings y luego a eso nuestras capas densas para hacer el modelo de clasificación

In [117]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data) # lo convierte a mayúsculas
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ') # le quita los codigos de cambio de línea
    return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '') #le quita los signos de puntuación


# Vocabulary size and number of words in a sequence.
vocab_size = 10000 # Vamos a permitir que tenga un vocabulario de 10000, las de mayor frecuencia el resto se codificará como UNK, o si hemos empleado OOV como esas categorías extra
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
# Nos quitamos los labels de esta manera, que van en conjunto con features en train_ds
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

## Create a classification model

Use the [Keras Sequential API](https://www.tensorflow.org/guide/keras/sequential_model) to define the sentiment classification model. In this case it is a "Continuous bag of words" style model.
* The [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization) layer transforms strings into vocabulary indices. You have already initialized `vectorize_layer` as a TextVectorization layer and built it's vocabulary by calling `adapt` on `text_ds`. Now vectorize_layer can be used as the first layer of your end-to-end classification model, feeding transformed strings into the Embedding layer.
* The [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: `(batch, sequence, embedding)`.

* The [`GlobalAveragePooling1D`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.

* The fixed-length output vector is piped through a fully-connected ([`Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)) layer with 16 hidden units.

* The last layer is densely connected with a single output node. 

Caution: This model doesn't use masking, so the zero-padding is used as part of the input and hence the padding length may affect the output.  To fix this, see the [masking and padding guide](https://www.tensorflow.org/guide/keras/masking_and_padding).

In [118]:
embedding_dim=16

'''
GlobalAveragePooling1D
Cada palabra tiene asociado un embedding. El ouput es la media de cada
coordenada del embedding, por tanto, si hay 16 embeddings, hará un 
flatten a 16, siendo cada valor la media de la coordenada de ese 
embedding para todas las palabras de la review
'''

model = Sequential([
  vectorize_layer, # 100 [1, 3, 4, 4, 90, ...]
  Embedding(vocab_size, embedding_dim, name="embedding"), # 10.000 x 16 --> [[], [], [] ...] 100x16
  GlobalAveragePooling1D(), # [] 16
  Dense(16, activation='relu'), # 
  Dense(1) # originalmente no tiene activacion
])

## Compile and train the model

You will use [TensorBoard](https://www.tensorflow.org/tensorboard) to visualize metrics including loss and accuracy. Create a `tf.keras.callbacks.TensorBoard`.

In [119]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

Compile and train the model using the `Adam` optimizer and `BinaryCrossentropy` loss. 

In [120]:
model.compile(optimizer='adam',
              # binary_crossentropy
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [122]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x1398d3a58>

With this approach the model reaches a validation accuracy of around 84% (note that the model is overfitting since training accuracy is higher).

Note: Your results may be a bit different, depending on how weights were randomly initialized before training the embedding layer. 

You can look into the model summary to learn more about each layer of the model.

In [123]:
test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    dataset_dir + "/test/",
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)


Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [124]:
model.evaluate(test_ds)



[0.4101976156234741, 0.8032500147819519]

In [125]:
model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_4 (TextV  (None, 100)              0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 100, 16)           160000    
                                                                 
 global_average_pooling1d_7   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_6 (Dense)             (None, 16)                272       
                                                                 
 dense_7 (Dense)             (None, 1)                 17        
                                                                 
Total params: 160,289
Trainable params: 160,289
Non-tr

Visualize the model metrics in TensorBoard.

In [126]:
#docs_infra: no_execute
%load_ext tensorboard
%tensorboard --logdir logs

![embeddings_classifier_accuracy.png](images/embeddings_classifier_accuracy.png)

## Retrieve the trained word embeddings and save them to disk

Next, retrieve the word embeddings learned during training. The embeddings are weights of the Embedding layer in the model. The weights matrix is of shape `(vocab_size, embedding_dimension)`.

Obtain the weights from the model using `get_layer()` and `get_weights()`. The `get_vocabulary()` function provides the vocabulary to build a metadata file with one token per line. 

In [127]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

Write the weights to disk. To use the [Embedding Projector](http://projector.tensorflow.org), you will upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words).

In [130]:
len(weights)

10000

In [131]:
len(vocab)

10000

In [132]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
    if index == 0:
        continue  # skip 0, it's padding.
    vec = weights[index]
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")
out_v.close()
out_m.close()

If you are running this tutorial in [Colaboratory](https://colab.research.google.com), you can use the following snippet to download these files to your local machine (or use the file browser, *View -> Table of contents -> File browser*).

In [21]:
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

## Visualize the embeddings

To visualize the embeddings, upload them to the embedding projector.

Open the [Embedding Projector](http://projector.tensorflow.org/) (this can also run in a local TensorBoard instance).

* Click on "Load data".

* Upload the two files you created above: `vecs.tsv` and `meta.tsv`.

The embeddings you have trained will now be displayed. You can search for words to find their closest neighbors. For example, try searching for "beautiful". You may see neighbors like "wonderful". 

Note: Experimentally, you may be able to produce more interpretable embeddings by using a simpler model. Try deleting the `Dense(16)` layer, retraining the model, and visualizing the embeddings again.

Note: Typically, a much larger dataset is needed to train more interpretable word embeddings. This tutorial uses a small IMDb dataset for the purpose of demonstration.


## Next Steps

This tutorial has shown you how to train and visualize word embeddings from scratch on a small dataset.

* To train word embeddings using Word2Vec algorithm, try the [Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec) tutorial. 

* To learn more about advanced text processing, read the [Transformer model for language understanding](https://www.tensorflow.org/tutorials/text/transformer).