## Preparing the Data

In [1]:
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.strings import regex_replace

def prepareData(dir):
  data = text_dataset_from_directory(dir)
  return data.map(
    lambda text, label: (regex_replace(text, '<br />', ' '), label),
  )

train_data = prepareData('data/aclImdb/test')
test_data = prepareData('data/aclImdb/train')

2021-12-08 20:20:30.670627: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/opt/oracle/instantclient_19_8
2021-12-08 20:20:30.670671: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Found 25000 files belonging to 2 classes.


2021-12-08 20:20:33.608916: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/opt/oracle/instantclient_19_8
2021-12-08 20:20:33.608960: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-12-08 20:20:33.608981: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (virtual-machine): /proc/driver/nvidia/version does not exist
2021-12-08 20:20:33.609329: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Found 75000 files belonging to 3 classes.


Ahora, todas las instancias `<br />` en nuestro conjunto de datos han sido reemplazadas por espacios.  Puede intentar imprimir parte del conjunto de datos si lo desea:

In [2]:
for text_batch, label_batch in train_data.take(1):
  print(text_batch.numpy()[0])
  print(label_batch.numpy()[0]) # 0 = negative, 1 = positive

b"Another popular screening for a British picture at Coalville's Century Theatre. A well crafted, solid drama with an ever developing plot and ongoing 'twists in the tale'...as the lies piled up! A masterclass of acting by a flawless cast, well marshaled by first time director Julian Fellowes. Outstanding performance, as usual, by Tom Wilkinson but good turns by all concerned including supporting actors Linda Bassett and John Neville. Our audience was engrossed by this film, which includes a couple of shock incidents which really make you 'jump'. A good tight production at around only 80 minutes, probably produced on a very limited budget, but a success, which should see Fellowes directing again for the big screen. Some publicity for the film seemed to suggest it was set in the 50s (as per Nigel Balchin's novel)but obviously this is not the case. Recommended viewing."
1


## Construyendo el modelo

Usaremos la Sequential clase , que representa una pila lineal de capas. Para empezar, crearemos una instancia de un modelo secuencial vacío y definiremos su tipo de entrada: 

In [3]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input

model = Sequential()
model.add(Input(shape = (1,), dtype = "string"))

Nuestro modelo ahora toma 1 entrada de cadena: tiempo para hacer algo con esa cadena. 

## 3.1 Vectorización de texto 

Nuestra primera capa será una capa TextVectorization, que procesará la cadena de entrada y la convertirá en una secuencia de números enteros, cada uno representando un token.

In [4]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

max_tokens = 1000
max_len = 100
vectorize_layer = TextVectorization(
  
  # Tamaño máximo de vocabulario. Cualquier palabra fuera de las max_tokens más comunes
  # serán tratados de la misma manera: como tokens "fuera de vocabulario" (OOV).
  max_tokens = max_tokens,
  
  # Salida de índices enteros, uno por token de cadena
  output_mode = "int",
  
  # Rellene o trunque siempre exactamente esta cantidad de tokens
  output_sequence_length = max_len,
)

Para inicializar la capa, necesitamos llamar a .adapt ():

In [5]:
# Llame a adapt (), que ajusta la capa TextVectorization a nuestro dataset de texto.
# Aquí es cuando se seleccionan las palabras más comunes de max_tokens (es decir, el vocabulario).
train_texts = train_data.map(lambda text, label: text)
vectorize_layer.adapt(train_texts)

## 3.2 Incrustación

Nuestra siguiente capa será una incrustación, que convertirá los números enteros producidos por la capa anterior en vectores de longitud fija.

In [6]:
from tensorflow.keras.layers import Embedding

# Capa anterior: Vectorización de texto
max_tokens = 1000

model.add(vectorize_layer)

# Tenga en cuenta que estamos usando max_tokens + 1 aquí, ya que hay un token fuera de vocabulario (OOV) que se agrega al vocabulario.
model.add(Embedding(max_tokens + 1, 128))

## 3.3 La capa recurrente 

¡Finalmente, estamos listos para la capa recurrente que convierte a nuestra red en una RNN!  Usaremos una capa Long Short-Term Memory (LSTM), que es una opción popular para este tipo de problema.  Es muy sencillo de implementar:

In [7]:
from tensorflow.keras.layers import LSTM

# 64 es el parámetro "unidades", que es la dimensionalidad del espacio de salida.
model.add(LSTM(64))

Para terminar nuestra red, agregaremos una estándar completamente conectada ( Densa capa ) y una capa de salida con sigmoidea activación : 

In [8]:
from tensorflow.keras.layers import Dense

model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

## 4. Compilación del modelo 

In [9]:
model.compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = ['accuracy'],
)

## 5. Entrenamiento del modelo 

In [10]:
model.fit(train_data, epochs=2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f7b294ce370>

Usar el modelo entrenado para hacer predicciones es fácil: pasamos una cadena a predict()y genera una partitura. 

In [11]:
# Debería imprimir una puntuación muy alta como 0,90.
print(model.predict([
  "i loved it! highly recommend it to anyone and everyone looking for a great movie to watch.",
]))

# Debería imprimir una puntuación muy baja como 0,01.
print(model.predict([
  "this was awful! i hated it so much, nobody should watch this. the acting was terrible, the music was terrible, overall it was just bad.",
]))

[[0.9096129]]
[[0.162857]]


https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.bib