# Introducción

Este ejemplo muestra cómo hacer una clasificación de texto a partir de texto sin formato (como un conjunto de archivos de texto en el disco). Demostramos el flujo de trabajo en el conjunto de datos de clasificación de sentimientos de IMDB (versión sin procesar). Usamos la capa **TextVectorization** para dividir e indexar palabras.

https://www.imdb.com/

In [None]:
# importamos las librerías
import tensorflow as tf
import numpy as np

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  27.9M      0  0:00:02  0:00:02 --:--:-- 27.8M


In [None]:
# Buscamos un ejemplo
!cat /content/aclImdb/train/pos/6248_7.txt

Being an Austrian myself this has been a straight knock in my face. Fortunately I don't live nowhere near the place where this movie takes place but unfortunately it portrays everything that the rest of Austria hates about Viennese people (or people close to that region). And it is very easy to read that this is exactly the directors intention: to let your head sink into your hands and say "Oh my god, how can THAT be possible!". No, not with me, the (in my opinion) totally exaggerated uncensored swinger club scene is not necessary, I watch porn, sure, but in this context I was rather disgusted than put in the right context.<br /><br />This movie tells a story about how misled people who suffer from lack of education or bad company try to survive and live in a world of redundancy and boring horizons. A girl who is treated like a whore by her super-jealous boyfriend (and still keeps coming back), a female teacher who discovers her masochism by putting the life of her super-cruel "lover" 

# Realizamos el paso de preprocesado

`tf.keras.preprocessing.text_dataset_from_directory`

In [None]:
# Creamos el flujo de train
batch_size = 32
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "/content/aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=2021,
)
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "/content/aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=2021,
)
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "/content/aclImdb/test",
    batch_size=batch_size
)

print(f"Número de batch en raw_train es {raw_train_ds.cardinality()}")
print(f"Número de batch en raw_validation es {raw_val_ds.cardinality()}")
print(f"Número de batch en raw_test es {raw_test_ds.cardinality()}")

Found 75000 files belonging to 3 classes.
Using 60000 files for training.
Found 75000 files belonging to 3 classes.
Using 15000 files for validation.
Found 25000 files belonging to 2 classes.
Número de batch en raw_train es 1875
Número de batch en raw_validation es 469
Número de batch en raw_test< es 782


In [None]:
# Observamos con un ejemplo realizando un batch
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(5):
    print(text_batch.numpy()[i])
    print(label_batch.numpy()[i])

b'Out of the first five episodes of Hammer\'s short-running "Hammer House of Horror" series, this fifth episode with the wonderful title "The House that Bled to Death" is arguably the creepiest one. As a great fan of the Hammer Studios\' Gothic Horror films for many years, I wonder what took me so long to finally start watching the series quite recently. So far, I\'ve only seen the first five episodes, and I have a strong feeling that the best is yet to come, but even if the series stays as entertaining as the first five episodes are, I will be satisfied. Whereas the second and third episodes were great to watch for their morbid and ingeniously dark sense of humor, this fifth entry is definitely the one out of the first five that delivers the most genuine Horror. The episode begins when an elderly man murders his wife out of unknown motivations. Years later, William (Nicholas Ball) and Emma Peters (Rachel Davies) move in the house with their little daughter Sophie (Emma Ridley). Soon a

In [None]:
# Parte de preprocesado y limpieza <br />
import re
import string
from tensorflow.keras.layers import TextVectorization
def custom_standarization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, "(<br />){1,}", " ")
  return tf.strings.regex_replace(
      stripped_html, f"[{re.escape(string.punctuation)}]", ""
  )

In [None]:
# Realizamos el paso de vectorización
# definimos unas constantes
max_features = 20000
embedding_dim = 128
sequence_length = 500

# Creamos nuestra capa de vectorización
vectorize_layer = TextVectorization(
    standardize=custom_standarization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Aplicamos al solo texto
text_ds = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

In [None]:
vectorize_layer.output_shape()

AttributeError: ignored

In [None]:
# Aplicamos la vectorización a nustros datos
# opción 1

# text_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='text')
# x = vectorize_layer(text_input)
# y = layers.Embedding(max_features + 1, embedding_dim)(x)

In [None]:
# Opción 2 en GPU / CPU
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label


# Vectorize the data.
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

# Do async prefetching / buffering of the data for best performance on GPU.
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

# Construimos el modelo

Creamos un modelo simple de 1D convnet con una capa (layer Embedding).

In [None]:
from tensorflow.keras import layers

# A integer input for vocab indices.
inputs = tf.keras.Input(shape=(None,), dtype="int64")

# Next, we add a layer to map those vocab indices into a space of dimensionality
# 'embedding_dim'.
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)

# Conv1D + global max pooling
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.GlobalMaxPooling1D()(x)

# We add a vanilla hidden layer:
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)

# We project onto a single unit output layer, and squash it with a sigmoid:
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model = tf.keras.Model(inputs, predictions)

# Compile the model with binary crossentropy loss and an adam optimizer.
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
# Entrenamos el modelo con epoch = 1
epoch = 1

# Ajustamos el modelo a train y test dataset
model.fit(train_ds, validation_data=val_ds, epochs=epoch)



<keras.callbacks.History at 0x7f042db15990>

In [None]:
# Evaluando el modelo con los datos de test
model.evaluate(test_ds )



[1846528311296.0, 0.5]

In [None]:
# Creación del modelo final
# A string input
inputs = tf.keras.Input(shape=(1,), dtype="string")
# Turn strings into vocab indices
indices = vectorize_layer(inputs)
# Turn vocab indices into predictions
outputs = model(indices)

# Our end to end model
end_to_end_model = tf.keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Test it with `raw_test_ds`, which yields raw strings
end_to_end_model.evaluate(raw_test_ds)



[1846527787008.0, 0.5]