# 11.2 Prepare text data

## common steps

- `standardize`: text standarization
- `tokenization`
- `indexing` all tokens
- `vector encoding` indices

It is common to restrict the vocalbulary to only the top 20,000 to 30,000 most common words in the training data, in keras, use `TextVectorization` layer

> NOTE: if vectorization is part of model, it will happen synchronously with the rest of the model, it runs on CPU

- `text_dataset_from_directory`

## special tokens

- OOV/[UNK]
- mask/padding token

# 11.3 Represents groups of words: sets vs sequences

- `bag-of-words`: multi-hot, count, tf-idf
- `sequence model`

In [2]:
# prepare the IMDB movie review dataset
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!ls aclImdb

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  8110k      0  0:00:10  0:00:10 --:--:-- 16.9M
imdbEr.txt  imdb.vocab	README	test  train


In [5]:
!cat aclImdb/train/pos/122_9.txt

I had started to lose my faith in films of recent being inundated with the typical Genre Hollywood film. Story lines fail, and camera work is merely copied from the last film of similiar taste. But, then I saw Zentropa (Europa) and my faith was renewed. Not only is the metaphorical storyline enthralling but the use of color and black and white is visually stimulating. The narrator (Max Von Sydow) takes you through a spellbounding journey every step of the way and engrosses you into Europa 1945. We have all seen death put on screen in a hundred thousand ways but the beauty of this film is how it takes you through every slow-moving moment that leads you to death. Unlike many films it doesn't cut after one second of showing (for example) a knife but forces you to watch the devastating yet sensuous beauty of a man's final moments. I think we can all take something different away from what this movie is trying to say but it is definitely worth taking the time to find out what it all really 

In [7]:
# prepare a validation set: 20% from training data
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"

for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

In [8]:
import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

2024-05-09 00:51:02.192145: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-09 00:51:02.222383: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Found 20000 files belonging to 2 classes.


2024-05-09 00:51:04.191335: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:51:04.208318: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-09 00:51:04.208478: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [18]:
# display shapes and dtypes
for inputs, targets in train_ds:
    print(inputs.shape, inputs.dtype, targets.shape, targets.dtype, inputs[0], targets[0])
    break

(32,) <dtype: 'string'> (32,) <dtype: 'int32'> tf.Tensor(b"This is a fine drama and a nice change of pace from today's more hectic and loud films. It is another solid based-on-a-true store, which still means much of it could be made up for dramatic purposes. Frankly, I don't know but I liked the story.<br /><br />The story is about a young man back in the Fifties who gets interested in rocketry and wants to enter that field instead of working in the coal mines as everyone else, including his father, does in this West Virginia town. The big problem is the conflict it causes between the boy and his father, which I think was overdone. I would like to have a little less tension between the two.<br /><br />The young man, still a boy, is played by Jake Gyllenhaal, one of his first staring assignments, I think. He's likable, as are his school buddies in here. It's nice to see nice kids in a modern-day film. The two other key actors in the movie are Chris Cooper (the dad) and Laura Dern (the k

In [37]:
# processing words with bag-of-words approach
from keras import layers
text_vectorization = layers.TextVectorization(
    max_tokens=20_000,
    output_mode="multi_hot"
)
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)

# batch_size is 32, so its shape will be (32, 20000)
bi_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y)
)
bi_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y)
)
bi_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y)
)
for inputs, targets in bi_1gram_train_ds:
    print(inputs.shape, inputs.dtype, targets.shape, targets.dtype, inputs[0], targets[0])
    break

2024-05-09 01:14:45.187337: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


(32, 20000) <dtype: 'int64'> (32,) <dtype: 'int32'> tf.Tensor([1 1 1 ... 0 0 0], shape=(20000,), dtype=int64) tf.Tensor(0, shape=(), dtype=int32)


In [38]:
# build the model
import keras
from keras import layers
def get_model(max_tokens=20_000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens, ))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]
# `cache()` them in memory, do the preprocessing only once
model.fit(bi_1gram_train_ds.cache(),
          validation_data=bi_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(bi_1gram_test_ds)[1]: .3f}")

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.7725 - loss: 0.4827 - val_accuracy: 0.8890 - val_loss: 0.2953
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.8964 - loss: 0.2824 - val_accuracy: 0.8936 - val_loss: 0.2945
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9125 - loss: 0.2452 - val_accuracy: 0.8942 - val_loss: 0.3077
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9211 - loss: 0.2284 - val_accuracy: 0.8930 - val_loss: 0.3240
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9276 - loss: 0.2197 - val_accuracy: 0.8924 - val_loss: 0.3458
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9309 - loss: 0.2087 - val_accuracy: 0.8904 - val_loss: 0.3566
Epoch 7/10
[1m625/625[0m 