* Whats the topic of this text? (text classification)
* Does this text contain abuse? (Moderation)
* Does this thext sound possitive or negative? (sentiment analysis)
* What should be the next word in this incomplete sentence? (language modelling)
* How would you say this in dutch? (translation)
* Produce a summary of this article in one paragraph. (summarization)

# What needs to be done to process text for neural networks?
* standardizing: convert to lower case, remove punctuation
* Split the tex into units (tokens), such as characters, words, groups of words, clauses in sentences, etc
* Convert all tokens to a tensor. This means (typically) indexing the tokens.

### Example
The cat sat on the mat.
the cat sat on the mat
["cat", "sat", "on", "mat"]
[2, 34, 53, 8]
(on-hot encoding very common)
é -> e
È -> E


# Three ways of handling tokens
## Word-level tokenization
so called "word-level tokenization"
Tokens are space-separated substrings (or punctuation-separated if appropriate). A variant also splits into subwords, which is especially important for agglutinating and composing languages, such as finnish or swedish. 
## N-gram tokenization
Tokens are groups of N consecutive words. For example, "The cat", "he was", "over there" -- these are 2-grams or "bigrams".
## Character-level tokenization
Each character is its own token. In practice, useful for languages with rich writing systems or pictographic writing (cyrillic, chinese)

Dataset to use: https://ai.stanford.edu/-amaas/data/sentiment/aclImdb_v1.tar.gz

In [None]:
import os, pathlib, shutil, random
base_dir = pathlib.Path("../Data/aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir/category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname, val_dir/category/fname)

In [None]:
import keras
batch_size=32

train_ds = keras.utils.text_dataset_from_directory(train_dir, batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory(val_dir, batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory(base_dir / "test", batch_size=batch_size)

In [None]:
for inputs, targets in train_ds:
    print(f"inputs: {inputs.shape}, {inputs.dtype}")
    print(f"targets: {targets.shape}, {targets.dtype}")
    break

In [None]:
from keras import layers

text_vectorization = layers.TextVectorization(max_tokens=20000, output_mode="multi_hot")
text_only_train_ds = train_ds.map(lambda x, _: x)
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [None]:
def get_model(max_tokens=20000, hidden_dim=20):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model-compile(optimzer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
    return model

In [None]:
model = get_model()
model.summary()

In [None]:
callbacks = [ keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)]

model.fit(binary_1gram_train_ds.cache(), validation_data=binary_1gram_val_ds.cache(), epochs=10, callbacks=callbacks)

In [None]:
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

# TFIDF

In [None]:
text_vectorization = layers.TextVectorization(ngrams=2, max_tokens=20000, output_mode="tf_idf")
text_vectorization.adapt(text_only_train_ds)
tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [None]:
model = get_model()
callbacks = [keras.callbacks.ModelCheckpoint("tfidf_2gram.keras", save_best_only=True)]
model.fit(tfidf_2gram_train_ds.cache(), validation_data=tfidf_2gram_val_ds.cache(), epochs=10, callbacks=callbacks)

In [None]:
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

In [None]:
max_length = 600
max_tokens = 20000


text_vectorization = layers.TextVectorization(max_tokens= max_tokens, output_mode="int", output_sequence_length=max_length)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [None]:
import tensorflow as tf

inputs =  keras.Input(shape=(None,), dtype="int64")
embedded = tf.one_hot(inputs, depth=max_tokens)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])