**Questions when processing text**

- What is the topic of the text? (text classification)

- Does this text contain abuse? (filtering/moderation)

- Does this text sound positive or negative? (sentiment analysis)

- What should be the next word? (language modelling)

- How would you say it in other language? (translation)

- Produce a summary of this article in one paragraph. (summarization)

**What needs to be done to process a text for neural networks?**

- Standardizing; e.g. convert to lower case, remove punctuation

- Tokenization; split text into units (tokens), such as characters, words, groups of words, clauses in sentences, etc

- Convert all tokens to a tensor. This means (typically) indexing the tokens.

*Example*

The cat sat on the mat. -> The cat sat on the mat -> 'cat', 'sat', 'on', 'mat' -> [2, 34, 53, 8] -> (one-hot encoding is very common)

**Two ways of handling tokens**

- *Word-level tokenization* - so called 'word-level tokenization'. Tokens are space-separated substrings (or punctuation-separated if appropriate). A varient also splits into subwords, which is especially important for agglutinating and composing languages, such as Swedish or Finnish (when they often make a wors of a several)

- *N-gram tokenization* - tokens are groups of N consecutive words. E.g., 'the cat', 'he was'... These are 'bigrams' ('2-grams')

- *Character-level tokenization* - Each caracter is its own token. In practice, is useful for languages with rich writing systems or pictographic writing (cyrillic, chinese, etc)

In [2]:
import os, pathlib, shutil, random


In [7]:
base_dir = pathlib.Path('../../../Data/aclImdb')
val_dir = base_dir / 'val'
train_dir = base_dir / 'train'

In [None]:
for category in ('neg', 'pos'):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname, val_dir / category / fname)

In [9]:
import keras

In [None]:
batch_size = 32

train_dataset = keras.utils.text_dataset_from_directory(train_dir, batch_size=batch_size)   
val_ds = keras.utils.text_dataset_from_directory(val_dir, batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory(base_dir / 'test', batch_size=batch_size)

In [None]:
for inputs, targets in train_ds:
    print('inputs:', inputs.shape, inputs.dtype)
    print('inputs[0]:', inputs[0])
    print('targets:', targets.shape, targets.dtype)
    break

In [12]:
from keras import layers

In [14]:
text_vectorization = layers.TextVectorization(max_tokens=20000, output_mode='multi_hot')

In [None]:
text_only_train_ds = train_ds.map(lambda x, _: x)
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)



In [None]:
binary_1gram_train_ds

In [15]:
def get_model(max_tokens=20000, hidden_dim = 16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation='relu')(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation='sigmoid')(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer = 'rmsprop', loss = 'binary_crossentropy', metrics = ['accuracy'])
    return model

In [16]:
model = get_model()
model.summary

<bound method Model.summary of <Functional name=functional, built=True>>

In [None]:
callbacks = [keras.callbacks.ModelCheckpoint('binary_1gram.keras', save_best_only=True)]
model.fit(binary_1gram_train_ds.cache(), validation_data=binary_1gram_val_ds.cache(), epochs=10, callbacks=callbacks)

In [None]:
model = keras.models.load_model('binary_1gram.keras')
print(f'Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}')

In [None]:
text_vectorization = layers.TextVectorization(ngrams=2, max_tokens=20000, output_mode='tf_idf')

text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [None]:
model = get_model()
callbacks = [keras.callbacks.ModelCheckpoint('tfidf_2gram.keras', save_best_only=True)]

model.fit(tfidf_2gram_train_ds.cache(), validation_data=tfidf_2gram_val_ds.cache(), epochs=10, callbacks=callbacks)

In [None]:
model = keras.models.load_model('tfidf_2gram.keras')

print(f'Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}')

In [18]:
# trying integer sequences, so that we could use LSTM

max_length = 600

text_vectorization = layers.TextVectorization(max_tokens=20000, output_mode='int', output_sequence_length=max_length)


In [None]:

text_vectorization.adapt(text_only_train_ds)

In [None]:
int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds=test_ds.map(lambda x, y: (text_vectorization(x), y))

In [19]:
import tensorflow as tf

In [None]:
inputs = keras.Input(shape = (None,), dtype = 'int64')
embedded = tf.one_hot(inputs, depth=20000) # needs a check
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer='rmsprop', loss = 'binary_crossentropy', metrics = ['accuracy'])