# Deep learning for text

## Preparing text data

- Vectorizing text is the process of transforming text into numeric tensors.
- Text vectorization processes come in many shapes and forms, but they all follow the same template:
    - First, we ***standardize the text*** to make it easier to process, such as by ***converting it to lowercase or removing punctuation***.
    - we ***split the text into units (called tokens)***, such as characters, words, or groups of words. This is called ***tokenization***.
    - we ***convert each such token into a numerical vector***. This will usually involve first indexing all tokens present in the data.

### Text standardization

- Text standardization is a basic form of feature engineering that aims to erase encoding differences that you don’t want your model to have to deal with.
- One of the simplest and most widespread standardization schemes is “convert to lowercase and remove punctuation characters.”
- Another common transformation is to convert special characters to a standard form, such as ***replacing “é” with “e,” “æ” with “ae,”***.
- A much more advanced standardization pattern that is more rarely used in a machine learning context is stemming: converting variations of a term (such as different conjugated forms of a verb) into a single shared representation, like turning ***“caught” and “been catching” into “[catch]”*** or ***“cats” into “[cat]”***. With stemming, “was staring” and ***“stared” would become something like “[stare]”***, and our two similar sentences would finally end up with an identical encoding:
    - “sunset came i [stare] at the mexico sky isnt nature splendid”

### Text splitting (tokenization)

- Once your text is standardized, you need to break it up into units to be vectorized (tokens), a step called ***tokenization***
    - ***Word-level tokenization***—Where tokens are space-separated (or punctuation separated) substrings. A variant of this is to further split words into subwords when applicable—for instance, treating “staring” as “star+ing” or “called” as “call+ed.”
    - ***N-gram tokenization***—Where tokens are groups of N consecutive words. For instance, “the cat” or “he was” would be 2-gram tokens (also called bigrams).
    - ***Character-level tokenization***—Where each character is its own token. In practice, this scheme is rarely used, and you only really see it in specialized contexts, like text generation or speech recognition.
- There are two kinds of text-processing models:
    - those that ***care about word order***, called ***sequence models***, and
    - those that ***treat input words as a set, discarding their original order***, called ***bag-of-words* models**.
- If you’re building a ***sequence model***, you’ll use ***word-level tokenization***,
- If you’re building a ***bag-of-words model***, you’ll use ***N-gram tokenization***.
    - ***N-grams*** are a way to artificially inject a small amount of local word order information into the model.
    - The term ***“bag”*** here refers to the fact that you’re dealing with a ***set of tokens*** rather than a list or sequence: the tokens have no specific order.
    - ***Bag-of-words*** isn’t an order-preserving tokenization method (the tokens generated are understood *as a set, not a sequence*, and the general structure of the sentences is lost).

### Vocabulary indexing

- Once your text is split into tokens, you need to ***encode each token into a numerical representation***
- **Important detail**:
    - When we look up a new token in our vocabulary index, it may not necessarily exist. Your training data may not have contained any instance of the word “cherimoya” (or maybe you excluded it from your index because it was too rare), so ***doing token_index = vocabulary["cherimoya"] may result in a KeyError***.
    - To handle this, you should use an ***“out of vocabulary” index (abbreviated as OOV index)***—*a catch-all for any token that wasn’t in the index*.
        - It’s usually index 1: you’re actually doing ***token_index = vocabulary.get(token, 1)***.
        - When ***decoding a sequence of integers back into words***, you’ll ***replace 1 with something like “[UNK]” (which you’d call an “OOV token”)***.
        - While the ***OOV token means*** “here was a word we did not recognize,” the mask token tells us “ignore me, I’m not a word.”
        - Use it in particular to ***pad sequence data***: because data batches need to be ***contiguous***,
            - all sequences in a batch of sequence data must have the same length,
            - so shorter sequences should be padded to the length of the longest sequence.
            - If you want to make a batch of data with the sequences [5, 7, 124, 4, 89] and [8, 34, 21], it would have to look like this:
                - [[5, 7, 124, 4, 89]
                [8, 34, 21, 0, 0]]

### Using the TextVectorization layer

Write code vectorizer in pure python

In [1]:
import string

class Vectorizer:
    def standardize(self, text):
        text = text.lower()
        return "".join(char for char in text if char not in string.punctuation)

    def tokenize(self, text):
        text = self.standardize(text)
        return text.split()

    def make_vocabulary(self, dataset):
        self.vocabulary = {"": 0, "[UNK]": 1}
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        self.inverse_vocabulary = dict(
            (v, k) for k, v in self.vocabulary.items())

    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token, 1) for token in tokens]

    def decode(self, int_sequence):
        return " ".join(
            self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

vectorizer = Vectorizer()
dataset = [
    "I know, understand, comprehend",
    "forget again, and then",
    "A buliding blooms.",
]
vectorizer.make_vocabulary(dataset)

In [2]:
test_sentence = "I know, understand, comprehend, and still forget again"
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

[2, 3, 4, 5, 8, 1, 6, 7]


In [3]:
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

i know understand comprehend and [UNK] forget again


In [4]:
from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
    # Configures the layer to return sequences of words encoded as integer indices.
    output_mode="int",
)

In [5]:
# the default keras TextVectorization layer behavior is equivalent to python code below
import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
    # convert strings to lowercase
    lowercase_string = tf.strings.lower(string_tensor)
    # replace punctuation characters with the rmpty string
    return tf.strings.regex_replace(
        lowercase_string, f"[{re.escape(string.punctuation)}]", "")

def custom_split_fn(string_tensor):
    # split strings on whitespace
    return tf.strings.split(string_tensor)

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn,
)

In [6]:
dataset = [
    "I know, understand, comprehend",
    "forget again, and then",
    "A buliding blooms.",
]
# adapt() method - to index the vocabulary of a text corpus
text_vectorization.adapt(dataset)

**Displaying the vocabulary**

In [7]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'understand',
 'then',
 'know',
 'i',
 'forget',
 'comprehend',
 'buliding',
 'blooms',
 'and',
 'again',
 'a']

In [8]:
# Retrieve the computed vocabulary via get_vocabulary()—this can be useful
# if you need to convert text encoded as integer sequences back into words
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I know, understand, comprehend, and still forget again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

tf.Tensor([ 5  4  2  7 10  1  6 11], shape=(8,), dtype=int64)


In [9]:
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

i know understand comprehend and [UNK] forget again


## Two approaches for representing groups of words: Sets and sequences

### Preparing the IMDB movie reviews data

In [10]:
# download the file
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  5788k      0  0:00:14  0:00:14 --:--:-- 12.1M


In [11]:
# No need this file - delete this file
!rm -r aclImdb/train/unsup

In [12]:
# Take a look at the content of a few of these text files
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

In [13]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    # Shuffle the list of training files
    random.Random(1337).shuffle(files)
    # Take 20% of the training files to use for validation
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    # Move the files to aclImdb/val/neg and aclImdb/val/pos.
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

In [14]:
from tensorflow import keras
batch_size = 32

# Running this line should output “Found 20000 files belonging to 2 classes”
train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


**Displaying the shapes and dtypes of the first batch**

In [15]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'Dr. McCoy and Mr. Spock find themselves trapped in a planet\'s past Ice Age, while Capt. Kirk is in the same planet\'s colonial period. However, it\'s the former pair that has the most trying time. Besides the freezing temperatures and sanctuary to be found only in caves, there is a third inhabitant, the beautiful and so sexy Zarabeth (Mariette Hartley). As Spock spends more time in this era, he slowly begins to revert to the behavioral patterns of his ancestors, feeling a natural attraction to Zarabeth and throwing "caution to the wind" about ever leaving this place. Only with Dr. McCoy\'s constant "reminders" does Spock hold on to some grasp of reality.<br /><br />This stand as one of the few times when the character gets to show some "emotion" and Nimoy (Spock) plays it to the hilt, coming close to knocking the bejesus out of Deforest Kelly (McCoy). Surpris

### Processing words as a set: The bag-of-words approach

- The simplest way to encode a piece of text for processing by a machine learning model is to discard order and treat it as a set (a “bag”) of tokens
    - Could either look at individual words (unigrams),
    - or try to Recover some ***local order information*** by looking at groups of consecutive token (N-grams).

#### Single words (unigrams) with binary encoding

- The main advantage of this encoding is that you can represent an entire text as a single vector, where each entry is a presence indicator for a given word.

**Preprocessing our datasets with a `TextVectorization` layer**

In [16]:
text_vectorization = TextVectorization(
    # Limit the vocabulary to the 20,000 most frequent words
    # In general, 20,000 is the right vocabulary size for text classification
    max_tokens=20000,
    # Encode the output tokens as multi-hot binary vectors
    output_mode="multi_hot",
)

# Prepare a dataset that only yields raw text inputs (no labels).
# (lambda x, y: x) - lambda function that takes two arguments, x and y, and returns x(inputs).
text_only_train_ds = train_ds.map(lambda x, y: x)
# Use that dataset to index the dataset vocabulary via the adapt() method
text_vectorization.adapt(text_only_train_ds)

# Prepare processed versions of our training, validation, and test dataset.
# Specify num_parallel_calls to leverage multiple CPU cores.
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

**Inspecting the output of our binary unigram dataset**

In [17]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


**Our model-building utility**

In [18]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

**Training and testing the binary unigram model**

In [19]:
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

# cache()- We call cache() on the datasets to cache them in memory:
# this way, we will only do the preprocessing once, during the first epoch,
# and we’ll reuse the preprocessed texts for the following epochs.
# This can only be done if the data is small enough to fit in memory.

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.886


This gets a test accuracy of 88.6%

#### Bigrams with binary encoding

- Discarding (get rid of) word order, the concepts can be expressed via multiple word that is quite distinct from the meaning of the words taken separately. The term **“United States”** can be **“states” and “united”**
- For this reason, you will usually end up re-injecting local order information into the bag-of-words representation by looking at N-grams rather than single words (most commonly, bigrams).

**Configuring the `TextVectorization` layer to return bigrams**

- The **TextVectorization** layer can be configured to return arbitrary N-grams: bigrams, trigrams, etc.
- Just pass an **ngrams=N** argument

In [20]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

**Training and testing the binary bigram model**

In [21]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.898


- The result of Test accuracy increase to 89.8%.
- This can be seen local order is simply important

#### Bigrams with TF-IDF encoding

**TF-IDF** stands for **“term frequency, inverse document frequency.”**

- Adding a bit more information to this representation by ***counting how many times each word or N-gram occurs***, that, by taking the histogram of the words over the text
- Count bigram occurrences
    
    **{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
    "sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}**
    
- TF-IDF is a metric that fuses these two ideas ***(the terms that appear in almost every document (like “the” or “a”) aren’t particularly informative, while terms that appear only in a small subset of all texts (like “Herzog”) are very distinctive, and thus important.)***.
- **TF-IDF normalization** weights a given term by taking ***“term frequency,”*** how many times the term appears in the current document, and dividing it by a measure of ***“document frequency,”*** which estimates how often the term comes up across the dataset.

**Configuring the `TextVectorization` layer to return token counts**

In [22]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

**Configuring `TextVectorization` to return TF-IDF-weighted outputs**

In [23]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)

**Training and testing the TF-IDF bigram model**

In [24]:
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]
model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")
print(f"The model gets: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f} test accuracy on the IMDB classification task")

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.883
The model gets: 0.883 test a

**Model Inference**

In [25]:
inputs = keras.Input(shape=(1,), dtype="string")
processed_inputs = text_vectorization(inputs)
outputs = model(processed_inputs)
inference_model = keras.Model(inputs, outputs)

In [31]:
import tensorflow as tf
raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I love it."],
])
predictions = inference_model(raw_text_data)
print(f"{float(predictions[0] * 100):.2f} percent positive")

74.44 percent positive
