### Deep learning for text
#### Preparing text data
Deep learning models, being differentiable functions, can only process numeric tensors: they can’t take raw text as input. **Vectorizing** text is the process of transforming text into numeric tensors. **Text vectorization** processes come in many shapes and forms, but they all follow the same template (see figure 11.1):
- First, you **standardize** the text to make it easier to process, such as by **converting it to lowercase** or **removing punctuation**.
- You split the text into units (called **tokens**), such as characters, words, or groups of words. 
  - This is called **tokenization**.
- You convert each such token into a numerical vector. 
  - This will usually involve first **indexing** all tokens present in the data.

Let’s review each of these steps.

![](./images/11.1.png)

##### Text standardization
Consider these two sentences:
- “sunset came. i was staring at the Mexico sky. Isnt nature splendid??”
- “Sunset came; I stared at the México sky. Isn’t nature splendid?”

They’re very similar—in fact, they’re almost identical. Yet, if you were to convert them to byte strings, they would end up with very different representations, because “i” and “I” are two different characters, “Mexico” and “México” are two different words, “isnt” isn’t “isn’t,” and so on. A machine learning model doesn’t know a priori that “i” and “I” are the same letter, that “é” is an “e” with an accent, or that “staring” and “stared” are two forms of the same verb. <br>
Text standardization is a basic form of feature engineering that aims to erase encoding differences that you don’t want your model to have to deal with. It’s not exclusive to machine learning, either—you’d have to do the same thing if you were building a search engine. <br>
One of the simplest and most widespread standardization schemes is “**convert to lowercase and remove punctuation characters**.” Our two sentences would become
- “sunset came i was staring at the mexico sky isnt nature splendid”
- “sunset came i stared at the méxico sky isnt nature splendid”

Much closer already. Another common transformation is to **convert special characters to a standard form**, such as replacing “é” with “e,” “æ” with “ae,” and so on. Our token “méxico” would then become “mexico”. <br>
Lastly, a much more advanced standardization pattern that is more rarely used in a machine learning context is **stemming**: converting variations of a term (such as different conjugated forms of a verb) into a single shared representation, like turning “caught” and “been catching” into “[catch]” or “cats” into “[cat]”. With stemming, “was staring” and “stared” would become something like “[stare]”, and our two similar sentences would finally end up with an identical encoding:
- “sunset came i [stare] at the mexico sky isnt nature splendid”

With these standardization techniques, your model will require less training data and will generalize better—it won’t need abundant examples of both “Sunset” and “sunset” to learn that they mean the same thing, and it will be able to make sense of “México” even if it has only seen “mexico” in its training set. Of course, standardization may also erase some amount of information, so always keep the context in mind: for instance, if you’re writing a model that extracts questions from interview articles, it should definitely treat “?” as a separate token instead of dropping it, because it’s a useful signal for this specific task.

##### Text splitting (tokenization)
Once your text is standardized, you need to break it up into units to be vectorized (tokens), a step called **tokenization**. You could do this in three different ways:
- **Word-level tokenization**—Where tokens are space-separated (or punctuation-separated) substrings. 
  - A variant of this is to further split words into subwords when applicable—for instance, treating “staring” as “star+ing” or “called” as “call+ed.”
- **N-gram tokenization**—Where tokens are groups of N consecutive words. 
  - For instance, “the cat” or “he was” would be 2-gram tokens (also called bigrams).
- **Character-level tokenization**—Where each character is its own token. 
  - In practice, this scheme is rarely used, and you only really see it in specialized contexts, like text generation or speech recognition.

In general, you’ll always use either **word-level** or **N-gram tokenization**. There are two kinds of text-processing models: those that care about word order, called **sequence models**, and those that treat input words as a set, discarding their original order, called **bag-of-words** models. If you’re building a **sequence model**, you’ll use **word-level tokenization**, and if you’re building a **bag-of-words model**, you’ll use **N-gram tokenization**. <br>
N-grams are a way to artificially inject a small amount of local word order information into the model. Throughout this chapter, you’ll learn more about each type of model and when to use them.

##### Understanding N-grams and bag-of-words
**Word N-grams** are groups of N (or fewer) consecutive words that you can extract from a sentence. The same concept may also be applied to characters instead of words.<br>
Here’s a simple example. Consider the sentence “the cat sat on the mat.” It may be decomposed into the following set of 2-grams: 
```python
{"the", "the cat", "cat", "cat sat", "sat",
"sat on", "on", "on the", "the mat", "mat"}
```
It may also be decomposed into the following set of 3-grams:
```python
{"the", "the cat", "cat", "cat sat", "the cat sat",
"sat", "sat on", "on", "cat sat on", "on the",
"sat on the", "the mat", "mat", "on the mat"}
```
Such a set is called a bag-of-2-grams or bag-of-3-grams, respectively. The term “bag” here refers to the fact that you’re dealing with a set of tokens rather than a list or sequence: the tokens have no specific order. This family of tokenization methods is called **bag-of-words** (or **bag-of-N-grams**).

Because **bag-of-words isn’t an order-preserving tokenization method** (the tokens generated are understood as a set, not a sequence, and the general structure of the sentences is lost), **it tends to be used in shallow language-processing models** rather than in deep learning models. Extracting N-grams is a form of feature engineering, and deep learning sequence models do away with this manual approach, replacing it with hierarchical feature learning. One-dimensional convnets, recurrent neural networks, and Transformers are capable of learning representations for groups of words and characters without being explicitly told about the existence of such groups, by looking at continuous word or character sequences.

##### Vocabulary indexing
Once your text is split into tokens, you need to **encode each token into a numerical representation**. You could potentially do this in a stateless way, such as by hashing each token into a fixed binary vector, but in practice, the way you’d go about it is to build an index of all terms found in the training data (the “**vocabulary**”), and assign a unique integer to each entry in the vocabulary. <br>
Something like this:
```python
vocabulary = {}
for text in dataset:
    text = standardize(text)
    tokens = tokenize(text)
    for token in tokens:
        if token not in vocabulary:
            vocabulary[token] = len(vocabulary)
```
You can then convert that integer into a vector encoding that can be processed by a neural network, like a one-hot vector:
```python
def one_hot_encode_token(token):
    vector = np.zeros((len(vocabulary),))
    token_index = vocabulary[token]
    vector[token_index] = 1
    return vector
```
Note that at this step it’s common to restrict the vocabulary to only the top 20,000 or 30,000 most common words found in the training data. Any text dataset tends to feature an extremely large number of unique terms, most of which only show up once or twice—indexing those rare terms would result in an excessively large feature space, where most features would have almost no information content. <br>
Remember when you were training your first deep learning models on the IMDB dataset in chapters 4 and 5? The data you were using from **keras.datasets.imdb** was already preprocessed into sequences of integers, where each integer stood for a given word. Back then, we used the setting **num_words=10000**, in order to restrict our vocabulary to the **top 10,000 most common words found in the training data**. <br>
Now, there’s an important detail here that we shouldn’t overlook: when we look up a new token in our vocabulary index, it may not necessarily exist. Your training data may not have contained any instance of the word “cherimoya” (or maybe you excluded it from your index because it was too rare), so doing **token_index = vocabulary["cherimoya"]** may result in a KeyError. To handle this, you should use an **“out of vocabulary” index** (abbreviated as OOV index)—a catch-all for any token that wasn’t in the index. **It’s usually index 1**: you’re actually doing **token_index = vocabulary.get(token, 1)**. **When decoding a sequence of integers back into words, you’ll replace 1 with something like “[UNK]” (which you’d call an “OOV token”)**. <br>
“Why use 1 and not 0?” you may ask. That’s because 0 is already taken. There are two special tokens that you will commonly use: the **OOV token (index 1)**, and the **mask token (index 0)**. While 
- **the OOV token means “here was a word we did not recognize,”** 
- **the mask token tells us “ignore me, I’m not a word.”** 

You’d use it in particular to pad sequence data: because data batches need to be contiguous, **all sequences in a batch of sequence data must have the same length, so shorter sequences should be padded to the length of the longest sequence**. If you want to make a batch of data with the sequences [5, 7, 124, 4, 89] and [8, 34, 21], it would have to look like this:
```python
[[5, 7, 124, 4, 89]
[8, 34, 21, 0, 0]]
```
The batches of integer sequences for the IMDB dataset that you worked with in chapters 4 and 5 were padded with zeros in this way.

##### Using the TextVectorization layer
Every step I’ve introduced so far would be very easy to implement in pure Python. <br>
Maybe you could write something like this:

In [1]:
import string

class Vectorizer:
    def standardize(self, text):
        text = text.lower()
        return "".join(char for char in text if char not in string.punctuation)
    
    def tokenize(self, text):
        text = self.standardize(text)
        return text.split()
    
    def make_vocabulary(self, dataset):
        self.vocabulary = {"": 0, "[UNK]": 1}
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        self.inverse_vocabulary = dict(
            (v, k) for k, v in self.vocabulary.items())
    
    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token, 1) for token in tokens]
    
    def decode(self, int_sequence):
        return " ".join(self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

vectorizer = Vectorizer()
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
vectorizer.make_vocabulary(dataset)

It does the job:

In [2]:
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

[2, 3, 5, 7, 1, 5, 6]


In [3]:
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


However, using something like this wouldn’t be very performant. In practice, you’ll work with the Keras **TextVectorization** layer, which is fast and efficient and can be dropped directly into a **tf.data** pipeline or a Keras model. <br>
This is what the TextVectorization layer looks like:

In [4]:
from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(
    output_mode="int", # Configures the layer to return sequences of words encoded as integer indices. 
    #There are several other output modes available, which you will see in action in a bit.
)

By default, the **TextVectorization** layer will use the setting **“convert to lowercase and remove punctuation” for text standardization**, and **“split on whitespace” for tokenization**. But importantly, you can provide custom functions for standardization and tokenization, which means the layer is flexible enough to handle any use case. Note that such custom functions should operate on **tf.string** tensors, not regular Python strings! <br>
For instance, the default layer behavior is equivalent to the following:

In [5]:
import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
    lowercase_string = tf.strings.lower(string_tensor) # Convert strings to lowercase.
    return tf.strings.regex_replace(
        lowercase_string, f"[{re.escape(string.punctuation)}]", "") # Replace punctuation characters with the empty string.

def custom_split_fn(string_tensor):
    return tf.strings.split(string_tensor) # Split strings on whitespace.

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn)

To index the vocabulary of a text corpus, just call the **adapt()** method of the layer with a **Dataset** object that yields strings, or just with a list of Python strings:

In [6]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
text_vectorization.adapt(dataset)

Note that you can retrieve the computed vocabulary via **get_vocabulary()**—this can be useful if you need to convert text encoded as integer sequences back into words.<br>
The first two entries in the vocabulary are the mask token (index 0) and the OOV token (index 1). Entries in the vocabulary list are sorted by frequency, so with a real world dataset, very common words like “the” or “a” would come first.

##### Displaying the vocabulary

In [7]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

For a demonstration, let’s try to encode and then decode an example sentence:

In [8]:
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)


In [9]:
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


You’ve now learned everything you need to know about text preprocessing—let’s move on to the modeling stage.

#### Two approaches for representing groups of words: Sets and sequences
How a machine learning model should represent individual words is a relatively uncontroversial question: they’re categorical features (values from a predefined set), and we know how to handle those. They should be encoded as dimensions in a feature space, or as category vectors (word vectors in this case). A much more problematic question, however, is how to encode the way words are woven into sentences: word order. <br>
The problem of order in natural language is an interesting one: unlike the steps of a timeseries, words in a sentence don’t have a natural, canonical order. Different languages order similar words in very different ways. For instance, the sentence structure of English is quite different from that of Japanese. Even within a given language, you can typically say the same thing in different ways by reshuffling the words a bit. Even further, if you fully randomize the words in a short sentence, you can still largely figure out what it was saying—though in many cases significant ambiguity seems to arise. Order is clearly important, but its relationship to meaning isn’t straightforward. <br>
How to represent word order is the pivotal question from which different kinds of NLP architectures spring. 
- The simplest thing you could do is just discard order and treat text as an unordered set of words—this gives you **bag-of-words models**. 
- You could also decide that words should be processed strictly in the order in which they appear, one at a time, like steps in a timeseries—you could then leverage the **recurrent models** from the last chapter. 
- Finally, a hybrid approach is also possible: the **Transformer** architecture is technically order-agnostic, yet it injects word-position information into the representations it processes, which enables it to simultaneously look at different parts of a sentence (unlike RNNs) while still being order-aware. Because they take into account word order, both **RNNs** and **Transformers** are called **sequence models**.

Historically, most early applications of machine learning to NLP just involved **bag-of-words models**. Interest in **sequence models** only started rising in 2015, with the rebirth of **recurrent neural networks**. Today, both approaches remain relevant. Let’s see how they work, and when to leverage which.

We’ll demonstrate each approach on a well-known text classification benchmark: the IMDB movie review sentiment-classification dataset. In chapters 4 and 5, you worked with a prevectorized version of the IMDB dataset; now, let’s process the raw IMDB text data, just like you would do when approaching a new text-classification problem in the real world.

##### Preparing the IMDB movie reviews data
Let’s start by downloading the dataset from the Stanford page of Andrew Maas and uncompressing it:

In [13]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0 80.2M    0 32768    0     0  20945      0  1:06:56  0:00:01  1:06:55 20951
  0 80.2M    0  384k    0     0   157k      0  0:08:40  0:00:02  0:08:38  157k
  1 80.2M    1 1296k    0     0   377k      0  0:03:37  0:00:03  0:03:34  377k
  2 80.2M    2 2432k    0     0   547k      0  0:02:29  0:00:04  0:02:25  547k
  4 80.2M    4 3744k    0     0   689k      0  0:01:59  0:00:05  0:01:54  756k
  6 80.2M    6 5072k    0     0   788k      0  0:01:44  0:00:06  0:01:38 1034k
  7 80.2M    7 6480k    0     0   871k      0  0:01:34  0:00:07  0:01:27 1219k
  8 80.2M    8 7248k    0     0   859k      0  0:01:35  0:00:08  0:01:27 1190k
  9 80.2M    9 7680k    0     0   813k      0  0:01

In [14]:
!tar -xf aclImdb_v1.tar.gz

You’re left with a directory named aclImdb, with the following structure:
```python
aclImdb/
...train/
......pos/
......neg/
...test/
......pos/
......neg/
```

For instance, the train/pos/ directory contains a set of 12,500 text files, each of which contains the text body of a positive-sentiment movie review to be used as training data. <br>
The negative-sentiment reviews live in the “neg” directories. In total, there are 25,000 text files for training and another 25,000 for testing. <br>
There’s also a train/unsup subdirectory in there, which we don’t need. Let’s delete it:
```python
!rm -r aclImdb/train/unsup
```

Let’s prepare a validation set by setting apart 20% of the training text files in a
new directory, aclImdb/val:

In [20]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files) # Shuffle the list of training files using a seed, to ensure we get the same validation set every time we run the code.
    # Take 20% of the training files to use for validation.
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    # Move the files to aclImdb/val/neg and aclImdb/val/pos.
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

Remember how, in chapter 8, we used the **image_dataset_from_directory** utility to create a batched Dataset of images and their labels for a directory structure? You can do the exact same thing for text files using the **text_dataset_from_directory** utility. <br>
Let’s create three **Dataset** objects for training, validation, and testing:

In [21]:
from tensorflow import keras

batch_size = 32

train_ds = keras.utils.text_dataset_from_directory("aclImdb/train", batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory("aclImdb/val", batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


These datasets yield inputs that are TensorFlow **tf.string** tensors and targets that are int32 tensors encoding the value “0” or “1.”
##### Displaying the shapes and dtypes of the first batch

In [22]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'No, this wasn\'t one of the ten worst films of the 1980\'s, but it certainly skirts the bottom 100 somewhere. This movie looks like it was put on the shelf for two or three years and then released in 1981. How else would you explain special effects pre-dating "An American Werewolf in London," disco still being considered cool, and Ronald Reagan not being the 40th President of the United States? While we\'re at it, let\'s not overlook those 1970\'s hairstyles in the 1950\'s and \'60\'s. I\'ve seen more of that here than in "Happy Days" & "Laverne & Shirley" combined.<br /><br />The one woman who elevates this movie to the "so bad, it\'s good" category was the late, great Elizabeth Hartman, but just barely. Biff plays as Miss Montgomery, the mousey high school teacher who becomes a sexpot, a stereotype that\'s been done to death and is still being churned out by

All set. Now let’s try learning something from this data.

##### Processing words as a set: The bag-of-words approach
The simplest way to encode a piece of text for processing by a machine learning model is to discard order and treat it as a set (a “bag”) of tokens. You could either look at individual words (unigrams), or try to recover some local order information by looking at groups of consecutive token (N-grams).

##### SINGLE WORDS (UNIGRAMS) WITH BINARY ENCODING
If you use a bag of single words, the sentence “the cat sat on the mat” becomes
```python
{"cat", "mat", "on", "sat", "the"}
```
The main advantage of this encoding is that you can represent an entire text as a single vector, where each entry is a presence indicator for a given word. For instance, using binary encoding (multi-hot), you’d encode a text as a vector with as many dimensions as there are words in your vocabulary—with 0s almost everywhere and some 1s for dimensions that encode words present in the text. This is what we did when we worked with text data in chapters 4 and 5. Let’s try this on our task.

First, let’s process our raw text datasets with a **TextVectorization** layer so that they yield multi-hot encoded binary word vectors. Our layer will only look at single words (that is to say, unigrams).

##### Preprocessing our datasets with a TextVectorization layer

In [23]:
# Limit the vocabulary to the 20,000 most frequent words. 
# Otherwise we’d be indexing every word in the training data potentially tens of thousands of terms that only occur once or twice and thus aren’t informative. 
# In general, 20,000 is the right vocabulary size for text classification.
text_vectorization = TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot", # Encode the output tokens as multi-hot binary vectors.
)

text_only_train_ds = train_ds.map(lambda x, y: x) # Prepare a dataset that only yields raw text inputs (no labels).
text_vectorization.adapt(text_only_train_ds) # Use that dataset to index the dataset vocabulary via the adapt() method.

# Prepare processed versions of our training, validation, and test dataset. 
# Make sure to specify num_parallel_calls to leverage multiple CPU cores.
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

You can try to inspect the output of one of these datasets.
##### Inspecting the output of our binary unigram dataset

In [24]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


Next, let’s write a reusable model-building function that we’ll use in all of our experiments in this section.

##### Our model-building utility

In [25]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

Finally, let’s train and test our model.
##### Training and testing the binary unigram model

In [26]:
model = get_model()
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]

# We call cache() on the datasets to cache them in memory: this way, we will only do the preprocessing once, during the first epoch, and we’ll reuse the preprocessed texts for the following epochs. 
# This can only be done if the data is small enough to fit in memory.

model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.885


This gets us to a test accuracy of 88.5%: not bad! Note that in this case, since the dataset is a balanced two-class classification dataset (there are as many positive samples as negative samples), the “naive baseline” we could reach without training an actual model would only be 50%. Meanwhile, the best score that can be achieved on this dataset without leveraging external data is around 95% test accuracy.

##### BIGRAMS WITH BINARY ENCODING
Of course, discarding word order is very reductive, because even atomic concepts can be expressed via multiple words: the term “United States” conveys a concept that is quite distinct from the meaning of the words “states” and “united” taken separately. For this reason, you will usually end up re-injecting local order information into your bag-of-words representation by looking at N-grams rather than single words (most commonly, bigrams). <br>
With bigrams, our sentence becomes
```python
{"the", "the cat", "cat", "cat sat", "sat",
"sat on", "on", "on the", "the mat", "mat"}
```

The **TextVectorization** layer can be configured to return arbitrary N-grams: bigrams, trigrams, etc. Just pass an **ngrams=N** argument as in the following listing.

##### Configuring the TextVectorization layer to return bigrams

In [27]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

Let’s test how our model performs when trained on such binary-encoded bags of bigrams.

##### Training and testing the binary bigram model

In [28]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]

model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.896


We’re now getting 89.6% test accuracy, a marked improvement! Turns out local order
is pretty important.

##### BIGRAMS WITH TF-IDF ENCODING
You can also add a bit more information to this representation by counting how many times each word or N-gram occurs, that is to say, by taking the histogram of the words over the text:
```python
{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
"sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}
```
If you’re doing text classification, knowing how many times a word occurs in a sample is critical: any sufficiently long movie review may contain the word “terrible” regardless of sentiment, but a review that contains many instances of the word “terrible” is likely a negative one. <br>
Here’s how you’d count bigram occurrences with the **TextVectorization** layer.
##### Configuring the TextVectorization layer to return token counts

In [29]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

Now, of course, some words are bound to occur more often than others no matter what the text is about. The words “the,” “a,” “is,” and “are” will always dominate your word count histograms, drowning out other words—despite being pretty much useless features in a classification context. How could we address this? <br>
You already guessed it: via **normalization**. We could just normalize word counts by subtracting the mean and dividing by the variance (computed across the entire training dataset). That would make sense. Except most vectorized sentences consist almost entirely of zeros (our previous example features 12 non-zero entries and 19,988 zero entries), a property called **“sparsity”**. That’s a great property to have, as it dramatically reduces compute load and reduces the risk of overfitting. If we subtracted the mean from each feature, we’d wreck sparsity. Thus, whatever normalization scheme we use should be divide-only. What, then, should we use as the denominator? The best practice is to go with something called **TF-IDF normalization**—TF-IDF stands for “term frequency, inverse document frequency.” <br>
**TF-IDF** is so common that it’s built into the **TextVectorization** layer. All you need to do to start using it is to switch the **output_mode** argument to **"tf_idf"**.

##### Configuring TextVectorization to return TF-IDF-weighted outputs

In [30]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)

Let’s train a new model with this scheme.

##### Training and testing the TF-IDF bigram model

In [31]:
text_vectorization.adapt(text_only_train_ds) # The adapt() call will learn the TF-IDF weights in addition to the vocabulary.

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]

model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.893


This gets us an 89.3% test accuracy on the IMDB classification task: it doesn’t seem to be particularly helpful in this case. However, for many text-classification datasets, it would be typical to see a one-percentage-point increase when using TF-IDF compared to plain binary encoding.

##### Exporting a model that processes raw strings
In the preceding examples, we did our text standardization, splitting, and indexing as part of the tf.data pipeline. But if we want to export a standalone model independent of this pipeline, we should make sure that it incorporates its own text preprocessing (otherwise, you’d have to reimplement in the production environment, which can be challenging or can lead to subtle discrepancies between the training data and the production data). Thankfully, this is easy. <br>
Just create a new model that reuses your TextVectorization layer and adds to it the model you just trained:

In [32]:
inputs = keras.Input(shape=(1,), dtype="string") # One input sample would be one string.
processed_inputs = text_vectorization(inputs) # Add text preprocessing.
outputs = model(processed_inputs) # Apply the previously trained model.
inference_model = keras.Model(inputs, outputs) # Instantiate the end-to-end model.

The resulting model can process batches of raw strings:

In [33]:
import tensorflow as tf

raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I loved it."],
])
predictions = inference_model(raw_text_data)
print(f"{float(predictions[0] * 100):.2f} percent positive")

97.43 percent positive
