<a href="https://colab.research.google.com/github/Stubberson/project-collection/blob/main/deep-learning/DL_for_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural language processing
This notebook follows Francois Chollet's book [Deep Learning with Python (2021)](https://sourestdeeds.github.io/pdf/Deep%20Learning%20with%20Python.pdf). This is chapter 11 of the book.

"Natural" languages refer to human languages, like English or Finnish, to distinguish them from languages that were designed for machines, like XML or Assembly. Machine languages are _designed_: a set of formal rules – the syntax – is described by an engineer. Machine languages are extremely exact and rigorous, meanings cannot be interpreted. With human language, it's the reverse: usage comes first, rules arise later. It is messy, ambiguous and in a constant flux.

Creating algorithms that can make sense of natural language is a big deal: language, and in particular text, underpins most of our communications and our cultural production.

Modern natural language processing (NLP) is about using machine learnign and large datasets to give computers the ability not to _understand_ language, which is a more lofty goal, but to ingest a piece of language as input and return something useful, like predicting the following:
- _Text classification_ – The topic of a text
- _Content filtering_ – Does this text contain abuse?
- _Sentiment analysis_ – Is a text positive or negative?
- _Language modelling_ – What should be the next word in this incomplete sentence?
- _Translation_ – How would you say this in German?
- _Summarization_ – How would you summarize this article?

Text-processing models won't possess a human-like understanding of language; rather, they simply look for statistical regularities in their input data, which turns out to be sufficient to perform well on many simple tasks. In much the same way that commputer "vision" is pattern recognition applied to pixels, NLP is pattern recognition applied to words, sentences, and paragraphs.

## Preparing text data
Deep learning models, being differentiable functions, can only process numeric tensors: they can't take raw text as input. _Vectorizing_ text is the process of transforming text into numeric tensors:
1. First, you _standardize_ the text to make it easier to process, such as by converting it to lowercase or removing punctuation.
2. You split the text into _tokens_, such as characters, words, or groups of words. This is called _tokenization_
    - _Word-level tokenization_ separates tokens by a specific character (e.g. space or punctuation). This is used with a _sequence model_.
    - _N-gram tokenization_ creates tokens of _N_ consecutive words. "The cat" would be a 2-gram token. This is used with a _bag-of-words model_.
3. You convert each such token into a numerical vector. This usually involves _indexing_ all tokens present in the data.
```python
""" An example of the vectorization process without any Keras or Tensorflow functionality """
import string
'
class Vectorizer:
    def standardize(self, text):
        text = text.lower()
        return "".join(char for char in text
            if char not in string.punctuation)

    def tokenize(self, text):
        text = self.standardize(text)
        return text.split()

    def make_vocabulary(self, dataset):
        self.vocabulary = {"": 0, "[OOV]": 1}  # The 0-index is usually reserved for mask token and the 1-index for the OOV token
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
                    self.inverse_vocabulary = dict(
                        (v, k) for k, v in self.vocabulary.items())

    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token, 1) for token in tokens]

    def decode(self, int_sequence):
        return " ".join(
            self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)
```

In practice, however, we'd use the Kears `TextVectorization` layer, which is fast and efficient and can be dropped directly into a `tf.data` pipeline on a Keras model. This is what the layer looks like:
```python
text_vectorization = TextVectorization(output_mode="int",)
```
The layer returns sequences of words encoded as integer indices. There are also other output models available. By default, the layer uses the setting "convert to lowercase and remove punctuation" for _standardization_, and "split on whitespace" for _tokenization_.

## Two approaches for representing groups of words: sets and sequences
_Word order_ is a fundamental problem for machine learning: unlike the steps of a timeseries, words in a sentence don't have a natural, canonical order. Different languages order similar words in very different ways. Order is important, but its relationship to meaning isn't straightforward.

The simplest thing you could do is to just discard order and treat text as an unordered set of words – the _bag-of-words models_ – or you could treat words as an ordered sequence – here an RNN would be the choice. Additionally, a hybrid approach is possible: the **Transformer architecture**.

Transformers are order-agnostic, yet they inject word-position information into the representations it processes, which enables it to simultaneously look at different parts of a sentence (unlike RNNs) while still being order-aware. Because they take into account word order, both RNNs and Transfromers are called _sequence models_.

### IMDB example
Let's demonstrate both approaches with the same IMDB sentiment-classification dataset as previously. First, we'll prepare the IMDB movie reviews data – this time from scratch.

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  5610k      0  0:00:14  0:00:14 --:--:-- 13.1M


In [None]:
# Take a look at the content of a few of these text files
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

Next, let's prepare a validation set by setting apart 20% of the training text files in a new directory, `aclImdb/val`:

In [None]:
# Imports
import os, pathlib, shutil, random
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"

for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))  # Take 20% of the training for val
    val_files = files[-num_val_samples:]  # and slice it from the end of files
    for fname in val_files:
        shutil.move(train_dir / category / fname,  # Move the files from train to val
                    val_dir / category / fname)

Now we can create a batched `Dataset` by using the `text_dataset_from_directory` utility.

In [None]:
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [None]:
# What kind of data do we have?
for inputs, targets in train_ds:
    print("Inputs shape:", inputs.shape)
    print("Inputs data type:", inputs.dtype)
    print("Targets shape:", targets.shape)
    print("Targets data type:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

Inputs shape: (32,)
Inputs data type: <dtype: 'string'>
Targets shape: (32,)
Targets data type: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'All day now I\'ve been watching dinosaurs, and all day they\'ve had the same fundamental problem.<br /><br />They don\'t believe in firearms. They just don\'t seem to have been _told_ about them or something. Bullets _bounce_ off of dinosaurs! Maybe it\'s because they became extinct millions of years before the invention of gunpowder, and the laws of physics were just different back then... Aah, no. Come on. If they\'re close enough to chemically operate today, they\'d have to be vulnerable to fast (even subsonic) lead projectiles. It\'s that simple.<br /><br />Look, the toughest-skinned reptiles on the planet today, alligators and crocodiles, are completely vulnerable to basic rifle fire. They\'re nothing magic. You can shoot a pistol round right through the heavy scales on their backs. They don\'t take armor-piercing bullets or anything special. Smal

#### Processing words as a set: the bag-of-words approach
The simplest way to encode a piece of text for processing by a machine learning model is to discard order and treat it as a set of tokens.

**SINGLE WORDS (UNIGRAMS) WITH BINARY ENCODING**

If you use a bag of single words, the sentence "the cat sat on the mat" becomes `{"cat", "mat", "on", "sat", "the"}`

The main advantage of this encoding is that you can represent an entire text as a single vector, where each entry is a presence indicator for a given word. For instance, using binary encoding (multi-hot), you'd encode a text as a vector with as many dimensions as there are words in your vocabulary – with 0s almost everywhere and some 1s for dimensions that encode words present in the text. Let's try this on our task.

First, let's process our raw text datasets with a `TextVectorization` layer so that they yield multi-hot encoded binary word vectors. Our layer will only look at single words (i.e., _unigrams_).

In [None]:
text_vectorization = keras.layers.TextVectorization(
    max_tokens=20000,  # Limit the vocabulary to the 20,000 most frequent words
    output_mode="multi_hot",  # Encode as multi-hot binary vectors
)

text_only_train_ds = train_ds.map(lambda x, y: x)  # A dataset w/ only raw text
text_vectorization.adapt(text_only_train_ds)  # Index with the adapt method

# Prepare processed versions of the train, val, and test datasets
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4  # Leverage multiple CPU cores
)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4
)


In [None]:
# What do we have?
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'int64'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1 1 0 ... 0 0 0], shape=(20000,), dtype=int64)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


In [None]:
# Let's write a reusable model-building function
def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

Let's then train and test our model.

In [None]:
# Get the model and check its summary
model = get_model()
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]

# Train the model
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 13ms/step - accuracy: 0.7747 - loss: 0.4836 - val_accuracy: 0.8902 - val_loss: 0.2768
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 5ms/step - accuracy: 0.8983 - loss: 0.2775 - val_accuracy: 0.8956 - val_loss: 0.2731
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.9155 - loss: 0.2380 - val_accuracy: 0.8966 - val_loss: 0.2878
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.9226 - loss: 0.2214 - val_accuracy: 0.8948 - val_loss: 0.3005
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - accuracy: 0.9296 - loss: 0.2115 - val_accuracy: 0.8918 - val_loss: 0.3191
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.9335 - loss: 0.2075 - val_accuracy: 0.8856 - val_loss: 0.3386
Epoch 7/10
[1m625/625[0m

We get a test accyracy of 88.8%. In this case, since the dataset is a balanced two-class classification dataset (_balanced_, there are as many positive and as there are negative samples), the "naive baseline" we could reach without training an actual model would only be 50%.

**BIGRAMS WITH BINARY ENCODING**

Of course, discarding word order is very reductive, because even atomic concepts can be expressed via multiple words: the term "United States" conveys a concept that is quite distinct from the meaning of the words "states" and "united" taken separately. For this reason, you will usually end up re-injecting local order information into your bag-of-words representation by looking at N-grams rather than single words.

With bigrams, our sentence "the cat sat on the mat" becomes:
`{"the", "the cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the mat", "mat"}`

In [None]:
# Configure the TextVectorization layer to return bigrams
text_vectorization = layers.TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot"
)

In [None]:
# Let's test the binary bigram model
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]

model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 14ms/step - accuracy: 0.7806 - loss: 0.4635 - val_accuracy: 0.9004 - val_loss: 0.2598
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - accuracy: 0.9112 - loss: 0.2489 - val_accuracy: 0.9026 - val_loss: 0.2731
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.9273 - loss: 0.2126 - val_accuracy: 0.9038 - val_loss: 0.2810
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - accuracy: 0.9375 - loss: 0.1855 - val_accuracy: 0.9090 - val_loss: 0.2942
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.9437 - loss: 0.1849 - val_accuracy: 0.9080 - val_loss: 0.3085
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.9500 - loss: 0.1772 - val_accuracy: 0.9034 - val_loss: 0.3303
Epoch 7/10
[1m625/625[0m

We do get an improvement! Local order seems to be important in sentiment analysis.

**BIGRAMS WITH TF-IDF ENCODING**

You can also add a bit more information to this representation by counting how many times each word or N-gram occurs, that is to say, by taking the histogram of the words over the text:

`{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1, "sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}`

In text classification, knowing how many times a word occurs in a sample is critical: any sufficiently long movie review may contain the word "terrible" regardless of sentiment, but a review that contains many instances of that word is likely a negative one.

In [None]:
# Let's configure the TextVectorization layer for returning token counts
text_vectorization = layers.TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

Now, of course, some words are bound to occur more often than other no matter what the text is about. The words "the", "a", "is", and "are" will always dominate your word counts, drowning out other words – despite being pretty much useless features in a classification context.

This can be addressed by _normalization_. Most vectorized sentences consist almost entirely of zeros (the previous example features 12 non-zero entries and 19,988 zero entries), a property called _sparsity_. That's a great property to have, as it dramatically reduces compute load and reduces the risk of overfitting. This is why the basic _z-score_ normalization isn't such a good idea (our zeros would vanish because of the subtraction by the mean).

_TF-IDF normalization_ (term frequency, inverse document frequency) is used to combat this problem. The more a given term appears in a document, the more important that term is for understanding what the document is about. At the same time, the frequency at which the term appears across all documents in your dataset matters too: terms that appear in almost every document (like “the” or “a”) aren't particularly informative, while terms that appear only in a small subset of all texts (like “Herzog”) are very distinctive, and thus important. TF-IDF is a metric that fuses these two ideas.

In [None]:
# Let's configure the TextVectorization layer for returning TF-IDF counts
text_vectorization = layers.TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf"
)

In [None]:
# Let's test the binary bigram model
text_vectorization.adapt(text_only_train_ds)
tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]

model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - accuracy: 0.6986 - loss: 0.7846 - val_accuracy: 0.8896 - val_loss: 0.3058
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8431 - loss: 0.3540 - val_accuracy: 0.8914 - val_loss: 0.3094
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8619 - loss: 0.3127 - val_accuracy: 0.8938 - val_loss: 0.3045
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8857 - loss: 0.2734 - val_accuracy: 0.8980 - val_loss: 0.2990
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8921 - loss: 0.2657 - val_accuracy: 0.8832 - val_loss: 0.3188
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.8970 - loss: 0.2578 - val_accuracy: 0.8800 - val_loss: 0.3668
Epoch 7/10
[1m625/625[0m 

Doesn't improve our test accuracy. However, for many text-classification datasets, it would be typical to see a one-percentage-point increase when using TF-IDF compared to plain binary encoding.

#### Processing words as a sequence: the sequence model approach
The previous examples show that order does matter in language processing. Until now, we have hand-crafted our sequences that the model ought to understand as sequences, but a better way is to let the algorithm do this on its own.

To implement a _sequence model_:
1. Start by representing your input samples as sequences of integer indices (one integer standing for one word).
2. Map each integer to a vector to obtain vector sequences.
3. Finally, feed these sequences of vectors into a stack of layers that could cross-correlate features from adjacent vectors, such as a 1D convent, a RNN, or a Transformer.

**First practical example:**

First, we need to prepare datasets that return integer sequences.

In [None]:
max_length = 600  # We'll truncate the the inputs after the first 600 words
max_tokens = 20000

text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

Next, let's make a model. The simplest way to convert our integer sequences to vector sequences is to one-hot encode the integers. On top of these one-hot vectors, we'll add a simple bidirectional LSTM.

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
# This is different from the book, but should do the exact same thing
embedded = keras.ops.one_hot(inputs, num_classes=max_tokens)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

In [None]:
# HOX! VERY SLOW, don't do again!
# Then train our model
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]

model.fit(int_train_ds,
          validation_data=int_val_ds,
          epochs=10,
          callbacks=callbacks)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m231s[0m 363ms/step - accuracy: 0.6102 - loss: 0.6432 - val_accuracy: 0.8546 - val_loss: 0.3668
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m227s[0m 363ms/step - accuracy: 0.8495 - loss: 0.3882 - val_accuracy: 0.8820 - val_loss: 0.3095
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m278s[0m 389ms/step - accuracy: 0.8881 - loss: 0.3165 - val_accuracy: 0.8614 - val_loss: 0.3368
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m227s[0m 363ms/step - accuracy: 0.9090 - loss: 0.2647 - val_accuracy: 0.8940 - val_loss: 0.3232
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m243s[0m 388ms/step - accuracy: 0.9229 - loss: 0.2295 - val_accuracy: 0.8956 - val_loss: 0.3002
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m243s[0m 389ms/step - accuracy: 0.9308 - loss: 0.2065 - val_accuracy: 0.8930 - val_loss: 0.3174
Epoc

<keras.src.callbacks.history.History at 0x7a94f81f5290>

In [None]:
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m127s[0m 161ms/step - accuracy: 0.8685 - loss: 0.3376
Test acc: 0.868


A first observation: this model is extremely slow to train when compared to the previous models. This is because our inputs are quite large: each input sample is encoded as a matrix of size `(600, 20000)` (600 words per sample, 20,000 possible words). That's 12,000,000 floats for a single movie review. Second, the model only gets to 87% test accuracy, not as good as the simpler models.

Clearly, using one-hot encoding to turn words into vectors, which was the simplest thing we could do, wasn't a great idea. There's a better way: _word embeddings_.

**UNDERSTANDING WORD EMBEDDINGS**

Crucially, when encoding something via one-hot encoding, you're making a feature-engineering decision. You're injecting into your model a fundamental assumption about the structure of your feature space. That assumption is that _the different tokens you're encoding are all independent from each other_: indeed, one-hot vectors are all orthogonal to each other. And in the case of words, that assumption is clearly wrong. Words form a structured space: they share information with each other. The words "movie" and "film" are interchangeable in most sentences, so the vector that represents "movie" should not be orthogonal to the vector that represents "film" – they should be the same vector, or close enough.

The _geometric relationship_ between two vectors should reflect the _semantic relationship_ between these words. Words that mean different things should lie far away from each other in the geometric space.

> _Word embeddings_ are vector representations of words that achieve exactly this: they map human language into a structured geometric space.

Whereas the vectors obtained through one-hot encoding are binary, sparse, and very high-dimensional, word embeddings are dense, low-dimensional floating-point vectors. So, word embeddings pack more information into far fewer dimensions when compared to one-hot encoding.

Besides being _dense_ representations, word embeddings are also _structured_ representations, and their structure is learned from data. Similar words get embedded in. close locations, and further, specific _directions_ in the embedding space are meaningful.

As an example, we could have the words _cat_, _dog_, _wolf_, and _tiger_. With the vector representations we chose here, some semantic relationships between these words can be encoded as geometric transformations. For instance, the same vector allows us to go from _cat_ to _tiger_ and from _dog_ to _wolf_: this vector could be interpreted as the "from pet to wild animal" vector.

In general, there are two ways to obtain word embeddings:
- Learn word embeddings jointly with the main task you care about. In this setup, you start with random word vectors and then learn word vectors in the same way you learn the weights of a neural network.
- Load into your model word embeddings that were precomputed using a different machine learning task than the one you're trying to solve. These are called _pretrained word embeddings_.

**LEARNING WORD EMBEDDINGS WITH THE EMBEDDING LAYER**

Usually you _learn_ a new embedding space with every new task. In Keras this is easy: we use the `Embedding` layer.

In [None]:
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)

The `Embedding` layer is best understood as a dictionary that maps integer indices to dense vectors. It takes integers as input, looks up these integers in an internal dictionary, and returns the associated vectors.

Let's build a model that includes an `Embedding` layer and benchmark it on our task.

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
                                    save_best_only=True)
]

model.fit(int_train_ds,
          validation_data=int_val_ds,
          epochs=10,
          callbacks=callbacks)

model = keras.models.load_model("embeddings_bidir_gru.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 49ms/step - accuracy: 0.6392 - loss: 0.6218 - val_accuracy: 0.8132 - val_loss: 0.4372
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 46ms/step - accuracy: 0.8302 - loss: 0.4241 - val_accuracy: 0.8288 - val_loss: 0.3960
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 47ms/step - accuracy: 0.8763 - loss: 0.3325 - val_accuracy: 0.8796 - val_loss: 0.3131
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 47ms/step - accuracy: 0.8983 - loss: 0.2738 - val_accuracy: 0.8522 - val_loss: 0.4266
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 44ms/step - accuracy: 0.9149 - loss: 0.2409 - val_accuracy: 0.8888 - val_loss: 0.3210
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 44ms/step - accuracy: 0.9297 - loss: 0.2070 - val_accuracy: 0.8822 - val_loss: 0.3557
Epoch 7/10
[1m6

Trains a lot faster than the one-hot model (because the LSTM layer only has to process 256-dimensional instead of 20,000 vectors), and its test accuracy is comparable. However, we're still some way off from the results of our basic bigram model. Part of the reason why is sipmly that the model is looking at slightly less data: the bigram model processed full reviews, while our sequence model truncates sequences after 600 words.

**UNDERSTANDING PADDING AND MASKING**

One thing that's slightly hurting model performance here is that our input sequences are full of zeros. This comes from our use of `output_sequence_length=max_length` option in `TextVectorization`: sentences longer than 600 tokens are truncated to a length of 600 tokens, and sentences shorter than 600 tokens are padded with zeros at the end so that they can be concatenated together with other sequences to form contiguous batches.

We're using a bidirectional RNN: two RNN layers running in parallel, with one processing the tokens in their natural order, and the other processing the same tokens in reverse. The RNN that looks at the tokens in their natural order will spend its alst iterations seeing only vectors that encode padding – possibly for hundreds of iterations if the original sentence was short. The information stored in the internal state of the RNN will gradually fade out as it gets exposed to these meaningless inputs.

We need a way to tell the RNN that it should skip these iterations: in steps _masking_. The mask is a tensor of 1s and 0s, where all 0s in the original input are represented as 0s and all other data as 1s.

Masking can be enabled in an `Embedding` layer with `mask_zero=True`.

**USING PRETRAINED EMBEDDINGS**

Sometimes there's very little training data available and you cannot use your data alone to learn an appropriate task-specific embedding of your vocabulary. For such cases precomputed embedding vectors are useful.

One of the most famous word-embedding schemes is the `word2vec` algorithm. Its dimensions capture specific semantic properties, such as gender.

### The Transformer architecture
Starting in 2017, a new model architecture started overtaking recurrent neural networks across most natural language processing tasks: the Transformer.

A simple mechanism called "_neural attention_" could be used to build powerful sequence models that didn't feature anay recurrent layers or convolution layers.

This finding unleashed a revolution in NLP and beyond. Neural attention has fast become one of the most influential ideas in deep learning.

#### Understanding self-attention
It's a simple yet powerful idea: not all input information seen by a model is equally important to the task at hand, so models should "pay more attention" to some features and "pay less attention" to others.

It is a similar idea to `MaxPooling` in convnets that looks at a pool of features in a spatial region and selects just one feature to keep. Additionally, `TF-IDF` normalization assigns importance scores to tokens based on how much information different tokens are likely to carry. Important tokens get boosted while irrelevant tokens get faded out.

Crucially, this kind of attention mechanism can be used for more than just highlighting or erasing certain features. It can be used to make features _context-aware_. You've just learned about word embeddings – vector spaces that capture the "shape" of the semantic relationships between different words. In an embedding space, a single word has a fixed position – a fixed set of relationships with every other word in the space. But that's not quite how language works: the meaning of a word is usually context-specific. When you mark the date, you're not talking about the same "date" as when you go on a date, nor is it the kind of date you'd buy at the market. When you say, "I'll see you soon," the meaning of the word "see" is subtly different from the "see" in "I'll see this project to its end," or "I see what you mean."

Clearly, a smart embedding space would provide a different vector representation for a word depending on the other words surrounding it. That's where _self-attention_ comes in. The purpose of self-attention is to modulate the representation of a token by using the representations of related tokens in the sequence. This produces context-aware token representations. Consider an example sentence: "The train left the station on time." Now, consider one word in the sentence: station. What kind of station are we talking about? Could it be a radio station? Maybe a the International Space Station? Let's figure out algorithmically via self-attention.

- **Step 1** is to compute relevancy scores between the vector for "station" and every other word in the sentence. These are our "attention scores." We're simply going to use the dot product between two word vectors as a measure of the strength of their relationship.
- **Step 2** is to compute the sum of all word vectors in the sentence, weighted by our relevancy scores. Words closely related to "station" will contribute more to the sum, while irrelevant words will contribute almost nothing. The resulting vector is our new representation for "station": a representation that incorporates the surrounding context. In particular, it includes part of the "train" vector, clarifying that it is, in fact, a "train station".

This process would be then repeated for every word in the sentence, producing a new sequence of vectors encoding the sentence.

```python
""" Pseudocode for calculating an attention weighted sequence,
or one attention-head """
def self_attention(input_sequence):
    attention_weighted_sequence = np.zeros(shape=input_sequence.shape)
    for i, pivot_vector in enumerate(input_sequence):
        scores = np.zeros(shape=(len(input_sequence),))
        for j, vector in enumerate(input_sequence):
            scores[j] = np.dot(pivot_vector, vector.T)  # Unnormalized attention-score
        scores /= np.sqrt(input_sequence.shape[1])  # Scale
        scores = softmax(scores)  # Normalize to (0 to 1)
        new_pivot_representation = np.zeros(shape=pivot_vector.shape)
        for j, vector in enumerate(input_sequence):
            # Sum of all tokens weighted by the attention scores
            new_pivot_representation += vector * scores[j]
        # The new sums become our output
        attention_weighted_sequence[i] = new_pivot_representation
    return attention_weighted_sequence
```

##### **Generalized self-attention: the _query-key-value_ model**
So far, we've only considered one input sequence. However, the Transformer architecture was originally developed for machine translation, where you have to deal with two input sequences: the source sequence you're currently translating, and the target sequence you're converting it to. A Transformer is a _sequence-to-sequence_ model: it was designed to convert one sequence into another.

The self-attention mechanism performs the following:
> `outputs = sum(values * pairwise_attention_scores(query, key))`

"For each token in the `query`, compute how much the token is related to every token in `key`, and use these scores to weight a sum of tokens from `values`."

This terminology comes from seach engines and recommender systems. You can write a query sequence, and that query is matched to a sequence of keys by the engine. Then the engine ranks the `keys` by strength of match, or _relevance_, wrt to the `query` and returns the `value` associated with the top number of matches (binary matching system, where a _dog_ in the `query` gets a 1 in the `keys` if it is in the key sequence, and a _cat_ gets a 0 because the `keys` sequence is only for e.g. canines).

- _Query_ – A reference sequence that describes something you're looking for.
- _Keys_ – Each value is assigned a key that describes the value in a format that can be readily compared to a query.
- _Values_ – A body of knowledge that you're trying to extract information from.

In practice, the _keys_ and the _values_ are often the same sequence.
##### **Multi-head attention**
Multi-head attention is an extra tweak to the self-attention mechanism. The "multi-head" moniker refers to the fact that the output space of the self-attention layer gets factored into a set of independent subspaces, learned separately: the intial query, key, and value sequences are sent through three _attention-heads_ each (one _head_ is the pseudocode we have above), resulting in three attention weighted vectors. This process is similar to having multiple kernels/filters in convolutional networks producing multiple feature maps.

Having independent heads helps the layer learn different groups of features for each token, where features within one group are correlated with each other but are mostly independent from features in a different group.

##### **The Transformer encoder**
The Transformer encoder chains a multi-head attention layer with a dense projection and adds normalization as well as residual connections.

The original Transformer architecture consists of two parts: an _encoder_ that processes the source sequence, and a _decoder_ that uses the source sequence to generate a translated version.

The _encoder_ is a very generic module that ingests a sequence and learns to turn it into a more useful representation.

Self-attention is a set-processing mechanism, focused on the relationships between pairs of sequence elements – it's blind to whether these elements occur at the beginning, at the end, or in the middle of a sequence. Why is a Transformer called a sequence model then?

Transformer is a hybrid approach between a _bag-of-words_ and a _sequence_ model: order information is injected into the architecture with _positional encoding_.

**POSITIONAL ENCODING**

The idea behind positional encoding is very simple: to give the model access to word-order information, we're going to add the word's position in the sentence to each word embedding. The input word embeddings have two components:
- _Word vector_ – Represents the word independently of any specific context
- _Position vector_ – Represents the position of the word in the current sentence.

One way to create positional vectors is through cosine functions, that map all values between -1 and 1. Another way is to learn position-embedding vectors the same way you would learn the word embedding vectors. The latter is called _positional embedding_.

## Beyond text classification: sequence-to-sequence learning
A sequence-to-sequence model takes a sequence as input and translates it into a different sequence. This is at the heart of many of the most successful applications in NLP:
- _Machine translation_ – Convert a paragraph in a source langauge to its equivalent in a target language.
- _Text summarization_ – Convert a long document ot a shorter version that retains the most important information.
- _Text generation_ – Convert a text prompt into a paragraph that completes the prompt.

The general template behind sequence-to-sequence models is two-fold. First during training,
- An _encoder_ model turns the source sequence into an intermediate representation.
- A _decoder_ is trained to predict the next token `i` in the target sequence by looking at both previous tokens and the encoded source sequence.

During inference, we don't have access to the target sequence – we're trying to predict it from scratch. We'll have to generate it one token at a time:
1. We obtain the encoded source sequence from the encoder.
2. The decoder starts by looking at the encoded source sequence as well as an initial "seed" token, and used them to predict the first real token in the sequence.
3. The predicted sequence so far is fed back into the decoder, which generates the next token, and so on, until it generates a stop token.

Sequence-to-sequence learning is the task where Transformer really shines. Neural attention enables Transformer models to successfully process sequences that are considerably longer and more complex than those RNNs can handle.

## Conclusions
- There are two kinds of NLP model: _bag-of-words_ models that process sets of words or N-grams without taking into account their order, and _sequence_ models that process word order. A bag-of-words model is made of `Dense` layers, while a sequence model could be an RNN, a 1D convnet, or a Transformer.
- _Word embeddings_ are vector spaces where semantic relationships between words are modeled as distance relationships between vectors that represent those words.
- _Sequence-to-sequence learning_ is a generic, powerful learning framework that can be applied to solve many NLP problems, including machine translation. A sequence-to-sequence model is made of an encoder, which processes a source sequence, and a decoder, which tries to predict future tokens in target sequence by looking at past tokens, with the help of the encode-processed source sequence.
- _Neural attention_ is a way to create context-aware word representations. It's the basis for the Transformer architecture.
- The Transformer architecture, which consists of a `TransformerEncoder` and a `TransformerDecoder`, yields excellent results on sequence-to-sequence tasks. The `TransformerEncoder` can also be used for text classification or any sort of single-input NLP task.