# Chapter 11 Deep Learning for text

## 11.1 Natural language processing(NLP): The bird’s eye view

+   Every machine language was $designed$: its
 starting point was a human engineer writing down a set of formal rules to describe
 what statements you could make in that language and what they meant

+   Machine-readable language is highly structured and rigorous, using precise syntactic
 rules to weave together exactly defined concepts from a fixed vocabulary, natural language is messy—ambiguous, chaotic, sprawling, and constantly in flux.

+   That’s what modern NLP is about: using machine learning and large datasets to
 give computers the ability not to understand language, which is a more lofty goal, but
 to ingest a piece of language as input and return something useful, like predicting the
 following
    1. “What’s the topic of this text?” ($text$ $classification$)
    
    2. “Does this text contain abuse?” ($content$ $filtering$)
    
    3. “Does this text sound positive or negative?” ($sentiment$ $analysis$)
    
    4. “What should be the next word in this incomplete sentence?” ($language$ $modeling$)
    
    5. “How would you say this in German?” ($translation$)
    
    6. “How would you summarize this article in one paragraph?” ($summarization$)
    
    7. etc.

+  they simply
 look for statistical regularities in their input data, which turns out to be sufficient to
 perform well on many simple tasks. In much the same way that computer vision is pattern recognition applied to pixels, NLP is pattern recognition applied to words, sentences, and paragraphs


+   Finally, around 2017–2018, a new architecture rose to replace RNNs: the __Transformer__, which you will learn about in the second half of this chapter. Transformers
 unlocked considerable progress across the field in a short period of time, and today
 most NLP systems are based on them.

## 11.2 Preparing text data

$Vectorizing$ text is the process of transforming text into numeric tensors.

1. $Standardize$:  First, you standardize the text to make it easier to process, such as by converting
 it to lowercase or removing punctuation.

2. $Tokenization$:  You split the text into units (called tokens), such as characters, words, or groups
 of words. This is called tokenization.  
 
3. $One-Hot-Encode$:  You convert each such token into a numerical vector. This will usually involve
 first indexing all tokens present in the data.

### 11.2.1 Text standardization

Text standardization is a basic form of feature engineering that aims to __erase
 encoding differences that you don’t want your model to have to deal with__

1.  __Convert to lowercase and remove punctuation characters__.
    

2.  __Stemming__: converting variations of a term (such as different conjugated forms of a verb) into a single shared representation.
    “caught” and “been catching” into “[catch]” or “cats” into “[cat]”.

### 11.2.2 Text splitting (tokenization)

1. __Word-level tokenization__ —Where tokens are space-separated (or punctuation-separated) substrings. A variant of this is to further split words into subwords
when applicable—for instance, treating “staring” as “star+ing” or “called” as
“call+ed.”

2. __N-gram tokenization__ —Where tokens are groups of N consecutive words. For
instance, “the cat” or “he was” would be 2-gram tokens (also called bigrams).


3. __Character-level tokenization__ —Where each character is its own token. In practice,
this scheme is rarely used, and you only really see it in specialized contexts, like
text generation or speech recognition.

There are two kinds of text-processing models: 
1. $Sequence$  $model$ :care about word __order__

2. $Bag-of-words$  $model$: treat input words as a set, __discarding their original order__

### 11.2.3 Vocabulary indexing

Once your text is split into tokens, you need to encode each token into a numerical
 representation.

the way you’d go about it is to build
 an index of all terms found in the training data (the “vocabulary”), and assign a
 unique integer to each entry in the vocabulary.

In [166]:
# vocabulary = {}
# for text in dataset: 
#     text = standardize(text)
#     tokens = tokenize(text)
#     for token in tokens:
#         if token is not in vocabulary:
#             vocabulary[token] = len(vocabulary)



You can then convert that integer into a vector encoding that can be processed by a
neural network, like a one-hot vector:

In [167]:
# import numpy as np 
# def one_hot_encode_token(token):
#     vetcor = np.zeros((len(vocabulary),))
#     token_index = vocabulary[token]
#     vetcor[token_index]
#     return vetcor

Note that at this step it’s common to restrict the vocabulary to only the top 20,000 or
 30,000 most common words found in the training data

The data you were using from keras.datasets.imdb was
 already preprocessed into sequences of integers, where each integer stood for a given
 word. 
 
Back then, we used the setting num_words=10000, in order to restrict our vocabulary to the __top 10,000__ most common words found in the training data.

when we look up a new token in our vocabulary index, it may not necessarily exist

Your training data may not have contained any instance of the word “cherimoya” (or maybe you
 excluded it from your index because it was too rare), so doing token_index =
 vocabulary["cherimoya"] may result in a KeyError. 
 
To handle this, you should use
 an __“out of vocabulary” index__ (abbreviated as $OOV$ index)—a catch-all for any token
 that wasn’t in the index. 
 
It’s usually __index 1__: you’re actually doing token_index =
 vocabulary.get(token, 1). When decoding a sequence of integers back into words,
 you’ll replace 1 with something like “[UNK]” (which you’d call an “OOV token”)

### 11.2.4 Using the TextVectorization layer

Every step I’ve introduced so far would be very easy to implement in pure Python.
Maybe you could write something like this

In [168]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [169]:


class Vectorizer:
    def standardize(self,text:str) -> str:
        text = text.lower()
        return"".join( char for char in text if char not in string.punctuation )

    def tokenize(self,text:str)->list:
        text = self.standardize(text)
        return text.split()
    
    def make_vocabulary(self,dataset):
        self.vocabulary = {"":0,"[UNK]":1}
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens :
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        self.inverse_volcabulary = dict ( (val,key) for key,val in self.vocabulary.items()   )
    
    def encode(self,text:str) -> list:
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token,1) for token in tokens ]

    def decode(self,int_sequence) -> str:
        return "".join(
            self.inverse_volcabulary.get(i,"[UNK]") for i in int_sequence
        )

In [170]:
vector = Vectorizer()
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms."
]

vector.make_vocabulary(dataset)
print(vector.vocabulary)

{'': 0, '[UNK]': 1, 'i': 2, 'write': 3, 'erase': 4, 'rewrite': 5, 'again': 6, 'and': 7, 'then': 8, 'a': 9, 'poppy': 10, 'blooms': 11}


In [171]:
test_sequence = "I write, rewrite, and still rewrite again."
encoded_sentence = vector.encode(test_sequence)
print(f"Encoded Sentence:\n{encoded_sentence}")

decoded_sentence = vector.decode(encoded_sentence)
print(f"Decoded Sentence\n{decoded_sentence}")

Encoded Sentence:
[2, 3, 5, 7, 1, 5, 6]
Decoded Sentence
iwriterewriteand[UNK]rewriteagain


In [172]:
# from sklearn.feature_extraction.text import CountVectorizer
# vect = CountVectorizer().fit_transform(vector.vocabulary) 
# vect = vect.toarray()
# print(vect)
# from sklearn.preprocessing import OneHotEncoder
# one = OneHotEncoder().fit_transform(vect)
# print(one)

However, using something like this wouldn’t be very performant. 

In practice, you’ll work with the Keras __TextVectorization__ layer, which is fast

In [173]:
from tensorflow import keras
from keras import layers

from keras.layers import TextVectorization

# Configures the layer to return sequences of words encoded 
# as integer indices. There are several other output modes 
# available, which you will see in action in a bit.
text_vectorization = TextVectorization(output_mode="int")


By default, the TextVectorization layer will use the setting
+ __convert to lowercase__ and __remove punctuation__ for text $standardization$, 
+  __split on whitespace__ for $tokenization$.

Note that
 such custom functions should operate on __tf.string tensors__, not regular Python
 strings!

In [174]:
import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor:tf.strings) -> tf.strings:
    lowercase_string = tf.strings.lower(string_tensor)
    return tf.strings.regex_replace(
        lowercase_string, f"[{re.escape(string.punctuation)}]",""
    )

def custom_split_fn(string_tensor:tf.strings) -> tf.strings:
    return tf.strings.split(string_tensor)


In [175]:
text_vectorization = TextVectorization(
    output_mode='int',
    standardize= custom_standardization_fn,
    split= custom_split_fn
)

To index the vocabulary of a text corpus, just call the __adapt() method__ of the layer
with a Dataset object that yields __strings__, or just with __a list of Python strings__

In [176]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms."
]

text_vectorization.adapt(dataset)

Note that you can __retrieve the computed vocabulary__ via __get_vocabulary()__—this can
 be useful if you need to convert text encoded as integer sequences back into words.

In [177]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

For demonstartion, let's try to encode and  decode the sentences

Listing 11.1 Displaying the vocabulary

In [178]:
vocalbuary= text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

inverse_volcab= dict( enumerate(vocalbuary))
decoded_sentence = "".join( inverse_volcab[int(i)] 
                            for i in encoded_sentence
    )
print(decoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)
iwriterewriteand[UNK]rewriteagain


#### Using the __TextVectorization__ layer in a __tf.data__ pipeline or as part of a model

The  __TextVectorization__ __ONLY__ works on __CPU__


There are __Two__ ways to use __TextVectorization__ layer

1.  Put it in the __tf.data__ pipeline:
    
    + __int_sequence_dataset = string_dataset.map( 
            text_vectorization,
            num_parallel_calls=4)__
    
    +  while the GPU runs the model on one batch of vectorized data, the CPU stays busy by vectorizing the next batch of
        raw strings.
    
    + __Recommands for GPU version__



2.  Make it part of the model:
    +   __text_input = keras.Input(shape=(), dtype="string")__

        __vectorized_text = text_vectorization(text_input)__

        __embedded_input = keras.layers.Embedding(...)(vectorized_text)__

        __output = ...__

        __model = keras.Model(text_input, output)__

    +   This means that
        at each training step, the rest of the model (placed on the GPU) will have to wait for
        the output of the TextVectorization layer (placed on the CPU) to be ready in order
        to get to work

    +   __Not suitable for GPU version__

Thankfully, the TextVectorization layer enables you to include
 text preprocessing right into your model, making it easier to deploy—even if you were
 originally using the layer as part of a tf.data pipeline.

## 11.3 Two approaches for representing groups of words: Sets and Sequences


+ Set : bag-of-words

+ Sequence : Focus on __Order__ of sentenses

__Transformer__ architecture is technically order-agnostic, yet it injects word-position information into
 the representations it processes, which enables it to simultaneously look at different
 parts of a sentence (unlike RNNs) while still being order-aware. 
 

Because they take into  account word order, both RNNs and Transformers are called $sequence$ $models$.

### 11.3.1 Preparing the IMDB movie reviews data

+ Take a look at the content of a few of these text files. 

+ Whether you’re working with
 text data or image data, remember to always inspect what your data looks like before
 you dive into modeling it.

Setting apart 20% of the training text files

In [179]:
# import os, pathlib,shutil,random

# # Find path in the current folder
# base_dir  = pathlib.Path("E:\\Deep Learning with Python\\Datas\\Ch11_IMBD_RAW\\aclImdb_v1\\aclImdb")

# val_dir = base_dir / "val"
# train_dir = base_dir / "train"

# for category in ("neg","pos"):
#     os.makedirs(val_dir/category)

#     # Make a list of all file name in //pos and //neg
#     files  = os.listdir( train_dir / category )
#     # Shuffle the list
#     random.Random(1337).shuffle(files)

#     # Pick last 20% files from the list
#     num_val_sample = int(len(files) * 0.2)

#     #Create a list for validation data
#     val_files = files[-num_val_sample:]

#     ## Move them one by one
#     for fname in val_files:
#         shutil.move(  train_dir/category/fname, val_dir / category / fname   )

Use __text_dataset_from_directory()__ to create dataset

Remember to delete \\unsup folder in \\train

In [180]:
from tensorflow import keras
from keras.utils import text_dataset_from_directory
batch_size = 32
train_ds = text_dataset_from_directory(directory="E:\\Deep Learning with Python\\Datas\\Ch11_IMBD_RAW\\aclImdb_v1\\aclImdb\\train",batch_size= batch_size)
val_ds = text_dataset_from_directory(directory="E:\\Deep Learning with Python\\Datas\\Ch11_IMBD_RAW\\aclImdb_v1\\aclImdb\\val",batch_size= batch_size)
test_ds = text_dataset_from_directory(directory="E:\\Deep Learning with Python\\Datas\\Ch11_IMBD_RAW\\aclImdb_v1\\aclImdb\\test",batch_size= batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


Listing 11.2 Displaying the shapes and dtypes of the first batch

In [181]:
for inputs, targets in train_ds:
    print("Inputs shape {}".format(inputs.shape))
    print("Inputs dytpe {}".format(inputs.dtype))
    print("Inputs example {}".format(inputs[0]))

    print("Outputs shape {}".format(targets.shape))
    print("Outputs dytpe {}".format(targets.dtype))
    print("Outputs example {}".format(targets[0]))

    break

Inputs shape (32,)
Inputs dytpe <dtype: 'string'>
Inputs example b'It\'s good to see that Vintage Film Buff have correctly categorized their excellent DVD release as a "musical", for that\'s what this film is, pure and simple. Like its unofficial remake, Murder at the Windmill (1949), the murder plot is just an excuse for an elaborate girlie show with Kitty Carlisle and Gertrude Michael leading a cast of super-decorative girls including Ann Sheridan, Lucy Ball, Beryl Wallace, Gwenllian Gill, Gladys Young, Barbara Fritchie, Wanda Perry and Dorothy White. Carl Brisson is also on hand to lend his strong voice to "Cocktails for Two". Undoubtedly the movie\'s most popular song, it is heard no less than four times. However, it\'s Gertrude Michael who steals the show, not only with her rendition of "Sweet Marijauna" but her strong performance as the hero\'s rejected girlfriend. As for the rest of the cast, we could have done without Jack Oakie and Victor McLaglen altogether. The only good thi

### 11.3.2 Processing words as a set: The bag-of-words approach

The main advantage of this encoding is that you can represent an entire text as a single vector, where each entry is a presence indicator for a given word.

First, let’s process our raw text datasets with a TextVectorization layer so that
they yield multi-hot encoded binary word vectors. 

Our layer will only look at single
words (that is to say, $unigrams$).

Listing 11.3 Preprocessing our datasets with a TextVectorization layer

In [182]:
# Limit the vocabulary to the 20,000 most frequent words.
# Otherwise we’d be indexing every word in the training data—
# potentially tens of thousands of terms that only occur once or
# twice and thus aren’t informative. In general, 20,000 is the
# right vocabulary size for text classification
text_vectorization = TextVectorization(max_tokens=20000,output_mode="multi_hot",)

# Prepare a dataset that 
# only yields raw text 
# inputs (no labels).
text_only_train_ds = train_ds.map(lambda x,y:x)

# Use that dataset to index 
# the dataset vocabulary via 
# the adapt() method.
text_vectorization.adapt(text_only_train_ds)



In [183]:
binary_lgram_train_ds = train_ds.map(
    lambda x, y : (text_vectorization(x),y), num_parallel_calls = 4
)

binary_lgram_val_ds = val_ds.map(
    lambda x, y : (text_vectorization(x),y), num_parallel_calls = 4
)

binary_lgram_test_ds = test_ds.map(
    lambda x, y : (text_vectorization(x),y), num_parallel_calls = 4
)

Listing 11.4 Inspecting the output of our binary unigram dataset

In [184]:
for inputs,targets in binary_lgram_train_ds:
    print("Inputs shape {}".format(inputs.shape))
    print("Inputs dytpe {}".format(inputs.dtype))
    print("Inputs example {}".format(inputs[0]))

    print("Outputs shape {}".format(targets.shape))
    print("Outputs dytpe {}".format(targets.dtype))
    print("Outputs example {}".format(targets[0]))

    break

Inputs shape (32, 20000)
Inputs dytpe <dtype: 'float32'>
Inputs example [1. 1. 1. ... 0. 0. 0.]
Outputs shape (32,)
Outputs dytpe <dtype: 'int32'>
Outputs example 1


Next, let’s write a reusable model-building function that we’ll use in all of our experiments in this section.

Listing 11.5 Our model-building utility

In [185]:
from keras import layers

def get_model(max_tokens= 20000, hidden_dim = 16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim,activation='relu')(inputs)
    x  = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1,activation="sigmoid")(x)
    model = keras.Model(inputs,outputs)
    model.compile (
        optimizer = keras.optimizers.RMSprop(),
        loss = keras.losses.BinaryCrossentropy(),
        metrics = ['accuracy']
    )
    return model

Listing 11.6 Training and testing the binary unigram model

In [186]:
model = get_model()
model.summary()

Model: "model_22"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_24 (InputLayer)       [(None, 20000)]           0         
                                                                 
 dense_28 (Dense)            (None, 16)                320016    
                                                                 
 dropout_20 (Dropout)        (None, 16)                0         
                                                                 
 dense_29 (Dense)            (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [187]:
from keras.callbacks import ModelCheckpoint
callbacks = [ModelCheckpoint(
            filepath="E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\binary_lgrm.keras",
            save_best_only=True
)]


In [188]:
# We call cache() on the 
# datasets to cache them in 
# memory: this way, we will 
# only do the preprocessing 
# once, during the first 
# epoch, and we’ll reuse the 
# preprocessed texts for the 
# following epochs. This can 
# only be done if the data 
# is small enough to fit in 
# memory
# model.fit( binary_lgram_train_ds.cache(), epochs= 10, callbacks=callbacks,validation_data=binary_lgram_val_ds.cache())


In [189]:
# model = keras.models.load_model("E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\binary_lgrm.keras")
# print("Test Acc = {:.3f}".format(model.evaluate(binary_lgram_test_ds)[1]))

##### BIGRAMS With binary ENCODING

Listing 11.7 Configuring the TextVectorization layer to return bigrams

In [190]:
text_vectorization = TextVectorization(ngrams=2, max_tokens=20000,output_mode="multi_hot")


Listing 11.8 Training and testing the binary bigram model

In [191]:
text_vectorization.adapt(text_only_train_ds)

binary_2gram_train_ds = train_ds.map(
    lambda x,y: (text_vectorization(x),y),num_parallel_calls = 4
)


binary_2gram_val_ds = val_ds.map(
    lambda x,y: (text_vectorization(x),y),num_parallel_calls = 4
)

binary_2gram_test_ds = test_ds.map(
    lambda x,y: (text_vectorization(x),y),num_parallel_calls = 4
)


In [192]:
model = get_model()

from keras.callbacks import ModelCheckpoint
callbacks = [ModelCheckpoint(
            filepath="E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\binary_2grm.keras",
            save_best_only=True
)]

In [193]:
# model.fit( binary_2gram_train_ds.cache(), epochs= 10, callbacks=callbacks,validation_data=binary_2gram_val_ds.cache())

In [194]:
# model = keras.models.load_model("E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\binary_2grm.keras")
# print("Test Acc = {:.3f}".format(model.evaluate(binary_2gram_test_ds)[1]))

#### BIGRAMS WITH TF-IDF ENCODING

You can also add a bit more information to this representation by __counting how many
times each word or N-gram occurs__, that is to say, by taking the histogram of the words
over the text

Listing 11.9 Configuring the TextVectorization layer to return token counts

In [195]:
text_vectorization = TextVectorization(
    ngrams=2,max_tokens=20000,output_mode="count"
)

+ To filter those uninformative words, we should "normalize" the  wordcounts by "sparsity"

+ The best practise is go with __TF-IDF__ : $Term$ $Frequency$ -  $Inverse$ $Document$ $Frequency$ 

TF-IDF is so common that it’s built into the __TextVectorization__ layer. 

All you need
 to do to start using it is to switch the __output_mode__ argument to __"tf_idf"__.

#### Understanding TF-IDF normalization

It weights a given term by taking “term frequency,” how many times the term appears in the
 current document, and dividing it by a measure of “document frequency,” which estimates how often the term comes up across the dataset

In [196]:
def tf_idf(term,document,dataset):
    import math
    term_freq= document.count(term)
    doc_freq = math.log(  (sum(doc.count(term)) for doc in dataset) +1 )
    return term_freq/doc_freq

Listing 11.10 Configuring TextVectorization to return TF-IDF-weighted outputs

In [197]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens= 20000,
    output_mode="tf_idf"
)

In [198]:
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x,y : (text_vectorization(x),y), num_parallel_calls=4
)


tfidf_2gram_val_ds = val_ds.map(
    lambda x,y : (text_vectorization(x),y), num_parallel_calls=4
)

tfidf_2gram_test_ds = test_ds.map(
    lambda x,y : (text_vectorization(x),y), num_parallel_calls=4
)

In [199]:
model= get_model()
from keras.callbacks import ModelCheckpoint
callbacks = [ModelCheckpoint(
            filepath="E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\tfidf_2grm.keras",
            save_best_only=True
)]

In [200]:
# model.fit( tfidf_2gram_train_ds.cache(), epochs= 10, callbacks=callbacks,validation_data=tfidf_2gram_val_ds.cache())

In [201]:
# model = keras.models.load_model("E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\tfidf_2grm.keras")
# print("Test Acc = {:.3f}".format(model.evaluate(tfidf_2gram_test_ds)[1]))

This gets us an 88.5% test accuracy on the IMDB classification task: it doesn’t seem to
 be particularly helpful in this case. 
 
 However, for many text-classification datasets, it
 would be typical to see __a one-percentage-point increase when using TF-IDF__ compared
 to plain binary encoding.

#### Exporting a model that processes raw strings

__Create a new model to reuse your trained vectorlization layer__

In [202]:
inputs = keras.Input(shape=(1,),dtype="string")
processed_inputs = text_vectorization(inputs)

## Apply the previously trained model which yields outputs 
outputs  = model(processed_inputs)
inference_model = keras.Model(inputs,outputs)



In [203]:
import tensorflow as tf
raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I love it!"]
])

predictions = inference_model(raw_text_data)
print("Input sentence : {}".format(raw_text_data[0][0]))
print("Positive proportion= {:.3f} %".format(float(predictions[0]*100)))

Input sentence : b'That was an excellent movie, I love it!'
Positive proportion= 48.783 %


### 11.3.3 Processing words as a sequence: The sequence model approach

$Sequence Model$ : Exposed the model to raw word sequences and let it figure out such features on its own

To implement a sequence model, 

1. Start by representing your input samples as
 __sequences of integer indices__ (one integer standing for one word). 
 
2. Then, you’d map each integer to a vector to obtain vector sequences. 

3. Finally, you’d feed these
 sequences of vectors into a stack of layers that could cross-correlate features from adjacent vectors, such as a 1D convnet, a RNN, or a Transformer.

__A residual stack of depth wise-separable 1D convolutions__ can often achieve comparable performance to __a bidirectional LSTM__, at a greatly reduced computational cost.

#### First Practical Example

Listing 11.12 Preparing integer sequence datasets

In [204]:
max_length = 600
max_tokens = 20000

text_vectorization = TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)

text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x,y : (text_vectorization(x),y), num_parallel_calls =4
)


int_val_ds = val_ds.map(
    lambda x,y : (text_vectorization(x),y), num_parallel_calls =4
)



int_test_ds = test_ds.map(
    lambda x,y : (text_vectorization(x),y), num_parallel_calls =4
)

Listing 11.13 A sequence model built on one-hot encoded vector sequences

+ The simplest way to convert our integer sequences to vector
 sequences is to one-hot encode the integers (each dimension would represent one
 possible term in the vocabulary). 
 
+ On top of these one-hot vectors, we’ll add a simple
 bidirectional LSTM.

In [205]:
inputs = keras.Input(shape=(None,),dtype="int64")
embedded = tf.one_hot(inputs,depth=max_tokens)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1,activation="sigmoid")(x)
model = keras.Model(inputs,outputs)
model.compile(
    optimizer = 'rmsprop',
    loss = keras.losses.BinaryCrossentropy(),
    metrics = ["accuracy"]
)
model.summary()

Model: "model_26"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_28 (InputLayer)       [(None, None)]            0         
                                                                 
 tf.one_hot_2 (TFOpLambda)   (None, None, 20000)       0         
                                                                 
 bidirectional_13 (Bidirecti  (None, 64)               5128448   
 onal)                                                           
                                                                 
 dropout_23 (Dropout)        (None, 64)                0         
                                                                 
 dense_34 (Dense)            (None, 1)                 65        
                                                                 
Total params: 5,128,513
Trainable params: 5,128,513
Non-trainable params: 0
________________________________________________

Listing 11.14 Training a first basic sequence model

In [206]:
from keras.callbacks import ModelCheckpoint
callbacks = [ModelCheckpoint(
            filepath="E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\one_hot_bidir_lstm.keras",
            save_best_only=True
)]

In [207]:
# model.fit( int_train_ds, epochs= 10, callbacks=callbacks,validation_data=int_val_ds)

In [208]:
# model = keras.models.load_model("E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\one_hot_bidir_lstm.keras")
# print("Test Acc = {:.3f}".format(model.evaluate(tfidf_2gram_test_ds)[1]))

The results will not give any improvement thus we do not need to run them. 

Let's focus on the embedded method !

#### Understanding Word Embeddings

+ $One-hot-encode$ : 
    1. Sparse
    2. Assume all the tokens are independent to each other (which is not true for words)
    3. Usually in very high dimension

+ $Embedding$:
    1. Map human language into a __geometry structure__
    2. Similar word get similar __locations__ and __directions__
    3. Usually in low dimension
    

Two ways to obtain word embeddings:

1. Learn word embeddings jointly with the main task you care about (such as document classification or sentiment prediction). In this setup, you start with ran dom word vectors and then learn word vectors in the same way you learn the
 weights of a neural network.

2. Load into your model word embeddings that were precomputed using a different machine learning task than the one you’re trying to solve. These are called
 pretrained word embeddings.

##### LEARNING WORD EMBEDDINGS WITH THE EMBEDDING LAYER

Listing 11.15 Instantiating an Embedding layer

In [209]:
embedding_layer = layers.Embedding(input_dim= max_tokens,output_dim=256)


+ It takes __integers__ as input, looks up
 these integers in an internal dictionary, and returns the associated vectors.

+ The Embedding layer takes as input a __rank-2 tensor of integers__, of shape __(batch_size, sequence_length)__,

+ The layer then returns
 a 3D floating-point tensor of shape __(batch_size, sequence_length, embedding_dimensionality)__

+ When you instantiate an Embedding layer, its weights (its internal dictionary o 
 token vectors) are initially random, just as with any other layer.

+ Once fully trained, the embedding
 space will show a lot of structure—a kind of structure specialized for the specific prob lem for which you’re training your model.

Listing 11.16 Model that uses an Embedding layer trained from scratch

In [210]:
inputs = keras.Input(shape=(None,),dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens,output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1,activation='sigmoid')(x)
model = keras.Model(inputs,outputs)

model.compile(
    optimizer = 'rmsprop',
    loss = 'binary_crossentropy',
    metrics = ['accuracy']
)

callbacks = [ModelCheckpoint(
        filepath="E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\embeddings_bidir_gru.keras",
        save_best_only=True
)]




In [211]:
model.summary()

Model: "model_27"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_29 (InputLayer)       [(None, None)]            0         
                                                                 
 embedding_18 (Embedding)    (None, None, 256)         5120000   
                                                                 
 bidirectional_14 (Bidirecti  (None, 64)               73984     
 onal)                                                           
                                                                 
 dropout_24 (Dropout)        (None, 64)                0         
                                                                 
 dense_35 (Dense)            (None, 1)                 65        
                                                                 
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
________________________________________________

In [212]:
# model.fit(int_train_ds,validation_data=int_val_ds,epochs=10,callbacks=callbacks,verbose=0)

In [213]:
# model = keras.models.load_model("E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\embeddings_bidir_gru.keras")
# print("Test Acc = {:.3f}".format(model.evaluate(int_test_ds)[1]))

##### UNDERSTANDING PADDING AND MASKING

+ The main problem of previous example:

    This comes from our use of the output_sequence_length=max_
    length option in TextVectorization (with max_length equal to 600): sentences longer than 600 tokens are truncated to a length of 600 tokens, and sentences shorter
    than 600 tokens are padded with zeros at the end so that they can be concatenated
    together with other sequences to form contiguous batches.

+ The RNN that looks at the tokens in their natural order will spend
 its last iterations seeing only vectors that encode padding—possibly for several hundreds of iterations if the original sentence was short. The information stored in the
 internal state of the RNN will gradually fade out as it gets exposed to these meaningless inputs.

+ We need $masking$ to tell RNN where should stop the iteration.

+ You can retrive that by passing __mask_zero=True__ in the __Embedded()__ layer

+ By calling __.compute_mask()__ to retrive the mask

In [214]:
embedding_layer =layers.Embedding(input_dim=10, output_dim=256, mask_zero=True)
some_input = [
 [4, 3, 2, 1, 0, 0, 0],
 [5, 4, 3, 2, 1, 0, 0],
 [2, 1, 0, 0, 0, 0, 0]]
mask = embedding_layer.compute_mask(some_input)
print(mask)

tf.Tensor(
[[ True  True  True  True False False False]
 [ True  True  True  True  True False False]
 [ True  True False False False False False]], shape=(3, 7), dtype=bool)


Listing 11.17 Using an Embedding layer with masking

In [215]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(
 input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
 loss="binary_crossentropy",
 metrics=["accuracy"])
model.summary()

Model: "model_28"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_30 (InputLayer)       [(None, None)]            0         
                                                                 
 embedding_20 (Embedding)    (None, None, 256)         5120000   
                                                                 
 bidirectional_15 (Bidirecti  (None, 64)               73984     
 onal)                                                           
                                                                 
 dropout_25 (Dropout)        (None, 64)                0         
                                                                 
 dense_36 (Dense)            (None, 1)                 65        
                                                                 
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
________________________________________________

##### Using Pretrained Word Embeddings

+ Sometimes your dataset is too small for generating embeddings

+ There are various precomputed databases of word embeddings that you can download and use in a Keras __Embedding__ layer

+ Popular : $Glove$ and $Word2Vec$

First, let’s download the GloVe word embeddings precomputed on the 2014
English Wikipedia dataset.

Let’s parse the unzipped file (a .txt file) to build an index that maps words (as strings)
to their vector representation.

Listing 11.18 Parsing the GloVe word-embeddings file

In [216]:
import numpy as np

path_to_glove_file = "E:\\Deep Learning with Python\\Datas\\Ch11_IMBD_RAW\\glove.6B\\glove.6B.100d.txt"

embedding_index = {}

with open(path_to_glove_file,encoding='utf-8') as f:
    read = f.readlines()
    for line in read:
        word,coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs,"f",sep=" ")
        embedding_index[word] = coefs
f.close()
print("Found {} word vectors".format(len(embedding_index)) )

Found 400000 word vectors


Next, let’s build an embedding matrix that you can load into an Embedding layer. It
must be a matrix of shape (max_words, embedding_dim), where each entry i contains
the embedding_dim-dimensional vector for the word of index i in the reference word
index (built during tokenization)

Listing 11.19 Preparing the GloVe word-embeddings matrix

In [217]:
embedding_dim = 100

# Retrieve the vocabulary indexed by
# our previous TextVectorization layer.
vocabulary = text_vectorization.get_vocabulary()

# Use it to create a 
# mapping from words 
# to their index in the 
# vocabulary
word_index =dict(zip(vocabulary,range(len(vocabulary))))

# Prepare a matrix 
# that we’ll fill with 
# the GloVe vectors.
embedding_matrix = np.zeros((max_tokens,embedding_dim))



# Fill entry i in the matrix with the 
# word vector for index i. Words 
# not found in the embedding 
# index will be all zeros.
for word, i in word_index.items():
    if i<max_tokens:
        embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Finally, we use a Constant initializer to load the pretrained embeddings in an Embedding
 layer. So as not to disrupt the pretrained representations during training, we __freeze the layer via trainable=False:__

In [218]:
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer= keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True
)

Listing 11.20 Model that uses a pretrained Embedding layer

In [219]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
 loss="binary_crossentropy",
 metrics=["accuracy"])
model.summary()
callbacks = [
 keras.callbacks.ModelCheckpoint("E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\glove_embeddings_sequence_model.keras",
 save_best_only=True) ]
# model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
#  callbacks=callbacks)
# model = keras.models.load_model("E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\glove_embeddings_sequence_model.keras")
# print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_29"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_31 (InputLayer)       [(None, None)]            0         
                                                                 
 embedding_21 (Embedding)    (None, None, 100)         2000000   
                                                                 
 bidirectional_16 (Bidirecti  (None, 64)               34048     
 onal)                                                           
                                                                 
 dropout_26 (Dropout)        (None, 64)                0         
                                                                 
 dense_37 (Dense)            (None, 1)                 65        
                                                                 
Total params: 2,034,113
Trainable params: 34,113
Non-trainable params: 2,000,000
___________________________________________

## 11.4 The Transformer architecture

+ Transformer is overtaking the RNN these years.

+ The Machanism behind Transformer : $Neural$ $Attention$

### 11.4.1 Understanding self-attention

+ Simple bu Powerful idea : Pay attention to some features , not all of them.

+ Some similar concepts: 
    1. Maxpooling
    2. TF-IDF normalization : A continuous of attention.

+ It can be made for $text-aware$ since the same word in different sentence has different meaning.

+ $self-attention$: To modulate the representation of tokens by using the representation of related tokens in sequences.

+ Step of $self-attention$: 
    1. Compute relevancy scores between the vector for “station” and every other word in the sentence.
    2. Compute the sum of all word vectors in the sentence, weighted by our relevancy scores
    3. The resulting vector is our new representation of the specific word


Let's see the pesudocode of self-attention

In [220]:
def self_attention(input_sequence:np.array):
    output = np.zeros(shape=input_sequence.shape)

    # Iterate over each token in the input sequence.
    for i, pivot_vector in enumerate(input_sequence):
        
        scores = np.zeros(shape=(len(input_sequence)))
        
        #  Compute the dot  product (attention  score) between the token and every  other token
        for j, vector in enumerate(input_sequence):
            scores[j] = np.dot(pivot_vector,vector.T )

        # Scale by a normalization factor, and apply a softmax.    
        scores /= np.sqrt(input_sequence.shape[1])
        scores = softmax(scores)


        new_pivot_representation = np.zeros(shape=(pivot_vector.shape))

        for j, vector in enumerate(input_sequence):

            # Take the sum of all tokens weighted by the attention scores.
            new_pivot_representation  += vector * scores[j]
        
        output[i] = new_pivot_representation
    
    return output



In practise it looks like following: 

1. Why are we passing the inputs to the layer three times? That seems redundant.

2. What are these “multiple heads” we’re referring to? That sounds intimidating—
do they also grow back if you cut them?

In [221]:
# num_heads = 4
# embed_dim = 256
# inputs = keras.Input(shape=(None,))
# mha_layer = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
# outputs = mha_layer(inputs, inputs, inputs)

#### GENERALIZED SELF-ATTENTION: THE QUERY-KEY-VALUE MODEL

+ A Transformer is a $sequence-to-sequence$ model: it was designed to
 convert one sequence into another

+ The meaning of $self-attention$: This means “for each token in inputs (A), compute how much the token is related to
 every token in inputs (B), and use these scores to weight a sum of tokens from
 inputs (C).”
 
    __outputs = sum( inputsC * pairwise_scores( inputsA,inputsB ))__

+  Inputs A :  $query$ : Something you are looking for
   
   Inputs B :  $key$ : Assigned to each value that describes the value in a format that can be readily compared to a $query$.

   Inputs C : $values$ : A body of knowledge that you are trying to extract information from


   __outputs = sum( values * pairwise_scores( query,key ))__

+ In practice, the $keys$ and the $values$ are often the __same sequence__

+ That explains why we needed to pass inputs three times to our MultiHeadAttention
layer.


### 11.4.2 Multi-head attention

+ $Multi-head$:   Output space of the self-attention layer gets factored into a set of independent subspaces, learned separately.
    
    1. The initial query, key, and value are sent through __three independent sets of dense projections__, resulting in __three separate vectors__.

    2. Each vector is processed via neural attention, and the __three outputs__ are __concatenated__ back together into a single output sequence.

    3. Each dense project is called $head$

+ Advantages : 

    1. The learnabel Dense projections make sure the layer actually learn something.

    2. Helps layer learn different features of tokens

+ Similar to the $Depthwise$ $Separable$ $Convolutions$ 

### 11.4.3 The Transformer Encoder

+ Factoring outputs into multiple independent spaces 
+ adding residual connections
+ adding normalization layers

+ The original Transformer : 
    1. A $Transformer$ $encoder$ that processes the source sequence 
    2. A $Transformer$ $decoder$ that uses the source sequence to generate a translated version.

Listing 11.21 Transformer encoder implemented as a subclassed Layer

In [284]:
class TransformerEncoder(layers.Layer):
    def __init__(self,embed_dim,dense_dim,num_heads,**kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads,key_dim=embed_dim
        )
        self.dens_proj = keras.Sequential([
            layers.Dense(dense_dim,activation='relu'),
            layers.Dense(embed_dim,)
        ])
        self.layernorm_1 = layers.LayerNormalization() 
        self.layernorm_2 = layers.LayerNormalization()

    def call(self,inputs,mask=None):

        # The mask that will be generated by 
        # the Embedding layer will be 2D, but 
        # the attention layer expects to be 3D 
        # or 4D, so we expand its rank.
        if mask is not None:
            mask= mask[:,tf.newaxis,:]
        
        attention_output = self.attention(inputs,inputs,attention_mask = mask)

        ## 1st Normalization and Residual
        proj_input = self.layernorm_1(inputs + attention_output)

        ## Dense layers
        proj_output = self.dens_proj(proj_input)

        ## 2nd Normalization & Residual
        return self.layernorm_2(proj_input+proj_output)
        
    # Implement 
    # serialization so 
    # we can save the 
    # model.

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim":self.embed_dim,
            "num_heads" : self.num_heads,
            "dense_dim" : self.dense_dim
        })
        return config


#### Save Custom Layers

In [229]:
# layer = layers.PositionalEmbedding(sequence_length, input_dim, output_dim)
# config = layer.get_config()
#  new_layer = PositionalEmbedding.from_config(config)

When you are loading a model from checkpoint file, you should provide custom layer classes to it.

In [230]:
# model = keras.models.load_model(
#  filename, custom_objects={"PositionalEmbedding": PositionalEmbedding})

We notice that in the __TransformEncoder__ we use __layernormalization__ rather __batchnormalization__. 

+ __BatchNormalization__ collects information from many samples to obtain accurate statistics for the feature means and variances, 

+ __LayerNormalization__ pools data within each sequence separately, which is more appropriate for sequence data.

Let's check the difference between them by pseudocode

In [231]:
def layer_normalization(batch_of_sequences):
    mean = np.mean( batch_of_sequences,keepdims=True,axis=-1 )
    variance  = np.var(batch_of_sequences,keepdims=True,axis=-1)
    return (batch_of_sequences-mean) / variance

def batch_normalization(batch_of_images):
    mean = np.mean( batch_of_images,keepdims=True,axis=(0,1,2) )
    variance  = np.var(batch_of_images,keepdims=True,axis=(0,1,2))
    return (batch_of_images-mean) / variance

Listing 11.22 Using the Transformer encoder for text classification

In [232]:
vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,),dtype="int64")
x =layers.Embedding(vocab_size,embed_dim)(inputs)
x = TransformerEncoder(embed_dim=embed_dim,dense_dim=dense_dim,num_heads=num_heads)(x)
x = layers.GlobalMaxPool1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1,activation="sigmoid")(x)
model = keras.Model(inputs,outputs)

model.compile(
    optimizer = 'rmsprop',
    loss = "binary_crossentropy",
    metrics = ['accuracy']
)

model.summary()

Model: "model_31"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_33 (InputLayer)       [(None, None)]            0         
                                                                 
 embedding_23 (Embedding)    (None, None, 256)         5120000   
                                                                 
 transformer_encoder_2 (Tran  (None, None, 256)        543776    
 sformerEncoder)                                                 
                                                                 
 global_max_pooling1d_2 (Glo  (None, 256)              0         
 balMaxPooling1D)                                                
                                                                 
 dropout_28 (Dropout)        (None, 256)               0         
                                                                 
 dense_43 (Dense)            (None, 1)                 257

Listing 11.23 Training and evaluating the Transformer encoder based model

In [233]:
# callbacks = [
#  keras.callbacks.ModelCheckpoint("E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\transformer_encoder.keras",
#  save_best_only=True) ]
 
# model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
#  callbacks=callbacks)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x22d2f279250>

In [234]:
# model = keras.models.load_model(
#  "E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\transformer_encoder.keras",
#  custom_objects={"TransformerEncoder": TransformerEncoder}) 

# print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Test acc: 0.864


The score is even worse , that is because what we gave was __NOT a Sequence__

I mentioned in passing that Transformer was a hybrid approach that is technically order-agnostic, but that manually injects order information in the representations it processes. This is the missing ingredient! It’s called positional encoding. Let’s take a look.

#### USING POSITIONAL ENCODING TO RE-INJECT ORDER INFORMATION

+ Idea :  Give the model access to __WordOorder information__

+ Our word embedding will contain two vetcors:
    1. The usual word vector
    2. The position vector

+ The position vector can have a very large scale.

+  we’ll learn position embedding vectors the same way we learn to embed word indices.

Listing 11.24 Implementing positional embedding as a subclassed layer

In [285]:
class PositionalEmbedding(layers.Layer):
    def __init__(self,sequence_length,input_dim,output_dim, **kwargs):
        super().__init__(**kwargs)

        # Embedding for words
        self.token_embeddings = layers.Embedding(input_dim=input_dim,output_dim=output_dim)

        #Embedding for positions
        self.position_embeddings = layers.Embedding(input_dim=sequence_length,output_dim=output_dim)

        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim
    
    def call(self,inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0,limit=length,delta=1)
        embedded_tokens =self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens+embedded_positions
    
    def compute_mask(self,inputs,mask=None):
        return tf.math.not_equal(inputs,0)


    def get_config(self):
        config = super().get_config()
        config.update({
                "output_dim":self.output_dim,
                "input_dim":self.input_dim,
                "sequence_length":self.sequence_length
        })
        return config


In [239]:
# Validating addition of Embedding layers
# inputs = keras.Input(shape=(None,))
# x1 = layers.Embedding(3,3)(inputs)
# x2= layers.Embedding(3,3)(inputs)
# x3 = x1+x2

#### Putting all of them togather!

In [242]:
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,),dtype="int64")
x = PositionalEmbedding(sequence_length,vocab_size,embed_dim)(inputs)
x = TransformerEncoder(embed_dim,dense_dim,num_heads)(x)
x = layers.GlobalMaxPool1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1,activation="sigmoid")(x)

model = keras.Model(inputs,outputs)

model.compile(
    optimizer = 'rmsprop',
    loss = 'binary_crossentropy',
    metrics =['accuracy']
)

model.summary()

Model: "model_33"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_36 (InputLayer)       [(None, None)]            0         
                                                                 
 positional_embedding_1 (Pos  (None, None, 256)        5273600   
 itionalEmbedding)                                               
                                                                 
 transformer_encoder_4 (Tran  (None, None, 256)        543776    
 sformerEncoder)                                                 
                                                                 
 global_max_pooling1d_4 (Glo  (None, 256)              0         
 balMaxPooling1D)                                                
                                                                 
 dropout_30 (Dropout)        (None, 256)               0         
                                                          

In [243]:
# callbacks = [
#  keras.callbacks.ModelCheckpoint("E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\full_transformer_encoder.keras",
#  save_best_only=True)
# ] 
# model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, 
# callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x22cf4147c40>

In [244]:
# model = keras.models.load_model(
#  "E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\full_transformer_encoder.keras",
#  custom_objects={"TransformerEncoder": TransformerEncoder,
#  "PositionalEmbedding": PositionalEmbedding}) 
# print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Test acc: 0.884


### 11.4.4 When to use sequence models over bag-of-words models

$ratio$ =${Number-of-Samples}\over{Mean-Sample-Length}$

+ When $ratio$ > 1500 : __Sequence Model__

+ When $ratio$ < 1500 : __Bag_of_bigrams__ 

## 11.5 Beyond text classification: Sequence-to-sequence learning

+ A general model of sequnence-to-sequence model :
    + An $encoder$ model turns the source sequence into an intermediate representation.
    
    + A $decoder$ is trained to predict the next token i in the target sequence by looking at both previous tokens (0 to i - 1) and the encoded source sequence.

+ During inference, we don’t have access to the target sequence—we’re trying to predict it from scratch. We’ll have to generate it one token at a time:
    1. We obtain the encoded source sequence from the encoder.

    2. The decoder starts by looking at the encoded source sequence as well as an initial “seed” token (such as the string "[start]"), and uses them to predict the
first real token in the sequence

    3. The predicted sequence so far is fed back into the decoder, which generates the
 next token, and so on, until it generates a stop token (such as the string
 "[end]")

### 11.5.1 A machine translation example

In [246]:
text_file = "E:\\Deep Learning with Python\Datas\\Ch11_IMBD_RAW\\spa-eng\\spa-eng\\spa.txt"
with open(text_file,encoding='utf-8') as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []

for line in lines:
    english,spanish = line.split("\t")
    spanish = "[start]" + spanish + "[end]"
    text_pairs.append((english,spanish))


In [247]:
import random
print(random.choice(text_pairs))

('On my way here, the strong wind blew my umbrella inside out.', '[start]De camino aquí una fuerte ráfaga de aire me dio la vuelta al paraguas.[end]')


In [248]:
random.shuffle(text_pairs)
num_val_samples = int(0.15*len(text_pairs))
num_train_samples = len(text_pairs) - 2*num_val_samples

train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples+num_val_samples]
test_pairs = text_pairs[num_train_samples+num_val_samples:]

Next, let’s prepare two separate TextVectorization layers: one for English and one
 for Spanish:

1. We need to preserve the "[start]" and "[end]" tokens that we’ve inserted. By
 default, the characters [ and ] would be stripped, but we want to keep them
 around so we can tell apart the word “start” and the start token "[start]".  

2. Punctuation is different from language to language! In the Spanish Text Vectorization layer, if we’re going to strip punctuation characters, we need to also strip the character ¿.

Listing 11.26 Vectorizing the English and Spanish text pairs

In [250]:
import string 
import re
strip_chars = string.punctuation+ "¿"
strip_chars = strip_chars.replace("[", "") 
strip_chars = strip_chars.replace("]", "")

def custom_standardization(input_string): 
    lowercase = tf.strings.lower(input_string) 
    
    return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "")


In [257]:
vocab_size = 15000 
sequence_length = 20 
#English Layer
source_vectorization = layers.TextVectorization( 
 max_tokens=vocab_size,
 output_mode="int",
 output_sequence_length=sequence_length,
)

# Spanish Layer
target_vectorization = layers.TextVectorization( 
 max_tokens=vocab_size,
 output_mode="int",
#  Generate Spanish sentences 
# that have one extra token, 
# since we’ll need to offset 
# the sentence by one step 
# during training.
 output_sequence_length=sequence_length + 1, 
 standardize=custom_standardization,
)


train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
source_vectorization.adapt(train_english_texts) 
target_vectorization.adapt(train_spanish_texts)

Listing 11.27 Preparing datasets for the translation task

In [268]:

batch_size = 64

def format_dataset(eng,spa):
    eng = source_vectorization(eng)
    spa = target_vectorization(spa)
    return({
        "english":eng,
        "spanish":spa[:,:-1]
            } ,
            spa[:,1:]
            )

def make_dataset(pairs):
    eng_texts,spa_texts = zip(*pairs)
    eng_texts  = list(eng_texts)
    spa_texts  = list(spa_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts,spa_texts))
    dataset = dataset.batch(batch_size=batch_size)
    dataset = dataset.map(format_dataset,num_parallel_calls=4)

    return dataset.shuffle(2049).prefetch(16).cache()

    

In [269]:
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [270]:

for inputs, targets in train_ds.take(1):
    print(f"inputs['english'].shape: {inputs['english'].shape}")
    print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
    print(f"targets.shape: {targets.shape}")

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)


### 11.5.2 Sequence-to-sequence learning with RNNs

+ The simplest way to use RNN:


    inputs = keras.Input(shape=(sequence_length,), dtype="int64")

    x = layers.Embedding(input_dim=vocab_size, output_dim=128)(inputs)

    x = layers.LSTM(32, return_sequences=True)(x)

    outputs = layers.Dense(vocab_size, activation="softmax")(x)
    
    model = keras.Model(inputs, outputs)

    __Probelm__:
    
        1. The target sequence must always be the same length as the source sequence.
        
        2. Due to the step-by-step nature of RNNs, the model will only be looking at
        tokens 0…N in the source sequence in order to predict token N in the target
        sequence.

+ A proper way to use RNN:

    1. Use an RNN (the encoder) to turn the entire source sequence into a single vector (or set of vectors).

    2. Then you would use this vector (or vectors) as the $initial$ $state$ of another RNN (the decoder), which would look at elements 0…N in the target sequence, and
    try to predict step N+1 in the target sequence.

Listing 11.28 GRU-based encoder

In [271]:
embed_dim = 256
latent_dim = 1024

#Specific the Name
source = keras.Input(shape=(None,),dtype="int64",name="english")
## Do Not forget the mask here !!!!
x = layers.Embedding(vocab_size,embed_dim,mask_zero=True)(source)
encoded_source = layers.Bidirectional(
    layers.GRU(latent_dim), merge_mode="sum")(x)



Listing 11.29 GRU-based decoder and the end-to-end model

In [273]:
##Spanish target goes here

past_target = keras.Input(shape=(None,),dtype="int64",name="spanish")
x = layers.Embedding(vocab_size,embed_dim,mask_zero=True)(past_target)
decoder_gru = layers.GRU(latent_dim,return_sequences=True)
x = decoder_gru(x,initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
targets_next_step = layers.Dense(vocab_size,activation='softmax')(x)
seq2seq_rnn = keras.Model([source,past_target],targets_next_step)

Listing 11.30 Training our recurrent sequence-to-sequence model

In [277]:
# seq2seq_rnn.compile(
#  optimizer="rmsprop",
#  loss="sparse_categorical_crossentropy",
#  metrics=['accuracy'])
# history= seq2seq_rnn.fit(train_ds, epochs=1, validation_data=val_ds)



In practise, accuracy is not a greate metric for machine translation models

Listing 11.31 Translating new sentences with our RNN encoder and decoder

In [278]:
import numpy as np
spa_vocab = target_vectorization.get_vocabulary() 
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab)) 
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]" 

    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])

        ## Sample the next token
        next_token_predictions = seq2seq_rnn.predict( 
        [tokenized_input_sentence, tokenized_target_sentence]) 


        sampled_token_index = np.argmax(next_token_predictions[0, i, :])

        ## Convert next prediction token prediction to a string 
        ## and append it to the generated sentence
        sampled_token = spa_index_lookup[sampled_token_index] 
        
        decoded_sentence += " " + sampled_token 
        
        ## Exit Condition
        if sampled_token == "[end]": 
            break
    
    return decoded_sentence


In [281]:
# test_eng_texts = [pair[0] for pair in test_pairs] 
# for _ in range(2):
#  input_sentence = random.choice(test_eng_texts)
#  print("-")
#  print(input_sentence)
#  print(decode_sequence(input_sentence))

Drawbacks of RNN model:


1. The source sequence representation has to be held entirely in the encoder state vector(s)

2. RNNs have trouble dealing with very long sequences, since they tend to progressively forget about the past—by the time you’ve reached the 100th token in
 either sequence, little information remains about the start of the sequence.

### 11.5.3 Sequence-to-sequence learning with Transformer

+ $Transformer$ $Encoder$ : Reads the source sequence and produces an encoded representation of it.

+ $Transformer$ $Decoder$ : Reads tokens 0…N in the target sequence and tries to predict token N+1.

#### Transformer Decoder

Listing 11.33 The TransformerDecoder

In [291]:
class TransformerDecoder(layers.Layer):
    def __init__(self,embed_dim,dense_dim,num_heads,**kwargs):
        super().__init__(**kwargs)

        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads

        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads,key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads,key_dim=embed_dim)
        self.dense_proj = keras.Sequential([
            layers.Dense(dense_dim,activation="relu"),
            layers.Dense(embed_dim,),
            
        ])

        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()

        ##layers.Layer.support_masking

        # This attribute ensures that the layer will 
        # propagate its input mask to its outputs; 
        # masking in Keras is explicitly opt-in. If 
        # you pass a mask to a layer that doesn’t 
        # implement compute_mask() and that 
        # _ doesn’t expose this supports_masking 
        # attribute, that’s an error.
        self.supports_masking = True

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim" : self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim":self.dense_dim,
        })

        return config
    
    def get_causal_attention_mask(self,inputs):
        input_shape = tf.shape(inputs)
        batch_size , sequence_length = input_shape[0],input_shape[1]
        i = tf.range(sequence_length)[:,tf.newaxis]
        j = tf.range(sequence_length)

        mask = tf.cast(i >= j,dtype="int32")
       
        mask = tf.reshape(mask,(1,input_shape[1],input_shape[1]))

        ## mult = [[batch_size,1],
        #             [1,1]]
        mult = tf.concat(
            [tf.expand_dims(batch_size,-1),
                tf.constant([1,1],dtype=tf.int32,)
            ],axis=0
        )

        return tf.tile(mask,mult)
        ## It will repeat mask for batch_size times.
        ## The shape will be like shape= ( batch_size,(mask.shape),1,1,1 )


    def call(self,inputs,encoder_outputs,mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)

        if mask is not  None:
            padding_mask = tf.cast( mask[:,tf.newaxis,:],dtype="int32" )
            padding_mask = tf.minimum(padding_mask,causal_mask)
        
        attent_output_1 = self.attention_1(
            query = inputs,
            value = inputs,
            key = inputs,
            # Causal_mask to restrict by N not N+1
            attention_mask = causal_mask
        )

        ## Residual & Normalization
        attent_output_1 = self.layernorm_1(inputs+attent_output_1)

        attent_output_2 = self.attention_2(
            query = attent_output_1,
            value = encoder_outputs,
            key = encoder_outputs,
            ## padding_mask make sure the previous restriction and the function of padding
            attention_mask = padding_mask,
        )

        ## Residual & Normalization
        attent_output_2 = self.layernorm_2(attent_output_1+attent_output_2)

        ## Two dense layers
        proj_output = self.dense_proj(attent_output_2)

        ##the 3rd layernormalization & Residual 
        return self.layernorm_3(proj_output + attent_output_2)


#### Problem of $Casual$ $padding$:

+ __TransformerDecoder__ is order-agnostic: it looks at the entire target sequence at once. No matter how the order is.

+ If it were allowed to use its entire input, it would simply learn to copy input step N+1 to location N in the output.
    
   thus when running inference, it would be completely useless, since input steps beyond N aren’t available

+ The fixing :  mask the upper half of the pairwise attention matrix to make it only focus on information from tokens 0...N in the target sequence should be used   when generating target token N+1.

+ we’ll add a __get_causal_attention_mask(self, inputs)__
 method to our __TransformerDecoder__ to retrieve an attention mask that we can pass to
 our MultiHeadAttention layers.

#### Putting them all togather!

Listing 11.36 End-to-end Transformer

In [292]:
embed_dim = 256
dense_dim = 2048
num_heads =8 

encoder_inputs = keras.Input(shape= (None,),dtype="int64",name="english")
x = PositionalEmbedding(sequence_length,vocab_size,embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim,dense_dim,num_heads)(x)

decoder_inputs = keras.Input(shape=(None,),dtype="int64",name='spanish')
x = PositionalEmbedding(sequence_length,vocab_size,embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim,dense_dim,num_heads)(x,encoder_outputs)


x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size,activation="softmax")(x)
transformer = keras.Model([encoder_inputs,decoder_inputs],decoder_outputs)



In [295]:
transformer.compile(
    optimizer = "rmsprop",
    loss= keras.losses.SparseCategoricalCrossentropy(),
    metrics = ["accuracy"]
)

callbacks= [ ModelCheckpoint(
    filepath="E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\Transformer.keras",
    save_best_only=True
)]

In [296]:
transformer.fit(train_ds,epochs=30,validation_data=val_ds,callbacks=callbacks,verbose=10)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30


KeyboardInterrupt: 

In [297]:

from keras.models import load_model

transformer =load_model("E:\\Python-Machine-Learning\\Deep_Learning_With_python\\Ch11_Imdb\\Transformer.keras",
                    custom_objects={
                        "PositionalEmbedding":PositionalEmbedding,
                        "TransformerDecoder":TransformerDecoder,
                        "TransformerEncoder":TransformerEncoder
                    })

Listing 11.38 Translating new sentences with our Transformer model

In [298]:
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip( range(len(spa_vocab)), spa_vocab  ))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"

    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization(
            [decoded_sentence])[:,:-1]
        predictions = transformer([tokenized_input_sentence,tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0,i,:])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " "+sampled_token

        if sampled_token == '[end]':
            break

    return decoded_sentence

In [299]:
test_eng_texts = [pair[0] for pair in test_pairs] 
for _ in range(20):
 input_sentence = random.choice(test_eng_texts)
 print("-")
 print(input_sentence)
 print(decode_sequence(input_sentence))

-
What do you call this vegetable in English?
[start] lo que lo este [UNK] de años[end]  de    de    de de  aquí[end]
-
I'm asking you to do this because I trust you.
[start] te [UNK] a que te [UNK] de [UNK]  de de  de de de  que que  [UNK]
-
Listen to your mother.
[start] a tu noche[end]  su  a  a    a    a a  a
-
I wish I were there right now.
[start] [UNK] que [UNK]  aquí[end]  aquí[end]      aquí[end]       aquí[end]
-
How many people did you invite to your party?
[start] [UNK] a que [UNK] a su noche[end]  de    a  de  a a  en
-
We're coming at once.
[start] a la días[end]  de  de  de           de
-
I don't understand what the author is trying to say.
[start] no me lo que no [UNK]  de de de que no que no no  que que no que
-
You can't do that here.
[start] puedo hacer aquí[end]  aquí[end]               aquí[end]
-
Do you think parents should punish their children when they lie?
[start] que que que [UNK] a la [UNK] de [UNK]   de que de está  [UNK] [UNK]  [UNK]
-
Many unfair things h