# Chapter 11 Deep Learning for text

## 11.1 Natural language processing(NLP): The bird’s eye view

+   Every machine language was $designed$: its
 starting point was a human engineer writing down a set of formal rules to describe
 what statements you could make in that language and what they meant

+   Machine-readable language is highly structured and rigorous, using precise syntactic
 rules to weave together exactly defined concepts from a fixed vocabulary, natural language is messy—ambiguous, chaotic, sprawling, and constantly in flux.

+   That’s what modern NLP is about: using machine learning and large datasets to
 give computers the ability not to understand language, which is a more lofty goal, but
 to ingest a piece of language as input and return something useful, like predicting the
 following
    1. “What’s the topic of this text?” ($text$ $classification$)
    
    2. “Does this text contain abuse?” ($content$ $filtering$)
    
    3. “Does this text sound positive or negative?” ($sentiment$ $analysis$)
    
    4. “What should be the next word in this incomplete sentence?” ($language$ $modeling$)
    
    5. “How would you say this in German?” ($translation$)
    
    6. “How would you summarize this article in one paragraph?” ($summarization$)
    
    7. etc.

+  they simply
 look for statistical regularities in their input data, which turns out to be sufficient to
 perform well on many simple tasks. In much the same way that computer vision is pattern recognition applied to pixels, NLP is pattern recognition applied to words, sentences, and paragraphs


+   Finally, around 2017–2018, a new architecture rose to replace RNNs: the __Transformer__, which you will learn about in the second half of this chapter. Transformers
 unlocked considerable progress across the field in a short period of time, and today
 most NLP systems are based on them.

## 11.2 Preparing text data

$Vectorizing$ text is the process of transforming text into numeric tensors.

1. $Standardize$:  First, you standardize the text to make it easier to process, such as by converting
 it to lowercase or removing punctuation.

2. $Tokenization$:  You split the text into units (called tokens), such as characters, words, or groups
 of words. This is called tokenization.  
 
3. $One-Hot-Encode$:  You convert each such token into a numerical vector. This will usually involve
 first indexing all tokens present in the data.

### 11.2.1 Text standardization

Text standardization is a basic form of feature engineering that aims to __erase
 encoding differences that you don’t want your model to have to deal with__

1.  __Convert to lowercase and remove punctuation characters__.
    

2.  __Stemming__: converting variations of a term (such as different conjugated forms of a verb) into a single shared representation.
    “caught” and “been catching” into “[catch]” or “cats” into “[cat]”.

### 11.2.2 Text splitting (tokenization)

1. __Word-level tokenization__ —Where tokens are space-separated (or punctuation-separated) substrings. A variant of this is to further split words into subwords
when applicable—for instance, treating “staring” as “star+ing” or “called” as
“call+ed.”

2. __N-gram tokenization__ —Where tokens are groups of N consecutive words. For
instance, “the cat” or “he was” would be 2-gram tokens (also called bigrams).


3. __Character-level tokenization__ —Where each character is its own token. In practice,
this scheme is rarely used, and you only really see it in specialized contexts, like
text generation or speech recognition.

There are two kinds of text-processing models: 
1. $Sequence$  $model$ :care about word __order__

2. $Bag-of-words$  $model$: treat input words as a set, __discarding their original order__

### 11.2.3 Vocabulary indexing

Once your text is split into tokens, you need to encode each token into a numerical
 representation.

the way you’d go about it is to build
 an index of all terms found in the training data (the “vocabulary”), and assign a
 unique integer to each entry in the vocabulary.

In [3]:
# vocabulary = {}
# for text in dataset: 
#     text = standardize(text)
#     tokens = tokenize(text)
#     for token in tokens:
#         if token is not in vocabulary:
#             vocabulary[token] = len(vocabulary)



You can then convert that integer into a vector encoding that can be processed by a
neural network, like a one-hot vector:

In [4]:
# import numpy as np 
# def one_hot_encode_token(token):
#     vetcor = np.zeros((len(vocabulary),))
#     token_index = vocabulary[token]
#     vetcor[token_index]
#     return vetcor

Note that at this step it’s common to restrict the vocabulary to only the top 20,000 or
 30,000 most common words found in the training data

The data you were using from keras.datasets.imdb was
 already preprocessed into sequences of integers, where each integer stood for a given
 word. 
 
Back then, we used the setting num_words=10000, in order to restrict our vocabulary to the __top 10,000__ most common words found in the training data.

when we look up a new token in our vocabulary index, it may not necessarily exist

Your training data may not have contained any instance of the word “cherimoya” (or maybe you
 excluded it from your index because it was too rare), so doing token_index =
 vocabulary["cherimoya"] may result in a KeyError. 
 
To handle this, you should use
 an __“out of vocabulary” index__ (abbreviated as $OOV$ index)—a catch-all for any token
 that wasn’t in the index. 
 
It’s usually __index 1__: you’re actually doing token_index =
 vocabulary.get(token, 1). When decoding a sequence of integers back into words,
 you’ll replace 1 with something like “[UNK]” (which you’d call an “OOV token”)

### 11.2.4 Using the TextVectorization layer

Every step I’ve introduced so far would be very easy to implement in pure Python.
Maybe you could write something like this

In [5]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [6]:


class Vectorizer:
    def standardize(self,text:str) -> str:
        text = text.lower()
        return"".join( char for char in text if char not in string.punctuation )

    def tokenize(self,text:str)->list:
        text = self.standardize(text)
        return text.split()
    
    def make_vocabulary(self,dataset):
        self.vocabulary = {"":0,"[UNK]":1}
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens :
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        self.inverse_volcabulary = dict ( (val,key) for key,val in self.vocabulary.items()   )
    
    def encode(self,text:str) -> list:
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token,1) for token in tokens ]

    def decode(self,int_sequence) -> str:
        return "".join(
            self.inverse_volcabulary.get(i,"[UNK]") for i in int_sequence
        )

In [7]:
vector = Vectorizer()
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms."
]

vector.make_vocabulary(dataset)
print(vector.vocabulary)

{'': 0, '[UNK]': 1, 'i': 2, 'write': 3, 'erase': 4, 'rewrite': 5, 'again': 6, 'and': 7, 'then': 8, 'a': 9, 'poppy': 10, 'blooms': 11}


In [8]:
test_sequence = "I write, rewrite, and still rewrite again."
encoded_sentence = vector.encode(test_sequence)
print(f"Encoded Sentence:\n{encoded_sentence}")

decoded_sentence = vector.decode(encoded_sentence)
print(f"Decoded Sentence\n{decoded_sentence}")

Encoded Sentence:
[2, 3, 5, 7, 1, 5, 6]
Decoded Sentence
iwriterewriteand[UNK]rewriteagain


In [9]:
# from sklearn.feature_extraction.text import CountVectorizer
# vect = CountVectorizer().fit_transform(vector.vocabulary) 
# vect = vect.toarray()
# print(vect)
# from sklearn.preprocessing import OneHotEncoder
# one = OneHotEncoder().fit_transform(vect)
# print(one)

However, using something like this wouldn’t be very performant. 

In practice, you’ll work with the Keras __TextVectorization__ layer, which is fast

In [10]:
from tensorflow import keras
from keras import layers

from keras.layers import TextVectorization

# Configures the layer to return sequences of words encoded 
# as integer indices. There are several other output modes 
# available, which you will see in action in a bit.
text_vectorization = TextVectorization(output_mode="int")


By default, the TextVectorization layer will use the setting
+ __convert to lowercase__ and __remove punctuation__ for text $standardization$, 
+  __split on whitespace__ for $tokenization$.

Note that
 such custom functions should operate on __tf.string tensors__, not regular Python
 strings!

In [11]:
import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor:tf.strings) -> tf.strings:
    lowercase_string = tf.strings.lower(string_tensor)
    return tf.strings.regex_replace(
        lowercase_string, f"[{re.escape(string.punctuation)}]",""
    )

def custom_split_fn(string_tensor:tf.strings) -> tf.strings:
    return tf.strings.split(string_tensor)


In [12]:
text_vectorization = TextVectorization(
    output_mode='int',
    standardize= custom_standardization_fn,
    split= custom_split_fn
)

To index the vocabulary of a text corpus, just call the __adapt() method__ of the layer
with a Dataset object that yields __strings__, or just with __a list of Python strings__

In [13]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms."
]

text_vectorization.adapt(dataset)

Note that you can __retrieve the computed vocabulary__ via __get_vocabulary()__—this can
 be useful if you need to convert text encoded as integer sequences back into words.

In [14]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

For demonstartion, let's try to encode and  decode the sentences

Listing 11.1 Displaying the vocabulary

In [15]:
vocalbuary= text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

inverse_volcab= dict( enumerate(vocalbuary))
decoded_sentence = "".join( inverse_volcab[int(i)] 
                            for i in encoded_sentence
    )
print(decoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)
iwriterewriteand[UNK]rewriteagain


#### Using the __TextVectorization__ layer in a __tf.data__ pipeline or as part of a model

The  __TextVectorization__ __ONLY__ works on __CPU__


There are __Two__ ways to use __TextVectorization__ layer

1.  Put it in the __tf.data__ pipeline:
    
    + __int_sequence_dataset = string_dataset.map( 
            text_vectorization,
            num_parallel_calls=4)__
    
    +  while the GPU runs the model on one batch of vectorized data, the CPU stays busy by vectorizing the next batch of
        raw strings.
    
    + __Recommands for GPU version__



2.  Make it part of the model:
    +   __text_input = keras.Input(shape=(), dtype="string")__

        __vectorized_text = text_vectorization(text_input)__

        __embedded_input = keras.layers.Embedding(...)(vectorized_text)__

        __output = ...__

        __model = keras.Model(text_input, output)__

    +   This means that
        at each training step, the rest of the model (placed on the GPU) will have to wait for
        the output of the TextVectorization layer (placed on the CPU) to be ready in order
        to get to work

    +   __Not suitable for GPU version__

Thankfully, the TextVectorization layer enables you to include
 text preprocessing right into your model, making it easier to deploy—even if you were
 originally using the layer as part of a tf.data pipeline.

## 11.3 Two approaches for representing groups of words: Sets and Sequences
